- 24 2月, 2018 4 次提交
-
-
由 Daniel Borkmann 提交于
Add a generic emit_mov_reg() helper in order to reuse it in BPF multiplication to load the src into rax, we can save a few bytes in alu32 while doing so. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
由 Daniel Borkmann 提交于
Instead of unconditionally performing push/pop on rax/rdx in case of multiplication, we can save a few bytes in case of dest register being either BPF r0 (rax) or r3 (rdx) since the result is written in there anyway. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
由 Daniel Borkmann 提交于
While analyzing some of the more complex BPF programs from Cilium, I found that LLVM generally prefers to emit LD_IMM64 instead of MOV32 BPF instructions for loading unsigned 32-bit immediates into a register. Given we cannot change the current/stable LLVM versions that are already out there, lets optimize this case such that the JIT prefers to emit 'mov %eax, imm32' over 'movabsq %rax, imm64' whenever suitable in order to reduce the image size by 4-5 bytes per such load in the typical case, reducing image size on some of the bigger programs by up to 4%. emit_mov_imm32() and emit_mov_imm64() have been added as helpers. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
由 Daniel Borkmann 提交于
When we shift by one, we can use a different encoding where imm is not explicitly needed, which saves 1 byte per such op. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
- 27 1月, 2018 1 次提交
-
-
由 Daniel Borkmann 提交于
Since we've changed div/mod exception handling for src_reg in eBPF verifier itself, remove the leftovers from x86_64 JIT. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
- 20 1月, 2018 2 次提交
-
-
由 Daniel Borkmann 提交于
For the BPF_REG_0 (BPF_REG_A in cBPF, respectively), we can use the short form of the opcode as dst mapping is on eax/rax and thus save a byte per such operation. Added to add/sub/and/or/xor for 32/64 bit when K immediate is used. There may be more such low-hanging fruit to add in future as well. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
由 Daniel Borkmann 提交于
Having a pure_initcall() callback just to permanently enable BPF JITs under CONFIG_BPF_JIT_ALWAYS_ON is unnecessary and could leave a small race window in future where JIT is still disabled on boot. Since we know about the setting at compilation time anyway, just initialize it properly there. Also consolidate all the individual bpf_jit_enable variables into a single one and move them under one location. Moreover, don't allow for setting unspecified garbage values on them. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
- 18 12月, 2017 2 次提交
-
-
由 Alexei Starovoitov 提交于
Typical JIT does several passes over bpf instructions to compute total size and relative offsets of jumps and calls. With multitple bpf functions calling each other all relative calls will have invalid offsets intially therefore we need to additional last pass over the program to emit calls with correct offsets. For example in case of three bpf functions: main: call foo call bpf_map_lookup exit foo: call bar exit bar: exit We will call bpf_int_jit_compile() indepedently for main(), foo() and bar() x64 JIT typically does 4-5 passes to converge. After these initial passes the image for these 3 functions will be good except call targets, since start addresses of foo() and bar() are unknown when we were JITing main() (note that call bpf_map_lookup will be resolved properly during initial passes). Once start addresses of 3 functions are known we patch call_insn->imm to point to right functions and call bpf_int_jit_compile() again which needs only one pass. Additional safety checks are done to make sure this last pass doesn't produce image that is larger or smaller than previous pass. When constant blinding is on it's applied to all functions at the first pass, since doing it once again at the last pass can change size of the JITed code. Tested on x64 and arm64 hw with JIT on/off, blinding on/off. x64 jits bpf-to-bpf calls correctly while arm64 falls back to interpreter. All other JITs that support normal BPF_CALL will behave the same way since bpf-to-bpf call is equivalent to bpf-to-kernel call from JITs point of view. Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 Alexei Starovoitov 提交于
global bpf_jit_enable variable is tested multiple times in JITs, blinding and verifier core. The malicious root can try to toggle it while loading the programs. This race condition was accounted for and there should be no issues, but it's safer to avoid this race condition. Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
- 04 10月, 2017 1 次提交
-
-
由 Alexei Starovoitov 提交于
- bpf prog_array just like all other types of bpf array accepts 32-bit index. Clarify that in the comment. - fix x64 JIT of bpf_tail_call which was incorrectly loading 8 instead of 4 bytes - tighten corresponding check in the interpreter to stay consistent The JIT bug can be triggered after introduction of BPF_F_NUMA_NODE flag in commit 96eabe7a in 4.14. Before that the map_flags would stay zero and though JIT code is wrong it will check bounds correctly. Hence two fixes tags. All other JITs don't have this problem. Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Fixes: 96eabe7a ("bpf: Allow selecting numa node during map creation") Fixes: b52f00e6 ("x86: bpf_jit: implement bpf_tail_call() helper") Acked-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NMartin KaFai Lau <kafai@fb.com> Reviewed-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 01 9月, 2017 1 次提交
-
-
由 Eric Dumazet 提交于
Saves 4 bytes replacing following instructions : lea rax, [rsi + rdx * 8 + offsetof(...)] mov rax, qword ptr [rax] cmp rax, 0 by : mov rax, [rsi + rdx * 8 + offsetof(...)] test rax, rax Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Acked-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 10 8月, 2017 1 次提交
-
-
由 Daniel Borkmann 提交于
This work implements jiting of BPF_J{LT,LE,SLT,SLE} instructions with BPF_X/BPF_K variants for the x86_64 eBPF JIT. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 07 6月, 2017 1 次提交
-
-
由 Martin KaFai Lau 提交于
Add jited_len to struct bpf_prog. It will be useful for the struct bpf_prog_info which will be added in the later patch. Signed-off-by: NMartin KaFai Lau <kafai@fb.com> Acked-by: NAlexei Starovoitov <ast@fb.com> Acked-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 01 6月, 2017 3 次提交
-
-
由 Alexei Starovoitov 提交于
Take advantage of stack_depth tracking in x64 JIT. Round up allocated stack by 8 bytes to make sure it stays aligned for functions called from JITed bpf program. Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Alexei Starovoitov 提交于
in order to JIT programs with different stack sizes we need to make epilogue and exception path to be stack size independent, hence move auxiliary stack space from the bottom of the stack to the top of the stack. Nice side effect is that JITed function prologue becomes shorter due to imm8 offset encoding vs imm32. Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Alexei Starovoitov 提交于
free up BPF_JMP | BPF_CALL | BPF_X opcode to be used by actual indirect call by register and use kernel internal opcode to mark call instruction into bpf_tail_call() helper. Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 09 5月, 2017 1 次提交
-
-
由 Laura Abbott 提交于
set_memory_* functions have moved to set_memory.h. Switch to this explicitly. Link: http://lkml.kernel.org/r/1488920133-27229-6-git-send-email-labbott@redhat.comSigned-off-by: NLaura Abbott <labbott@redhat.com> Acked-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 29 4月, 2017 1 次提交
-
-
由 Daniel Borkmann 提交于
For both cases, the verifier is already rejecting such invalid formed instructions. Thus, remove these artifacts from old times and align it with ppc64, sparc64 and s390x JITs that don't have them in the first place. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 22 2月, 2017 1 次提交
-
-
由 Daniel Borkmann 提交于
Eric and Willem reported that they recently saw random crashes when JIT was in use and bisected this to 74451e66 ("bpf: make jited programs visible in traces"). Issue was that the consolidation part added bpf_jit_binary_unlock_ro() that would unlock previously made read-only memory back to read-write. However, DEBUG_SET_MODULE_RONX cannot be used for this to test for presence of set_memory_*() functions. We need to use ARCH_HAS_SET_MEMORY instead to fix this; also add the corresponding bpf_jit_binary_lock_ro() to filter.h. Fixes: 74451e66 ("bpf: make jited programs visible in traces") Reported-by: NEric Dumazet <edumazet@google.com> Reported-by: NWillem de Bruijn <willemb@google.com> Bisected-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Tested-by: NWillem de Bruijn <willemb@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 18 2月, 2017 2 次提交
-
-
由 Daniel Borkmann 提交于
Long standing issue with JITed programs is that stack traces from function tracing check whether a given address is kernel code through {__,}kernel_text_address(), which checks for code in core kernel, modules and dynamically allocated ftrace trampolines. But what is still missing is BPF JITed programs (interpreted programs are not an issue as __bpf_prog_run() will be attributed to them), thus when a stack trace is triggered, the code walking the stack won't see any of the JITed ones. The same for address correlation done from user space via reading /proc/kallsyms. This is read by tools like perf, but the latter is also useful for permanent live tracing with eBPF itself in combination with stack maps when other eBPF types are part of the callchain. See offwaketime example on dumping stack from a map. This work tries to tackle that issue by making the addresses and symbols known to the kernel. The lookup from *kernel_text_address() is implemented through a latched RB tree that can be read under RCU in fast-path that is also shared for symbol/size/offset lookup for a specific given address in kallsyms. The slow-path iteration through all symbols in the seq file done via RCU list, which holds a tiny fraction of all exported ksyms, usually below 0.1 percent. Function symbols are exported as bpf_prog_<tag>, in order to aide debugging and attribution. This facility is currently enabled for root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening is active in any mode. The rationale behind this is that still a lot of systems ship with world read permissions on kallsyms thus addresses should not get suddenly exposed for them. If that situation gets much better in future, we always have the option to change the default on this. Likewise, unprivileged programs are not allowed to add entries there either, but that is less of a concern as most such programs types relevant in this context are for root-only anyway. If enabled, call graphs and stack traces will then show a correct attribution; one example is illustrated below, where the trace is now visible in tooling such as perf script --kallsyms=/proc/kallsyms and friends. Before: 7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux) f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so) After: 7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux) [...] 7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux) f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so) Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Daniel Borkmann 提交于
Remove the dummy bpf_jit_compile() stubs for eBPF JITs and make that a single __weak function in the core that can be overridden similarly to the eBPF one. Also remove stale pr_err() mentions of bpf_jit_compile. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 09 1月, 2017 1 次提交
-
-
由 Daniel Borkmann 提交于
If after too many passes still no image could be emitted, then swap back to the original program as we do in all other cases and don't use the one with blinding. Fixes: 959a7579 ("bpf, x86: add support for constant blinding") Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 09 12月, 2016 1 次提交
-
-
由 Martin KaFai Lau 提交于
This patch allows XDP prog to extend/remove the packet data at the head (like adding or removing header). It is done by adding a new XDP helper bpf_xdp_adjust_head(). It also renames bpf_helper_changes_skb_data() to bpf_helper_changes_pkt_data() to better reflect that XDP prog does not work on skb. This patch adds one "xdp_adjust_head" bit to bpf_prog for the XDP-capable driver to check if the XDP prog requires bpf_xdp_adjust_head() support. The driver can then decide to error out during XDP_SETUP_PROG. Signed-off-by: NMartin KaFai Lau <kafai@fb.com> Acked-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NJohn Fastabend <john.r.fastabend@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 17 5月, 2016 3 次提交
-
-
由 Daniel Borkmann 提交于
This patch adds recently added constant blinding helpers into the x86 eBPF JIT. In the bpf_int_jit_compile() path, requirements are to utilize bpf_jit_blind_constants()/bpf_jit_prog_release_other() pair for rewriting the program into a blinded one, and to map the BPF_REG_AX register to a CPU register. The mapping of BPF_REG_AX is at non-callee saved register r10, and thus shared with cached skb->data used for ld_abs/ind and not in every program type needed. When blinding is not used, there's zero additional overhead in the generated image. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Daniel Borkmann 提交于
Since the blinding is strictly only called from inside eBPF JITs, we need to change signatures for bpf_int_jit_compile() and bpf_prog_select_runtime() first in order to prepare that the eBPF program we're dealing with can change underneath. Hence, for call sites, we need to return the latest prog. No functional change in this patch. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Daniel Borkmann 提交于
There is never such a situation, where bpf_int_jit_compile() is called with either prog as NULL or len as 0, so the tests are unnecessary and confusing as people would just copy them. s390 doesn't have them, so no change is needed there. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 19 12月, 2015 2 次提交
-
-
由 Daniel Borkmann 提交于
When sometimes structs or variables need to be initialized/'memset' to 0 in an eBPF C program, the x86 BPF JIT converts this to use immediates. We can however save a couple of bytes (f.e. even up to 7 bytes on a single emmission of BPF_LD | BPF_IMM | BPF_DW) in the image by detecting such case and use xor on the dst register instead. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Daniel Borkmann 提交于
Back in the days where eBPF (or back then "internal BPF" ;->) was not exposed to user space, and only the classic BPF programs internally translated into eBPF programs, we missed the fact that for classic BPF A and X needed to be cleared. It was fixed back then via 83d5b7ef ("net: filter: initialize A and X registers"), and thus classic BPF specifics were added to the eBPF interpreter core to work around it. This added some confusion for JIT developers later on that take the eBPF interpreter code as an example for deriving their JIT. F.e. in f75298f5 ("s390/bpf: clear correct BPF accumulator register"), at least X could leak stack memory. Furthermore, since this is only needed for classic BPF translations and not for eBPF (verifier takes care that read access to regs cannot be done uninitialized), more complexity is added to JITs as they need to determine whether they deal with migrations or native eBPF where they can just omit clearing A/X in their prologue and thus reduce image size a bit, see f.e. cde66c2d ("s390/bpf: Only clear A and X for converted BPF programs"). In other cases (x86, arm64), A and X is being cleared in the prologue also for eBPF case, which is unnecessary. Lets move this into the BPF migration in bpf_convert_filter() where it actually belongs as long as the number of eBPF JITs are still few. It can thus be done generically; allowing us to remove the quirk from __bpf_prog_run() and to slightly reduce JIT image size in case of eBPF, while reducing code duplication on this matter in current(/future) eBPF JITs. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Reviewed-by: NMichael Holzheu <holzheu@linux.vnet.ibm.com> Tested-by: NMichael Holzheu <holzheu@linux.vnet.ibm.com> Cc: Zi Shen Lim <zlim.lnx@gmail.com> Cc: Yang Shi <yang.shi@linaro.org> Acked-by: NYang Shi <yang.shi@linaro.org> Acked-by: NZi Shen Lim <zlim.lnx@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 03 10月, 2015 1 次提交
-
-
由 Daniel Borkmann 提交于
As we need to add further flags to the bpf_prog structure, lets migrate both bools to a bitfield representation. The size of the base structure (excluding insns) remains unchanged at 40 bytes. Add also tags for the kmemchecker, so that it doesn't throw false positives. Even in case gcc would generate suboptimal code, it's not being accessed in performance critical paths. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@plumgrid.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 10 8月, 2015 1 次提交
-
-
由 Wang Nan 提交于
All the map backends are of generic nature. In order to avoid adding much special code into the eBPF core, rewrite part of the bpf_prog_array map code and make it more generic. So the new perf_event_array map type can reuse most of code with bpf_prog_array map and add fewer lines of special code. Signed-off-by: NWang Nan <wangnan0@huawei.com> Signed-off-by: NKaixu Xia <xiakaixu@huawei.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 31 7月, 2015 1 次提交
-
-
由 Daniel Borkmann 提交于
When bpf_jit_compile() got split into two functions via commit f3c2af7b ("net: filter: x86: split bpf_jit_compile()"), bpf_jit_dump() was changed to always show 0 as number of compiler passes. Change it to dump the actual number. Also on sparc, we count passes starting from 0, so add 1 for the debug dump as well. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@plumgrid.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 30 7月, 2015 1 次提交
-
-
由 Daniel Borkmann 提交于
With eBPF JIT compiler enabled on x86_64, I was able to reliably trigger the following general protection fault out of an eBPF program with a simple tail call, f.e. tracex5 (or a stripped down version of it): [ 927.097918] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC [...] [ 927.100870] task: ffff8801f228b780 ti: ffff880016a64000 task.ti: ffff880016a64000 [ 927.102096] RIP: 0010:[<ffffffffa002440d>] [<ffffffffa002440d>] 0xffffffffa002440d [ 927.103390] RSP: 0018:ffff880016a67a68 EFLAGS: 00010006 [ 927.104683] RAX: 5a5a5a5a5a5a5a5a RBX: 0000000000000000 RCX: 0000000000000001 [ 927.105921] RDX: 0000000000000000 RSI: ffff88014e438000 RDI: ffff880016a67e00 [ 927.107137] RBP: ffff880016a67c90 R08: 0000000000000000 R09: 0000000000000001 [ 927.108351] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880016a67e00 [ 927.109567] R13: 0000000000000000 R14: ffff88026500e460 R15: ffff880220a81520 [ 927.110787] FS: 00007fe7d5c1f740(0000) GS:ffff880265000000(0000) knlGS:0000000000000000 [ 927.112021] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 927.113255] CR2: 0000003e7bbb91a0 CR3: 000000006e04b000 CR4: 00000000001407e0 [ 927.114500] Stack: [ 927.115737] ffffc90008cdb000 ffff880016a67e00 ffff88026500e460 ffff880220a81520 [ 927.117005] 0000000100000000 000000000000001b ffff880016a67aa8 ffffffff8106c548 [ 927.118276] 00007ffcdaf22e58 0000000000000000 0000000000000000 ffff880016a67ff0 [ 927.119543] Call Trace: [ 927.120797] [<ffffffff8106c548>] ? lookup_address+0x28/0x30 [ 927.122058] [<ffffffff8113d176>] ? __module_text_address+0x16/0x70 [ 927.123314] [<ffffffff8117bf0e>] ? is_ftrace_trampoline+0x3e/0x70 [ 927.124562] [<ffffffff810c1a0f>] ? __kernel_text_address+0x5f/0x80 [ 927.125806] [<ffffffff8102086f>] ? print_context_stack+0x7f/0xf0 [ 927.127033] [<ffffffff810f7852>] ? __lock_acquire+0x572/0x2050 [ 927.128254] [<ffffffff810f7852>] ? __lock_acquire+0x572/0x2050 [ 927.129461] [<ffffffff8119edfa>] ? trace_call_bpf+0x3a/0x140 [ 927.130654] [<ffffffff8119ee4a>] trace_call_bpf+0x8a/0x140 [ 927.131837] [<ffffffff8119edfa>] ? trace_call_bpf+0x3a/0x140 [ 927.133015] [<ffffffff8119f008>] kprobe_perf_func+0x28/0x220 [ 927.134195] [<ffffffff811a1668>] kprobe_dispatcher+0x38/0x60 [ 927.135367] [<ffffffff81174b91>] ? seccomp_phase1+0x1/0x230 [ 927.136523] [<ffffffff81061400>] kprobe_ftrace_handler+0xf0/0x150 [ 927.137666] [<ffffffff81174b95>] ? seccomp_phase1+0x5/0x230 [ 927.138802] [<ffffffff8117950c>] ftrace_ops_recurs_func+0x5c/0xb0 [ 927.139934] [<ffffffffa022b0d5>] 0xffffffffa022b0d5 [ 927.141066] [<ffffffff81174b91>] ? seccomp_phase1+0x1/0x230 [ 927.142199] [<ffffffff81174b95>] seccomp_phase1+0x5/0x230 [ 927.143323] [<ffffffff8102c0a4>] syscall_trace_enter_phase1+0xc4/0x150 [ 927.144450] [<ffffffff81174b95>] ? seccomp_phase1+0x5/0x230 [ 927.145572] [<ffffffff8102c0a4>] ? syscall_trace_enter_phase1+0xc4/0x150 [ 927.146666] [<ffffffff817f9a9f>] tracesys+0xd/0x44 [ 927.147723] Code: 48 8b 46 10 48 39 d0 76 2c 8b 85 fc fd ff ff 83 f8 20 77 21 83 c0 01 89 85 fc fd ff ff 48 8d 44 d6 80 48 8b 00 48 83 f8 00 74 0a <48> 8b 40 20 48 83 c0 33 ff e0 48 89 d8 48 8b 9d d8 fd ff ff 4c [ 927.150046] RIP [<ffffffffa002440d>] 0xffffffffa002440d The code section with the instructions that traps points into the eBPF JIT image of the root program (the one invoking the tail call instruction). Using bpf_jit_disasm -o on the eBPF root program image: [...] 4e: mov -0x204(%rbp),%eax 8b 85 fc fd ff ff 54: cmp $0x20,%eax <--- if (tail_call_cnt > MAX_TAIL_CALL_CNT) 83 f8 20 57: ja 0x000000000000007a 77 21 59: add $0x1,%eax <--- tail_call_cnt++ 83 c0 01 5c: mov %eax,-0x204(%rbp) 89 85 fc fd ff ff 62: lea -0x80(%rsi,%rdx,8),%rax <--- prog = array->prog[index] 48 8d 44 d6 80 67: mov (%rax),%rax 48 8b 00 6a: cmp $0x0,%rax <--- check for NULL 48 83 f8 00 6e: je 0x000000000000007a 74 0a 70: mov 0x20(%rax),%rax <--- GPF triggered here! fetch of bpf_func 48 8b 40 20 [ matches <48> 8b 40 20 ... from above ] 74: add $0x33,%rax <--- prologue skip of new prog 48 83 c0 33 78: jmpq *%rax <--- jump to new prog insns ff e0 [...] The problem is that rax has 5a5a5a5a5a5a5a5a, which suggests a tail call jump to map slot 0 is pointing to a poisoned page. The issue is the following: lea instruction has a wrong offset, i.e. it should be ... lea 0x80(%rsi,%rdx,8),%rax ... but it actually seems to be ... lea -0x80(%rsi,%rdx,8),%rax ... where 0x80 is offsetof(struct bpf_array, prog), thus the offset needs to be positive instead of negative. Disassembling the interpreter, we btw similarly do: [...] c88: lea 0x80(%rax,%rdx,8),%rax <--- prog = array->prog[index] 48 8d 84 d0 80 00 00 00 c90: add $0x1,%r13d 41 83 c5 01 c94: mov (%rax),%rax 48 8b 00 [...] Now the other interesting fact is that this panic triggers only when things like CONFIG_LOCKDEP are being used. In that case offsetof(struct bpf_array, prog) starts at offset 0x80 and in non-CONFIG_LOCKDEP case at offset 0x50. Reason is that the work_struct inside struct bpf_map grows by 48 bytes in my case due to the lockdep_map member (which also has CONFIG_LOCK_STAT enabled members). Changing the emitter to always use the 4 byte displacement in the lea instruction fixes the panic on my side. It increases the tail call instruction emission by 3 more byte, but it should cover us from various combinations (and perhaps other future increases on related structures). After patch, disassembly: [...] 9e: lea 0x80(%rsi,%rdx,8),%rax <--- CONFIG_LOCKDEP/CONFIG_LOCK_STAT 48 8d 84 d6 80 00 00 00 a6: mov (%rax),%rax 48 8b 00 [...] [...] 9e: lea 0x50(%rsi,%rdx,8),%rax <--- No CONFIG_LOCKDEP 48 8d 84 d6 50 00 00 00 a6: mov (%rax),%rax 48 8b 00 [...] Fixes: b52f00e6 ("x86: bpf_jit: implement bpf_tail_call() helper") Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@plumgrid.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 21 7月, 2015 1 次提交
-
-
由 Alexei Starovoitov 提交于
Allow eBPF programs attached to TC qdiscs call skb_vlan_push/pop via helper functions. These functions may change skb->data/hlen which are cached by some JITs to improve performance of ld_abs/ld_ind instructions. Therefore JITs need to recognize bpf_skb_vlan_push/pop() calls, re-compute header len and re-cache skb->data/hlen back into cpu registers. Note, skb->data/hlen are not directly accessible from the programs, so any changes to skb->data done either by these helpers or by other TC actions are safe. eBPF JIT supported by three architectures: - arm64 JIT is using bpf_load_pointer() without caching, so it's ok as-is. - x64 JIT re-caches skb->data/hlen unconditionally after vlan_push/pop calls (experiments showed that conditional re-caching is slower). - s390 JIT falls back to interpreter for now when bpf_skb_vlan_push() is present in the program (re-caching is tbd). These helpers allow more scalable handling of vlan from the programs. Instead of creating thousands of vlan netdevs on top of eth0 and attaching TC+ingress+bpf to all of them, the program can be attached to eth0 directly and manipulate vlans as necessary. Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 25 5月, 2015 1 次提交
-
-
由 Alexei Starovoitov 提交于
x86 has variable length encoding. x86 JIT compiler is trying to pick the shortest encoding for given bpf instruction. While doing so the jump targets are changing, so JIT is doing multiple passes over the program. Typical program needs 3 passes. Some very short programs converge with 2 passes. Large programs may need 4 or 5. But specially crafted bpf programs may hit the pass limit and if the program converges on the last iteration the JIT compiler will be producing an image full of 'int 3' insns. Fix this corner case by doing final iteration over bpf program. Fixes: 0a14842f ("net: filter: Just In Time compiler for x86-64") Reported-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com> Tested-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 22 5月, 2015 1 次提交
-
-
由 Alexei Starovoitov 提交于
bpf_tail_call() arguments: ctx - context pointer jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table index - index in the jump table In this implementation x64 JIT bypasses stack unwind and jumps into the callee program after prologue, so the callee program reuses the same stack. The logic can be roughly expressed in C like: u32 tail_call_cnt; void *jumptable[2] = { &&label1, &&label2 }; int bpf_prog1(void *ctx) { label1: ... } int bpf_prog2(void *ctx) { label2: ... } int bpf_prog1(void *ctx) { ... if (tail_call_cnt++ < MAX_TAIL_CALL_CNT) goto *jumptable[index]; ... and pass my 'ctx' to callee ... ... fall through if no entry in jumptable ... } Note that 'skip current program epilogue and next program prologue' is an optimization. Other JITs don't have to do it the same way. >From safety point of view it's valid as well, since programs always initialize the stack before use, so any residue in the stack left by the current program is not going be read. The same verifier checks are done for the calls from the kernel into all bpf programs. Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 13 5月, 2015 1 次提交
-
-
由 Alexei Starovoitov 提交于
FROM_BE16: 'ror %reg, 8' doesn't clear upper bits of the register, so use additional 'movzwl' insn to zero extend 16 bits into 64 FROM_LE16: should zero extend lower 16 bits into 64 bit FROM_LE32: should zero extend lower 32 bits into 64 bit Fixes: 89aa0758 ("net: sock: allow eBPF programs to be attached to sockets") Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 10 12月, 2014 2 次提交
-
-
由 Joe Perches 提交于
Let the compiler decide instead. No change in object size x86-64 -O2 no profiling Signed-off-by: NJoe Perches <joe@perches.com> Suggested-by: NEric Dumazet <eric.dumazet@gmail.com> Acked-by: NAlexei Starovoitov <ast@plumgrid.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Joe Perches 提交于
Use the (1 << reg) & mask trick to reduce code size. x86-64 size difference -O2 without profiling for various gcc versions: $ size arch/x86/net/bpf_jit_comp.o* text data bss dec hex filename 9266 4 0 9270 2436 arch/x86/net/bpf_jit_comp.o.4.4.new 10042 4 0 10046 273e arch/x86/net/bpf_jit_comp.o.4.4.old 9109 4 0 9113 2399 arch/x86/net/bpf_jit_comp.o.4.6.new 9717 4 0 9721 25f9 arch/x86/net/bpf_jit_comp.o.4.6.old 8789 4 0 8793 2259 arch/x86/net/bpf_jit_comp.o.4.7.new 10245 4 0 10249 2809 arch/x86/net/bpf_jit_comp.o.4.7.old 9671 4 0 9675 25cb arch/x86/net/bpf_jit_comp.o.4.9.new 10679 4 0 10683 29bb arch/x86/net/bpf_jit_comp.o.4.9.old Signed-off-by: NJoe Perches <joe@perches.com> Tested-by: NAlexei Starovoitov <ast@plumgrid.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 06 12月, 2014 1 次提交
-
-
由 Alexei Starovoitov 提交于
classic BPF has a restriction that last insn is always BPF_RET. eBPF doesn't have BPF_RET instruction and this restriction. It has BPF_EXIT insn which can appear anywhere in the program one or more times and it doesn't have to be last insn. Fix eBPF JIT to emit epilogue when first BPF_EXIT is seen and all other BPF_EXIT instructions will be emitted as jump. Since jump offset to epilogue is computed as: jmp_offset = ctx->cleanup_addr - addrs[i] we need to change type of cleanup_addr to signed to compute the offset as: (long long) ((int)20 - (int)30) instead of: (long long) ((unsigned int)20 - (int)30) Fixes: 62258278 ("net: filter: x86: internal BPF JIT") Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 15 10月, 2014 1 次提交
-
-
由 Alexei Starovoitov 提交于
1. JIT compiler using multi-pass approach to converge to final image size, since x86 instructions are variable length. It starts with large gaps between instructions (so some jumps may use imm32 instead of imm8) and iterates until total program size is the same as in previous pass. This algorithm works only if program size is strictly decreasing. Programs that use LD_ABS insn need additional code in prologue, but it was not emitted during 1st pass, so there was a chance that 2nd pass would adjust imm32->imm8 jump offsets to the same number of bytes as increase in prologue, which may cause algorithm to erroneously decide that size converged. Fix it by always emitting largest prologue in the first pass which is detected by oldproglen==0 check. Also change error check condition 'proglen != oldproglen' to fail gracefully. 2. while staring at the code realized that 64-byte buffer may not be enough when 1st insn is large, so increase it to 128 to avoid buffer overflow (theoretical maximum size of prologue+div is 109) and add runtime check. Fixes: 62258278 ("net: filter: x86: internal BPF JIT") Reported-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com> Tested-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-