1. 10 1月, 2018 1 次提交
    • A
      bpf: introduce BPF_JIT_ALWAYS_ON config · 290af866
      Alexei Starovoitov 提交于
      The BPF interpreter has been used as part of the spectre 2 attack CVE-2017-5715.
      
      A quote from goolge project zero blog:
      "At this point, it would normally be necessary to locate gadgets in
      the host kernel code that can be used to actually leak data by reading
      from an attacker-controlled location, shifting and masking the result
      appropriately and then using the result of that as offset to an
      attacker-controlled address for a load. But piecing gadgets together
      and figuring out which ones work in a speculation context seems annoying.
      So instead, we decided to use the eBPF interpreter, which is built into
      the host kernel - while there is no legitimate way to invoke it from inside
      a VM, the presence of the code in the host kernel's text section is sufficient
      to make it usable for the attack, just like with ordinary ROP gadgets."
      
      To make attacker job harder introduce BPF_JIT_ALWAYS_ON config
      option that removes interpreter from the kernel in favor of JIT-only mode.
      So far eBPF JIT is supported by:
      x64, arm64, arm32, sparc64, s390, powerpc64, mips64
      
      The start of JITed program is randomized and code page is marked as read-only.
      In addition "constant blinding" can be turned on with net.core.bpf_jit_harden
      
      v2->v3:
      - move __bpf_prog_ret0 under ifdef (Daniel)
      
      v1->v2:
      - fix init order, test_bpf and cBPF (Daniel's feedback)
      - fix offloaded bpf (Jakub's feedback)
      - add 'return 0' dummy in case something can invoke prog->bpf_func
      - retarget bpf tree. For bpf-next the patch would need one extra hunk.
        It will be sent when the trees are merged back to net-next
      
      Considered doing:
        int bpf_jit_enable __read_mostly = BPF_EBPF_JIT_DEFAULT;
      but it seems better to land the patch as-is and in bpf-next remove
      bpf_jit_enable global variable from all JITs, consolidate in one place
      and remove this jit_init() function.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      290af866
  2. 01 12月, 2017 1 次提交
  3. 16 11月, 2017 1 次提交
  4. 11 11月, 2017 2 次提交
  5. 05 11月, 2017 1 次提交
    • J
      bpf: offload: add infrastructure for loading programs for a specific netdev · ab3f0063
      Jakub Kicinski 提交于
      The fact that we don't know which device the program is going
      to be used on is quite limiting in current eBPF infrastructure.
      We have to reverse or limit the changes which kernel makes to
      the loaded bytecode if we want it to be offloaded to a networking
      device.  We also have to invent new APIs for debugging and
      troubleshooting support.
      
      Make it possible to load programs for a specific netdev.  This
      helps us to bring the debug information closer to the core
      eBPF infrastructure (e.g. we will be able to reuse the verifer
      log in device JIT).  It allows device JITs to perform translation
      on the original bytecode.
      
      __bpf_prog_get() when called to get a reference for an attachment
      point will now refuse to give it if program has a device assigned.
      Following patches will add a version of that function which passes
      the expected netdev in. @type argument in __bpf_prog_get() is
      renamed to attach_type to make it clearer that it's only set on
      attachment.
      
      All calls to ndo_bpf are protected by rtnl, only verifier callbacks
      are not.  We need a wait queue to make sure netdev doesn't get
      destroyed while verifier is still running and calling its driver.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NSimon Horman <simon.horman@netronome.com>
      Reviewed-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ab3f0063
  6. 25 10月, 2017 1 次提交
    • Y
      bpf: permit multiple bpf attachments for a single perf event · e87c6bc3
      Yonghong Song 提交于
      This patch enables multiple bpf attachments for a
      kprobe/uprobe/tracepoint single trace event.
      Each trace_event keeps a list of attached perf events.
      When an event happens, all attached bpf programs will
      be executed based on the order of attachment.
      
      A global bpf_event_mutex lock is introduced to protect
      prog_array attaching and detaching. An alternative will
      be introduce a mutex lock in every trace_event_call
      structure, but it takes a lot of extra memory.
      So a global bpf_event_mutex lock is a good compromise.
      
      The bpf prog detachment involves allocation of memory.
      If the allocation fails, a dummy do-nothing program
      will replace to-be-detached program in-place.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e87c6bc3
  7. 17 10月, 2017 1 次提交
  8. 08 10月, 2017 1 次提交
  9. 05 10月, 2017 2 次提交
    • A
      bpf: introduce BPF_PROG_QUERY command · 468e2f64
      Alexei Starovoitov 提交于
      introduce BPF_PROG_QUERY command to retrieve a set of either
      attached programs to given cgroup or a set of effective programs
      that will execute for events within a cgroup
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      for cgroup bits
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      468e2f64
    • A
      bpf: multi program support for cgroup+bpf · 324bda9e
      Alexei Starovoitov 提交于
      introduce BPF_F_ALLOW_MULTI flag that can be used to attach multiple
      bpf programs to a cgroup.
      
      The difference between three possible flags for BPF_PROG_ATTACH command:
      - NONE(default): No further bpf programs allowed in the subtree.
      - BPF_F_ALLOW_OVERRIDE: If a sub-cgroup installs some bpf program,
        the program in this cgroup yields to sub-cgroup program.
      - BPF_F_ALLOW_MULTI: If a sub-cgroup installs some bpf program,
        that cgroup program gets run in addition to the program in this cgroup.
      
      NONE and BPF_F_ALLOW_OVERRIDE existed before. This patch doesn't
      change their behavior. It only clarifies the semantics in relation
      to new flag.
      
      Only one program is allowed to be attached to a cgroup with
      NONE or BPF_F_ALLOW_OVERRIDE flag.
      Multiple programs are allowed to be attached to a cgroup with
      BPF_F_ALLOW_MULTI flag. They are executed in FIFO order
      (those that were attached first, run first)
      The programs of sub-cgroup are executed first, then programs of
      this cgroup and then programs of parent cgroup.
      All eligible programs are executed regardless of return code from
      earlier programs.
      
      To allow efficient execution of multiple programs attached to a cgroup
      and to avoid penalizing cgroups without any programs attached
      introduce 'struct bpf_prog_array' which is RCU protected array
      of pointers to bpf programs.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      for cgroup bits
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      324bda9e
  10. 04 10月, 2017 1 次提交
  11. 17 8月, 2017 1 次提交
  12. 10 8月, 2017 1 次提交
    • D
      bpf: add BPF_J{LT,LE,SLT,SLE} instructions · 92b31a9a
      Daniel Borkmann 提交于
      Currently, eBPF only understands BPF_JGT (>), BPF_JGE (>=),
      BPF_JSGT (s>), BPF_JSGE (s>=) instructions, this means that
      particularly *JLT/*JLE counterparts involving immediates need
      to be rewritten from e.g. X < [IMM] by swapping arguments into
      [IMM] > X, meaning the immediate first is required to be loaded
      into a register Y := [IMM], such that then we can compare with
      Y > X. Note that the destination operand is always required to
      be a register.
      
      This has the downside of having unnecessarily increased register
      pressure, meaning complex program would need to spill other
      registers temporarily to stack in order to obtain an unused
      register for the [IMM]. Loading to registers will thus also
      affect state pruning since we need to account for that register
      use and potentially those registers that had to be spilled/filled
      again. As a consequence slightly more stack space might have
      been used due to spilling, and BPF programs are a bit longer
      due to extra code involving the register load and potentially
      required spill/fills.
      
      Thus, add BPF_JLT (<), BPF_JLE (<=), BPF_JSLT (s<), BPF_JSLE (s<=)
      counterparts to the eBPF instruction set. Modifying LLVM to
      remove the NegateCC() workaround in a PoC patch at [1] and
      allowing it to also emit the new instructions resulted in
      cilium's BPF programs that are injected into the fast-path to
      have a reduced program length in the range of 2-3% (e.g.
      accumulated main and tail call sections from one of the object
      file reduced from 4864 to 4729 insns), reduced complexity in
      the range of 10-30% (e.g. accumulated sections reduced in one
      of the cases from 116432 to 88428 insns), and reduced stack
      usage in the range of 1-5% (e.g. accumulated sections from one
      of the object files reduced from 824 to 784b).
      
      The modification for LLVM will be incorporated in a backwards
      compatible way. Plan is for LLVM to have i) a target specific
      option to offer a possibility to explicitly enable the extension
      by the user (as we have with -m target specific extensions today
      for various CPU insns), and ii) have the kernel checked for
      presence of the extensions and enable them transparently when
      the user is selecting more aggressive options such as -march=native
      in a bpf target context. (Other frontends generating BPF byte
      code, e.g. ply can probe the kernel directly for its code
      generation.)
      
        [1] https://github.com/borkmann/llvm/tree/bpf-insnsSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      92b31a9a
  13. 30 6月, 2017 1 次提交
    • M
      bpf: Fix out-of-bound access on interpreters[] · 8007e40a
      Martin KaFai Lau 提交于
      The index is off-by-one when fp->aux->stack_depth
      has already been rounded up to 32.  In particular,
      if stack_depth is 512, the index will be 16.
      
      The fix is to round_up and then takes -1 instead of round_down.
      
      [   22.318680] ==================================================================
      [   22.319745] BUG: KASAN: global-out-of-bounds in bpf_prog_select_runtime+0x48a/0x670
      [   22.320737] Read of size 8 at addr ffffffff82aadae0 by task sockex3/1946
      [   22.321646]
      [   22.321858] CPU: 1 PID: 1946 Comm: sockex3 Tainted: G        W       4.12.0-rc6-01680-g2ee87db3 #22
      [   22.323061] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.el7.centos 04/01/2014
      [   22.324260] Call Trace:
      [   22.324612]  dump_stack+0x67/0x99
      [   22.325081]  print_address_description+0x1e8/0x290
      [   22.325734]  ? bpf_prog_select_runtime+0x48a/0x670
      [   22.326360]  kasan_report+0x265/0x350
      [   22.326860]  __asan_report_load8_noabort+0x19/0x20
      [   22.327484]  bpf_prog_select_runtime+0x48a/0x670
      [   22.328109]  bpf_prog_load+0x626/0xd40
      [   22.328637]  ? __bpf_prog_charge+0xc0/0xc0
      [   22.329222]  ? check_nnp_nosuid.isra.61+0x100/0x100
      [   22.329890]  ? __might_fault+0xf6/0x1b0
      [   22.330446]  ? lock_acquire+0x360/0x360
      [   22.331013]  SyS_bpf+0x67c/0x24d0
      [   22.331491]  ? trace_hardirqs_on+0xd/0x10
      [   22.332049]  ? __getnstimeofday64+0xaf/0x1c0
      [   22.332635]  ? bpf_prog_get+0x20/0x20
      [   22.333135]  ? __audit_syscall_entry+0x300/0x600
      [   22.333770]  ? syscall_trace_enter+0x540/0xdd0
      [   22.334339]  ? exit_to_usermode_loop+0xe0/0xe0
      [   22.334950]  ? do_syscall_64+0x48/0x410
      [   22.335446]  ? bpf_prog_get+0x20/0x20
      [   22.335954]  do_syscall_64+0x181/0x410
      [   22.336454]  entry_SYSCALL64_slow_path+0x25/0x25
      [   22.337121] RIP: 0033:0x7f263fe81f19
      [   22.337618] RSP: 002b:00007ffd9a3440c8 EFLAGS: 00000202 ORIG_RAX: 0000000000000141
      [   22.338619] RAX: ffffffffffffffda RBX: 0000000000aac5fb RCX: 00007f263fe81f19
      [   22.339600] RDX: 0000000000000030 RSI: 00007ffd9a3440d0 RDI: 0000000000000005
      [   22.340470] RBP: 0000000000a9a1e0 R08: 0000000000a9a1e0 R09: 0000009d00000001
      [   22.341430] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000010000
      [   22.342411] R13: 0000000000a9a023 R14: 0000000000000001 R15: 0000000000000003
      [   22.343369]
      [   22.343593] The buggy address belongs to the variable:
      [   22.344241]  interpreters+0x80/0x980
      [   22.344708]
      [   22.344908] Memory state around the buggy address:
      [   22.345556]  ffffffff82aad980: 00 00 00 04 fa fa fa fa 04 fa fa fa fa fa fa fa
      [   22.346449]  ffffffff82aada00: 00 00 00 00 00 fa fa fa fa fa fa fa 00 00 00 00
      [   22.347361] >ffffffff82aada80: 00 00 00 00 00 00 00 00 00 00 00 00 fa fa fa fa
      [   22.348301]                                                        ^
      [   22.349142]  ffffffff82aadb00: 00 01 fa fa fa fa fa fa 00 00 00 00 00 00 00 00
      [   22.350058]  ffffffff82aadb80: 00 00 07 fa fa fa fa fa 00 00 05 fa fa fa fa fa
      [   22.350984] ==================================================================
      
      Fixes: b870aa90 ("bpf: use different interpreter depending on required stack size")
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8007e40a
  14. 01 6月, 2017 3 次提交
    • A
      bpf: use different interpreter depending on required stack size · b870aa90
      Alexei Starovoitov 提交于
      16 __bpf_prog_run() interpreters for various stack sizes add .text
      but not a lot comparing to run-time stack savings
      
         text	   data	    bss	    dec	    hex	filename
        26350   10328     624   37302    91b6 kernel/bpf/core.o.before_split
        25777   10328     624   36729    8f79 kernel/bpf/core.o.after_split
        26970	  10328	    624	  37922	   9422	kernel/bpf/core.o.now
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b870aa90
    • A
      bpf: split bpf core interpreter · f696b8f4
      Alexei Starovoitov 提交于
      split __bpf_prog_run() interpreter into stack allocation and execution parts.
      The code section shrinks which helps interpreter performance in some cases.
         text	   data	    bss	    dec	    hex	filename
        26350	  10328	    624	  37302	   91b6	kernel/bpf/core.o.before
        25777	  10328	    624	  36729	   8f79	kernel/bpf/core.o.after
      
      Very short programs got slower (due to extra function call):
      Before:
      test_bpf: #89 ALU64_ADD_K: 1 + 2 = 3 jited:0 7 PASS
      test_bpf: #90 ALU64_ADD_K: 3 + 0 = 3 jited:0 8 PASS
      test_bpf: #91 ALU64_ADD_K: 1 + 2147483646 = 2147483647 jited:0 7 PASS
      test_bpf: #92 ALU64_ADD_K: 4294967294 + 2 = 4294967296 jited:0 11 PASS
      test_bpf: #93 ALU64_ADD_K: 2147483646 + -2147483647 = -1 jited:0 7 PASS
      After:
      test_bpf: #89 ALU64_ADD_K: 1 + 2 = 3 jited:0 11 PASS
      test_bpf: #90 ALU64_ADD_K: 3 + 0 = 3 jited:0 11 PASS
      test_bpf: #91 ALU64_ADD_K: 1 + 2147483646 = 2147483647 jited:0 11 PASS
      test_bpf: #92 ALU64_ADD_K: 4294967294 + 2 = 4294967296 jited:0 14 PASS
      test_bpf: #93 ALU64_ADD_K: 2147483646 + -2147483647 = -1 jited:0 10 PASS
      
      Longer programs got faster:
      Before:
      test_bpf: #266 BPF_MAXINSNS: Ctx heavy transformations jited:0 20286 20513 PASS
      test_bpf: #267 BPF_MAXINSNS: Call heavy transformations jited:0 31853 31768 PASS
      test_bpf: #268 BPF_MAXINSNS: Jump heavy test jited:0 9815 PASS
      test_bpf: #269 BPF_MAXINSNS: Very long jump backwards jited:0 6 PASS
      test_bpf: #270 BPF_MAXINSNS: Edge hopping nuthouse jited:0 13959 PASS
      test_bpf: #271 BPF_MAXINSNS: Jump, gap, jump, ... jited:0 210 PASS
      test_bpf: #272 BPF_MAXINSNS: ld_abs+get_processor_id jited:0 21724 PASS
      test_bpf: #273 BPF_MAXINSNS: ld_abs+vlan_push/pop jited:0 19118 PASS
      After:
      test_bpf: #266 BPF_MAXINSNS: Ctx heavy transformations jited:0 19008 18827 PASS
      test_bpf: #267 BPF_MAXINSNS: Call heavy transformations jited:0 29238 28450 PASS
      test_bpf: #268 BPF_MAXINSNS: Jump heavy test jited:0 9485 PASS
      test_bpf: #269 BPF_MAXINSNS: Very long jump backwards jited:0 12 PASS
      test_bpf: #270 BPF_MAXINSNS: Edge hopping nuthouse jited:0 13257 PASS
      test_bpf: #271 BPF_MAXINSNS: Jump, gap, jump, ... jited:0 213 PASS
      test_bpf: #272 BPF_MAXINSNS: ld_abs+get_processor_id jited:0 19389 PASS
      test_bpf: #273 BPF_MAXINSNS: ld_abs+vlan_push/pop jited:0 19583 PASS
      
      For real world production programs the difference is noise.
      
      This patch is first step towards reducing interpreter stack consumption.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f696b8f4
    • A
      bpf: free up BPF_JMP | BPF_CALL | BPF_X opcode · 71189fa9
      Alexei Starovoitov 提交于
      free up BPF_JMP | BPF_CALL | BPF_X opcode to be used by actual
      indirect call by register and use kernel internal opcode to
      mark call instruction into bpf_tail_call() helper.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      71189fa9
  15. 09 5月, 2017 1 次提交
  16. 29 4月, 2017 1 次提交
  17. 11 4月, 2017 1 次提交
  18. 18 2月, 2017 2 次提交
    • D
      bpf: make jited programs visible in traces · 74451e66
      Daniel Borkmann 提交于
      Long standing issue with JITed programs is that stack traces from
      function tracing check whether a given address is kernel code
      through {__,}kernel_text_address(), which checks for code in core
      kernel, modules and dynamically allocated ftrace trampolines. But
      what is still missing is BPF JITed programs (interpreted programs
      are not an issue as __bpf_prog_run() will be attributed to them),
      thus when a stack trace is triggered, the code walking the stack
      won't see any of the JITed ones. The same for address correlation
      done from user space via reading /proc/kallsyms. This is read by
      tools like perf, but the latter is also useful for permanent live
      tracing with eBPF itself in combination with stack maps when other
      eBPF types are part of the callchain. See offwaketime example on
      dumping stack from a map.
      
      This work tries to tackle that issue by making the addresses and
      symbols known to the kernel. The lookup from *kernel_text_address()
      is implemented through a latched RB tree that can be read under
      RCU in fast-path that is also shared for symbol/size/offset lookup
      for a specific given address in kallsyms. The slow-path iteration
      through all symbols in the seq file done via RCU list, which holds
      a tiny fraction of all exported ksyms, usually below 0.1 percent.
      Function symbols are exported as bpf_prog_<tag>, in order to aide
      debugging and attribution. This facility is currently enabled for
      root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
      is active in any mode. The rationale behind this is that still a lot
      of systems ship with world read permissions on kallsyms thus addresses
      should not get suddenly exposed for them. If that situation gets
      much better in future, we always have the option to change the
      default on this. Likewise, unprivileged programs are not allowed
      to add entries there either, but that is less of a concern as most
      such programs types relevant in this context are for root-only anyway.
      If enabled, call graphs and stack traces will then show a correct
      attribution; one example is illustrated below, where the trace is
      now visible in tooling such as perf script --kallsyms=/proc/kallsyms
      and friends.
      
      Before:
      
        7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
               f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)
      
      After:
      
        7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
        [...]
        7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
               f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74451e66
    • D
      bpf: remove stubs for cBPF from arch code · 9383191d
      Daniel Borkmann 提交于
      Remove the dummy bpf_jit_compile() stubs for eBPF JITs and make
      that a single __weak function in the core that can be overridden
      similarly to the eBPF one. Also remove stale pr_err() mentions
      of bpf_jit_compile.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9383191d
  19. 26 1月, 2017 1 次提交
    • D
      bpf: add initial bpf tracepoints · a67edbf4
      Daniel Borkmann 提交于
      This work adds a number of tracepoints to paths that are either
      considered slow-path or exception-like states, where monitoring or
      inspecting them would be desirable.
      
      For bpf(2) syscall, tracepoints have been placed for main commands
      when they succeed. In XDP case, tracepoint is for exceptions, that
      is, f.e. on abnormal BPF program exit such as unknown or XDP_ABORTED
      return code, or when error occurs during XDP_TX action and the packet
      could not be forwarded.
      
      Both have been split into separate event headers, and can be further
      extended. Worst case, if they unexpectedly should get into our way in
      future, they can also removed [1]. Of course, these tracepoints (like
      any other) can be analyzed by eBPF itself, etc. Example output:
      
        # ./perf record -a -e bpf:* sleep 10
        # ./perf script
        sock_example  6197 [005]   283.980322:      bpf:bpf_map_create: map type=ARRAY ufd=4 key=4 val=8 max=256 flags=0
        sock_example  6197 [005]   283.980721:       bpf:bpf_prog_load: prog=a5ea8fa30ea6849c type=SOCKET_FILTER ufd=5
        sock_example  6197 [005]   283.988423:   bpf:bpf_prog_get_type: prog=a5ea8fa30ea6849c type=SOCKET_FILTER
        sock_example  6197 [005]   283.988443: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[06 00 00 00] val=[00 00 00 00 00 00 00 00]
        [...]
        sock_example  6197 [005]   288.990868: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[01 00 00 00] val=[14 00 00 00 00 00 00 00]
             swapper     0 [005]   289.338243:    bpf:bpf_prog_put_rcu: prog=a5ea8fa30ea6849c type=SOCKET_FILTER
      
        [1] https://lwn.net/Articles/705270/Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a67edbf4
  20. 17 1月, 2017 1 次提交
    • D
      bpf: rework prog_digest into prog_tag · f1f7714e
      Daniel Borkmann 提交于
      Commit 7bd509e3 ("bpf: add prog_digest and expose it via
      fdinfo/netlink") was recently discussed, partially due to
      admittedly suboptimal name of "prog_digest" in combination
      with sha1 hash usage, thus inevitably and rightfully concerns
      about its security in terms of collision resistance were
      raised with regards to use-cases.
      
      The intended use cases are for debugging resp. introspection
      only for providing a stable "tag" over the instruction sequence
      that both kernel and user space can calculate independently.
      It's not usable at all for making a security relevant decision.
      So collisions where two different instruction sequences generate
      the same tag can happen, but ideally at a rather low rate. The
      "tag" will be dumped in hex and is short enough to introspect
      in tracepoints or kallsyms output along with other data such
      as stack trace, etc. Thus, this patch performs a rename into
      prog_tag and truncates the tag to a short output (64 bits) to
      make it obvious it's not collision-free.
      
      Should in future a hash or facility be needed with a security
      relevant focus, then we can think about requirements, constraints,
      etc that would fit to that situation. For now, rework the exposed
      parts for the current use cases as long as nothing has been
      released yet. Tested on x86_64 and s390x.
      
      Fixes: 7bd509e3 ("bpf: add prog_digest and expose it via fdinfo/netlink")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f1f7714e
  21. 18 12月, 2016 2 次提交
    • D
      bpf: fix overflow in prog accounting · 5ccb071e
      Daniel Borkmann 提交于
      Commit aaac3ba9 ("bpf: charge user for creation of BPF maps and
      programs") made a wrong assumption of charging against prog->pages.
      Unlike map->pages, prog->pages are still subject to change when we
      need to expand the program through bpf_prog_realloc().
      
      This can for example happen during verification stage when we need to
      expand and rewrite parts of the program. Should the required space
      cross a page boundary, then prog->pages is not the same anymore as
      its original value that we used to bpf_prog_charge_memlock() on. Thus,
      we'll hit a wrap-around during bpf_prog_uncharge_memlock() when prog
      is freed eventually. I noticed this that despite having unlimited
      memlock, programs suddenly refused to load with EPERM error due to
      insufficient memlock.
      
      There are two ways to fix this issue. One would be to add a cached
      variable to struct bpf_prog that takes a snapshot of prog->pages at the
      time of charging. The other approach is to also account for resizes. I
      chose to go with the latter for a couple of reasons: i) We want accounting
      rather to be more accurate instead of further fooling limits, ii) adding
      yet another page counter on struct bpf_prog would also be a waste just
      for this purpose. We also do want to charge as early as possible to
      avoid going into the verifier just to find out later on that we crossed
      limits. The only place that needs to be fixed is bpf_prog_realloc(),
      since only here we expand the program, so we try to account for the
      needed delta and should we fail, call-sites check for outcome anyway.
      On cBPF to eBPF migrations, we don't grab a reference to the user as
      they are charged differently. With that in place, my test case worked
      fine.
      
      Fixes: aaac3ba9 ("bpf: charge user for creation of BPF maps and programs")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ccb071e
    • D
      bpf: dynamically allocate digest scratch buffer · aafe6ae9
      Daniel Borkmann 提交于
      Geert rightfully complained that 7bd509e3 ("bpf: add prog_digest
      and expose it via fdinfo/netlink") added a too large allocation of
      variable 'raw' from bss section, and should instead be done dynamically:
      
        # ./scripts/bloat-o-meter kernel/bpf/core.o.1 kernel/bpf/core.o.2
        add/remove: 3/0 grow/shrink: 0/0 up/down: 33291/0 (33291)
        function                                     old     new   delta
        raw                                            -   32832  +32832
        [...]
      
      Since this is only relevant during program creation path, which can be
      considered slow-path anyway, lets allocate that dynamically and be not
      implicitly dependent on verifier mutex. Move bpf_prog_calc_digest() at
      the beginning of replace_map_fd_with_map_ptr() and also error handling
      stays straight forward.
      Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aafe6ae9
  22. 09 12月, 2016 1 次提交
  23. 06 12月, 2016 1 次提交
    • D
      bpf: add prog_digest and expose it via fdinfo/netlink · 7bd509e3
      Daniel Borkmann 提交于
      When loading a BPF program via bpf(2), calculate the digest over
      the program's instruction stream and store it in struct bpf_prog's
      digest member. This is done at a point in time before any instructions
      are rewritten by the verifier. Any unstable map file descriptor
      number part of the imm field will be zeroed for the hash.
      
      fdinfo example output for progs:
      
        # cat /proc/1590/fdinfo/5
        pos:          0
        flags:        02000002
        mnt_id:       11
        prog_type:    1
        prog_jited:   1
        prog_digest:  b27e8b06da22707513aa97363dfb11c7c3675d28
        memlock:      4096
      
      When programs are pinned and retrieved by an ELF loader, the loader
      can check the program's digest through fdinfo and compare it against
      one that was generated over the ELF file's program section to see
      if the program needs to be reloaded. Furthermore, this can also be
      exposed through other means such as netlink in case of a tc cls/act
      dump (or xdp in future), but also through tracepoints or other
      facilities to identify the program. Other than that, the digest can
      also serve as a base name for the work in progress kallsyms support
      of programs. The digest doesn't depend/select the crypto layer, since
      we need to keep dependencies to a minimum. iproute2 will get support
      for this facility.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7bd509e3
  24. 23 10月, 2016 1 次提交
  25. 28 9月, 2016 1 次提交
  26. 10 9月, 2016 1 次提交
    • D
      bpf: add BPF_CALL_x macros for declaring helpers · f3694e00
      Daniel Borkmann 提交于
      This work adds BPF_CALL_<n>() macros and converts all the eBPF helper functions
      to use them, in a similar fashion like we do with SYSCALL_DEFINE<n>() macros
      that are used today. Motivation for this is to hide all the register handling
      and all necessary casts from the user, so that it is done automatically in the
      background when adding a BPF_CALL_<n>() call.
      
      This makes current helpers easier to review, eases to write future helpers,
      avoids getting the casting mess wrong, and allows for extending all helpers at
      once (f.e. build time checks, etc). It also helps detecting more easily in
      code reviews that unused registers are not instrumented in the code by accident,
      breaking compatibility with existing programs.
      
      BPF_CALL_<n>() internals are quite similar to SYSCALL_DEFINE<n>() ones with some
      fundamental differences, for example, for generating the actual helper function
      that carries all u64 regs, we need to fill unused regs, so that we always end up
      with 5 u64 regs as an argument.
      
      I reviewed several 0-5 generated BPF_CALL_<n>() variants of the .i results and
      they look all as expected. No sparse issue spotted. We let this also sit for a
      few days with Fengguang's kbuild test robot, and there were no issues seen. On
      s390, it barked on the "uses dynamic stack allocation" notice, which is an old
      one from bpf_perf_event_output{,_tp}() reappearing here due to the conversion
      to the call wrapper, just telling that the perf raw record/frag sits on stack
      (gcc with s390's -mwarn-dynamicstack), but that's all. Did various runtime tests
      and they were fine as well. All eBPF helpers are now converted to use these
      macros, getting rid of a good chunk of all the raw castings.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f3694e00
  27. 16 7月, 2016 1 次提交
    • D
      bpf: avoid stack copy and use skb ctx for event output · 555c8a86
      Daniel Borkmann 提交于
      This work addresses a couple of issues bpf_skb_event_output()
      helper currently has: i) We need two copies instead of just a
      single one for the skb data when it should be part of a sample.
      The data can be non-linear and thus needs to be extracted via
      bpf_skb_load_bytes() helper first, and then copied once again
      into the ring buffer slot. ii) Since bpf_skb_load_bytes()
      currently needs to be used first, the helper needs to see a
      constant size on the passed stack buffer to make sure BPF
      verifier can do sanity checks on it during verification time.
      Thus, just passing skb->len (or any other non-constant value)
      wouldn't work, but changing bpf_skb_load_bytes() is also not
      the proper solution, since the two copies are generally still
      needed. iii) bpf_skb_load_bytes() is just for rather small
      buffers like headers, since they need to sit on the limited
      BPF stack anyway. Instead of working around in bpf_skb_load_bytes(),
      this work improves the bpf_skb_event_output() helper to address
      all 3 at once.
      
      We can make use of the passed in skb context that we have in
      the helper anyway, and use some of the reserved flag bits as
      a length argument. The helper will use the new __output_custom()
      facility from perf side with bpf_skb_copy() as callback helper
      to walk and extract the data. It will pass the data for setup
      to bpf_event_output(), which generates and pushes the raw record
      with an additional frag part. The linear data used in the first
      frag of the record serves as programmatically defined meta data
      passed along with the appended sample.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      555c8a86
  28. 30 6月, 2016 1 次提交
  29. 21 5月, 2016 1 次提交
  30. 17 5月, 2016 4 次提交
    • D
      bpf: add generic constant blinding for use in jits · 4f3446bb
      Daniel Borkmann 提交于
      This work adds a generic facility for use from eBPF JIT compilers
      that allows for further hardening of JIT generated images through
      blinding constants. In response to the original work on BPF JIT
      spraying published by Keegan McAllister [1], most BPF JITs were
      changed to make images read-only and start at a randomized offset
      in the page, where the rest was filled with trap instructions. We
      have this nowadays in x86, arm, arm64 and s390 JIT compilers.
      Additionally, later work also made eBPF interpreter images read
      only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
      arm, arm64 and s390 archs as well currently. This is done by
      default for mentioned JITs when JITing is enabled. Furthermore,
      we had a generic and configurable constant blinding facility on our
      todo for quite some time now to further make spraying harder, and
      first implementation since around netconf 2016.
      
      We found that for systems where untrusted users can load cBPF/eBPF
      code where JIT is enabled, start offset randomization helps a bit
      to make jumps into crafted payload harder, but in case where larger
      programs that cross page boundary are injected, we again have some
      part of the program opcodes at a page start offset. With improved
      guessing and more reliable payload injection, chances can increase
      to jump into such payload. Elena Reshetova recently wrote a test
      case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
      can leave some more room for payloads. Note that for all this,
      additional bugs in the kernel are still required to make the jump
      (and of course to guess right, to not jump into a trap) and naturally
      the JIT must be enabled, which is disabled by default.
      
      For helping mitigation, the general idea is to provide an option
      bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
      that for cases where JIT should be enabled for performance reasons,
      the generated image can be further hardened with blinding constants
      for unpriviledged users (bpf_jit_harden == 1), with trading off
      performance for these, but not for privileged ones. We also added
      the option of blinding for all users (bpf_jit_harden == 2), which
      is quite helpful for testing f.e. with test_bpf.ko. There are no
      further e.g. hardening levels of bpf_jit_harden switch intended,
      rationale is to have it dead simple to use as on/off. Since this
      functionality would need to be duplicated over and over for JIT
      compilers to use, which are already complex enough, we provide a
      generic eBPF byte-code level based blinding implementation, which is
      then just transparently JITed. JIT compilers need to make only a few
      changes to integrate this facility and can be migrated one by one.
      
      This option is for eBPF JITs and will be used in x86, arm64, s390
      without too much effort, and soon ppc64 JITs, thus that native eBPF
      can be blinded as well as cBPF to eBPF migrations, so that both can
      be covered with a single implementation. The rule for JITs is that
      bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
      and in case blinding is disabled, we follow normally with JITing the
      passed program. In case blinding is enabled and we fail during the
      process of blinding itself, we must return with the interpreter.
      Similarly, in case the JITing process after the blinding failed, we
      return normally to the interpreter with the non-blinded code. Meaning,
      interpreter doesn't change in any way and operates on eBPF code as
      usual. For doing this pre-JIT blinding step, we need to make use of
      a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
      to the JIT and not in any way part of the eBPF architecture. Just like
      in the same way as JITs internally make use of some helper registers
      when emitting code, only that here the helper register is one
      abstraction level higher in eBPF bytecode, but nevertheless in JIT
      phase. That helper register is needed since f.e. manually written
      program can issue loads to all registers of eBPF architecture.
      
      The core concept with the additional register is: blind out all 32
      and 64 bit constants by converting BPF_K based instructions into a
      small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
      is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
      and REG <OP> BPF_REG_AX, so actual operation on the target register
      is translated from BPF_K into BPF_X one that is operating on
      BPF_REG_AX's content. During rewriting phase when blinding, RND is
      newly generated via prandom_u32() for each processed instruction.
      64 bit loads are split into two 32 bit loads to make translation and
      patching not too complex. Only basic thing required by JITs is to
      call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
      pair, and to map BPF_REG_AX into an unused register.
      
      Small bpf_jit_disasm extract from [2] when applied to x86 JIT:
      
      echo 0 > /proc/sys/net/core/bpf_jit_harden
      
        ffffffffa034f5e9 + <x>:
        [...]
        39:   mov    $0xa8909090,%eax
        3e:   mov    $0xa8909090,%eax
        43:   mov    $0xa8ff3148,%eax
        48:   mov    $0xa89081b4,%eax
        4d:   mov    $0xa8900bb0,%eax
        52:   mov    $0xa810e0c1,%eax
        57:   mov    $0xa8908eb4,%eax
        5c:   mov    $0xa89020b0,%eax
        [...]
      
      echo 1 > /proc/sys/net/core/bpf_jit_harden
      
        ffffffffa034f1e5 + <x>:
        [...]
        39:   mov    $0xe1192563,%r10d
        3f:   xor    $0x4989b5f3,%r10d
        46:   mov    %r10d,%eax
        49:   mov    $0xb8296d93,%r10d
        4f:   xor    $0x10b9fd03,%r10d
        56:   mov    %r10d,%eax
        59:   mov    $0x8c381146,%r10d
        5f:   xor    $0x24c7200e,%r10d
        66:   mov    %r10d,%eax
        69:   mov    $0xeb2a830e,%r10d
        6f:   xor    $0x43ba02ba,%r10d
        76:   mov    %r10d,%eax
        79:   mov    $0xd9730af,%r10d
        7f:   xor    $0xa5073b1f,%r10d
        86:   mov    %r10d,%eax
        89:   mov    $0x9a45662b,%r10d
        8f:   xor    $0x325586ea,%r10d
        96:   mov    %r10d,%eax
        [...]
      
      As can be seen, original constants that carry payload are hidden
      when enabled, actual operations are transformed from constant-based
      to register-based ones, making jumps into constants ineffective.
      Above extract/example uses single BPF load instruction over and
      over, but of course all instructions with constants are blinded.
      
      Performance wise, JIT with blinding performs a bit slower than just
      JIT and faster than interpreter case. This is expected, since we
      still get all the performance benefits from JITing and in normal
      use-cases not every single instruction needs to be blinded. Summing
      up all 296 test cases averaged over multiple runs from test_bpf.ko
      suite, interpreter was 55% slower than JIT only and JIT with blinding
      was 8% slower than JIT only. Since there are also some extremes in
      the test suite, I expect for ordinary workloads that the performance
      for the JIT with blinding case is even closer to JIT only case,
      f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
      35 (+ blinding), and 151 (interpreter).
      
      BPF test suite, seccomp test suite, eBPF sample code and various
      bigger networking eBPF programs have been tested with this and were
      running fine. For testing purposes, I also adapted interpreter and
      redirected blinded eBPF image to interpreter and also here all tests
      pass.
      
        [1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
        [2] https://github.com/01org/jit-spray-poc-for-ksp/
        [3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NElena Reshetova <elena.reshetova@intel.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f3446bb
    • D
      bpf: prepare bpf_int_jit_compile/bpf_prog_select_runtime apis · d1c55ab5
      Daniel Borkmann 提交于
      Since the blinding is strictly only called from inside eBPF JITs,
      we need to change signatures for bpf_int_jit_compile() and
      bpf_prog_select_runtime() first in order to prepare that the
      eBPF program we're dealing with can change underneath. Hence,
      for call sites, we need to return the latest prog. No functional
      change in this patch.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d1c55ab5
    • D
      bpf: add bpf_patch_insn_single helper · c237ee5e
      Daniel Borkmann 提交于
      Move the functionality to patch instructions out of the verifier
      code and into the core as the new bpf_patch_insn_single() helper
      will be needed later on for blinding as well. No changes in
      functionality.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c237ee5e
    • D
      bpf: minor cleanups in ebpf code · 4936e352
      Daniel Borkmann 提交于
      Besides others, remove redundant comments where the code is self
      documenting enough, and properly indent various bpf_verifier_ops
      and bpf_prog_type_list declarations. Moreover, remove two exports
      that actually have no module user.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4936e352
  31. 07 5月, 2016 1 次提交
    • A
      bpf: direct packet access · 969bf05e
      Alexei Starovoitov 提交于
      Extended BPF carried over two instructions from classic to access
      packet data: LD_ABS and LD_IND. They're highly optimized in JITs,
      but due to their design they have to do length check for every access.
      When BPF is processing 20M packets per second single LD_ABS after JIT
      is consuming 3% cpu. Hence the need to optimize it further by amortizing
      the cost of 'off < skb_headlen' over multiple packet accesses.
      One option is to introduce two new eBPF instructions LD_ABS_DW and LD_IND_DW
      with similar usage as skb_header_pointer().
      The kernel part for interpreter and x64 JIT was implemented in [1], but such
      new insns behave like old ld_abs and abort the program with 'return 0' if
      access is beyond linear data. Such hidden control flow is hard to workaround
      plus changing JITs and rolling out new llvm is incovenient.
      
      Therefore allow cls_bpf/act_bpf program access skb->data directly:
      int bpf_prog(struct __sk_buff *skb)
      {
        struct iphdr *ip;
      
        if (skb->data + sizeof(struct iphdr) + ETH_HLEN > skb->data_end)
            /* packet too small */
            return 0;
      
        ip = skb->data + ETH_HLEN;
      
        /* access IP header fields with direct loads */
        if (ip->version != 4 || ip->saddr == 0x7f000001)
            return 1;
        [...]
      }
      
      This solution avoids introduction of new instructions. llvm stays
      the same and all JITs stay the same, but verifier has to work extra hard
      to prove safety of the above program.
      
      For XDP the direct store instructions can be allowed as well.
      
      The skb->data is NET_IP_ALIGNED, so for common cases the verifier can check
      the alignment. The complex packet parsers where packet pointer is adjusted
      incrementally cannot be tracked for alignment, so allow byte access in such cases
      and misaligned access on architectures that define efficient_unaligned_access
      
      [1] https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/?h=ld_abs_dwSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      969bf05e