1. 11 5月, 2018 11 次提交
    • J
      tcp: Add mark for TIMEWAIT sockets · 00483690
      Jon Maxwell 提交于
      This version has some suggestions by Eric Dumazet:
      
      - Use a local variable for the mark in IPv6 instead of ctl_sk to avoid SMP
      races.
      - Use the more elegant "IP4_REPLY_MARK(net, skb->mark) ?: sk->sk_mark"
      statement.
      - Factorize code as sk_fullsock() check is not necessary.
      
      Aidan McGurn from Openwave Mobility systems reported the following bug:
      
      "Marked routing is broken on customer deployment. Its effects are large
      increase in Uplink retransmissions caused by the client never receiving
      the final ACK to their FINACK - this ACK misses the mark and routes out
      of the incorrect route."
      
      Currently marks are added to sk_buffs for replies when the "fwmark_reflect"
      sysctl is enabled. But not for TW sockets that had sk->sk_mark set via
      setsockopt(SO_MARK..).
      
      Fix this in IPv4/v6 by adding tw->tw_mark for TIME_WAIT sockets. Copy the the
      original sk->sk_mark in __inet_twsk_hashdance() to the new tw->tw_mark location.
      Then progate this so that the skb gets sent with the correct mark. Do the same
      for resets. Give the "fwmark_reflect" sysctl precedence over sk->sk_mark so that
      netfilter rules are still honored.
      Signed-off-by: NJon Maxwell <jmaxwell37@gmail.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      00483690
    • J
      net: ipv4: remove define INET_CSK_DEBUG and unnecessary EXPORT_SYMBOL · 03bdfc00
      Joe Perches 提交于
      INET_CSK_DEBUG is always set and only is used for 2 pr_debug calls.
      
      EXPORT_SYMBOL(inet_csk_timer_bug_msg) is only used by these 2
      pr_debug calls and is also unnecessary as the exported string can
      be used directly by these calls.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      03bdfc00
    • E
      net/ipv6: fix lock imbalance in ip6_route_del() · 9e575010
      Eric Dumazet 提交于
      WARNING: lock held when returning to user space!
      4.17.0-rc3+ #37 Not tainted
      
      syz-executor1/27662 is leaving the kernel with locks still held!
      1 lock held by syz-executor1/27662:
       #0: 00000000f661aee7 (rcu_read_lock){....}, at: ip6_route_del+0xea/0x13f0 net/ipv6/route.c:3206
      BUG: scheduling while atomic: syz-executor1/27662/0x00000002
      INFO: lockdep is turned off.
      Modules linked in:
      Kernel panic - not syncing: scheduling while atomic
      
      CPU: 1 PID: 27662 Comm: syz-executor1 Not tainted 4.17.0-rc3+ #37
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1b9/0x294 lib/dump_stack.c:113
       panic+0x22f/0x4de kernel/panic.c:184
       __schedule_bug.cold.85+0xdf/0xdf kernel/sched/core.c:3290
       schedule_debug kernel/sched/core.c:3307 [inline]
       __schedule+0x139e/0x1e30 kernel/sched/core.c:3412
       schedule+0xef/0x430 kernel/sched/core.c:3549
       exit_to_usermode_loop+0x220/0x310 arch/x86/entry/common.c:152
       prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline]
       syscall_return_slowpath arch/x86/entry/common.c:265 [inline]
       do_syscall_64+0x6ac/0x800 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x455979
      RSP: 002b:00007fbf4051dc68 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
      RAX: 0000000000000000 RBX: 00007fbf4051e6d4 RCX: 0000000000455979
      RDX: 00000000200001c0 RSI: 000000000000890c RDI: 0000000000000013
      RBP: 000000000072bea0 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
      R13: 00000000000003c8 R14: 00000000006f9b60 R15: 0000000000000000
      Dumping ftrace buffer:
         (ftrace buffer empty)
      Kernel Offset: disabled
      Rebooting in 86400 seconds..
      
      Fixes: 23fb93a4 ("net/ipv6: Cleanup exception and cache route handling")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: David Ahern <dsahern@gmail.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Acked-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9e575010
    • V
      net: dsa: fix added_by_user switchdev notification · a37fb855
      Vivien Didelot 提交于
      Commit 161d82de ("net: bridge: Notify about !added_by_user FDB
      entries") causes the below oops when bringing up a slave interface,
      because dsa_port_fdb_add is still scheduled, but with a NULL address.
      
      To fix this, keep the dsa_slave_switchdev_event function agnostic of the
      notified info structure and handle the added_by_user flag in the
      specific dsa_slave_switchdev_event_work function.
      
          [   75.512263] Unable to handle kernel NULL pointer dereference at virtual address 00000000
          [   75.519063] pgd = (ptrval)
          [   75.520545] [00000000] *pgd=00000000
          [   75.522839] Internal error: Oops: 17 [#1] ARM
          [   75.525898] Modules linked in:
          [   75.527673] CPU: 0 PID: 9 Comm: kworker/u2:1 Not tainted 4.17.0-rc2 #78
          [   75.532988] Hardware name: Freescale Vybrid VF5xx/VF6xx (Device Tree)
          [   75.538153] Workqueue: dsa_ordered dsa_slave_switchdev_event_work
          [   75.542970] PC is at mv88e6xxx_port_db_load_purge+0x60/0x1b0
          [   75.547341] LR is at mdiobus_read_nested+0x6c/0x78
          [   75.550833] pc : [<804cd5c0>]    lr : [<804bba84>]    psr: 60070013
          [   75.555796] sp : 9f54bd78  ip : 9f54bd87  fp : 9f54bddc
          [   75.559719] r10: 00000000  r9 : 0000000e  r8 : 9f6a6010
          [   75.563643] r7 : 00000000  r6 : 81203048  r5 : 9f6a6010  r4 : 9f6a601c
          [   75.568867] r3 : 00000000  r2 : 00000000  r1 : 0000000d  r0 : 00000000
          [   75.574094] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
          [   75.579933] Control: 10c53c7d  Table: 9de20059  DAC: 00000051
          [   75.584384] Process kworker/u2:1 (pid: 9, stack limit = 0x(ptrval))
          [   75.589349] Stack: (0x9f54bd78 to 0x9f54c000)
          [   75.592406] bd60:                                                       00000000 00000000
          [   75.599295] bd80: 00000391 9f299d10 9f299d68 8014317c 9f7f0000 8120af00 00006dc2 00000000
          [   75.606186] bda0: 8120af00 00000000 9f54bdec 1c9f5d92 8014317c 9f6a601c 9f6a6010 00000000
          [   75.613076] bdc0: 00000000 00000000 9dd1141c 8125a0b4 9f54be0c 9f54bde0 804cd8a8 804cd56c
          [   75.619966] bde0: 0000000e 80143680 00000001 9dce9c1c 81203048 9dce9c10 00000003 00000000
          [   75.626858] be00: 9f54be5c 9f54be10 806abcac 804cd864 9f54be54 80143664 8014317c 80143054
          [   75.633748] be20: ffcaa81d 00000000 812030b0 1c9f5d92 00000000 81203048 9f54beb4 00000003
          [   75.640639] be40: ffffffff 00000000 9dd1141c 8125a0b4 9f54be84 9f54be60 80138e98 806abb18
          [   75.647529] be60: 81203048 9ddc4000 9dce9c54 9f72a300 00000000 00000000 9f54be9c 9f54be88
          [   75.654420] be80: 801390bc 80138e50 00000000 9dce9c54 9f54beac 9f54bea0 806a9524 801390a0
          [   75.661310] bea0: 9f54bedc 9f54beb0 806a9c7c 806a950c 9f54becc 00000000 00000000 00000000
          [   75.668201] bec0: 9f540000 1c9f5d92 805fe604 9ddffc00 9f54befc 9f54bee0 806ab228 806a9c38
          [   75.675092] bee0: 806ab178 9ddffc00 9f4c1900 9f40d200 9f54bf34 9f54bf00 80131e30 806ab184
          [   75.681983] bf00: 9f40d214 9f54a038 9f40d200 9f40d200 9f4c1918 812119a0 9f40d214 9f54a038
          [   75.688873] bf20: 9f40d200 9f4c1900 9f54bf7c 9f54bf38 80132124 80131d1c 9f5f2dd8 00000000
          [   75.695764] bf40: 812119a0 9f54a038 812119a0 81259c5b 9f5f2dd8 9f5f2dc0 9f53dbc0 00000000
          [   75.702655] bf60: 9f4c1900 801320b4 9f5f2dd8 9f4f7e88 9f54bfac 9f54bf80 80137ad0 801320c0
          [   75.709544] bf80: 9f54a000 9f53dbc0 801379a0 00000000 00000000 00000000 00000000 00000000
          [   75.716434] bfa0: 00000000 9f54bfb0 801010e8 801379ac 00000000 00000000 00000000 00000000
          [   75.723324] bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
          [   75.730206] bfe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
          [   75.737083] Backtrace:
          [   75.738252] [<804cd560>] (mv88e6xxx_port_db_load_purge) from [<804cd8a8>] (mv88e6xxx_port_fdb_add+0x50/0x68)
          [   75.746795]  r10:8125a0b4 r9:9dd1141c r8:00000000 r7:00000000 r6:00000000 r5:9f6a6010
          [   75.753323]  r4:9f6a601c
          [   75.754570] [<804cd858>] (mv88e6xxx_port_fdb_add) from [<806abcac>] (dsa_switch_event+0x1a0/0x660)
          [   75.762238]  r8:00000000 r7:00000003 r6:9dce9c10 r5:81203048 r4:9dce9c1c
          [   75.767655] [<806abb0c>] (dsa_switch_event) from [<80138e98>] (notifier_call_chain+0x54/0x94)
          [   75.774893]  r10:8125a0b4 r9:9dd1141c r8:00000000 r7:ffffffff r6:00000003 r5:9f54beb4
          [   75.781423]  r4:81203048
          [   75.782672] [<80138e44>] (notifier_call_chain) from [<801390bc>] (raw_notifier_call_chain+0x28/0x30)
          [   75.790514]  r9:00000000 r8:00000000 r7:9f72a300 r6:9dce9c54 r5:9ddc4000 r4:81203048
          [   75.796982] [<80139094>] (raw_notifier_call_chain) from [<806a9524>] (dsa_port_notify+0x24/0x38)
          [   75.804483] [<806a9500>] (dsa_port_notify) from [<806a9c7c>] (dsa_port_fdb_add+0x50/0x6c)
          [   75.811371] [<806a9c2c>] (dsa_port_fdb_add) from [<806ab228>] (dsa_slave_switchdev_event_work+0xb0/0x10c)
          [   75.819635]  r4:9ddffc00
          [   75.820885] [<806ab178>] (dsa_slave_switchdev_event_work) from [<80131e30>] (process_one_work+0x120/0x3a4)
          [   75.829241]  r6:9f40d200 r5:9f4c1900 r4:9ddffc00 r3:806ab178
          [   75.833612] [<80131d10>] (process_one_work) from [<80132124>] (worker_thread+0x70/0x574)
          [   75.840415]  r10:9f4c1900 r9:9f40d200 r8:9f54a038 r7:9f40d214 r6:812119a0 r5:9f4c1918
          [   75.846945]  r4:9f40d200
          [   75.848191] [<801320b4>] (worker_thread) from [<80137ad0>] (kthread+0x130/0x160)
          [   75.854300]  r10:9f4f7e88 r9:9f5f2dd8 r8:801320b4 r7:9f4c1900 r6:00000000 r5:9f53dbc0
          [   75.860830]  r4:9f5f2dc0
          [   75.862076] [<801379a0>] (kthread) from [<801010e8>] (ret_from_fork+0x14/0x2c)
          [   75.867999] Exception stack(0x9f54bfb0 to 0x9f54bff8)
          [   75.871753] bfa0:                                     00000000 00000000 00000000 00000000
          [   75.878640] bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
          [   75.885519] bfe0: 00000000 00000000 00000000 00000000 00000013 00000000
          [   75.890844]  r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5:801379a0
          [   75.897377]  r4:9f53dbc0 r3:9f54a000
          [   75.899663] Code: e3a02000 e3a03000 e14b26f4 e24bc055 (e5973000)
          [   75.904575] ---[ end trace fbca818a124dbf0d ]---
      
      Fixes: 816a3bed ("switchdev: Add fdb.added_by_user to switchdev notifications")
      Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Reviewed-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a37fb855
    • J
      tipc: clean up removal of binding table items · 5f30721c
      Jon Maloy 提交于
      In commit be47e41d ("tipc: fix use-after-free in tipc_nametbl_stop")
      we fixed a problem caused by premature release of service range items.
      
      That fix is correct, and solved the problem. However, it doesn't address
      the root of the problem, which is that we don't lookup the tipc_service
       -> service_range -> publication items in the correct hierarchical
      order.
      
      In this commit we try to make this right, and as a side effect obtain
      some code simplification.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f30721c
    • D
      net/udp: Update udp_encap_needed static key to modern api · 88ab3108
      Davidlohr Bueso 提交于
      No changes in refcount semantics -- key init is false; replace
      
      static_key_enable         with   static_branch_enable
      static_key_slow_inc|dec   with   static_branch_inc|dec
      static_key_false          with   static_branch_unlikely
      
      Added a '_key' suffix to udp and udpv6 encap_needed, for better
      self documentation.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88ab3108
    • D
      net: Update generic_xdp_needed static key to modern api · 02786475
      Davidlohr Bueso 提交于
      No changes in refcount semantics -- key init is false; replace
      
      static_key_slow_inc|dec   with   static_branch_inc|dec
      static_key_false          with   static_branch_unlikely
      
      Added a '_key' suffix to generic_xdp_needed, for better self
      documentation.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02786475
    • D
      net: Update netstamp_needed static key to modern api · 39e83922
      Davidlohr Bueso 提交于
      No changes in refcount semantics -- key init is false; replace
      
      static_key_slow_inc|dec   with   static_branch_inc|dec
      static_key_false          with   static_branch_unlikely
      
      Added a '_key' suffix to netstamp_needed, for better self
      documentation.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      39e83922
    • D
      net: Update [e/in]gress_needed static key to modern api · aabf6772
      Davidlohr Bueso 提交于
      No changes in semantics -- key init is false; replace
      
      static_key_slow_inc|dec   with   static_branch_inc|dec
      static_key_false          with   static_branch_unlikely
      
      Added a '_key' suffix to both ingress_needed and egress_needed,
      for better self documentation.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aabf6772
    • D
      net/sock: Update memalloc_socks static key to modern api · a7950ae8
      Davidlohr Bueso 提交于
      No changes in refcount semantics -- key init is false; replace
      
      static_key_slow_inc|dec   with   static_branch_inc|dec
      static_key_false          with   static_branch_unlikely
      
      Added a '_key' suffix to memalloc_socks, for better self
      documentation.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a7950ae8
    • D
      net/ipv4: Update ip_tunnel_metadata_cnt static key to modern api · 5263a98f
      Davidlohr Bueso 提交于
      No changes in refcount semantics -- key init is false; replace
      
      static_key_slow_inc|dec   with   static_branch_inc|dec
      static_key_false          with   static_branch_unlikely
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5263a98f
  2. 09 5月, 2018 6 次提交
  3. 08 5月, 2018 6 次提交
  4. 07 5月, 2018 7 次提交
  5. 05 5月, 2018 1 次提交
  6. 04 5月, 2018 9 次提交
    • S
      smc: add support for splice() · 9014db20
      Stefan Raspl 提交于
      Provide an implementation for splice() when we are using SMC. See
      smc_splice_read() for further details.
      Signed-off-by: NStefan Raspl <raspl@linux.ibm.com>
      Signed-off-by: Ursula Braun <ubraun@linux.ibm.com><
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9014db20
    • S
      smc: allocate RMBs as compound pages · 2ef4f27a
      Stefan Raspl 提交于
      Preparatory work for splice() support.
      Signed-off-by: NStefan Raspl <raspl@linux.ibm.com>
      Signed-off-by: Ursula Braun <ubraun@linux.ibm.com><
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2ef4f27a
    • S
      smc: make smc_rx_wait_data() generic · b51fa1b1
      Stefan Raspl 提交于
      Turn smc_rx_wait_data into a generic function that can be used at various
      instances to wait on traffic to complete with varying criteria.
      Signed-off-by: NStefan Raspl <raspl@linux.ibm.com>
      Signed-off-by: Ursula Braun <ubraun@linux.ibm.com><
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b51fa1b1
    • S
      smc: simplify abort logic · c8b8ec8e
      Stefan Raspl 提交于
      Some of the conditions to exit recv() are common in two pathes - cleaning up
      code by moving the check up so we have it only once.
      Signed-off-by: NStefan Raspl <raspl@linux.ibm.com>
      Signed-off-by: Ursula Braun <ubraun@linux.ibm.com><
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c8b8ec8e
    • M
      xfrm: use a dedicated slab cache for struct xfrm_state · 565f0fa9
      Mathias Krause 提交于
      struct xfrm_state is rather large (768 bytes here) and therefore wastes
      quite a lot of memory as it falls into the kmalloc-1024 slab cache,
      leaving 256 bytes of unused memory per XFRM state object -- a net waste
      of 25%.
      
      Using a dedicated slab cache for struct xfrm_state reduces the level of
      internal fragmentation to a minimum.
      
      On my configuration SLUB chooses to create a slab cache covering 4
      pages holding 21 objects, resulting in an average memory waste of ~13
      bytes per object -- a net waste of only 1.6%.
      
      In my tests this led to memory savings of roughly 2.3MB for 10k XFRM
      states.
      Signed-off-by: NMathias Krause <minipli@googlemail.com>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      565f0fa9
    • D
      bpf: add skb_load_bytes_relative helper · 4e1ec56c
      Daniel Borkmann 提交于
      This adds a small BPF helper similar to bpf_skb_load_bytes() that
      is able to load relative to mac/net header offset from the skb's
      linear data. Compared to bpf_skb_load_bytes(), it takes a fifth
      argument namely start_header, which is either BPF_HDR_START_MAC
      or BPF_HDR_START_NET. This allows for a more flexible alternative
      compared to LD_ABS/LD_IND with negative offset. It's enabled for
      tc BPF programs as well as sock filter program types where it's
      mainly useful in reuseport programs to ease access to lower header
      data.
      
      Reference: https://lists.iovisor.org/pipermail/iovisor-dev/2017-March/000698.htmlSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      4e1ec56c
    • D
      bpf: implement ld_abs/ld_ind in native bpf · e0cea7ce
      Daniel Borkmann 提交于
      The main part of this work is to finally allow removal of LD_ABS
      and LD_IND from the BPF core by reimplementing them through native
      eBPF instead. Both LD_ABS/LD_IND were carried over from cBPF and
      keeping them around in native eBPF caused way more trouble than
      actually worth it. To just list some of the security issues in
      the past:
      
        * fdfaf64e ("x86: bpf_jit: support negative offsets")
        * 35607b02 ("sparc: bpf_jit: fix loads from negative offsets")
        * e0ee9c12 ("x86: bpf_jit: fix two bugs in eBPF JIT compiler")
        * 07aee943 ("bpf, sparc: fix usage of wrong reg for load_skb_regs after call")
        * 6d59b7db ("bpf, s390x: do not reload skb pointers in non-skb context")
        * 87338c8e ("bpf, ppc64: do not reload skb pointers in non-skb context")
      
      For programs in native eBPF, LD_ABS/LD_IND are pretty much legacy
      these days due to their limitations and more efficient/flexible
      alternatives that have been developed over time such as direct
      packet access. LD_ABS/LD_IND only cover 1/2/4 byte loads into a
      register, the load happens in host endianness and its exception
      handling can yield unexpected behavior. The latter is explained
      in depth in f6b1b3bf ("bpf: fix subprog verifier bypass by
      div/mod by 0 exception") with similar cases of exceptions we had.
      In native eBPF more recent program types will disable LD_ABS/LD_IND
      altogether through may_access_skb() in verifier, and given the
      limitations in terms of exception handling, it's also disabled
      in programs that use BPF to BPF calls.
      
      In terms of cBPF, the LD_ABS/LD_IND is used in networking programs
      to access packet data. It is not used in seccomp-BPF but programs
      that use it for socket filtering or reuseport for demuxing with
      cBPF. This is mostly relevant for applications that have not yet
      migrated to native eBPF.
      
      The main complexity and source of bugs in LD_ABS/LD_IND is coming
      from their implementation in the various JITs. Most of them keep
      the model around from cBPF times by implementing a fastpath written
      in asm. They use typically two from the BPF program hidden CPU
      registers for caching the skb's headlen (skb->len - skb->data_len)
      and skb->data. Throughout the JIT phase this requires to keep track
      whether LD_ABS/LD_IND are used and if so, the two registers need
      to be recached each time a BPF helper would change the underlying
      packet data in native eBPF case. At least in eBPF case, available
      CPU registers are rare and the additional exit path out of the
      asm written JIT helper makes it also inflexible since not all
      parts of the JITer are in control from plain C. A LD_ABS/LD_IND
      implementation in eBPF therefore allows to significantly reduce
      the complexity in JITs with comparable performance results for
      them, e.g.:
      
      test_bpf             tcpdump port 22             tcpdump complex
      x64      - before    15 21 10                    14 19  18
               - after      7 10 10                     7 10  15
      arm64    - before    40 91 92                    40 91 151
               - after     51 64 73                    51 62 113
      
      For cBPF we now track any usage of LD_ABS/LD_IND in bpf_convert_filter()
      and cache the skb's headlen and data in the cBPF prologue. The
      BPF_REG_TMP gets remapped from R8 to R2 since it's mainly just
      used as a local temporary variable. This allows to shrink the
      image on x86_64 also for seccomp programs slightly since mapping
      to %rsi is not an ereg. In callee-saved R8 and R9 we now track
      skb data and headlen, respectively. For normal prologue emission
      in the JITs this does not add any extra instructions since R8, R9
      are pushed to stack in any case from eBPF side. cBPF uses the
      convert_bpf_ld_abs() emitter which probes the fast path inline
      already and falls back to bpf_skb_load_helper_{8,16,32}() helper
      relying on the cached skb data and headlen as well. R8 and R9
      never need to be reloaded due to bpf_helper_changes_pkt_data()
      since all skb access in cBPF is read-only. Then, for the case
      of native eBPF, we use the bpf_gen_ld_abs() emitter, which calls
      the bpf_skb_load_helper_{8,16,32}_no_cache() helper unconditionally,
      does neither cache skb data and headlen nor has an inlined fast
      path. The reason for the latter is that native eBPF does not have
      any extra registers available anyway, but even if there were, it
      avoids any reload of skb data and headlen in the first place.
      Additionally, for the negative offsets, we provide an alternative
      bpf_skb_load_bytes_relative() helper in eBPF which operates
      similarly as bpf_skb_load_bytes() and allows for more flexibility.
      Tested myself on x64, arm64, s390x, from Sandipan on ppc64.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      e0cea7ce
    • D
      bpf: migrate ebpf ld_abs/ld_ind tests to test_verifier · 93731ef0
      Daniel Borkmann 提交于
      Remove all eBPF tests involving LD_ABS/LD_IND from test_bpf.ko. Reason
      is that the eBPF tests from test_bpf module do not go via BPF verifier
      and therefore any instruction rewrites from verifier cannot take place.
      
      Therefore, move them into test_verifier which runs out of user space,
      so that verfier can rewrite LD_ABS/LD_IND internally in upcoming patches.
      It will have the same effect since runtime tests are also performed from
      there. This also allows to finally unexport bpf_skb_vlan_{push,pop}_proto
      and keep it internal to core kernel.
      
      Additionally, also add further cBPF LD_ABS/LD_IND test coverage into
      test_bpf.ko suite.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      93731ef0
    • D
      bpf: prefix cbpf internal helpers with bpf_ · b390134c
      Daniel Borkmann 提交于
      No change in functionality, just remove the '__' prefix and replace it
      with a 'bpf_' prefix instead. We later on add a couple of more helpers
      for cBPF and keeping the scheme with '__' is suboptimal there.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      b390134c