提交 · 2479c2c9f4c4db24afc8be1a0ff507343ff000f8 · openanolis / cloud-kernel

30 1月, 2018 4 次提交

Merge branch 'rtnetlink-enable-IFLA_IF_NETNSID-for-RTM_DELLINK-RTM_SETINK' · 2479c2c9

由 David S. Miller 提交于 1月 29, 2018

Christian Brauner says:

====================
rtnetlink: enable IFLA_IF_NETNSID for RTM_{DEL,SET}LINK

Based on the previous discussion this enables passing a IFLA_IF_NETNSID
property along with RTM_SETLINK and RTM_DELLINK requests. The patch for
RTM_NEWLINK will be sent out in a separate patch since there are more
corner-cases to think about.

Changelog 2018-01-24:
* Preserve old behavior and report -ENODEV when either ifindex or ifname is
  provided and IFLA_GROUP is set. Spotted by Wolfgang Bumiller.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2479c2c9

rtnetlink: enable IFLA_IF_NETNSID for RTM_DELLINK · b61ad68a

由 Christian Brauner 提交于 1月 24, 2018

- Backwards Compatibility:
  If userspace wants to determine whether RTM_DELLINK supports the
  IFLA_IF_NETNSID property they should first send an RTM_GETLINK request
  with IFLA_IF_NETNSID on lo. If either EACCESS is returned or the reply
  does not include IFLA_IF_NETNSID userspace should assume that
  IFLA_IF_NETNSID is not supported on this kernel.
  If the reply does contain an IFLA_IF_NETNSID property userspace
  can send an RTM_DELLINK with a IFLA_IF_NETNSID property. If they receive
  EOPNOTSUPP then the kernel does not support the IFLA_IF_NETNSID property
  with RTM_DELLINK. Userpace should then fallback to other means.

- Security:
  Callers must have CAP_NET_ADMIN in the owning user namespace of the
  target network namespace.
Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b61ad68a

rtnetlink: enable IFLA_IF_NETNSID for RTM_SETLINK · c310bfcb

由 Christian Brauner 提交于 1月 24, 2018

- Backwards Compatibility:
  If userspace wants to determine whether RTM_SETLINK supports the
  IFLA_IF_NETNSID property they should first send an RTM_GETLINK request
  with IFLA_IF_NETNSID on lo. If either EACCESS is returned or the reply
  does not include IFLA_IF_NETNSID userspace should assume that
  IFLA_IF_NETNSID is not supported on this kernel.
  If the reply does contain an IFLA_IF_NETNSID property userspace
  can send an RTM_SETLINK with a IFLA_IF_NETNSID property. If they receive
  EOPNOTSUPP then the kernel does not support the IFLA_IF_NETNSID property
  with RTM_SETLINK. Userpace should then fallback to other means.

  To retain backwards compatibility the kernel will first check whether a
  IFLA_NET_NS_PID or IFLA_NET_NS_FD property has been passed. If either
  one is found it will be used to identify the target network namespace.
  This implies that users who do not care whether their running kernel
  supports IFLA_IF_NETNSID with RTM_SETLINK can pass both
  IFLA_NET_NS_{FD,PID} and IFLA_IF_NETNSID referring to the same network
  namespace.

- Security:
  Callers must have CAP_NET_ADMIN in the owning user namespace of the
  target network namespace.
Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c310bfcb

rtnetlink: enable IFLA_IF_NETNSID in do_setlink() · 7c4f63ba

由 Christian Brauner 提交于 1月 24, 2018

RTM_{NEW,SET}LINK already allow operations on other network namespaces
by identifying the target network namespace through IFLA_NET_NS_{FD,PID}
properties. This is done by looking for the corresponding properties in
do_setlink(). Extend do_setlink() to also look for the IFLA_IF_NETNSID
property. This introduces no functional changes since all callers of
do_setlink() currently block IFLA_IF_NETNSID by reporting an error before
they reach do_setlink().

This introduces the helpers:

static struct net *rtnl_link_get_net_by_nlattr(struct net *src_net, struct
                                               nlattr *tb[])

static struct net *rtnl_link_get_net_capable(const struct sk_buff *skb,
                                             struct net *src_net,
					     struct nlattr *tb[], int cap)

to simplify permission checks and target network namespace retrieval for
RTM_* requests that already support IFLA_NET_NS_{FD,PID} but get extended
to IFLA_IF_NETNSID. To perserve backwards compatibility the helpers look
for IFLA_NET_NS_{FD,PID} properties first before checking for
IFLA_IF_NETNSID.
Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7c4f63ba

29 1月, 2018 5 次提交

D
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 3e3ab9cc
由 David S. Miller 提交于 1月 29, 2018
```
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
3e3ab9cc

Merge tag 'wireless-drivers-next-for-davem-2018-01-26' of... · 868c36dc

由 David S. Miller 提交于 1月 28, 2018

Merge tag 'wireless-drivers-next-for-davem-2018-01-26' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next

Kalle Valo says:

====================
wireless-drivers-next patches for 4.16

Major changes:

wil6210

* add PCI device id for Talyn

* support flashless device

ath9k

* improve RSSI/signal accuracy on AR9003 series

mt76

* validate CCMP PN from received frames to avoid replay attacks

qtnfmac

* support 64-bit network stats

* report more hardware information to kernel log and some via ethtool
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

868c36dc

sfc: mark some unexported symbols as static · e7345ba3

由 kbuild test robot 提交于 1月 26, 2018

efx_default_channel_want_txqs() is only used in efx.c, while
 efx_ptp_want_txqs() and efx_ptp_channel_type (a struct) are only used
 in ptp.c.  In all cases these symbols should be static.

Fixes: 2935e3c3 ("sfc: on 8000 series use TX queues for TX timestamps")
Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
[ecree@solarflare.com: rewrote commit message]
Signed-off-by: NEdward Cree <ecree@solarflare.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e7345ba3

Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · 5abe9ead

由 David S. Miller 提交于 1月 28, 2018

Jeff Kirsher says:

====================
40GbE Intel Wired LAN Driver Updates 2018-01-26

This series contains updates to i40e and i40evf.

Michal updates the driver to pass critical errors from the firmware to
the caller.

Patryk fixes an issue of creating multiple identical filters with the
same location, by simply moving the functions so that we remove the
existing filter and then add the new filter.

Paweł adds back in the ability to turn off offloads when VLAN is set for
the VF driver.  Fixed an issue where the number of TC queue pairs was
exceeding MSI-X vectors count, causing messages about invalid TC mapping
and wrong selected Tx queue.

Alex cleans up the i40e/i40evf_set_itr_per_queue() by dropping all the
unneeded pointer chases.  Puts to use the reg_idx value, which was going
unused, so that we can avoid having to compute the vector every time
throughout the driver.

Upasana enable the driver to display LLDP information on the vSphere Web
Client by exposing DCB parameters.

Alice converts our flags from 32 to 64 bit size, since we have added
more flags.

Dave implements a private ethtool flag to disable the processing of LLDP
packets by the firmware, so that the firmware will not consume LLDPDU
and cause them to be sent up the stack.

Alan adds a mechanism for detecting/storing the flag for processing of
LLDP packets by the firmware, so that its current state is persistent
across reboots/reloads of the driver.

Avinash fixes kdump with i40e due to resource constraints.  We were
enabling VMDq and iWARP when we just have a single CPU, which was
starving kdump for the lack of IRQs.

Jake adds support to program the fragmented IPv4 input set PCTYPE.
Fixed the reported masks to properly report that the entire field is
masked, since we had accidentally swapped the mask values for the IPv4
addresses with the L4 port numbers.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5abe9ead

Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · 457740a9

由 David S. Miller 提交于 1月 28, 2018

Alexei Starovoitov says:

====================
pull-request: bpf-next 2018-01-26

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) A number of extensions to tcp-bpf, from Lawrence.
    - direct R or R/W access to many tcp_sock fields via bpf_sock_ops
    - passing up to 3 arguments to bpf_sock_ops functions
    - tcp_sock field bpf_sock_ops_cb_flags for controlling callbacks
    - optionally calling bpf_sock_ops program when RTO fires
    - optionally calling bpf_sock_ops program when packet is retransmitted
    - optionally calling bpf_sock_ops program when TCP state changes
    - access to tclass and sk_txhash
    - new selftest

2) div/mod exception handling, from Daniel.
    One of the ugly leftovers from the early eBPF days is that div/mod
    operations based on registers have a hard-coded src_reg == 0 test
    in the interpreter as well as in JIT code generators that would
    return from the BPF program with exit code 0. This was basically
    adopted from cBPF interpreter for historical reasons.
    There are multiple reasons why this is very suboptimal and prone
    to bugs. To name one: the return code mapping for such abnormal
    program exit of 0 does not always match with a suitable program
    type's exit code mapping. For example, '0' in tc means action 'ok'
    where the packet gets passed further up the stack, which is just
    undesirable for such cases (e.g. when implementing policy) and
    also does not match with other program types.
    After considering _four_ different ways to address the problem,
    we adapt the same behavior as on some major archs like ARMv8:
    X div 0 results in 0, and X mod 0 results in X. aarch64 and
    aarch32 ISA do not generate any traps or otherwise aborts
    of program execution for unsigned divides.
    Given the options, it seems the most suitable from
    all of them, also since major archs have similar schemes in
    place. Given this is all in the realm of undefined behavior,
    we still have the option to adapt if deemed necessary.

3) sockmap sample refactoring, from John.

4) lpm map get_next_key fixes, from Yonghong.

5) test cleanups, from Alexei and Prashant.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

457740a9

28 1月, 2018 2 次提交

Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · 6b2e2829

由 David S. Miller 提交于 1月 28, 2018

Jeff Kirsher says:

====================
10GbE Intel Wired LAN Driver Updates 2018-01-26

This series contains updates to ixgbe and ixgbevf.

Emil updates ixgbevf to match ixgbe functionality, starting with the
consolidating of functions that represent logical steps in the receive
process so we can later update them more easily.  Updated ixgbevf to
only synchronize the length of the frame, which will typically be the
MTU or smaller.  Updated the VF driver to use the length of the packet
instead of the DD status bit to determine if a new descriptor is ready
to be processed, which saves on reads and we can save time on
initialization.  Added support for DMA_ATTR_SKIP_CPU_SYNC/WEAK_ORDERING
to help improve performance on some platforms.  Updated the VF driver to
do bulk updates of the page reference count instead of just incrementing
it by one reference at a time.  Updated the VF driver to only go through
the region of the receive ring that was designated to be cleaned up,
rather than process the entire ring.

Colin Ian King adds the use of ARRAY_SIZE() on various arrays.

Miroslav Lichvar fixes an issue where ethtool was reporting timestamping
filters unsupported for X550, which is incorrect.

Paul adds support for reporting 5G link speed for some devices.

Dan Carpenter fixes a typo where && was used when it should have been
||.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6b2e2829

net/rocker: Remove unreachable return instruction · 751c45bd

由 Leon Romanovsky 提交于 1月 28, 2018

The "return 0" instruction follows other return instruction
and it makes it impossible to execute, hence remove it.

Fixes: 00fc0c51 ("rocker: Change world_ops API and implementation to be switchdev independant")
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

751c45bd

27 1月, 2018 29 次提交

Merge branch 'fix-lpm-map' · 8223967f

由 Alexei Starovoitov 提交于 1月 26, 2018

Yonghong Song says:

====================
A kernel page fault which happens in lpm map trie_get_next_key is reported
by syzbot and Eric. The issue was introduced by commit b471f2f1
("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map").
Patch #1 fixed the issue in the kernel and patch #2 adds a multithreaded
test case in tools/testing/selftests/bpf/test_lpm_map.
====================
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

8223967f

tools/bpf: add a multithreaded stress test in bpf selftests test_lpm_map · af32efee

由 Yonghong Song 提交于 1月 26, 2018

The new test will spawn four threads, doing map update, delete, lookup
and get_next_key in parallel. It is able to reproduce the issue in the
previous commit found by syzbot and Eric Dumazet.
Signed-off-by: NYonghong Song <yhs@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

af32efee

bpf: fix kernel page fault in lpm map trie_get_next_key · 6dd1ec6c

由 Yonghong Song 提交于 1月 26, 2018

Commit b471f2f1 ("bpf: implement MAP_GET_NEXT_KEY command
for LPM_TRIE map") introduces a bug likes below:

    if (!rcu_dereference(trie->root))
        return -ENOENT;
    if (!key || key->prefixlen > trie->max_prefixlen) {
        root = &trie->root;
        goto find_leftmost;
    }
    ......
  find_leftmost:
    for (node = rcu_dereference(*root); node;) {

In the code after label find_leftmost, it is assumed
that *root should not be NULL, but it is not true as
it is possbile trie->root is changed to NULL by an
asynchronous delete operation.

The issue is reported by syzbot and Eric Dumazet with the
below error log:
  ......
  kasan: CONFIG_KASAN_INLINE enabled
  kasan: GPF could be caused by NULL-ptr deref or user memory access
  general protection fault: 0000 [#1] SMP KASAN
  Dumping ftrace buffer:
     (ftrace buffer empty)
  Modules linked in:
  CPU: 1 PID: 8033 Comm: syz-executor3 Not tainted 4.15.0-rc8+ #4
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  RIP: 0010:trie_get_next_key+0x3c2/0xf10 kernel/bpf/lpm_trie.c:682
  ......

This patch fixed the issue by use local rcu_dereferenced
pointer instead of *(&trie->root) later on.

Fixes: b471f2f1 ("bpf: implement MAP_GET_NEXT_KEY command or LPM_TRIE map")
Reported-by: Nsyzbot <syzkaller@googlegroups.com>
Reported-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NYonghong Song <yhs@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

6dd1ec6c

Merge branch 'bpf-improvements-and-fixes' · 1651e39e

由 Alexei Starovoitov 提交于 1月 26, 2018

Daniel Borkmann says:

====================
This set contains a small cleanup in cBPF prologue generation and
otherwise fixes an outstanding issue related to BPF to BPF calls
and exception handling. For details please see related patches.
Last but not least, BPF selftests is extended with several new
test cases.

Thanks!
====================
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

1651e39e

bpf: add further test cases around div/mod and others · 21ccaf21

由 Daniel Borkmann 提交于 1月 26, 2018

Update selftests to relfect recent changes and add various new
test cases.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

21ccaf21

bpf, arm: remove obsolete exception handling from div/mod · 73ae3c04

由 Daniel Borkmann 提交于 1月 26, 2018

Since we've changed div/mod exception handling for src_reg in
eBPF verifier itself, remove the leftovers from arm32 JIT.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Cc: Shubham Bansal <illusionist.neo@gmail.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

73ae3c04

bpf, mips64: remove unneeded zero check from div/mod with k · e472d5d8

由 Daniel Borkmann 提交于 1月 26, 2018

The verifier in both cBPF and eBPF reject div/mod by 0 imm,
so this can never load. Remove emitting such test and reject
it from being JITed instead (the latter is actually also not
needed, but given practice in sparc64, ppc64 today, so
doesn't hurt to add it here either).
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Cc: David Daney <david.daney@cavium.com>
Reviewed-by: NDavid Daney <david.daney@cavium.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

e472d5d8

bpf, mips64: remove obsolete exception handling from div/mod · 1fb5c9c6

由 Daniel Borkmann 提交于 1月 26, 2018

Since we've changed div/mod exception handling for src_reg in
eBPF verifier itself, remove the leftovers from mips64 JIT.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Cc: David Daney <david.daney@cavium.com>
Reviewed-by: NDavid Daney <david.daney@cavium.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

1fb5c9c6

bpf, sparc64: remove obsolete exception handling from div/mod · 740d52c6

由 Daniel Borkmann 提交于 1月 26, 2018

Since we've changed div/mod exception handling for src_reg in
eBPF verifier itself, remove the leftovers from sparc64 JIT.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

740d52c6

bpf, ppc64: remove obsolete exception handling from div/mod · 53fbf571

由 Daniel Borkmann 提交于 1月 26, 2018

Since we've changed div/mod exception handling for src_reg in
eBPF verifier itself, remove the leftovers from ppc64 JIT.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

53fbf571

bpf, s390x: remove obsolete exception handling from div/mod · a3212b8f

由 Daniel Borkmann 提交于 1月 26, 2018

Since we've changed div/mod exception handling for src_reg in
eBPF verifier itself, remove the leftovers from s390x JIT.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

a3212b8f

bpf, arm64: remove obsolete exception handling from div/mod · 96a71005

由 Daniel Borkmann 提交于 1月 26, 2018

Since we've changed div/mod exception handling for src_reg in
eBPF verifier itself, remove the leftovers from arm64 JIT.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

96a71005

bpf, x86_64: remove obsolete exception handling from div/mod · 3e5b1a39

由 Daniel Borkmann 提交于 1月 26, 2018

Since we've changed div/mod exception handling for src_reg in
eBPF verifier itself, remove the leftovers from x86_64 JIT.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

3e5b1a39

bpf: fix subprog verifier bypass by div/mod by 0 exception · f6b1b3bf

由 Daniel Borkmann 提交于 1月 26, 2018

One of the ugly leftovers from the early eBPF days is that div/mod
operations based on registers have a hard-coded src_reg == 0 test
in the interpreter as well as in JIT code generators that would
return from the BPF program with exit code 0. This was basically
adopted from cBPF interpreter for historical reasons.

There are multiple reasons why this is very suboptimal and prone
to bugs. To name one: the return code mapping for such abnormal
program exit of 0 does not always match with a suitable program
type's exit code mapping. For example, '0' in tc means action 'ok'
where the packet gets passed further up the stack, which is just
undesirable for such cases (e.g. when implementing policy) and
also does not match with other program types.

While trying to work out an exception handling scheme, I also
noticed that programs crafted like the following will currently
pass the verifier:

  0: (bf) r6 = r1
  1: (85) call pc+8
  caller:
   R6=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
  callee:
   frame1: R1=ctx(id=0,off=0,imm=0) R10=fp0,call_1
  10: (b4) (u32) r2 = (u32) 0
  11: (b4) (u32) r3 = (u32) 1
  12: (3c) (u32) r3 /= (u32) r2
  13: (61) r0 = *(u32 *)(r1 +76)
  14: (95) exit
  returning from callee:
   frame1: R0_w=pkt(id=0,off=0,r=0,imm=0)
           R1=ctx(id=0,off=0,imm=0) R2_w=inv0
           R3_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff))
           R10=fp0,call_1
  to caller at 2:
   R0_w=pkt(id=0,off=0,r=0,imm=0) R6=ctx(id=0,off=0,imm=0)
   R10=fp0,call_-1

  from 14 to 2: R0=pkt(id=0,off=0,r=0,imm=0)
                R6=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
  2: (bf) r1 = r6
  3: (61) r1 = *(u32 *)(r1 +80)
  4: (bf) r2 = r0
  5: (07) r2 += 8
  6: (2d) if r2 > r1 goto pc+1
   R0=pkt(id=0,off=0,r=8,imm=0) R1=pkt_end(id=0,off=0,imm=0)
   R2=pkt(id=0,off=8,r=8,imm=0) R6=ctx(id=0,off=0,imm=0)
   R10=fp0,call_-1
  7: (71) r0 = *(u8 *)(r0 +0)
  8: (b7) r0 = 1
  9: (95) exit

  from 6 to 8: safe
  processed 16 insns (limit 131072), stack depth 0+0

Basically what happens is that in the subprog we make use of a
div/mod by 0 exception and in the 'normal' subprog's exit path
we just return skb->data back to the main prog. This has the
implication that the verifier thinks we always get a pkt pointer
in R0 while we still have the implicit 'return 0' from the div
as an alternative unconditional return path earlier. Thus, R0
then contains 0, meaning back in the parent prog we get the
address range of [0x0, skb->data_end] as read and writeable.
Similar can be crafted with other pointer register types.

Since i) BPF_ABS/IND is not allowed in programs that contain
BPF to BPF calls (and generally it's also disadvised to use in
native eBPF context), ii) unknown opcodes don't return zero
anymore, iii) we don't return an exception code in dead branches,
the only last missing case affected and to fix is the div/mod
handling.

What we would really need is some infrastructure to propagate
exceptions all the way to the original prog unwinding the
current stack and returning that code to the caller of the
BPF program. In user space such exception handling for similar
runtimes is typically implemented with setjmp(3) and longjmp(3)
as one possibility which is not available in the kernel,
though (kgdb used to implement it in kernel long time ago). I
implemented a PoC exception handling mechanism into the BPF
interpreter with porting setjmp()/longjmp() into x86_64 and
adding a new internal BPF_ABRT opcode that can use a program
specific exception code for all exception cases we have (e.g.
div/mod by 0, unknown opcodes, etc). While this seems to work
in the constrained BPF environment (meaning, here, we don't
need to deal with state e.g. from memory allocations that we
would need to undo before going into exception state), it still
has various drawbacks: i) we would need to implement the
setjmp()/longjmp() for every arch supported in the kernel and
for x86_64, arm64, sparc64 JITs currently supporting calls,
ii) it has unconditional additional cost on main program
entry to store CPU register state in initial setjmp() call,
and we would need some way to pass the jmp_buf down into
___bpf_prog_run() for main prog and all subprogs, but also
storing on stack is not really nice (other option would be
per-cpu storage for this, but it also has the drawback that
we need to disable preemption for every BPF program types).
All in all this approach would add a lot of complexity.

Another poor-man's solution would be to have some sort of
additional shared register or scratch buffer to hold state
for exceptions, and test that after every call return to
chain returns and pass R0 all the way down to BPF prog caller.
This is also problematic in various ways: i) an additional
register doesn't map well into JITs, and some other scratch
space could only be on per-cpu storage, which, again has the
side-effect that this only works when we disable preemption,
or somewhere in the input context which is not available
everywhere either, and ii) this adds significant runtime
overhead by putting conditionals after each and every call,
as well as implementation complexity.

Yet another option is to teach verifier that div/mod can
return an integer, which however is also complex to implement
as verifier would need to walk such fake 'mov r0,<code>; exit;'
sequeuence and there would still be no guarantee for having
propagation of this further down to the BPF caller as proper
exception code. For parent prog, it is also is not distinguishable
from a normal return of a constant scalar value.

The approach taken here is a completely different one with
little complexity and no additional overhead involved in
that we make use of the fact that a div/mod by 0 is undefined
behavior. Instead of bailing out, we adapt the same behavior
as on some major archs like ARMv8 [0] into eBPF as well:
X div 0 results in 0, and X mod 0 results in X. aarch64 and
aarch32 ISA do not generate any traps or otherwise aborts
of program execution for unsigned divides. I verified this
also with a test program compiled by gcc and clang, and the
behavior matches with the spec. Going forward we adapt the
eBPF verifier to emit such rewrites once div/mod by register
was seen. cBPF is not touched and will keep existing 'return 0'
semantics. Given the options, it seems the most suitable from
all of them, also since major archs have similar schemes in
place. Given this is all in the realm of undefined behavior,
we still have the option to adapt if deemed necessary and
this way we would also have the option of more flexibility
from LLVM code generation side (which is then fully visible
to verifier). Thus, this patch i) fixes the panic seen in
above program and ii) doesn't bypass the verifier observations.

  [0] ARM Architecture Reference Manual, ARMv8 [ARM DDI 0487B.b]
      http://infocenter.arm.com/help/topic/com.arm.doc.ddi0487b.b/DDI0487B_b_armv8_arm.pdf
      1) aarch64 instruction set: section C3.4.7 and C6.2.279 (UDIV)
         "A division by zero results in a zero being written to
          the destination register, without any indication that
          the division by zero occurred."
      2) aarch32 instruction set: section F1.4.8 and F5.1.263 (UDIV)
         "For the SDIV and UDIV instructions, division by zero
          always returns a zero result."

Fixes: f4d7e40a ("bpf: introduce function calls (verification)")
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

f6b1b3bf

bpf: make unknown opcode handling more robust · 5e581dad

由 Daniel Borkmann 提交于 1月 26, 2018

Recent findings by syzcaller fixed in 7891a87e ("bpf: arsh is
not supported in 32 bit alu thus reject it") triggered a warning
in the interpreter due to unknown opcode not being rejected by
the verifier. The 'return 0' for an unknown opcode is really not
optimal, since with BPF to BPF calls, this would go untracked by
the verifier.

Do two things here to improve the situation: i) perform basic insn
sanity check early on in the verification phase and reject every
non-uapi insn right there. The bpf_opcode_in_insntable() table
reuses the same mapping as the jumptable in ___bpf_prog_run() sans
the non-public mappings. And ii) in ___bpf_prog_run() we do need
to BUG in the case where the verifier would ever create an unknown
opcode due to some rewrites.

Note that JITs do not have such issues since they would punt to
interpreter in these situations. Moreover, the BPF_JIT_ALWAYS_ON
would also help to avoid such unknown opcodes in the first place.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

5e581dad

bpf: improve dead code sanitizing · 2a5418a1

由 Daniel Borkmann 提交于 1月 26, 2018

Given we recently had c131187d ("bpf: fix branch pruning
logic") and 95a762e2 ("bpf: fix incorrect sign extension in
check_alu_op()") in particular where before verifier skipped
verification of the wrongly assumed dead branch, we should not
just replace the dead code parts with nops (mov r0,r0). If there
is a bug such as fixed in 95a762e2 in future again, where
runtime could execute those insns, then one of the potential
issues with the current setting would be that given the nops
would be at the end of the program, we could execute out of
bounds at some point.

The best in such case would be to just exit the BPF program
altogether and return an exception code. However, given this
would require two instructions, and such a dead code gap could
just be a single insn long, we would need to place 'r0 = X; ret'
snippet at the very end after the user program or at the start
before the program (where we'd skip that region on prog entry),
and then place unconditional ja's into the dead code gap.

While more complex but possible, there's still another block
in the road that currently prevents from this, namely BPF to
BPF calls. The issue here is that such exception could be
returned from a callee, but the caller would not know that
it's an exception that needs to be propagated further down.
Alternative that has little complexity is to just use a ja-1
code for now which will trap the execution here instead of
silently doing bad things if we ever get there due to bugs.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

2a5418a1

bpf: xor of a/x in cbpf can be done in 32 bit alu · 1d621674

由 Daniel Borkmann 提交于 1月 26, 2018

Very minor optimization; saves 1 byte per program in x86_64
JIT in cBPF prologue.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

1d621674

samples/bpf: Partially fixes the bpf.o build · c25ef6a5

由 Mickaël Salaün 提交于 1月 26, 2018

Do not build lib/bpf/bpf.o with this Makefile but use the one from the
library directory.  This avoid making a buggy bpf.o file (e.g. missing
symbols).

This patch is useful if some code (e.g. Landlock tests) needs both the
bpf.o (from tools/lib/bpf) and the bpf_load.o (from samples/bpf).
Signed-off-by: NMickaël Salaün <mic@digikod.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

c25ef6a5

bpf: clean up from test_tcpbpf_kern.c · 771fc607

由 Lawrence Brakmo 提交于 1月 26, 2018

Removed commented lines from test_tcpbpf_kern.c

Fixes: d6d4f60c bpf: add selftest for tcpbpf
Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

771fc607

i40e: Do not allow use more TC queue pairs than MSI-X vectors exist · 1563f2d2

由 Paweł Jabłoński 提交于 12月 29, 2017

This patch suppresses the message about invalid TC mapping and wrong
selected TX queue. The root cause of this bug was setting too many
TC queue pairs on huge multiprocessor machines. When quantity of the
TC queue pairs is exceeding MSI-X vectors count then TX queue number
can be selected beyond actual TX queues amount.
Signed-off-by: NPaweł Jabłoński <pawel.jablonski@intel.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

1563f2d2

i40e/i40evf: Record ITR register location in the q_vector · a3f9fb5e

由 Alexander Duyck 提交于 12月 29, 2017

The drivers for i40e and i40evf had a reg_idx value stored in the q_vector
that was going completely unused. I can only assume this was copied over
from ixgbe and nobody knew how to use it.

I'm going to make use of the value to avoid having to compute the vector
and thus the register index for multiple paths throughout the drivers.
Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

a3f9fb5e

i40e: fix reported mask for ntuple filters · 40339af3

由 Jacob Keller 提交于 12月 27, 2017

In commit 36777d9f ("i40e: check current configured input set when
adding ntuple filters") some code was added to report the input set
mask for a given filter when reporting it to the user.

This code is necessary so that the reported filter correctly displays
that it is or is not masking certain fields.

Unfortunately the code was incorrect. Development error accidentally
swapped the mask values for the IPv4 addresses with the L4 port numbers.
The port numbers are only 16bits wide while IPv4 addresses are 32 bits.
Unfortunately we assigned only 16 bits to the IPv4 address masks.
Additionally we assigned 32bit value 0xFFFFFFF to the TCP port numbers.
This second part does not matter as the value would be truncated to
16bits regardless, but it is unnecessary.

Fix the reported masks to properly report that the entire field is
masked.

Fixes: 36777d9f ("i40e: check current configured input set when adding ntuple filters")
Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

40339af3

i40e: disallow programming multiple filters with same criteria · 443ee71a

由 Jacob Keller 提交于 12月 27, 2017

Our hardware does not allow situations where two filters might conflict
when matching. Essentially hardware only programs one filter for each
set of matching criteria. We don't support filters with overlapping
input sets, because each flow type can only use a single input set.

Additionally, different flow types will never have overlapping matches,
because of how the hardware parses the flow type before checking
matching criteria.

For this reason, we do not need or use the location number when
programming filters to hardware.

In order to avoid confusing scenarios with filters that match the same
criteria but program the flow to different queues, do not allow multiple
filters that match identical criteria to be programmed.

This ensures that we avoid odd scenarios when deleting filters, and when
programming new filters that match the same criteria.

Instead, users that wish to update the criteria for a filter must use
the same location id, or must delete all the matching filters first.
Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

443ee71a

i40e: program fragmented IPv4 filter input set · 02b4016b

由 Jacob Keller 提交于 12月 27, 2017

When implementing support for IP_USER_FLOW filters, we correctly
programmed a filter for both the non fragmented IPv4/Other filter, as
well as the fragmented IPv4 filters. However, we did not properly
program the input set for fragmented IPv4 PCTYPE. This meant that the
filters would almost certainly not match, unless the user specified all
of the flow types.

Add support to program the fragmented IPv4 filter input set. Since we
always program these filters together, we'll assume that the two input
sets must match, and will thus always program the input sets to the same
value.
Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

02b4016b

i40e: Fix kdump failure · 69399873

由 Avinash Dayanand 提交于 12月 27, 2017

kdump fails in the system when used in conjunction with Ethernet driver
X722/X710. This is mainly because when we are resource constrained i.e.
when we have just one online_cpus, we are enabling VMDq and iWARP. It
doesn't make sense to enable them with just one CPU and starve kdump
for lack of IRQs.

So don't enable VMDq or iWARP when we just have a single CPU.
Signed-off-by: NAvinash Dayanand <avinash.dayanand@intel.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

69399873

i40e: cleanup unnecessary parens · 5056716c

由 Jeff Kirsher 提交于 12月 27, 2017

Clean up unnecessary parenthesis.
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>

5056716c

i40e: fix FW_LLDP flag on init · 64e1dcbb

由 Alan Brady 提交于 12月 27, 2017

Using ethtool --set-priv-flags disable-fw-lldp <on/off> is persistent
across reboots/reloads so we need some mechanism in the driver to detect
if it's on or off on init so we can set the ethtool private flag
appropriately. Without this, every time the driver is reloaded the flag
will default to off regardless of whether it's on or off in FW.

We detect this by first attempting to program DCB and if AQ fails
returning I40E_AQ_RC_EPERM, we know that LLDP is disabled in FW.
Signed-off-by: NAlan Brady <alan.brady@intel.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

64e1dcbb

i40e: Implement an ethtool private flag to stop LLDP in FW · c61c8fe1

由 Dave Ertman 提交于 12月 27, 2017

Implement the private flag disable-fw-lldp for ethtool
to disable the processing of LLDP packets by the FW.
This will stop the FW from consuming LLDPDU and cause
them to be sent up the stack.

The FW is also being configured to apply a default DCB
configuration on link up.

Toggling the value of this flag will also cause a PF reset.

Disabling FW DCB will also disable DCBx.
Signed-off-by: NDave Ertman <david.m.ertman@intel.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

c61c8fe1

i40e: change flags to use 64 bits · 60f481b9

由 Alice Michael 提交于 12月 27, 2017

As we have added more flags, we need to now use more
bits and have over flooded the 32 bit size.  So
make it 64.

Also change all the existing bits to unsigned long long
bits.
Signed-off-by: NAlice Michael <alice.michael@intel.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

60f481b9

openanolis / cloud-kernel 接近 2 年 前同步成功

openanolis / cloud-kernel
接近 2 年前同步成功