1. 30 1月, 2018 7 次提交
    • M
      ptr_ring: clean up documentation · 8619d384
      Michael S. Tsirkin 提交于
      The only function safe to call without locks
      is __ptr_ring_empty. Move documentation about
      lockless use there to make sure people do not
      try to use __ptr_ring_peek outside locks.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8619d384
    • M
      ptr_ring: keep consumer_head valid at all times · 406de755
      Michael S. Tsirkin 提交于
      The comment near __ptr_ring_peek says:
      
       * If ring is never resized, and if the pointer is merely
       * tested, there's no need to take the lock - see e.g.  __ptr_ring_empty.
      
      but this was in fact never possible since consumer_head would sometimes
      point outside the ring. Refactor the code so that it's always
      pointing within a ring.
      
      Fixes: c5ad119f ("net: sched: pfifo_fast use skb_array")
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      406de755
    • M
      ipv6: Fix SO_REUSEPORT UDP socket with implicit sk_ipv6only · 7ece54a6
      Martin KaFai Lau 提交于
      If a sk_v6_rcv_saddr is !IPV6_ADDR_ANY and !IPV6_ADDR_MAPPED, it
      implicitly implies it is an ipv6only socket.  However, in inet6_bind(),
      this addr_type checking and setting sk->sk_ipv6only to 1 are only done
      after sk->sk_prot->get_port(sk, snum) has been completed successfully.
      
      This inconsistency between sk_v6_rcv_saddr and sk_ipv6only confuses
      the 'get_port()'.
      
      In particular, when binding SO_REUSEPORT UDP sockets,
      udp_reuseport_add_sock(sk,...) is called.  udp_reuseport_add_sock()
      checks "ipv6_only_sock(sk2) == ipv6_only_sock(sk)" before adding sk to
      sk2->sk_reuseport_cb.  In this case, ipv6_only_sock(sk2) could be
      1 while ipv6_only_sock(sk) is still 0 here.  The end result is,
      reuseport_alloc(sk) is called instead of adding sk to the existing
      sk2->sk_reuseport_cb.
      
      It can be reproduced by binding two SO_REUSEPORT UDP sockets on an
      IPv6 address (!ANY and !MAPPED).  Only one of the socket will
      receive packet.
      
      The fix is to set the implicit sk_ipv6only before calling get_port().
      The original sk_ipv6only has to be saved such that it can be restored
      in case get_port() failed.  The situation is similar to the
      inet_reset_saddr(sk) after get_port() has failed.
      
      Thanks to Calvin Owens <calvinowens@fb.com> who created an easy
      reproduction which leads to a fix.
      
      Fixes: e32ea7e7 ("soreuseport: fast reuseport UDP socket selection")
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ece54a6
    • D
      Merge branch 'rtnetlink-enable-IFLA_IF_NETNSID-for-RTM_DELLINK-RTM_SETINK' · 2479c2c9
      David S. Miller 提交于
      Christian Brauner says:
      
      ====================
      rtnetlink: enable IFLA_IF_NETNSID for RTM_{DEL,SET}LINK
      
      Based on the previous discussion this enables passing a IFLA_IF_NETNSID
      property along with RTM_SETLINK and RTM_DELLINK requests. The patch for
      RTM_NEWLINK will be sent out in a separate patch since there are more
      corner-cases to think about.
      
      Changelog 2018-01-24:
      * Preserve old behavior and report -ENODEV when either ifindex or ifname is
        provided and IFLA_GROUP is set. Spotted by Wolfgang Bumiller.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2479c2c9
    • C
      rtnetlink: enable IFLA_IF_NETNSID for RTM_DELLINK · b61ad68a
      Christian Brauner 提交于
      - Backwards Compatibility:
        If userspace wants to determine whether RTM_DELLINK supports the
        IFLA_IF_NETNSID property they should first send an RTM_GETLINK request
        with IFLA_IF_NETNSID on lo. If either EACCESS is returned or the reply
        does not include IFLA_IF_NETNSID userspace should assume that
        IFLA_IF_NETNSID is not supported on this kernel.
        If the reply does contain an IFLA_IF_NETNSID property userspace
        can send an RTM_DELLINK with a IFLA_IF_NETNSID property. If they receive
        EOPNOTSUPP then the kernel does not support the IFLA_IF_NETNSID property
        with RTM_DELLINK. Userpace should then fallback to other means.
      
      - Security:
        Callers must have CAP_NET_ADMIN in the owning user namespace of the
        target network namespace.
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b61ad68a
    • C
      rtnetlink: enable IFLA_IF_NETNSID for RTM_SETLINK · c310bfcb
      Christian Brauner 提交于
      - Backwards Compatibility:
        If userspace wants to determine whether RTM_SETLINK supports the
        IFLA_IF_NETNSID property they should first send an RTM_GETLINK request
        with IFLA_IF_NETNSID on lo. If either EACCESS is returned or the reply
        does not include IFLA_IF_NETNSID userspace should assume that
        IFLA_IF_NETNSID is not supported on this kernel.
        If the reply does contain an IFLA_IF_NETNSID property userspace
        can send an RTM_SETLINK with a IFLA_IF_NETNSID property. If they receive
        EOPNOTSUPP then the kernel does not support the IFLA_IF_NETNSID property
        with RTM_SETLINK. Userpace should then fallback to other means.
      
        To retain backwards compatibility the kernel will first check whether a
        IFLA_NET_NS_PID or IFLA_NET_NS_FD property has been passed. If either
        one is found it will be used to identify the target network namespace.
        This implies that users who do not care whether their running kernel
        supports IFLA_IF_NETNSID with RTM_SETLINK can pass both
        IFLA_NET_NS_{FD,PID} and IFLA_IF_NETNSID referring to the same network
        namespace.
      
      - Security:
        Callers must have CAP_NET_ADMIN in the owning user namespace of the
        target network namespace.
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c310bfcb
    • C
      rtnetlink: enable IFLA_IF_NETNSID in do_setlink() · 7c4f63ba
      Christian Brauner 提交于
      RTM_{NEW,SET}LINK already allow operations on other network namespaces
      by identifying the target network namespace through IFLA_NET_NS_{FD,PID}
      properties. This is done by looking for the corresponding properties in
      do_setlink(). Extend do_setlink() to also look for the IFLA_IF_NETNSID
      property. This introduces no functional changes since all callers of
      do_setlink() currently block IFLA_IF_NETNSID by reporting an error before
      they reach do_setlink().
      
      This introduces the helpers:
      
      static struct net *rtnl_link_get_net_by_nlattr(struct net *src_net, struct
                                                     nlattr *tb[])
      
      static struct net *rtnl_link_get_net_capable(const struct sk_buff *skb,
                                                   struct net *src_net,
      					     struct nlattr *tb[], int cap)
      
      to simplify permission checks and target network namespace retrieval for
      RTM_* requests that already support IFLA_NET_NS_{FD,PID} but get extended
      to IFLA_IF_NETNSID. To perserve backwards compatibility the helpers look
      for IFLA_NET_NS_{FD,PID} properties first before checking for
      IFLA_IF_NETNSID.
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c4f63ba
  2. 29 1月, 2018 5 次提交
    • D
    • D
      Merge tag 'wireless-drivers-next-for-davem-2018-01-26' of... · 868c36dc
      David S. Miller 提交于
      Merge tag 'wireless-drivers-next-for-davem-2018-01-26' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next
      
      Kalle Valo says:
      
      ====================
      wireless-drivers-next patches for 4.16
      
      Major changes:
      
      wil6210
      
      * add PCI device id for Talyn
      
      * support flashless device
      
      ath9k
      
      * improve RSSI/signal accuracy on AR9003 series
      
      mt76
      
      * validate CCMP PN from received frames to avoid replay attacks
      
      qtnfmac
      
      * support 64-bit network stats
      
      * report more hardware information to kernel log and some via ethtool
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      868c36dc
    • K
      sfc: mark some unexported symbols as static · e7345ba3
      kbuild test robot 提交于
      efx_default_channel_want_txqs() is only used in efx.c, while
       efx_ptp_want_txqs() and efx_ptp_channel_type (a struct) are only used
       in ptp.c.  In all cases these symbols should be static.
      
      Fixes: 2935e3c3 ("sfc: on 8000 series use TX queues for TX timestamps")
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      [ecree@solarflare.com: rewrote commit message]
      Signed-off-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e7345ba3
    • D
      Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · 5abe9ead
      David S. Miller 提交于
      Jeff Kirsher says:
      
      ====================
      40GbE Intel Wired LAN Driver Updates 2018-01-26
      
      This series contains updates to i40e and i40evf.
      
      Michal updates the driver to pass critical errors from the firmware to
      the caller.
      
      Patryk fixes an issue of creating multiple identical filters with the
      same location, by simply moving the functions so that we remove the
      existing filter and then add the new filter.
      
      Paweł adds back in the ability to turn off offloads when VLAN is set for
      the VF driver.  Fixed an issue where the number of TC queue pairs was
      exceeding MSI-X vectors count, causing messages about invalid TC mapping
      and wrong selected Tx queue.
      
      Alex cleans up the i40e/i40evf_set_itr_per_queue() by dropping all the
      unneeded pointer chases.  Puts to use the reg_idx value, which was going
      unused, so that we can avoid having to compute the vector every time
      throughout the driver.
      
      Upasana enable the driver to display LLDP information on the vSphere Web
      Client by exposing DCB parameters.
      
      Alice converts our flags from 32 to 64 bit size, since we have added
      more flags.
      
      Dave implements a private ethtool flag to disable the processing of LLDP
      packets by the firmware, so that the firmware will not consume LLDPDU
      and cause them to be sent up the stack.
      
      Alan adds a mechanism for detecting/storing the flag for processing of
      LLDP packets by the firmware, so that its current state is persistent
      across reboots/reloads of the driver.
      
      Avinash fixes kdump with i40e due to resource constraints.  We were
      enabling VMDq and iWARP when we just have a single CPU, which was
      starving kdump for the lack of IRQs.
      
      Jake adds support to program the fragmented IPv4 input set PCTYPE.
      Fixed the reported masks to properly report that the entire field is
      masked, since we had accidentally swapped the mask values for the IPv4
      addresses with the L4 port numbers.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5abe9ead
    • D
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · 457740a9
      David S. Miller 提交于
      Alexei Starovoitov says:
      
      ====================
      pull-request: bpf-next 2018-01-26
      
      The following pull-request contains BPF updates for your *net-next* tree.
      
      The main changes are:
      
      1) A number of extensions to tcp-bpf, from Lawrence.
          - direct R or R/W access to many tcp_sock fields via bpf_sock_ops
          - passing up to 3 arguments to bpf_sock_ops functions
          - tcp_sock field bpf_sock_ops_cb_flags for controlling callbacks
          - optionally calling bpf_sock_ops program when RTO fires
          - optionally calling bpf_sock_ops program when packet is retransmitted
          - optionally calling bpf_sock_ops program when TCP state changes
          - access to tclass and sk_txhash
          - new selftest
      
      2) div/mod exception handling, from Daniel.
          One of the ugly leftovers from the early eBPF days is that div/mod
          operations based on registers have a hard-coded src_reg == 0 test
          in the interpreter as well as in JIT code generators that would
          return from the BPF program with exit code 0. This was basically
          adopted from cBPF interpreter for historical reasons.
          There are multiple reasons why this is very suboptimal and prone
          to bugs. To name one: the return code mapping for such abnormal
          program exit of 0 does not always match with a suitable program
          type's exit code mapping. For example, '0' in tc means action 'ok'
          where the packet gets passed further up the stack, which is just
          undesirable for such cases (e.g. when implementing policy) and
          also does not match with other program types.
          After considering _four_ different ways to address the problem,
          we adapt the same behavior as on some major archs like ARMv8:
          X div 0 results in 0, and X mod 0 results in X. aarch64 and
          aarch32 ISA do not generate any traps or otherwise aborts
          of program execution for unsigned divides.
          Given the options, it seems the most suitable from
          all of them, also since major archs have similar schemes in
          place. Given this is all in the realm of undefined behavior,
          we still have the option to adapt if deemed necessary.
      
      3) sockmap sample refactoring, from John.
      
      4) lpm map get_next_key fixes, from Yonghong.
      
      5) test cleanups, from Alexei and Prashant.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      457740a9
  3. 28 1月, 2018 2 次提交
    • D
      Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · 6b2e2829
      David S. Miller 提交于
      Jeff Kirsher says:
      
      ====================
      10GbE Intel Wired LAN Driver Updates 2018-01-26
      
      This series contains updates to ixgbe and ixgbevf.
      
      Emil updates ixgbevf to match ixgbe functionality, starting with the
      consolidating of functions that represent logical steps in the receive
      process so we can later update them more easily.  Updated ixgbevf to
      only synchronize the length of the frame, which will typically be the
      MTU or smaller.  Updated the VF driver to use the length of the packet
      instead of the DD status bit to determine if a new descriptor is ready
      to be processed, which saves on reads and we can save time on
      initialization.  Added support for DMA_ATTR_SKIP_CPU_SYNC/WEAK_ORDERING
      to help improve performance on some platforms.  Updated the VF driver to
      do bulk updates of the page reference count instead of just incrementing
      it by one reference at a time.  Updated the VF driver to only go through
      the region of the receive ring that was designated to be cleaned up,
      rather than process the entire ring.
      
      Colin Ian King adds the use of ARRAY_SIZE() on various arrays.
      
      Miroslav Lichvar fixes an issue where ethtool was reporting timestamping
      filters unsupported for X550, which is incorrect.
      
      Paul adds support for reporting 5G link speed for some devices.
      
      Dan Carpenter fixes a typo where && was used when it should have been
      ||.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b2e2829
    • L
      net/rocker: Remove unreachable return instruction · 751c45bd
      Leon Romanovsky 提交于
      The "return 0" instruction follows other return instruction
      and it makes it impossible to execute, hence remove it.
      
      Fixes: 00fc0c51 ("rocker: Change world_ops API and implementation to be switchdev independant")
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      751c45bd
  4. 27 1月, 2018 26 次提交
    • A
      Merge branch 'fix-lpm-map' · 8223967f
      Alexei Starovoitov 提交于
      Yonghong Song says:
      
      ====================
      A kernel page fault which happens in lpm map trie_get_next_key is reported
      by syzbot and Eric. The issue was introduced by commit b471f2f1
      ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map").
      Patch #1 fixed the issue in the kernel and patch #2 adds a multithreaded
      test case in tools/testing/selftests/bpf/test_lpm_map.
      ====================
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      8223967f
    • Y
      tools/bpf: add a multithreaded stress test in bpf selftests test_lpm_map · af32efee
      Yonghong Song 提交于
      The new test will spawn four threads, doing map update, delete, lookup
      and get_next_key in parallel. It is able to reproduce the issue in the
      previous commit found by syzbot and Eric Dumazet.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      af32efee
    • Y
      bpf: fix kernel page fault in lpm map trie_get_next_key · 6dd1ec6c
      Yonghong Song 提交于
      Commit b471f2f1 ("bpf: implement MAP_GET_NEXT_KEY command
      for LPM_TRIE map") introduces a bug likes below:
      
          if (!rcu_dereference(trie->root))
              return -ENOENT;
          if (!key || key->prefixlen > trie->max_prefixlen) {
              root = &trie->root;
              goto find_leftmost;
          }
          ......
        find_leftmost:
          for (node = rcu_dereference(*root); node;) {
      
      In the code after label find_leftmost, it is assumed
      that *root should not be NULL, but it is not true as
      it is possbile trie->root is changed to NULL by an
      asynchronous delete operation.
      
      The issue is reported by syzbot and Eric Dumazet with the
      below error log:
        ......
        kasan: CONFIG_KASAN_INLINE enabled
        kasan: GPF could be caused by NULL-ptr deref or user memory access
        general protection fault: 0000 [#1] SMP KASAN
        Dumping ftrace buffer:
           (ftrace buffer empty)
        Modules linked in:
        CPU: 1 PID: 8033 Comm: syz-executor3 Not tainted 4.15.0-rc8+ #4
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:trie_get_next_key+0x3c2/0xf10 kernel/bpf/lpm_trie.c:682
        ......
      
      This patch fixed the issue by use local rcu_dereferenced
      pointer instead of *(&trie->root) later on.
      
      Fixes: b471f2f1 ("bpf: implement MAP_GET_NEXT_KEY command or LPM_TRIE map")
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Reported-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      6dd1ec6c
    • A
      Merge branch 'bpf-improvements-and-fixes' · 1651e39e
      Alexei Starovoitov 提交于
      Daniel Borkmann says:
      
      ====================
      This set contains a small cleanup in cBPF prologue generation and
      otherwise fixes an outstanding issue related to BPF to BPF calls
      and exception handling. For details please see related patches.
      Last but not least, BPF selftests is extended with several new
      test cases.
      
      Thanks!
      ====================
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      1651e39e
    • D
      bpf: add further test cases around div/mod and others · 21ccaf21
      Daniel Borkmann 提交于
      Update selftests to relfect recent changes and add various new
      test cases.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      21ccaf21
    • D
      bpf, arm: remove obsolete exception handling from div/mod · 73ae3c04
      Daniel Borkmann 提交于
      Since we've changed div/mod exception handling for src_reg in
      eBPF verifier itself, remove the leftovers from arm32 JIT.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Shubham Bansal <illusionist.neo@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      73ae3c04
    • D
      bpf, mips64: remove unneeded zero check from div/mod with k · e472d5d8
      Daniel Borkmann 提交于
      The verifier in both cBPF and eBPF reject div/mod by 0 imm,
      so this can never load. Remove emitting such test and reject
      it from being JITed instead (the latter is actually also not
      needed, but given practice in sparc64, ppc64 today, so
      doesn't hurt to add it here either).
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: David Daney <david.daney@cavium.com>
      Reviewed-by: NDavid Daney <david.daney@cavium.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      e472d5d8
    • D
      bpf, mips64: remove obsolete exception handling from div/mod · 1fb5c9c6
      Daniel Borkmann 提交于
      Since we've changed div/mod exception handling for src_reg in
      eBPF verifier itself, remove the leftovers from mips64 JIT.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: David Daney <david.daney@cavium.com>
      Reviewed-by: NDavid Daney <david.daney@cavium.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      1fb5c9c6
    • D
      bpf, sparc64: remove obsolete exception handling from div/mod · 740d52c6
      Daniel Borkmann 提交于
      Since we've changed div/mod exception handling for src_reg in
      eBPF verifier itself, remove the leftovers from sparc64 JIT.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      740d52c6
    • D
      bpf, ppc64: remove obsolete exception handling from div/mod · 53fbf571
      Daniel Borkmann 提交于
      Since we've changed div/mod exception handling for src_reg in
      eBPF verifier itself, remove the leftovers from ppc64 JIT.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      53fbf571
    • D
      bpf, s390x: remove obsolete exception handling from div/mod · a3212b8f
      Daniel Borkmann 提交于
      Since we've changed div/mod exception handling for src_reg in
      eBPF verifier itself, remove the leftovers from s390x JIT.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      a3212b8f
    • D
      bpf, arm64: remove obsolete exception handling from div/mod · 96a71005
      Daniel Borkmann 提交于
      Since we've changed div/mod exception handling for src_reg in
      eBPF verifier itself, remove the leftovers from arm64 JIT.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      96a71005
    • D
      bpf, x86_64: remove obsolete exception handling from div/mod · 3e5b1a39
      Daniel Borkmann 提交于
      Since we've changed div/mod exception handling for src_reg in
      eBPF verifier itself, remove the leftovers from x86_64 JIT.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      3e5b1a39
    • D
      bpf: fix subprog verifier bypass by div/mod by 0 exception · f6b1b3bf
      Daniel Borkmann 提交于
      One of the ugly leftovers from the early eBPF days is that div/mod
      operations based on registers have a hard-coded src_reg == 0 test
      in the interpreter as well as in JIT code generators that would
      return from the BPF program with exit code 0. This was basically
      adopted from cBPF interpreter for historical reasons.
      
      There are multiple reasons why this is very suboptimal and prone
      to bugs. To name one: the return code mapping for such abnormal
      program exit of 0 does not always match with a suitable program
      type's exit code mapping. For example, '0' in tc means action 'ok'
      where the packet gets passed further up the stack, which is just
      undesirable for such cases (e.g. when implementing policy) and
      also does not match with other program types.
      
      While trying to work out an exception handling scheme, I also
      noticed that programs crafted like the following will currently
      pass the verifier:
      
        0: (bf) r6 = r1
        1: (85) call pc+8
        caller:
         R6=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
        callee:
         frame1: R1=ctx(id=0,off=0,imm=0) R10=fp0,call_1
        10: (b4) (u32) r2 = (u32) 0
        11: (b4) (u32) r3 = (u32) 1
        12: (3c) (u32) r3 /= (u32) r2
        13: (61) r0 = *(u32 *)(r1 +76)
        14: (95) exit
        returning from callee:
         frame1: R0_w=pkt(id=0,off=0,r=0,imm=0)
                 R1=ctx(id=0,off=0,imm=0) R2_w=inv0
                 R3_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff))
                 R10=fp0,call_1
        to caller at 2:
         R0_w=pkt(id=0,off=0,r=0,imm=0) R6=ctx(id=0,off=0,imm=0)
         R10=fp0,call_-1
      
        from 14 to 2: R0=pkt(id=0,off=0,r=0,imm=0)
                      R6=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
        2: (bf) r1 = r6
        3: (61) r1 = *(u32 *)(r1 +80)
        4: (bf) r2 = r0
        5: (07) r2 += 8
        6: (2d) if r2 > r1 goto pc+1
         R0=pkt(id=0,off=0,r=8,imm=0) R1=pkt_end(id=0,off=0,imm=0)
         R2=pkt(id=0,off=8,r=8,imm=0) R6=ctx(id=0,off=0,imm=0)
         R10=fp0,call_-1
        7: (71) r0 = *(u8 *)(r0 +0)
        8: (b7) r0 = 1
        9: (95) exit
      
        from 6 to 8: safe
        processed 16 insns (limit 131072), stack depth 0+0
      
      Basically what happens is that in the subprog we make use of a
      div/mod by 0 exception and in the 'normal' subprog's exit path
      we just return skb->data back to the main prog. This has the
      implication that the verifier thinks we always get a pkt pointer
      in R0 while we still have the implicit 'return 0' from the div
      as an alternative unconditional return path earlier. Thus, R0
      then contains 0, meaning back in the parent prog we get the
      address range of [0x0, skb->data_end] as read and writeable.
      Similar can be crafted with other pointer register types.
      
      Since i) BPF_ABS/IND is not allowed in programs that contain
      BPF to BPF calls (and generally it's also disadvised to use in
      native eBPF context), ii) unknown opcodes don't return zero
      anymore, iii) we don't return an exception code in dead branches,
      the only last missing case affected and to fix is the div/mod
      handling.
      
      What we would really need is some infrastructure to propagate
      exceptions all the way to the original prog unwinding the
      current stack and returning that code to the caller of the
      BPF program. In user space such exception handling for similar
      runtimes is typically implemented with setjmp(3) and longjmp(3)
      as one possibility which is not available in the kernel,
      though (kgdb used to implement it in kernel long time ago). I
      implemented a PoC exception handling mechanism into the BPF
      interpreter with porting setjmp()/longjmp() into x86_64 and
      adding a new internal BPF_ABRT opcode that can use a program
      specific exception code for all exception cases we have (e.g.
      div/mod by 0, unknown opcodes, etc). While this seems to work
      in the constrained BPF environment (meaning, here, we don't
      need to deal with state e.g. from memory allocations that we
      would need to undo before going into exception state), it still
      has various drawbacks: i) we would need to implement the
      setjmp()/longjmp() for every arch supported in the kernel and
      for x86_64, arm64, sparc64 JITs currently supporting calls,
      ii) it has unconditional additional cost on main program
      entry to store CPU register state in initial setjmp() call,
      and we would need some way to pass the jmp_buf down into
      ___bpf_prog_run() for main prog and all subprogs, but also
      storing on stack is not really nice (other option would be
      per-cpu storage for this, but it also has the drawback that
      we need to disable preemption for every BPF program types).
      All in all this approach would add a lot of complexity.
      
      Another poor-man's solution would be to have some sort of
      additional shared register or scratch buffer to hold state
      for exceptions, and test that after every call return to
      chain returns and pass R0 all the way down to BPF prog caller.
      This is also problematic in various ways: i) an additional
      register doesn't map well into JITs, and some other scratch
      space could only be on per-cpu storage, which, again has the
      side-effect that this only works when we disable preemption,
      or somewhere in the input context which is not available
      everywhere either, and ii) this adds significant runtime
      overhead by putting conditionals after each and every call,
      as well as implementation complexity.
      
      Yet another option is to teach verifier that div/mod can
      return an integer, which however is also complex to implement
      as verifier would need to walk such fake 'mov r0,<code>; exit;'
      sequeuence and there would still be no guarantee for having
      propagation of this further down to the BPF caller as proper
      exception code. For parent prog, it is also is not distinguishable
      from a normal return of a constant scalar value.
      
      The approach taken here is a completely different one with
      little complexity and no additional overhead involved in
      that we make use of the fact that a div/mod by 0 is undefined
      behavior. Instead of bailing out, we adapt the same behavior
      as on some major archs like ARMv8 [0] into eBPF as well:
      X div 0 results in 0, and X mod 0 results in X. aarch64 and
      aarch32 ISA do not generate any traps or otherwise aborts
      of program execution for unsigned divides. I verified this
      also with a test program compiled by gcc and clang, and the
      behavior matches with the spec. Going forward we adapt the
      eBPF verifier to emit such rewrites once div/mod by register
      was seen. cBPF is not touched and will keep existing 'return 0'
      semantics. Given the options, it seems the most suitable from
      all of them, also since major archs have similar schemes in
      place. Given this is all in the realm of undefined behavior,
      we still have the option to adapt if deemed necessary and
      this way we would also have the option of more flexibility
      from LLVM code generation side (which is then fully visible
      to verifier). Thus, this patch i) fixes the panic seen in
      above program and ii) doesn't bypass the verifier observations.
      
        [0] ARM Architecture Reference Manual, ARMv8 [ARM DDI 0487B.b]
            http://infocenter.arm.com/help/topic/com.arm.doc.ddi0487b.b/DDI0487B_b_armv8_arm.pdf
            1) aarch64 instruction set: section C3.4.7 and C6.2.279 (UDIV)
               "A division by zero results in a zero being written to
                the destination register, without any indication that
                the division by zero occurred."
            2) aarch32 instruction set: section F1.4.8 and F5.1.263 (UDIV)
               "For the SDIV and UDIV instructions, division by zero
                always returns a zero result."
      
      Fixes: f4d7e40a ("bpf: introduce function calls (verification)")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f6b1b3bf
    • D
      bpf: make unknown opcode handling more robust · 5e581dad
      Daniel Borkmann 提交于
      Recent findings by syzcaller fixed in 7891a87e ("bpf: arsh is
      not supported in 32 bit alu thus reject it") triggered a warning
      in the interpreter due to unknown opcode not being rejected by
      the verifier. The 'return 0' for an unknown opcode is really not
      optimal, since with BPF to BPF calls, this would go untracked by
      the verifier.
      
      Do two things here to improve the situation: i) perform basic insn
      sanity check early on in the verification phase and reject every
      non-uapi insn right there. The bpf_opcode_in_insntable() table
      reuses the same mapping as the jumptable in ___bpf_prog_run() sans
      the non-public mappings. And ii) in ___bpf_prog_run() we do need
      to BUG in the case where the verifier would ever create an unknown
      opcode due to some rewrites.
      
      Note that JITs do not have such issues since they would punt to
      interpreter in these situations. Moreover, the BPF_JIT_ALWAYS_ON
      would also help to avoid such unknown opcodes in the first place.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      5e581dad
    • D
      bpf: improve dead code sanitizing · 2a5418a1
      Daniel Borkmann 提交于
      Given we recently had c131187d ("bpf: fix branch pruning
      logic") and 95a762e2 ("bpf: fix incorrect sign extension in
      check_alu_op()") in particular where before verifier skipped
      verification of the wrongly assumed dead branch, we should not
      just replace the dead code parts with nops (mov r0,r0). If there
      is a bug such as fixed in 95a762e2 in future again, where
      runtime could execute those insns, then one of the potential
      issues with the current setting would be that given the nops
      would be at the end of the program, we could execute out of
      bounds at some point.
      
      The best in such case would be to just exit the BPF program
      altogether and return an exception code. However, given this
      would require two instructions, and such a dead code gap could
      just be a single insn long, we would need to place 'r0 = X; ret'
      snippet at the very end after the user program or at the start
      before the program (where we'd skip that region on prog entry),
      and then place unconditional ja's into the dead code gap.
      
      While more complex but possible, there's still another block
      in the road that currently prevents from this, namely BPF to
      BPF calls. The issue here is that such exception could be
      returned from a callee, but the caller would not know that
      it's an exception that needs to be propagated further down.
      Alternative that has little complexity is to just use a ja-1
      code for now which will trap the execution here instead of
      silently doing bad things if we ever get there due to bugs.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      2a5418a1
    • D
      bpf: xor of a/x in cbpf can be done in 32 bit alu · 1d621674
      Daniel Borkmann 提交于
      Very minor optimization; saves 1 byte per program in x86_64
      JIT in cBPF prologue.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      1d621674
    • M
      samples/bpf: Partially fixes the bpf.o build · c25ef6a5
      Mickaël Salaün 提交于
      Do not build lib/bpf/bpf.o with this Makefile but use the one from the
      library directory.  This avoid making a buggy bpf.o file (e.g. missing
      symbols).
      
      This patch is useful if some code (e.g. Landlock tests) needs both the
      bpf.o (from tools/lib/bpf) and the bpf_load.o (from samples/bpf).
      Signed-off-by: NMickaël Salaün <mic@digikod.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      c25ef6a5
    • L
      bpf: clean up from test_tcpbpf_kern.c · 771fc607
      Lawrence Brakmo 提交于
      Removed commented lines from test_tcpbpf_kern.c
      
      Fixes: d6d4f60c bpf: add selftest for tcpbpf
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      771fc607
    • P
      i40e: Do not allow use more TC queue pairs than MSI-X vectors exist · 1563f2d2
      Paweł Jabłoński 提交于
      This patch suppresses the message about invalid TC mapping and wrong
      selected TX queue. The root cause of this bug was setting too many
      TC queue pairs on huge multiprocessor machines. When quantity of the
      TC queue pairs is exceeding MSI-X vectors count then TX queue number
      can be selected beyond actual TX queues amount.
      Signed-off-by: NPaweł Jabłoński <pawel.jablonski@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      1563f2d2
    • A
      i40e/i40evf: Record ITR register location in the q_vector · a3f9fb5e
      Alexander Duyck 提交于
      The drivers for i40e and i40evf had a reg_idx value stored in the q_vector
      that was going completely unused. I can only assume this was copied over
      from ixgbe and nobody knew how to use it.
      
      I'm going to make use of the value to avoid having to compute the vector
      and thus the register index for multiple paths throughout the drivers.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      a3f9fb5e
    • J
      i40e: fix reported mask for ntuple filters · 40339af3
      Jacob Keller 提交于
      In commit 36777d9f ("i40e: check current configured input set when
      adding ntuple filters") some code was added to report the input set
      mask for a given filter when reporting it to the user.
      
      This code is necessary so that the reported filter correctly displays
      that it is or is not masking certain fields.
      
      Unfortunately the code was incorrect. Development error accidentally
      swapped the mask values for the IPv4 addresses with the L4 port numbers.
      The port numbers are only 16bits wide while IPv4 addresses are 32 bits.
      Unfortunately we assigned only 16 bits to the IPv4 address masks.
      Additionally we assigned 32bit value 0xFFFFFFF to the TCP port numbers.
      This second part does not matter as the value would be truncated to
      16bits regardless, but it is unnecessary.
      
      Fix the reported masks to properly report that the entire field is
      masked.
      
      Fixes: 36777d9f ("i40e: check current configured input set when adding ntuple filters")
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      40339af3
    • J
      i40e: disallow programming multiple filters with same criteria · 443ee71a
      Jacob Keller 提交于
      Our hardware does not allow situations where two filters might conflict
      when matching. Essentially hardware only programs one filter for each
      set of matching criteria. We don't support filters with overlapping
      input sets, because each flow type can only use a single input set.
      
      Additionally, different flow types will never have overlapping matches,
      because of how the hardware parses the flow type before checking
      matching criteria.
      
      For this reason, we do not need or use the location number when
      programming filters to hardware.
      
      In order to avoid confusing scenarios with filters that match the same
      criteria but program the flow to different queues, do not allow multiple
      filters that match identical criteria to be programmed.
      
      This ensures that we avoid odd scenarios when deleting filters, and when
      programming new filters that match the same criteria.
      
      Instead, users that wish to update the criteria for a filter must use
      the same location id, or must delete all the matching filters first.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      443ee71a
    • J
      i40e: program fragmented IPv4 filter input set · 02b4016b
      Jacob Keller 提交于
      When implementing support for IP_USER_FLOW filters, we correctly
      programmed a filter for both the non fragmented IPv4/Other filter, as
      well as the fragmented IPv4 filters. However, we did not properly
      program the input set for fragmented IPv4 PCTYPE. This meant that the
      filters would almost certainly not match, unless the user specified all
      of the flow types.
      
      Add support to program the fragmented IPv4 filter input set. Since we
      always program these filters together, we'll assume that the two input
      sets must match, and will thus always program the input sets to the same
      value.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      02b4016b
    • A
      i40e: Fix kdump failure · 69399873
      Avinash Dayanand 提交于
      kdump fails in the system when used in conjunction with Ethernet driver
      X722/X710. This is mainly because when we are resource constrained i.e.
      when we have just one online_cpus, we are enabling VMDq and iWARP. It
      doesn't make sense to enable them with just one CPU and starve kdump
      for lack of IRQs.
      
      So don't enable VMDq or iWARP when we just have a single CPU.
      Signed-off-by: NAvinash Dayanand <avinash.dayanand@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      69399873
    • J
      i40e: cleanup unnecessary parens · 5056716c
      Jeff Kirsher 提交于
      Clean up unnecessary parenthesis.
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      5056716c