1. 26 3月, 2020 1 次提交
    • P
      net: Fix CONFIG_NET_CLS_ACT=n and CONFIG_NFT_FWD_NETDEV={y, m} build · 2c64605b
      Pablo Neira Ayuso 提交于
      net/netfilter/nft_fwd_netdev.c: In function ‘nft_fwd_netdev_eval’:
          net/netfilter/nft_fwd_netdev.c:32:10: error: ‘struct sk_buff’ has no member named ‘tc_redirected’
            pkt->skb->tc_redirected = 1;
                    ^~
          net/netfilter/nft_fwd_netdev.c:33:10: error: ‘struct sk_buff’ has no member named ‘tc_from_ingress’
            pkt->skb->tc_from_ingress = 1;
                    ^~
      
      To avoid a direct dependency with tc actions from netfilter, wrap the
      redirect bits around CONFIG_NET_REDIRECT and move helpers to
      include/linux/skbuff.h. Turn on this toggle from the ifb driver, the
      only existing client of these bits in the tree.
      
      This patch adds skb_set_redirected() that sets on the redirected bit
      on the skbuff, it specifies if the packet was redirect from ingress
      and resets the timestamp (timestamp reset was originally missing in the
      netfilter bugfix).
      
      Fixes: bcfabee1 ("netfilter: nft_fwd_netdev: allow to redirect to ifb via ingress")
      Reported-by: noreply@ellerman.id.au
      Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c64605b
  2. 21 2月, 2020 1 次提交
  3. 24 1月, 2020 1 次提交
  4. 28 12月, 2019 1 次提交
  5. 26 12月, 2019 1 次提交
  6. 22 11月, 2019 1 次提交
  7. 18 8月, 2019 1 次提交
    • I
      devlink: Add packet trap infrastructure · 0f420b6c
      Ido Schimmel 提交于
      Add the basic packet trap infrastructure that allows device drivers to
      register their supported packet traps and trap groups with devlink.
      
      Each driver is expected to provide basic information about each
      supported trap, such as name and ID, but also the supported metadata
      types that will accompany each packet trapped via the trap. The
      currently supported metadata type is just the input port, but more will
      be added in the future. For example, output port and traffic class.
      
      Trap groups allow users to set the action of all member traps. In
      addition, users can retrieve per-group statistics in case per-trap
      statistics are too narrow. In the future, the trap group object can be
      extended with more attributes, such as policer settings which will limit
      the amount of traffic generated by member traps towards the CPU.
      
      Beside registering their packet traps with devlink, drivers are also
      expected to report trapped packets to devlink along with relevant
      metadata. devlink will maintain packets and bytes statistics for each
      packet trap and will potentially report the trapped packet with its
      metadata to user space via drop monitor netlink channel.
      
      The interface towards the drivers is simple and allows devlink to set
      the action of the trap. Currently, only two actions are supported:
      'trap' and 'drop'. When set to 'trap', the device is expected to provide
      the sole copy of the packet to the driver which will pass it to devlink.
      When set to 'drop', the device is expected to drop the packet and not
      send a copy to the driver. In the future, more actions can be added,
      such as 'mirror'.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0f420b6c
  8. 18 6月, 2019 1 次提交
    • A
      net: ipv4: move tcp_fastopen server side code to SipHash library · c681edae
      Ard Biesheuvel 提交于
      Using a bare block cipher in non-crypto code is almost always a bad idea,
      not only for security reasons (and we've seen some examples of this in
      the kernel in the past), but also for performance reasons.
      
      In the TCP fastopen case, we call into the bare AES block cipher one or
      two times (depending on whether the connection is IPv4 or IPv6). On most
      systems, this results in a call chain such as
      
        crypto_cipher_encrypt_one(ctx, dst, src)
          crypto_cipher_crt(tfm)->cit_encrypt_one(crypto_cipher_tfm(tfm), ...);
            aesni_encrypt
              kernel_fpu_begin();
              aesni_enc(ctx, dst, src); // asm routine
              kernel_fpu_end();
      
      It is highly unlikely that the use of special AES instructions has a
      benefit in this case, especially since we are doing the above twice
      for IPv6 connections, instead of using a transform which can process
      the entire input in one go.
      
      We could switch to the cbcmac(aes) shash, which would at least get
      rid of the duplicated overhead in *some* cases (i.e., today, only
      arm64 has an accelerated implementation of cbcmac(aes), while x86 will
      end up using the generic cbcmac template wrapping the AES-NI cipher,
      which basically ends up doing exactly the above). However, in the given
      context, it makes more sense to use a light-weight MAC algorithm that
      is more suitable for the purpose at hand, such as SipHash.
      
      Since the output size of SipHash already matches our chosen value for
      TCP_FASTOPEN_COOKIE_SIZE, and given that it accepts arbitrary input
      sizes, this greatly simplifies the code as well.
      
      NOTE: Server farms backing a single server IP for load balancing purposes
            and sharing a single fastopen key will be adversely affected by
            this change unless all systems in the pool receive their kernel
            upgrades at the same time.
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c681edae
  9. 21 5月, 2019 1 次提交
  10. 25 3月, 2019 1 次提交
  11. 27 2月, 2019 1 次提交
  12. 16 2月, 2019 1 次提交
  13. 20 12月, 2018 2 次提交
    • F
      net: convert bridge_nf to use skb extension infrastructure · de8bda1d
      Florian Westphal 提交于
      This converts the bridge netfilter (calling iptables hooks from bridge)
      facility to use the extension infrastructure.
      
      The bridge_nf specific hooks in skb clone and free paths are removed, they
      have been replaced by the skb_ext hooks that do the same as the bridge nf
      allocations hooks did.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de8bda1d
    • F
      sk_buff: add skb extension infrastructure · df5042f4
      Florian Westphal 提交于
      This adds an optional extension infrastructure, with ispec (xfrm) and
      bridge netfilter as first users.
      objdiff shows no changes if kernel is built without xfrm and br_netfilter
      support.
      
      The third (planned future) user is Multipath TCP which is still
      out-of-tree.
      MPTCP needs to map logical mptcp sequence numbers to the tcp sequence
      numbers used by individual subflows.
      
      This DSS mapping is read/written from tcp option space on receive and
      written to tcp option space on transmitted tcp packets that are part of
      and MPTCP connection.
      
      Extending skb_shared_info or adding a private data field to skb fclones
      doesn't work for incoming skb, so a different DSS propagation method would
      be required for the receive side.
      
      mptcp has same requirements as secpath/bridge netfilter:
      
      1. extension memory is released when the sk_buff is free'd.
      2. data is shared after cloning an skb (clone inherits extension)
      3. adding extension to an skb will COW the extension buffer if needed.
      
      The "MPTCP upstreaming" effort adds SKB_EXT_MPTCP extension to store the
      mapping for tx and rx processing.
      
      Two new members are added to sk_buff:
      1. 'active_extensions' byte (filling a hole), telling which extensions
         are available for this skb.
         This has two purposes.
         a) avoids the need to initialize the pointer.
         b) allows to "delete" an extension by clearing its bit
         value in ->active_extensions.
      
         While it would be possible to store the active_extensions byte
         in the extension struct instead of sk_buff, there is one problem
         with this:
          When an extension has to be disabled, we can always clear the
          bit in skb->active_extensions.  But in case it would be stored in the
          extension buffer itself, we might have to COW it first, if
          we are dealing with a cloned skb.  On kmalloc failure we would
          be unable to turn an extension off.
      
      2. extension pointer, located at the end of the sk_buff.
         If the active_extensions byte is 0, the pointer is undefined,
         it is not initialized on skb allocation.
      
      This adds extra code to skb clone and free paths (to deal with
      refcount/free of extension area) but this replaces similar code that
      manages skb->nf_bridge and skb->sp structs in the followup patches of
      the series.
      
      It is possible to add support for extensions that are not preseved on
      clones/copies.
      
      To do this, it would be needed to define a bitmask of all extensions that
      need copy/cow semantics, and change __skb_ext_copy() to check
      ->active_extensions & SKB_EXT_PRESERVE_ON_CLONE, then just set
      ->active_extensions to 0 on the new clone.
      
      This isn't done here because all extensions that get added here
      need the copy/cow semantics.
      
      v2:
      Allocate entire extension space using kmem_cache.
      Upside is that this allows better tracking of used memory,
      downside is that we will allocate more space than strictly needed in
      most cases (its unlikely that all extensions are active/needed at same
      time for same skb).
      The allocated memory (except the small extension header) is not cleared,
      so no additonal overhead aside from memory usage.
      
      Avoid atomic_dec_and_test operation on skb_ext_put()
      by using similar trick as kfree_skbmem() does with fclone_ref:
      If recount is 1, there is no concurrent user and we can free right away.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df5042f4
  14. 16 10月, 2018 1 次提交
    • D
      bpf, sockmap: convert to generic sk_msg interface · 604326b4
      Daniel Borkmann 提交于
      Add a generic sk_msg layer, and convert current sockmap and later
      kTLS over to make use of it. While sk_buff handles network packet
      representation from netdevice up to socket, sk_msg handles data
      representation from application to socket layer.
      
      This means that sk_msg framework spans across ULP users in the
      kernel, and enables features such as introspection or filtering
      of data with the help of BPF programs that operate on this data
      structure.
      
      Latter becomes in particular useful for kTLS where data encryption
      is deferred into the kernel, and as such enabling the kernel to
      perform L7 introspection and policy based on BPF for TLS connections
      where the record is being encrypted after BPF has run and came to
      a verdict. In order to get there, first step is to transform open
      coding of scatter-gather list handling into a common core framework
      that subsystems can use.
      
      The code itself has been split and refactored into three bigger
      pieces: i) the generic sk_msg API which deals with managing the
      scatter gather ring, providing helpers for walking and mangling,
      transferring application data from user space into it, and preparing
      it for BPF pre/post-processing, ii) the plain sock map itself
      where sockets can be attached to or detached from; these bits
      are independent of i) which can now be used also without sock
      map, and iii) the integration with plain TCP as one protocol
      to be used for processing L7 application data (later this could
      e.g. also be extended to other protocols like UDP). The semantics
      are the same with the old sock map code and therefore no change
      of user facing behavior or APIs. While pursuing this work it
      also helped finding a number of bugs in the old sockmap code
      that we've fixed already in earlier commits. The test_sockmap
      kselftest suite passes through fine as well.
      
      Joint work with John.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      604326b4
  15. 25 7月, 2018 1 次提交
  16. 29 5月, 2018 1 次提交
    • S
      net: Introduce generic failover module · 30c8bd5a
      Sridhar Samudrala 提交于
      The failover module provides a generic interface for paravirtual drivers
      to register a netdev and a set of ops with a failover instance. The ops
      are used as event handlers that get called to handle netdev register/
      unregister/link change/name change events on slave pci ethernet devices
      with the same mac address as the failover netdev.
      
      This enables paravirtual drivers to use a VF as an accelerated low latency
      datapath. It also allows migration of VMs with direct attached VFs by
      failing over to the paravirtual datapath when the VF is unplugged.
      Signed-off-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30c8bd5a
  17. 24 5月, 2018 1 次提交
    • A
      net: add skeleton of bpfilter kernel module · d2ba09c1
      Alexei Starovoitov 提交于
      bpfilter.ko consists of bpfilter_kern.c (normal kernel module code)
      and user mode helper code that is embedded into bpfilter.ko
      
      The steps to build bpfilter.ko are the following:
      - main.c is compiled by HOSTCC into the bpfilter_umh elf executable file
      - with quite a bit of objcopy and Makefile magic the bpfilter_umh elf file
        is converted into bpfilter_umh.o object file
        with _binary_net_bpfilter_bpfilter_umh_start and _end symbols
        Example:
        $ nm ./bld_x64/net/bpfilter/bpfilter_umh.o
        0000000000004cf8 T _binary_net_bpfilter_bpfilter_umh_end
        0000000000004cf8 A _binary_net_bpfilter_bpfilter_umh_size
        0000000000000000 T _binary_net_bpfilter_bpfilter_umh_start
      - bpfilter_umh.o and bpfilter_kern.o are linked together into bpfilter.ko
      
      bpfilter_kern.c is a normal kernel module code that calls
      the fork_usermode_blob() helper to execute part of its own data
      as a user mode process.
      
      Notice that _binary_net_bpfilter_bpfilter_umh_start - end
      is placed into .init.rodata section, so it's freed as soon as __init
      function of bpfilter.ko is finished.
      As part of __init the bpfilter.ko does first request/reply action
      via two unix pipe provided by fork_usermode_blob() helper to
      make sure that umh is healthy. If not it will kill it via pid.
      
      Later bpfilter_process_sockopt() will be called from bpfilter hooks
      in get/setsockopt() to pass iptable commands into umh via bpfilter.ko
      
      If admin does 'rmmod bpfilter' the __exit code bpfilter.ko will
      kill umh as well.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d2ba09c1
  18. 04 5月, 2018 1 次提交
  19. 01 5月, 2018 1 次提交
  20. 17 4月, 2018 1 次提交
    • J
      page_pool: refurbish version of page_pool code · ff7d6b27
      Jesper Dangaard Brouer 提交于
      Need a fast page recycle mechanism for ndo_xdp_xmit API for returning
      pages on DMA-TX completion time, which have good cross CPU
      performance, given DMA-TX completion time can happen on a remote CPU.
      
      Refurbish my page_pool code, that was presented[1] at MM-summit 2016.
      Adapted page_pool code to not depend the page allocator and
      integration into struct page.  The DMA mapping feature is kept,
      even-though it will not be activated/used in this patchset.
      
      [1] http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf
      
      V2: Adjustments requested by Tariq
       - Changed page_pool_create return codes, don't return NULL, only
         ERR_PTR, as this simplifies err handling in drivers.
      
      V4: many small improvements and cleanups
      - Add DOC comment section, that can be used by kernel-doc
      - Improve fallback mode, to work better with refcnt based recycling
        e.g. remove a WARN as pointed out by Tariq
        e.g. quicker fallback if ptr_ring is empty.
      
      V5: Fixed SPDX license as pointed out by Alexei
      
      V6: Adjustments requested by Eric Dumazet
       - Adjust ____cacheline_aligned_in_smp usage/placement
       - Move rcu_head in struct page_pool
       - Free pages quicker on destroy, minimize resources delayed an RCU period
       - Remove code for forward/backward compat ABI interface
      
      V8: Issues found by kbuild test robot
       - Address sparse should be static warnings
       - Only compile+link when a driver use/select page_pool,
         mlx5 selects CONFIG_PAGE_POOL, although its first used in two patches
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ff7d6b27
  21. 09 1月, 2018 1 次提交
  22. 03 1月, 2018 1 次提交
  23. 28 11月, 2017 1 次提交
  24. 04 9月, 2017 1 次提交
  25. 30 8月, 2017 1 次提交
    • J
      nsh: add GSO support · c411ed85
      Jiri Benc 提交于
      Add a new nsh/ directory. It currently holds only GSO functions but more
      will come: in particular, code shared by openvswitch and tc to manipulate
      NSH headers.
      
      For now, assume there's no hardware support for NSH segmentation. We can
      always introduce netdev->nsh_features later.
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c411ed85
  26. 29 8月, 2017 2 次提交
  27. 16 6月, 2017 1 次提交
  28. 18 2月, 2017 1 次提交
    • D
      bpf: make jited programs visible in traces · 74451e66
      Daniel Borkmann 提交于
      Long standing issue with JITed programs is that stack traces from
      function tracing check whether a given address is kernel code
      through {__,}kernel_text_address(), which checks for code in core
      kernel, modules and dynamically allocated ftrace trampolines. But
      what is still missing is BPF JITed programs (interpreted programs
      are not an issue as __bpf_prog_run() will be attributed to them),
      thus when a stack trace is triggered, the code walking the stack
      won't see any of the JITed ones. The same for address correlation
      done from user space via reading /proc/kallsyms. This is read by
      tools like perf, but the latter is also useful for permanent live
      tracing with eBPF itself in combination with stack maps when other
      eBPF types are part of the callchain. See offwaketime example on
      dumping stack from a map.
      
      This work tries to tackle that issue by making the addresses and
      symbols known to the kernel. The lookup from *kernel_text_address()
      is implemented through a latched RB tree that can be read under
      RCU in fast-path that is also shared for symbol/size/offset lookup
      for a specific given address in kallsyms. The slow-path iteration
      through all symbols in the seq file done via RCU list, which holds
      a tiny fraction of all exported ksyms, usually below 0.1 percent.
      Function symbols are exported as bpf_prog_<tag>, in order to aide
      debugging and attribution. This facility is currently enabled for
      root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
      is active in any mode. The rationale behind this is that still a lot
      of systems ship with world read permissions on kallsyms thus addresses
      should not get suddenly exposed for them. If that situation gets
      much better in future, we always have the option to change the
      default on this. Likewise, unprivileged programs are not allowed
      to add entries there either, but that is less of a concern as most
      such programs types relevant in this context are for root-only anyway.
      If enabled, call graphs and stack traces will then show a correct
      attribution; one example is illustrated below, where the trace is
      now visible in tooling such as perf script --kallsyms=/proc/kallsyms
      and friends.
      
      Before:
      
        7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
               f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)
      
      After:
      
        7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
        [...]
        7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
               f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74451e66
  29. 09 2月, 2017 1 次提交
  30. 04 2月, 2017 1 次提交
  31. 25 1月, 2017 1 次提交
    • Y
      net: Introduce psample, a new genetlink channel for packet sampling · 6ae0a628
      Yotam Gigi 提交于
      Add a general way for kernel modules to sample packets, without being tied
      to any specific subsystem. This netlink channel can be used by tc,
      iptables, etc. and allow to standardize packet sampling in the kernel.
      
      For every sampled packet, the psample module adds the following metadata
      fields:
      
      PSAMPLE_ATTR_IIFINDEX - the packets input ifindex, if applicable
      
      PSAMPLE_ATTR_OIFINDEX - the packet output ifindex, if applicable
      
      PSAMPLE_ATTR_ORIGSIZE - the packet's original size, in case it has been
         truncated during sampling
      
      PSAMPLE_ATTR_SAMPLE_GROUP - the packet's sample group, which is set by the
         user who initiated the sampling. This field allows the user to
         differentiate between several samplers working simultaneously and
         filter packets relevant to him
      
      PSAMPLE_ATTR_GROUP_SEQ - sequence counter of last sent packet. The
         sequence is kept for each group
      
      PSAMPLE_ATTR_SAMPLE_RATE - the sampling rate used for sampling the packets
      
      PSAMPLE_ATTR_DATA - the actual packet bits
      
      The sampled packets are sent to the PSAMPLE_NL_MCGRP_SAMPLE multicast
      group. In addition, add the GET_GROUPS netlink command which allows the
      user to see the current sample groups, their refcount and sequence number.
      This command currently supports only netlink dump mode.
      Signed-off-by: NYotam Gigi <yotamg@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Reviewed-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Reviewed-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ae0a628
  32. 11 1月, 2017 1 次提交
    • A
      cgroup: move CONFIG_SOCK_CGROUP_DATA to init/Kconfig · 73b35147
      Arnd Bergmann 提交于
      We now 'select SOCK_CGROUP_DATA' but Kconfig complains that this is
      not right when CONFIG_NET is disabled and there is no socket interface:
      
      warning: (CGROUP_BPF) selects SOCK_CGROUP_DATA which has unmet direct dependencies (NET)
      
      I don't know what the correct solution for this is, but simply removing
      the dependency on NET from SOCK_CGROUP_DATA by moving it out of the
      'if NET' section avoids the warning and does not produce other build
      errors.
      
      Fixes: 483c4933 ("cgroup: Fix CGROUP_BPF config")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73b35147
  33. 10 1月, 2017 1 次提交
  34. 02 12月, 2016 1 次提交
    • T
      bpf: BPF for lightweight tunnel infrastructure · 3a0af8fd
      Thomas Graf 提交于
      Registers new BPF program types which correspond to the LWT hooks:
        - BPF_PROG_TYPE_LWT_IN   => dst_input()
        - BPF_PROG_TYPE_LWT_OUT  => dst_output()
        - BPF_PROG_TYPE_LWT_XMIT => lwtunnel_xmit()
      
      The separate program types are required to differentiate between the
      capabilities each LWT hook allows:
      
       * Programs attached to dst_input() or dst_output() are restricted and
         may only read the data of an skb. This prevent modification and
         possible invalidation of already validated packet headers on receive
         and the construction of illegal headers while the IP headers are
         still being assembled.
      
       * Programs attached to lwtunnel_xmit() are allowed to modify packet
         content as well as prepending an L2 header via a newly introduced
         helper bpf_skb_change_head(). This is safe as lwtunnel_xmit() is
         invoked after the IP header has been assembled completely.
      
      All BPF programs receive an skb with L3 headers attached and may return
      one of the following error codes:
      
       BPF_OK - Continue routing as per nexthop
       BPF_DROP - Drop skb and return EPERM
       BPF_REDIRECT - Redirect skb to device as per redirect() helper.
                      (Only valid in lwtunnel_xmit() context)
      
      The return codes are binary compatible with their TC_ACT_
      relatives to ease compatibility.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a0af8fd
  35. 18 8月, 2016 1 次提交
    • T
      strparser: Stream parser for messages · 43a0c675
      Tom Herbert 提交于
      This patch introduces a utility for parsing application layer protocol
      messages in a TCP stream. This is a generalization of the mechanism
      implemented of Kernel Connection Multiplexor.
      
      The API includes a context structure, a set of callbacks, utility
      functions, and a data ready function.
      
      A stream parser instance is defined by a strparse structure that
      is bound to a TCP socket. The function to initialize the structure
      is:
      
      int strp_init(struct strparser *strp, struct sock *csk,
                    struct strp_callbacks *cb);
      
      csk is the TCP socket being bound to and cb are the parser callbacks.
      
      The upper layer calls strp_tcp_data_ready when data is ready on the lower
      socket for strparser to process. This should be called from a data_ready
      callback that is set on the socket:
      
      void strp_tcp_data_ready(struct strparser *strp);
      
      A parser is bound to a TCP socket by setting data_ready function to
      strp_tcp_data_ready so that all receive indications on the socket
      go through the parser. This is assumes that sk_user_data is set to
      the strparser structure.
      
      There are four callbacks.
       - parse_msg is called to parse the message (returns length or error).
       - rcv_msg is called when a complete message has been received
       - read_sock_done is called when data_ready function exits
       - abort_parser is called to abort the parser
      
      The input to parse_msg is an skbuff which contains next message under
      construction. The backend processing of parse_msg will parse the
      application layer protocol headers to determine the length of
      the message in the stream. The possible return values are:
      
         >0 : indicates length of successfully parsed message
         0  : indicates more data must be received to parse the message
         -ESTRPIPE : current message should not be processed by the
            kernel, return control of the socket to userspace which
            can proceed to read the messages itself
         other < 0 : Error is parsing, give control back to userspace
            assuming that synchronzation is lost and the stream
            is unrecoverable (application expected to close TCP socket)
      
      In the case of error return (< 0) strparse will stop the parser
      and report and error to userspace. The application must deal
      with the error. To handle the error the strparser is unbound
      from the TCP socket. If the error indicates that the stream
      TCP socket is at recoverable point (ESTRPIPE) then the application
      can read the TCP socket to process the stream. Once the application
      has dealt with the exceptions in the stream, it may again bind the
      socket to a strparser to continue data operations.
      
      Note that ENODATA may be returned to the application. In this case
      parse_msg returned -ESTRPIPE, however strparser was unable to maintain
      synchronization of the stream (i.e. some of the message in question
      was already read by the parser).
      
      strp_pause and strp_unpause are used to provide flow control. For
      instance, if rcv_msg is called but the upper layer can't immediately
      consume the message it can hold the message and pause strparser.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      43a0c675
  36. 20 7月, 2016 1 次提交
    • G
      net/ncsi: Resource management · 2d283bdd
      Gavin Shan 提交于
      NCSI spec (DSP0222) defines several objects: package, channel, mode,
      filter, version and statistics etc. This introduces the data structs
      to represent those objects and implement functions to manage them.
      Also, this introduces CONFIG_NET_NCSI for the newly implemented NCSI
      stack.
      
         * The user (e.g. netdev driver) dereference NCSI device by
           "struct ncsi_dev", which is embedded to "struct ncsi_dev_priv".
           The later one is used by NCSI stack internally.
         * Every NCSI device can have multiple packages simultaneously, up
           to 8 packages. It's represented by "struct ncsi_package" and
           identified by 3-bits ID.
         * Every NCSI package can have multiple channels, up to 32. It's
           represented by "struct ncsi_channel" and identified by 5-bits ID.
         * Every NCSI channel has version, statistics, various modes and
           filters. They are represented by "struct ncsi_channel_version",
           "struct ncsi_channel_stats", "struct ncsi_channel_mode" and
           "struct ncsi_channel_filter" separately.
         * Apart from AEN (Asynchronous Event Notification), the NCSI stack
           works in terms of command and response. This introduces "struct
           ncsi_req" to represent a complete NCSI transaction made of NCSI
           request and response.
      
      link: https://www.dmtf.org/sites/default/files/standards/documents/DSP0222_1.1.0.pdfSigned-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Acked-by: NJoel Stanley <joel@jms.id.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2d283bdd
  37. 17 5月, 2016 2 次提交
    • D
      bpf: add generic constant blinding for use in jits · 4f3446bb
      Daniel Borkmann 提交于
      This work adds a generic facility for use from eBPF JIT compilers
      that allows for further hardening of JIT generated images through
      blinding constants. In response to the original work on BPF JIT
      spraying published by Keegan McAllister [1], most BPF JITs were
      changed to make images read-only and start at a randomized offset
      in the page, where the rest was filled with trap instructions. We
      have this nowadays in x86, arm, arm64 and s390 JIT compilers.
      Additionally, later work also made eBPF interpreter images read
      only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
      arm, arm64 and s390 archs as well currently. This is done by
      default for mentioned JITs when JITing is enabled. Furthermore,
      we had a generic and configurable constant blinding facility on our
      todo for quite some time now to further make spraying harder, and
      first implementation since around netconf 2016.
      
      We found that for systems where untrusted users can load cBPF/eBPF
      code where JIT is enabled, start offset randomization helps a bit
      to make jumps into crafted payload harder, but in case where larger
      programs that cross page boundary are injected, we again have some
      part of the program opcodes at a page start offset. With improved
      guessing and more reliable payload injection, chances can increase
      to jump into such payload. Elena Reshetova recently wrote a test
      case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
      can leave some more room for payloads. Note that for all this,
      additional bugs in the kernel are still required to make the jump
      (and of course to guess right, to not jump into a trap) and naturally
      the JIT must be enabled, which is disabled by default.
      
      For helping mitigation, the general idea is to provide an option
      bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
      that for cases where JIT should be enabled for performance reasons,
      the generated image can be further hardened with blinding constants
      for unpriviledged users (bpf_jit_harden == 1), with trading off
      performance for these, but not for privileged ones. We also added
      the option of blinding for all users (bpf_jit_harden == 2), which
      is quite helpful for testing f.e. with test_bpf.ko. There are no
      further e.g. hardening levels of bpf_jit_harden switch intended,
      rationale is to have it dead simple to use as on/off. Since this
      functionality would need to be duplicated over and over for JIT
      compilers to use, which are already complex enough, we provide a
      generic eBPF byte-code level based blinding implementation, which is
      then just transparently JITed. JIT compilers need to make only a few
      changes to integrate this facility and can be migrated one by one.
      
      This option is for eBPF JITs and will be used in x86, arm64, s390
      without too much effort, and soon ppc64 JITs, thus that native eBPF
      can be blinded as well as cBPF to eBPF migrations, so that both can
      be covered with a single implementation. The rule for JITs is that
      bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
      and in case blinding is disabled, we follow normally with JITing the
      passed program. In case blinding is enabled and we fail during the
      process of blinding itself, we must return with the interpreter.
      Similarly, in case the JITing process after the blinding failed, we
      return normally to the interpreter with the non-blinded code. Meaning,
      interpreter doesn't change in any way and operates on eBPF code as
      usual. For doing this pre-JIT blinding step, we need to make use of
      a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
      to the JIT and not in any way part of the eBPF architecture. Just like
      in the same way as JITs internally make use of some helper registers
      when emitting code, only that here the helper register is one
      abstraction level higher in eBPF bytecode, but nevertheless in JIT
      phase. That helper register is needed since f.e. manually written
      program can issue loads to all registers of eBPF architecture.
      
      The core concept with the additional register is: blind out all 32
      and 64 bit constants by converting BPF_K based instructions into a
      small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
      is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
      and REG <OP> BPF_REG_AX, so actual operation on the target register
      is translated from BPF_K into BPF_X one that is operating on
      BPF_REG_AX's content. During rewriting phase when blinding, RND is
      newly generated via prandom_u32() for each processed instruction.
      64 bit loads are split into two 32 bit loads to make translation and
      patching not too complex. Only basic thing required by JITs is to
      call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
      pair, and to map BPF_REG_AX into an unused register.
      
      Small bpf_jit_disasm extract from [2] when applied to x86 JIT:
      
      echo 0 > /proc/sys/net/core/bpf_jit_harden
      
        ffffffffa034f5e9 + <x>:
        [...]
        39:   mov    $0xa8909090,%eax
        3e:   mov    $0xa8909090,%eax
        43:   mov    $0xa8ff3148,%eax
        48:   mov    $0xa89081b4,%eax
        4d:   mov    $0xa8900bb0,%eax
        52:   mov    $0xa810e0c1,%eax
        57:   mov    $0xa8908eb4,%eax
        5c:   mov    $0xa89020b0,%eax
        [...]
      
      echo 1 > /proc/sys/net/core/bpf_jit_harden
      
        ffffffffa034f1e5 + <x>:
        [...]
        39:   mov    $0xe1192563,%r10d
        3f:   xor    $0x4989b5f3,%r10d
        46:   mov    %r10d,%eax
        49:   mov    $0xb8296d93,%r10d
        4f:   xor    $0x10b9fd03,%r10d
        56:   mov    %r10d,%eax
        59:   mov    $0x8c381146,%r10d
        5f:   xor    $0x24c7200e,%r10d
        66:   mov    %r10d,%eax
        69:   mov    $0xeb2a830e,%r10d
        6f:   xor    $0x43ba02ba,%r10d
        76:   mov    %r10d,%eax
        79:   mov    $0xd9730af,%r10d
        7f:   xor    $0xa5073b1f,%r10d
        86:   mov    %r10d,%eax
        89:   mov    $0x9a45662b,%r10d
        8f:   xor    $0x325586ea,%r10d
        96:   mov    %r10d,%eax
        [...]
      
      As can be seen, original constants that carry payload are hidden
      when enabled, actual operations are transformed from constant-based
      to register-based ones, making jumps into constants ineffective.
      Above extract/example uses single BPF load instruction over and
      over, but of course all instructions with constants are blinded.
      
      Performance wise, JIT with blinding performs a bit slower than just
      JIT and faster than interpreter case. This is expected, since we
      still get all the performance benefits from JITing and in normal
      use-cases not every single instruction needs to be blinded. Summing
      up all 296 test cases averaged over multiple runs from test_bpf.ko
      suite, interpreter was 55% slower than JIT only and JIT with blinding
      was 8% slower than JIT only. Since there are also some extremes in
      the test suite, I expect for ordinary workloads that the performance
      for the JIT with blinding case is even closer to JIT only case,
      f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
      35 (+ blinding), and 151 (interpreter).
      
      BPF test suite, seccomp test suite, eBPF sample code and various
      bigger networking eBPF programs have been tested with this and were
      running fine. For testing purposes, I also adapted interpreter and
      redirected blinded eBPF image to interpreter and also here all tests
      pass.
      
        [1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
        [2] https://github.com/01org/jit-spray-poc-for-ksp/
        [3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NElena Reshetova <elena.reshetova@intel.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f3446bb
    • D
      bpf: split HAVE_BPF_JIT into cBPF and eBPF variant · 6077776b
      Daniel Borkmann 提交于
      Split the HAVE_BPF_JIT into two for distinguishing cBPF and eBPF JITs.
      
      Current cBPF ones:
      
        # git grep -n HAVE_CBPF_JIT arch/
        arch/arm/Kconfig:44:    select HAVE_CBPF_JIT
        arch/mips/Kconfig:18:   select HAVE_CBPF_JIT if !CPU_MICROMIPS
        arch/powerpc/Kconfig:129:       select HAVE_CBPF_JIT
        arch/sparc/Kconfig:35:  select HAVE_CBPF_JIT
      
      Current eBPF ones:
      
        # git grep -n HAVE_EBPF_JIT arch/
        arch/arm64/Kconfig:61:  select HAVE_EBPF_JIT
        arch/s390/Kconfig:126:  select HAVE_EBPF_JIT if PACK_STACK && HAVE_MARCH_Z196_FEATURES
        arch/x86/Kconfig:94:    select HAVE_EBPF_JIT                    if X86_64
      
      Later code also needs this facility to check for eBPF JITs.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6077776b