1. 09 1月, 2018 10 次提交
    • P
      netfilter: move checksum_partial indirection to struct nf_ipv6_ops · f7dcbe2f
      Pablo Neira Ayuso 提交于
      We cannot make a direct call to nf_ip6_checksum_partial() because that
      would result in autoloading the 'ipv6' module because of symbol
      dependencies.  Therefore, define checksum_partial indirection in
      nf_ipv6_ops where this really belongs to.
      
      For IPv4, we can indeed make a direct function call, which is faster,
      given IPv4 is built-in in the networking code by default. Still,
      CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline
      stub for IPv4 in such case.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      f7dcbe2f
    • P
      netfilter: move checksum indirection to struct nf_ipv6_ops · ef71fe27
      Pablo Neira Ayuso 提交于
      We cannot make a direct call to nf_ip6_checksum() because that would
      result in autoloading the 'ipv6' module because of symbol dependencies.
      Therefore, define checksum indirection in nf_ipv6_ops where this really
      belongs to.
      
      For IPv4, we can indeed make a direct function call, which is faster,
      given IPv4 is built-in in the networking code by default. Still,
      CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline
      stub for IPv4 in such case.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      ef71fe27
    • F
      netfilter: core: only allow one nat hook per hook point · f92b40a8
      Florian Westphal 提交于
      The netfilter NAT core cannot deal with more than one NAT hook per hook
      location (prerouting, input ...), because the NAT hooks install a NAT null
      binding in case the iptables nat table (iptable_nat hooks) or the
      corresponding nftables chain (nft nat hooks) doesn't specify a nat
      transformation.
      
      Null bindings are needed to detect port collsisions between NAT-ed and
      non-NAT-ed connections.
      
      This causes nftables NAT rules to not work when iptable_nat module is
      loaded, and vice versa because nat binding has already been attached
      when the second nat hook is consulted.
      
      The netfilter core is not really the correct location to handle this
      (hooks are just hooks, the core has no notion of what kinds of side
       effects a hook implements), but its the only place where we can check
      for conflicts between both iptables hooks and nftables hooks without
      adding dependencies.
      
      So add nat annotation to hook_ops to describe those hooks that will
      add NAT bindings and then make core reject if such a hook already exists.
      The annotation fills a padding hole, in case further restrictions appar
      we might change this to a 'u8 type' instead of bool.
      
      iptables error if nft nat hook active:
      iptables -t nat -A POSTROUTING -j MASQUERADE
      iptables v1.4.21: can't initialize iptables table `nat': File exists
      Perhaps iptables or your kernel needs to be upgraded.
      
      nftables error if iptables nat table present:
      nft -f /etc/nftables/ipv4-nat
      /usr/etc/nftables/ipv4-nat:3:1-2: Error: Could not process rule: File exists
      table nat {
      ^^
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      f92b40a8
    • F
      netfilter: xtables: add and use xt_request_find_table_lock · 03d13b68
      Florian Westphal 提交于
      currently we always return -ENOENT to userspace if we can't find
      a particular table, or if the table initialization fails.
      
      Followup patch will make nat table init fail in case nftables already
      registered a nat hook so this change makes xt_find_table_lock return
      an ERR_PTR to return the errno value reported from the table init
      function.
      
      Add xt_request_find_table_lock as try_then_request_module replacement
      and use it where needed.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      03d13b68
    • F
      netfilter: reduce NF_MAX_HOOKS define · 256d94ba
      Florian Westphal 提交于
      This can be same as NF_INET_NUMHOOKS if we don't support DECNET.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      256d94ba
    • F
      netfilter: don't allocate space for arp/bridge hooks unless needed · 2a95183a
      Florian Westphal 提交于
      no need to define hook points if the family isn't supported.
      Because we need these hooks for either nftables, arp/ebtables
      or the 'call-iptables' hack we have in the bridge layer add two
      new dependencies, NETFILTER_FAMILY_{ARP,BRIDGE}, and have the
      users select them.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      2a95183a
    • F
      netfilter: don't allocate space for decnet hooks unless needed · bb4badf3
      Florian Westphal 提交于
      no need to define hook points if the family isn't supported.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      bb4badf3
    • F
      netfilter: add defines for arp/decnet max hooks · e58f33cc
      Florian Westphal 提交于
      The kernel already has defines for this, but they are in uapi exposed
      headers.
      
      Including these from netns.h causes build errors and also adds unneeded
      dependencies on heads that we don't need.
      
      So move these defines to netfilter_defs.h and place the uapi ones
      in ifndef __KERNEL__ to keep them for userspace.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      e58f33cc
    • F
      netfilter: reduce size of hook entry point locations · b0f38338
      Florian Westphal 提交于
      struct net contains:
      
      struct nf_hook_entries __rcu *hooks[NFPROTO_NUMPROTO][NF_MAX_HOOKS];
      
      which store the hook entry point locations for the various protocol
      families and the hooks.
      
      Using array results in compact c code when doing accesses, i.e.
        x = rcu_dereference(net->nf.hooks[pf][hook]);
      
      but its also wasting a lot of memory, as most families are
      not used.
      
      So split the array into those families that are used, which
      are only 5 (instead of 13).  In most cases, the 'pf' argument is
      constant, i.e. gcc removes switch statement.
      
      struct net before:
       /* size: 5184, cachelines: 81, members: 46 */
      after:
       /* size: 4672, cachelines: 73, members: 46 */
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      b0f38338
    • F
      netfilter: core: free hooks with call_rcu · 8c873e21
      Florian Westphal 提交于
      Giuseppe Scrivano says:
        "SELinux, if enabled, registers for each new network namespace 6
          netfilter hooks."
      
      Cost for this is high.  With synchronize_net() removed:
         "The net benefit on an SMP machine with two cores is that creating a
         new network namespace takes -40% of the original time."
      
      This patch replaces synchronize_net+kvfree with call_rcu().
      We store rcu_head at the tail of a structure that has no fixed layout,
      i.e. we cannot use offsetof() to compute the start of the original
      allocation.  Thus store this information right after the rcu head.
      
      We could simplify this by just placing the rcu_head at the start
      of struct nf_hook_entries.  However, this structure is used in
      packet processing hotpath, so only place what is needed for that
      at the beginning of the struct.
      Reported-by: NGiuseppe Scrivano <gscrivan@redhat.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      8c873e21
  2. 06 1月, 2018 2 次提交
    • J
      xdp: generic XDP handling of xdp_rxq_info · e817f856
      Jesper Dangaard Brouer 提交于
      Hook points for xdp_rxq_info:
       * reg  : netif_alloc_rx_queues
       * unreg: netif_free_rx_queues
      
      The net_device have some members (num_rx_queues + real_num_rx_queues)
      and data-area (dev->_rx with struct netdev_rx_queue's) that were
      primarily used for exporting information about RPS (CONFIG_RPS) queues
      to sysfs (CONFIG_SYSFS).
      
      For generic XDP extend struct netdev_rx_queue with the xdp_rxq_info,
      and remove some of the CONFIG_SYSFS ifdefs.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      e817f856
    • J
      xdp: base API for new XDP rx-queue info concept · aecd67b6
      Jesper Dangaard Brouer 提交于
      This patch only introduce the core data structures and API functions.
      All XDP enabled drivers must use the API before this info can used.
      
      There is a need for XDP to know more about the RX-queue a given XDP
      frames have arrived on.  For both the XDP bpf-prog and kernel side.
      
      Instead of extending xdp_buff each time new info is needed, the patch
      creates a separate read-mostly struct xdp_rxq_info, that contains this
      info.  We stress this data/cache-line is for read-only info.  This is
      NOT for dynamic per packet info, use the data_meta for such use-cases.
      
      The performance advantage is this info can be setup at RX-ring init
      time, instead of updating N-members in xdp_buff.  A possible (driver
      level) micro optimization is that xdp_buff->rxq assignment could be
      done once per XDP/NAPI loop.  The extra pointer deref only happens for
      program needing access to this info (thus, no slowdown to existing
      use-cases).
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      aecd67b6
  3. 05 1月, 2018 2 次提交
  4. 04 1月, 2018 4 次提交
  5. 03 1月, 2018 8 次提交
  6. 31 12月, 2017 5 次提交
  7. 29 12月, 2017 1 次提交
  8. 28 12月, 2017 2 次提交
  9. 27 12月, 2017 1 次提交
    • L
      rtnetlink: Replace implementation of ASSERT_RTNL() macro with WARN_ONCE() · 66364bdf
      Leon Romanovsky 提交于
      ASSERT_RTNL() macro is actual open-coded variant of WARN_ONCE() with
      two exceptions. First, it prints stack for multiple hits and not only
      once as WARN_ONCE() does. Second, the user can disable prints of
      WARN_ONCE by setting CONFIG_BUG to N.
      
      The multiple prints of dump stack are actually not needed, because calls
      without rtnl lock are programming errors and user can't do anything
      about them except to complain to the mailing list after first occurrence
      of such failure.
      
      The user who disabled BUG/WARN prints did it explicitly because by default
      in upstream kernel and distributions this option is enabled. It means
      that user doesn't want to see prints about missing locks too.
      
      This patch replaces open-coded variant in favor of already existing
      macro and change error prints to be once only.
      Reviewed-by: NMark Bloch <markb@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      66364bdf
  10. 22 12月, 2017 2 次提交
    • M
      IB/mlx5: Fix congestion counters in LAG mode · 71a0ff65
      Majd Dibbiny 提交于
      Congestion counters are counted and queried per physical function.
      When working in LAG mode, CNP packets can be sent or received on both
      of the functions, thus congestion counters should be aggregated from
      the two physical functions.
      
      Fixes: e1f24a79 ("IB/mlx5: Support congestion related counters")
      Signed-off-by: NMajd Dibbiny <majd@mellanox.com>
      Reviewed-by: NAviv Heller <avivh@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leon@kernel.org>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      71a0ff65
    • S
      net: reevalulate autoflowlabel setting after sysctl setting · 513674b5
      Shaohua Li 提交于
      sysctl.ip6.auto_flowlabels is default 1. In our hosts, we set it to 2.
      If sockopt doesn't set autoflowlabel, outcome packets from the hosts are
      supposed to not include flowlabel. This is true for normal packet, but
      not for reset packet.
      
      The reason is ipv6_pinfo.autoflowlabel is set in sock creation. Later if
      we change sysctl.ip6.auto_flowlabels, the ipv6_pinfo.autoflowlabel isn't
      changed, so the sock will keep the old behavior in terms of auto
      flowlabel. Reset packet is suffering from this problem, because reset
      packet is sent from a special control socket, which is created at boot
      time. Since sysctl.ipv6.auto_flowlabels is 1 by default, the control
      socket will always have its ipv6_pinfo.autoflowlabel set, even after
      user set sysctl.ipv6.auto_flowlabels to 1, so reset packset will always
      have flowlabel. Normal sock created before sysctl setting suffers from
      the same issue. We can't even turn off autoflowlabel unless we kill all
      socks in the hosts.
      
      To fix this, if IPV6_AUTOFLOWLABEL sockopt is used, we use the
      autoflowlabel setting from user, otherwise we always call
      ip6_default_np_autolabel() which has the new settings of sysctl.
      
      Note, this changes behavior a little bit. Before commit 42240901
      (ipv6: Implement different admin modes for automatic flow labels), the
      autoflowlabel behavior of a sock isn't sticky, eg, if sysctl changes,
      existing connection will change autoflowlabel behavior. After that
      commit, autoflowlabel behavior is sticky in the whole life of the sock.
      With this patch, the behavior isn't sticky again.
      
      Cc: Martin KaFai Lau <kafai@fb.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Tom Herbert <tom@quantonium.net>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      513674b5
  11. 21 12月, 2017 3 次提交
    • S
      xfrm: wrap xfrmdev_ops with offload config · 9cb0d21d
      Shannon Nelson 提交于
      There's no reason to define netdev->xfrmdev_ops if
      the offload facility is not CONFIG'd in.
      Signed-off-by: NShannon Nelson <shannon.nelson@oracle.com>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      9cb0d21d
    • D
      bpf: allow for correlation of maps and helpers in dump · 7105e828
      Daniel Borkmann 提交于
      Currently a dump of an xlated prog (post verifier stage) doesn't
      correlate used helpers as well as maps. The prog info lists
      involved map ids, however there's no correlation of where in the
      program they are used as of today. Likewise, bpftool does not
      correlate helper calls with the target functions.
      
      The latter can be done w/o any kernel changes through kallsyms,
      and also has the advantage that this works with inlined helpers
      and BPF calls.
      
      Example, via interpreter:
      
        # tc filter show dev foo ingress
        filter protocol all pref 49152 bpf chain 0
        filter protocol all pref 49152 bpf chain 0 handle 0x1 foo.o:[ingress] \
                            direct-action not_in_hw id 1 tag c74773051b364165   <-- prog id:1
      
        * Output before patch (calls/maps remain unclear):
      
        # bpftool prog dump xlated id 1             <-- dump prog id:1
         0: (b7) r1 = 2
         1: (63) *(u32 *)(r10 -4) = r1
         2: (bf) r2 = r10
         3: (07) r2 += -4
         4: (18) r1 = 0xffff95c47a8d4800
         6: (85) call unknown#73040
         7: (15) if r0 == 0x0 goto pc+18
         8: (bf) r2 = r10
         9: (07) r2 += -4
        10: (bf) r1 = r0
        11: (85) call unknown#73040
        12: (15) if r0 == 0x0 goto pc+23
        [...]
      
        * Output after patch:
      
        # bpftool prog dump xlated id 1
         0: (b7) r1 = 2
         1: (63) *(u32 *)(r10 -4) = r1
         2: (bf) r2 = r10
         3: (07) r2 += -4
         4: (18) r1 = map[id:2]                     <-- map id:2
         6: (85) call bpf_map_lookup_elem#73424     <-- helper call
         7: (15) if r0 == 0x0 goto pc+18
         8: (bf) r2 = r10
         9: (07) r2 += -4
        10: (bf) r1 = r0
        11: (85) call bpf_map_lookup_elem#73424
        12: (15) if r0 == 0x0 goto pc+23
        [...]
      
        # bpftool map show id 2                     <-- show/dump/etc map id:2
        2: hash_of_maps  flags 0x0
              key 4B  value 4B  max_entries 3  memlock 4096B
      
      Example, JITed, same prog:
      
        # tc filter show dev foo ingress
        filter protocol all pref 49152 bpf chain 0
        filter protocol all pref 49152 bpf chain 0 handle 0x1 foo.o:[ingress] \
                        direct-action not_in_hw id 3 tag c74773051b364165 jited
      
        # bpftool prog show id 3
        3: sched_cls  tag c74773051b364165
              loaded_at Dec 19/13:48  uid 0
              xlated 384B  jited 257B  memlock 4096B  map_ids 2
      
        # bpftool prog dump xlated id 3
         0: (b7) r1 = 2
         1: (63) *(u32 *)(r10 -4) = r1
         2: (bf) r2 = r10
         3: (07) r2 += -4
         4: (18) r1 = map[id:2]                      <-- map id:2
         6: (85) call __htab_map_lookup_elem#77408   <-+ inlined rewrite
         7: (15) if r0 == 0x0 goto pc+2                |
         8: (07) r0 += 56                              |
         9: (79) r0 = *(u64 *)(r0 +0)                <-+
        10: (15) if r0 == 0x0 goto pc+24
        11: (bf) r2 = r10
        12: (07) r2 += -4
        [...]
      
      Example, same prog, but kallsyms disabled (in that case we are
      also not allowed to pass any relative offsets, etc, so prog
      becomes pointer sanitized on dump):
      
        # sysctl kernel.kptr_restrict=2
        kernel.kptr_restrict = 2
      
        # bpftool prog dump xlated id 3
         0: (b7) r1 = 2
         1: (63) *(u32 *)(r10 -4) = r1
         2: (bf) r2 = r10
         3: (07) r2 += -4
         4: (18) r1 = map[id:2]
         6: (85) call bpf_unspec#0
         7: (15) if r0 == 0x0 goto pc+2
        [...]
      
      Example, BPF calls via interpreter:
      
        # bpftool prog dump xlated id 1
         0: (85) call pc+2#__bpf_prog_run_args32
         1: (b7) r0 = 1
         2: (95) exit
         3: (b7) r0 = 2
         4: (95) exit
      
      Example, BPF calls via JIT:
      
        # sysctl net.core.bpf_jit_enable=1
        net.core.bpf_jit_enable = 1
        # sysctl net.core.bpf_jit_kallsyms=1
        net.core.bpf_jit_kallsyms = 1
      
        # bpftool prog dump xlated id 1
         0: (85) call pc+2#bpf_prog_3b185187f1855c4c_F
         1: (b7) r0 = 1
         2: (95) exit
         3: (b7) r0 = 2
         4: (95) exit
      
      And finally, an example for tail calls that is now working
      as well wrt correlation:
      
        # bpftool prog dump xlated id 2
        [...]
        10: (b7) r2 = 8
        11: (85) call bpf_trace_printk#-41312
        12: (bf) r1 = r6
        13: (18) r2 = map[id:1]
        15: (b7) r3 = 0
        16: (85) call bpf_tail_call#12
        17: (b7) r1 = 42
        18: (6b) *(u16 *)(r6 +46) = r1
        19: (b7) r0 = 0
        20: (95) exit
      
        # bpftool map show id 1
        1: prog_array  flags 0x0
              key 4B  value 4B  max_entries 1  memlock 4096B
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      7105e828
    • A
      bpf: fix integer overflows · bb7f0f98
      Alexei Starovoitov 提交于
      There were various issues related to the limited size of integers used in
      the verifier:
       - `off + size` overflow in __check_map_access()
       - `off + reg->off` overflow in check_mem_access()
       - `off + reg->var_off.value` overflow or 32-bit truncation of
         `reg->var_off.value` in check_mem_access()
       - 32-bit truncation in check_stack_boundary()
      
      Make sure that any integer math cannot overflow by not allowing
      pointer math with large values.
      
      Also reduce the scope of "scalar op scalar" tracking.
      
      Fixes: f1174f77 ("bpf/verifier: rework value tracking")
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      bb7f0f98