1. 09 1月, 2018 13 次提交
    • P
      netfilter: move reroute indirection to struct nf_ipv6_ops · ce388f45
      Pablo Neira Ayuso 提交于
      We cannot make a direct call to nf_ip6_reroute() because that would result
      in autoloading the 'ipv6' module because of symbol dependencies.
      Therefore, define reroute indirection in nf_ipv6_ops where this really
      belongs to.
      
      For IPv4, we can indeed make a direct function call, which is faster,
      given IPv4 is built-in in the networking code by default. Still,
      CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline
      stub for IPv4 in such case.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      ce388f45
    • P
      netfilter: move route indirection to struct nf_ipv6_ops · 3f87c08c
      Pablo Neira Ayuso 提交于
      We cannot make a direct call to nf_ip6_route() because that would result
      in autoloading the 'ipv6' module because of symbol dependencies.
      Therefore, define route indirection in nf_ipv6_ops where this really
      belongs to.
      
      For IPv4, we can indeed make a direct function call, which is faster,
      given IPv4 is built-in in the networking code by default. Still,
      CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline
      stub for IPv4 in such case.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      3f87c08c
    • P
      netfilter: remove saveroute indirection in struct nf_afinfo · 7db9a51e
      Pablo Neira Ayuso 提交于
      This is only used by nf_queue.c and this function comes with no symbol
      dependencies with IPv6, it just refers to structure layouts. Therefore,
      we can replace it by a direct function call from where it belongs.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      7db9a51e
    • P
      netfilter: move checksum_partial indirection to struct nf_ipv6_ops · f7dcbe2f
      Pablo Neira Ayuso 提交于
      We cannot make a direct call to nf_ip6_checksum_partial() because that
      would result in autoloading the 'ipv6' module because of symbol
      dependencies.  Therefore, define checksum_partial indirection in
      nf_ipv6_ops where this really belongs to.
      
      For IPv4, we can indeed make a direct function call, which is faster,
      given IPv4 is built-in in the networking code by default. Still,
      CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline
      stub for IPv4 in such case.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      f7dcbe2f
    • P
      netfilter: move checksum indirection to struct nf_ipv6_ops · ef71fe27
      Pablo Neira Ayuso 提交于
      We cannot make a direct call to nf_ip6_checksum() because that would
      result in autoloading the 'ipv6' module because of symbol dependencies.
      Therefore, define checksum indirection in nf_ipv6_ops where this really
      belongs to.
      
      For IPv4, we can indeed make a direct function call, which is faster,
      given IPv4 is built-in in the networking code by default. Still,
      CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline
      stub for IPv4 in such case.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      ef71fe27
    • F
      netfilter: core: only allow one nat hook per hook point · f92b40a8
      Florian Westphal 提交于
      The netfilter NAT core cannot deal with more than one NAT hook per hook
      location (prerouting, input ...), because the NAT hooks install a NAT null
      binding in case the iptables nat table (iptable_nat hooks) or the
      corresponding nftables chain (nft nat hooks) doesn't specify a nat
      transformation.
      
      Null bindings are needed to detect port collsisions between NAT-ed and
      non-NAT-ed connections.
      
      This causes nftables NAT rules to not work when iptable_nat module is
      loaded, and vice versa because nat binding has already been attached
      when the second nat hook is consulted.
      
      The netfilter core is not really the correct location to handle this
      (hooks are just hooks, the core has no notion of what kinds of side
       effects a hook implements), but its the only place where we can check
      for conflicts between both iptables hooks and nftables hooks without
      adding dependencies.
      
      So add nat annotation to hook_ops to describe those hooks that will
      add NAT bindings and then make core reject if such a hook already exists.
      The annotation fills a padding hole, in case further restrictions appar
      we might change this to a 'u8 type' instead of bool.
      
      iptables error if nft nat hook active:
      iptables -t nat -A POSTROUTING -j MASQUERADE
      iptables v1.4.21: can't initialize iptables table `nat': File exists
      Perhaps iptables or your kernel needs to be upgraded.
      
      nftables error if iptables nat table present:
      nft -f /etc/nftables/ipv4-nat
      /usr/etc/nftables/ipv4-nat:3:1-2: Error: Could not process rule: File exists
      table nat {
      ^^
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      f92b40a8
    • F
      netfilter: xtables: add and use xt_request_find_table_lock · 03d13b68
      Florian Westphal 提交于
      currently we always return -ENOENT to userspace if we can't find
      a particular table, or if the table initialization fails.
      
      Followup patch will make nat table init fail in case nftables already
      registered a nat hook so this change makes xt_find_table_lock return
      an ERR_PTR to return the errno value reported from the table init
      function.
      
      Add xt_request_find_table_lock as try_then_request_module replacement
      and use it where needed.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      03d13b68
    • F
      netfilter: reduce NF_MAX_HOOKS define · 256d94ba
      Florian Westphal 提交于
      This can be same as NF_INET_NUMHOOKS if we don't support DECNET.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      256d94ba
    • F
      netfilter: don't allocate space for arp/bridge hooks unless needed · 2a95183a
      Florian Westphal 提交于
      no need to define hook points if the family isn't supported.
      Because we need these hooks for either nftables, arp/ebtables
      or the 'call-iptables' hack we have in the bridge layer add two
      new dependencies, NETFILTER_FAMILY_{ARP,BRIDGE}, and have the
      users select them.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      2a95183a
    • F
      netfilter: don't allocate space for decnet hooks unless needed · bb4badf3
      Florian Westphal 提交于
      no need to define hook points if the family isn't supported.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      bb4badf3
    • F
      netfilter: add defines for arp/decnet max hooks · e58f33cc
      Florian Westphal 提交于
      The kernel already has defines for this, but they are in uapi exposed
      headers.
      
      Including these from netns.h causes build errors and also adds unneeded
      dependencies on heads that we don't need.
      
      So move these defines to netfilter_defs.h and place the uapi ones
      in ifndef __KERNEL__ to keep them for userspace.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      e58f33cc
    • F
      netfilter: reduce size of hook entry point locations · b0f38338
      Florian Westphal 提交于
      struct net contains:
      
      struct nf_hook_entries __rcu *hooks[NFPROTO_NUMPROTO][NF_MAX_HOOKS];
      
      which store the hook entry point locations for the various protocol
      families and the hooks.
      
      Using array results in compact c code when doing accesses, i.e.
        x = rcu_dereference(net->nf.hooks[pf][hook]);
      
      but its also wasting a lot of memory, as most families are
      not used.
      
      So split the array into those families that are used, which
      are only 5 (instead of 13).  In most cases, the 'pf' argument is
      constant, i.e. gcc removes switch statement.
      
      struct net before:
       /* size: 5184, cachelines: 81, members: 46 */
      after:
       /* size: 4672, cachelines: 73, members: 46 */
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      b0f38338
    • F
      netfilter: core: free hooks with call_rcu · 8c873e21
      Florian Westphal 提交于
      Giuseppe Scrivano says:
        "SELinux, if enabled, registers for each new network namespace 6
          netfilter hooks."
      
      Cost for this is high.  With synchronize_net() removed:
         "The net benefit on an SMP machine with two cores is that creating a
         new network namespace takes -40% of the original time."
      
      This patch replaces synchronize_net+kvfree with call_rcu().
      We store rcu_head at the tail of a structure that has no fixed layout,
      i.e. we cannot use offsetof() to compute the start of the original
      allocation.  Thus store this information right after the rcu head.
      
      We could simplify this by just placing the rcu_head at the start
      of struct nf_hook_entries.  However, this structure is used in
      packet processing hotpath, so only place what is needed for that
      at the beginning of the struct.
      Reported-by: NGiuseppe Scrivano <gscrivan@redhat.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      8c873e21
  2. 06 1月, 2018 2 次提交
    • J
      xdp: generic XDP handling of xdp_rxq_info · e817f856
      Jesper Dangaard Brouer 提交于
      Hook points for xdp_rxq_info:
       * reg  : netif_alloc_rx_queues
       * unreg: netif_free_rx_queues
      
      The net_device have some members (num_rx_queues + real_num_rx_queues)
      and data-area (dev->_rx with struct netdev_rx_queue's) that were
      primarily used for exporting information about RPS (CONFIG_RPS) queues
      to sysfs (CONFIG_SYSFS).
      
      For generic XDP extend struct netdev_rx_queue with the xdp_rxq_info,
      and remove some of the CONFIG_SYSFS ifdefs.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      e817f856
    • J
      xdp: base API for new XDP rx-queue info concept · aecd67b6
      Jesper Dangaard Brouer 提交于
      This patch only introduce the core data structures and API functions.
      All XDP enabled drivers must use the API before this info can used.
      
      There is a need for XDP to know more about the RX-queue a given XDP
      frames have arrived on.  For both the XDP bpf-prog and kernel side.
      
      Instead of extending xdp_buff each time new info is needed, the patch
      creates a separate read-mostly struct xdp_rxq_info, that contains this
      info.  We stress this data/cache-line is for read-only info.  This is
      NOT for dynamic per packet info, use the data_meta for such use-cases.
      
      The performance advantage is this info can be setup at RX-ring init
      time, instead of updating N-members in xdp_buff.  A possible (driver
      level) micro optimization is that xdp_buff->rxq assignment could be
      done once per XDP/NAPI loop.  The extra pointer deref only happens for
      program needing access to this info (thus, no slowdown to existing
      use-cases).
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      aecd67b6
  3. 05 1月, 2018 2 次提交
  4. 04 1月, 2018 4 次提交
  5. 03 1月, 2018 8 次提交
  6. 31 12月, 2017 5 次提交
  7. 29 12月, 2017 1 次提交
  8. 28 12月, 2017 2 次提交
  9. 27 12月, 2017 1 次提交
    • L
      rtnetlink: Replace implementation of ASSERT_RTNL() macro with WARN_ONCE() · 66364bdf
      Leon Romanovsky 提交于
      ASSERT_RTNL() macro is actual open-coded variant of WARN_ONCE() with
      two exceptions. First, it prints stack for multiple hits and not only
      once as WARN_ONCE() does. Second, the user can disable prints of
      WARN_ONCE by setting CONFIG_BUG to N.
      
      The multiple prints of dump stack are actually not needed, because calls
      without rtnl lock are programming errors and user can't do anything
      about them except to complain to the mailing list after first occurrence
      of such failure.
      
      The user who disabled BUG/WARN prints did it explicitly because by default
      in upstream kernel and distributions this option is enabled. It means
      that user doesn't want to see prints about missing locks too.
      
      This patch replaces open-coded variant in favor of already existing
      macro and change error prints to be once only.
      Reviewed-by: NMark Bloch <markb@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      66364bdf
  10. 22 12月, 2017 2 次提交
    • M
      IB/mlx5: Fix congestion counters in LAG mode · 71a0ff65
      Majd Dibbiny 提交于
      Congestion counters are counted and queried per physical function.
      When working in LAG mode, CNP packets can be sent or received on both
      of the functions, thus congestion counters should be aggregated from
      the two physical functions.
      
      Fixes: e1f24a79 ("IB/mlx5: Support congestion related counters")
      Signed-off-by: NMajd Dibbiny <majd@mellanox.com>
      Reviewed-by: NAviv Heller <avivh@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leon@kernel.org>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      71a0ff65
    • S
      net: reevalulate autoflowlabel setting after sysctl setting · 513674b5
      Shaohua Li 提交于
      sysctl.ip6.auto_flowlabels is default 1. In our hosts, we set it to 2.
      If sockopt doesn't set autoflowlabel, outcome packets from the hosts are
      supposed to not include flowlabel. This is true for normal packet, but
      not for reset packet.
      
      The reason is ipv6_pinfo.autoflowlabel is set in sock creation. Later if
      we change sysctl.ip6.auto_flowlabels, the ipv6_pinfo.autoflowlabel isn't
      changed, so the sock will keep the old behavior in terms of auto
      flowlabel. Reset packet is suffering from this problem, because reset
      packet is sent from a special control socket, which is created at boot
      time. Since sysctl.ipv6.auto_flowlabels is 1 by default, the control
      socket will always have its ipv6_pinfo.autoflowlabel set, even after
      user set sysctl.ipv6.auto_flowlabels to 1, so reset packset will always
      have flowlabel. Normal sock created before sysctl setting suffers from
      the same issue. We can't even turn off autoflowlabel unless we kill all
      socks in the hosts.
      
      To fix this, if IPV6_AUTOFLOWLABEL sockopt is used, we use the
      autoflowlabel setting from user, otherwise we always call
      ip6_default_np_autolabel() which has the new settings of sysctl.
      
      Note, this changes behavior a little bit. Before commit 42240901
      (ipv6: Implement different admin modes for automatic flow labels), the
      autoflowlabel behavior of a sock isn't sticky, eg, if sysctl changes,
      existing connection will change autoflowlabel behavior. After that
      commit, autoflowlabel behavior is sticky in the whole life of the sock.
      With this patch, the behavior isn't sticky again.
      
      Cc: Martin KaFai Lau <kafai@fb.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Tom Herbert <tom@quantonium.net>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      513674b5