1. 14 5月, 2015 2 次提交
    • P
      netfilter: add netfilter ingress hook after handle_ing() under unique static key · e687ad60
      Pablo Neira 提交于
      This patch adds the Netfilter ingress hook just after the existing tc ingress
      hook, that seems to be the consensus solution for this.
      
      Note that the Netfilter hook resides under the global static key that enables
      ingress filtering. Nonetheless, Netfilter still also has its own static key for
      minimal impact on the existing handle_ing().
      
      * Without this patch:
      
      Result: OK: 6216490(c6216338+d152) usec, 100000000 (60byte,0frags)
        16086246pps 7721Mb/sec (7721398080bps) errors: 100000000
      
          42.46%  kpktgend_0   [kernel.kallsyms]   [k] __netif_receive_skb_core
          25.92%  kpktgend_0   [kernel.kallsyms]   [k] kfree_skb
           7.81%  kpktgend_0   [pktgen]            [k] pktgen_thread_worker
           5.62%  kpktgend_0   [kernel.kallsyms]   [k] ip_rcv
           2.70%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_internal
           2.34%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_sk
           1.44%  kpktgend_0   [kernel.kallsyms]   [k] __build_skb
      
      * With this patch:
      
      Result: OK: 6214833(c6214731+d101) usec, 100000000 (60byte,0frags)
        16090536pps 7723Mb/sec (7723457280bps) errors: 100000000
      
          41.23%  kpktgend_0      [kernel.kallsyms]  [k] __netif_receive_skb_core
          26.57%  kpktgend_0      [kernel.kallsyms]  [k] kfree_skb
           7.72%  kpktgend_0      [pktgen]           [k] pktgen_thread_worker
           5.55%  kpktgend_0      [kernel.kallsyms]  [k] ip_rcv
           2.78%  kpktgend_0      [kernel.kallsyms]  [k] netif_receive_skb_internal
           2.06%  kpktgend_0      [kernel.kallsyms]  [k] netif_receive_skb_sk
           1.43%  kpktgend_0      [kernel.kallsyms]  [k] __build_skb
      
      * Without this patch + tc ingress:
      
              tc filter add dev eth4 parent ffff: protocol ip prio 1 \
                      u32 match ip dst 4.3.2.1/32
      
      Result: OK: 9269001(c9268821+d179) usec, 100000000 (60byte,0frags)
        10788648pps 5178Mb/sec (5178551040bps) errors: 100000000
      
          40.99%  kpktgend_0   [kernel.kallsyms]  [k] __netif_receive_skb_core
          17.50%  kpktgend_0   [kernel.kallsyms]  [k] kfree_skb
          11.77%  kpktgend_0   [cls_u32]          [k] u32_classify
           5.62%  kpktgend_0   [kernel.kallsyms]  [k] tc_classify_compat
           5.18%  kpktgend_0   [pktgen]           [k] pktgen_thread_worker
           3.23%  kpktgend_0   [kernel.kallsyms]  [k] tc_classify
           2.97%  kpktgend_0   [kernel.kallsyms]  [k] ip_rcv
           1.83%  kpktgend_0   [kernel.kallsyms]  [k] netif_receive_skb_internal
           1.50%  kpktgend_0   [kernel.kallsyms]  [k] netif_receive_skb_sk
           0.99%  kpktgend_0   [kernel.kallsyms]  [k] __build_skb
      
      * With this patch + tc ingress:
      
              tc filter add dev eth4 parent ffff: protocol ip prio 1 \
                      u32 match ip dst 4.3.2.1/32
      
      Result: OK: 9308218(c9308091+d126) usec, 100000000 (60byte,0frags)
        10743194pps 5156Mb/sec (5156733120bps) errors: 100000000
      
          42.01%  kpktgend_0   [kernel.kallsyms]   [k] __netif_receive_skb_core
          17.78%  kpktgend_0   [kernel.kallsyms]   [k] kfree_skb
          11.70%  kpktgend_0   [cls_u32]           [k] u32_classify
           5.46%  kpktgend_0   [kernel.kallsyms]   [k] tc_classify_compat
           5.16%  kpktgend_0   [pktgen]            [k] pktgen_thread_worker
           2.98%  kpktgend_0   [kernel.kallsyms]   [k] ip_rcv
           2.84%  kpktgend_0   [kernel.kallsyms]   [k] tc_classify
           1.96%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_internal
           1.57%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_sk
      
      Note that the results are very similar before and after.
      
      I can see gcc gets the code under the ingress static key out of the hot path.
      Then, on that cold branch, it generates the code to accomodate the netfilter
      ingress static key. My explanation for this is that this reduces the pressure
      on the instruction cache for non-users as the new code is out of the hot path,
      and it comes with minimal impact for tc ingress users.
      
      Using gcc version 4.8.4 on:
      
      Architecture:          x86_64
      CPU op-mode(s):        32-bit, 64-bit
      Byte Order:            Little Endian
      CPU(s):                8
      [...]
      L1d cache:             16K
      L1i cache:             64K
      L2 cache:              2048K
      L3 cache:              8192K
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e687ad60
    • J
      net: move __skb_tx_hash to dev.c · 5605c762
      Jiri Pirko 提交于
      __skb_tx_hash function has no relation to flow_dissect so just move it
      to dev.c
      Signed-off-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5605c762
  2. 13 5月, 2015 3 次提交
  3. 12 5月, 2015 1 次提交
  4. 11 5月, 2015 1 次提交
    • D
      net: sched: further simplify handle_ing · d2788d34
      Daniel Borkmann 提交于
      Ingress qdisc has no other purpose than calling into tc_classify()
      that executes attached classifier(s) and action(s).
      
      It has a 1:1 relationship to dev->ingress_queue. After having commit
      087c1a60 ("net: sched: run ingress qdisc without locks") removed
      the central ingress lock, one major contention point is gone.
      
      The extra indirection layers however, are not necessary for calling
      into ingress qdisc. pktgen calling locally into netif_receive_skb()
      with a dummy u32, single CPU result on a Supermicro X10SLM-F, Xeon
      E3-1240: before ~21,1 Mpps, after patch ~22,9 Mpps.
      
      We can redirect the private classifier list to the netdev directly,
      without changing any classifier API bits (!) and execute on that from
      handle_ing() side. The __QDISC_STATE_DEACTIVATE test can be removed,
      ingress qdisc doesn't have a queue and thus dev_deactivate_queue()
      is also not applicable, ingress_cl_list provides similar behaviour.
      In other words, ingress qdisc acts like TCQ_F_BUILTIN qdisc.
      
      One next possible step is the removal of the dev's ingress (dummy)
      netdev_queue, and to only have the list member in the netdevice
      itself.
      
      Note, the filter chain is RCU protected and individual filter elements
      are being kfree'd by sched subsystem after RCU grace period. RCU read
      lock is being held by __netif_receive_skb_core().
      
      Joint work with Alexei Starovoitov.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d2788d34
  5. 30 4月, 2015 1 次提交
    • N
      bridge/nl: remove wrong use of NLM_F_MULTI · 46c264da
      Nicolas Dichtel 提交于
      NLM_F_MULTI must be used only when a NLMSG_DONE message is sent. In fact,
      it is sent only at the end of a dump.
      
      Libraries like libnl will wait forever for NLMSG_DONE.
      
      Fixes: e5a55a89 ("net: create generic bridge ops")
      Fixes: 815cccbf ("ixgbe: add setlink, getlink support to ixgbe and ixgbevf")
      CC: John Fastabend <john.r.fastabend@intel.com>
      CC: Sathya Perla <sathya.perla@emulex.com>
      CC: Subbu Seetharaman <subbu.seetharaman@emulex.com>
      CC: Ajit Khaparde <ajit.khaparde@emulex.com>
      CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      CC: intel-wired-lan@lists.osuosl.org
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Scott Feldman <sfeldma@gmail.com>
      CC: Stephen Hemminger <stephen@networkplumber.org>
      CC: bridge@lists.linux-foundation.org
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46c264da
  6. 23 4月, 2015 2 次提交
  7. 18 4月, 2015 1 次提交
  8. 14 4月, 2015 2 次提交
  9. 11 4月, 2015 1 次提交
  10. 08 4月, 2015 2 次提交
  11. 07 4月, 2015 1 次提交
    • H
      ipv6: protect skb->sk accesses from recursive dereference inside the stack · f60e5990
      hannes@stressinduktion.org 提交于
      We should not consult skb->sk for output decisions in xmit recursion
      levels > 0 in the stack. Otherwise local socket settings could influence
      the result of e.g. tunnel encapsulation process.
      
      ipv6 does not conform with this in three places:
      
      1) ip6_fragment: we do consult ipv6_npinfo for frag_size
      
      2) sk_mc_loop in ipv6 uses skb->sk and checks if we should
         loop the packet back to the local socket
      
      3) ip6_skb_dst_mtu could query the settings from the user socket and
         force a wrong MTU
      
      Furthermore:
      In sk_mc_loop we could potentially land in WARN_ON(1) if we use a
      PF_PACKET socket ontop of an IPv6-backed vxlan device.
      
      Reuse xmit_recursion as we are currently only interested in protecting
      tunnel devices.
      
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f60e5990
  12. 04 4月, 2015 1 次提交
  13. 03 4月, 2015 2 次提交
  14. 30 3月, 2015 1 次提交
  15. 25 3月, 2015 1 次提交
  16. 19 3月, 2015 3 次提交
  17. 18 3月, 2015 1 次提交
  18. 16 3月, 2015 2 次提交
  19. 13 3月, 2015 2 次提交
  20. 10 3月, 2015 1 次提交
  21. 06 3月, 2015 1 次提交
  22. 03 3月, 2015 1 次提交
  23. 21 2月, 2015 1 次提交
    • G
      net: Initialize all members in skb_gro_remcsum_init() · 846cd667
      Geert Uytterhoeven 提交于
      skb_gro_remcsum_init() initializes the gro_remcsum.delta member only,
      leading to compiler warnings about a possibly uninitialized
      gro_remcsum.offset member:
      
      drivers/net/vxlan.c: In function ‘vxlan_gro_receive’:
      drivers/net/vxlan.c:602: warning: ‘grc.offset’ may be used uninitialized in this function
      net/ipv4/fou.c: In function ‘gue_gro_receive’:
      net/ipv4/fou.c:262: warning: ‘grc.offset’ may be used uninitialized in this function
      
      While these are harmless for now:
        - skb_gro_remcsum_process() sets offset before changing delta,
        - skb_gro_remcsum_cleanup() checks if delta is non-zero before
          accessing offset,
      it's safer to let the initialization function initialize all members.
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      846cd667
  24. 12 2月, 2015 4 次提交
    • T
      net: Infrastructure for CHECKSUM_PARTIAL with remote checsum offload · 15e2396d
      Tom Herbert 提交于
      This patch adds infrastructure so that remote checksum offload can
      set CHECKSUM_PARTIAL instead of calling csum_partial and writing
      the modfied checksum field.
      
      Add skb_remcsum_adjust_partial function to set an skb for using
      CHECKSUM_PARTIAL with remote checksum offload.  Changed
      skb_remcsum_process and skb_gro_remcsum_process to take a boolean
      argument to indicate if checksum partial can be set or the
      checksum needs to be modified using the normal algorithm.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      15e2396d
    • T
      net: Use more bit fields in napi_gro_cb · baa32ff4
      Tom Herbert 提交于
      This patch moves the free and same_flow fields to be bit fields
      (2 and 1 bit sized respectively). This frees up some space for u16's.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      baa32ff4
    • T
      net: Clarify meaning of CHECKSUM_PARTIAL for receive path · 6edec0e6
      Tom Herbert 提交于
      The current meaning of CHECKSUM_PARTIAL for validating checksums
      is that _all_ checksums in the packet are considered valid.
      However, in the manner that CHECKSUM_PARTIAL is set only the checksum
      at csum_start+csum_offset and any preceding checksums may
      be considered valid. If there are checksums in the packet after
      csum_offset it is possible they have not been verfied.
      
      This patch changes CHECKSUM_PARTIAL logic in skb_csum_unnecessary and
      __skb_gro_checksum_validate_needed to only considered checksums
      referring to csum_start and any preceding checksums (with starting
      offset before csum_start) to be verified.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6edec0e6
    • T
      net: Fix remcsum in GRO path to not change packet · 26c4f7da
      Tom Herbert 提交于
      Remote checksum offload processing is currently the same for both
      the GRO and non-GRO path. When the remote checksum offload option
      is encountered, the checksum field referred to is modified in
      the packet. So in the GRO case, the packet is modified in the
      GRO path and then the operation is skipped when the packet goes
      through the normal path based on skb->remcsum_offload. There is
      a problem in that the packet may be modified in the GRO path, but
      then forwarded off host still containing the remote checksum option.
      A remote host will again perform RCO but now the checksum verification
      will fail since GRO RCO already modified the checksum.
      
      To fix this, we ensure that GRO restores a packet to it's original
      state before returning. In this model, when GRO processes a remote
      checksum option it still changes the checksum per the algorithm
      but on return from lower layer processing the checksum is restored
      to its original value.
      
      In this patch we add define gro_remcsum structure which is passed
      to skb_gro_remcsum_process to save offset and delta for the checksum
      being changed. After lower layer processing, skb_gro_remcsum_cleanup
      is called to restore the checksum before returning from GRO.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      26c4f7da
  25. 09 2月, 2015 2 次提交
    • E
      net:rfs: adjust table size checking · 93c1af6c
      Eric Dumazet 提交于
      Make sure root user does not try something stupid.
      
      Also make sure mask field in struct rps_sock_flow_table
      does not share a cache line with the potentially often dirtied
      flow table.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Fixes: 567e4b79 ("net: rfs: add hash collision detection")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93c1af6c
    • E
      net: rfs: add hash collision detection · 567e4b79
      Eric Dumazet 提交于
      Receive Flow Steering is a nice solution but suffers from
      hash collisions when a mix of connected and unconnected traffic
      is received on the host, when flow hash table is populated.
      
      Also, clearing flow in inet_release() makes RFS not very good
      for short lived flows, as many packets can follow close().
      (FIN , ACK packets, ...)
      
      This patch extends the information stored into global hash table
      to not only include cpu number, but upper part of the hash value.
      
      I use a 32bit value, and dynamically split it in two parts.
      
      For host with less than 64 possible cpus, this gives 6 bits for the
      cpu number, and 26 (32-6) bits for the upper part of the hash.
      
      Since hash bucket selection use low order bits of the hash, we have
      a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
      enough.
      
      If the hash found in flow table does not match, we fallback to RPS (if
      it is enabled for the rxqueue).
      
      This means that a packet for an non connected flow can avoid the
      IPI through a unrelated/victim CPU.
      
      This also means we no longer have to clear the table at socket
      close time, and this helps short lived flows performance.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      567e4b79