1. 17 1月, 2014 7 次提交
    • T
      net: Check skb->rxhash in gro_receive · 0b4cec8c
      Tom Herbert 提交于
      When initializing a gro_list for a packet, first check the rxhash of
      the incoming skb against that of the skb's in the list. This should be
      a very strong inidicator of whether the flow is going to be matched,
      and potentially allows a lot of other checks to be short circuited.
      Use skb_hash_raw so that we don't force the hash to be calculated.
      
      Tested by running netperf 200 TCP_STREAMs between two machines with
      GRO, HW rxhash, and 1G. Saw no performance degration, slight reduction
      of time in dev_gro_receive.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0b4cec8c
    • D
      packet: use percpu mmap tx frame pending refcount · b0138408
      Daniel Borkmann 提交于
      In PF_PACKET's packet mmap(), we can avoid using one atomic_inc()
      and one atomic_dec() call in skb destructor and use a percpu
      reference count instead in order to determine if packets are
      still pending to be sent out. Micro-benchmark with [1] that has
      been slightly modified (that is, protcol = 0 in socket(2) and
      bind(2)), example on a rather crappy testing machine; I expect
      it to scale and have even better results on bigger machines:
      
      ./packet_mm_tx -s7000 -m7200 -z700000 em1, avg over 2500 runs:
      
      With patch:    4,022,015 cyc
      Without patch: 4,812,994 cyc
      
      time ./packet_mm_tx -s64 -c10000000 em1 > /dev/null, stable:
      
      With patch:
        real         1m32.241s
        user         0m0.287s
        sys          1m29.316s
      
      Without patch:
        real         1m38.386s
        user         0m0.265s
        sys          1m35.572s
      
      In function tpacket_snd(), it is okay to use packet_read_pending()
      since in fast-path we short-circuit the condition already with
      ph != NULL, since we have next frames to process. In case we have
      MSG_DONTWAIT, we also do not execute this path as need_wait is
      false here anyway, and in case of _no_ MSG_DONTWAIT flag, it is
      okay to call a packet_read_pending(), because when we ever reach
      that path, we're done processing outgoing frames anyway and only
      look if there are skbs still outstanding to be orphaned. We can
      stay lockless in this percpu counter since it's acceptable when we
      reach this path for the sum to be imprecise first, but we'll level
      out at 0 after all pending frames have reached the skb destructor
      eventually through tx reclaim. When people pin a tx process to
      particular CPUs, we expect overflows to happen in the reference
      counter as on one CPU we expect heavy increase; and distributed
      through ksoftirqd on all CPUs a decrease, for example. As
      David Laight points out, since the C language doesn't define the
      result of signed int overflow (i.e. rather than wrap, it is
      allowed to saturate as a possible outcome), we have to use
      unsigned int as reference count. The sum over all CPUs when tx
      is complete will result in 0 again.
      
      The BUG_ON() in tpacket_destruct_skb() we can remove as well. It
      can _only_ be set from inside tpacket_snd() path and we made sure
      to increase tx_ring.pending in any case before we called po->xmit(skb).
      So testing for tx_ring.pending == 0 is not too useful. Instead, it
      would rather have been useful to test if lower layers didn't orphan
      the skb so that we're missing ring slots being put back to
      TP_STATUS_AVAILABLE. But such a bug will be caught in user space
      already as we end up realizing that we do not have any
      TP_STATUS_AVAILABLE slots left anymore. Therefore, we're all set.
      
      Btw, in case of RX_RING path, we do not make use of the pending
      member, therefore we also don't need to use up any percpu memory
      here. Also note that __alloc_percpu() already returns a zero-filled
      percpu area, so initialization is done already.
      
        [1] http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmapSigned-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b0138408
    • D
      packet: don't unconditionally schedule() in case of MSG_DONTWAIT · 87a2fd28
      Daniel Borkmann 提交于
      In tpacket_snd(), when we've discovered a first frame that is
      not in status TP_STATUS_SEND_REQUEST, and return a NULL buffer,
      we exit the send routine in case of MSG_DONTWAIT, since we've
      finished traversing the mmaped send ring buffer and don't care
      about pending frames.
      
      While doing so, we still unconditionally call an expensive
      schedule() in the packet_current_frame() "error" path, which
      is unnecessary in this case since it's enough to just quit
      the function.
      
      Also, in case MSG_DONTWAIT is not set, we should rather test
      for need_resched() first and do schedule() only if necessary
      since meanwhile pending frames could already have finished
      processing and called skb destructor.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      87a2fd28
    • D
      packet: improve socket create/bind latency in some cases · 902fefb8
      Daniel Borkmann 提交于
      Most people acquire PF_PACKET sockets with a protocol argument in
      the socket call, e.g. libpcap does so with htons(ETH_P_ALL) for
      all its sockets. Most likely, at some point in time a subsequent
      bind() call will follow, e.g. in libpcap with ...
      
        memset(&sll, 0, sizeof(sll));
        sll.sll_family          = AF_PACKET;
        sll.sll_ifindex         = ifindex;
        sll.sll_protocol        = htons(ETH_P_ALL);
      
      ... as arguments. What happens in the kernel is that already
      in socket() syscall, we install a proto hook via register_prot_hook()
      if our protocol argument is != 0. Yet, in bind() we're almost
      doing the same work by doing a unregister_prot_hook() with an
      expensive synchronize_net() call in case during socket() the proto
      was != 0, plus follow-up register_prot_hook() with a bound device
      to it this time, in order to limit traffic we get.
      
      In the case when the protocol and user supplied device index (== 0)
      does not change from socket() to bind(), we can spare us doing
      the same work twice. Similarly for re-binding to the same device
      and protocol. For these scenarios, we can decrease create/bind
      latency from ~7447us (sock-bind-2 case) to ~89us (sock-bind-1 case)
      with this patch.
      
      Alternatively, for the first case, if people care, they should
      simply create their sockets with proto == 0 argument and define
      the protocol during bind() as this saves a call to synchronize_net()
      as well (sock-bind-3 case).
      
      In all other cases, we're tied to user space behaviour we must not
      change, also since a bind() is not strictly required. Thus, we need
      the synchronize_net() to make sure no asynchronous packet processing
      paths still refer to the previous elements of po->prot_hook.
      
      In case of mmap()ed sockets, the workflow that includes bind() is
      socket() -> setsockopt(<ring>) -> bind(). In that case, a pair of
      {__unregister, register}_prot_hook is being called from setsockopt()
      in order to install the new protocol receive handler. Thus, when
      we call bind and can skip a re-hook, we have already previously
      installed the new handler. For fanout, this is handled different
      entirely, so we should be good.
      
      Timings on an i7-3520M machine:
      
        * sock-bind-1:   89 us
        * sock-bind-2: 7447 us
        * sock-bind-3:   75 us
      
      sock-bind-1:
        socket(PF_PACKET, SOCK_RAW, htons(ETH_P_IP)) = 3
        bind(3, {sa_family=AF_PACKET, proto=htons(ETH_P_IP), if=all(0),
                 pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0
      
      sock-bind-2:
        socket(PF_PACKET, SOCK_RAW, htons(ETH_P_IP)) = 3
        bind(3, {sa_family=AF_PACKET, proto=htons(ETH_P_IP), if=lo(1),
                 pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0
      
      sock-bind-3:
        socket(PF_PACKET, SOCK_RAW, 0) = 3
        bind(3, {sa_family=AF_PACKET, proto=htons(ETH_P_IP), if=lo(1),
                 pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      902fefb8
    • P
      net/ipv4: don't use module_init in non-modular gre_offload · cf172283
      Paul Gortmaker 提交于
      Recent commit 438e38fa
      ("gre_offload: statically build GRE offloading support") added
      new module_init/module_exit calls to the gre_offload.c file.
      
      The file is obj-y and can't be anything other than built-in.
      Currently it can never be built modular, so using module_init
      as an alias for __initcall can be somewhat misleading.
      
      Fix this up now, so that we can relocate module_init from
      init.h into module.h in the future.  If we don't do this, we'd
      have to add module.h to obviously non-modular code, and that
      would be a worse thing.  We also make the inclusion explicit.
      
      Note that direct use of __initcall is discouraged, vs. one
      of the priority categorized subgroups.  As __initcall gets
      mapped onto device_initcall, our use of device_initcall
      directly in this change means that the runtime impact is
      zero -- it will remain at level 6 in initcall ordering.
      
      As for the module_exit, rather than replace it with __exitcall,
      we simply remove it, since it appears only UML does anything
      with those, and even for UML, there is no relevant cleanup
      to be done here.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf172283
    • E
      net: eth_type_trans() should use skb_header_pointer() · 0864c158
      Eric Dumazet 提交于
      eth_type_trans() can read uninitialized memory as drivers
      do not necessarily pull more than 14 bytes in skb->head before
      calling it.
      
      As David suggested, we can use skb_header_pointer() to
      fix this without breaking some drivers that might not expect
      eth_type_trans() pulling 2 additional bytes.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0864c158
    • J
      neigh: use NEIGH_VAR_INIT in ndo_neigh_setup functions. · 89740ca7
      Jiri Pirko 提交于
      When ndo_neigh_setup is called, the bitfield used by NEIGH_VAR_SET is
      not initialized yet. This might cause confusion for the people who use
      NEIGH_VAR_SET in ndo_neigh_setup. So rather introduce NEIGH_VAR_INIT for
      usage in ndo_neigh_setup.
      Signed-off-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      89740ca7
  2. 16 1月, 2014 9 次提交
  3. 15 1月, 2014 20 次提交
  4. 14 1月, 2014 4 次提交
    • P
      netfilter: Add dependency on IPV6 for NF_TABLES_INET · 419331d8
      Paul Gortmaker 提交于
      Commit 1d49144c ("netfilter: nf_tables: add "inet" table for
      IPv4/IPv6") allows creation of non-IPV6 enabled .config files that
      will fail to configure/link as follows:
      
      warning: (NF_TABLES_INET) selects NF_TABLES_IPV6 which has unmet direct dependencies (NET && INET && IPV6 && NETFILTER && NF_TABLES)
      warning: (NF_TABLES_INET) selects NF_TABLES_IPV6 which has unmet direct dependencies (NET && INET && IPV6 && NETFILTER && NF_TABLES)
      warning: (NF_TABLES_INET) selects NF_TABLES_IPV6 which has unmet direct dependencies (NET && INET && IPV6 && NETFILTER && NF_TABLES)
      net/built-in.o: In function `nft_reject_eval':
      nft_reject.c:(.text+0x651e8): undefined reference to `nf_ip6_checksum'
      nft_reject.c:(.text+0x65270): undefined reference to `ip6_route_output'
      nft_reject.c:(.text+0x656c4): undefined reference to `ip6_dst_hoplimit'
      make: *** [vmlinux] Error 1
      
      Since the feature is to allow for a mixed IPV4 and IPV6 table, it
      seems sensible to make it depend on IPV6.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Acked-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      419331d8
    • W
      bridge: move br_net_exit() to br.c · b86f81cc
      WANG Cong 提交于
      And it can become static.
      
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b86f81cc
    • N
      inet_diag: fix inet_diag_dump_icsk() to use correct state for timewait sockets · 70315d22
      Neal Cardwell 提交于
      Fix inet_diag_dump_icsk() to reflect the fact that both TCP_TIME_WAIT
      and TCP_FIN_WAIT2 connections are represented by inet_timewait_sock
      (not just TIME_WAIT), and for such sockets the tw_substate field holds
      the real state, which can be either TCP_TIME_WAIT or TCP_FIN_WAIT2.
      
      This brings the inet_diag state-matching code in line with the field
      it uses to populate idiag_state. This is also analogous to the info
      exported in /proc/net/tcp, where get_tcp4_sock() exports sk->sk_state
      and get_timewait4_sock() exports tw->tw_substate.
      
      Before fixing this, (a) neither "ss -nemoi" nor "ss -nemoi state
      fin-wait-2" would return a socket in TCP_FIN_WAIT2; and (b) "ss -nemoi
      state time-wait" would also return sockets in state TCP_FIN_WAIT2.
      
      This is an old bug that predates 05dbc7b5 ("tcp/dccp: remove twchain").
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70315d22
    • V
      net: make dev_set_mtu() honor notification return code · 2315dc91
      Veaceslav Falico 提交于
      Currently, after changing the MTU for a device, dev_set_mtu() calls
      NETDEV_CHANGEMTU notification, however doesn't verify it's return code -
      which can be NOTIFY_BAD - i.e. some of the net notifier blocks refused this
      change, and continues nevertheless.
      
      To fix this, verify the return code, and if it's an error - then revert the
      MTU to the original one, notify again and pass the error code.
      
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      CC: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Reviewed-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2315dc91