1. 19 12月, 2013 10 次提交
  2. 18 12月, 2013 14 次提交
    • A
      packet: deliver VLAN TPID to userspace · a0cdfcf3
      Atzm Watanabe 提交于
      This enables userspace to get VLAN TPID as well as the VLAN TCI.
      Signed-off-by: NAtzm Watanabe <atzm@stratosphere.co.jp>
      Acked-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a0cdfcf3
    • A
      packet: fill the gap of TPACKET_ALIGNMENT with zeros · e4d26f4b
      Atzm Watanabe 提交于
      struct tpacket{2,3}_hdr is aligned to a multiple of TPACKET_ALIGNMENT.
      Explicitly defining and zeroing the gap of this makes additional changes
      easier.
      Signed-off-by: NAtzm Watanabe <atzm@stratosphere.co.jp>
      Acked-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e4d26f4b
    • A
      packet: make aligned size of struct tpacket{2,3}_hdr clear · 51846355
      Atzm Watanabe 提交于
      struct tpacket{2,3}_hdr is aligned to a multiple of TPACKET_ALIGNMENT.
      We may add members to them until current aligned size without forcing
      userspace to call getsockopt(..., PACKET_HDRLEN, ...).
      Signed-off-by: NAtzm Watanabe <atzm@stratosphere.co.jp>
      Acked-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      51846355
    • T
      net: Add utility function to copy skb hash · 3df7a74e
      Tom Herbert 提交于
      Adds skb_copy_hash to copy rxhash and l4_rxhash from one skb to another.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3df7a74e
    • T
      net: Add utility functions to clear rxhash · 7539fadc
      Tom Herbert 提交于
      In several places 'skb->rxhash = 0' is being done to clear the
      rxhash value in an skb.  This does not clear l4_rxhash which could
      still be set so that the rxhash wouldn't be recalculated on subsequent
      call to skb_get_rxhash.  This patch adds an explict function to clear
      all the rxhash related information in the skb properly.
      
      skb_clear_hash_if_not_l4 clears the rxhash only if it is not marked as
      l4_rxhash.
      
      Fixed up places where 'skb->rxhash = 0' was being called.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7539fadc
    • T
      net: Change skb_get_rxhash to skb_get_hash · 3958afa1
      Tom Herbert 提交于
      Changing name of function as part of making the hash in skbuff to be
      generic property, not just for receive path.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3958afa1
    • W
      net/hsr: using kfree_rcu() to simplify the code · 1aee6cc2
      Wei Yongjun 提交于
      The callback function of call_rcu() just calls a kfree(), so we
      can use kfree_rcu() instead of call_rcu() + callback function.
      Signed-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Acked-by: NArvid Brodin <arvid.brodin@alten.se>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1aee6cc2
    • B
      neigh: Netlink notification for administrative NUD state change · 53385d2d
      Bob Gilligan 提交于
      The neighbour code sends up an RTM_NEWNEIGH netlink notification if
      the NUD state of a neighbour cache entry is changed by a timer (e.g.
      from REACHABLE to STALE), even if the lladdr of the entry has not
      changed.
      
      But an administrative change to the the NUD state of a neighbour cache
      entry that does not change the lladdr (e.g. via "ip -4 neigh change
      ...  nud ...") does not trigger a netlink notification.  This means
      that netlink listeners will not hear about administrative NUD state
      changes such as from a resolved state to PERMANENT.
      
      This patch changes the neighbor code to generate an RTM_NEWNEIGH
      message when the NUD state of an entry is changed administratively.
      Signed-off-by: NBob Gilligan <gilligan@aristanetworks.com>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      53385d2d
    • E
      pkt_sched: fq: more robust memory allocation · c3bd8549
      Eric Dumazet 提交于
      This patch brings NUMA support and automatic fallback to vmalloc()
      in case kmalloc() failed to allocate FQ hash table.
      
      NUMA support depends on XPS being setup for the device before
      qdisc allocation. After a XPS change, it might be worth creating
      qdisc hierarchy again.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c3bd8549
    • E
      tcp: refine TSO splits · d4589926
      Eric Dumazet 提交于
      While investigating performance problems on small RPC workloads,
      I noticed linux TCP stack was always splitting the last TSO skb
      into two parts (skbs). One being a multiple of MSS, and a small one
      with the Push flag. This split is done even if TCP_NODELAY is set,
      or if no small packet is in flight.
      
      Example with request/response of 4K/4K
      
      IP A > B: . ack 68432 win 2783 <nop,nop,timestamp 6524593 6525001>
      IP A > B: . 65537:68433(2896) ack 69632 win 2783 <nop,nop,timestamp 6524593 6525001>
      IP A > B: P 68433:69633(1200) ack 69632 win 2783 <nop,nop,timestamp 6524593 6525001>
      IP B > A: . ack 68433 win 2768 <nop,nop,timestamp 6525001 6524593>
      IP B > A: . 69632:72528(2896) ack 69633 win 2768 <nop,nop,timestamp 6525001 6524593>
      IP B > A: P 72528:73728(1200) ack 69633 win 2768 <nop,nop,timestamp 6525001 6524593>
      IP A > B: . ack 72528 win 2783 <nop,nop,timestamp 6524593 6525001>
      IP A > B: . 69633:72529(2896) ack 73728 win 2783 <nop,nop,timestamp 6524593 6525001>
      IP A > B: P 72529:73729(1200) ack 73728 win 2783 <nop,nop,timestamp 6524593 6525001>
      
      We can avoid this split by including the Nagle tests at the right place.
      
      Note : If some NIC had trouble sending TSO packets with a partial
      last segment, we would have hit the problem in GRO/forwarding workload already.
      
      tcp_minshall_update() is moved to tcp_output.c and is updated as we might
      feed a TSO packet with a partial last segment.
      
      This patch tremendously improves performance, as the traffic now looks
      like :
      
      IP A > B: . ack 98304 win 2783 <nop,nop,timestamp 6834277 6834685>
      IP A > B: P 94209:98305(4096) ack 98304 win 2783 <nop,nop,timestamp 6834277 6834685>
      IP B > A: . ack 98305 win 2768 <nop,nop,timestamp 6834686 6834277>
      IP B > A: P 98304:102400(4096) ack 98305 win 2768 <nop,nop,timestamp 6834686 6834277>
      IP A > B: . ack 102400 win 2783 <nop,nop,timestamp 6834279 6834686>
      IP A > B: P 98305:102401(4096) ack 102400 win 2783 <nop,nop,timestamp 6834279 6834686>
      IP B > A: . ack 102401 win 2768 <nop,nop,timestamp 6834687 6834279>
      IP B > A: P 102400:106496(4096) ack 102401 win 2768 <nop,nop,timestamp 6834687 6834279>
      IP A > B: . ack 106496 win 2783 <nop,nop,timestamp 6834280 6834687>
      IP A > B: P 102401:106497(4096) ack 106496 win 2783 <nop,nop,timestamp 6834280 6834687>
      IP B > A: . ack 106497 win 2768 <nop,nop,timestamp 6834688 6834280>
      IP B > A: P 106496:110592(4096) ack 106497 win 2768 <nop,nop,timestamp 6834688 6834280>
      
      Before :
      
      lpq83:~# nstat >/dev/null;perf stat ./super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K
      280774
      
       Performance counter stats for './super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K':
      
           205719.049006 task-clock                #    9.278 CPUs utilized
               8,449,968 context-switches          #    0.041 M/sec
               1,935,997 CPU-migrations            #    0.009 M/sec
                 160,541 page-faults               #    0.780 K/sec
         548,478,722,290 cycles                    #    2.666 GHz                     [83.20%]
         455,240,670,857 stalled-cycles-frontend   #   83.00% frontend cycles idle    [83.48%]
         272,881,454,275 stalled-cycles-backend    #   49.75% backend  cycles idle    [66.73%]
         166,091,460,030 instructions              #    0.30  insns per cycle
                                                   #    2.74  stalled cycles per insn [83.39%]
          29,150,229,399 branches                  #  141.699 M/sec                   [83.30%]
           1,943,814,026 branch-misses             #    6.67% of all branches         [83.32%]
      
            22.173517844 seconds time elapsed
      
      lpq83:~# nstat | egrep "IpOutRequests|IpExtOutOctets"
      IpOutRequests                   16851063           0.0
      IpExtOutOctets                  23878580777        0.0
      
      After patch :
      
      lpq83:~# nstat >/dev/null;perf stat ./super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K
      280877
      
       Performance counter stats for './super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K':
      
           107496.071918 task-clock                #    4.847 CPUs utilized
               5,635,458 context-switches          #    0.052 M/sec
               1,374,707 CPU-migrations            #    0.013 M/sec
                 160,920 page-faults               #    0.001 M/sec
         281,500,010,924 cycles                    #    2.619 GHz                     [83.28%]
         228,865,069,307 stalled-cycles-frontend   #   81.30% frontend cycles idle    [83.38%]
         142,462,742,658 stalled-cycles-backend    #   50.61% backend  cycles idle    [66.81%]
          95,227,712,566 instructions              #    0.34  insns per cycle
                                                   #    2.40  stalled cycles per insn [83.43%]
          16,209,868,171 branches                  #  150.795 M/sec                   [83.20%]
             874,252,952 branch-misses             #    5.39% of all branches         [83.37%]
      
            22.175821286 seconds time elapsed
      
      lpq83:~# nstat | egrep "IpOutRequests|IpExtOutOctets"
      IpOutRequests                   11239428           0.0
      IpExtOutOctets                  23595191035        0.0
      
      Indeed, the occupancy of tx skbs (IpExtOutOctets/IpOutRequests) is higher :
      2099 instead of 1417, thus helping GRO to be more efficient when using FQ packet
      scheduler.
      
      Many thanks to Neal for review and ideas.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Tested-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4589926
    • S
      net: remove dead code for add/del multiple · 477bb933
      stephen hemminger 提交于
      These function to manipulate multiple addresses are not used anywhere
      in current net-next tree. Some out of tree code maybe using these but
      too bad; they should submit their code upstream..
      
      Also, make __hw_addr_flush local since only used by dev_addr_lists.c
      Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      477bb933
    • S
      net: unix: allow bind to fail on mutex lock · 37ab4fa7
      Sasha Levin 提交于
      This is similar to the set_peek_off patch where calling bind while the
      socket is stuck in unix_dgram_recvmsg() will block and cause a hung task
      spew after a while.
      
      This is also the last place that did a straightforward mutex_lock(), so
      there shouldn't be any more of these patches.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      37ab4fa7
    • E
      udp: ipv4: do not use sk_dst_lock from softirq context · e47eb5df
      Eric Dumazet 提交于
      Using sk_dst_lock from softirq context is not supported right now.
      
      Instead of adding BH protection everywhere,
      udp_sk_rx_dst_set() can instead use xchg(), as suggested
      by David.
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Fixes: 97502231 ("udp: ipv4: must add synchronization in udp_sk_rx_dst_set()")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e47eb5df
    • F
      net: ovs: use CRC32 accelerated flow hash if available · 500f8087
      Francesco Fusco 提交于
      Currently OVS uses jhash2() for calculating flow hashes in its
      internal flow_hash() function. The performance of the flow_hash()
      function is critical, as the input data can be hundreds of bytes
      long.
      
      OVS is largely deployed in x86_64 based datacenters.  Therefore,
      we argue that the performance critical fast path of OVS should
      exploit underlying CPU features in order to reduce the per packet
      processing costs. We replace jhash2 with the hash implementation
      provided by the kernel hash lib, which exploits the crc32l
      instruction to achieve high performance
      
      Our patch greatly reduces the hash footprint from ~200 cycles of
      jhash2() to around ~90 cycles in case of ovs_flow_hash_crc()
      (measured with rdtsc over maximum length flow keys on an i7 Intel
      CPU).
      
      Additionally, we wrote a microbenchmark to stress the flow table
      performance. The benchmark inserts random flows into the flow
      hash and then performs lookups. Our hash deployed on a CRC32
      capable CPU reduces the lookup for 1000 flows, 100 masks from
      ~10,100us to ~6,700us, for example.
      
      Thus, simply use the newly introduced arch_fast_hash2() as a
      drop-in replacement.
      Signed-off-by: NFrancesco Fusco <ffusco@redhat.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NThomas Graf <tgraf@redhat.com>
      Acked-by: NJesse Gross <jesse@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      500f8087
  3. 17 12月, 2013 5 次提交
  4. 16 12月, 2013 1 次提交
  5. 14 12月, 2013 5 次提交
  6. 13 12月, 2013 2 次提交
    • F
      ipv6: fix incorrect type in declaration · 68536053
      Florent Fourcot 提交于
      Introduced by 1397ed35
        "ipv6: add flowinfo for tcp6 pkt_options for all cases"
      Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
      
      V2: fix the title, add empty line after the declaration (Sergei Shtylyov
      feedbacks)
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      68536053
    • J
      net-gro: Prepare GRO stack for the upcoming tunneling support · 299603e8
      Jerry Chu 提交于
      This patch modifies the GRO stack to avoid the use of "network_header"
      and associated macros like ip_hdr() and ipv6_hdr() in order to allow
      an arbitary number of IP hdrs (v4 or v6) to be used in the
      encapsulation chain. This lays the foundation for various IP
      tunneling support (IP-in-IP, GRE, VXLAN, SIT,...) to be added later.
      
      With this patch, the GRO stack traversing now is mostly based on
      skb_gro_offset rather than special hdr offsets saved in skb (e.g.,
      skb->network_header). As a result all but the top layer (i.e., the
      the transport layer) must have hdrs of the same length in order for
      a pkt to be considered for aggregation. Therefore when adding a new
      encap layer (e.g., for tunneling), one must check and skip flows
      (e.g., by setting NAPI_GRO_CB(p)->same_flow to 0) that have a
      different hdr length.
      
      Note that unlike the network header, the transport header can and
      will continue to be set by the GRO code since there will be at
      most one "transport layer" in the encap chain.
      Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      299603e8
  7. 12 12月, 2013 3 次提交