1. 20 12月, 2016 1 次提交
    • Z
      ipv4: Should use consistent conditional judgement for ip fragment in... · 0a28cfd5
      zheng li 提交于
      ipv4: Should use consistent conditional judgement for ip fragment in __ip_append_data and ip_finish_output
      
      There is an inconsistent conditional judgement in __ip_append_data and
      ip_finish_output functions, the variable length in __ip_append_data just
      include the length of application's payload and udp header, don't include
      the length of ip header, but in ip_finish_output use
      (skb->len > ip_skb_dst_mtu(skb)) as judgement, and skb->len include the
      length of ip header.
      
      That causes some particular application's udp payload whose length is
      between (MTU - IP Header) and MTU were fragmented by ip_fragment even
      though the rst->dev support UFO feature.
      
      Add the length of ip header to length in __ip_append_data to keep
      consistent conditional judgement as ip_finish_output for ip fragment.
      Signed-off-by: NZheng Li <james.z.li@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a28cfd5
  2. 18 12月, 2016 2 次提交
    • T
      inet: Fix get port to handle zero port number with soreuseport set · 0643ee4f
      Tom Herbert 提交于
      A user may call listen with binding an explicit port with the intent
      that the kernel will assign an available port to the socket. In this
      case inet_csk_get_port does a port scan. For such sockets, the user may
      also set soreuseport with the intent a creating more sockets for the
      port that is selected. The problem is that the initial socket being
      opened could inadvertently choose an existing and unreleated port
      number that was already created with soreuseport.
      
      This patch adds a boolean parameter to inet_bind_conflict that indicates
      rather soreuseport is allowed for the check (in addition to
      sk->sk_reuseport). In calls to inet_bind_conflict from inet_csk_get_port
      the argument is set to true if an explicit port is being looked up (snum
      argument is nonzero), and is false if port scan is done.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0643ee4f
    • T
      inet: Don't go into port scan when looking for specific bind port · 9af7e923
      Tom Herbert 提交于
      inet_csk_get_port is called with port number (snum argument) that may be
      zero or nonzero. If it is zero, then the intent is to find an available
      ephemeral port number to bind to. If snum is non-zero then the caller
      is asking to allocate a specific port number. In the latter case we
      never want to perform the scan in ephemeral port range. It is
      conceivable that this can happen if the "goto again" in "tb_found:"
      is done. This patch adds a check that snum is zero before doing
      the "goto again".
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9af7e923
  3. 10 12月, 2016 4 次提交
    • E
      udp: udp_rmem_release() should touch sk_rmem_alloc later · 02ab0d13
      Eric Dumazet 提交于
      In flood situations, keeping sk_rmem_alloc at a high value
      prevents producers from touching the socket.
      
      It makes sense to lower sk_rmem_alloc only at the end
      of udp_rmem_release() after the thread draining receive
      queue in udp_recvmsg() finished the writes to sk_forward_alloc.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02ab0d13
    • E
      udp: add batching to udp_rmem_release() · 6b229cf7
      Eric Dumazet 提交于
      If udp_recvmsg() constantly releases sk_rmem_alloc
      for every read packet, it gives opportunity for
      producers to immediately grab spinlocks and desperatly
      try adding another packet, causing false sharing.
      
      We can add a simple heuristic to give the signal
      by batches of ~25 % of the queue capacity.
      
      This patch considerably increases performance under
      flood by about 50 %, since the thread draining the queue
      is no longer slowed by false sharing.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b229cf7
    • E
      udp: copy skb->truesize in the first cache line · c84d9490
      Eric Dumazet 提交于
      In UDP RX handler, we currently clear skb->dev before skb
      is added to receive queue, because device pointer is no longer
      available once we exit from RCU section.
      
      Since this first cache line is always hot, lets reuse this space
      to store skb->truesize and thus avoid a cache line miss at
      udp_recvmsg()/udp_skb_destructor time while receive queue
      spinlock is held.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c84d9490
    • E
      udp: add busylocks in RX path · 4b272750
      Eric Dumazet 提交于
      Idea of busylocks is to let producers grab an extra spinlock
      to relieve pressure on the receive_queue spinlock shared by consumer.
      
      This behavior is requested only once socket receive queue is above
      half occupancy.
      
      Under flood, this means that only one producer can be in line
      trying to acquire the receive_queue spinlock.
      
      These busylock can be allocated on a per cpu manner, instead of a
      per socket one (that would consume a cache line per socket)
      
      This patch considerably improves UDP behavior under stress,
      depending on number of NIC RX queues and/or RPS spread.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b272750
  4. 09 12月, 2016 2 次提交
    • E
      udp: under rx pressure, try to condense skbs · c8c8b127
      Eric Dumazet 提交于
      Under UDP flood, many softirq producers try to add packets to
      UDP receive queue, and one user thread is burning one cpu trying
      to dequeue packets as fast as possible.
      
      Two parts of the per packet cost are :
      - copying payload from kernel space to user space,
      - freeing memory pieces associated with skb.
      
      If socket is under pressure, softirq handler(s) can try to pull in
      skb->head the payload of the packet if it fits.
      
      Meaning the softirq handler(s) can free/reuse the page fragment
      immediately, instead of letting udp_recvmsg() do this hundreds of usec
      later, possibly from another node.
      
      Additional gains :
      - We reduce skb->truesize and thus can store more packets per SO_RCVBUF
      - We avoid cache line misses at copyout() time and consume_skb() time,
      and avoid one put_page() with potential alien freeing on NUMA hosts.
      
      This comes at the cost of a copy, bounded to available tail room, which
      is usually small. (We might have to fix GRO_MAX_HEAD which looks bigger
      than necessary)
      
      This patch gave me about 5 % increase in throughput in my tests.
      
      skb_condense() helper could probably used in other contexts.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c8c8b127
    • Z
      icmp: correct return value of icmp_rcv() · f91c58d6
      Zhang Shengju 提交于
      Currently, icmp_rcv() always return zero on a packet delivery upcall.
      
      To make its behavior more compliant with the way this API should be
      used, this patch changes this to let it return NET_RX_SUCCESS when the
      packet is proper handled, and NET_RX_DROP otherwise.
      Signed-off-by: NZhang Shengju <zhangshengju@cmss.chinamobile.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f91c58d6
  5. 07 12月, 2016 9 次提交
  6. 06 12月, 2016 11 次提交
  7. 05 12月, 2016 9 次提交
  8. 04 12月, 2016 2 次提交
    • I
      ipv4: fib: Replay events when registering FIB notifier · c3852ef7
      Ido Schimmel 提交于
      Commit b90eb754 ("fib: introduce FIB notification infrastructure")
      introduced a new notification chain to notify listeners (f.e., switchdev
      drivers) about addition and deletion of routes.
      
      However, upon registration to the chain the FIB tables can already be
      populated, which means potential listeners will have an incomplete view
      of the tables.
      
      Solve that by dumping the FIB tables and replaying the events to the
      passed notification block. The dump itself is done using RCU in order
      not to starve consumers that need RTNL to make progress.
      
      The integrity of the dump is ensured by reading the FIB change sequence
      counter before and after the dump under RTNL. This allows us to avoid
      the problematic situation in which the dumping process sends a ENTRY_ADD
      notification following ENTRY_DEL generated by another process holding
      RTNL.
      
      Callers of the registration function may pass a callback that is
      executed in case the dump was inconsistent with current FIB tables.
      
      The number of retries until a consistent dump is achieved is set to a
      fixed number to prevent callers from looping for long periods of time.
      In case current limit proves to be problematic in the future, it can be
      easily converted to be configurable using a sysctl.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c3852ef7
    • I
      ipv4: fib: Allow for consistent FIB dumping · cacaad11
      Ido Schimmel 提交于
      The next patch will enable listeners of the FIB notification chain to
      request a dump of the FIB tables. However, since RTNL isn't taken during
      the dump, it's possible for the FIB tables to change mid-dump, which
      will result in inconsistency between the listener's table and the
      kernel's.
      
      Allow listeners to know about changes that occurred mid-dump, by adding
      a change sequence counter to each net namespace. The counter is
      incremented just before a notification is sent in the FIB chain.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cacaad11