1. 06 4月, 2016 22 次提交
  2. 05 4月, 2016 18 次提交
    • D
      Merge branch 'tcp-udp-misc' · 15f41e2b
      David S. Miller 提交于
      Eric Dumazet says:
      
      ====================
      net: various udp/tcp changes
      
      First round of patches for linux-4.7
      
      Add a generic facility for sockets to be freed after an RCU grace
      period, if they need to.
      
      Then UDP stack is changed to no longer use SLAB_DESTROY_BY_RCU,
      in order to speedup rx processing for traffic encapsulated in UDP.
      It gives a 17 % speedup for normal UDP reception in stress conditions.
      
      Then TCP listeners are changed to use SOCK_RCU_FREE as well
      to avoid touching sk_refcnt in synflood case :
      I got up to 30 % performance increase for a mono listener.
      
      Then three patches add SK_MEMINFO_DROPS to sock_diag
      and add per socket rx drops accounting to TCP.
      
      Last patch adds rate limiting on ACK sent on behalf of SYN_RECV
      to better resist to SYNFLOOD targeting one or few flows.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      15f41e2b
    • E
      tcp: rate limit ACK sent by SYN_RECV request sockets · 4ce7e93c
      Eric Dumazet 提交于
      Attackers like to use SYNFLOOD targeting one 5-tuple, as they
      hit a single RX queue (and cpu) on the victim.
      
      If they use random sequence numbers in their SYN, we detect
      they do not match the expected window and send back an ACK.
      
      This patch adds a rate limitation, so that the effect of such
      attacks is limited to ingress only.
      
      We roughly double our ability to absorb such attacks.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Maciej Żenczykowski <maze@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4ce7e93c
    • E
      ipv4: tcp: set SOCK_USE_WRITE_QUEUE for ip_send_unicast_reply() · a9d6532b
      Eric Dumazet 提交于
      TCP uses per cpu 'sockets' to send some packets :
      - RST packets ( tcp_v4_send_reset()) )
      - ACK packets for SYN_RECV and TIMEWAIT sockets
      
      By setting SOCK_USE_WRITE_QUEUE flag, we tell sock_wfree()
      to not call sk_write_space() since these internal sockets
      do not care.
      
      This gives a small performance improvement, merely by allowing
      cpu to properly predict the sock_wfree() conditional branch,
      and avoiding one atomic operation.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a9d6532b
    • E
      tcp: increment sk_drops for listeners · 9caad864
      Eric Dumazet 提交于
      Goal: packets dropped by a listener are accounted for.
      
      This adds tcp_listendrop() helper, and clears sk_drops in sk_clone_lock()
      so that children do not inherit their parent drop count.
      
      Note that we no longer increment LINUX_MIB_LISTENDROPS counter when
      sending a SYNCOOKIE, since the SYN packet generated a SYNACK.
      We already have a separate LINUX_MIB_SYNCOOKIESSENT
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9caad864
    • E
      tcp: increment sk_drops for dropped rx packets · 532182cd
      Eric Dumazet 提交于
      Now ss can report sk_drops, we can instruct TCP to increment
      this per socket counter when it drops an incoming frame, to refine
      monitoring and debugging.
      
      Following patch takes care of listeners drops.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      532182cd
    • E
      sock_diag: add SK_MEMINFO_DROPS · 15239302
      Eric Dumazet 提交于
      Reporting sk_drops to user space was available for UDP
      sockets using /proc interface.
      
      Add this to sock_diag, so that we can have the same information
      available to ss users, and we'll be able to add sk_drops
      indications for TCP sockets as well.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      15239302
    • E
      tcp/dccp: do not touch listener sk_refcnt under synflood · 3b24d854
      Eric Dumazet 提交于
      When a SYNFLOOD targets a non SO_REUSEPORT listener, multiple
      cpus contend on sk->sk_refcnt and sk->sk_wmem_alloc changes.
      
      By letting listeners use SOCK_RCU_FREE infrastructure,
      we can relax TCP_LISTEN lookup rules and avoid touching sk_refcnt
      
      Note that we still use SLAB_DESTROY_BY_RCU rules for other sockets,
      only listeners are impacted by this change.
      
      Peak performance under SYNFLOOD is increased by ~33% :
      
      On my test machine, I could process 3.2 Mpps instead of 2.4 Mpps
      
      Most consuming functions are now skb_set_owner_w() and sock_wfree()
      contending on sk->sk_wmem_alloc when cooking SYNACK and freeing them.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b24d854
    • E
      inet: reqsk_alloc() needs to take care of dead listeners · 3a5d1c0e
      Eric Dumazet 提交于
      We'll soon no longer take a refcount on listeners,
      so reqsk_alloc() can not assume a listener refcount is not
      zero. We need to use atomic_inc_not_zero()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a5d1c0e
    • E
      tcp/dccp: use rcu locking in inet_diag_find_one_icsk() · 2d331915
      Eric Dumazet 提交于
      RX packet processing holds rcu_read_lock(), so we can remove
      pairs of rcu_read_lock()/rcu_read_unlock() in lookup functions
      if inet_diag also holds rcu before calling them.
      
      This is needed anyway as __inet_lookup_listener() and
      inet6_lookup_listener() will soon no longer increment
      refcount on the found listener.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2d331915
    • E
      tcp/dccp: remove BH disable/enable in lookup · ee3cf32a
      Eric Dumazet 提交于
      Since linux 2.6.29, lookups only use rcu locking.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee3cf32a
    • E
      udp: no longer use SLAB_DESTROY_BY_RCU · ca065d0c
      Eric Dumazet 提交于
      Tom Herbert would like not touching UDP socket refcnt for encapsulated
      traffic. For this to happen, we need to use normal RCU rules, with a grace
      period before freeing a socket. UDP sockets are not short lived in the
      high usage case, so the added cost of call_rcu() should not be a concern.
      
      This actually removes a lot of complexity in UDP stack.
      
      Multicast receives no longer need to hold a bucket spinlock.
      
      Note that ip early demux still needs to take a reference on the socket.
      
      Same remark for functions used by xt_socket and xt_PROXY netfilter modules,
      but this might be changed later.
      
      Performance for a single UDP socket receiving flood traffic from
      many RX queues/cpus.
      
      Simple udp_rx using simple recvfrom() loop :
      438 kpps instead of 374 kpps : 17 % increase of the peak rate.
      
      v2: Addressed Willem de Bruijn feedback in multicast handling
       - keep early demux break in __udp4_lib_demux_lookup()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <tom@herbertland.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Tested-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca065d0c
    • E
      net: add SOCK_RCU_FREE socket flag · a4298e45
      Eric Dumazet 提交于
      We want a generic way to insert an RCU grace period before socket
      freeing for cases where RCU_SLAB_DESTROY_BY_RCU is adding too
      much overhead.
      
      SLAB_DESTROY_BY_RCU strict rules force us to take a reference
      on the socket sk_refcnt, and it is a performance problem for UDP
      encapsulation, or TCP synflood behavior, as many CPUs might
      attempt the atomic operations on a shared sk_refcnt
      
      UDP sockets and TCP listeners can set SOCK_RCU_FREE so that their
      lookup can use traditional RCU rules, without refcount changes.
      They can set the flag only once hashed and visible by other cpus.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <tom@herbertland.com>
      Tested-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a4298e45
    • D
      Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · 43e2dfb2
      David S. Miller 提交于
      Jeff Kirsher says:
      
      ====================
      10GbE Intel Wired LAN Driver Updates 2016-04-04
      
      This series contains updates to ixgbe and ixgbevf.
      
      Pavel Tikhomirov fixes a typo where we were incrementing transmit stats
      instead of receive stats on the receive side.
      
      Emil updates the ixgbevf driver to use bit operations for setting and
      checking the adapter state.
      
      Chas Williams adds the new NDO trust feature check so that the VF guest
      has the ability to set the unicast address of the interface, if it is a
      trusted VF.
      
      Alex cleans up the driver to that the only time we add a PF entry to the
      VLVF is either for VLAN 0 or if the PF has requested a VLAN that a VF
      is already using.  Also adds support for generic transmit checksums,
      giving the added advantage is that we can support inner checksum offloads
      for tunnels and MPLS while still being able to transparently insert
      VLAN tags.  Lastly, changed ixgbe so that we can use the ethtool
      rx-vlan-filter flag to toggle receive VLAN filtering on and off.
      
      Mark cleans up the ixgbe driver by making all op structures that do not
      change constants.  Also fixed flow control for Xeon D KR backplanes, since
      we cannot use auto-negotiation to determine the mode, we have to use
      whatever the user configured.
      
      Sowmini Varadhan updates ixgbe to use eth_platform_get_mac_address()
      instead of the arch specific solution that was added by a previous
      commit.
      
      Don fixed an issue where it was possible that a system reset could occur
      when we were holding the SWFW semaphore lock, which the next time the
      driver loaded would see it incorrectly as locked.
      
      v2: updated patch 8 of the series to include a minor flags issue where
          we had lost NETIF_F_HW_TC and we were setting NETIF_F_SCTP_CRC in
          two different areas, when we only needed/wanted it in one spot.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      43e2dfb2
    • D
      Merge branch 'mv88e6131-hw-bridging-6185' · 6e338048
      David S. Miller 提交于
      Vivien Didelot says:
      
      ====================
      net: dsa: mv88e6131: HW bridging support for 6185
      
      All packets passing through a switch of the 6185 family are currently all
      directed to the CPU port. This means that port bridging is software driven.
      
      To enable hardware bridging for this switch family, we need to implement the
      port mapping operations, the FDB operations, and optionally the VLAN operations
      (for 802.1Q and VLAN filtering aware systems).
      
      However this family only has 256 FDBs indexed by 8-bit identifiers, opposed to
      4096 FDBs with 12-bit identifiers for other families such as 6352. It also
      doesn't have dedicated FID registers for ATU and VTU operations.
      
      This patchset fixes these differences, and enable hardware bridging for 6185.
      
      Changes v1 -> v2:
       - Describe the different numbers of databases and prefer a feature-based logic
         over the current ID/family-based logic.
      ====================
      Tested-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e338048
    • V
      net: dsa: mv88e6131: enable hardware bridging · 26892ffc
      Vivien Didelot 提交于
      By adding support for bridge operations, FDB operations, and optionally
      VLAN operations (for 802.1Q and VLAN filtering aware systems), the
      switch bridges ports correctly, the CPU is able to populate the hardware
      address databases, and thus hardware bridging becomes functional within
      the 88E6185 family of switches.
      Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      26892ffc
    • V
      net: dsa: mv88e6xxx: map destination addresses for 6185 · f93dd042
      Vivien Didelot 提交于
      The 88E6185 switch also has a MapDA bit in its Port Control 2 register.
      When this bit is cleared, all frames are sent out to the CPU port.
      
      Set this bit to rely on address databases (ATU) hits and direct frames
      out of the correct ports, and thus allow hardware bridging.
      Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f93dd042
    • V
      net: dsa: mv88e6xxx: support 256 databases · 11ea809f
      Vivien Didelot 提交于
      The 6185 family of devices has only 256 address databases. Their 8-bit
      FID for ATU and VTU operations are split into ATU Control and ATU/VTU
      Operation registers.
      Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      11ea809f
    • V
      net: dsa: mv88e6xxx: variable number of databases · f74df0be
      Vivien Didelot 提交于
      Marvell switch chips have different number of address databases.
      
      The code currently only supports models with 4096 databases. Such switch
      has dedicated FID registers for ATU and VTU operations. Models with
      fewer databases have their FID split in several registers.
      
      List them all but only support models with 4096 databases at the moment.
      Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f74df0be