1. 07 4月, 2016 1 次提交
  2. 06 4月, 2016 2 次提交
  3. 05 4月, 2016 10 次提交
    • E
      tcp: rate limit ACK sent by SYN_RECV request sockets · 4ce7e93c
      Eric Dumazet 提交于
      Attackers like to use SYNFLOOD targeting one 5-tuple, as they
      hit a single RX queue (and cpu) on the victim.
      
      If they use random sequence numbers in their SYN, we detect
      they do not match the expected window and send back an ACK.
      
      This patch adds a rate limitation, so that the effect of such
      attacks is limited to ingress only.
      
      We roughly double our ability to absorb such attacks.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Maciej Żenczykowski <maze@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4ce7e93c
    • E
      ipv4: tcp: set SOCK_USE_WRITE_QUEUE for ip_send_unicast_reply() · a9d6532b
      Eric Dumazet 提交于
      TCP uses per cpu 'sockets' to send some packets :
      - RST packets ( tcp_v4_send_reset()) )
      - ACK packets for SYN_RECV and TIMEWAIT sockets
      
      By setting SOCK_USE_WRITE_QUEUE flag, we tell sock_wfree()
      to not call sk_write_space() since these internal sockets
      do not care.
      
      This gives a small performance improvement, merely by allowing
      cpu to properly predict the sock_wfree() conditional branch,
      and avoiding one atomic operation.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a9d6532b
    • E
      tcp: increment sk_drops for listeners · 9caad864
      Eric Dumazet 提交于
      Goal: packets dropped by a listener are accounted for.
      
      This adds tcp_listendrop() helper, and clears sk_drops in sk_clone_lock()
      so that children do not inherit their parent drop count.
      
      Note that we no longer increment LINUX_MIB_LISTENDROPS counter when
      sending a SYNCOOKIE, since the SYN packet generated a SYNACK.
      We already have a separate LINUX_MIB_SYNCOOKIESSENT
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9caad864
    • E
      tcp: increment sk_drops for dropped rx packets · 532182cd
      Eric Dumazet 提交于
      Now ss can report sk_drops, we can instruct TCP to increment
      this per socket counter when it drops an incoming frame, to refine
      monitoring and debugging.
      
      Following patch takes care of listeners drops.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      532182cd
    • E
      tcp/dccp: do not touch listener sk_refcnt under synflood · 3b24d854
      Eric Dumazet 提交于
      When a SYNFLOOD targets a non SO_REUSEPORT listener, multiple
      cpus contend on sk->sk_refcnt and sk->sk_wmem_alloc changes.
      
      By letting listeners use SOCK_RCU_FREE infrastructure,
      we can relax TCP_LISTEN lookup rules and avoid touching sk_refcnt
      
      Note that we still use SLAB_DESTROY_BY_RCU rules for other sockets,
      only listeners are impacted by this change.
      
      Peak performance under SYNFLOOD is increased by ~33% :
      
      On my test machine, I could process 3.2 Mpps instead of 2.4 Mpps
      
      Most consuming functions are now skb_set_owner_w() and sock_wfree()
      contending on sk->sk_wmem_alloc when cooking SYNACK and freeing them.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b24d854
    • E
      tcp/dccp: use rcu locking in inet_diag_find_one_icsk() · 2d331915
      Eric Dumazet 提交于
      RX packet processing holds rcu_read_lock(), so we can remove
      pairs of rcu_read_lock()/rcu_read_unlock() in lookup functions
      if inet_diag also holds rcu before calling them.
      
      This is needed anyway as __inet_lookup_listener() and
      inet6_lookup_listener() will soon no longer increment
      refcount on the found listener.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2d331915
    • E
      udp: no longer use SLAB_DESTROY_BY_RCU · ca065d0c
      Eric Dumazet 提交于
      Tom Herbert would like not touching UDP socket refcnt for encapsulated
      traffic. For this to happen, we need to use normal RCU rules, with a grace
      period before freeing a socket. UDP sockets are not short lived in the
      high usage case, so the added cost of call_rcu() should not be a concern.
      
      This actually removes a lot of complexity in UDP stack.
      
      Multicast receives no longer need to hold a bucket spinlock.
      
      Note that ip early demux still needs to take a reference on the socket.
      
      Same remark for functions used by xt_socket and xt_PROXY netfilter modules,
      but this might be changed later.
      
      Performance for a single UDP socket receiving flood traffic from
      many RX queues/cpus.
      
      Simple udp_rx using simple recvfrom() loop :
      438 kpps instead of 374 kpps : 17 % increase of the peak rate.
      
      v2: Addressed Willem de Bruijn feedback in multicast handling
       - keep early demux break in __udp4_lib_demux_lookup()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <tom@herbertland.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Tested-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca065d0c
    • S
      sock: enable timestamping using control messages · c14ac945
      Soheil Hassas Yeganeh 提交于
      Currently, SOL_TIMESTAMPING can only be enabled using setsockopt.
      This is very costly when users want to sample writes to gather
      tx timestamps.
      
      Add support for enabling SO_TIMESTAMPING via control messages by
      using tsflags added in `struct sockcm_cookie` (added in the previous
      patches in this series) to set the tx_flags of the last skb created in
      a sendmsg. With this patch, the timestamp recording bits in tx_flags
      of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg.
      
      Please note that this is only effective for overriding the recording
      timestamps flags. Users should enable timestamp reporting (e.g.,
      SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using
      socket options and then should ask for SOF_TIMESTAMPING_TX_*
      using control messages per sendmsg to sample timestamps for each
      write.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c14ac945
    • S
      ipv4: process socket-level control messages in IPv4 · 24025c46
      Soheil Hassas Yeganeh 提交于
      Process socket-level control messages by invoking
      __sock_cmsg_send in ip_cmsg_send for control messages on
      the SOL_SOCKET layer.
      
      This makes sure whenever ip_cmsg_send is called in udp, icmp,
      and raw, we also process socket-level control messages.
      
      Note that this commit interprets new control messages that
      were ignored before. As such, this commit does not change
      the behavior of IPv4 control messages.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      24025c46
    • S
      tcp: use one bit in TCP_SKB_CB to mark ACK timestamps · 6b084928
      Soheil Hassas Yeganeh 提交于
      Currently, to avoid a cache line miss for accessing skb_shinfo,
      tcp_ack_tstamp skips socket that do not have
      SOF_TIMESTAMPING_TX_ACK bit set in sk_tsflags. This is
      implemented based on an implicit assumption that the
      SOF_TIMESTAMPING_TX_ACK is set via socket options for the
      duration that ACK timestamps are needed.
      
      To implement per-write timestamps, this check should be
      removed and replaced with a per-packet alternative that
      quickly skips packets missing ACK timestamps marks without
      a cache-line miss.
      
      To enable per-packet marking without a cache line miss, use
      one bit in TCP_SKB_CB to mark a whether a SKB might need a
      ack tx timestamp or not. Further checks in tcp_ack_tstamp are not
      modified and work as before.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b084928
  4. 03 4月, 2016 2 次提交
    • H
      netlink: use nla_get_in_addr and nla_put_in_addr for ipv4 address · 7822ce73
      Haishuang Yan 提交于
      Since nla_get_in_addr and nla_put_in_addr were implemented,
      so use them appropriately.
      Signed-off-by: NHaishuang Yan <yanhaishuang@cmss.chinamobile.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7822ce73
    • Y
      tcp: remove cwnd moderation after recovery · 23492623
      Yuchung Cheng 提交于
      For non-SACK connections, cwnd is lowered to inflight plus 3 packets
      when the recovery ends. This is an optional feature in the NewReno
      RFC 2582 to reduce the potential burst when cwnd is "re-opened"
      after recovery and inflight is low.
      
      This feature is questionably effective because of PRR: when
      the recovery ends (i.e., snd_una == high_seq) NewReno holds the
      CA_Recovery state for another round trip to prevent false fast
      retransmits. But if the inflight is low, PRR will overwrite the
      moderated cwnd in tcp_cwnd_reduction() later regardlessly. So if a
      receiver responds bogus ACKs (i.e., acking future data) to speed up
      transfer after recovery, it can only induce a burst up to a window
      worth of data packets by acking up to SND.NXT. A restart from (short)
      idle or receiving streched ACKs can both cause such bursts as well.
      
      On the other hand, if the recovery ends because the sender
      detects the losses were spurious (e.g., reordering). This feature
      unconditionally lowers a reverted cwnd even though nothing
      was lost.
      
      By principle loss recovery module should not update cwnd. Further
      pacing is much more effective to reduce burst. Hence this patch
      removes the cwnd moderation feature.
      
      v2 changes: revised commit message on bogus ACKs and burst, and
                  missing signature
      Signed-off-by: NMatt Mathis <mattmathis@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      23492623
  5. 31 3月, 2016 1 次提交
    • A
      gro: Allow tunnel stacking in the case of FOU/GUE · c3483384
      Alexander Duyck 提交于
      This patch should fix the issues seen with a recent fix to prevent
      tunnel-in-tunnel frames from being generated with GRO.  The fix itself is
      correct for now as long as we do not add any devices that support
      NETIF_F_GSO_GRE_CSUM.  When such a device is added it could have the
      potential to mess things up due to the fact that the outer transport header
      points to the outer UDP header and not the GRE header as would be expected.
      
      Fixes: fac8e0f5 ("tunnels: Don't apply GRO to multiple layers of encapsulation.")
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c3483384
  6. 28 3月, 2016 6 次提交
  7. 24 3月, 2016 1 次提交
  8. 23 3月, 2016 2 次提交
  9. 22 3月, 2016 1 次提交
    • D
      net: ipv4: Fix truncated timestamp returned by inet_current_timestamp() · 3ba9d300
      Deepa Dinamani 提交于
      The millisecond timestamps returned by the function is
      converted to network byte order by making a call to htons().
      htons() only returns __be16 while __be32 is required here.
      
      This was identified by the sparse warning from the buildbot:
      net/ipv4/af_inet.c:1405:16: sparse: incorrect type in return
      			    expression (different base types)
      net/ipv4/af_inet.c:1405:16: expected restricted __be32
      net/ipv4/af_inet.c:1405:16: got restricted __be16 [usertype] <noident>
      
      Change the function to use htonl() to return the correct __be32 type
      instead so that the millisecond value doesn't get truncated.
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: James Morris <jmorris@namei.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Fixes: 822c8685 ("net: ipv4: Convert IP network timestamps to be y2038 safe")
      Reported-by: Fengguang Wu <fengguang.wu@intel.com> [0-day test robot]
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3ba9d300
  10. 21 3月, 2016 3 次提交
    • J
      tunnels: Remove encapsulation offloads on decap. · a09a4c8d
      Jesse Gross 提交于
      If a packet is either locally encapsulated or processed through GRO
      it is marked with the offloads that it requires. However, when it is
      decapsulated these tunnel offload indications are not removed. This
      means that if we receive an encapsulated TCP packet, aggregate it with
      GRO, decapsulate, and retransmit the resulting frame on a NIC that does
      not support encapsulation, we won't be able to take advantage of hardware
      offloads even though it is just a simple TCP packet at this point.
      
      This fixes the problem by stripping off encapsulation offload indications
      when packets are decapsulated.
      
      The performance impacts of this bug are significant. In a test where a
      Geneve encapsulated TCP stream is sent to a hypervisor, GRO'ed, decapsulated,
      and bridged to a VM performance is improved by 60% (5Gbps->8Gbps) as a
      result of avoiding unnecessary segmentation at the VM tap interface.
      Reported-by: NRamu Ramamurthy <sramamur@linux.vnet.ibm.com>
      Fixes: 68c33163 ("v4 GRE: Add TCP segmentation offload for GRE")
      Signed-off-by: NJesse Gross <jesse@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a09a4c8d
    • J
      tunnels: Don't apply GRO to multiple layers of encapsulation. · fac8e0f5
      Jesse Gross 提交于
      When drivers express support for TSO of encapsulated packets, they
      only mean that they can do it for one layer of encapsulation.
      Supporting additional levels would mean updating, at a minimum,
      more IP length fields and they are unaware of this.
      
      No encapsulation device expresses support for handling offloaded
      encapsulated packets, so we won't generate these types of frames
      in the transmit path. However, GRO doesn't have a check for
      multiple levels of encapsulation and will attempt to build them.
      
      UDP tunnel GRO actually does prevent this situation but it only
      handles multiple UDP tunnels stacked on top of each other. This
      generalizes that solution to prevent any kind of tunnel stacking
      that would cause problems.
      
      Fixes: bf5a755f ("net-gre-gro: Add GRE support to the GRO stack")
      Signed-off-by: NJesse Gross <jesse@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fac8e0f5
    • J
      ipip: Properly mark ipip GRO packets as encapsulated. · b8cba75b
      Jesse Gross 提交于
      ipip encapsulated packets can be merged together by GRO but the result
      does not have the proper GSO type set or even marked as being
      encapsulated at all. Later retransmission of these packets will likely
      fail if the device does not support ipip offloads. This is similar to
      the issue resolved in IPv6 sit in feec0cb3
      ("ipv6: gro: support sit protocol").
      Reported-by: NPatrick Boutilier <boutilpj@ednet.ns.ca>
      Fixes: 9667e9bb ("ipip: Add gro callbacks to ipip offload")
      Tested-by: NPatrick Boutilier <boutilpj@ednet.ns.ca>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJesse Gross <jesse@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b8cba75b
  11. 19 3月, 2016 1 次提交
  12. 18 3月, 2016 1 次提交
  13. 16 3月, 2016 1 次提交
    • P
      tags: Fix DEFINE_PER_CPU expansions · 25528213
      Peter Zijlstra 提交于
      $ make tags
        GEN     tags
      ctags: Warning: drivers/acpi/processor_idle.c:64: null expansion of name pattern "\1"
      ctags: Warning: drivers/xen/events/events_2l.c:41: null expansion of name pattern "\1"
      ctags: Warning: kernel/locking/lockdep.c:151: null expansion of name pattern "\1"
      ctags: Warning: kernel/rcu/rcutorture.c:133: null expansion of name pattern "\1"
      ctags: Warning: kernel/rcu/rcutorture.c:135: null expansion of name pattern "\1"
      ctags: Warning: kernel/workqueue.c:323: null expansion of name pattern "\1"
      ctags: Warning: net/ipv4/syncookies.c:53: null expansion of name pattern "\1"
      ctags: Warning: net/ipv6/syncookies.c:44: null expansion of name pattern "\1"
      ctags: Warning: net/rds/page.c:45: null expansion of name pattern "\1"
      
      Which are all the result of the DEFINE_PER_CPU pattern:
      
        scripts/tags.sh:200:	'/\<DEFINE_PER_CPU([^,]*, *\([[:alnum:]_]*\)/\1/v/'
        scripts/tags.sh:201:	'/\<DEFINE_PER_CPU_SHARED_ALIGNED([^,]*, *\([[:alnum:]_]*\)/\1/v/'
      
      The below cures them. All except the workqueue one are within reasonable
      distance of the 80 char limit. TJ do you have any preference on how to
      fix the wq one, or shall we just not care its too long?
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      25528213
  14. 15 3月, 2016 3 次提交
    • E
      net: diag: add a scheduling point in inet_diag_dump_icsk() · acffb584
      Eric Dumazet 提交于
      On loaded TCP servers, looking at millions of sockets can hold
      cpu for many seconds, if the lookup condition is very narrow.
      
      (eg : ss dst 1.2.3.4 )
      
      Better add a cond_resched() to allow other processes to access
      the cpu.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      acffb584
    • J
      netfilter: Allow calling into nat helper without skb_dst. · 26461905
      Jarno Rajahalme 提交于
      NAT checksum recalculation code assumes existence of skb_dst, which
      becomes a problem for a later patch in the series ("openvswitch:
      Interface with NAT.").  Simplify this by removing the check on
      skb_dst, as the checksum will be dealt with later in the stack.
      Suggested-by: NPravin Shelar <pshelar@nicira.com>
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      26461905
    • M
      tcp: Add RFC4898 tcpEStatsPerfDataSegsOut/In · a44d6eac
      Martin KaFai Lau 提交于
      Per RFC4898, they count segments sent/received
      containing a positive length data segment (that includes
      retransmission segments carrying data).  Unlike
      tcpi_segs_out/in, tcpi_data_segs_out/in excludes segments
      carrying no data (e.g. pure ack).
      
      The patch also updates the segs_in in tcp_fastopen_add_skb()
      so that segs_in >= data_segs_in property is kept.
      
      Together with retransmission data, tcpi_data_segs_out
      gives a better signal on the rxmit rate.
      
      v6: Rebase on the latest net-next
      
      v5: Eric pointed out that checking skb->len is still needed in
      tcp_fastopen_add_skb() because skb can carry a FIN without data.
      Hence, instead of open coding segs_in and data_segs_in, tcp_segs_in()
      helper is used.  Comment is added to the fastopen case to explain why
      segs_in has to be reset and tcp_segs_in() has to be called before
      __skb_pull().
      
      v4: Add comment to the changes in tcp_fastopen_add_skb()
      and also add remark on this case in the commit message.
      
      v3: Add const modifier to the skb parameter in tcp_segs_in()
      
      v2: Rework based on recent fix by Eric:
      commit a9d99ce2 ("tcp: fix tcpi_segs_in after connection establishment")
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Cc: Chris Rapier <rapier@psc.edu>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Marcelo Ricardo Leitner <mleitner@redhat.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a44d6eac
  15. 14 3月, 2016 4 次提交
  16. 10 3月, 2016 1 次提交