1. 11 9月, 2013 1 次提交
  2. 05 9月, 2013 1 次提交
  3. 31 8月, 2013 1 次提交
  4. 30 8月, 2013 3 次提交
    • D
      net: packet: document available fanout policies · 7ec06da8
      Daniel Borkmann 提交于
      Update documentation to add fanout policies that are available.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ec06da8
    • E
      tcp: TSO packets automatic sizing · 95bd09eb
      Eric Dumazet 提交于
      After hearing many people over past years complaining against TSO being
      bursty or even buggy, we are proud to present automatic sizing of TSO
      packets.
      
      One part of the problem is that tcp_tso_should_defer() uses an heuristic
      relying on upcoming ACKS instead of a timer, but more generally, having
      big TSO packets makes little sense for low rates, as it tends to create
      micro bursts on the network, and general consensus is to reduce the
      buffering amount.
      
      This patch introduces a per socket sk_pacing_rate, that approximates
      the current sending rate, and allows us to size the TSO packets so
      that we try to send one packet every ms.
      
      This field could be set by other transports.
      
      Patch has no impact for high speed flows, where having large TSO packets
      makes sense to reach line rate.
      
      For other flows, this helps better packet scheduling and ACK clocking.
      
      This patch increases performance of TCP flows in lossy environments.
      
      A new sysctl (tcp_min_tso_segs) is added, to specify the
      minimal size of a TSO packet (default being 2).
      
      A follow-up patch will provide a new packet scheduler (FQ), using
      sk_pacing_rate as an input to perform optional per flow pacing.
      
      This explains why we chose to set sk_pacing_rate to twice the current
      rate, allowing 'slow start' ramp up.
      
      sk_pacing_rate = 2 * cwnd * mss / srtt
      
      v2: Neal Cardwell reported a suspect deferring of last two segments on
      initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
      into account tp->xmit_size_goal_segs
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95bd09eb
    • H
      ipv6: drop fragmented ndisc packets by default (RFC 6980) · b800c3b9
      Hannes Frederic Sowa 提交于
      This patch implements RFC6980: Drop fragmented ndisc packets by
      default. If a fragmented ndisc packet is received the user is informed
      that it is possible to disable the check.
      
      Cc: Fernando Gont <fernando@gont.com.ar>
      Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b800c3b9
  5. 28 8月, 2013 1 次提交
  6. 24 8月, 2013 1 次提交
    • A
      openvswitch: Mega flow implementation · 03f0d916
      Andy Zhou 提交于
      Add wildcarded flow support in kernel datapath.
      
      Wildcarded flow can improve OVS flow set up performance by avoid sending
      matching new flows to the user space program. The exact performance boost
      will largely dependent on wildcarded flow hit rate.
      
      In case all new flows hits wildcard flows, the flow set up rate is
      within 5% of that of linux bridge module.
      
      Pravin has made significant contributions to this patch. Including API
      clean ups and bug fixes.
      Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NAndy Zhou <azhou@nicira.com>
      Signed-off-by: NJesse Gross <jesse@nicira.com>
      03f0d916
  7. 14 8月, 2013 1 次提交
    • H
      ipv6: make unsolicited report intervals configurable for mld · fc4eba58
      Hannes Frederic Sowa 提交于
      Commit cab70040 ("net: igmp:
      Reduce Unsolicited report interval to 1s when using IGMPv3") and
      2690048c ("net: igmp: Allow user-space
      configuration of igmp unsolicited report interval") by William Manley made
      igmp unsolicited report intervals configurable per interface and corrected
      the interval of unsolicited igmpv3 report messages resendings to 1s.
      
      Same needs to be done for IPv6:
      
      MLDv1 (RFC2710 7.10.): 10 seconds
      MLDv2 (RFC3810 9.11.): 1 second
      
      Both intervals are configurable via new procfs knobs
      mldv1_unsolicited_report_interval and mldv2_unsolicited_report_interval.
      
      (also added .force_mld_version to ipv6_devconf_dflt to bring structs in
      line without semantic changes)
      
      v2:
      a) Joined documentation update for IPv4 and IPv6 MLD/IGMP
         unsolicited_report_interval procfs knobs.
      b) incorporate stylistic feedback from William Manley
      
      v3:
      a) add new DEVCONF_* values to the end of the enum (thanks to David
         Miller)
      
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: William Manley <william.manley@youview.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc4eba58
  8. 10 8月, 2013 2 次提交
  9. 01 8月, 2013 1 次提交
  10. 31 7月, 2013 2 次提交
  11. 25 7月, 2013 2 次提交
    • E
      tcp: TCP_NOTSENT_LOWAT socket option · c9bee3b7
      Eric Dumazet 提交于
      Idea of this patch is to add optional limitation of number of
      unsent bytes in TCP sockets, to reduce usage of kernel memory.
      
      TCP receiver might announce a big window, and TCP sender autotuning
      might allow a large amount of bytes in write queue, but this has little
      performance impact if a large part of this buffering is wasted :
      
      Write queue needs to be large only to deal with large BDP, not
      necessarily to cope with scheduling delays (incoming ACKS make room
      for the application to queue more bytes)
      
      For most workloads, using a value of 128 KB or less is OK to give
      applications enough time to react to POLLOUT events in time
      (or being awaken in a blocking sendmsg())
      
      This patch adds two ways to set the limit :
      
      1) Per socket option TCP_NOTSENT_LOWAT
      
      2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
      not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
      Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.
      
      This changes poll()/select()/epoll() to report POLLOUT
      only if number of unsent bytes is below tp->nosent_lowat
      
      Note this might increase number of sendmsg()/sendfile() calls
      when using non blocking sockets,
      and increase number of context switches for blocking sockets.
      
      Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
      defined as :
       Specify the minimum number of bytes in the buffer until
       the socket layer will pass the data to the protocol)
      
      Tested:
      
      netperf sessions, and watching /proc/net/protocols "memory" column for TCP
      
      With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
      used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)
      
      lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
      TCPv6     1880      2   45458   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      TCP       1696    508   45458   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      
      lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
      TCPv6     1880      2   20567   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      TCP       1696    508   20567   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      
      Using 128KB has no bad effect on the throughput or cpu usage
      of a single flow, although there is an increase of context switches.
      
      A bonus is that we hold socket lock for a shorter amount
      of time and should improve latencies of ACK processing.
      
      lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
      OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
      Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
      Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
      Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
      Final       Final                                             %     Method %      Method
      1651584     6291456     16384  20.00   17447.90   10^6bits/s  3.13  S      -1.00  U      0.353   -1.000  usec/KB
      
       Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
      
                 412,514 context-switches
      
           200.034645535 seconds time elapsed
      
      lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
      OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
      Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
      Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
      Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
      Final       Final                                             %     Method %      Method
      1593240     6291456     16384  20.00   17321.16   10^6bits/s  3.35  S      -1.00  U      0.381   -1.000  usec/KB
      
       Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
      
               2,675,818 context-switches
      
           200.029651391 seconds time elapsed
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-By: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9bee3b7
    • D
      net: sctp: trivial: update mailing list address · 91705c61
      Daniel Borkmann 提交于
      The SCTP mailing list address to send patches or questions
      to is linux-sctp@vger.kernel.org and not
      lksctp-developers@lists.sourceforge.net anymore. Therefore,
      update all occurences.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: NVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      91705c61
  12. 10 7月, 2013 1 次提交
  13. 26 6月, 2013 4 次提交
    • J
      ipvs: add sync_persist_mode flag · 4d0c875d
      Julian Anastasov 提交于
      Add sync_persist_mode flag to reduce sync traffic
      by syncing only persistent templates.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Tested-by: NAleksey Chudov <aleksey.chudov@gmail.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      4d0c875d
    • V
      bonding: add an option to fail when any of arp_ip_target is inaccessible · 8599b52e
      Veaceslav Falico 提交于
      Currently, we fail only when all of the ips in arp_ip_target are gone.
      However, in some situations we might need to fail if even one host from
      arp_ip_target becomes unavailable.
      
      All situations, obviously, rely on the idea that we need *completely*
      functional network, with all interfaces/addresses working correctly.
      
      One real world example might be:
      vlans on top on bond (hybrid port). If bond and vlans have ips assigned
      and we have their peers monitored via arp_ip_target - in case of switch
      misconfiguration (trunk/access port), slave driver malfunction or
      tagged/untagged traffic dropped on the way - we will be able to switch
      to another slave.
      
      Though any other configuration needs that if we need to have access to all
      arp_ip_targets.
      
      This patch adds this possibility by adding a new parameter -
      arp_all_targets (both as a module parameter and as a sysfs knob). It can be
      set to:
      
      	0 or any (the default) - which works exactly as it's working now -
      	the slave is up if any of the arp_ip_targets are up.
      
      	1 or all - the slave is up if all of the arp_ip_targets are up.
      
      This parameter can be changed on the fly (via sysfs), and requires the mode
      to be active-backup and arp_validate to be enabled (it obeys the
      arp_validate config on which slaves to validate).
      
      Internally it's done through:
      
      1) Add target_last_arp_rx[BOND_MAX_ARP_TARGETS] array to slave struct. It's
         an array of jiffies, meaning that slave->target_last_arp_rx[i] is the
         last time we've received arp from bond->params.arp_targets[i] on this
         slave.
      
      2) If we successfully validate an arp from bond->params.arp_targets[i] in
         bond_validate_arp() - update the slave->target_last_arp_rx[i] with the
         current jiffies value.
      
      3) When getting slave's last_rx via slave_last_rx(), we return the oldest
         time when we've received an arp from any address in
         bond->params.arp_targets[].
      
      If the value of arp_all_targets == 0 - we still work the same way as
      before.
      
      Also, update the documentation to reflect the new parameter.
      
      v3->v4:
      Kill the forgotten rtnl_unlock(), rephrase the documentation part to be
      more clear, don't fail setting arp_all_targets if arp_validate is not set -
      it has no effect anyway but can be easier to set up. Also, print a warning
      if the last arp_ip_target is removed while the arp_interval is on, but not
      the arp_validate.
      
      v2->v3:
      Use _bh spinlock, remove useless rtnl_lock() and use jiffies for new
      arp_ip_target last arp, instead of slave_last_rx(). On bond_enslave(),
      use the same initialization value for target_last_arp_rx[] as is used
      for the default last_arp_rx, to avoid useless interface flaps.
      
      Also, instead of failing to remove the last arp_ip_target just print a
      warning - otherwise it might break existing scripts.
      
      v1->v2:
      Correctly handle adding/removing hosts in arp_ip_target - we need to
      shift/initialize all slave's target_last_arp_rx. Also, don't fail module
      loading on arp_all_targets misconfiguration, just disable it, and some
      minor style fixes.
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8599b52e
    • V
      bonding: doc: some details on backup slave arp validation · d7d35c68
      Veaceslav Falico 提交于
      Add some details to bonding documentation on how backup slave arp
      validation works.
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7d35c68
    • C
      doc: fix some syntax errors in netlink mmap sample code · 76237576
      Cong Wang 提交于
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NCong Wang <amwang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76237576
  14. 24 6月, 2013 1 次提交
  15. 13 6月, 2013 1 次提交
    • C
      net: add doc for ip_early_demux sysctl · e3d73bce
      Cong Wang 提交于
      commit 6648bd7e (ipv4: Add sysctl knob to control
      early socket demux) introduced such sysctl, but forgot to add
      doc into Documentation/networking/ip-sysctl.txt. This patch adds it.
      
      Basically I grab the doc from the description of commit 41063e9d
      (ipv4: Early TCP socket demux.) and the above commit.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NCong Wang <amwang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e3d73bce
  16. 08 6月, 2013 2 次提交
  17. 05 6月, 2013 1 次提交
  18. 28 5月, 2013 3 次提交
  19. 24 5月, 2013 1 次提交
  20. 25 4月, 2013 1 次提交
  21. 20 4月, 2013 1 次提交
  22. 09 4月, 2013 2 次提交
  23. 31 3月, 2013 1 次提交
  24. 27 3月, 2013 1 次提交
  25. 21 3月, 2013 2 次提交
    • Y
      tcp: implement RFC5682 F-RTO · e33099f9
      Yuchung Cheng 提交于
      This patch implements F-RTO (foward RTO recovery):
      
      When the first retransmission after timeout is acknowledged, F-RTO
      sends new data instead of old data. If the next ACK acknowledges
      some never-retransmitted data, then the timeout was spurious and the
      congestion state is reverted.  Otherwise if the next ACK selectively
      acknowledges the new data, then the timeout was genuine and the
      loss recovery continues. This idea applies to recurring timeouts
      as well. While F-RTO sends different data during timeout recovery,
      it does not (and should not) change the congestion control.
      
      The implementaion follows the three steps of SACK enhanced algorithm
      (section 3) in RFC5682. Step 1 is in tcp_enter_loss(). Step 2 and
      3 are in tcp_process_loss().  The basic version is not supported
      because SACK enhanced version also works for non-SACK connections.
      
      The new implementation is functionally in parity with the old F-RTO
      implementation except the one case where it increases undo events:
      In addition to the RFC algorithm, a spurious timeout may be detected
      without sending data in step 2, as long as the SACK confirms not
      all the original data are dropped. When this happens, the sender
      will undo the cwnd and perhaps enter fast recovery instead. This
      additional check increases the F-RTO undo events by 5x compared
      to the prior implementation on Google Web servers, since the sender
      often does not have new data to send for HTTP.
      
      Note F-RTO may detect spurious timeout before Eifel with timestamps
      does so.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e33099f9
    • Y
      tcp: refactor F-RTO · 9b44190d
      Yuchung Cheng 提交于
      The patch series refactor the F-RTO feature (RFC4138/5682).
      
      This is to simplify the loss recovery processing. Existing F-RTO
      was developed during the experimental stage (RFC4138) and has
      many experimental features.  It takes a separate code path from
      the traditional timeout processing by overloading CA_Disorder
      instead of using CA_Loss state. This complicates CA_Disorder state
      handling because it's also used for handling dubious ACKs and undos.
      While the algorithm in the RFC does not change the congestion control,
      the implementation intercepts congestion control in various places
      (e.g., frto_cwnd in tcp_ack()).
      
      The new code implements newer F-RTO RFC5682 using CA_Loss processing
      path.  F-RTO becomes a small extension in the timeout processing
      and interfaces with congestion control and Eifel undo modules.
      It lets congestion control (module) determines how many to send
      independently.  F-RTO only chooses what to send in order to detect
      spurious retranmission. If timeout is found spurious it invokes
      existing Eifel undo algorithms like DSACK or TCP timestamp based
      detection.
      
      The first patch removes all F-RTO code except the sysctl_tcp_frto is
      left for the new implementation.  Since CA_EVENT_FRTO is removed, TCP
      westwood now computes ssthresh on regular timeout CA_EVENT_LOSS event.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b44190d
  26. 19 3月, 2013 1 次提交
    • J
      ipvs: add backup_only flag to avoid loops · 0c12582f
      Julian Anastasov 提交于
      Dmitry Akindinov is reporting for a problem where SYNs are looping
      between the master and backup server when the backup server is used as
      real server in DR mode and has IPVS rules to function as director.
      
      Even when the backup function is enabled we continue to forward
      traffic and schedule new connections when the current master is using
      the backup server as real server. While this is not a problem for NAT,
      for DR and TUN method the backup server can not determine if a request
      comes from client or from director.
      
      To avoid such loops add new sysctl flag backup_only. It can be needed
      for DR/TUN setups that do not need backup and director function at the
      same time. When the backup function is enabled we stop any forwarding
      and pass the traffic to the local stack (real server mode). The flag
      disables the director function when the backup function is enabled.
      
      For setups that enable backup function for some virtual services and
      director function for other virtual services there should be another
      more complex solution to support DR/TUN mode, may be to assign
      per-virtual service syncid value, so that we can differentiate the
      requests.
      Reported-by: NDmitry Akindinov <dimak@stalker.com>
      Tested-by: NGerman Myzovsky <lawyer@sipnet.ru>
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      0c12582f
  27. 18 3月, 2013 1 次提交
    • C
      tcp: Remove TCPCT · 1a2c6181
      Christoph Paasch 提交于
      TCPCT uses option-number 253, reserved for experimental use and should
      not be used in production environments.
      Further, TCPCT does not fully implement RFC 6013.
      
      As a nice side-effect, removing TCPCT increases TCP's performance for
      very short flows:
      
      Doing an apache-benchmark with -c 100 -n 100000, sending HTTP-requests
      for files of 1KB size.
      
      before this patch:
      	average (among 7 runs) of 20845.5 Requests/Second
      after:
      	average (among 7 runs) of 21403.6 Requests/Second
      Signed-off-by: NChristoph Paasch <christoph.paasch@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a2c6181