1. 29 7月, 2013 2 次提交
  2. 28 7月, 2013 4 次提交
  3. 27 7月, 2013 1 次提交
  4. 25 7月, 2013 7 次提交
    • T
      net: Make devnet_rename_seq static · 18afa4b0
      Thomas Gleixner 提交于
      No users outside net/core/dev.c.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      18afa4b0
    • E
      tcp: TCP_NOTSENT_LOWAT socket option · c9bee3b7
      Eric Dumazet 提交于
      Idea of this patch is to add optional limitation of number of
      unsent bytes in TCP sockets, to reduce usage of kernel memory.
      
      TCP receiver might announce a big window, and TCP sender autotuning
      might allow a large amount of bytes in write queue, but this has little
      performance impact if a large part of this buffering is wasted :
      
      Write queue needs to be large only to deal with large BDP, not
      necessarily to cope with scheduling delays (incoming ACKS make room
      for the application to queue more bytes)
      
      For most workloads, using a value of 128 KB or less is OK to give
      applications enough time to react to POLLOUT events in time
      (or being awaken in a blocking sendmsg())
      
      This patch adds two ways to set the limit :
      
      1) Per socket option TCP_NOTSENT_LOWAT
      
      2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
      not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
      Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.
      
      This changes poll()/select()/epoll() to report POLLOUT
      only if number of unsent bytes is below tp->nosent_lowat
      
      Note this might increase number of sendmsg()/sendfile() calls
      when using non blocking sockets,
      and increase number of context switches for blocking sockets.
      
      Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
      defined as :
       Specify the minimum number of bytes in the buffer until
       the socket layer will pass the data to the protocol)
      
      Tested:
      
      netperf sessions, and watching /proc/net/protocols "memory" column for TCP
      
      With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
      used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)
      
      lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
      TCPv6     1880      2   45458   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      TCP       1696    508   45458   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      
      lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
      TCPv6     1880      2   20567   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      TCP       1696    508   20567   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      
      Using 128KB has no bad effect on the throughput or cpu usage
      of a single flow, although there is an increase of context switches.
      
      A bonus is that we hold socket lock for a shorter amount
      of time and should improve latencies of ACK processing.
      
      lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
      OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
      Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
      Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
      Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
      Final       Final                                             %     Method %      Method
      1651584     6291456     16384  20.00   17447.90   10^6bits/s  3.13  S      -1.00  U      0.353   -1.000  usec/KB
      
       Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
      
                 412,514 context-switches
      
           200.034645535 seconds time elapsed
      
      lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
      OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
      Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
      Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
      Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
      Final       Final                                             %     Method %      Method
      1593240     6291456     16384  20.00   17321.16   10^6bits/s  3.35  S      -1.00  U      0.381   -1.000  usec/KB
      
       Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
      
               2,675,818 context-switches
      
           200.029651391 seconds time elapsed
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-By: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9bee3b7
    • E
      net: add sk_stream_is_writeable() helper · 64dc6130
      Eric Dumazet 提交于
      Several call sites use the hardcoded following condition :
      
      sk_stream_wspace(sk) >= sk_stream_min_wspace(sk)
      
      Lets use a helper because TCP_NOTSENT_LOWAT support will change this
      condition for TCP sockets.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      64dc6130
    • D
      net: sctp: trivial: update mailing list address · 91705c61
      Daniel Borkmann 提交于
      The SCTP mailing list address to send patches or questions
      to is linux-sctp@vger.kernel.org and not
      lksctp-developers@lists.sourceforge.net anymore. Therefore,
      update all occurences.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: NVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      91705c61
    • A
      net: trans_rdma: remove unused function · 59ea52dc
      Andi Shyti 提交于
      This patch gets rid of the following warning:
      
      net/9p/trans_rdma.c:594:12: warning: ‘rdma_cancelled’ defined but not used [-Wunused-function]
       static int rdma_cancelled(struct p9_client *client, struct p9_req_t *req)
      
      The rdma_cancelled function is not called anywhere in the kernel
      Signed-off-by: NAndi Shyti <andi@etezian.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      59ea52dc
    • F
      net: ipv6 eliminate parameter "int addrlen" in function fib6_add_1 · 9225b230
      fan.du 提交于
      The "int addrlen" in fib6_add_1 is rebundant, as we can get it from
      parameter "struct in6_addr *addr" once we modified its type.
      And also fix some coding style issues in fib6_add_1
      Signed-off-by: NFan Du <fan.du@windriver.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9225b230
    • F
      net ipv6: Remove rebundant rt6i_nsiblings initialization · 86a37def
      fan.du 提交于
      Seting rt->rt6i_nsiblings to zero is rebundant, because above memset
      zeroed the rest of rt excluding the first dst memember.
      Signed-off-by: NFan Du <fan.du@windriver.com>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: James Morris <jmorris@namei.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      86a37def
  5. 24 7月, 2013 4 次提交
  6. 23 7月, 2013 5 次提交
    • Y
      tcp: use RTT from SACK for RTO · ed08495c
      Yuchung Cheng 提交于
      If RTT is not available because Karn's check has failed or no
      new packet is acked, use the RTT measured from SACK to estimate
      the RTO. The sender can continue to estimate the RTO during loss
      recovery or reordering event upon receiving non-partial ACKs.
      
      This also changes when the RTO is re-armed. Previously it is
      only re-armed when some data is cummulatively acknowledged (i.e.,
      SND.UNA advances), but now it is re-armed whenever RTT estimator
      is updated. This feature is particularly useful to reduce spurious
      timeout for buffer bloat including cellular carriers [1], and
      RTT estimation on reordering events.
      
      [1] "An In-depth Study of LTE: Effect of Network Protocol and
       Application Behavior on Performance", In Proc. of SIGCOMM 2013
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ed08495c
    • Y
      tcp: measure RTT from new SACK · 59c9af42
      Yuchung Cheng 提交于
      Take RTT sample if an ACK selectively acks some sequences that
      have never been retransmitted. The Karn's algorithm does not apply
      even if that ACK (s)acks other retransmitted sequences, because it
      must been generated by an original but perhaps out-of-order packet.
      There is no ambiguity. In case when multiple blocks are newly
      sacked because of ACK losses the earliest block is used to
      measure RTT, similar to cummulative ACKs.
      
      Such RTT samples allow the sender to estimate the RTO during loss
      recovery and packet reordering events. It is still useful even with
      TCP timestamps. That's because during these events the SND.UNA may
      not advance preventing RTT samples from TS ECR (thus the FLAG_ACKED
      check before calling tcp_ack_update_rtt()).  Therefore this new
      RTT source is complementary to existing ACK and TS RTT mechanisms.
      
      This patch does not update the RTO. It is done in the next patch.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      59c9af42
    • Y
      tcp: prefer packet timing to TS-ECR for RTT · 5b08e47c
      Yuchung Cheng 提交于
      Prefer packet timings to TS-ecr for RTT measurements when both
      sources are available. That's because broken middle-boxes and remote
      peer can return packets with corrupted TS ECR fields. Similarly most
      congestion controls that require RTT signals favor timing-based
      sources as well. Also check for bad TS ECR values to avoid RTT
      blow-ups. It has happened on production Web servers.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5b08e47c
    • Y
      tcp: consolidate SYNACK RTT sampling · 375fe02c
      Yuchung Cheng 提交于
      The first patch consolidates SYNACK and other RTT measurement to use a
      central function tcp_ack_update_rtt(). A (small) bonus is now SYNACK
      RTT measurement happens after PAWS check, potentially reducing the
      impact of RTO seeding on bad TCP timestamps values.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      375fe02c
    • R
      net: Provide a generic socket error queue delivery method for Tx time stamps. · cb820f8e
      Richard Cochran 提交于
      This patch moves the private error queue delivery function from the
      af_packet code to the core socket method. In this way, network layers
      only needing the error queue for transmit time stamping can share common
      code.
      Signed-off-by: NRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb820f8e
  7. 20 7月, 2013 1 次提交
  8. 19 7月, 2013 3 次提交
    • E
      vlan: fix a race in egress prio management · 3e3aac49
      Eric Dumazet 提交于
      egress_priority_map[] hash table updates are protected by rtnl,
      and we never remove elements until device is dismantled.
      
      We have to make sure that before inserting an new element in hash table,
      all its fields are committed to memory or else another cpu could
      find corrupt values and crash.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Patrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3e3aac49
    • E
      vlan: mask vlan prio bits · d4b812de
      Eric Dumazet 提交于
      In commit 48cc32d3
      ("vlan: don't deliver frames for unknown vlans to protocols")
      Florian made sure we set pkt_type to PACKET_OTHERHOST
      if the vlan id is set and we could find a vlan device for this
      particular id.
      
      But we also have a problem if prio bits are set.
      
      Steinar reported an issue on a router receiving IPv6 frames with a
      vlan tag of 4000 (id 0, prio 2), and tunneled into a sit device,
      because skb->vlan_tci is set.
      
      Forwarded frame is completely corrupted : We can see (8100:4000)
      being inserted in the middle of IPv6 source address :
      
      16:48:00.780413 IP6 2001:16d8:8100:4000:ee1c:0:9d9:bc87 >
      9f94:4d95:2001:67c:29f4::: ICMP6, unknown icmp6 type (0), length 64
             0x0000:  0000 0029 8000 c7c3 7103 0001 a0ae e651
             0x0010:  0000 0000 ccce 0b00 0000 0000 1011 1213
             0x0020:  1415 1617 1819 1a1b 1c1d 1e1f 2021 2223
             0x0030:  2425 2627 2829 2a2b 2c2d 2e2f 3031 3233
      
      It seems we are not really ready to properly cope with this right now.
      
      We can probably do better in future kernels :
      vlan_get_ingress_priority() should be a netdev property instead of
      a per vlan_dev one.
      
      For stable kernels, lets clear vlan_tci to fix the bugs.
      Reported-by: NSteinar H. Gunderson <sesse@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4b812de
    • P
      pkt_sched: sch_qfq: remove a source of high packet delay/jitter · 87f40dd6
      Paolo Valente 提交于
      QFQ+ inherits from QFQ a design choice that may cause a high packet
      delay/jitter and a severe short-term unfairness. As QFQ, QFQ+ uses a
      special quantity, the system virtual time, to track the service
      provided by the ideal system it approximates. When a packet is
      dequeued, this quantity must be incremented by the size of the packet,
      divided by the sum of the weights of the aggregates waiting to be
      served. Tracking this sum correctly is a non-trivial task, because, to
      preserve tight service guarantees, the decrement of this sum must be
      delayed in a special way [1]: this sum can be decremented only after
      that its value would decrease also in the ideal system approximated by
      QFQ+. For efficiency, QFQ+ keeps track only of the 'instantaneous'
      weight sum, increased and decreased immediately as the weight of an
      aggregate changes, and as an aggregate is created or destroyed (which,
      in its turn, happens as a consequence of some class being
      created/destroyed/changed). However, to avoid the problems caused to
      service guarantees by these immediate decreases, QFQ+ increments the
      system virtual time using the maximum value allowed for the weight
      sum, 2^10, in place of the dynamic, instantaneous value. The
      instantaneous value of the weight sum is used only to check whether a
      request of weight increase or a class creation can be satisfied.
      
      Unfortunately, the problems caused by this choice are worse than the
      temporary degradation of the service guarantees that may occur, when a
      class is changed or destroyed, if the instantaneous value of the
      weight sum was used to update the system virtual time. In fact, the
      fraction of the link bandwidth guaranteed by QFQ+ to each aggregate is
      equal to the ratio between the weight of the aggregate and the sum of
      the weights of the competing aggregates. The packet delay guaranteed
      to the aggregate is instead inversely proportional to the guaranteed
      bandwidth. By using the maximum possible value, and not the actual
      value of the weight sum, QFQ+ provides each aggregate with the worst
      possible service guarantees, and not with service guarantees related
      to the actual set of competing aggregates. To see the consequences of
      this fact, consider the following simple example.
      
      Suppose that only the following aggregates are backlogged, i.e., that
      only the classes in the following aggregates have packets to transmit:
      one aggregate with weight 10, say A, and ten aggregates with weight 1,
      say B1, B2, ..., B10. In particular, suppose that these aggregates are
      always backlogged. Given the weight distribution, the smoothest and
      fairest service order would be:
      A B1 A B2 A B3 A B4 A B5 A B6 A B7 A B8 A B9 A B10 A B1 A B2 ...
      
      QFQ+ would provide exactly this optimal service if it used the actual
      value for the weight sum instead of the maximum possible value, i.e.,
      11 instead of 2^10. In contrast, since QFQ+ uses the latter value, it
      serves aggregates as follows (easy to prove and to reproduce
      experimentally):
      A B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 A A A A A A A A A A B1 B2 ... B10 A A ...
      
      By replacing 10 with N in the above example, and by increasing N, one
      can increase at will the maximum packet delay and the jitter
      experienced by the classes in aggregate A.
      
      This patch addresses this issue by just using the above
      'instantaneous' value of the weight sum, instead of the maximum
      possible value, when updating the system virtual time.  After the
      instantaneous weight sum is decreased, QFQ+ may deviate from the ideal
      service for a time interval in the order of the time to serve one
      maximum-size packet for each backlogged class. The worst-case extent
      of the deviation exhibited by QFQ+ during this time interval [1] is
      basically the same as of the deviation described above (but, without
      this patch, QFQ+ suffers from such a deviation all the time). Finally,
      this patch modifies the comment to the function qfq_slot_insert, to
      make it coherent with the fact that the weight sum used by QFQ+ can
      now be lower than the maximum possible value.
      
      [1] P. Valente, "Extending WF2Q+ to support a dynamic traffic mix",
      Proceedings of AAA-IDEA'05, June 2005.
      Signed-off-by: NPaolo Valente <paolo.valente@unimore.it>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      87f40dd6
  9. 17 7月, 2013 5 次提交
  10. 15 7月, 2013 2 次提交
    • D
      svcrdma: underflow issue in decode_write_list() · b2781e10
      Dan Carpenter 提交于
      My static checker marks everything from ntohl() as untrusted and it
      complains we could have an underflow problem doing:
      
      	return (u32 *)&ary->wc_array[nchunks];
      
      Also on 32 bit systems the upper bound check could overflow.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      b2781e10
    • P
      net: delete __cpuinit usage from all net files · 013dbb32
      Paul Gortmaker 提交于
      The __cpuinit type of throwaway sections might have made sense
      some time ago when RAM was more constrained, but now the savings
      do not offset the cost and complications.  For example, the fix in
      commit 5e427ec2 ("x86: Fix bit corruption at CPU resume time")
      is a good example of the nasty type of bugs that can be created
      with improper use of the various __init prefixes.
      
      After a discussion on LKML[1] it was decided that cpuinit should go
      the way of devinit and be phased out.  Once all the users are gone,
      we can then finally remove the macros themselves from linux/init.h.
      
      This removes all the net/* uses of the __cpuinit macros
      from all C files.
      
      [1] https://lkml.org/lkml/2013/5/20/589
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: netdev@vger.kernel.org
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      013dbb32
  11. 14 7月, 2013 3 次提交
  12. 13 7月, 2013 3 次提交