提交 · a3c910d2e796ce878231350b904105261a8cf09c · OpenHarmony / kernel_linux

24 6月, 2013 1 次提交

tcp: doc : fix the syncookies default value · a3c910d2

由 Shan Wei 提交于 6月 21, 2013

syncookies is on for default since in commit e994b7c9
(tcp: Don't make syn cookies initial setting depend on CONFIG_SYSCTL).

And fix a typo of CONFIG_SYN_COOKIES.
Signed-off-by: NShan Wei <davidshan@tencent.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a3c910d2

21 3月, 2013 2 次提交

tcp: implement RFC5682 F-RTO · e33099f9

由 Yuchung Cheng 提交于 3月 20, 2013

This patch implements F-RTO (foward RTO recovery):

When the first retransmission after timeout is acknowledged, F-RTO
sends new data instead of old data. If the next ACK acknowledges
some never-retransmitted data, then the timeout was spurious and the
congestion state is reverted.  Otherwise if the next ACK selectively
acknowledges the new data, then the timeout was genuine and the
loss recovery continues. This idea applies to recurring timeouts
as well. While F-RTO sends different data during timeout recovery,
it does not (and should not) change the congestion control.

The implementaion follows the three steps of SACK enhanced algorithm
(section 3) in RFC5682. Step 1 is in tcp_enter_loss(). Step 2 and
3 are in tcp_process_loss().  The basic version is not supported
because SACK enhanced version also works for non-SACK connections.

The new implementation is functionally in parity with the old F-RTO
implementation except the one case where it increases undo events:
In addition to the RFC algorithm, a spurious timeout may be detected
without sending data in step 2, as long as the SACK confirms not
all the original data are dropped. When this happens, the sender
will undo the cwnd and perhaps enter fast recovery instead. This
additional check increases the F-RTO undo events by 5x compared
to the prior implementation on Google Web servers, since the sender
often does not have new data to send for HTTP.

Note F-RTO may detect spurious timeout before Eifel with timestamps
does so.
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e33099f9

tcp: refactor F-RTO · 9b44190d

由 Yuchung Cheng 提交于 3月 20, 2013

The patch series refactor the F-RTO feature (RFC4138/5682).

This is to simplify the loss recovery processing. Existing F-RTO
was developed during the experimental stage (RFC4138) and has
many experimental features.  It takes a separate code path from
the traditional timeout processing by overloading CA_Disorder
instead of using CA_Loss state. This complicates CA_Disorder state
handling because it's also used for handling dubious ACKs and undos.
While the algorithm in the RFC does not change the congestion control,
the implementation intercepts congestion control in various places
(e.g., frto_cwnd in tcp_ack()).

The new code implements newer F-RTO RFC5682 using CA_Loss processing
path.  F-RTO becomes a small extension in the timeout processing
and interfaces with congestion control and Eifel undo modules.
It lets congestion control (module) determines how many to send
independently.  F-RTO only chooses what to send in order to detect
spurious retranmission. If timeout is found spurious it invokes
existing Eifel undo algorithms like DSACK or TCP timestamp based
detection.

The first patch removes all F-RTO code except the sysctl_tcp_frto is
left for the new implementation.  Since CA_EVENT_FRTO is removed, TCP
westwood now computes ssthresh on regular timeout CA_EVENT_LOSS event.
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9b44190d

18 3月, 2013 1 次提交

tcp: Remove TCPCT · 1a2c6181

由 Christoph Paasch 提交于 3月 17, 2013

TCPCT uses option-number 253, reserved for experimental use and should
not be used in production environments.
Further, TCPCT does not fully implement RFC 6013.

As a nice side-effect, removing TCPCT increases TCP's performance for
very short flows:

Doing an apache-benchmark with -c 100 -n 100000, sending HTTP-requests
for files of 1KB size.

before this patch:
	average (among 7 runs) of 20845.5 Requests/Second
after:
	average (among 7 runs) of 21403.6 Requests/Second
Signed-off-by: NChristoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1a2c6181

15 3月, 2013 1 次提交

Documentation: fix neigh/default/gc_thresh1 default value. · b66c66dc

由 Li RongQing 提交于 3月 14, 2013

The default value is 128, not 256
	#grep gc_thresh1 net/ -rI
	net/decnet/dn_neigh.c:	.gc_thresh1 =			128,
	net/ipv6/ndisc.c:	.gc_thresh1 =	 128,
	net/ipv4/arp.c:	.gc_thresh1	= 128,
Signed-off-by: NLi RongQing <roy.qing.li@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b66c66dc

12 3月, 2013 1 次提交

tcp: Tail loss probe (TLP) · 6ba8a3b1

由 Nandita Dukkipati 提交于 3月 11, 2013

This patch series implement the Tail loss probe (TLP) algorithm described
in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
first patch implements the basic algorithm.

TLP's goal is to reduce tail latency of short transactions. It achieves
this by converting retransmission timeouts (RTOs) occuring due
to tail losses (losses at end of transactions) into fast recovery.
TLP transmits one packet in two round-trips when a connection is in
Open state and isn't receiving any ACKs. The transmitted packet, aka
loss probe, can be either new or a retransmission. When there is tail
loss, the ACK from a loss probe triggers FACK/early-retransmit based
fast recovery, thus avoiding a costly RTO. In the absence of loss,
there is no change in the connection state.

PTO stands for probe timeout. It is a timer event indicating
that an ACK is overdue and triggers a loss probe packet. The PTO value
is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
ACK timer when there is only one oustanding packet.

TLP Algorithm

On transmission of new data in Open state:
  -> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
  -> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
  -> PTO = min(PTO, RTO)

Conditions for scheduling PTO:
  -> Connection is in Open state.
  -> Connection is either cwnd limited or no new data to send.
  -> Number of probes per tail loss episode is limited to one.
  -> Connection is SACK enabled.

When PTO fires:
  new_segment_exists:
    -> transmit new segment.
    -> packets_out++. cwnd remains same.

  no_new_packet:
    -> retransmit the last segment.
       Its ACK triggers FACK or early retransmit based recovery.

ACK path:
  -> rearm RTO at start of ACK processing.
  -> reschedule PTO if need be.

In addition, the patch includes a small variation to the Early Retransmit
(ER) algorithm, such that ER and TLP together can in principle recover any
N-degree of tail loss through fast recovery. TLP is controlled by the same
sysctl as ER, tcp_early_retrans sysctl.
tcp_early_retrans==0; disables TLP and ER.
		 ==1; enables RFC5827 ER.
		 ==2; delayed ER.
		 ==3; TLP and delayed ER. [DEFAULT]
		 ==4; TLP only.

The TLP patch series have been extensively tested on Google Web servers.
It is most effective for short Web trasactions, where it reduced RTOs by 15%
and improved HTTP response time (average by 6%, 99th percentile by 10%).
The transmitted probes account for <0.5% of the overall transmissions.
Signed-off-by: NNandita Dukkipati <nanditad@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Acked-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6ba8a3b1

06 2月, 2013 1 次提交

tcp: remove Appropriate Byte Count support · ca2eb567

由 Stephen Hemminger 提交于 2月 05, 2013

TCP Appropriate Byte Count was added by me, but later disabled.
There is no point in maintaining it since it is a potential source
of bugs and Linux already implements other better window protection
heuristics.
Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ca2eb567

23 1月, 2013 1 次提交

neigh: Keep neighbour cache entries if number of them is small enough. · 2724680b

由 YOSHIFUJI Hideaki / 吉藤英明提交于 1月 22, 2013

Since we have removed NCE (Neighbour Cache Entry) reference from
routing entries, the only refcnt holders of an NCE are its timer
(if running) and its owner table, in usual cases.  As a result,
neigh_periodic_work() purges NCEs over and over again even for
gateways.

It does not make sense to purge entries, if number of them is
very small, so keep them.  The minimum number of entries to keep
is specified by gc_thresh1.
Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2724680b

11 1月, 2013 1 次提交

doc: Clarify behavior when sysctl tcp_ecn = 1 · 3d55b323

由 Vijay Subramanian 提交于 1月 09, 2013

Recent commit (commit 7e3a2dc5 doc: make the description of how tcp_ecn
works more explicit and clear ) clarified the behavior of tcp_ecn sysctl
variable but description is inconsistent. When requested by incoming conections,
ECN is enabled with not just tcp_ecn = 2 but also with tcp_ecn = 1.

This patch makes it clear that with tcp_ecn = 1, ECN is enabled when requested
by incoming connections.

Also fix spelling of 'incoming'.
Signed-off-by: NVijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3d55b323

05 1月, 2013 2 次提交

ip-sysctl: fix spelling errors · 3b09adcb

由 stephen hemminger 提交于 1月 03, 2013

Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3b09adcb

ipv6: document ndisc_notify in networking/ip-sysctl.txt · db2b620a

由 Hannes Frederic Sowa 提交于 1月 01, 2013

I slipped in a new sysctl without proper documentation. I would like to
make up for this now.
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

db2b620a

11 12月, 2012 1 次提交

doc: Tighten-up and clarify description of tcp_fin_timeout · d825da2e

由 Rick Jones 提交于 12月 10, 2012

The description for tcp_fin_timeout should be tigher and more clear.

In addition to being tighter, we should make the spelling of the
state name consistent with what utilities report, remove the now
dated reference to 2.2 and put the default in the consistent place.
Signed-off-by: NRick Jones <rick.jones2@hp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d825da2e

08 12月, 2012 1 次提交

net: doc : use more suitable word 'unexpected' to replace 'secluded' · 5d248c49

由 Shan Wei 提交于 12月 06, 2012

 'secluded' is used to describe places, not suitable here.
Suggested-by: NBen Hutchings <bhutchings@solarflare.com>
Signed-off-by: NShan Wei <davidshan@tencent.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5d248c49

06 12月, 2012 1 次提交

net: doc: add default value for neighbour parameters · cc868028

由 Shan Wei 提交于 12月 04, 2012

Signed-off-by: NShan Wei <davidshan@tencent.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cc868028

30 11月, 2012 1 次提交

doc: make the description of how tcp_ecn works more explicit and clear · 7e3a2dc5

由 Rick Jones 提交于 11月 28, 2012

Make the description of how tcp_ecn works a bit more explicit and clear.
Signed-off-by: NRick Jones <rick.jones2@hp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7e3a2dc5

26 10月, 2012 1 次提交

sctp: Make hmac algorithm selection for cookie generation dynamic · 3c68198e

由 Neil Horman 提交于 10月 24, 2012

Currently sctp allows for the optional use of md5 of sha1 hmac algorithms to
generate cookie values when establishing new connections via two build time
config options.  Theres no real reason to make this a static selection.  We can
add a sysctl that allows for the dynamic selection of these algorithms at run
time, with the default value determined by the corresponding crypto library
availability.
This comes in handy when, for example running a system in FIPS mode, where use
of md5 is disallowed, but SHA1 is permitted.

Note: This new sysctl has no corresponding socket option to select the cookie
hmac algorithm.  I chose not to implement that intentionally, as RFC 6458
contains no option for this value, and I opted not to pollute the socket option
namespace.

Change notes:
v2)
	* Updated subject to have the proper sctp prefix as per Dave M.
	* Replaced deafult selection options with new options that allow
	  developers to explicitly select available hmac algs at build time
	  as per suggestion by Vlad Y.
Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
CC: Vlad Yasevich <vyasevich@gmail.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: netdev@vger.kernel.org
Acked-by: NVlad Yasevich <vyasevich@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3c68198e

01 9月, 2012 2 次提交

tcp: TCP Fast Open Server - header & support functions · 10467163

由 Jerry Chu 提交于 8月 31, 2012

This patch adds all the necessary data structure and support
functions to implement TFO server side. It also documents a number
of flags for the sysctl_tcp_fastopen knob, and adds a few Linux
extension MIBs.

In addition, it includes the following:

1. a new TCP_FASTOPEN socket option an application must call to
supply a max backlog allowed in order to enable TFO on its listener.

2. A number of key data structures:
"fastopen_rsk" in tcp_sock - for a big socket to access its
request_sock for retransmission and ack processing purpose. It is
non-NULL iff 3WHS not completed.

"fastopenq" in request_sock_queue - points to a per Fast Open
listener data structure "fastopen_queue" to keep track of qlen (# of
outstanding Fast Open requests) and max_qlen, among other things.

"listener" in tcp_request_sock - to point to the original listener
for book-keeping purpose, i.e., to maintain qlen against max_qlen
as part of defense against IP spoofing attack.

3. various data structure and functions, many in tcp_fastopen.c, to
support server side Fast Open cookie operations, including
/proc/sys/net/ipv4/tcp_fastopen_key to allow manual rekeying.
Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

10467163

tcp: Increase timeout for SYN segments · 6c9ff979

由 Alex Bergmann 提交于 8月 31, 2012

Commit 9ad7c049 ("tcp: RFC2988bis + taking RTT sample from 3WHS for
the passive open side") changed the initRTO from 3secs to 1sec in
accordance to RFC6298 (former RFC2988bis). This reduced the time till
the last SYN retransmission packet gets sent from 93secs to 31secs.

RFC1122 is stating that the retransmission should be done for at least 3
minutes, but this seems to be quite high.

  "However, the values of R1 and R2 may be different for SYN
  and data segments.  In particular, R2 for a SYN segment MUST
  be set large enough to provide retransmission of the segment
  for at least 3 minutes.  The application can close the
  connection (i.e., give up on the open attempt) sooner, of
  course."

This patch increases the value of TCP_SYN_RETRIES to the value of 6,
providing a retransmission window of 63secs.

The comments for SYN and SYNACK retries have also been updated to
describe the current settings. The same goes for the documentation file
"Documentation/networking/ip-sysctl.txt".
Signed-off-by: NAlexander Bergmann <alex@linlab.net>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6c9ff979

31 7月, 2012 1 次提交

ipv4: remove rt_cache_rebuild_count · 0c7462a2

由 Eric Dumazet 提交于 7月 30, 2012

After IP route cache removal, rt_cache_rebuild_count is no longer
used.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0c7462a2

23 7月, 2012 1 次提交

sctp: Implement quick failover draft from tsvwg · 5aa93bcf

由 Neil Horman 提交于 7月 21, 2012

I've seen several attempts recently made to do quick failover of sctp transports
by reducing various retransmit timers and counters.  While its possible to
implement a faster failover on multihomed sctp associations, its not
particularly robust, in that it can lead to unneeded retransmits, as well as
false connection failures due to intermittent latency on a network.

Instead, lets implement the new ietf quick failover draft found here:
http://tools.ietf.org/html/draft-nishida-tsvwg-sctp-failover-05

This will let the sctp stack identify transports that have had a small number of
errors, and avoid using them quickly until their reliability can be
re-established.  I've tested this out on two virt guests connected via multiple
isolated virt networks and believe its in compliance with the above draft and
works well.
Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
CC: Vlad Yasevich <vyasevich@gmail.com>
CC: Sridhar Samudrala <sri@us.ibm.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: linux-sctp@vger.kernel.org
CC: joe@perches.com
Acked-by: NVlad Yasevich <vyasevich@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5aa93bcf

20 7月, 2012 2 次提交

net-tcp: Fast Open client - cookie-less mode · 67da22d2

由 Yuchung Cheng 提交于 7月 19, 2012

In trusted networks, e.g., intranet, data-center, the client does not
need to use Fast Open cookie to mitigate DoS attacks. In cookie-less
mode, sendmsg() with MSG_FASTOPEN flag will send SYN-data regardless
of cookie availability.
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

67da22d2

net-tcp: Fast Open client - sendmsg(MSG_FASTOPEN) · cf60af03

由 Yuchung Cheng 提交于 7月 19, 2012

sendmsg() (or sendto()) with MSG_FASTOPEN is a combo of connect(2)
and write(2). The application should replace connect() with it to
send data in the opening SYN packet.

For blocking socket, sendmsg() blocks until all the data are buffered
locally and the handshake is completed like connect() call. It
returns similar errno like connect() if the TCP handshake fails.

For non-blocking socket, it returns the number of bytes queued (and
transmitted in the SYN-data packet) if cookie is available. If cookie
is not available, it transmits a data-less SYN packet with Fast Open
cookie request option and returns -EINPROGRESS like connect().

Using MSG_FASTOPEN on connecting or connected socket will result in
simlar errno like repeating connect() calls. Therefore the application
should only use this flag on new sockets.

The buffer size of sendmsg() is independent of the MSS of the connection.
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cf60af03

17 7月, 2012 1 次提交

tcp: implement RFC 5961 3.2 · 282f23c6

由 Eric Dumazet 提交于 7月 17, 2012

Implement the RFC 5691 mitigation against Blind
Reset attack using RST bit.

Idea is to validate incoming RST sequence,
to match RCV.NXT value, instead of previouly accepted
window : (RCV.NXT <= SEG.SEQ < RCV.NXT+RCV.WND)

If sequence is in window but not an exact match, send
a "challenge ACK", so that the other part can resend an
RST with the appropriate sequence.

Add a new sysctl, tcp_challenge_ack_limit, to limit
number of challenge ACK sent per second.

Add a new SNMP counter to count number of challenge acks sent.
(netstat -s | grep TCPChallengeACK)
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Kiran Kumar Kella <kkiran@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

282f23c6

12 7月, 2012 1 次提交

tcp: TCP Small Queues · 46d3ceab

由 Eric Dumazet 提交于 7月 11, 2012

This introduce TSQ (TCP Small Queues)

TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
device queues), to reduce RTT and cwnd bias, part of the bufferbloat
problem.

sk->sk_wmem_alloc not allowed to grow above a given limit,
allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
given time.

TSO packets are sized/capped to half the limit, so that we have two
TSO packets in flight, allowing better bandwidth use.

As a side effect, setting the limit to 40000 automatically reduces the
standard gso max limit (65536) to 40000/2 : It can help to reduce
latencies of high prio packets, having smaller TSO packets.

This means we divert sock_wfree() to a tcp_wfree() handler, to
queue/send following frames when skb_orphan() [2] is called for the
already queued skbs.

Results on my dev machines (tg3/ixgbe nics) are really impressive,
using standard pfifo_fast, and with or without TSO/GSO.

Without reduction of nominal bandwidth, we have reduction of buffering
per bulk sender :
< 1ms on Gbit (instead of 50ms with TSO)
< 8ms on 100Mbit (instead of 132 ms)

I no longer have 4 MBytes backlogged in qdisc by a single netperf
session, and both side socket autotuning no longer use 4 Mbytes.

As skb destructor cannot restart xmit itself ( as qdisc lock might be
taken at this point ), we delegate the work to a tasklet. We use one
tasklest per cpu for performance reasons.

If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag.
This flag is tested in a new protocol method called from release_sock(),
to eventually send new segments.

[1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
[2] skb_orphan() is usually called at TX completion time,
  but some drivers call it in their start_xmit() handler.
  These drivers should at least use BQL, or else a single TCP
  session can still fill the whole NIC TX ring, since TSQ will
  have no effect.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Dave Taht <dave.taht@bufferbloat.net>
Cc: Tom Herbert <therbert@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

46d3ceab

01 7月, 2012 1 次提交
- D
  ipv4: Clarify in docs that accept_local requires rp_filter. · c801e3cc
  由 David S. Miller 提交于 6月 30, 2012
```
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
  c801e3cc
13 6月, 2012 1 次提交

ipv4: Add interface option to enable routing of 127.0.0.0/8 · d0daebc3

由 Thomas Graf 提交于 6月 12, 2012

Routing of 127/8 is tradtionally forbidden, we consider
packets from that address block martian when routing and do
not process corresponding ARP requests.

This is a sane default but renders a huge address space
practically unuseable.

The RFC states that no address within the 127/8 block should
ever appear on any network anywhere but it does not forbid
the use of such addresses outside of the loopback device in
particular. For example to address a pool of virtual guests
behind a load balancer.

This patch adds a new interface option 'route_localnet'
enabling routing of the 127/8 address block and processing
of ARP requests on a specific interface.

Note that for the feature to work, the default local route
covering 127/8 dev lo needs to be removed.

Example:
  $ sysctl -w net.ipv4.conf.eth0.route_localnet=1
  $ ip route del 127.0.0.0/8 dev lo table local
  $ ip addr add 127.1.0.1/16 dev eth0
  $ ip route flush cache

V2: Fix invalid check to auto flush cache (thanks davem)
Signed-off-by: NThomas Graf <tgraf@suug.ch>
Acked-by: NNeil Horman <nhorman@tuxdriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d0daebc3

09 5月, 2012 1 次提交

netfilter: bridge: optionally set indev to vlan · 4981682c

由 Pablo Neira Ayuso 提交于 5月 08, 2012

if net.bridge.bridge-nf-filter-vlan-tagged sysctl is enabled, bridge
netfilter removes the vlan header temporarily and then feeds the packet
to ip(6)tables.

When the new "bridge-nf-pass-vlan-input-device" sysctl is on
(default off), then bridge netfilter will also set the
in-interface to the vlan interface; if such an interface exists.

This is needed to make iptables REDIRECT target work with
"vlan-on-top-of-bridge" setups and to allow use of "iptables -i" to
match the vlan device name.

Also update Documentation with current brnf default settings.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Acked-by: NBart De Schuymer <bdschuym@pandora.be>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

4981682c

03 5月, 2012 2 次提交

tcp: change tcp_adv_win_scale and tcp_rmem[2] · b49960a0

由 Eric Dumazet 提交于 5月 02, 2012

tcp_adv_win_scale default value is 2, meaning we expect a good citizen
skb to have skb->len / skb->truesize ratio of 75% (3/4)

In 2.6 kernels we (mis)accounted for typical MSS=1460 frame :
1536 + 64 + 256 = 1856 'estimated truesize', and 1856 * 3/4 = 1392.
So these skbs were considered as not bloated.

With recent truesize fixes, a typical MSS=1460 frame truesize is now the
more precise :
2048 + 256 = 2304. But 2304 * 3/4 = 1728.
So these skb are not good citizen anymore, because 1460 < 1728

(GRO can escape this problem because it build skbs with a too low
truesize.)

This also means tcp advertises a too optimistic window for a given
allocated rcvspace : When receiving frames, sk_rmem_alloc can hit
sk_rcvbuf limit and we call tcp_prune_queue()/tcp_collapse() too often,
especially when application is slow to drain its receive queue or in
case of losses (netperf is fast, scp is slow). This is a major latency
source.

We should adjust the len/truesize ratio to 50% instead of 75%

This patch :

1) changes tcp_adv_win_scale default to 1 instead of 2

2) increase tcp_rmem[2] limit from 4MB to 6MB to take into account
better truesize tracking and to allow autotuning tcp receive window to
reach same value than before. Note that same amount of kernel memory is
consumed compared to 2.6 kernels.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b49960a0

tcp: early retransmit · eed530b6

由 Yuchung Cheng 提交于 5月 02, 2012

This patch implements RFC 5827 early retransmit (ER) for TCP.
It reduces DUPACK threshold (dupthresh) if outstanding packets are
less than 4 to recover losses by fast recovery instead of timeout.

While the algorithm is simple, small but frequent network reordering
makes this feature dangerous: the connection repeatedly enter
false recovery and degrade performance. Therefore we implement
a mitigation suggested in the appendix of the RFC that delays
entering fast recovery by a small interval, i.e., RTT/4. Currently
ER is conservative and is disabled for the rest of the connection
after the first reordering event. A large scale web server
experiment on the performance impact of ER is summarized in
section 6 of the paper "Proportional Rate Reduction for TCP”,
IMC 2011. http://conferences.sigcomm.org/imc/2011/docs/p155.pdf

Note that Linux has a similar feature called THIN_DUPACK. The
differences are THIN_DUPACK do not mitigate reorderings and is only
used after slow start. Currently ER is disabled if THIN_DUPACK is
enabled. I would be happy to merge THIN_DUPACK feature with ER if
people think it's a good idea.

ER is enabled by sysctl_tcp_early_retrans:
  0: Disables ER

  1: Reduce dupthresh to packets_out - 1 when outstanding packets < 4.

  2: (Default) reduce dupthresh like mode 1. In addition, delay
     entering fast recovery by RTT/4.

Note: mode 2 is implemented in the third part of this patch series.
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

eed530b6

27 4月, 2012 1 次提交

net: doc: merge /proc/sys/net/core/* documents into one place · c60f6aa8

由 Shan Wei 提交于 4月 26, 2012

All parameter descriptions in /proc/sys/net/core/* now is separated
two places. So, merge them into Documentation/sysctl/net.txt.
Signed-off-by: NShan Wei <davidshan@tencent.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c60f6aa8

04 4月, 2012 1 次提交

TCP: update ip_local_port_range documentation · 5d6bd861

由 Fernando Luis Vazquez Cao 提交于 4月 03, 2012

The explanation of ip_local_port_range in
Documentation/networking/ip-sysctl.txt contains several factual
errors:

- The default value of ip_local_port_range does not depend on the
  amount of memory available in the system.
- tcp_tw_recycle is not enabled by default.
- 1024-4999 is not the default value.
- Etc.

Clean up the mess.
Signed-off-by: NFernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5d6bd861

07 12月, 2011 1 次提交

ipv4:correct description for tcp_max_syn_backlog · 99b53bdd

由 Peter Pan(潘卫平) 提交于 12月 05, 2011

Since commit c5ed63d6(tcp: fix three tcp sysctls tuning),
sysctl_max_syn_backlog is determined by tcp_hashinfo->ehash_mask,
and the minimal value is 128, and it will increase in proportion to the
memory of machine.
The original description for tcp_max_syn_backlog and sysctl_max_syn_backlog
are out of date.

Changelog:
V2: update description for sysctl_max_syn_backlog
Signed-off-by: NWeiping Pan <panweiping3@gmail.com>
Reviewed-by: NShan Wei <shanwei88@gmail.com>
Acked-by: NNeil Horman <nhorman@tuxdriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

99b53bdd

01 12月, 2011 1 次提交

tcp: inherit listener congestion control for passive cnx · d8a6e65f

由 Eric Dumazet 提交于 11月 30, 2011

Rick Jones reported that TCP_CONGESTION sockopt performed on a listener
was ignored for its children sockets : right after accept() the
congestion control for new socket is the system default one.

This seems an oversight of the initial design (quoted from Stephen)

Based on prior investigation and patch from Rick.
Reported-by: NRick Jones <rick.jones2@hp.com>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Yuchung Cheng <ycheng@google.com>
Tested-by: NRick Jones <rick.jones2@hp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d8a6e65f

14 11月, 2011 1 次提交

neigh: new unresolved queue limits · 8b5c171b

由 Eric Dumazet 提交于 11月 09, 2011

Le mercredi 09 novembre 2011 à 16:21 -0500, David Miller a écrit :
> From: David Miller <davem@davemloft.net>
> Date: Wed, 09 Nov 2011 16:16:44 -0500 (EST)
>
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Wed, 09 Nov 2011 12:14:09 +0100
> >
> >> unres_qlen is the number of frames we are able to queue per unresolved
> >> neighbour. Its default value (3) was never changed and is responsible
> >> for strange drops, especially if IP fragments are used, or multiple
> >> sessions start in parallel. Even a single tcp flow can hit this limit.
> >  ...
> >
> > Ok, I've applied this, let's see what happens :-)
>
> Early answer, build fails.
>
> Please test build this patch with DECNET enabled and resubmit.  The
> decnet neigh layer still refers to the removed ->queue_len member.
>
> Thanks.

Ouch, this was fixed on one machine yesterday, but not the other one I
used this morning, sorry.

[PATCH V5 net-next] neigh: new unresolved queue limits

unres_qlen is the number of frames we are able to queue per unresolved
neighbour. Its default value (3) was never changed and is responsible
for strange drops, especially if IP fragments are used, or multiple
sessions start in parallel. Even a single tcp flow can hit this limit.

$ arp -d 192.168.20.108 ; ping -c 2 -s 8000 192.168.20.108
PING 192.168.20.108 (192.168.20.108) 8000(8028) bytes of data.
8008 bytes from 192.168.20.108: icmp_seq=2 ttl=64 time=0.322 ms
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8b5c171b

09 11月, 2011 1 次提交

net: min_pmtu default is 552 · 20db93c3

由 Eric Dumazet 提交于 11月 08, 2011

Small fix in Documentation, since min_pmtu is 512 + 20 + 20 = 552
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

20db93c3

30 9月, 2011 1 次提交

net: Documentation: Fix type of variables · 605b91c8

由 Roy.Li 提交于 9月 28, 2011

Signed-off-by: NRoy.Li <rongqing.li@windriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

605b91c8

17 9月, 2011 1 次提交

ipv6: Send ICMPv6 RSes only when RAs are accepted · 026359bc

由 Tore Anderson 提交于 8月 28, 2011

This patch improves the logic determining when to send ICMPv6 Router
Solicitations, so that they are 1) always sent when the kernel is
accepting Router Advertisements, and 2) never sent when the kernel is
not accepting RAs. In other words, the operational setting of the
"accept_ra" sysctl is used.

The change also makes the special "Hybrid Router" forwarding mode
("forwarding" sysctl set to 2) operate exactly the same as the standard
Router mode (forwarding=1). The only difference between the two was
that RSes was being sent in the Hybrid Router mode only. The sysctl
documentation describing the special Hybrid Router mode has therefore
been removed.

Rationale for the change:

Currently, the value of forwarding sysctl is the only thing determining
whether or not to send RSes. If it has the value 0 or 2, they are sent,
otherwise they are not. This leads to inconsistent behaviour in the
following cases:

* accept_ra=0, forwarding=0
* accept_ra=0, forwarding=2
* accept_ra=1, forwarding=2
* accept_ra=2, forwarding=1

In the first three cases, the kernel will send RSes, even though it will
not accept any RAs received in reply. In the last case, it will not send
any RSes, even though it will accept and process any RAs received. (Most
routers will send unsolicited RAs periodically, so suppressing RSes in
the last case will merely delay auto-configuration, not prevent it.)

Also, it is my opinion that having the forwarding sysctl control RS
sending behaviour (completely independent of whether RAs are being
accepted or not) is simply not what most users would intuitively expect
to be the case.
Signed-off-by: NTore Anderson <tore@fud.no>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

026359bc

23 8月, 2011 1 次提交

net: Documentation: RFC 2553bis is now RFC 3493 · d5c073ca

由 Geoffrey Thomas 提交于 8月 22, 2011

Signed-off-by: NGeoffrey Thomas <geofft@mit.edu>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d5c073ca

09 7月, 2011 1 次提交

net: Fix default in docs for tcp_orphan_retries. · 06b8fc5d

由 David S. Miller 提交于 7月 08, 2011

Default should be listed at 8 instead of 7.
Reported-by: NDenys Fedoryshchenko <denys@visp.net.lb>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

06b8fc5d

05 7月, 2011 1 次提交

Update documented default values for various TCP/UDP tunables · 6539fefd

由 Max Matveev 提交于 6月 21, 2011

tcp_rmem and tcp_wmem use 1 page as default value for the minimum
amount of memory to be used, same as udp_wmem_min and udp_rmem_min.
Pages are different size on different architectures - use the right
units when describing the defaults.
Reviewed-by: NShan Wei <shanwei@cn.fujitsu.com>
Acked-by: NNeil Horman <nhorman@tuxdriver.com>
Signed-off-by: NMax Matveev <makc@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6539fefd

OpenHarmony / kernel_linux 上一次同步 3 年多

OpenHarmony / kernel_linux
上一次同步 3 年多