提交 · 0a5ebb8000c5362be368df9d197943deb06b6916 · openanolis / cloud-kernel

11 5月, 2011 1 次提交
- D
  ipv4: Pass explicit daddr arg to ip_send_reply(). · 0a5ebb80
  由 David S. Miller 提交于 5月 09, 2011
```
This eliminates an access to rt->rt_src.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
  0a5ebb80
09 5月, 2011 3 次提交

D
tcp: Use cork flow info instead of rt->rt_dst in tcp_v4_get_peer() · c5216cc7
由 David S. Miller 提交于 5月 06, 2011
```
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
c5216cc7

ipv4: Use inet_csk_route_child_sock() in DCCP and TCP. · 0e734419

由 David S. Miller 提交于 5月 08, 2011

Operation order is now transposed, we first create the child
socket then we try to hook up the route.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0e734419

tcp: Use cork flow in tcp_v4_connect() · da905bd1

由 David S. Miller 提交于 5月 06, 2011

Since this is invoked from inet_stream_connect() the socket is locked
and therefore this usage is safe.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

da905bd1

29 4月, 2011 3 次提交

ipv4: Get route daddr from flow key in tcp_v4_connect(). · d4fb3d74

由 David S. Miller 提交于 4月 28, 2011

Now that output route lookups update the flow with
destination address selection, we can fetch it from
fl4->daddr instead of rt->rt_dst
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d4fb3d74

ipv4: Fetch route saddr from flow key in tcp_v4_connect(). · 4071cfff

由 David S. Miller 提交于 4月 28, 2011

Now that output route lookups update the flow with
source address selection, we can fetch it from
fl4->saddr instead of rt->rt_src
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4071cfff

inet: add RCU protection to inet->opt · f6d8bd05

由 Eric Dumazet 提交于 4月 21, 2011

We lack proper synchronization to manipulate inet->opt ip_options

Problem is ip_make_skb() calls ip_setup_cork() and
ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options),
without any protection against another thread manipulating inet->opt.

Another thread can change inet->opt pointer and free old one under us.

Use RCU to protect inet->opt (changed to inet->inet_opt).

Instead of handling atomic refcounts, just copy ip_options when
necessary, to avoid cache line dirtying.

We cant insert an rcu_head in struct ip_options since its included in
skb->cb[], so this patch is large because I had to introduce a new
ip_options_rcu structure.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f6d8bd05

28 4月, 2011 1 次提交

ipv4: Sanitize and simplify ip_route_{connect,newports}() · 2d7192d6

由 David S. Miller 提交于 4月 26, 2011

These functions are used together as a unit for route resolution
during connect().  They address the chicken-and-egg problem that
exists when ports need to be allocated during connect() processing,
yet such port allocations require addressing information from the
routing code.

It's currently more heavy handed than it needs to be, and in
particular we allocate and initialize a flow object twice.

Let the callers provide the on-stack flow object.  That way we only
need to initialize it once in the ip_route_connect() call.

Later, if ip_route_newports() needs to do anything, it re-uses that
flow object as-is except for the ports which it updates before the
route re-lookup.

Also, describe why this set of facilities are needed and how it works
in a big comment.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
Reviewed-by: NEric Dumazet <eric.dumazet@gmail.com>

2d7192d6

23 4月, 2011 1 次提交

inet: constify ip headers and in6_addr · b71d1d42

由 Eric Dumazet 提交于 4月 22, 2011

Add const qualifiers to structs iphdr, ipv6hdr and in6_addr pointers
where possible, to make code intention more obvious.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b71d1d42

03 3月, 2011 1 次提交
- D
  ipv4: Make output route lookup return rtable directly. · b23dd4fe
  由 David S. Miller 提交于 3月 02, 2011
```
Instead of on the stack.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
  b23dd4fe
02 3月, 2011 1 次提交
- D
  ipv4: Can final ip_route_connect() arg to boolean "can_sleep". · abdf7e72
  由 David S. Miller 提交于 3月 01, 2011
```
Since that's what the current vague "flags" thing means.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
  abdf7e72
25 2月, 2011 1 次提交

ipv4: Rearrange how ip_route_newports() gets port keys. · dca8b089

由 David S. Miller 提交于 2月 24, 2011

ip_route_newports() is the only place in the entire kernel that
cares about the port members in the routing cache entry's lookup
flow key.

Therefore the only reason we store an entire flow inside of the
struct rtentry is for this one special case.

Rewrite ip_route_newports() such that:

1) The caller passes in the original port values, so we don't need
   to use the rth->fl.fl_ip_{s,d}port values to remember them.

2) The lookup flow is constructed by hand instead of being copied
   from the routing cache entry's flow.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dca8b089

21 2月, 2011 1 次提交

tcp: Remove debug macro of TCP_CHECK_TIMER · 089c3482

由 Shan Wei 提交于 2月 19, 2011

Now, TCP_CHECK_TIMER is not used for debuging, it does nothing.
And, it has been there for several years, maybe 6 years.

Remove it to keep code clearer.
Signed-off-by: NShan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

089c3482

11 2月, 2011 1 次提交

inetpeer: Abstract address representation further. · 7a71ed89

由 David S. Miller 提交于 2月 09, 2011

Future changes will add caching information, and some of
these new elements will be addresses.

Since the family is implicit via the ->daddr.family member,
replicating the family in ever address we store is entirely
redundant.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7a71ed89

25 1月, 2011 1 次提交

tcp: fix bug in listening_get_next() · fd0273c5

由 Eric Dumazet 提交于 1月 24, 2011

commit a8b690f9 (tcp: Fix slowness in read /proc/net/tcp)
introduced a bug in handling of SYN_RECV sockets.

st->offset represents number of sockets found since beginning of
listening_hash[st->bucket].

We should not reset st->offset when iterating through
syn_table[st->sbucket], or else if more than ~25 sockets (if
PAGE_SIZE=4096) are in SYN_RECV state, we exit from listening_get_next()
with a too small st->offset

Next time we enter tcp_seek_last_pos(), we are not able to seek past
already found sockets.
Reported-by: NPK <runningdoglackey@yahoo.com>
CC: Tom Herbert <therbert@google.com>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fd0273c5

24 12月, 2010 1 次提交

tcp: fix listening_get_next() · 1bde5ac4

由 Eric Dumazet 提交于 12月 23, 2010

Alexey Vlasov found /proc/net/tcp could sometime loop and display
millions of sockets in LISTEN state.

In 2.6.29, when we converted TCP hash tables to RCU, we left two
sk_next() calls in listening_get_next().

We must instead use sk_nulls_next() to properly detect an end of chain.
Reported-by: NAlexey Vlasov <renton@renton.name>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1bde5ac4

14 12月, 2010 1 次提交

net: Abstract default ADVMSS behind an accessor. · 0dbaee3b

由 David S. Miller 提交于 12月 13, 2010

Make all RTAX_ADVMSS metric accesses go through a new helper function,
dst_metric_advmss().

Leave the actual default metric as "zero" in the real metric slot,
and compute the actual default value dynamically via a new dst_ops
AF specific callback.

For stacked IPSEC routes, we use the advmss of the path which
preserves existing behavior.

Unlike ipv4/ipv6, DecNET ties the advmss to the mtu and thus updates
advmss on pmtu updates.  This inconsistency in advmss handling
results in more raw metric accesses than I wish we ended up with.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0dbaee3b

02 12月, 2010 1 次提交

timewait_sock: Create and use getpeer op. · ccb7c410

由 David S. Miller 提交于 12月 01, 2010

The only thing AF-specific about remembering the timestamp
for a time-wait TCP socket is getting the peer.

Abstract that behind a new timewait_sock_ops vector.

Support for real IPV6 sockets is not filled in yet, but
curiously this makes timewait recycling start to work
for v4-mapped ipv6 sockets.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ccb7c410

01 12月, 2010 3 次提交

inet: Turn ->remember_stamp into ->get_peer in connection AF ops. · 3f419d2d

由 David S. Miller 提交于 11月 29, 2010

Then we can make a completely generic tcp_remember_stamp()
that uses ->get_peer() as a helper, minimizing the AF specific
code and minimizing the eventual code duplication when we implement
the ipv6 side of TW recycling.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3f419d2d

D
inetpeer: Make inet_getpeer() take an inet_peer_adress_t pointer. · b534ecf1
由 David S. Miller 提交于 11月 30, 2010
```
And make an inet_getpeer_v4() helper, update callers.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
b534ecf1

inetpeer: Introduce inet_peer_address_t. · 582a72da

由 David S. Miller 提交于 11月 30, 2010

Currently only the v4 aspect is used, but this will change.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

582a72da

28 11月, 2010 1 次提交

netns: Don't leak others' openreq-s in proc · 8475ef9f

由 Pavel Emelyanov 提交于 11月 22, 2010

The /proc/net/tcp leaks openreq sockets from other namespaces.
Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8475ef9f

13 11月, 2010 1 次提交

tcp: Don't change unlocked socket state in tcp_v4_err(). · 8f49c270

由 David S. Miller 提交于 11月 12, 2010

Alexey Kuznetsov noticed a regression introduced by
commit f1ecd5d9
("Revert Backoff [v3]: Revert RTO on ICMP destination unreachable")

The RTO and timer modification code added to tcp_v4_err()
doesn't check sock_owned_by_user(), which if true means we
don't have exclusive access to the socket and therefore cannot
modify it's critical state.

Just skip this new code block if sock_owned_by_user() is true
and eliminate the now superfluous sock_owned_by_user() code
block contained within.
Reported-by: NAlexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
CC: Damian Lukowski <damian@tvk.rwth-aachen.de>
Acked-by: NEric Dumazet <eric.dumazet@gmail.com>

8f49c270

21 10月, 2010 1 次提交

tproxy: fix hash locking issue when using port redirection in __inet_inherit_port() · 093d2823

由 Balazs Scheidler 提交于 10月 21, 2010

When __inet_inherit_port() is called on a tproxy connection the wrong locks are
held for the inet_bind_bucket it is added to. __inet_inherit_port() made an
implicit assumption that the listener's port number (and thus its bind bucket).
Unfortunately, if you're using the TPROXY target to redirect skbs to a
transparent proxy that assumption is not true anymore and things break.

This patch adds code to __inet_inherit_port() so that it can handle this case
by looking up or creating a new bind bucket for the child socket and updates
callers of __inet_inherit_port() to gracefully handle __inet_inherit_port()
failing.

Reported by and original patch from Stephen Buck <stephen.buck@exinda.com>.
See http://marc.info/?t=128169268200001&r=1&w=2 for the original discussion.
Signed-off-by: NKOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: NPatrick McHardy <kaber@trash.net>

093d2823

01 9月, 2010 1 次提交

gro: unexport tcp4_gro_receive and tcp4_gro_complete · 1639ab6f

由 Eric Dumazet 提交于 8月 31, 2010

tcp4_gro_receive() and tcp4_gro_complete() dont need to be exported.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1639ab6f

13 7月, 2010 2 次提交

inet, inet6: make tcp_sendmsg() and tcp_sendpage() through inet_sendmsg() and inet_sendpage() · 7ba42910

由 Changli Gao 提交于 7月 10, 2010

a new boolean flag no_autobind is added to structure proto to avoid the autobind
calls when the protocol is TCP. Then sock_rps_record_flow() is called int the
TCP's sendmsg() and sendpage() pathes.
Signed-off-by: NChangli Gao <xiaosuo@gmail.com>
----
 include/net/inet_common.h |    4 ++++
 include/net/sock.h        |    1 +
 include/net/tcp.h         |    8 ++++----
 net/ipv4/af_inet.c        |   15 +++++++++------
 net/ipv4/tcp.c            |   11 +++++------
 net/ipv4/tcp_ipv4.c       |    3 +++
 net/ipv6/af_inet6.c       |    8 ++++----
 net/ipv6/tcp_ipv6.c       |    3 +++
 8 files changed, 33 insertions(+), 20 deletions(-)
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7ba42910

net/ipv4: EXPORT_SYMBOL cleanups · 4bc2f18b

由 Eric Dumazet 提交于 7月 09, 2010

CodingStyle cleanups

EXPORT_SYMBOL should immediately follow the symbol declaration.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4bc2f18b

27 6月, 2010 1 次提交

syncookies: add support for ECN · 172d69e6

由 Florian Westphal 提交于 6月 21, 2010

Allows use of ECN when syncookies are in effect by encoding ecn_ok
into the syn-ack tcp timestamp.

While at it, remove a uneeded #ifdef CONFIG_SYN_COOKIES.
With CONFIG_SYN_COOKIES=nm want_cookie is ifdef'd to 0 and gcc
removes the "if (0)".
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

172d69e6

17 6月, 2010 1 次提交

inetpeer: restore small inet_peer structures · 317fe0e6

由 Eric Dumazet 提交于 6月 16, 2010

Addition of rcu_head to struct inet_peer added 16bytes on 64bit arches.

Thats a bit unfortunate, since old size was exactly 64 bytes.

This can be solved, using an union between this rcu_head an four fields,
that are normally used only when a refcount is taken on inet_peer.
rcu_head is used only when refcnt=-1, right before structure freeing.

Add a inet_peer_refcheck() function to check this assertion for a while.

We can bring back SLAB_HWCACHE_ALIGN qualifier in kmem cache creation.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

317fe0e6

11 6月, 2010 1 次提交

net-next: remove useless union keyword · d8d1f30b

由 Changli Gao 提交于 6月 10, 2010

remove useless union keyword in rtable, rt6_info and dn_route.

Since there is only one member in a union, the union keyword isn't useful.
Signed-off-by: NChangli Gao <xiaosuo@gmail.com>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d8d1f30b

07 6月, 2010 1 次提交

tcp: Fix slowness in read /proc/net/tcp · a8b690f9

由 Tom Herbert 提交于 6月 07, 2010

This patch address a serious performance issue in reading the
TCP sockets table (/proc/net/tcp).

Reading the full table is done by a number of sequential read
operations.  At each read operation, a seek is done to find the
last socket that was previously read.  This seek operation requires
that the sockets in the table need to be counted up to the current
file position, and to count each of these requires taking a lock for
each non-empty bucket.  The whole algorithm is O(n^2).

The fix is to cache the last bucket value, offset within the bucket,
and the file position returned by the last read operation.   On the
next sequential read, the bucket and offset are used to find the
last read socket immediately without needing ot scan the previous
buckets  the table.  This algorithm t read the whole table is O(n).

The improvement offered by this patch is easily show by performing
cat'ing /proc/net/tcp on a machine with a lot of connections.  With
about 182K connections in the table, I see the following:

- Without patch
time cat /proc/net/tcp > /dev/null

real	1m56.729s
user	0m0.214s
sys	1m56.344s

- With patch
time cat /proc/net/tcp > /dev/null

real	0m0.894s
user	0m0.290s
sys	0m0.594s
Signed-off-by: NTom Herbert <therbert@google.com>
Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a8b690f9

05 6月, 2010 3 次提交

syncookies: avoid unneeded tcp header flag double check · af9b4738

由 Florian Westphal 提交于 6月 03, 2010

caller: if (!th->rst && !th->syn && th->ack)
callee: if (!th->ack)

make the caller only check for !syn (common for 3whs), and move
the !rst / ack test to the callee.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

af9b4738

syncookies: make v4/v6 synflood warning behaviour the same · 2a1d4bd4

由 Florian Westphal 提交于 6月 03, 2010

both syn_flood_warning functions print a message, but
ipv4 version only prints a warning if CONFIG_SYN_COOKIES=y.

Make the v4 one behave like the v6 one.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2a1d4bd4

rps: tcp: fix rps_sock_flow_table table updates · ca55158c

由 Eric Dumazet 提交于 6月 03, 2010

I believe a moderate SYN flood attack can corrupt RFS flow table
(rps_sock_flow_table), making RPS/RFS much less effective.

Even in a normal situation, server handling short lived sessions suffer
from bad steering for the first data packet of a session, if another SYN
packet is received for another session.

We do following action in tcp_v4_rcv() :

	sock_rps_save_rxhash(sk, skb->rxhash);

We should _not_ do this if sk is a LISTEN socket, as about each
packet received on a LISTEN socket has a different rxhash than
previous one.
 -> RPS_NO_CPU markers are spread all over rps_sock_flow_table.

Also, it makes sense to protect sk->rxhash field changes with socket
lock (We currently can change it even if user thread owns the lock
and might use rxhash)

This patch moves sock_rps_save_rxhash() to a sock locked section,
and only for non LISTEN sockets.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ca55158c

16 5月, 2010 1 次提交

net: Introduce sk_route_nocaps · a465419b

由 Eric Dumazet 提交于 5月 16, 2010

TCP-MD5 sessions have intermittent failures, when route cache is
invalidated. ip_queue_xmit() has to find a new route, calls
sk_setup_caps(sk, &rt->u.dst), destroying the 

sk->sk_route_caps &= ~NETIF_F_GSO_MASK

that MD5 desperately try to make all over its way (from
tcp_transmit_skb() for example)

So we send few bad packets, and everything is fine when
tcp_transmit_skb() is called again for this socket.

Since ip_queue_xmit() is at a lower level than TCP-MD5, I chose to use a
socket field, sk_route_nocaps, containing bits to mask on sk_route_caps.
Reported-by: NBhaskar Dutta <bhaskie@gmail.com>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a465419b

28 4月, 2010 1 次提交

net: Make RFS socket operations not be inet specific. · c58dc01b

由 David S. Miller 提交于 4月 27, 2010

Idea from Eric Dumazet.

As for placement inside of struct sock, I tried to choose a place
that otherwise has a 32-bit hole on 64-bit systems.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
Acked-by: NEric Dumazet <eric.dumazet@gmail.com>

c58dc01b

21 4月, 2010 1 次提交

net: Fix various endianness glitches · 0eae88f3

由 Eric Dumazet 提交于 4月 20, 2010

Sparse can help us find endianness bugs, but we need to make some
cleanups to be able to more easily spot real bugs.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0eae88f3

17 4月, 2010 1 次提交

rfs: Receive Flow Steering · fec5e652

由 Tom Herbert 提交于 4月 16, 2010

This patch implements receive flow steering (RFS).  RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running.  RFS is an
extension of Receive Packet Steering (RPS).

The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure.  The rxhash is passed in skb's received on
the connection from netif_receive_skb.  For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.

The convolution of the simple approach is that it would potentially
allow OOO packets.  If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.

To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.

rps_sock_table is a global hash table.  Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.

rps_dev_flow_table is specific to each device queue.  Each entry
contains a CPU and a tail queue counter.  The CPU is the "current"
CPU for a matching flow.  The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.

Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length.  When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.

And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted.  When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:

- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table.  This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.

Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality.  2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.

This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.

There are two configuration parameters for RFS.  The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue.  Both are rounded to power of two.

The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).

The benefits of RFS are dependent on cache hierarchy, application
load, and other factors.  On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation.  However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.

Below are some benchmark results which show the potential benfit of
this patch.  The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp.  The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.

e1000e on 8 core Intel
   No RFS or RPS		104K tps at 30% CPU
   No RFS (best RPS config):    290K tps at 63% CPU
   RFS				303K tps at 61% CPU

RPC test	tps	CPU%	50/90/99% usec latency	Latency StdDev
  No RFS/RPS	103K	48%	757/900/3185		4472.35
  RPS only:	174K	73%	415/993/2468		491.66
  RFS		223K	73%	379/651/1382		315.61
Signed-off-by: NTom Herbert <therbert@google.com>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fec5e652

12 4月, 2010 2 次提交

inet: Remove unused send_check length argument · bb296246

由 Herbert Xu 提交于 4月 11, 2010

inet: Remove unused send_check length argument

This patch removes the unused length argument from the send_check
function in struct inet_connection_sock_af_ops.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Tested-by: NYinghai <yinghai.lu@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bb296246

tcp: Handle CHECKSUM_PARTIAL for SYNACK packets for IPv4 · 419f9f89

由 Herbert Xu 提交于 4月 11, 2010

tcp: Handle CHECKSUM_PARTIAL for SYNACK packets for IPv4

This patch moves the common code between tcp_v4_send_check and
tcp_v4_gso_send_check into a new function __tcp_v4_send_check.

It then uses the new function in tcp_v4_send_synack so that it
handles CHECKSUM_PARTIAL properly.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Tested-by: NYinghai <yinghai.lu@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

419f9f89

openanolis / cloud-kernel 大约 1 年 前同步成功

openanolis / cloud-kernel
大约 1 年前同步成功