提交 · 1ed5c48f231cd00eac0b3d2350ac61e3c825063e · openanolis / cloud-kernel

12 7月, 2012 9 次提交

D
net: Remove checks for dst_ops->redirect being NULL. · 1ed5c48f
由 David S. Miller 提交于 7月 12, 2012
```
No longer necessary.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
1ed5c48f
D
net: Add dummy dst_ops->redirect method where needed. · b587ee3b
由 David S. Miller 提交于 7月 12, 2012
```
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
b587ee3b
D
ipv6: Use icmpv6_notify() to propagate redirect, instead of rt6_redirect(). · b94f1c09
由 David S. Miller 提交于 7月 12, 2012
```
And delete rt6_redirect(), since it is no longer used.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
b94f1c09
D
ipv6: Add redirect support to all protocol icmp error handlers. · ec18d9a2
由 David S. Miller 提交于 7月 12, 2012
```
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
ec18d9a2
D
ipv6: Add ip6_redirect() and ip6_sk_redirect() helper functions. · 3a5ad2ee
由 David S. Miller 提交于 7月 12, 2012
```
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
3a5ad2ee
D
ipv6: Pull main logic of rt6_redirect() into rt6_do_redirect(). · 6e157b6a
由 David S. Miller 提交于 7月 12, 2012
```
Hook it into dst_ops->redirect as well.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
6e157b6a

ipv6: Move bulk of redirect handling into rt6_redirect(). · e8599ff4

由 David S. Miller 提交于 7月 11, 2012

This sets things up so that we can have the protocol error handlers
call down into the ipv6 route code for redirects just as ipv4 already
does.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e8599ff4

ipv6: Export ndisc option parsing from ndisc.c · 30f2a5f3

由 David S. Miller 提交于 7月 11, 2012

This is going to be used internally by the rt6 redirect code.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

30f2a5f3

tcp: TCP Small Queues · 46d3ceab

由 Eric Dumazet 提交于 7月 11, 2012

This introduce TSQ (TCP Small Queues)

TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
device queues), to reduce RTT and cwnd bias, part of the bufferbloat
problem.

sk->sk_wmem_alloc not allowed to grow above a given limit,
allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
given time.

TSO packets are sized/capped to half the limit, so that we have two
TSO packets in flight, allowing better bandwidth use.

As a side effect, setting the limit to 40000 automatically reduces the
standard gso max limit (65536) to 40000/2 : It can help to reduce
latencies of high prio packets, having smaller TSO packets.

This means we divert sock_wfree() to a tcp_wfree() handler, to
queue/send following frames when skb_orphan() [2] is called for the
already queued skbs.

Results on my dev machines (tg3/ixgbe nics) are really impressive,
using standard pfifo_fast, and with or without TSO/GSO.

Without reduction of nominal bandwidth, we have reduction of buffering
per bulk sender :
< 1ms on Gbit (instead of 50ms with TSO)
< 8ms on 100Mbit (instead of 132 ms)

I no longer have 4 MBytes backlogged in qdisc by a single netperf
session, and both side socket autotuning no longer use 4 Mbytes.

As skb destructor cannot restart xmit itself ( as qdisc lock might be
taken at this point ), we delegate the work to a tasklet. We use one
tasklest per cpu for performance reasons.

If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag.
This flag is tested in a new protocol method called from release_sock(),
to eventually send new segments.

[1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
[2] skb_orphan() is usually called at TX completion time,
  but some drivers call it in their start_xmit() handler.
  These drivers should at least use BQL, or else a single TCP
  session can still fill the whole NIC TX ring, since TSQ will
  have no effect.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Dave Taht <dave.taht@bufferbloat.net>
Cc: Tom Herbert <therbert@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

46d3ceab

11 7月, 2012 7 次提交

net: Fix (nearly-)kernel-doc comments for various functions · 2c53040f

由 Ben Hutchings 提交于 7月 10, 2012

Fix incorrect start markers, wrapped summary lines, missing section
breaks, incorrect separators, and some name mismatches.
Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2c53040f

D
rtnetlink: Remove ts/tsage args to rtnl_put_cacheinfo(). · 87a50699
由 David S. Miller 提交于 7月 10, 2012
```
Nobody provides non-zero values any longer.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
87a50699

inet: Kill FLOWI_FLAG_PRECOW_METRICS. · 3e12939a

由 David S. Miller 提交于 7月 10, 2012

No longer needed.  TCP writes metrics, but now in it's own special
cache that does not dirty the route metrics.  Therefore there is no
longer any reason to pre-cow metrics in this way.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3e12939a

inet: Minimize use of cached route inetpeer. · 1d861aa4

由 David S. Miller 提交于 7月 10, 2012

Only use it in the absolutely required cases:

1) COW'ing metrics

2) ipv4 PMTU

3) ipv4 redirects
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1d861aa4

D
inet: Remove ->get_peer() method. · 16d18399
由 David S. Miller 提交于 7月 10, 2012
```
No longer used.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
16d18399
D
tcp: Move timestamps from inetpeer to metrics cache. · 81166dd6
由 David S. Miller 提交于 7月 10, 2012
```
With help from Lin Ming.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
81166dd6
D
tcp: Abstract back handling peer aliveness test into helper function. · ab92bb2f
由 David S. Miller 提交于 7月 09, 2012
```
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
ab92bb2f

06 7月, 2012 3 次提交

ipv6: fix a bad cast in ip6_dst_lookup_tail() · c56bf6fe

由 Eric Dumazet 提交于 7月 06, 2012

Fix a bug in ip6_dst_lookup_tail(), where typeof(dst) is
"struct dst_entry **", not "struct dst_entry *"
Reported-by: NFengguang Wu <wfg@linux.intel.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c56bf6fe

ipv6: remove redundant declarations · 883dd4fb

由 Eric Dumazet 提交于 7月 05, 2012

remove redundant declarations, they belong in include/net/tcp.h
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

883dd4fb

ipv6: Initialize the neighbour pointer of rt6_info on allocation · a2de86f6

由 Steffen Klassert 提交于 7月 05, 2012

git commit 97cac082 (ipv6: Store route neighbour in rt6_info struct)
added a neighbour pointer to rt6_info. Currently we don't initialize
this pointer at allocation time. We assume this pointer to be valid
if it is not a null pointer, so initialize it on allocation.
Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a2de86f6

05 7月, 2012 6 次提交

ipv6: remove unnecessary codes in tcp_ipv6.c · 43264e0b

由 RongQing.Li 提交于 7月 01, 2012

opt always equals np->opts, so it is meaningless to define opt, and
check if opt does not equal np->opts and then try to free opt.
Signed-off-by: NRongQing.Li <roy.qing.li@gmail.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

43264e0b

ipv6: Store route neighbour in rt6_info struct. · 97cac082

由 David S. Miller 提交于 7月 02, 2012

This makes for a simplified conversion away from dst_get_neighbour*().

All code outside of ipv6 will use neigh lookups via dst_neigh_lookup*().
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

97cac082

D
net: Pass neighbours and dest address into NETEVENT_REDIRECT events. · 1d248b1c
由 David S. Miller 提交于 7月 03, 2012
```
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
1d248b1c

net: Add optional SKB arg to dst_ops->neigh_lookup(). · f894cbf8

由 David S. Miller 提交于 7月 02, 2012

Causes the handler to use the daddr in the ipv4/ipv6 header when
the route gateway is unspecified (local subnet).
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f894cbf8

net: Do delayed neigh confirmation. · 5110effe

由 David S. Miller 提交于 7月 02, 2012

When a dst_confirm() happens, mark the confirmation as pending in the
dst.  Then on the next packet out, when we have the neigh in-hand, do
the update.

This removes the dependency in dst_confirm() of dst's having an
attached neigh.

While we're here, remove the explicit 'dst' NULL check, all except 2
or 3 call sites ensure it's not NULL.  So just fix those cases up.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5110effe

netfilter: nf_conntrack: generalize nf_ct_l4proto_net · 08911475

由 Pablo Neira Ayuso 提交于 6月 29, 2012

This patch generalizes nf_ct_l4proto_net by splitting it into chunks and
moving the corresponding protocol part to where it really belongs to.

To clarify, note that we follow two different approaches to support per-net
depending if it's built-in or run-time loadable protocol tracker.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
Acked-by: NGao feng <gaofeng@cn.fujitsu.com>

08911475

29 6月, 2012 5 次提交

ipv6_tunnel: Allow receiving packets on the fallback tunnel if they pass sanity checks · d0087b29

由 Ville Nuorvala 提交于 6月 28, 2012

At Facebook, we do Layer-3 DSR via IP-in-IP tunneling. Our load balancers wrap
an extra IP header on incoming packets so they can be routed to the backend.
In the v4 tunnel driver, when these packets fall on the default tunl0 device,
the behavior is to decapsulate them and drop them back on the stack. So our
setup is that tunl0 has the VIP and eth0 has (obviously) the backend's real
address.

In IPv6 we do the same thing, but the v6 tunnel driver didn't have this same
behavior - if you didn't have an explicit tunnel setup, it would drop the
packet.

This patch brings that v4 feature to the v6 driver.

The same IPv6 address checks are performed as with any normal tunnel,
but as the fallback tunnel endpoint addresses are unspecified, the checks
must be performed on a per-packet basis, rather than at tunnel
configuration time.

[Patch description modified by phil@ipom.com]
Signed-off-by: NVille Nuorvala <ville.nuorvala@gmail.com>
Tested-by: NPhil Dibowitz <phil@ipom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d0087b29

tcp: plug dst leak in tcp_v6_conn_request() · 9f10d3f6

由 Neal Cardwell 提交于 6月 28, 2012

The code in tcp_v6_conn_request() was implicitly assuming that
tcp_v6_send_synack() would take care of dst_release(), much as
tcp_v4_send_synack() already does. This resulted in
tcp_v6_conn_request() leaking a dst if sysctl_tw_recycle is enabled.

This commit restructures tcp_v6_send_synack() so that it accepts a dst
pointer and takes care of releasing the dst that is passed in, to plug
the leak and avoid future surprises by bringing the IPv6 behavior in
line with the IPv4 side.
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9f10d3f6

tcp: use inet6_csk_route_req() in tcp_v6_send_synack() · 9494218f

由 Neal Cardwell 提交于 6月 28, 2012

With the recent change (earlier in this patch series) to set
flowi6_oif to treq->iif in inet6_csk_route_req(), the dst lookup in
these two functions is now identical, so tcp_v6_send_synack() can now
just call inet6_csk_route_req(), to reduce code duplication and keep
things closer to the IPv4 side, which is structured this way.
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9494218f

tcp: pass fl6 to inet6_csk_route_req() · 3840a06e

由 Neal Cardwell 提交于 6月 28, 2012

This commit changes inet_csk_route_req() so that it uses a pointer to
a struct flowi6, rather than allocating its own on the stack. This
brings its behavior in line with its IPv4 cousin,
inet_csk_route_req(), and allows a follow-on patch to fix a dst leak.
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3840a06e

tcp: fix inet6_csk_route_req() for link-local addresses · 9247869e

由 Neal Cardwell 提交于 6月 28, 2012

Fix inet6_csk_route_req() to use as the flowi6_oif the treq->iif,
which is correctly fixed up in tcp_v6_conn_request() to handle the
case of link-local addresses. This brings it in line with the
tcp_v6_send_synack() code, which is already correctly using the
treq->iif in this way.
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9247869e

28 6月, 2012 4 次提交

net: skb_free_datagram_locked() doesnt drop all packets · 22911fc5

由 Eric Dumazet 提交于 6月 27, 2012

dropwatch wrongly diagnose all received UDP packets as drops.

This patch removes trace_kfree_skb() done in skb_free_datagram_locked().

Locations calling skb_free_datagram_locked() should do it on their own.

As a result, drops are accounted on the right function.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

22911fc5

ip6mr: Do not use RTA_PUT() macros · 74a0bd7d

由 Thomas Graf 提交于 6月 26, 2012

Signed-off-by: NThomas Graf <tgraf@suug.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

74a0bd7d

netfilter: nf_ct_icmpv6: add icmpv6_kmemdup_sysctl_table function · 8fc02781

由 Gao feng 提交于 6月 21, 2012

Split sysctl function into smaller chucks to cleanup code and prepare
patches to reduce ifdef pollution.
Signed-off-by: NGao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

8fc02781

netfilter: nf_conntrack: prepare l4proto->init_net cleanup · f1caad27

由 Gao feng 提交于 6月 21, 2012

l4proto->init contain quite redundant code. We can simplify this
by adding a new parameter l3proto.

This patch prepares that code simplification.
Signed-off-by: NGao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

f1caad27

26 6月, 2012 3 次提交

net/ipv6/route.c: packets originating on device match lo · 4dc27d1c

由 David McCullough 提交于 6月 25, 2012

Fix to allow IPv6 packets originating locally to match rules with the "iff"
set to "lo".  This allows IPv6 rule matching work the same as it does for
IPv4.  From the iproute2 man page:

   iif NAME
		  select  the incoming device to match.  If the interface is loop‐
		  back, the rule only matches packets originating from this  host.
		  This  means that you may create separate routing tables for for‐
		  warded and local packets and, hence, completely segregate them.
Signed-off-by: NDavid McCullough <david_mccullough@mcafee.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4dc27d1c

tcp: heed result of security_inet_conn_request() in tcp_v6_conn_request() · 437c5b53

由 Neal Cardwell 提交于 6月 23, 2012

If security_inet_conn_request() returns non-zero then TCP/IPv6 should
drop the request, just as in TCP/IPv4 and DCCP in both IPv4 and IPv6.
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

437c5b53

ipv6: fib: fix fib dump restart · fa809e2f

由 Eric Dumazet 提交于 6月 25, 2012

Commit 2bec5a36 (ipv6: fib: fix crash when changing large fib
while dumping it) introduced ability to restart the dump at tree root,
but failed to skip correctly a count of already dumped entries. Code
didn't match Patrick intent.

We must skip exactly the number of already dumped entries.

Note that like other /proc/net files or netlink producers, we could
still dump some duplicates entries.
Reported-by: NDebabrata Banerjee <dbavatar@gmail.com>
Reported-by: NJosh Hunt <johunt@akamai.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fa809e2f

20 6月, 2012 1 次提交

inet: Sanitize inet{,6} protocol demux. · f9242b6b

由 David S. Miller 提交于 6月 19, 2012

Don't pretend that inet_protos[] and inet6_protos[] are hashes, thay
are just a straight arrays.  Remove all unnecessary hash masking.

Document MAX_INET_PROTOS.

Use RAW_HTABLE_SIZE when appropriate.
Reported-by: NBen Hutchings <bhutchings@solarflare.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f9242b6b

19 6月, 2012 1 次提交

ipv6: Move ipv6 proc file registration to end of init order · d189634e

由 Thomas Graf 提交于 6月 18, 2012

/proc/net/ipv6_route reflects the contents of fib_table_hash. The proc
handler is installed in ip6_route_net_init() whereas fib_table_hash is
allocated in fib6_net_init() _after_ the proc handler has been installed.

This opens up a short time frame to access fib_table_hash with its pants
down.

Move the registration of the proc files to a later point in the init
order to avoid the race.

Tested :-)
Signed-off-by: NThomas Graf <tgraf@suug.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d189634e

16 6月, 2012 1 次提交

netfilter: add user-space connection tracking helper infrastructure · 12f7a505

由 Pablo Neira Ayuso 提交于 5月 13, 2012

There are good reasons to supports helpers in user-space instead:

* Rapid connection tracking helper development, as developing code
  in user-space is usually faster.

* Reliability: A buggy helper does not crash the kernel. Moreover,
  we can monitor the helper process and restart it in case of problems.

* Security: Avoid complex string matching and mangling in kernel-space
  running in privileged mode. Going further, we can even think about
  running user-space helpers as a non-root process.

* Extensibility: It allows the development of very specific helpers (most
  likely non-standard proprietary protocols) that are very likely not to be
  accepted for mainline inclusion in the form of kernel-space connection
  tracking helpers.

This patch adds the infrastructure to allow the implementation of
user-space conntrack helpers by means of the new nfnetlink subsystem
`nfnetlink_cthelper' and the existing queueing infrastructure
(nfnetlink_queue).

I had to add the new hook NF_IP6_PRI_CONNTRACK_HELPER to register
ipv[4|6]_helper which results from splitting ipv[4|6]_confirm into
two pieces. This change is required not to break NAT sequence
adjustment and conntrack confirmation for traffic that is enqueued
to our user-space conntrack helpers.

Basic operation, in a few steps:

1) Register user-space helper by means of `nfct':

 nfct helper add ftp inet tcp

 [ It must be a valid existing helper supported by conntrack-tools ]

2) Add rules to enable the FTP user-space helper which is
   used to track traffic going to TCP port 21.

For locally generated packets:

 iptables -I OUTPUT -t raw -p tcp --dport 21 -j CT --helper ftp

For non-locally generated packets:

 iptables -I PREROUTING -t raw -p tcp --dport 21 -j CT --helper ftp

3) Run the test conntrackd in helper mode (see example files under
   doc/helper/conntrackd.conf

 conntrackd

4) Generate FTP traffic going, if everything is OK, then conntrackd
   should create expectations (you can check that with `conntrack':

 conntrack -E expect

    [NEW] 301 proto=6 src=192.168.1.136 dst=130.89.148.12 sport=0 dport=54037 mask-src=255.255.255.255 mask-dst=255.255.255.255 sport=0 dport=65535 master-src=192.168.1.136 master-dst=130.89.148.12 sport=57127 dport=21 class=0 helper=ftp
[DESTROY] 301 proto=6 src=192.168.1.136 dst=130.89.148.12 sport=0 dport=54037 mask-src=255.255.255.255 mask-dst=255.255.255.255 sport=0 dport=65535 master-src=192.168.1.136 master-dst=130.89.148.12 sport=57127 dport=21 class=0 helper=ftp

This confirms that our test helper is receiving packets including the
conntrack information, and adding expectations in kernel-space.

The user-space helper can also store its private tracking information
in the conntrack structure in the kernel via the CTA_HELP_INFO. The
kernel will consider this a binary blob whose layout is unknown. This
information will be included in the information that is transfered
to user-space via glue code that integrates nfnetlink_queue and
ctnetlink.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

12f7a505

openanolis / cloud-kernel 12 个月 前同步成功

openanolis / cloud-kernel
12 个月前同步成功