提交 · f9242b6b28d61295f2bf7e8adfb1060b382e5381 · OpenHarmony / kernel_linux

20 6月, 2012 1 次提交

inet: Sanitize inet{,6} protocol demux. · f9242b6b

由 David S. Miller 提交于 6月 19, 2012

Don't pretend that inet_protos[] and inet6_protos[] are hashes, thay
are just a straight arrays.  Remove all unnecessary hash masking.

Document MAX_INET_PROTOS.

Use RAW_HTABLE_SIZE when appropriate.
Reported-by: NBen Hutchings <bhutchings@solarflare.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f9242b6b

18 6月, 2012 1 次提交

ipv4: Cap ADVMSS metric in the FIB rather than the routing cache. · 6fac2625

由 David S. Miller 提交于 6月 17, 2012

It makes no sense to execute this limit test every time we create a
routing cache entry.

We can't simply error out on these things since we've silently
accepted and truncated them forever.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6fac2625

16 6月, 2012 3 次提交

netfilter: add user-space connection tracking helper infrastructure · 12f7a505

由 Pablo Neira Ayuso 提交于 5月 13, 2012

There are good reasons to supports helpers in user-space instead:

* Rapid connection tracking helper development, as developing code
  in user-space is usually faster.

* Reliability: A buggy helper does not crash the kernel. Moreover,
  we can monitor the helper process and restart it in case of problems.

* Security: Avoid complex string matching and mangling in kernel-space
  running in privileged mode. Going further, we can even think about
  running user-space helpers as a non-root process.

* Extensibility: It allows the development of very specific helpers (most
  likely non-standard proprietary protocols) that are very likely not to be
  accepted for mainline inclusion in the form of kernel-space connection
  tracking helpers.

This patch adds the infrastructure to allow the implementation of
user-space conntrack helpers by means of the new nfnetlink subsystem
`nfnetlink_cthelper' and the existing queueing infrastructure
(nfnetlink_queue).

I had to add the new hook NF_IP6_PRI_CONNTRACK_HELPER to register
ipv[4|6]_helper which results from splitting ipv[4|6]_confirm into
two pieces. This change is required not to break NAT sequence
adjustment and conntrack confirmation for traffic that is enqueued
to our user-space conntrack helpers.

Basic operation, in a few steps:

1) Register user-space helper by means of `nfct':

 nfct helper add ftp inet tcp

 [ It must be a valid existing helper supported by conntrack-tools ]

2) Add rules to enable the FTP user-space helper which is
   used to track traffic going to TCP port 21.

For locally generated packets:

 iptables -I OUTPUT -t raw -p tcp --dport 21 -j CT --helper ftp

For non-locally generated packets:

 iptables -I PREROUTING -t raw -p tcp --dport 21 -j CT --helper ftp

3) Run the test conntrackd in helper mode (see example files under
   doc/helper/conntrackd.conf

 conntrackd

4) Generate FTP traffic going, if everything is OK, then conntrackd
   should create expectations (you can check that with `conntrack':

 conntrack -E expect

    [NEW] 301 proto=6 src=192.168.1.136 dst=130.89.148.12 sport=0 dport=54037 mask-src=255.255.255.255 mask-dst=255.255.255.255 sport=0 dport=65535 master-src=192.168.1.136 master-dst=130.89.148.12 sport=57127 dport=21 class=0 helper=ftp
[DESTROY] 301 proto=6 src=192.168.1.136 dst=130.89.148.12 sport=0 dport=54037 mask-src=255.255.255.255 mask-dst=255.255.255.255 sport=0 dport=65535 master-src=192.168.1.136 master-dst=130.89.148.12 sport=57127 dport=21 class=0 helper=ftp

This confirms that our test helper is receiving packets including the
conntrack information, and adding expectations in kernel-space.

The user-space helper can also store its private tracking information
in the conntrack structure in the kernel via the CTA_HELP_INFO. The
kernel will consider this a binary blob whose layout is unknown. This
information will be included in the information that is transfered
to user-space via glue code that integrates nfnetlink_queue and
ctnetlink.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

12f7a505

netfilter: nfnetlink_queue: add NAT TCP sequence adjustment if packet mangled · 8c88f87c

由 Pablo Neira Ayuso 提交于 6月 07, 2012

User-space programs that receive traffic via NFQUEUE may mangle packets.
If NAT is enabled, this usually puzzles sequence tracking, leading to
traffic disruptions.

With this patch, nfnl_queue will make the corresponding NAT TCP sequence
adjustment if:

1) The packet has been mangled,
2) the NFQA_CFG_F_CONNTRACK flag has been set, and
3) NAT is detected.

There are some records on the Internet complaning about this issue:
http://stackoverflow.com/questions/260757/packet-mangling-utilities-besides-iptables

By now, we only support TCP since we have no helpers for DCCP or SCTP.
Better to add this if we ever have some helper over those layer 4 protocols.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

8c88f87c

netfilter: nf_ct_helper: implement variable length helper private data · 1afc5679

由 Pablo Neira Ayuso 提交于 6月 07, 2012

This patch uses the new variable length conntrack extensions.

Instead of using union nf_conntrack_help that contain all the
helper private data information, we allocate variable length
area to store the private helper data.

This patch includes the modification of all existing helpers.
It also includes a couple of include header to avoid compilation
warnings.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

1afc5679

15 6月, 2012 1 次提交

ipv4: Handle PMTU in all ICMP error handlers. · 36393395

由 David S. Miller 提交于 6月 14, 2012

With ip_rt_frag_needed() removed, we have to explicitly update PMTU
information in every ICMP error handler.

Create two helper functions to facilitate this.

1) ipv4_sk_update_pmtu()

   This updates the PMTU when we have a socket context to
   work with.

2) ipv4_update_pmtu()

   Raw version, used when no socket context is available.  For this
   interface, we essentially just pass in explicit arguments for
   the flow identity information we would have extracted from the
   socket.

   And you'll notice that ipv4_sk_update_pmtu() is simply implemented
   in terms of ipv4_update_pmtu()

Note that __ip_route_output_key() is used, rather than something like
ip_route_output_flow() or ip_route_output_key().  This is because we
absolutely do not want to end up with a route that does IPSEC
encapsulation and the like.  Instead, we only want the route that
would get us to the node described by the outermost IP header.
Reported-by: NSteffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

36393395

13 6月, 2012 2 次提交

net-next: add dev_loopback_xmit() to avoid duplicate code · 95603e22

由 Michel Machado 提交于 6月 12, 2012

Add dev_loopback_xmit() in order to deduplicate functions
ip_dev_loopback_xmit() (in net/ipv4/ip_output.c) and
ip6_dev_loopback_xmit() (in net/ipv6/ip6_output.c).

I was about to reinvent the wheel when I noticed that
ip_dev_loopback_xmit() and ip6_dev_loopback_xmit() do exactly what I
need and are not IP-only functions, but they were not available to reuse
elsewhere.

ip6_dev_loopback_xmit() does not have line "skb_dst_force(skb);", but I
understand that this is harmless, and should be in dev_loopback_xmit().
Signed-off-by: NMichel Machado <michel@digirati.com.br>
CC: "David S. Miller" <davem@davemloft.net>
CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
CC: James Morris <jmorris@namei.org>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Patrick McHardy <kaber@trash.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jpirko@redhat.com>
CC: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
CC: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

95603e22

ipv4: Add interface option to enable routing of 127.0.0.0/8 · d0daebc3

由 Thomas Graf 提交于 6月 12, 2012

Routing of 127/8 is tradtionally forbidden, we consider
packets from that address block martian when routing and do
not process corresponding ARP requests.

This is a sane default but renders a huge address space
practically unuseable.

The RFC states that no address within the 127/8 block should
ever appear on any network anywhere but it does not forbid
the use of such addresses outside of the loopback device in
particular. For example to address a pool of virtual guests
behind a load balancer.

This patch adds a new interface option 'route_localnet'
enabling routing of the 127/8 address block and processing
of ARP requests on a specific interface.

Note that for the feature to work, the default local route
covering 127/8 dev lo needs to be removed.

Example:
  $ sysctl -w net.ipv4.conf.eth0.route_localnet=1
  $ ip route del 127.0.0.0/8 dev lo table local
  $ ip addr add 127.1.0.1/16 dev eth0
  $ ip route flush cache

V2: Fix invalid check to auto flush cache (thanks davem)
Signed-off-by: NThomas Graf <tgraf@suug.ch>
Acked-by: NNeil Horman <nhorman@tuxdriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d0daebc3

11 6月, 2012 6 次提交

inet: Avoid potential NULL peer dereference. · 7b34ca2a

由 David S. Miller 提交于 6月 11, 2012

We handle NULL in rt{,6}_set_peer but then our caller will try to pass
that NULL pointer into inet_putpeer() which isn't ready for it.

Fix this by moving the NULL check one level up, and then remove the
now unnecessary NULL check from inetpeer_ptr_set_peer().
Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7b34ca2a

D
inet: Use FIB table peer roots in routes. · 8b96d22d
由 David S. Miller 提交于 6月 11, 2012
```
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
8b96d22d
D
inet: Add inetpeer tree roots to the FIB tables. · 8e773277
由 David S. Miller 提交于 6月 11, 2012
```
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
8e773277

inet: Add family scope inetpeer flushes. · b48c80ec

由 David S. Miller 提交于 6月 10, 2012

This implementation can deal with having many inetpeer roots, which is
a necessary prerequisite for per-FIB table rooted peer tables.

Each family (AF_INET, AF_INET6) has a sequence number which we bump
when we get a family invalidation request.

Each peer lookup cheaply checks whether the flush sequence of the
root we are using is out of date, and if so flushes it and updates
the sequence number.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b48c80ec

ipv4: Kill ip_rt_frag_needed(). · 46517008

由 David S. Miller 提交于 6月 10, 2012

There is zero point to this function.

It's only real substance is to perform an extremely outdated BSD4.2
ICMP check, which we can safely remove.  If you really have a MTU
limited link being routed by a BSD4.2 derived system, here's a nickel
go buy yourself a real router.

The other actions of ip_rt_frag_needed(), checking and conditionally
updating the peer, are done by the per-protocol handlers of the ICMP
event.

TCP, UDP, et al. have a handler which will receive this event and
transmit it back into the associated route via dst_ops->update_pmtu().

This simplification is important, because it eliminates the one place
where we do not have a proper route context in which to make an
inetpeer lookup.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

46517008

inet: Hide route peer accesses behind helpers. · 97bab73f

由 David S. Miller 提交于 6月 09, 2012

We encode the pointer(s) into an unsigned long with one state bit.

The state bit is used so we can store the inetpeer tree root to use
when resolving the peer later.

Later the peer roots will be per-FIB table, and this change works to
facilitate that.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

97bab73f

10 6月, 2012 4 次提交

inet: Pass inetpeer root into inet_getpeer*() interfaces. · c0efc887

由 David S. Miller 提交于 6月 09, 2012

Otherwise we reference potentially non-existing members when
ipv6 is disabled.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c0efc887

inet: Consolidate inetpeer_invalidate_tree() interfaces. · 56a6b248

由 David S. Miller 提交于 6月 09, 2012

We only need one interface for this operation, since we always know
which inetpeer root we want to flush.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

56a6b248

D
inet: Initialize per-netns inetpeer roots in net/ipv{4,6}/route.c · c3426b47
由 David S. Miller 提交于 6月 09, 2012
```
Instead of net/ipv4/inetpeer.c
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
c3426b47

[PATCH] tcp: Cache inetpeer in timewait socket, and only when necessary. · 2397849b

由 David S. Miller 提交于 6月 09, 2012

Since it's guarenteed that we will access the inetpeer if we're trying
to do timewait recycling and TCP options were enabled on the
connection, just cache the peer in the timewait socket.

In the future, inetpeer lookups will be context dependent (per routing
realm), and this helps facilitate that as well.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2397849b

09 6月, 2012 4 次提交

tcp: Get rid of inetpeer special cases. · 4670fd81

由 David S. Miller 提交于 6月 09, 2012

The get_peer method TCP uses is full of special cases that make no
sense accommodating, and it also gets in the way of doing more
reasonable things here.

First of all, if the socket doesn't have a usable cached route, there
is no sense in trying to optimize timewait recycling.

Likewise for the case where we have IP options, such as SRR enabled,
that make the IP header destination address (and thus the destination
address of the route key) differ from that of the connection's
destination address.

Just return a NULL peer in these cases, and thus we're also able to
get rid of the clumsy inetpeer release logic.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4670fd81

inet: Create and use rt{,6}_get_peer_create(). · fbfe95a4

由 David S. Miller 提交于 6月 08, 2012

There's a lot of places that open-code rt{,6}_get_peer() only because
they want to set 'create' to one.  So add an rt{,6}_get_peer_create()
for their sake.

There were also a few spots open-coding plain rt{,6}_get_peer() and
those are transformed here as well.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fbfe95a4

inetpeer: add parameter net for inet_getpeer_v4,v6 · 54db0cc2

由 Gao feng 提交于 6月 08, 2012

add struct net as a parameter of inet_getpeer_v[4,6],
use net to replace &init_net.

and modify some places to provide net for inet_getpeer_v[4,6]
Signed-off-by: NGao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

54db0cc2

inetpeer: add namespace support for inetpeer · c8a627ed

由 Gao feng 提交于 6月 08, 2012

now inetpeer doesn't support namespace,the information will
be leaking across namespace.

this patch move the global vars v4_peers and v6_peers to
netns_ipv4 and netns_ipv6 as a field peers.

add struct pernet_operations inetpeer_ops to initial pernet
inetpeer data.

and change family_to_base and inet_getpeer to support namespace.
Signed-off-by: NGao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c8a627ed

08 6月, 2012 1 次提交

snmp: fix OutOctets counter to include forwarded datagrams · 2d8dbb04

由 Vincent Bernat 提交于 6月 05, 2012

RFC 4293 defines ipIfStatsOutOctets (similar definition for
ipSystemStatsOutOctets):

   The total number of octets in IP datagrams delivered to the lower
   layers for transmission.  Octets from datagrams counted in
   ipIfStatsOutTransmits MUST be counted here.

And ipIfStatsOutTransmits:

   The total number of IP datagrams that this entity supplied to the
   lower layers for transmission.  This includes datagrams generated
   locally and those forwarded by this entity.

Therefore, IPSTATS_MIB_OUTOCTETS must be incremented when incrementing
IPSTATS_MIB_OUTFORWDATAGRAMS.

IP_UPD_PO_STATS is not used since ipIfStatsOutRequests must not
include forwarded datagrams:

   The total number of IP datagrams that local IP user-protocols
   (including ICMP) supplied to IP in requests for transmission.  Note
   that this counter does not include any datagrams counted in
   ipIfStatsOutForwDatagrams.
Signed-off-by: NVincent Bernat <bernat@luffy.cx>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2d8dbb04

07 6月, 2012 8 次提交

netfilter: ipv4, defrag: switch hook PFs to nfproto · 89a48e35

由 Alban Crequy 提交于 5月 14, 2012

This patch is a cleanup. Use NFPROTO_* for consistency with other
netfilter code.
Signed-off-by: NAlban Crequy <alban.crequy@collabora.co.uk>
Reviewed-by: NJavier Martinez Canillas <javier.martinez@collabora.co.uk>
Reviewed-by: NVincent Sanders <vincent.sanders@collabora.co.uk>

89a48e35

netfilter: nf_conntrack: add namespace support for cttimeout · 8264deb8

由 Gao feng 提交于 5月 28, 2012

This patch adds namespace support for cttimeout.
Acked-by: NEric W. Biederman <ebiederm@xmission.com>
Signed-off-by: NGao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

8264deb8

netfilter: nf_conntrack: remove now unused sysctl for nf_conntrack_l[3|4]proto · e76d0af5

由 Pablo Neira Ayuso 提交于 6月 05, 2012

Since the sysctl data for l[3|4]proto now resides in pernet nf_proto_net.
We can now remove this unused fields from struct nf_contrack_l[3,4]proto.
Acked-by: NEric W. Biederman <ebiederm@xmission.com>
Signed-off-by: NGao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

e76d0af5

netfilter: nf_ct_ipv4: add namespace support · 3ea04dd3

由 Gao feng 提交于 5月 28, 2012

This patch adds namespace support for IPv4 protocol tracker.
Acked-by: NEric W. Biederman <ebiederm@xmission.com>
Signed-off-by: NGao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

3ea04dd3

netfilter: nf_ct_icmp: add namespace support · 4b626b9c

由 Gao feng 提交于 5月 28, 2012

This patch adds namespace support for ICMP protocol tracker.
Acked-by: NEric W. Biederman <ebiederm@xmission.com>
Signed-off-by: NGao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

4b626b9c

netfilter: nf_conntrack: prepare namespace support for l3 protocol trackers · 524a53e5

由 Gao feng 提交于 5月 28, 2012

This patch prepares the namespace support for layer 3 protocol trackers.
Basically, this modifies the following interfaces:

* nf_ct_l3proto_[un]register_sysctl.
* nf_conntrack_l3proto_[un]register.

We add a new nf_ct_l3proto_net is used to get the pernet data of l3proto.

This adds rhe new struct nf_ip_net that is used to store the sysctl header
and l3proto_ipv4,l4proto_tcp(6),l4proto_udp(6),l4proto_icmp(v6) because the
protos such tcp and tcp6 use the same data,so making nf_ip_net as a field
of netns_ct is the easiest way to manager it.

This patch also adds init_net to struct nf_conntrack_l3proto to initial
the layer 3 protocol pernet data.
Acked-by: NEric W. Biederman <ebiederm@xmission.com>
Signed-off-by: NGao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

524a53e5

netfilter: nf_conntrack: prepare namespace support for l4 protocol trackers · 2c352f44

由 Gao feng 提交于 5月 28, 2012

This patch prepares the namespace support for layer 4 protocol trackers.
Basically, this modifies the following interfaces:

* nf_ct_[un]register_sysctl
* nf_conntrack_l4proto_[un]register

to include the namespace parameter. We still use init_net in this patch
to prepare the ground for follow-up patches for each layer 4 protocol
tracker.

We add a new net_id field to struct nf_conntrack_l4proto that is used
to store the pernet_operations id for each layer 4 protocol tracker.

Note that AF_INET6's protocols do not need to do sysctl compat. Thus,
we only register compat sysctl when l4proto.l3proto != AF_INET6.
Acked-by: NEric W. Biederman <ebiederm@xmission.com>
Signed-off-by: NGao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

2c352f44

inetpeer: fix a race in inetpeer_gc_worker() · 55432d2b

由 Eric Dumazet 提交于 6月 05, 2012

commit 5faa5df1 (inetpeer: Invalidate the inetpeer tree along with
the routing cache) added a race :

Before freeing an inetpeer, we must respect a RCU grace period, and make
sure no user will attempt to increase refcnt.

inetpeer_invalidate_tree() waits for a RCU grace period before inserting
inetpeer tree into gc_list and waking the worker. At that time, no
concurrent lookup can find a inetpeer in this tree.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Acked-by: NSteffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

55432d2b

04 6月, 2012 4 次提交

net: Remove casts to same type · e3192690

由 Joe Perches 提交于 6月 03, 2012

Adding casts of objects to the same type is unnecessary
and confusing for a human reader.

For example, this cast:

	int y;
	int *p = (int *)&y;

I used the coccinelle script below to find and remove these
unnecessary casts.  I manually removed the conversions this
script produces of casts with __force and __user.

@@
type T;
T *p;
@@

-	(T *)p
+	p
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e3192690

net: use consume_skb() in place of kfree_skb() · 5d0ba55b

由 Eric Dumazet 提交于 6月 04, 2012

Remove some dropwatch/drop_monitor false positives.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5d0ba55b

tcp: tcp_make_synack() consumes dst parameter · 4aea39c1

由 Eric Dumazet 提交于 6月 03, 2012

tcp_make_synack() clones the dst, and callers release it.

We can avoid two atomic operations per SYNACK if tcp_make_synack()
consumes dst instead of cloning it.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4aea39c1

tcp: tcp_make_synack() can use alloc_skb() · 90ba9b19

由 Eric Dumazet 提交于 6月 03, 2012

There is no value using sock_wmalloc() in tcp_make_synack().

A listener socket only sends SYNACK packets, they are not queued in a
socket queue, only in Qdisc and device layers, so the number of in
flight packets is limited in these layers. We used sock_wmalloc() with
the %force parameter set to 1 to ignore socket limits anyway.

This patch removes two atomic operations per SYNACK packet.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

90ba9b19

02 6月, 2012 2 次提交

tcp: reflect SYN queue_mapping into SYNACK packets · fff32699

由 Eric Dumazet 提交于 6月 01, 2012

While testing how linux behaves on SYNFLOOD attack on multiqueue device
(ixgbe), I found that SYNACK messages were dropped at Qdisc level
because we send them all on a single queue.

Obvious choice is to reflect incoming SYN packet @queue_mapping to
SYNACK packet.

Under stress, my machine could only send 25.000 SYNACK per second (for
200.000 incoming SYN per second). NIC : ixgbe with 16 rx/tx queues.

After patch, not a single SYNACK is dropped.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fff32699

tcp: do not create inetpeer on SYNACK message · 7433819a

由 Eric Dumazet 提交于 5月 31, 2012

Another problem on SYNFLOOD/DDOS attack is the inetpeer cache getting
larger and larger, using lots of memory and cpu time.

tcp_v4_send_synack()
->inet_csk_route_req()
 ->ip_route_output_flow()
  ->rt_set_nexthop()
   ->rt_init_metrics()
    ->inet_getpeer( create = true)

This is a side effect of commit a4daad6b (net: Pre-COW metrics for
TCP) added in 2.6.39

Possible solution :

Instruct inet_csk_route_req() to remove FLOWI_FLAG_PRECOW_METRICS

Before patch :

# grep peer /proc/slabinfo
inet_peer_cache   4175430 4175430    192   42    2 : tunables    0    0    0 : slabdata  99415  99415      0

Samples: 41K of event 'cycles', Event count (approx.): 30716565122
+  20,24%      ksoftirqd/0  [kernel.kallsyms]           [k] inet_getpeer
+   8,19%      ksoftirqd/0  [kernel.kallsyms]           [k] peer_avl_rebalance.isra.1
+   4,81%      ksoftirqd/0  [kernel.kallsyms]           [k] sha_transform
+   3,64%      ksoftirqd/0  [kernel.kallsyms]           [k] fib_table_lookup
+   2,36%      ksoftirqd/0  [ixgbe]                     [k] ixgbe_poll
+   2,16%      ksoftirqd/0  [kernel.kallsyms]           [k] __ip_route_output_key
+   2,11%      ksoftirqd/0  [kernel.kallsyms]           [k] kernel_map_pages
+   2,11%      ksoftirqd/0  [kernel.kallsyms]           [k] ip_route_input_common
+   2,01%      ksoftirqd/0  [kernel.kallsyms]           [k] __inet_lookup_established
+   1,83%      ksoftirqd/0  [kernel.kallsyms]           [k] md5_transform
+   1,75%      ksoftirqd/0  [kernel.kallsyms]           [k] check_leaf.isra.9
+   1,49%      ksoftirqd/0  [kernel.kallsyms]           [k] ipt_do_table
+   1,46%      ksoftirqd/0  [kernel.kallsyms]           [k] hrtimer_interrupt
+   1,45%      ksoftirqd/0  [kernel.kallsyms]           [k] kmem_cache_alloc
+   1,29%      ksoftirqd/0  [kernel.kallsyms]           [k] inet_csk_search_req
+   1,29%      ksoftirqd/0  [kernel.kallsyms]           [k] __netif_receive_skb
+   1,16%      ksoftirqd/0  [kernel.kallsyms]           [k] copy_user_generic_string
+   1,15%      ksoftirqd/0  [kernel.kallsyms]           [k] kmem_cache_free
+   1,02%      ksoftirqd/0  [kernel.kallsyms]           [k] tcp_make_synack
+   0,93%      ksoftirqd/0  [kernel.kallsyms]           [k] _raw_spin_lock_bh
+   0,87%      ksoftirqd/0  [kernel.kallsyms]           [k] __call_rcu
+   0,84%      ksoftirqd/0  [kernel.kallsyms]           [k] rt_garbage_collect
+   0,84%      ksoftirqd/0  [kernel.kallsyms]           [k] fib_rules_lookup
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7433819a

30 5月, 2012 1 次提交

memcg: decrement static keys at real destroy time · 3f134619

由 Glauber Costa 提交于 5月 29, 2012

We call the destroy function when a cgroup starts to be removed, such as
by a rmdir event.

However, because of our reference counters, some objects are still
inflight.  Right now, we are decrementing the static_keys at destroy()
time, meaning that if we get rid of the last static_key reference, some
objects will still have charges, but the code to properly uncharge them
won't be run.

This becomes a problem specially if it is ever enabled again, because now
new charges will be added to the staled charges making keeping it pretty
much impossible.

We just need to be careful with the static branch activation: since there
is no particular preferred order of their activation, we need to make sure
that we only start using it after all call sites are active.  This is
achieved by having a per-memcg flag that is only updated after
static_key_slow_inc() returns.  At this time, we are sure all sites are
active.

This is made per-memcg, not global, for a reason: it also has the effect
of making socket accounting more consistent.  The first memcg to be
limited will trigger static_key() activation, therefore, accounting.  But
all the others will then be accounted no matter what.  After this patch,
only limited memcgs will have its sockets accounted.

[akpm@linux-foundation.org: move enum sock_flag_bits into sock.h,
                            document enum sock_flag_bits,
                            convert memcg_proto_active() and memcg_proto_activated() to test_bit(),
                            redo tcp_update_limit() comment to 80 cols]
Signed-off-by: NGlauber Costa <glommer@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: NDavid Miller <davem@davemloft.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3f134619

27 5月, 2012 1 次提交

xfrm: take net hdr len into account for esp payload size calculation · 91657eaf

由 Benjamin Poirier 提交于 5月 24, 2012

Corrects the function that determines the esp payload size. The calculations
done in esp{4,6}_get_mtu() lead to overlength frames in transport mode for
certain mtu values and suboptimal frames for others.

According to what is done, mainly in esp{,6}_output() and tcp_mtu_to_mss(),
net_header_len must be taken into account before doing the alignment
calculation.
Signed-off-by: NBenjamin Poirier <bpoirier@suse.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

91657eaf

24 5月, 2012 1 次提交

tcp: take care of overlaps in tcp_try_coalesce() · 1ca7ee30

由 Eric Dumazet 提交于 5月 23, 2012

Sergio Correia reported following warning :

WARNING: at net/ipv4/tcp.c:1301 tcp_cleanup_rbuf+0x4f/0x110()

WARN(skb && !before(tp->copied_seq, TCP_SKB_CB(skb)->end_seq),
     "cleanup rbuf bug: copied %X seq %X rcvnxt %X\n",
     tp->copied_seq, TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt);

It appears TCP coalescing, and more specifically commit b081f85c
(net: implement tcp coalescing in tcp_queue_rcv()) should take care of
possible segment overlaps in receive queue. This was properly done in
the case of out_or_order_queue by the caller.

For example, segment at tail of queue have sequence 1000-2000, and we
add a segment with sequence 1500-2500.
This can happen in case of retransmits.

In this case, just don't do the coalescing.
Reported-by: NSergio Correia <lists@uece.net>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Tested-by: NSergio Correia <lists@uece.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1ca7ee30

OpenHarmony / kernel_linux 上一次同步 3 年多

OpenHarmony / kernel_linux
上一次同步 3 年多