提交 · ee9952831cfd0bbe834f4a26489d7dce74582e37 · OpenHarmony / kernel_linux

22 4月, 2012 3 次提交

由 Pavel Emelyanov 提交于 4月 19, 2012

This includes (according the the previous description):

* TCP_REPAIR sockoption

This one just puts the socket in/out of the repair mode.
Allowed for CAP_NET_ADMIN and for closed/establised sockets only.
When repair mode is turned off and the socket happens to be in
the established state the window probe is sent to the peer to
'unlock' the connection.

* TCP_REPAIR_QUEUE sockoption

This one sets the queue which we're about to repair. The
'no-queue' is set by default.

* TCP_QUEUE_SEQ socoption

Sets the write_seq/rcv_nxt of a selected repaired queue.
Allowed for TCP_CLOSE-d sockets only. When the socket changes
its state the other seq-s are changed by the kernel according
to the protocol rules (most of the existing code is actually
reused).

* Ability to forcibly bind a socket to a port

The sk->sk_reuse is set to SK_FORCE_REUSE.

* Immediate connect modification

The connect syscall initializes the connection, then directly jumps
to the code which finalizes it.

* Silent close modification

The close just aborts the connection (similar to SO_LINGER with 0
time) but without sending any FIN/RST-s to peer.
Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ee995283

tcp: Move code around · 370816ae

由 Pavel Emelyanov 提交于 4月 19, 2012

This is just the preparation patch, which makes the needed for
TCP repair code ready for use.
Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

370816ae

sock: Introduce named constants for sk_reuse · 4a17fd52

由 Pavel Emelyanov 提交于 4月 19, 2012

Name them in a "backward compatible" manner, i.e. reuse or not
are still 1 and 0 respectively. The reuse value of 2 means that
the socket with it will forcibly reuse everyone else's port.
Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4a17fd52

21 4月, 2012 7 次提交

net: Delete all remaining instances of ctl_path · a5347fe3

由 Eric W. Biederman 提交于 4月 19, 2012

We don't use struct ctl_path anymore so delete the exported constants.
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
Acked-by: NPavel Emelyanov <xemul@parallels.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a5347fe3

net: Convert all sysctl registrations to register_net_sysctl · ec8f23ce

由 Eric W. Biederman 提交于 4月 19, 2012

This results in code with less boiler plate that is a bit easier
to read.

Additionally stops us from using compatibility code in the sysctl
core, hastening the day when the compatibility code can be removed.
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
Acked-by: NPavel Emelyanov <xemul@parallels.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ec8f23ce

net: Convert nf_conntrack_proto to use register_net_sysctl · f99e8f71

由 Eric W. Biederman 提交于 4月 19, 2012

There isn't much advantage here except that strings paths are a bit
easier to read, and converting everything to them allows me to kill off
ctl_path.
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
Acked-by: NPavel Emelyanov <xemul@parallels.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f99e8f71

net ipv4: Convert devinet to use register_net_sysctl · 8607ddb8

由 Eric W. Biederman 提交于 4月 19, 2012

Using an ascii path to register_net_sysctl as opposed to the slightly
awkward ctl_path allows for much simpler code.

We no longer need to malloc dev_name to keep it alive the length of our
sysctl register instead we can use a small temporary buffer on the
stack.
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
Acked-by: NPavel Emelyanov <xemul@parallels.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8607ddb8

net ipv4: Remove the unneeded registration of an empty net/ipv4/neigh · 4e5ca785

由 Eric W. Biederman 提交于 4月 19, 2012

sysctl no longer requires explicit creation of directories.  The neigh
directory is always populated with at least a default entry so this
won't cause any user visible changes.

Delete the ipv4_path and the ipv4_skeleton these are no longer needed.

Directly register the ipv4_route_table.

And since I am an idiot remove the header definitions that I should
have removed in the previous patch.
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
Acked-by: NPavel Emelyanov <xemul@parallels.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4e5ca785

net: Move all of the network sysctls without a namespace into init_net. · 5dd3df10

由 Eric W. Biederman 提交于 4月 19, 2012

This makes it clearer which sysctls are relative to your current network
namespace.

This makes it a little less error prone by not exposing sysctls for the
initial network namespace in other namespaces.

This is the same way we handle all of our other network interfaces to
userspace and I can't honestly remember why we didn't do this for
sysctls right from the start.
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
Acked-by: NPavel Emelyanov <xemul@parallels.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5dd3df10

net: Kill register_sysctl_rotable · 43444757

由 Eric W. Biederman 提交于 4月 19, 2012

register_sysctl_rotable never caught on as an interesting way to
register sysctls.  My take on the situation is that what we want are
sysctls that we can only see in the initial network namespace.  What we
have implemented with register_sysctl_rotable are sysctls that we can
see in all of the network namespaces and can only change in the initial
network namespace.

That is a very silly way to go.  Just register the network sysctls
in the initial network namespace and we don't have any weird special
cases to deal with.

The sysctls affected are:
/proc/sys/net/ipv4/ipfrag_secret_interval
/proc/sys/net/ipv4/ipfrag_max_dist
/proc/sys/net/ipv6/ip6frag_secret_interval
/proc/sys/net/ipv6/mld_max_msf

I really don't expect anyone will miss them if they can't read them in a
child user namespace.

CC: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
Acked-by: NPavel Emelyanov <xemul@parallels.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

43444757

20 4月, 2012 1 次提交

ipv4: dont drop packet in defrag but consume it · cbf8f7bb

由 Eric Dumazet 提交于 4月 19, 2012

When defragmentation is finalized, we clone a packet and kfree_skb() it.

Call consume_skb() to not confuse dropwatch, since its not a drop.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cbf8f7bb

19 4月, 2012 1 次提交

net: fix compile error of leaking kmemleak.h header · 7426a564

由 Shan Wei 提交于 4月 18, 2012

net/core/sysctl_net_core.c: In function ‘sysctl_core_init’:
net/core/sysctl_net_core.c:259: error: implicit declaration of function ‘kmemleak_not_leak’

with same error in net/ipv4/route.c
Signed-off-by: NShan Wei <davidshan@tencent.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7426a564

18 4月, 2012 1 次提交

net/ipv4:Remove two memleak reports by kmemleak_not_leak. · 7f593881

由 majianpeng 提交于 4月 16, 2012

Signed-off-by: Nmajianpeng <majianpeng@gmail.com>
Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7f593881

16 4月, 2012 2 次提交

net: cleanup unsigned to unsigned int · 95c96174

由 Eric Dumazet 提交于 4月 15, 2012

Use of "unsigned int" is preferred to bare "unsigned" in net tree.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

95c96174

ipv4: fix checkpatch errors · 5e73ea1a

由 Daniel Baluta 提交于 4月 15, 2012

Fix checkpatch errors of the following type:
	* ERROR: "foo * bar" should be "foo *bar"
	* ERROR: "(foo*)" should be "(foo *)"
Signed-off-by: NDaniel Baluta <dbaluta@ixiacom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5e73ea1a

15 4月, 2012 5 次提交

tcp: Remove redundant code entering quickack mode · a8cb05b2

由 Vijay Subramanian 提交于 4月 13, 2012

tcp_enter_quickack_mode() already calls tcp_incr_quickack() and sets
icsk->icsk_ack.ato  to TCP_ATO_MIN. This patch removes the duplication.
Signed-off-by: NVijay Subramanian <subramanian.vijay@gmail.com>
Reviewed-by: NFlavio Leitner <fbl@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a8cb05b2

tcp: bind() use stronger condition for bind_conflict · aacd9289

由 Alex Copot 提交于 4月 12, 2012

We must try harder to get unique (addr, port) pairs when
doing port autoselection for sockets with SO_REUSEADDR
option set.

We achieve this by adding a relaxation parameter to
inet_csk_bind_conflict. When 'relax' parameter is off
we return a conflict whenever the current searched
pair (addr, port) is not unique.

This tries to address the problems reported in patch:
	8d238b25
	Revert "tcp: bind() fix when many ports are bound"

Tests where ran for creating and binding(0) many sockets
on 100 IPs. The results are, on average:

	* 60000 sockets, 600 ports / IP:
		* 0.210 s, 620 (IP, port) duplicates without patch
		* 0.219 s, no duplicates with patch
	* 100000 sockets, 1000 ports / IP:
		* 0.371 s, 1720 duplicates without patch
		* 0.373 s, no duplicates with patch
	* 200000 sockets, 2000 ports / IP:
		* 0.766 s, 6900 duplicates without patch
		* 0.768 s, no duplicates with patch
	* 500000 sockets, 5000 ports / IP:
		* 2.227 s, 41500 duplicates without patch
		* 2.284 s, no duplicates with patch
Signed-off-by: NAlex Copot <alex.mihai.c@gmail.com>
Signed-off-by: NDaniel Baluta <dbaluta@ixiacom.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

aacd9289

inet: makes syn_ack_timeout mandatory · c72e1183

由 Eric Dumazet 提交于 4月 12, 2012

There are two struct request_sock_ops providers, tcp and dccp.

inet_csk_reqsk_queue_prune() can avoid testing syn_ack_timeout being
NULL if we make it non NULL like syn_ack_timeout
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Cc: dccp@vger.kernel.org
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c72e1183

tcp: RFC6298 supersedes RFC2988bis · fd4f2cea

由 Eric Dumazet 提交于 4月 12, 2012

Updates some comments to track RFC6298
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: H.K. Jerry Chu <hkchu@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fd4f2cea

tunnel: implement 64 bits statistics · 87b6d218

由 stephen hemminger 提交于 4月 12, 2012

Convert the per-cpu statistics kept for GRE, IPIP, and SIT tunnels
to use 64 bit statistics.
Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

87b6d218

14 4月, 2012 1 次提交

udp: intoduce udp_encap_needed static_key · 447167bf

由 Eric Dumazet 提交于 4月 11, 2012

Most machines dont use UDP encapsulation (L2TP)

Adds a static_key so that udp_queue_rcv_skb() doesnt have to perform a
test if L2TP never setup the encap_rcv on a socket.

Idea of this patch came after Simon Horman proposal to add a hook on TCP
as well.

If static_key is not yet enabled, the fast path does a single JMP .

When static_key is enabled, JMP destination is patched to reach the real
encap_type/encap_rcv logic, possibly adding cache misses.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Cc: Simon Horman <horms@verge.net.au>
Cc: dev@openvswitch.org
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

447167bf

11 4月, 2012 3 次提交

tcp: avoid order-1 allocations on wifi and tx path · a21d4572

由 Eric Dumazet 提交于 4月 10, 2012

Marc Merlin reported many order-1 allocations failures in TX path on its
wireless setup, that dont make any sense with MTU=1500 network, and non
SG capable hardware.

After investigation, it turns out TCP uses sk_stream_alloc_skb() and
used as a convention skb_tailroom(skb) to know how many bytes of data
payload could be put in this skb (for non SG capable devices)

Note : these skb used kmalloc-4096 (MTU=1500 + MAX_HEADER +
sizeof(struct skb_shared_info) being above 2048)

Later, mac80211 layer need to add some bytes at the tail of skb
(IEEE80211_ENCRYPT_TAILROOM = 18 bytes) and since no more tailroom is
available has to call pskb_expand_head() and request order-1
allocations.

This patch changes sk_stream_alloc_skb() so that only
sk->sk_prot->max_header bytes of headroom are reserved, and use a new
skb field, avail_size to hold the data payload limit.

This way, order-0 allocations done by TCP stack can leave more than 2 KB
of tailroom and no more allocation is performed in mac80211 layer (or
any layer needing some tailroom)

avail_size is unioned with mark/dropcount, since mark will be set later
in IP stack for output packets. Therefore, skb size is unchanged.
Reported-by: NMarc MERLIN <marc@merlins.org>
Tested-by: NMarc MERLIN <marc@merlins.org>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a21d4572

tcp: fix tcp_rcv_rtt_update() use of an unscaled RTT sample · 18a223e0

由 Neal Cardwell 提交于 4月 10, 2012

Fix a code path in tcp_rcv_rtt_update() that was comparing scaled and
unscaled RTT samples.

The intent in the code was to only use the 'm' measurement if it was a
new minimum.  However, since 'm' had not yet been shifted left 3 bits
but 'new_sample' had, this comparison would nearly always succeed,
leading us to erroneously set our receive-side RTT estimate to the 'm'
sample when that sample could be nearly 8x too high to use.

The overall effect is to often cause the receive-side RTT estimate to
be significantly too large (up to 40% too large for brief periods in
my tests).
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

18a223e0

tcp: restore correct limit · 5fb84b14

由 Eric Dumazet 提交于 4月 10, 2012

Commit c43b874d (tcp: properly initialize tcp memory limits) tried
to fix a regression added in commits 4acb4190 & 3dc43e3e,
but still get it wrong.

Result is machines with low amount of memory have too small tcp_rmem[2]
value and slow tcp receives : Per socket limit being 1/1024 of memory
instead of 1/128 in old kernels, so rcv window is capped to small
values.

Fix this to match comment and previous behavior.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5fb84b14

10 4月, 2012 2 次提交

netfilter: nf_ct_ipv4: packets with wrong ihl are invalid · 07153c6e

由 Jozsef Kadlecsik 提交于 4月 03, 2012

It was reported that the Linux kernel sometimes logs:

klogd: [2629147.402413] kernel BUG at net / netfilter /
nf_conntrack_proto_tcp.c: 447!
klogd: [1072212.887368] kernel BUG at net / netfilter /
nf_conntrack_proto_tcp.c: 392

ipv4_get_l4proto() in nf_conntrack_l3proto_ipv4.c and tcp_error() in
nf_conntrack_proto_tcp.c should catch malformed packets, so the errors
at the indicated lines - TCP options parsing - should not happen.
However, tcp_error() relies on the "dataoff" offset to the TCP header,
calculated by ipv4_get_l4proto().  But ipv4_get_l4proto() does not check
bogus ihl values in IPv4 packets, which then can slip through tcp_error()
and get caught at the TCP options parsing routines.

The patch fixes ipv4_get_l4proto() by invalidating packets with bogus
ihl value.

The patch closes netfilter bugzilla id 771.
Signed-off-by: NJozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

07153c6e

netfilter: nf_ct_ipv4: handle invalid IPv4 and IPv6 packets consistently · 8430eac2

由 Jozsef Kadlecsik 提交于 4月 09, 2012

IPv6 conntrack marked invalid packets as INVALID and let the user
drop those by an explicit rule, while IPv4 conntrack dropped such
packets itself.

IPv4 conntrack is changed so that it marks INVALID packets and let
the user to drop them.
Signed-off-by: NJozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

8430eac2

06 4月, 2012 2 次提交

tcp: tcp_sendpages() should call tcp_push() once · 35f9c09f

由 Eric Dumazet 提交于 4月 05, 2012

commit 2f533844 (tcp: allow splice() to build full TSO packets) added
a regression for splice() calls using SPLICE_F_MORE.

We need to call tcp_flush() at the end of the last page processed in
tcp_sendpages(), or else transmits can be deferred and future sends
stall.

Add a new internal flag, MSG_SENDPAGE_NOTLAST, acting like MSG_MORE, but
with different semantic.

For all sendpage() providers, its a transparent change. Only
sock_sendpage() and tcp_sendpages() can differentiate the two different
flags provided by pipe_to_sendpage()
Reported-by: NTom Herbert <therbert@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: H.K. Jerry Chu <hkchu@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail&gt;com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

35f9c09f

netdma: adding alignment check for NETDMA ops · a2bd1140

由 Dave Jiang 提交于 4月 04, 2012

This is the fallout from adding memcpy alignment workaround for certain
IOATDMA hardware. NetDMA will only use DMA engine that can handle byte align
ops.
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

a2bd1140

05 4月, 2012 2 次提交

net: replace continue with break to reduce unnecessary loop in xxx_xmarksources · ce713ee5

由 RongQing.Li 提交于 4月 05, 2012

The conditional which decides to skip inactive filters does not
change with the change of loop index, so it is unnecessary to
check them many times.
Signed-off-by: NRongQing.Li <roy.qing.li@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ce713ee5

net/route: export symbol ip_tos2prio · d4a96865

由 Amir Vadai 提交于 4月 04, 2012

Need to export this to enable drivers use rt_tos2priority()
Signed-off-by: NAmir Vadai <amirv@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d4a96865

04 4月, 2012 1 次提交

tcp: allow splice() to build full TSO packets · 2f533844

由 Eric Dumazet 提交于 4月 03, 2012

vmsplice()/splice(pipe, socket) call do_tcp_sendpages() one page at a
time, adding at most 4096 bytes to an skb. (assuming PAGE_SIZE=4096)

The call to tcp_push() at the end of do_tcp_sendpages() forces an
immediate xmit when pipe is not already filled, and tso_fragment() try
to split these skb to MSS multiples.

4096 bytes are usually split in a skb with 2 MSS, and a remaining
sub-mss skb (assuming MTU=1500)

This makes slow start suboptimal because many small frames are sent to
qdisc/driver layers instead of big ones (constrained by cwnd and packets
in flight of course)

In fact, applications using sendmsg() (adding an additional memory copy)
instead of vmsplice()/splice()/sendfile() are a bit faster because of
this anomaly, especially if serving small files in environments with
large initial [c]wnd.

Call tcp_push() only if MSG_MORE is not set in the flags parameter.

This bit is automatically provided by splice() internals but for the
last page, or on all pages if user specified SPLICE_F_MORE splice()
flag.

In some workloads, this can reduce number of sent logical packets by an
order of magnitude, making zero-copy TCP actually faster than
one-copy :)
Reported-by: NTom Herbert <therbert@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: H.K. Jerry Chu <hkchu@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail&gt;com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2f533844

02 4月, 2012 2 次提交

netfilter: ipv4: Stop using NLA_PUT*(). · d317e4f6

由 David S. Miller 提交于 4月 01, 2012

These macros contain a hidden goto, and are thus extremely error
prone and make code hard to audit.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d317e4f6

ipv4: Stop using NLA_PUT*(). · f3756b79

由 David S. Miller 提交于 4月 01, 2012

These macros contain a hidden goto, and are thus extremely error
prone and make code hard to audit.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f3756b79

29 3月, 2012 1 次提交

Remove all #inclusions of asm/system.h · 9ffc93f2

由 David Howells 提交于 3月 28, 2012

Remove all #inclusions of asm/system.h preparatory to splitting and killing
it. Performed with the following command:

perl -p -i -e 's!^#\s*include\s*<asm/system[.]h>.*\n!!' `grep -Irl '^#\s*include\s*<asm/system[.]h>' *`
Signed-off-by: NDavid Howells <dhowells@redhat.com>

9ffc93f2

28 3月, 2012 1 次提交

net/ipv4: fix IPv4 multicast over network namespaces · 4e7b2f14

由 Benjamin LaHaise 提交于 3月 27, 2012

When using multicast over a local bridge feeding a number of LXC guests
using veth, the LXC guests are unable to get a response from other guests
when pinging 224.0.0.1. Multicast packets did not appear to be getting
delivered to the network namespaces of the guest hosts, and further
inspection showed that the incoming route was pointing to the loopback
device of the host, not the guest. This lead to the wrong network namespace
being picked up by sockets (like ICMP). Fix this by using the correct
network namespace when creating the inbound route entry.
Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4e7b2f14

23 3月, 2012 2 次提交

bonding: remove entries for master_ip and vlan_ip and query devices instead · eaddcd76

由 Andy Gospodarek 提交于 3月 22, 2012

The following patch aimed to resolve an issue where secondary, tertiary,
etc. addresses added to bond interfaces could overwrite the
bond->master_ip and vlan_ip values.

        commit 917fbdb3
        Author: Henrik Saavedra Persson <henrik.e.persson@ericsson.com>
        Date:   Wed Nov 23 23:37:15 2011 +0000

            bonding: only use primary address for ARP

That patch was good because it prevented bonds using ARP monitoring from
sending frames with an invalid source IP address.  Unfortunately, it
didn't always work as expected.

When using an ioctl (like ifconfig does) to set the IP address and
netmask, 2 separate ioctls are actually called to set the IP and netmask
if the mask chosen doesn't match the standard mask for that class of
address.  The first ioctl did not have a mask that matched the one in
the primary address and would still cause the device address to be
overwritten.  The second ioctl that was called to set the mask would
then detect as secondary and ignored, but the damage was already done.

This was not an issue when using an application that used netlink
sockets as the setting of IP and netmask came down at once.  The
inconsistent behavior between those two interfaces was something that
needed to be resolved.

While I was thinking about how I wanted to resolve this, Ralf Zeidler
came with a patch that resolved this on a RHEL kernel by keeping a full
shadow of the entries in dev->ifa_list for the bonding device and vlan
devices in the bonding driver.  I didn't like the duplication of the
list as I want to see the 'bonding' struct and code shrink rather than
grow, but liked the general idea.

As the Subject indicates this patch drops the master_ip and vlan_ip
elements from the 'bonding' and 'vlan_entry' structs, respectively.
This can be done because a device's address-list is now traversed to
determine the optimal source IP address for ARP requests and for checks
to see if the bonding device has a particular IP address.  This code
could have all be contained inside the bonding driver, but it made more
sense to me to EXPORT and call inet_confirm_addr since it did exactly
what was needed.

I tested this and a backported patch and everything works as expected.
Ralf also helped with verification of the backported patch.

Thanks to Ralf for all his help on this.

v2: Whitespace and organizational changes based on suggestions from Jay
Vosburgh and Dave Miller.

v3: Fixup incorrect usage of rcu_read_unlock based on Dave Miller's
suggestion.
Signed-off-by: NAndy Gospodarek <andy@greyhouse.net>
CC: Ralf Zeidler <ralf.zeidler@nsn.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

eaddcd76

netfilter: remove forward module param confusion. · 523f610e

由 Rusty Russell 提交于 3月 22, 2012

It used to be an int, and it got changed to a bool parameter at least
7 years ago.  It happens that NF_ACCEPT and NF_DROP are 0 and 1, so
this works, but it's unclear, and the check that it's in range is not
required.
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

523f610e

20 3月, 2012 2 次提交

tcp: reduce out_of_order memory use · c8628155

由 Eric Dumazet 提交于 3月 18, 2012

With increasing receive window sizes, but speed of light not improved
that much, out of order queue can contain a huge number of skbs, waiting
to be moved to receive_queue when missing packets can fill the holes.

Some devices happen to use fat skbs (truesize of 4096 + sizeof(struct
sk_buff)) to store regular (MTU <= 1500) frames. This makes highly
probable sk_rmem_alloc hits sk_rcvbuf limit, which can be 4Mbytes in
many cases.

When limit is hit, tcp stack calls tcp_collapse_ofo_queue(), a true
latency killer and cpu cache blower.

Doing the coalescing attempt each time we add a frame in ofo queue
permits to keep memory use tight and in many cases avoid the
tcp_collapse() thing later.

Tested on various wireless setups (b43, ath9k, ...) known to use big skb
truesize, this patch removed the "packets collapsed in receive queue due
to low socket buffer" I had before.

This also reduced average memory used by tcp sockets.

With help from Neal Cardwell.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: H.K. Jerry Chu <hkchu@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c8628155

tcp: introduce tcp_data_queue_ofo · e86b2919

由 Eric Dumazet 提交于 3月 18, 2012

Split tcp_data_queue() in two parts for better readability.

tcp_data_queue_ofo() is responsible for queueing incoming skb into out
of order queue.

Change code layout so that the skb_set_owner_r() is performed only if
skb is not dropped.

This is a preliminary patch before "reduce out_of_order memory use"
following patch.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: H.K. Jerry Chu <hkchu@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e86b2919

17 3月, 2012 1 次提交

arp: allow arp processing to honor per interface arp_accept sysctl · 124d37e9

由 Neil Horman 提交于 3月 15, 2012

I found recently that the arp_process function which handles all of our received
arp frames, is using IPV4_DEVCONF_ALL macro to check the state of the arp_process
flag. This seems wrong, as it implies that either none or all of the network
interfaces accept gratuitous arps. This patch corrects that, allowing
per-interface arp_accept configuration to deviate from the all setting. Note
this also brings us into line with the way the arp_filter setting is handled
during arp_process execution.

Tested this myself on my home network, and confirmed it works as expected.
Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

124d37e9

OpenHarmony / kernel_linux 上一次同步 3 年多

OpenHarmony / kernel_linux
上一次同步 3 年多