提交 · c5ed63d66f24fd4f7089b5a6e087b0ce7202aa8e · OpenHarmony / kernel_linux

26 8月, 2010 1 次提交

tcp: fix three tcp sysctls tuning · c5ed63d6

由 Eric Dumazet 提交于 8月 25, 2010

As discovered by Anton Blanchard, current code to autotune 
tcp_death_row.sysctl_max_tw_buckets, sysctl_tcp_max_orphans and
sysctl_max_syn_backlog makes little sense.

The bigger a page is, the less tcp_max_orphans is : 4096 on a 512GB
machine in Anton's case.

(tcp_hashinfo.bhash_size * sizeof(struct inet_bind_hashbucket))
is much bigger if spinlock debugging is on. Its wrong to select bigger
limits in this case (where kernel structures are also bigger)

bhash_size max is 65536, and we get this value even for small machines. 

A better ground is to use size of ehash table, this also makes code
shorter and more obvious.

Based on a patch from Anton, and another from David.
Reported-and-tested-by: NAnton Blanchard <anton@samba.org>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c5ed63d6

25 8月, 2010 1 次提交

tcp: Combat per-cpu skew in orphan tests. · ad1af0fe

由 David S. Miller 提交于 8月 25, 2010

As reported by Anton Blanchard when we use
percpu_counter_read_positive() to make our orphan socket limit checks,
the check can be off by up to num_cpus_online() * batch (which is 32
by default) which on a 128 cpu machine can be as large as the default
orphan limit itself.

Fix this by doing the full expensive sum check if the optimized check
triggers.
Reported-by: NAnton Blanchard <anton@samba.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
Acked-by: NEric Dumazet <eric.dumazet@gmail.com>

ad1af0fe

24 8月, 2010 1 次提交

netfilter: fix CONFIG_COMPAT support · cca77b7c

由 Florian Westphal 提交于 8月 23, 2010

commit f3c5c1bf
(netfilter: xtables: make ip_tables reentrant) forgot to
also compute the jumpstack size in the compat handlers.

Result is that "iptables -I INPUT -j userchain" turns into -j DROP.

Reported by Sebastian Roesner on #netfilter, closes
http://bugzilla.netfilter.org/show_bug.cgi?id=669.

Note: arptables change is compile-tested only.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
Tested-by: NMikael Pettersson <mikpe@it.uu.se>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cca77b7c

18 8月, 2010 1 次提交

netfilter: {ip,ip6,arp}_tables: avoid lockdep false positive · 001389b9

由 Eric Dumazet 提交于 8月 16, 2010

After commit 24b36f01 (netfilter: {ip,ip6,arp}_tables: dont block
bottom half more than necessary), lockdep can raise a warning
because we attempt to lock a spinlock with BH enabled, while
the same lock is usually locked by another cpu in a softirq context.

Disable again BH to avoid these lockdep warnings.
Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
Diagnosed-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

001389b9

08 8月, 2010 1 次提交

tcp: no md5sig option size check bug · ba78e2dd

由 Dmitry Popov 提交于 8月 07, 2010

tcp_parse_md5sig_option doesn't check md5sig option (TCPOPT_MD5SIG)
length, but tcp_v[46]_inbound_md5_hash assume that it's at least 16
bytes long.
Signed-off-by: NDmitry Popov <dp@highloadlab.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ba78e2dd

03 8月, 2010 2 次提交

ip_fragment: fix subtracting PPPOE_SES_HLEN from mtu twice · c893b806

由 Changli Gao 提交于 7月 31, 2010

6c79bf0f subtracts PPPOE_SES_HLEN from mtu at
the front of ip_fragment(). So the later subtraction should be removed. The
MTU of 802.1q is also 1500, so MTU should not be changed.
Signed-off-by: NChangli Gao <xiaosuo@gmail.com>
Signed-off-by: NBart De Schuymer <bdschuym@pandora.bo>
----
 net/ipv4/ip_output.c |    6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)
Signed-off-by: NBart De Schuymer <bdschuym@pandora.bo>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c893b806

net: Add getsockopt support for TCP thin-streams · 3c0fef0b

由 Josh Hunt 提交于 7月 30, 2010

Initial TCP thin-stream commit did not add getsockopt support for the new
socket options: TCP_THIN_LINEAR_TIMEOUTS and TCP_THIN_DUPACK. This adds support
for them.
Signed-off-by: NJosh Hunt <johunt@akamai.com>
Tested-by: NAndreas Petlund <apetlund@simula.no>
Acked-by: NAndreas Petlund <apetlund@simula.no>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3c0fef0b

02 8月, 2010 4 次提交

netfilter: nf_nat: don't check if the tuple is unique when there isn't any other choice · 2452a99d

由 Changli Gao 提交于 8月 02, 2010

The tuple got from unique_tuple() doesn't need to be really unique, so the
check for the unique tuple isn't necessary, when there isn't any other
choice. Eliminating the unnecessary nf_nat_used_tuple() can save some CPU
cycles too.
Signed-off-by: NChangli Gao <xiaosuo@gmail.com>
Signed-off-by: NPatrick McHardy <kaber@trash.net>

2452a99d

netfilter: nf_nat: make unique_tuple return void · f43dc98b

由 Changli Gao 提交于 8月 02, 2010

The only user of unique_tuple() get_unique_tuple() doesn't care about the
return value of unique_tuple(), so make unique_tuple() return void (nothing).
Signed-off-by: NChangli Gao <xiaosuo@gmail.com>
Signed-off-by: NPatrick McHardy <kaber@trash.net>

f43dc98b

netfilter: nf_nat: use local variable hdrlen · 794dbc1d

由 Changli Gao 提交于 8月 02, 2010

Use local variable hdrlen instead of ip_hdrlen(skb).
Signed-off-by: NChangli Gao <xiaosuo@gmail.com>
Signed-off-by: NPatrick McHardy <kaber@trash.net>

794dbc1d

netfilter: {ip,ip6,arp}_tables: dont block bottom half more than necessary · 24b36f01

由 Eric Dumazet 提交于 8月 02, 2010

We currently disable BH for the whole duration of get_counters()

On machines with a lot of cpus and large tables, this might be too long.

We can disable preemption during the whole function, and disable BH only
while fetching counters for the current cpu.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NPatrick McHardy <kaber@trash.net>

24b36f01

31 7月, 2010 1 次提交

tcp: cookie transactions setsockopt memory leak · a3bdb549

由 Dmitry Popov 提交于 7月 29, 2010

There is a bug in do_tcp_setsockopt(net/ipv4/tcp.c),
TCP_COOKIE_TRANSACTIONS case.
In some cases (when tp->cookie_values == NULL) new tcp_cookie_values
structure can be allocated (at cvp), but not bound to
tp->cookie_values. So a memory leak occurs.
Signed-off-by: NDmitry Popov <dp@highloadlab.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a3bdb549

23 7月, 2010 4 次提交

netfilter: iptables: use skb->len for accounting · 7df0884c

由 Changli Gao 提交于 7月 23, 2010

Use skb->len for accounting as xt_quota does.
Signed-off-by: NChangli Gao <xiaosuo@gmail.com>
Signed-off-by: NPatrick McHardy <kaber@trash.net>

7df0884c

netfilter: arptables: use arp_hdr_len() · f667009e

由 Changli Gao 提交于 7月 23, 2010

use arp_hdr_len().
Signed-off-by: NChangli Gao <xiaosuo@gmail.com>
Signed-off-by: NPatrick McHardy <kaber@trash.net>

f667009e

netfilter: nf_nat_core: merge the same lines · c36952e5

由 Changli Gao 提交于 7月 23, 2010

proto->unique_tuple() will be called finally, if the previous calls fail. This
patch checks the false condition of (range->flags &IP_NAT_RANGE_PROTO_RANDOM)
instead to avoid duplicate line of code: proto->unique_tuple().
Signed-off-by: NChangli Gao <xiaosuo@gmail.com>
Signed-off-by: NPatrick McHardy <kaber@trash.net>

c36952e5

net: RTA_MARK addition · 963bfeee

由 Eric Dumazet 提交于 7月 20, 2010

Add a new rt attribute, RTA_MARK, and use it in
rt_fill_info()/inet_rtm_getroute() to support following commands :

ip route get 192.168.20.110 mark NUMBER
ip route get 192.168.20.108 from 192.168.20.110 iif eth1 mark NUMBER
ip route list cache [192.168.20.110] mark NUMBER
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

963bfeee

22 7月, 2010 1 次提交

net: remove last uses of __attribute__((packed)) · 3f30fc15

由 Gustavo F. Padovan 提交于 7月 21, 2010

Network code uses the __packed macro instead of __attribute__((packed)).
Signed-off-by: NGustavo F. Padovan <padovan@profusion.mobi>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3f30fc15

20 7月, 2010 1 次提交

tcp: fix crash in tcp_xmit_retransmit_queue · 45e77d31

由 Ilpo Järvinen 提交于 7月 19, 2010

It can happen that there are no packets in queue while calling
tcp_xmit_retransmit_queue(). tcp_write_queue_head() then returns
NULL and that gets deref'ed to get sacked into a local var.

There is no work to do if no packets are outstanding so we just
exit early.

This oops was introduced by 08ebd172 (tcp: remove tp->lost_out
guard to make joining diff nicer).
Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
Reported-by: NLennart Schulte <lennart.schulte@nets.rwth-aachen.de>
Tested-by: NLennart Schulte <lennart.schulte@nets.rwth-aachen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

45e77d31

16 7月, 2010 1 次提交

ipmr: Don't leak memory if fib lookup fails. · e40dbc51

由 Ben Greear 提交于 7月 15, 2010

This was detected using two mcast router tables.  The
pimreg for the second interface did not have a specific
mrule, so packets received by it were handled by the
default table, which had nothing configured.

This caused the ipmr_fib_lookup to fail, causing
the memory leak.
Signed-off-by: NBen Greear <greearb@candelatech.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e40dbc51

15 7月, 2010 1 次提交

rfs: call sock_rps_record_flow() in tcp_splice_read() · 3a047bf8

由 Changli Gao 提交于 7月 12, 2010

rfs: call sock_rps_record_flow() in tcp_splice_read()

call sock_rps_record_flow() in tcp_splice_read(), so the applications using
splice(2) or sendfile(2) can utilize RFS.
Signed-off-by: NChangli Gao <xiaosuo@gmail.com>
----
 net/ipv4/tcp.c |    1 +
 1 file changed, 1 insertion(+)
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3a047bf8

13 7月, 2010 2 次提交

inet, inet6: make tcp_sendmsg() and tcp_sendpage() through inet_sendmsg() and inet_sendpage() · 7ba42910

由 Changli Gao 提交于 7月 10, 2010

a new boolean flag no_autobind is added to structure proto to avoid the autobind
calls when the protocol is TCP. Then sock_rps_record_flow() is called int the
TCP's sendmsg() and sendpage() pathes.
Signed-off-by: NChangli Gao <xiaosuo@gmail.com>
----
 include/net/inet_common.h |    4 ++++
 include/net/sock.h        |    1 +
 include/net/tcp.h         |    8 ++++----
 net/ipv4/af_inet.c        |   15 +++++++++------
 net/ipv4/tcp.c            |   11 +++++------
 net/ipv4/tcp_ipv4.c       |    3 +++
 net/ipv6/af_inet6.c       |    8 ++++----
 net/ipv6/tcp_ipv6.c       |    3 +++
 8 files changed, 33 insertions(+), 20 deletions(-)
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7ba42910

net/ipv4: EXPORT_SYMBOL cleanups · 4bc2f18b

由 Eric Dumazet 提交于 7月 09, 2010

CodingStyle cleanups

EXPORT_SYMBOL should immediately follow the symbol declaration.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4bc2f18b

09 7月, 2010 1 次提交

gre: propagate ipv6 transport class · dd4ba83d

由 Stephen Hemminger 提交于 7月 08, 2010

This patch makes IPV6 over IPv4 GRE tunnel propagate the transport
class field from the underlying IPV6 header to the IPV4 Type Of Service
field. Without the patch, all IPV6 packets in tunnel look the same to QoS.

This assumes that IPV6 transport class is exactly the same
as IPv4 TOS. Not sure if that is always the case?  Maybe need
to mask off some bits.

The mask and shift to get tclass is copied from ipv6/datagram.c
Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dd4ba83d

08 7月, 2010 1 次提交

net/ipv4/ip_output.c: Removal of unused variable in ip_fragment() · 49085bd7

由 George Kadianakis 提交于 7月 06, 2010

Removal of unused integer variable in ip_fragment().
Signed-off-by: NGeorge Kadianakis <desnacked@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

49085bd7

06 7月, 2010 1 次提交

ipv4: use skb_dst_copy() in ip_copy_metadata() · fe76cda3

由 Eric Dumazet 提交于 7月 01, 2010

Avoid touching dst refcount in ip_fragment().
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fe76cda3

05 7月, 2010 3 次提交

netfilter: ipt_REJECT: avoid touching dst ref · b13b7125

由 Eric Dumazet 提交于 7月 05, 2010

We can avoid a pair of atomic ops in ipt_REJECT send_reset()
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NPatrick McHardy <kaber@trash.net>

b13b7125

netfilter: ipt_REJECT: postpone the checksum calculation. · 98b0e84a

由 Changli Gao 提交于 7月 05, 2010

postpone the checksum calculation, then if the output NIC supports checksum
offloading, we can utlize it. And though the output NIC doesn't support
checksum offloading, but we'll mangle this packet, this can free us from
updating the checksum, as the checksum calculation occurs later.
Signed-off-by: NChangli Gao <xiaosuo@gmail.com>
Signed-off-by: NPatrick McHardy <kaber@trash.net>

98b0e84a

xfrm: fix xfrm by MARK logic · 44b451f1

由 Peter Kosyh 提交于 7月 02, 2010

While using xfrm by MARK feature in
2.6.34 - 2.6.35 kernels, the mark
is always cleared in flowi structure via memset in
_decode_session4 (net/ipv4/xfrm4_policy.c), so
the policy lookup fails.
IPv6 code is affected by this bug too.
Signed-off-by: NPeter Kosyh <p.kosyh@gmail.com>
Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

44b451f1

01 7月, 2010 2 次提交

fragment: add fast path for in-order fragments · d6bebca9

由 Changli Gao 提交于 6月 29, 2010

add fast path for in-order fragments

As the fragments are sent in order in most of OSes, such as Windows, Darwin and
FreeBSD, it is likely the new fragments are at the end of the inet_frag_queue.
In the fast path, we check if the skb at the end of the inet_frag_queue is the
prev we expect.
Signed-off-by: NChangli Gao <xiaosuo@gmail.com>
----
 include/net/inet_frag.h |    1 +
 net/ipv4/ip_fragment.c  |   12 ++++++++++++
 net/ipv6/reassembly.c   |   11 +++++++++++
 3 files changed, 24 insertions(+)
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d6bebca9

snmp: 64bit ipstats_mib for all arches · 4ce3c183

由 Eric Dumazet 提交于 6月 30, 2010

/proc/net/snmp and /proc/net/netstat expose SNMP counters.

Width of these counters is either 32 or 64 bits, depending on the size
of "unsigned long" in kernel.

This means user program parsing these files must already be prepared to
deal with 64bit values, regardless of user program being 32 or 64 bit.

This patch introduces 64bit snmp values for IPSTAT mib, where some
counters can wrap pretty fast if they are 32bit wide.

# netstat -s|egrep "InOctets|OutOctets"
    InOctets: 244068329096
    OutOctets: 244069348848
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4ce3c183

29 6月, 2010 2 次提交

tcp: tso_fragment() might avoid GFP_ATOMIC · c4ead4c5

由 Eric Dumazet 提交于 6月 24, 2010

We can pass a gfp argument to tso_fragment() and avoid GFP_ATOMIC
allocations sometimes.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c4ead4c5

net: use this_cpu_ptr() · 7a9b2d59

由 Eric Dumazet 提交于 6月 24, 2010

use this_cpu_ptr(p) instead of per_cpu_ptr(p, smp_processor_id())
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7a9b2d59

28 6月, 2010 2 次提交

netfilter: ipt_LOG/ip6t_LOG: add option to print decoded MAC header · 7eb9282c

由 Patrick McHardy 提交于 6月 28, 2010

The LOG targets print the entire MAC header as one long string, which is not
readable very well:

IN=eth0 OUT= MAC=00:15:f2:24:91:f8:00:1b:24:dc:61:e6:08:00 ...

Add an option to decode known header formats (currently just ARPHRD_ETHER devices)
in their individual fields:

IN=eth0 OUT= MACSRC=00:1b:24:dc:61:e6 MACDST=00:15:f2:24:91:f8 MACPROTO=0800 ...
IN=eth0 OUT= MACSRC=00:1b:24:dc:61:e6 MACDST=00:15:f2:24:91:f8 MACPROTO=86dd ...

The option needs to be explicitly enabled by userspace to avoid breaking
existing parsers.
Signed-off-by: NPatrick McHardy <kaber@trash.net>

7eb9282c

netfilter: ipt_LOG/ip6t_LOG: remove comparison within loop · cf377eb4

由 Patrick McHardy 提交于 6月 28, 2010

Remove the comparison within the loop to print the macheader by prepending
the colon to all but the first printk.

Based on suggestion by Jan Engelhardt <jengelh@medozas.de>.
Signed-off-by: NPatrick McHardy <kaber@trash.net>

cf377eb4

27 6月, 2010 2 次提交

syncookies: add support for ECN · 172d69e6

由 Florian Westphal 提交于 6月 21, 2010

Allows use of ECN when syncookies are in effect by encoding ecn_ok
into the syn-ack tcp timestamp.

While at it, remove a uneeded #ifdef CONFIG_SYN_COOKIES.
With CONFIG_SYN_COOKIES=nm want_cookie is ifdef'd to 0 and gcc
removes the "if (0)".
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

172d69e6

syncookies: do not store rcv_wscale in tcp timestamp · 734f614b

由 Florian Westphal 提交于 6月 21, 2010

As pointed out by Fernando Gont there is no need to encode rcv_wscale
into the cookie.

We did not use the restored rcv_wscale anyway; it is recomputed
via tcp_select_initial_window().

Thus we can save 4 bits in the ts option space by removing rcv_wscale.
In case window scaling was not supported, we set the (invalid) wscale
value 0xf.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

734f614b

26 6月, 2010 2 次提交

snmp: add align parameter to snmp_mib_init() · 1823e4c8

由 Eric Dumazet 提交于 6月 22, 2010

In preparation for 64bit snmp counters for some mibs,
add an 'align' parameter to snmp_mib_init(), instead
of assuming mibs only contain 'unsigned long' fields.

Callers can use __alignof__(type) to provide correct
alignment.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1823e4c8

arp: RCU change in arp_solicit() · 4b4194c4

由 Eric Dumazet 提交于 6月 22, 2010

Avoid two atomic ops in arp_solicit()
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4b4194c4

25 6月, 2010 1 次提交

tcp: do not send reset to already closed sockets · 565b7b2d

由 Konstantin Khorenko 提交于 6月 24, 2010

i've found that tcp_close() can be called for an already closed
socket, but still sends reset in this case (tcp_send_active_reset())
which seems to be incorrect.  Moreover, a packet with reset is sent
with different source port as original port number has been already
cleared on socket.  Besides that incrementing stat counter for
LINUX_MIB_TCPABORTONCLOSE also does not look correct in this case.

Initially this issue was found on 2.6.18-x RHEL5 kernel, but the same
seems to be true for the current mainstream kernel (checked on
2.6.35-rc3).  Please, correct me if i missed something.

How that happens:

1) the server receives a packet for socket in TCP_CLOSE_WAIT state
   that triggers a tcp_reset():

Call Trace:
 <IRQ>  [<ffffffff8025b9b9>] tcp_reset+0x12f/0x1e8
 [<ffffffff80046125>] tcp_rcv_state_process+0x1c0/0xa08
 [<ffffffff8003eb22>] tcp_v4_do_rcv+0x310/0x37a
 [<ffffffff80028bea>] tcp_v4_rcv+0x74d/0xb43
 [<ffffffff8024ef4c>] ip_local_deliver_finish+0x0/0x259
 [<ffffffff80037131>] ip_local_deliver+0x200/0x2f4
 [<ffffffff8003843c>] ip_rcv+0x64c/0x69f
 [<ffffffff80021d89>] netif_receive_skb+0x4c4/0x4fa
 [<ffffffff80032eca>] process_backlog+0x90/0xec
 [<ffffffff8000cc50>] net_rx_action+0xbb/0x1f1
 [<ffffffff80012d3a>] __do_softirq+0xf5/0x1ce
 [<ffffffff8001147a>] handle_IRQ_event+0x56/0xb0
 [<ffffffff8006334c>] call_softirq+0x1c/0x28
 [<ffffffff80070476>] do_softirq+0x2c/0x85
 [<ffffffff80070441>] do_IRQ+0x149/0x152
 [<ffffffff80062665>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff80008a2e>] __handle_mm_fault+0x6cd/0x1303
 [<ffffffff80008903>] __handle_mm_fault+0x5a2/0x1303
 [<ffffffff80033a9d>] cache_free_debugcheck+0x21f/0x22e
 [<ffffffff8006a263>] do_page_fault+0x49a/0x7dc
 [<ffffffff80066487>] thread_return+0x89/0x174
 [<ffffffff800c5aee>] audit_syscall_exit+0x341/0x35c
 [<ffffffff80062e39>] error_exit+0x0/0x84

tcp_rcv_state_process()
...  // (sk_state == TCP_CLOSE_WAIT here)
...
        /* step 2: check RST bit */
        if(th->rst) {
                tcp_reset(sk);
                goto discard;
        }
...
---------------------------------
tcp_rcv_state_process
 tcp_reset
  tcp_done
   tcp_set_state(sk, TCP_CLOSE);
     inet_put_port
      __inet_put_port
       inet_sk(sk)->num = 0;

   sk->sk_shutdown = SHUTDOWN_MASK;

2) After that the process (socket owner) tries to write something to
   that socket and "inet_autobind" sets a _new_ (which differs from
   the original!) port number for the socket:

 Call Trace:
  [<ffffffff80255a12>] inet_bind_hash+0x33/0x5f
  [<ffffffff80257180>] inet_csk_get_port+0x216/0x268
  [<ffffffff8026bcc9>] inet_autobind+0x22/0x8f
  [<ffffffff80049140>] inet_sendmsg+0x27/0x57
  [<ffffffff8003a9d9>] do_sock_write+0xae/0xea
  [<ffffffff80226ac7>] sock_writev+0xdc/0xf6
  [<ffffffff800680c7>] _spin_lock_irqsave+0x9/0xe
  [<ffffffff8001fb49>] __pollwait+0x0/0xdd
  [<ffffffff8008d533>] default_wake_function+0x0/0xe
  [<ffffffff800a4f10>] autoremove_wake_function+0x0/0x2e
  [<ffffffff800f0b49>] do_readv_writev+0x163/0x274
  [<ffffffff80066538>] thread_return+0x13a/0x174
  [<ffffffff800145d8>] tcp_poll+0x0/0x1c9
  [<ffffffff800c56d3>] audit_syscall_entry+0x180/0x1b3
  [<ffffffff800f0dd0>] sys_writev+0x49/0xe4
  [<ffffffff800622dd>] tracesys+0xd5/0xe0

3) sendmsg fails at last with -EPIPE (=> 'write' returns -EPIPE in userspace):

F: tcp_sendmsg1 -EPIPE: sk=ffff81000bda00d0, sport=49847, old_state=7, new_state=7, sk_err=0, sk_shutdown=3

Call Trace:
 [<ffffffff80027557>] tcp_sendmsg+0xcb/0xe87
 [<ffffffff80033300>] release_sock+0x10/0xae
 [<ffffffff8016f20f>] vgacon_cursor+0x0/0x1a7
 [<ffffffff8026bd32>] inet_autobind+0x8b/0x8f
 [<ffffffff8003a9d9>] do_sock_write+0xae/0xea
 [<ffffffff80226ac7>] sock_writev+0xdc/0xf6
 [<ffffffff800680c7>] _spin_lock_irqsave+0x9/0xe
 [<ffffffff8001fb49>] __pollwait+0x0/0xdd
 [<ffffffff8008d533>] default_wake_function+0x0/0xe
 [<ffffffff800a4f10>] autoremove_wake_function+0x0/0x2e
 [<ffffffff800f0b49>] do_readv_writev+0x163/0x274
 [<ffffffff80066538>] thread_return+0x13a/0x174
 [<ffffffff800145d8>] tcp_poll+0x0/0x1c9
 [<ffffffff800c56d3>] audit_syscall_entry+0x180/0x1b3
 [<ffffffff800f0dd0>] sys_writev+0x49/0xe4
 [<ffffffff800622dd>] tracesys+0xd5/0xe0

tcp_sendmsg()
...
        /* Wait for a connection to finish. */
        if ((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) {
                int old_state = sk->sk_state;
                if ((err = sk_stream_wait_connect(sk, &timeo)) != 0) {
if (f_d && (err == -EPIPE)) {
        printk("F: tcp_sendmsg1 -EPIPE: sk=%p, sport=%u, old_state=%d, new_state=%d, "
                "sk_err=%d, sk_shutdown=%d\n",
                sk, ntohs(inet_sk(sk)->sport), old_state, sk->sk_state,
                sk->sk_err, sk->sk_shutdown);
        dump_stack();
}
                        goto out_err;
                }
        }
...

4) Then the process (socket owner) understands that it's time to close
   that socket and does that (and thus triggers sending reset packet):

Call Trace:
...
 [<ffffffff80032077>] dev_queue_xmit+0x343/0x3d6
 [<ffffffff80034698>] ip_output+0x351/0x384
 [<ffffffff80251ae9>] dst_output+0x0/0xe
 [<ffffffff80036ec6>] ip_queue_xmit+0x567/0x5d2
 [<ffffffff80095700>] vprintk+0x21/0x33
 [<ffffffff800070f0>] check_poison_obj+0x2e/0x206
 [<ffffffff80013587>] poison_obj+0x36/0x45
 [<ffffffff8025dea6>] tcp_send_active_reset+0x15/0x14d
 [<ffffffff80023481>] dbg_redzone1+0x1c/0x25
 [<ffffffff8025dea6>] tcp_send_active_reset+0x15/0x14d
 [<ffffffff8000ca94>] cache_alloc_debugcheck_after+0x189/0x1c8
 [<ffffffff80023405>] tcp_transmit_skb+0x764/0x786
 [<ffffffff8025df8a>] tcp_send_active_reset+0xf9/0x14d
 [<ffffffff80258ff1>] tcp_close+0x39a/0x960
 [<ffffffff8026be12>] inet_release+0x69/0x80
 [<ffffffff80059b31>] sock_release+0x4f/0xcf
 [<ffffffff80059d4c>] sock_close+0x2c/0x30
 [<ffffffff800133c9>] __fput+0xac/0x197
 [<ffffffff800252bc>] filp_close+0x59/0x61
 [<ffffffff8001eff6>] sys_close+0x85/0xc7
 [<ffffffff800622dd>] tracesys+0xd5/0xe0

So, in brief:

* a received packet for socket in TCP_CLOSE_WAIT state triggers
  tcp_reset() which clears inet_sk(sk)->num and put socket into
  TCP_CLOSE state

* an attempt to write to that socket forces inet_autobind() to get a
  new port (but the write itself fails with -EPIPE)

* tcp_close() called for socket in TCP_CLOSE state sends an active
  reset via socket with newly allocated port

This adds an additional check in tcp_close() for already closed
sockets. We do not want to send anything to closed sockets.
Signed-off-by: NKonstantin Khorenko <khorenko@openvz.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

565b7b2d

24 6月, 2010 1 次提交

net - IP_NODEFRAG option for IPv4 socket · 7b2ff18e

由 Jiri Olsa 提交于 6月 15, 2010

this patch is implementing IP_NODEFRAG option for IPv4 socket.
The reason is, there's no other way to send out the packet with user
customized header of the reassembly part.
Signed-off-by: NJiri Olsa <jolsa@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7b2ff18e

OpenHarmony / kernel_linux 上一次同步 大约 4 年

OpenHarmony / kernel_linux
上一次同步大约 4 年