提交 · a1ab77f97ed03f5dae66ae4c64375beffab83772 · openanolis / cloud-kernel

09 11月, 2009 5 次提交

ipv4: udp: Optimise multicast reception · 1240d137

由 Eric Dumazet 提交于 11月 08, 2009

UDP multicast rx path is a bit complex and can hold a spinlock
for a long time.

Using a small (32 or 64 entries) stack of socket pointers can help
to perform expensive operations (skb_clone(), udp_queue_rcv_skb())
outside of the lock, in most cases.

It's also a base for a future RCU conversion of multicast recption.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NLucian Adrian Grijincu <lgrijincu@ixiacom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1240d137

ipv4: udp: optimize unicast RX path · 5051ebd2

由 Eric Dumazet 提交于 11月 08, 2009

We first locate the (local port) hash chain head
If few sockets are in this chain, we proceed with previous lookup algo.

If too many sockets are listed, we take a look at the secondary
(port, address) hash chain we added in previous patch.

We choose the shortest chain and proceed with a RCU lookup on the elected chain.

But, if we chose (port, address) chain, and fail to find a socket on given address,
 we must try another lookup on (port, INADDR_ANY) chain to find socket not bound
to a particular IP.

-> No extra cost for typical setups, where the first lookup will probabbly
be performed.

RCU lookups everywhere, we dont acquire spinlock.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5051ebd2

udp: secondary hash on (local port, local address) · 512615b6

由 Eric Dumazet 提交于 11月 08, 2009

Extends udp_table to contain a secondary hash table.

socket anchor for this second hash is free, because UDP
doesnt use skc_bind_node : We define an union to hold
both skc_bind_node & a new hlist_nulls_node udp_portaddr_node

udp_lib_get_port() inserts sockets into second hash chain
(additional cost of one atomic op)

udp_lib_unhash() deletes socket from second hash chain
(additional cost of one atomic op)

Note : No spinlock lockdep annotation is needed, because
lock for the secondary hash chain is always get after
lock for primary hash chain.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

512615b6

udp: split sk_hash into two u16 hashes · d4cada4a

由 Eric Dumazet 提交于 11月 08, 2009

Union sk_hash with two u16 hashes for udp (no extra memory taken)

One 16 bits hash on (local port) value (the previous udp 'hash')

One 16 bits hash on (local address, local port) values, initialized
but not yet used. This second hash is using jenkin hash for better
distribution.

Because the 'port' is xored later, a partial hash is performed
on local address + net_hash_mix(net)
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d4cada4a

udp: add a counter into udp_hslot · fdcc8aa9

由 Eric Dumazet 提交于 11月 08, 2009

Adds a counter in udp_hslot to keep an accurate count
of sockets present in chain.

This will permit to upcoming UDP lookup algo to chose
the shortest chain when secondary hash is added.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fdcc8aa9

08 11月, 2009 1 次提交

net: Support specifying the network namespace upon device creation. · 81adee47

由 Eric W. Biederman 提交于 11月 08, 2009

There is no good reason to not support userspace specifying the
network namespace during device creation, and it makes it easier
to create a network device and pass it to a child network namespace
with a well known name.

We have to be careful to ensure that the target network namespace
for the new device exists through the life of the call.  To keep
that logic clear I have factored out the network namespace grabbing
logic into rtnl_link_get_net.

In addtion we need to continue to pass the source network namespace
to the rtnl_link_ops.newlink method so that we can find the base
device source network namespace.
Signed-off-by: NEric W. Biederman <ebiederm@aristanetworks.com>
Acked-by: NEric Dumazet <eric.dumazet@gmail.com>

81adee47

06 11月, 2009 5 次提交

netfilter: nf_nat: fix NAT issue in 2.6.30.4+ · f9dd09c7

由 Jozsef Kadlecsik 提交于 11月 06, 2009

Vitezslav Samel discovered that since 2.6.30.4+ active FTP can not work
over NAT. The "cause" of the problem was a fix of unacknowledged data
detection with NAT (commit a3a9f79e).
However, actually, that fix uncovered a long standing bug in TCP conntrack:
when NAT was enabled, we simply updated the max of the right edge of
the segments we have seen (td_end), by the offset NAT produced with
changing IP/port in the data. However, we did not update the other parameter
(td_maxend) which is affected by the NAT offset. Thus that could drift
away from the correct value and thus resulted breaking active FTP.

The patch below fixes the issue by *not* updating the conntrack parameters
from NAT, but instead taking into account the NAT offsets in conntrack in a
consistent way. (Updating from NAT would be more harder and expensive because
it'd need to re-calculate parameters we already calculated in conntrack.)
Signed-off-by: NJozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f9dd09c7

ip_frag: dont touch device refcount · 69df9d59

由 Eric Dumazet 提交于 11月 05, 2009

When sending fragmentation expiration ICMP V4/V6 messages,
we can avoid touching device refcount, thanks to RCU
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

69df9d59

net: check kern before calling security subsystem · c84b3268

由 Eric Paris 提交于 11月 05, 2009

Before calling capable(CAP_NET_RAW) check if this operations is on behalf
of the kernel or on behalf of userspace.  Do not do the security check if
it is on behalf of the kernel.
Signed-off-by: NEric Paris <eparis@redhat.com>
Acked-by: NArnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c84b3268

net: pass kern to net_proto_family create function · 3f378b68

由 Eric Paris 提交于 11月 05, 2009

The generic __sock_create function has a kern argument which allows the
security system to make decisions based on if a socket is being created by
the kernel or by userspace. This patch passes that flag to the
net_proto_family specific create function, so it can do the same thing.
Signed-off-by: NEric Paris <eparis@redhat.com>
Acked-by: NArnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3f378b68

net: drop capability from protocol definitions · 13f18aa0

由 Eric Paris 提交于 11月 05, 2009

struct can_proto had a capability field which wasn't ever used.  It is
dropped entirely.

struct inet_protosw had a capability field which can be more clearly
expressed in the code by just checking if sock->type = SOCK_RAW.
Signed-off-by: NEric Paris <eparis@redhat.com>
Acked-by: NArnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

13f18aa0

05 11月, 2009 3 次提交

tcp: Use defaults when no route options are available · 6a2a2d6b

由 Gilad Ben-Yossef 提交于 11月 04, 2009

Trying to parse the option of a SYN packet that we have
no route entry for should just use global wide defaults
for route entry options.
Signed-off-by: NGilad Ben-Yossef <gilad@codefidence.com>
Tested-by: Valdis.Kletnieks@vt.edu
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6a2a2d6b

tcp: Do not call IPv4 specific func in tcp_check_req · 05eaade2

由 Gilad Ben-Yossef 提交于 11月 04, 2009

Calling IPv4 specific inet_csk_route_req in tcp_check_req
is a bad idea and crashes machine on IPv6 connections, as reported
by Valdis Kletnieks

Also, all we are really interested in is the timestamp
option in the header, so calling tcp_parse_options()
with the "estab" set to false flag is an overkill as
it tries to parse half a dozen other TCP options.

We know whether timestamp should be enabled or not
using data from request_sock.
Signed-off-by: NGilad Ben-Yossef <gilad@codefidence.com>
Tested-by: Valdis.Kletnieks@vt.edu
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

05eaade2

net: net/ipv4/devinet.c cleanups · 9f9354b9

由 Eric Dumazet 提交于 11月 04, 2009

As pointed by Stephen Rothwell, commit c6d14c84 added a warning :

net/ipv4/devinet.c: In function 'inet_select_addr':
net/ipv4/devinet.c:902: warning: label 'out' defined but not used

delete unused 'out' label and do some cleanups as well
Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9f9354b9

04 11月, 2009 1 次提交

net: Introduce for_each_netdev_rcu() iterator · c6d14c84

由 Eric Dumazet 提交于 11月 04, 2009

Adds RCU management to the list of netdevices.

Convert some for_each_netdev() users to RCU version, if
it can avoid read_lock-ing dev_base_lock

Ie:
	read_lock(&dev_base_loack);
	for_each_netdev(net, dev)
		some_action();
	read_unlock(&dev_base_lock);

becomes :

	rcu_read_lock();
	for_each_netdev_rcu(net, dev)
		some_action();
	rcu_read_unlock();
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c6d14c84

02 11月, 2009 2 次提交

icmp: icmp_send() can avoid a dev_put() · 685c7944

由 Eric Dumazet 提交于 11月 01, 2009

We can avoid touching device refcount in icmp_send(),
using dev_get_by_index_rcu()
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

685c7944

ipv4: inetdev_by_index() switch to RCU · c148fc2e

由 Eric Dumazet 提交于 11月 01, 2009

Use dev_get_by_index_rcu() instead of __dev_get_by_index() and
dev_base_lock rwlock
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c148fc2e

31 10月, 2009 2 次提交

gre: Fix dev_addr clobbering for gretap · 2e9526b3

由 Herbert Xu 提交于 10月 30, 2009

Nathan Neulinger noticed that gretap devices get their MAC address
from the local IP address, which results in invalid MAC addresses
half of the time.

This is because gretap is still using the tunnel netdev ops rather
than the correct tap netdev ops struct.

This patch also fixes changelink to not clobber the MAC address
for the gretap case.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Acked-by: NStephen Hemminger <shemminger@vyatta.com>
Tested-by: NNathan Neulinger <nneul@mst.edu>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2e9526b3

net: fix sk_forward_alloc corruption · 9d410c79

由 Eric Dumazet 提交于 10月 30, 2009

On UDP sockets, we must call skb_free_datagram() with socket locked,
or risk sk_forward_alloc corruption. This requirement is not respected
in SUNRPC.

Add a convenient helper, skb_free_datagram_locked() and use it in SUNRPC
Reported-by: NFrancis Moreau <francis.moro@gmail.com>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9d410c79

30 10月, 2009 1 次提交

net: Fix RPF to work with policy routing · b0c110ca

由 jamal 提交于 10月 18, 2009

Policy routing is not looked up by mark on reverse path filtering.
This fixes it.
Signed-off-by: NJamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b0c110ca

29 10月, 2009 10 次提交

net,socket: introduce DECLARE_SOCKADDR helper to catch overflow at build time · 38bfd8f5

由 Cyrill Gorcunov 提交于 10月 29, 2009

proto_ops->getname implies copying protocol specific data
into storage unit (particulary to __kernel_sockaddr_storage).
So when we implement new protocol support we should keep such
a detail in mind (which is easy to forget about).

Lets introduce DECLARE_SOCKADDR helper which check if
storage unit is not overfowed at build time.

Eventually inet_getname is switched to use DECLARE_SOCKADDR
(to show example of usage).
Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

38bfd8f5

net: Cleanup redundant tests on unsigned · 65a1c4ff

由 roel kluin 提交于 10月 23, 2009

optlen is unsigned so the `< 0' test is never true.
Signed-off-by: NRoel Kluin <roel.kluin@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

65a1c4ff

Allow disabling of DSACK TCP option per route · dc343475

由 Gilad Ben-Yossef 提交于 10月 28, 2009

Add and use no DSCAK bit in the features field.
Signed-off-by: NGilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: NOri Finkelman <ori@comsleep.com>
Sigend-off-by: NYony Amit <yony@comsleep.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dc343475

Allow to turn off TCP window scale opt per route · 345cda2f

由 Gilad Ben-Yossef 提交于 10月 28, 2009

Add and use no window scale bit in the features field.

Note that this is not the same as setting a window scale of 0
as would happen with window limit on route.
Signed-off-by: NGilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: NOri Finkelman <ori@comsleep.com>
Sigend-off-by: NYony Amit <yony@comsleep.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

345cda2f

Allow disabling TCP timestamp options per route · cda42ebd

由 Gilad Ben-Yossef 提交于 10月 28, 2009

Implement querying and acting upon the no timestamp bit in the feature
field.
Signed-off-by: NGilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: NOri Finkelman <ori@comsleep.com>
Sigend-off-by: NYony Amit <yony@comsleep.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cda42ebd

Add the no SACK route option feature · 1aba721e

由 Gilad Ben-Yossef 提交于 10月 28, 2009

Implement querying and acting upon the no sack bit in the features
field.
Signed-off-by: NGilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: NOri Finkelman <ori@comsleep.com>
Sigend-off-by: NYony Amit <yony@comsleep.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1aba721e

Allow tcp_parse_options to consult dst entry · 022c3f7d

由 Gilad Ben-Yossef 提交于 10月 28, 2009

We need tcp_parse_options to be aware of dst_entry to
take into account per dst_entry TCP options settings
Signed-off-by: NGilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: NOri Finkelman <ori@comsleep.com>
Sigend-off-by: NYony Amit <yony@comsleep.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

022c3f7d

Only parse time stamp TCP option in time wait sock · f55017a9

由 Gilad Ben-Yossef 提交于 10月 28, 2009

Since we only use tcp_parse_options here to check for the exietence
of TCP timestamp option in the header, it is better to call with
the "established" flag on.
Signed-off-by: NGilad Ben-Yossef <gilad@codefidence.com>
Signed-off-by: NOri Finkelman <ori@comsleep.com>
Signed-off-by: NYony Amit <yony@comsleep.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f55017a9

ipmr: Optimize multiple unregistration · d17fa6fa

由 Eric Dumazet 提交于 10月 28, 2009

Speedup module unloading by factorizing synchronize_rcu() calls
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d17fa6fa

AF_RAW: Augment raw_send_hdrinc to expand skb to fit iphdr->ihl (v2) · 55888dfb

由 Neil Horman 提交于 10月 28, 2009

Augment raw_send_hdrinc to correct for incorrect ip header length values

A series of oopses was reported to me recently. Apparently when using AF_RAW
sockets to send data to peers that were reachable via ipsec encapsulation,
people could panic or BUG halt their systems.

I've tracked the problem down to user space sending an invalid ip header over an
AF_RAW socket with IP_HDRINCL set to 1.

Basically what happens is that userspace sends down an ip frame that includes
only the header (no data), but sets the ip header ihl value to a large number,
one that is larger than the total amount of data passed to the sendmsg call. In
raw_send_hdrincl, we allocate an skb based on the size of the data in the msghdr
that was passed in, but assume the data is all valid. Later during ipsec
encapsulation, xfrm4_tranport_output moves the entire frame back in the skbuff
to provide headroom for the ipsec headers. During this operation, the
skb->transport_header is repointed to a spot computed by
skb->network_header + the ip header length (ihl). Since so little data was
passed in relative to the value of ihl provided by the raw socket, we point
transport header to an unknown location, resulting in various crashes.

This fix for this is pretty straightforward, simply validate the value of of
iph->ihl when sending over a raw socket. If (iph->ihl*4U) > user data buffer
size, drop the frame and return -EINVAL. I just confirmed this fixes the
reported crashes.
Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

55888dfb

28 10月, 2009 3 次提交

net: Corrected spelling error heurestics->heuristics · ea84e555

由 Andreas Petlund 提交于 10月 27, 2009

Corrected a spelling error in a function name.
Signed-off-by: NAndreas Petlund <apetlund@simula.no>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ea84e555

gre: Optimize multiple unregistration · eef6dd65

由 Eric Dumazet 提交于 10月 27, 2009

Speedup module unloading by factorizing synchronize_rcu() calls
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

eef6dd65

ipip: Optimize multiple unregistration · 0694c4c0

由 Eric Dumazet 提交于 10月 27, 2009

Speedup module unloading by factorizing synchronize_rcu() calls
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0694c4c0

24 10月, 2009 2 次提交

gre: convert hash tables locking to RCU · 8d5b2c08

由 Eric Dumazet 提交于 10月 23, 2009

GRE tunnels use one rwlock to protect their hash tables.

This locking scheme can be converted to RCU for free, since netdevice
already must wait for a RCU grace period at dismantle time.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8d5b2c08

ipip: convert hash tables locking to RCU · 8f95dd63

由 Eric Dumazet 提交于 10月 23, 2009

IPIP tunnels use one rwlock to protect their hash tables.

This locking scheme can be converted to RCU for free, since netdevice
already must wait for a RCU grace period at dismantle time.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8f95dd63

23 10月, 2009 1 次提交

net: use WARN() for the WARN_ON in commit · c62f4c45

由 Arjan van de Ven 提交于 10月 22, 2009

Commit b6b39e8f (tcp: Try to catch MSG_PEEK bug) added a printk()
to the WARN_ON() that's in tcp.c. This patch changes this combination
to WARN(); the advantage of WARN() is that the printk message shows up
inside the message, so that kerneloops.org will collect the message.

In addition, this gets rid of an extra if() statement.
Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c62f4c45

21 10月, 2009 1 次提交

net: Fix for dst_negative_advice · ea94ff3b

由 Krishna Kumar 提交于 10月 19, 2009

dst_negative_advice() should check for changed dst and reset
sk_tx_queue_mapping accordingly. Pass sock to the callers of
dst_negative_advice.

(sk_reset_txq is defined just for use by dst_negative_advice. The
only way I could find to get around this is to move dst_negative_()
from dst.h to dst.c, include sock.h in dst.c, etc)
Signed-off-by: NKrishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ea94ff3b

20 10月, 2009 3 次提交

tcp: Try to catch MSG_PEEK bug · b6b39e8f

由 Herbert Xu 提交于 10月 19, 2009

This patch tries to print out more information when we hit the
MSG_PEEK bug in tcp_recvmsg.  It's been around since at least
2005 and it's about time that we finally fix it.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b6b39e8f

IP: Cleanups · 0eae750e

由 John Dykstra 提交于 10月 19, 2009

Use symbols instead of magic constants while checking PMTU discovery
setsockopt.

Remove redundant test in ip_rt_frag_needed() (done by caller).
Signed-off-by: NJohn Dykstra <john.dykstra1@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0eae750e

net: Fix IP_MULTICAST_IF · 55b80503

由 Eric Dumazet 提交于 10月 19, 2009

ipv4/ipv6 setsockopt(IP_MULTICAST_IF) have dubious __dev_get_by_index() calls.

This function should be called only with RTNL or dev_base_lock held, or reader
could see a corrupt hash chain and eventually enter an endless loop.

Fix is to call dev_get_by_index()/dev_put().

If this happens to be performance critical, we could define a new dev_exist_by_index()
function to avoid touching dev refcount.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

55b80503

openanolis / cloud-kernel 大约 1 年 前同步成功

openanolis / cloud-kernel
大约 1 年前同步成功