提交 · 755d0e77ac9c8d125388922dc33434ed5b2ebe80 · openanolis / cloud-kernel

22 3月, 2010 1 次提交

net: rtnetlink: ignore NETDEV_PRE_TYPE_CHANGE in rtnetlink_event() · 755d0e77

由 Patrick McHardy 提交于 3月 19, 2010

Ignore the new NETDEV_PRE_TYPE_CHANGE event in rtnetlink_event() since
there have been no changes userspace needs to be notified of.

Also add a comment to the netdev notifier event definitions to remind
people to update the exclusion list when adding new event types.
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

755d0e77

21 3月, 2010 13 次提交

ipv6: Fix bug in ipv6_chk_same_addr(). · 3e81c6da

由 David S. Miller 提交于 3月 20, 2010

hlist_for_each_entry(p...) will not necessarily initialize 'p'
to anything if the hlist is empty.  GCC notices this and emits
a warning.

Just return true explicitly when we hit a match, and return
false is we fall out of the loop without one.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3e81c6da

ipv6: Reduce timer events for addrconf_verify(). · b2db7564

由 YOSHIFUJI Hideaki 提交于 3月 20, 2010

This patch reduces timer events while keeping accuracy by rounding
our timer and/or batching several address validations in addrconf_verify().

addrconf_verify() is called at earliest timeout among interface addresses'
timeouts, but at maximum ADDR_CHECK_FREQUENCY (120 secs).

In most cases, all of timeouts of interface addresses are long enough
(e.g. several hours or days vs 2 minutes), this timer is usually called
every ADDR_CHECK_FREQUENCY, and it is okay to be lazy.
(Note this timer could be eliminated if all code paths which modifies
variables related to timeouts call us manually, but it is another story.)

However, in other least but important cases, we try keeping accuracy.

When the real interface address timeout is coming, and the timeout
is just before the rounded timeout, we accept some error.

When a timeout has been reached, we also try batching other several
events in very near future.
Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b2db7564

IPv6: addrconf cleanup addrconf_verify · 88949cf4

由 stephen hemminger 提交于 3月 17, 2010

The variable regen_advance is only used in the privacy case.
Move it to simplify code and eliminate ifdef's
Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

88949cf4

addrconf: checkpatch fixes · e21e8467

由 Stephen Hemminger 提交于 3月 20, 2010

Fix some of the checkpatch complaints.
Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e21e8467

IPv6: addrconf cleanups · bcdd553f

由 Stephen Hemminger 提交于 3月 20, 2010

Some minor stuff, reformat comments and add whitespace for clarity
Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bcdd553f

ipv6: convert idev_list to list macros · 502a2ffd

由 stephen hemminger 提交于 3月 17, 2010

Convert to list macro's for the list of addresses per interface
in IPv6.
Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

502a2ffd

ipv6: user better hash for addrconf · 3a88a81d

由 stephen hemminger 提交于 3月 17, 2010

The existing hash function has a couple of issues:
  * it is hardwired to 16 for IN6_ADDR_HSIZE
  * limited to 256 and callers using int
  * use jhash2 rather than some old BSD algorithm

No need for random seed since this is local only (based on assigned
addresses) table.
Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3a88a81d

IPv6: convert addrconf hash list to RCU · 5c578aed

由 stephen hemminger 提交于 3月 17, 2010

Convert from reader/writer lock to RCU and spinlock for addrconf
hash list.

Adds an additional helper macro for hlist_for_each_entry_continue_rcu
to handle the continue case.
Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5c578aed

ipv6: convert addrconf list to hlist · c2e21293

由 stephen hemminger 提交于 3月 17, 2010

Using hash list macros, simplifies code and helps later RCU.

This patch includes some initialization that is not strictly necessary,
since an empty hlist node/list is all zero; and list is in BSS
and node is allocated with kzalloc.
Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c2e21293

ipv6: convert temporary address list to list macros · 372e6c8f

由 stephen hemminger 提交于 3月 17, 2010

Use list macros instead of open coded linked list.
Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

372e6c8f

netfilter: ctnetlink: fix reliable event delivery if message building fails · 37b7ef72

由 Pablo Neira Ayuso 提交于 3月 16, 2010

This patch fixes a bug that allows to lose events when reliable
event delivery mode is used, ie. if NETLINK_BROADCAST_SEND_ERROR
and NETLINK_RECV_NO_ENOBUFS socket options are set.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

37b7ef72

netlink: fix NETLINK_RECV_NO_ENOBUFS in netlink_set_err() · 1a50307b

由 Pablo Neira Ayuso 提交于 3月 18, 2010

Currently, ENOBUFS errors are reported to the socket via
netlink_set_err() even if NETLINK_RECV_NO_ENOBUFS is set. However,
that should not happen. This fixes this problem and it changes the
prototype of netlink_set_err() to return the number of sockets that
have set the NETLINK_RECV_NO_ENOBUFS socket option. This return
value is used in the next patch in these bugfix series.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1a50307b

NET_DMA: free skbs periodically · 73852e81

由 Steven J. Magnani 提交于 3月 16, 2010

Under NET_DMA, data transfer can grind to a halt when userland issues a
large read on a socket with a high RCVLOWAT (i.e., 512 KB for both).
This appears to be because the NET_DMA design queues up lots of memcpy
operations, but doesn't issue or wait for them (and thus free the
associated skbs) until it is time for tcp_recvmesg() to return.
The socket hangs when its TCP window goes to zero before enough data is
available to satisfy the read.

Periodically issue asynchronous memcpy operations, and free skbs for ones
that have completed, to prevent sockets from going into zero-window mode.
Signed-off-by: NSteven J. Magnani <steve@digidescorp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

73852e81

20 3月, 2010 5 次提交

tcp: Fix tcp_mark_head_lost() with packets == 0 · 6830c25b

由 Lennart Schulte 提交于 3月 17, 2010

A packet is marked as lost in case packets == 0, although nothing should be done.
This results in a too early retransmitted packet during recovery in some cases.
This small patch fixes this issue by returning immediately.
Signed-off-by: NLennart Schulte <lennart.schulte@nets.rwth-aachen.de>
Signed-off-by: NArnd Hannemann <hannemann@nets.rwth-aachen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6830c25b

net: ipmr/ip6mr: fix potential out-of-bounds vif_table access · a50436f2

由 Patrick McHardy 提交于 3月 17, 2010

mfc_parent of cache entries is used to index into the vif_table and is
initialised from mfcctl->mfcc_parent. This can take values of to 2^16-1,
while the vif_table has only MAXVIFS (32) entries. The same problem
affects ip6mr.

Refuse invalid values to fix a potential out-of-bounds access. Unlike
the other validity checks, this is checked in ipmr_mfc_add() instead of
the setsockopt handler since its unused in the delete path and might be
uninitialized.
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a50436f2

TCP: check min TTL on received ICMP packets · 97e3ecd1

由 stephen hemminger 提交于 3月 18, 2010

This adds RFC5082 checks for TTL on received ICMP packets.
It adds some security against spoofed ICMP packets
disrupting GTSM protected sessions.
Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

97e3ecd1

ipv6: Remove redundant dst NULL check in ip6_dst_check · 10414444

由 Herbert Xu 提交于 3月 18, 2010

As the only path leading to ip6_dst_check makes an indirect call
through dst->ops, dst cannot be NULL in ip6_dst_check.

This patch removes this check in case it misleads people who
come across this code.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

10414444

ipv4: check rt_genid in dst_check · d11a4dc1

由 Timo Teräs 提交于 3月 18, 2010

Xfrm_dst keeps a reference to ipv4 rtable entries on each
cached bundle. The only way to renew xfrm_dst when the underlying
route has changed, is to implement dst_check for this. This is
what ipv6 side does too.

The problems started after 87c1e12b
("ipsec: Fix bogus bundle flowi") which fixed a bug causing xfrm_dst
to not get reused, until that all lookups always generated new
xfrm_dst with new route reference and path mtu worked. But after the
fix, the old routes started to get reused even after they were expired
causing pmtu to break (well it would occationally work if the rtable
gc had run recently and marked the route obsolete causing dst_check to
get called).
Signed-off-by: NTimo Teras <timo.teras@iki.fi>
Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d11a4dc1

19 3月, 2010 6 次提交

net: Potential null skb->dev dereference · 0641e4fb

由 Eric Dumazet 提交于 3月 18, 2010

When doing "ifenslave -d bond0 eth0", there is chance to get NULL
dereference in netif_receive_skb(), because dev->master suddenly becomes
NULL after we tested it.

We should use ACCESS_ONCE() to avoid this (or rcu_dereference())
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0641e4fb

tcp: Fix OOB POLLIN avoidance. · b634f875

由 Alexandra Kossovsky 提交于 3月 18, 2010

From: Alexandra.Kossovsky@oktetlabs.ru

Fixes kernel bugzilla #15541
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b634f875

net: forbid underlaying devices to change its type · 1c01fe14

由 Jiri Pirko 提交于 3月 10, 2010

It's not desired for underlaying devices to change type. At the time,
there is for example possible to have bond with changed type from
Ethernet to Infiniband as a port of a bridge. This patch fixes this.
Signed-off-by: NJiri Pirko <jpirko@redhat.com>
Signed-off-by: NJay Vosburgh <fubar@us.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1c01fe14

bonding: check return value of nofitier when changing type · 3ca5b404

由 Jiri Pirko 提交于 3月 10, 2010

This patch adds the possibility to refuse the bonding type change for
other subsystems (such as for example bridge, vlan, etc.)
Signed-off-by: NJiri Pirko <jpirko@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3ca5b404

net: rename notifier defines for netdev type change · 93d9b7d7

由 Jiri Pirko 提交于 3月 10, 2010

Since generally there could be more netdevices changing type other
than bonding, making this event type name "bonding-unrelated"
Signed-off-by: NJiri Pirko <jpirko@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

93d9b7d7

rps: Fixed build with CONFIG_SMP not enabled. · 1e94d72f

由 Tom Herbert 提交于 3月 18, 2010

Signed-off-by: NTom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1e94d72f

17 3月, 2010 15 次提交

net: convert multiple drivers to use netdev_for_each_mc_addr, part7 · ff6e2163

由 Jiri Pirko 提交于 3月 01, 2010

In mlx4, using char * to store mc address in private structure instead.
Signed-off-by: NJiri Pirko <jpirko@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ff6e2163

tipc: Allow retransmission of cloned buffers · ca509101

由 Neil Horman 提交于 3月 15, 2010

Forward port commit
fc477e160af086f6e30c3d4fdf5f5c000d29beb5
from git://tipc.cslab.ericsson.net/pub/git/people/allan/tipc.git

Origional commit message:

Allow retransmission of cloned buffers

This patch fixes an issue with TIPC's message retransmission logic
that prevented retransmission of clone sk_buffs.  Originally intended
as a means of avoiding wasted work in retransmitting messages that
were still on the driver's outbound queue, it also prevented TIPC
from retransmitting messages through other means -- such as the
secondary bearer of the broadcast link, or another interface in a
set of bonded interfaces.  This fix removes existing checks for
cloned sk_buffs that prevented such retransmission.
Origionally-Signed-off-by: NAllan Stephens <allan.stephens@windriver.com>
Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ca509101

tipc: Increase frequency of load distribution over broadcast link · 1a624832

由 Neil Horman 提交于 3月 15, 2010

Forward port commit 29eb572941501c40ac6e62dbc5043bf9ee76ee56
from git://tipc.cslab.ericsson.net/pub/git/people/allan/tipc.git

Origional commit message:
Increase frequency of load distribution over broadcast link

This patch enhances the behavior of TIPC's broadcast link so that it
alternates between redundant bearers (if available) after every
message sent, rather than after every 10 messages.  This change helps
to speed up delivery of retransmitted messages by ensuring that
they are not sent repeatedly over a bearer that is no longer working,
but not yet recognized as failed.

Tested by myself in the latest net-2.6 tree using the tipc sanity test suite
Origionally-signed-off-by: NAllan Stephens <allan.stephens@windriver.com>
Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>

bcast.c |   35 ++++++++++++++---------------------
1 file changed, 14 insertions(+), 21 deletions(-)
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1a624832

net: core: add IFLA_STATS64 support · 10708f37

由 Jan Engelhardt 提交于 3月 11, 2010

`ip -s link` shows interface counters truncated to 32 bit. This is
because interface statistics are transported only in 32-bit quantity
to userspace. This commit adds a new IFLA_STATS64 attribute that
exports them in full 64 bit.

References: http://lkml.indiana.edu/hypermail/linux/kernel/0307.3/0215.htmlSigned-off-by: NJan Engelhardt <jengelh@medozas.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

10708f37

J
net: tcp: make veno selectable as default congestion module · 6ce1a6df
由 Jan Engelhardt 提交于 3月 11, 2010
```
Signed-off-by: NJan Engelhardt <jengelh@medozas.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
6ce1a6df
J
net: tcp: make hybla selectable as default congestion module · dd2acaa7
由 Jan Engelhardt 提交于 3月 11, 2010
```
Signed-off-by: NJan Engelhardt <jengelh@medozas.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
dd2acaa7

net: remove rcu locking from fib_rules_event() · 2fb3573d

由 Eric Dumazet 提交于 3月 09, 2010

We hold RTNL at this point and dont use RCU variants of list traversals,
we dont need rcu_read_lock()/rcu_read_unlock()
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2fb3573d

bridge: per-cpu packet statistics (v3) · 14bb4789

由 stephen hemminger 提交于 3月 02, 2010

The shared packet statistics are a potential source of slow down
on bridged traffic. Convert to per-cpu array, but only keep those
statistics which change per-packet.
Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

14bb4789

rps: Receive Packet Steering · 0a9627f2

由 Tom Herbert 提交于 3月 16, 2010

This patch implements software receive side packet steering (RPS).  RPS
distributes the load of received packet processing across multiple CPUs.

Problem statement: Protocol processing done in the NAPI context for received
packets is serialized per device queue and becomes a bottleneck under high
packet load.  This substantially limits pps that can be achieved on a single
queue NIC and provides no scaling with multiple cores.

This solution queues packets early on in the receive path on the backlog queues
of other CPUs.   This allows protocol processing (e.g. IP and TCP) to be
performed on packets in parallel.   For each device (or each receive queue in
a multi-queue device) a mask of CPUs is set to indicate the CPUs that can
process packets. A CPU is selected on a per packet basis by hashing contents
of the packet header (e.g. the TCP or UDP 4-tuple) and using the result to index
into the CPU mask.  The IPI mechanism is used to raise networking receive
softirqs between CPUs.  This effectively emulates in software what a multi-queue
NIC can provide, but is generic requiring no device support.

Many devices now provide a hash over the 4-tuple on a per packet basis
(e.g. the Toeplitz hash).  This patch allow drivers to set the HW reported hash
in an skb field, and that value in turn is used to index into the RPS maps.
Using the HW generated hash can avoid cache misses on the packet when
steering it to a remote CPU.

The CPU mask is set on a per device and per queue basis in the sysfs variable
/sys/class/net/<device>/queues/rx-<n>/rps_cpus.  This is a set of canonical
bit maps for receive queues in the device (numbered by <n>).  If a device
does not support multi-queue, a single variable is used for the device (rx-0).

Generally, we have found this technique increases pps capabilities of a single
queue device with good CPU utilization.  Optimal settings for the CPU mask
seem to depend on architectures and cache hierarcy.  Below are some results
running 500 instances of netperf TCP_RR test with 1 byte req. and resp.
Results show cumulative transaction rate and system CPU utilization.

e1000e on 8 core Intel
   Without RPS: 108K tps at 33% CPU
   With RPS:    311K tps at 64% CPU

forcedeth on 16 core AMD
   Without RPS: 156K tps at 15% CPU
   With RPS:    404K tps at 49% CPU

bnx2x on 16 core AMD
   Without RPS  567K tps at 61% CPU (4 HW RX queues)
   Without RPS  738K tps at 96% CPU (8 HW RX queues)
   With RPS:    854K tps at 76% CPU (4 HW RX queues)

Caveats:
- The benefits of this patch are dependent on architecture and cache hierarchy.
Tuning the masks to get best performance is probably necessary.
- This patch adds overhead in the path for processing a single packet.  In
a lightly loaded server this overhead may eliminate the advantages of
increased parallelism, and possibly cause some relative performance degradation.
We have found that masks that are cache aware (share same caches with
the interrupting CPU) mitigate much of this.
- The RPS masks can be changed dynamically, however whenever the mask is changed
this introduces the possibility of generating out of order packets.  It's
probably best not change the masks too frequently.
Signed-off-by: NTom Herbert <therbert@google.com>

 include/linux/netdevice.h |   32 ++++-
 include/linux/skbuff.h    |    3 +
 net/core/dev.c            |  335 +++++++++++++++++++++++++++++++++++++--------
 net/core/net-sysfs.c      |  225 ++++++++++++++++++++++++++++++-
 net/core/skbuff.c         |    2 +
 5 files changed, 538 insertions(+), 59 deletions(-)
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0a9627f2

RDS: Enable per-cpu workqueue threads · 768bbedf

由 Tina Yang 提交于 3月 11, 2010

Create per-cpu workqueue threads instead of a single
krdsd thread. This is a step towards better scalability.
Signed-off-by: NAndy Grover <andy.grover@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

768bbedf

RDS: Do not call set_page_dirty() with irqs off · 561c7df6

由 Andy Grover 提交于 3月 11, 2010

set_page_dirty() unconditionally re-enables interrupts, so
if we call it with irqs off, they will be on after the call,
and that's bad. This patch moves the call after we've re-enabled
interrupts in send_drop_to(), so it's safe.

Also, add BUG_ONs to let us know if we ever do call set_page_dirty
with interrupts off.
Signed-off-by: NAndy Grover <andy.grover@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

561c7df6

RDS: Properly unmap when getting a remote access error · 450d06c0

由 Sherman Pun 提交于 3月 11, 2010

If the RDMA op has aborted with a remote access error,
in addition to what we already do (tell userspace it has
completed with an error) also unmap it and put() the rm.

Otherwise, hangs may occur on arches that track maps and
will not exit without proper cleanup.
Signed-off-by: NAndy Grover <andy.grover@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

450d06c0

RDS: only put sockets that have seen congestion on the poll_waitq · b98ba52f

由 Andy Grover 提交于 3月 11, 2010

rds_poll_waitq's listeners will be awoken if we receive a congestion
notification. Bad performance may result because *all* polled sockets
contend for this single lock. However, it should not be necessary to
wake pollers when a congestion update arrives if they have never
experienced congestion, and not putting these on the waitq will
hopefully greatly reduce contention.
Signed-off-by: NAndy Grover <andy.grover@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b98ba52f

RDS: Fix locking in rds_send_drop_to() · 550a8002

由 Tina Yang 提交于 3月 11, 2010

It seems rds_send_drop_to() called
__rds_rdma_send_complete(rs, rm, RDS_RDMA_CANCELED)
with only rds_sock lock, but not rds_message lock. It raced with
other threads that is attempting to modify the rds_message as well,
such as from within rds_rdma_send_complete().
Signed-off-by: NTina Yang <tina.yang@oracle.com>
Signed-off-by: NAndy Grover <andy.grover@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

550a8002

RDS: Turn down alarming reconnect messages · 97069788

由 Andy Grover 提交于 3月 11, 2010

RDS's error messages when a connection goes down are a little
extreme. A connection may go down, and it will be re-established,
and everything is fine. This patch links these messages through
rdsdebug(), instead of to printk directly.
Signed-off-by: NAndy Grover <andy.grover@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

97069788

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功