提交 · 68835aba4d9b74e2f94106d13b6a4bddc447c4c8 · openanolis / cloud-kernel

10 12月, 2010 2 次提交

net: optimize INET input path further · 68835aba

由 Eric Dumazet 提交于 11月 30, 2010

Followup of commit b178bb3d (net: reorder struct sock fields)

Optimize INET input path a bit further, by :

1) moving sk_refcnt close to sk_lock.

This reduces number of dirtied cache lines by one on 64bit arches (and
64 bytes cache line size).

2) moving inet_daddr & inet_rcv_saddr at the beginning of sk

(same cache line than hash / family / bound_dev_if / nulls_node)

This reduces number of accessed cache lines in lookups by one, and dont
increase size of inet and timewait socks.
inet and tw sockets now share same place-holder for these fields.

Before patch :

offsetof(struct sock, sk_refcnt) = 0x10
offsetof(struct sock, sk_lock) = 0x40
offsetof(struct sock, sk_receive_queue) = 0x60
offsetof(struct inet_sock, inet_daddr) = 0x270
offsetof(struct inet_sock, inet_rcv_saddr) = 0x274

After patch :

offsetof(struct sock, sk_refcnt) = 0x44
offsetof(struct sock, sk_lock) = 0x48
offsetof(struct sock, sk_receive_queue) = 0x68
offsetof(struct inet_sock, inet_daddr) = 0x0
offsetof(struct inet_sock, inet_rcv_saddr) = 0x4

compute_score() (udp or tcp) now use a single cache line per ignored
item, instead of two.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

68835aba

net: Abstract away all dst_entry metrics accesses. · defb3519

由 David S. Miller 提交于 12月 08, 2010

Use helper functions to hide all direct accesses, especially writes,
to dst_entry metrics values.

This will allow us to:

1) More easily change how the metrics are stored.

2) Implement COW for metrics.

In particular this will help us put metrics into the inetpeer
cache if that is what we end up doing.  We can make the _metrics
member a pointer instead of an array, initially have it point
at the read-only metrics in the FIB, and then on the first set
grab an inetpeer entry and point the _metrics member there.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
Acked-by: NEric Dumazet <eric.dumazet@gmail.com>

defb3519

09 12月, 2010 3 次提交

tcp: Replace time wait bucket msg by counter · 67631510

由 Tom Herbert 提交于 12月 08, 2010

Rather than printing the message to the log, use a mib counter to keep
track of the count of occurences of time wait bucket overflow.  Reduces
spam in logs.
Signed-off-by: NTom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

67631510

filter: constify sk_run_filter() · 62ab0812

由 Eric Dumazet 提交于 12月 06, 2010

sk_run_filter() doesnt write on skb, change its prototype to reflect
this.

Fix two af_packet comments.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

62ab0812

net: RCU conversion of dev_getbyhwaddr() and arp_ioctl() · 941666c2

由 Eric Dumazet 提交于 12月 05, 2010

Le dimanche 05 décembre 2010 à 09:19 +0100, Eric Dumazet a écrit :

> Hmm..
>
> If somebody can explain why RTNL is held in arp_ioctl() (and therefore
> in arp_req_delete()), we might first remove RTNL use in arp_ioctl() so
> that your patch can be applied.
>
> Right now it is not good, because RTNL wont be necessarly held when you
> are going to call arp_invalidate() ?

While doing this analysis, I found a refcount bug in llc, I'll send a
patch for net-2.6

Meanwhile, here is the patch for net-next-2.6

Your patch then can be applied after mine.

Thanks

[PATCH] net: RCU conversion of dev_getbyhwaddr() and arp_ioctl()

dev_getbyhwaddr() was called under RTNL.

Rename it to dev_getbyhwaddr_rcu() and change all its caller to now use
RCU locking instead of RTNL.

Change arp_ioctl() to use RCU instead of RTNL locking.

Note: this fix a dev refcount bug in llc
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

941666c2

07 12月, 2010 6 次提交

dccp: Policy-based packet dequeueing infrastructure · 871a2c16

由 Tomasz Grobelny 提交于 12月 04, 2010

This patch adds a generic infrastructure for policy-based dequeueing of
TX packets and provides two policies:
 * a simple FIFO policy (which is the default) and
 * a priority based policy (set via socket options).
Both policies honour the tx_qlen sysctl for the maximum size of the write
queue (can be overridden via socket options).

The priority policy uses skb->priority internally to assign an u32 priority
identifier, using the same ranking as SO_PRIORITY. The skb->priority field
is set to 0 when the packet leaves DCCP. The priority is supplied as ancillary
data using cmsg(3), the patch also provides the requisite parsing routines.
Signed-off-by: NTomasz Grobelny <tomasz@grobelny.oswiecenia.net>
Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>

871a2c16

__in_dev_get_rtnl() can use rtnl_dereference() · 06a9701f

由 Eric Dumazet 提交于 12月 01, 2010

If caller holds RTNL, we dont need a memory barrier
(smp_read_barrier_depends) included in rcu_dereference().

Just use rtnl_dereference() to properly document the assertions.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

06a9701f

filter: add SKF_AD_RXHASH and SKF_AD_CPU · da2033c2

由 Eric Dumazet 提交于 11月 30, 2010

Add SKF_AD_RXHASH and SKF_AD_CPU to filter ancillary mechanism,
to be able to build advanced filters.

This can help spreading packets on several sockets with a fast
selection, after RPS dispatch to N cpus for example, or to catch a
percentage of flows in one queue.

tcpdump -s 500 "cpu = 1" :

[0] ld CPU
[1] jeq #1  jt 2  jf 3
[2] ret #500
[3] ret #0

# take 12.5 % of flows (average)
tcpdump -s 1000 "rxhash & 7 = 2" :

[0] ld RXHASH
[1] and #7
[2] jeq #2  jt 3  jf 4
[3] ret #1000
[4] ret #0
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Cc: Rui <wirelesser@gmail.com>
Acked-by: NChangli Gao <xiaosuo@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

da2033c2

usbnet: changes for upcoming cdc_ncm driver · 073285fd

由 Alexey Orishko 提交于 11月 29, 2010

Changes:
include/linux/usb/usbnet.h:
- a new flag to indicate driver's capability to accumulate IP packets in Tx
 direction and extract several packets from single skb in Rx direction.
drivers/net/usb/usbnet.c:
- the procedure of counting packets in usbnet was updated due to the
 accumulating of IP packets in the driver
- no short packets are sent if indicated by the flag in driver_info
 structure
Signed-off-by: NAlexey Orishko <alexey.orishko@stericsson.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

073285fd

tg3: Move EEE definitions into mdio.h · 3110f5f5

由 Matt Carlson 提交于 12月 06, 2010

In commit 52b02d04 entitled "tg3: Add
EEE support", Ben Hutchings had commented that the EEE advertisement
register will be in a standard location.  This patch moves that
definition into mdio.h and changes the code to use it.
Signed-off-by: NMatt Carlson <mcarlson@broadcom.com>
Reviewed-by: NBenjamin Li <benli@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3110f5f5

filter: fix sk_filter rcu handling · 46bcf14f

由 Eric Dumazet 提交于 12月 06, 2010

Pavel Emelyanov tried to fix a race between sk_filter_(de|at)tach and
sk_clone() in commit 47e958ea

Problem is we can have several clones sharing a common sk_filter, and
these clones might want to sk_filter_attach() their own filters at the
same time, and can overwrite old_filter->rcu, corrupting RCU queues.

We can not use filter->rcu without being sure no other thread could do
the same thing.

Switch code to a more conventional ref-counting technique : Do the
atomic decrement immediately and queue one rcu call back when last
reference is released.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

46bcf14f

03 12月, 2010 5 次提交

tipc: Remove obsolete native API files and exports · d265fef6

由 Allan Stephens 提交于 11月 30, 2010

As part of the removal of TIPC's native API support it is no longer
necessary for TIPC to export symbols for routines that can be called
by kernel-based applications, nor for it to have header files that
kernel-based applications can include to access the declarations for
those routines. This commit eliminates the exporting of symbols by
TIPC and migrates the contents of each obsolete native API include
file into its corresponding non-native API equivalent.

The code which was migrated in this commit was migrated intact, in
that there are no technical changes combined with the relocation.
Signed-off-by: NAllan Stephens <Allan.Stephens@windriver.com>
Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d265fef6

net: kill unused macros from head file · dca9b240

由 Shan Wei 提交于 12月 01, 2010

These macros have been defined for several years since v2.6.12-rc2（tracing by git）,
but never be used. So remove them.
Signed-off-by: NShan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dca9b240

net: snmp: fix the wrong ICMP_MIB_MAX value · a9527a3b

由 Shan Wei 提交于 12月 01, 2010

__ICMP_MIB_MAX is equal to the total number of icmp mib,
So no need to add 1. This wastes 4/8 bytes memory.

Change it to be same as ICMP6_MIB_MAX, TCP_MIB_MAX, UDP_MIB_MAX.
Signed-off-by: NShan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a9527a3b

D
ipv6: Create inet6_csk_route_req(). · ae4694b2
由 David S. Miller 提交于 12月 02, 2010
```
Brother of ipv4's inet_csk_route_req().
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
ae4694b2

ipv6: Add rt6_get_peer() helper. · 15c05425

由 David S. Miller 提交于 12月 02, 2010

To go along side ipv4's rt_get_peer().
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

15c05425

02 12月, 2010 4 次提交

timewait_sock: Create and use getpeer op. · ccb7c410

由 David S. Miller 提交于 12月 01, 2010

The only thing AF-specific about remembering the timestamp
for a time-wait TCP socket is getting the peer.

Abstract that behind a new timewait_sock_ops vector.

Support for real IPV6 sockets is not filled in yet, but
curiously this makes timewait recycling start to work
for v4-mapped ipv6 sockets.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ccb7c410

D
inetpeer: Fix incorrect comment about inetpeer struct size. · 4399ce40
由 David S. Miller 提交于 12月 01, 2010
```
Now with ipv6 support it is no longer less than 64 bytes.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
4399ce40
D
inetpeer: Kill use of inet_peer_address_t typedef. · 8790ca17
由 David S. Miller 提交于 12月 01, 2010
```
They are verboten these days.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
8790ca17

net sched: use xps information for qdisc NUMA affinity · f2cd2d3e

由 Eric Dumazet 提交于 11月 29, 2010

Allocate qdisc memory according to NUMA properties of cpus included in
xps map.

To be effective, qdisc should be (re)setup after changes
of /sys/class/net/eth<n>/queues/tx-<n>/xps_cpus

I added a numa_node field in struct netdev_queue, containing NUMA node
if all cpus included in xps_cpus share same node, else -1.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f2cd2d3e

01 12月, 2010 5 次提交

inet: Turn ->remember_stamp into ->get_peer in connection AF ops. · 3f419d2d

由 David S. Miller 提交于 11月 29, 2010

Then we can make a completely generic tcp_remember_stamp()
that uses ->get_peer() as a helper, minimizing the AF specific
code and minimizing the eventual code duplication when we implement
the ipv6 side of TW recycling.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3f419d2d

D
ipv6: Add infrastructure to bind inet_peer objects to routes. · b3419363
由 David S. Miller 提交于 11月 30, 2010
```
They are only allowed on cached ipv6 routes.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
b3419363

inetpeer: Add inet_getpeer_v6() · 672f007d

由 David S. Miller 提交于 11月 30, 2010

Now that all of the infrastructure is in place, we can add
the ipv6 shorthand for peer creation.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

672f007d

D
inetpeer: Make inet_getpeer() take an inet_peer_adress_t pointer. · b534ecf1
由 David S. Miller 提交于 11月 30, 2010
```
And make an inet_getpeer_v4() helper, update callers.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
b534ecf1

inetpeer: Introduce inet_peer_address_t. · 582a72da

由 David S. Miller 提交于 11月 30, 2010

Currently only the v4 aspect is used, but this will change.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

582a72da

30 11月, 2010 3 次提交

af_unix: limit recursion level · 25888e30

由 Eric Dumazet 提交于 11月 25, 2010

Its easy to eat all kernel memory and trigger NMI watchdog, using an
exploit program that queues unix sockets on top of others.

lkml ref : http://lkml.org/lkml/2010/11/25/8

This mechanism is used in applications, one choice we have is to have a
recursion limit.

Other limits might be needed as well (if we queue other types of files),
since the passfd mechanism is currently limited by socket receive queue
sizes only.

Add a recursion_level to unix socket, allowing up to 4 levels.

Each time we send an unix socket through sendfd mechanism, we copy its
recursion level (plus one) to receiver. This recursion level is cleared
when socket receive queue is emptied.
Reported-by: NМарк Коренберг <socketpair@gmail.com>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

25888e30

xps: add __rcu annotations · a4177869

由 Eric Dumazet 提交于 11月 28, 2010

Avoid sparse warnings : add __rcu annotations and use
rcu_dereference_protected() where necessary.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a4177869

sctp: kill unused macros in head file · 49b4a654

由 Shan Wei 提交于 11月 29, 2010

1. SCTP_CMD_NUM_VERBS,SCTP_CMD_MAX
These two macros have never been used for several years since v2.6.12-rc2.

2.sctp_port_rover,sctp_port_alloc_lock
The commit 063930 abandoned global variables of port_rover and port_alloc_lock,
but still keep two macros to refer to them.
So, remove them now.

commit 06393009
Author: Stephen Hemminger <shemminger@linux-foundation.org>
Date:   Wed Oct 10 17:30:18 2007 -0700

    [SCTP]: port randomization
Signed-off-by: NShan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

49b4a654

29 11月, 2010 5 次提交

xps: Add CONFIG_XPS · bf264145

由 Tom Herbert 提交于 11月 26, 2010

This patch adds XPS_CONFIG option to enable and disable XPS.  This is
done in the same manner as RPS_CONFIG.  This is also fixes build
failure in XPS code when SMP is not enabled.
Signed-off-by: NTom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bf264145

xfrm: fix gre key endianess · aa285b17

由 Timo Teräs 提交于 11月 23, 2010

fl->fl_gre_key is network byte order contrary to fl->fl_icmp_*.
Make xfrm_flowi_{s|d}port return network byte order values for gre
key too.
Signed-off-by: NTimo Teräs <timo.teras@iki.fi>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

aa285b17

X25 remove bkl in subscription ioctls · 5595a1a5

由 andrew hendry 提交于 11月 25, 2010

Signed-off-by: NAndrew Hendry <andrew.hendry@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5595a1a5

net: add netif_tx_queue_frozen_or_stopped · 5a0d2268

由 Eric Dumazet 提交于 11月 23, 2010

When testing struct netdev_queue state against FROZEN bit, we also test
XOFF bit. We can test both bits at once and save some cycles.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5a0d2268

sctp: kill unused macro definition · 5584b807

由 Shan Wei 提交于 11月 22, 2010

These macros have been existed for several years since v2.6.12-rc2.
But they never be used. So remove them now.
Signed-off-by: NShan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5584b807

28 11月, 2010 1 次提交

rtnl: make link af-specific updates atomic · cf7afbfe

由 Thomas Graf 提交于 11月 22, 2010

As David pointed out correctly, updates to af-specific attributes
are currently not atomic. If multiple changes are requested and
one of them fails, previous updates may have been applied already
leaving the link behind in a undefined state.

This patch splits the function parse_link_af() into two functions
validate_link_af() and set_link_at(). validate_link_af() is placed
to validate_linkmsg() check for errors as early as possible before
any changes to the link have been made. set_link_af() is called to
commit the changes later.

This method is not fail proof, while it is currently sufficient
to make set_link_af() inerrable and thus 100% atomic, the
validation function method will not be able to detect all error
scenarios in the future, there will likely always be errors
depending on states which are f.e. not protected by rtnl_mutex
and thus may change between validation and setting.

Also, instead of silently ignoring unknown address families and
config blocks for address families which did not register a set
function the errors EAFNOSUPPORT respectively EOPNOSUPPORT are
returned to avoid comitting 4 out of 5 update requests without
notifying the user.
Signed-off-by: NThomas Graf <tgraf@infradead.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cf7afbfe

25 11月, 2010 6 次提交

cfg80211: allow using CQM event to notify packet loss · c063dbf5

由 Johannes Berg 提交于 11月 24, 2010

This adds the ability for drivers to use CQM events
to notify about packet loss for specific stations
(which could be the AP for the managed mode case).
Since the threshold might be determined by the
driver (it isn't passed in right now) it will be
passed out of the driver to userspace in the event.
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>

c063dbf5

cfg80211: Add documentation for antenna ops · 79b1c460

由 Bruno Randolf 提交于 11月 24, 2010

Signed-off-by: NBruno Randolf <br1@einfach.org>
Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>

79b1c460

cfg80211/mac80211: improve ad-hoc multicast rate handling · dd5b4cc7

由 Felix Fietkau 提交于 11月 22, 2010

- store the multicast rate as an index instead of the rate value
  (reduces cpu overhead in a hotpath)
- validate the rate values (must match a bitrate in at least one sband)
Signed-off-by: NFelix Fietkau <nbd@openwrt.org>
Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>

dd5b4cc7

Revert "nl80211/mac80211: Report signal average" · ccb14354

由 John W. Linville 提交于 11月 24, 2010

This reverts commit 86107fd1.

This patch inadvertantly changed the userland ABI.
Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>

ccb14354

xps: Transmit Packet Steering · 1d24eb48

由 Tom Herbert 提交于 11月 21, 2010

This patch implements transmit packet steering (XPS) for multiqueue
devices.  XPS selects a transmit queue during packet transmission based
on configuration.  This is done by mapping the CPU transmitting the
packet to a queue.  This is the transmit side analogue to RPS-- where
RPS is selecting a CPU based on receive queue, XPS selects a queue
based on the CPU (previously there was an XPS patch from Eric
Dumazet, but that might more appropriately be called transmit completion
steering).

Each transmit queue can be associated with a number of CPUs which will
use the queue to send packets.  This is configured as a CPU mask on a
per queue basis in:

/sys/class/net/eth<n>/queues/tx-<n>/xps_cpus

The mappings are stored per device in an inverted data structure that
maps CPUs to queues.  In the netdevice structure this is an array of
num_possible_cpu structures where each structure holds and array of
queue_indexes for queues which that CPU can use.

The benefits of XPS are improved locality in the per queue data
structures.  Also, transmit completions are more likely to be done
nearer to the sending thread, so this should promote locality back
to the socket on free (e.g. UDP).  The benefits of XPS are dependent on
cache hierarchy, application load, and other factors.  XPS would
nominally be configured so that a queue would only be shared by CPUs
which are sharing a cache, the degenerative configuration woud be that
each CPU has it's own queue.

Below are some benchmark results which show the potential benfit of
this patch.  The netperf test has 500 instances of netperf TCP_RR test
with 1 byte req. and resp.

bnx2x on 16 core AMD
   XPS (16 queues, 1 TX queue per CPU)  1234K at 100% CPU
   No XPS (16 queues)                   996K at 100% CPU
Signed-off-by: NTom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1d24eb48

xps: Improvements in TX queue selection · 3853b584

由 Tom Herbert 提交于 11月 21, 2010

In dev_pick_tx, don't do work in calculating queue
index or setting
the index in the sock unless the device has more than one queue.  This
allows the sock to be set only with a queue index of a multi-queue
device which is desirable if device are stacked like in a tunnel.

We also allow the mapping of a socket to queue to be changed.  To
maintain in order packet transmission a flag (ooo_okay) has been
added to the sk_buff structure.  If a transport layer sets this flag
on a packet, the transmit queue can be changed for the socket.
Presumably, the transport would set this if there was no possbility
of creating OOO packets (for instance, there are no packets in flight
for the socket).  This patch includes the modification in TCP output
for setting this flag.
Signed-off-by: NTom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3853b584

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功