提交 · 54309fa60b5f57b90c1842176f6045e665d21142 · OpenHarmony / kernel_linux

06 5月, 2013 2 次提交

netpoll: inverted down_trylock() test · a3dbbc2b

由 Dan Carpenter 提交于 5月 06, 2013

The return value is reversed from mutex_trylock().
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a3dbbc2b

rps_dev_flow_table_release(): no need to delay vfree() · 243198d0

由 Al Viro 提交于 5月 05, 2013

The same story as with fib_trie patch - vfree() from RCU callbacks
is legitimate now.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

243198d0

03 5月, 2013 2 次提交
- B
  net: vlan,ethtool: netdev_features_t is more than 32 bit · b29d3145
  由 Bjørn Mork 提交于 5月 01, 2013
```
Signed-off-by: NBjørn Mork <bjorn@mork.no>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
  b29d3145
- P
  net: use netdev_features_t in skb_needs_linearize() · 6708c9e5
  由 Patrick McHardy 提交于 5月 01, 2013
```
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
  6708c9e5
02 5月, 2013 3 次提交

proc: Supply a function to remove a proc entry by PDE · a8ca16ea

由 David Howells 提交于 4月 12, 2013

Supply a function (proc_remove()) to remove a proc entry (and any subtree
rooted there) by proc_dir_entry pointer rather than by name and (optionally)
root dir entry pointer.  This allows us to eliminate all remaining pde->name
accesses outside of procfs.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NGrant Likely <grant.likely@linaro.or>
cc: linux-acpi@vger.kernel.org
cc: openipmi-developer@lists.sourceforge.net
cc: devicetree-discuss@lists.ozlabs.org
cc: linux-pci@vger.kernel.org
cc: netdev@vger.kernel.org
cc: netfilter-devel@vger.kernel.org
cc: alsa-devel@alsa-project.org
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

a8ca16ea

proc: Split the namespace stuff out into linux/proc_ns.h · 0bb80f24

由 David Howells 提交于 4月 12, 2013

Split the proc namespace stuff out into linux/proc_ns.h.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
cc: netdev@vger.kernel.org
cc: Serge E. Hallyn <serge.hallyn@ubuntu.com>
cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

0bb80f24

netpoll: convert mutex into a semaphore · bd7c4b60

由 Neil Horman 提交于 4月 30, 2013

Bart Van Assche recently reported a warning to me:

<IRQ>  [<ffffffff8103d79f>] warn_slowpath_common+0x7f/0xc0
[<ffffffff8103d7fa>] warn_slowpath_null+0x1a/0x20
[<ffffffff814761dd>] mutex_trylock+0x16d/0x180
[<ffffffff813968c9>] netpoll_poll_dev+0x49/0xc30
[<ffffffff8136a2d2>] ? __alloc_skb+0x82/0x2a0
[<ffffffff81397715>] netpoll_send_skb_on_dev+0x265/0x410
[<ffffffff81397c5a>] netpoll_send_udp+0x28a/0x3a0
[<ffffffffa0541843>] ? write_msg+0x53/0x110 [netconsole]
[<ffffffffa05418bf>] write_msg+0xcf/0x110 [netconsole]
[<ffffffff8103eba1>] call_console_drivers.constprop.17+0xa1/0x1c0
[<ffffffff8103fb76>] console_unlock+0x2d6/0x450
[<ffffffff8104011e>] vprintk_emit+0x1ee/0x510
[<ffffffff8146f9f6>] printk+0x4d/0x4f
[<ffffffffa0004f1d>] scsi_print_command+0x7d/0xe0 [scsi_mod]

This resulted from my commit ca99ca14 which introduced a mutex_trylock
operation in a path that could execute in interrupt context.  When mutex
debugging is enabled, the above warns the user when we are in fact
exectuting in interrupt context
interrupt context.

After some discussion, It seems that a semaphore is the proper mechanism to use
here.  While mutexes are defined to be unusable in interrupt context, no such
condition exists for semaphores (save for the fact that the non blocking api
calls, like up and down_trylock must be used when in irq context).
Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
Reported-by: NBart Van Assche <bvanassche@acm.org>
CC: Bart Van Assche <bvanassche@acm.org>
CC: David Miller <davem@davemloft.net>
CC: netdev@vger.kernel.org
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bd7c4b60

30 4月, 2013 7 次提交

unix/dgram: fix peeking with an offset larger than data in queue · 39cc8613

由 Benjamin Poirier 提交于 4月 29, 2013

Currently, peeking on a unix datagram socket with an offset larger than len of
the data in the sk receive queue returns immediately with bogus data. That's
because *off is not reset between each skb_queue_walk().

This patch fixes this so that the behavior is the same as peeking with no
offset on an empty queue: the caller blocks.
Signed-off-by: NBenjamin Poirier <bpoirier@suse.de>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

39cc8613

unix/dgram: peek beyond 0-sized skbs · add05ad4

由 Benjamin Poirier 提交于 4月 29, 2013

"77c1090f net: fix infinite loop in __skb_recv_datagram()" (v3.8) introduced a
regression:
After that commit, recv can no longer peek beyond a 0-sized skb in the queue.
__skb_recv_datagram() instead stops at the first skb with len == 0 and results
in the system call failing with -EFAULT via skb_copy_datagram_iovec().

When peeking at an offset with 0-sized skb(s), each one of those is received
only once, in sequence. The offset starts moving forward again after receiving
datagrams with len > 0.
Signed-off-by: NBenjamin Poirier <bpoirier@suse.de>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

add05ad4

net: Use consume_skb() to free gso segmented skb · 0c772159

由 Sridhar Samudrala 提交于 4月 29, 2013

Use consume_skb() to free the original skb that is successfully transmitted
as gso segmented skbs so that it is not treated as a drop due to an error.
Signed-off-by: NSridhar Samudrala <sri@us.ibm.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0c772159

net/core: remove duplicate statements by do-while loop · 70e3ba72

由 Akinobu Mita 提交于 4月 29, 2013

Remove duplicate statements by using do-while loop instead of while loop.

- A;
- while (e) {
+ do {
	A;
- }
+ } while (e);
Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

70e3ba72

net/core: rename random32() to prandom_u32() · 33d7c5e5

由 Akinobu Mita 提交于 4月 29, 2013

Use preferable function name which implies using a pseudo-random
number generator.
Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

33d7c5e5

net: defer net_secret[] initialization · aebda156

由 Eric Dumazet 提交于 4月 29, 2013

Instead of feeding net_secret[] at boot time, defer the init
at the point first socket is created.

This permits some platforms to use better entropy sources than
the ones available at boot time.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

aebda156

sock_diag: allow to dump bpf filters · e8d9612c

由 Nicolas Dichtel 提交于 4月 25, 2013

This patch allows to dump BPF filters attached to a socket with
SO_ATTACH_FILTER.
Note that we check CAP_SYS_ADMIN before allowing to dump this info.

For now, only AF_PACKET sockets use this feature.
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e8d9612c

25 4月, 2013 3 次提交

net: fix address check in rtnl_fdb_del · 37fe0660

由 Vlad Yasevich 提交于 4月 23, 2013

Commit 6681712d
	vxlan: generalize forwarding tables

relaxed the address checks in rtnl_fdb_del() to use is_zero_ether_addr().
This allows users to add multicast addresses using the fdb API.  However,
the check in rtnl_fdb_del() still uses a more strict
is_valid_ether_addr() which rejects multicast addresses.  Thus it
is possible to add an fdb that can not be later removed.
Relax the check in rtnl_fdb_del() as well.
Signed-off-by: NVlad Yasevich <vyasevic@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

37fe0660

net: remove redundant code in dev_hard_start_xmit() · e12472dc

由 Eric Dumazet 提交于 4月 22, 2013

This reverts commit 068a2de5 (net: release dst entry while
cache-hot for GSO case too)

Before GSO packet segmentation, we already take care of skb->dst if it
can be released.

There is no point adding extra test for every segment in the gso loop.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e12472dc

packet: tx timestamping on tpacket ring · 2e31396f

由 Willem de Bruijn 提交于 4月 23, 2013

When transmit timestamping is enabled at the socket level, record a
timestamp on packets written to a PACKET_TX_RING. Tx timestamps are
always looped to the application over the socket error queue. Software
timestamps are also written back into the packet frame header in the
packet ring.
Reported-by: NPaul Chavent <paul.chavent@onera.fr>
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2e31396f

23 4月, 2013 1 次提交

net: Remove return value from list_netdevice() · 53759be9

由 dingtianhong 提交于 4月 17, 2013

The return value from list_netdevice() is not used and no need, so remove it.
Signed-off-by: NDing Tianhong <dingtianhong@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

53759be9

20 4月, 2013 6 次提交

net: rate-limit warn-bad-offload splats. · c846ad9b

由 Ben Greear 提交于 4月 19, 2013

If one does do something unfortunate and allow a
bad offload bug into the kernel, this the
skb_warn_bad_offload can effectively live-lock the
system, filling the logs with the same error over
and over.

Add rate limitation to this so that box remains otherwise
functional in this case.
Signed-off-by: NBen Greear <greearb@candelatech.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c846ad9b

D
net: Add missing netdev feature strings for NETIF_F_HW_VLAN_STAG_* · 2d6577f1
由 David S. Miller 提交于 4月 19, 2013
```
Noticed by Ben Hutchings.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
2d6577f1

net: add function to allocate sk_buff head without data area · 0ebd0ac5

由 Patrick McHardy 提交于 4月 17, 2013

Add a function to allocate a sk_buff head without any data. This will
be used by memory mapped netlink to attach data from the mmaped area
to the skb.

Additionally change skb_release_all() to check whether the skb has a
data area to allow the skb destructor to clear the data pointer in case
only a head has been allocated.
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0ebd0ac5

net: vlan: add 802.1ad support · 8ad227ff

由 Patrick McHardy 提交于 4月 19, 2013

Add support for 802.1ad VLAN devices. This mainly consists of checking for
ETH_P_8021AD in addition to ETH_P_8021Q in a couple of places and check
offloading capabilities based on the used protocol.

Configuration is done using "ip link":

# ip link add link eth0 eth0.1000 \
	type vlan proto 802.1ad id 1000
# ip link add link eth0.1000 eth0.1000.1000 \
	type vlan proto 802.1q id 1000

52:54:00:12:34:56 > 92:b1:54:28:e4:8c, ethertype 802.1Q (0x8100), length 106: vlan 1000, p 0, ethertype 802.1Q, vlan 1000, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto ICMP (1), length 84)
    20.1.0.2 > 20.1.0.1: ICMP echo request, id 3003, seq 8, length 64
92:b1:54:28:e4:8c > 52:54:00:12:34:56, ethertype 802.1Q-QinQ (0x88a8), length 106: vlan 1000, p 0, ethertype 802.1Q, vlan 1000, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 47944, offset 0, flags [none], proto ICMP (1), length 84)
    20.1.0.1 > 20.1.0.2: ICMP echo reply, id 3003, seq 8, length 64
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8ad227ff

net: vlan: add protocol argument to packet tagging functions · 86a9bad3

由 Patrick McHardy 提交于 4月 19, 2013

Add a protocol argument to the VLAN packet tagging functions. In case of HW
tagging, we need that protocol available in the ndo_start_xmit functions,
so it is stored in a new field in the skb. The new field fits into a hole
(on 64 bit) and doesn't increase the sks's size.
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

86a9bad3

net: vlan: rename NETIF_F_HW_VLAN_* feature flags to NETIF_F_HW_VLAN_CTAG_* · f646968f

由 Patrick McHardy 提交于 4月 19, 2013

Rename the hardware VLAN acceleration features to include "CTAG" to indicate
that they only support CTAGs. Follow up patches will introduce 802.1ad
server provider tagging (STAGs) and require the distinction for hardware not
supporting acclerating both.
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f646968f

17 4月, 2013 1 次提交

neighbour: Convert NEIGH_PRINTK to neigh_dbg · d5d427cd

由 Joe Perches 提交于 4月 15, 2013

Update debugging messages to a more current style.

Emit these debugging messages at KERN_DEBUG instead
of KERN_DEFAULT.

Add and use neigh_dbg(level, fmt, ...) macro
Add dynamic_debug capability via pr_debug
Convert embedded function names to "%s: ", __func__
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d5d427cd

16 4月, 2013 1 次提交

net: add dev_uc_sync_multiple() and dev_mc_sync_multiple() api · 4cd729b0

由 Vlad Yasevich 提交于 4月 15, 2013

The current implementation of dev_uc_sync/unsync() assumes that there is
a strict 1-to-1 relationship between the source and destination of the sync.
In other words, once an address has been synced to a destination device, it
will not be synced to any other device through the sync API.
However, there are some virtual devices that aggreate a number of lower
devices and need to sync addresses to all of them.  The current
API falls short there.

This patch introduces a new dev_uc_sync_multiple() api that can be called
in the above circumstances and allows sync to work for every invocation.

CC: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: NVlad Yasevich <vyasevic@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4cd729b0

12 4月, 2013 1 次提交

Revert "netprio_cgroup: make local table static" · 6c677985

由 David S. Miller 提交于 4月 12, 2013

This reverts commit 763eff57.

It causes build regressions, as per Stephen Rothwell:

====================
After merging the final tree, today's linux-next build (powerpc
allyesconfig) failed like this:

net/core/netprio_cgroup.c:250:29: error: static declaration of 'net_prio_subsys' follows non-static declaration
include/linux/cgroup_subsys.h:71:1: note: previous declaration of 'net_prio_subsys' was here
====================
Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6c677985

11 4月, 2013 1 次提交

netprio_cgroup: make local table static · 763eff57

由 stephen hemminger 提交于 4月 10, 2013

Minor sparse warning
Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

763eff57

10 4月, 2013 3 次提交

procfs: new helper - PDE_DATA(inode) · d9dda78b

由 Al Viro 提交于 3月 31, 2013

The only part of proc_dir_entry the code outside of fs/proc
really cares about is PDE(inode)->data.  Provide a helper
for that; static inline for now, eventually will be moved
to fs/proc, along with the knowledge of struct proc_dir_entry
layout.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

d9dda78b

netprio_cgroup: remove task_struct parameter from sock_update_netprio() · 6ffd4641

由 Zefan Li 提交于 4月 08, 2013

The callers always pass current to sock_update_netprio().
Signed-off-by: NLi Zefan <lizefan@huawei.com>
Acked-by: NNeil Horman <nhorman@tuxdriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6ffd4641

cls_cgroup: remove task_struct parameter from sock_update_classid() · 211d2f97

由 Zefan Li 提交于 4月 08, 2013

The callers always pass current to sock_update_classid().
Signed-off-by: NLi Zefan <lizefan@huawei.com>
Acked-by: NNeil Horman <nhorman@tuxdriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

211d2f97

09 4月, 2013 1 次提交

rtnetlink: Call nlmsg_parse() with correct header length · 88c5b5ce

由 Michael Riesch 提交于 4月 08, 2013

Signed-off-by: NMichael Riesch <michael.riesch@omicron.at>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jiri Benc <jbenc@redhat.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-kernel@vger.kernel.org
Acked-by: NMark Rustad <mark.d.rustad@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

88c5b5ce

08 4月, 2013 1 次提交

scm: Stop passing struct cred · 6b0ee8c0

由 Eric W. Biederman 提交于 4月 03, 2013

Now that uids and gids are completely encapsulated in kuid_t
and kgid_t we no longer need to pass struct cred which allowed
us to test both the uid and the user namespace for equality.

Passing struct cred potentially allows us to pass the entire group
list as BSD does but I don't believe the cost of cache line misses
justifies retaining code for a future potential application.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6b0ee8c0

06 4月, 2013 1 次提交

netfilter: don't reset nf_trace in nf_reset() · 124dff01

由 Patrick McHardy 提交于 4月 05, 2013

Commit 130549fe ("netfilter: reset nf_trace in nf_reset") added code
to reset nf_trace in nf_reset(). This is wrong and unnecessary.

nf_reset() is used in the following cases:

- when passing packets up the the socket layer, at which point we want to
  release all netfilter references that might keep modules pinned while
  the packet is queued. nf_trace doesn't matter anymore at this point.

- when encapsulating or decapsulating IPsec packets. We want to continue
  tracing these packets after IPsec processing.

- when passing packets through virtual network devices. Only devices on
  that encapsulate in IPv4/v6 matter since otherwise nf_trace is not
  used anymore. Its not entirely clear whether those packets should
  be traced after that, however we've always done that.

- when passing packets through virtual network devices that make the
  packet cross network namespace boundaries. This is the only cases
  where we clearly want to reset nf_trace and is also what the
  original patch intended to fix.

Add a new function nf_reset_trace() and use it in dev_forward_skb() to
fix this properly.
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

124dff01

05 4月, 2013 1 次提交

net: count hw_addr syncs so that unsync works properly. · 4543fbef

由 Vlad Yasevich 提交于 4月 02, 2013

A few drivers use dev_uc_sync/unsync to synchronize the
address lists from master down to slave/lower devices.  In
some cases (bond/team) a single address list is synched down
to multiple devices.  At the time of unsync, we have a leak
in these lower devices, because "synced" is treated as a
boolean and the address will not be unsynced for anything after
the first device/call.

Treat "synced" as a count (same as refcount) and allow all
unsync calls to work.
Signed-off-by: NVlad Yasevich <vyasevic@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4543fbef

03 4月, 2013 1 次提交

net: fix smatch warnings inside datagram_poll · 8facd5fb

由 Jacob Keller 提交于 4月 02, 2013

Commit 7d4c04fc ("net: add option to enable
error queue packets waking select") has an issue due to operator precedence
causing the bit-wise OR to bind to the sock_flags call instead of the result of
the terniary conditional. This fixes the *_poll functions to work properly. The
old code results in "mask |= POLLPRI" instead of what was intended, which is to
only include POLLPRI when the socket option is enabled.
Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8facd5fb

02 4月, 2013 1 次提交

net: add skb_dst_set_noref_force · 932bc4d7

由 Julian Anastasov 提交于 3月 21, 2013

Rename skb_dst_set_noref to __skb_dst_set_noref and
add force flag as suggested by David Miller. The new wrapper
skb_dst_set_noref_force will force dst entries that are not
cached to be attached as skb dst without taking reference
as long as provided dst is reclaimed after RCU grace period.
Signed-off-by: NJulian Anastasov <ja@ssi.bg>
Signed-off by: Hans Schillstrom <hans@schillstrom.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NSimon Horman <horms@verge.net.au>

932bc4d7

01 4月, 2013 1 次提交

net: add option to enable error queue packets waking select · 7d4c04fc

由 Keller, Jacob E 提交于 3月 28, 2013

Currently, when a socket receives something on the error queue it only wakes up
the socket on select if it is in the "read" list, that is the socket has
something to read. It is useful also to wake the socket if it is in the error
list, which would enable software to wait on error queue packets without waking
up for regular data on the socket. The main use case is for receiving
timestamped transmit packets which return the timestamp to the socket via the
error queue. This enables an application to select on the socket for the error
queue only instead of for the regular traffic.

-v2-
* Added the SO_SELECT_ERR_QUEUE socket option to every architechture specific file
* Modified every socket poll function that checks error queue
Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
Cc: Jeffrey Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Matthew Vick <matthew.vick@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7d4c04fc

30 3月, 2013 2 次提交

net: add a synchronize_net() in netdev_rx_handler_unregister() · 00cfec37

由 Eric Dumazet 提交于 3月 29, 2013

commit 35d48903 (bonding: fix rx_handler locking) added a race
in bonding driver, reported by Steven Rostedt who did a very good
diagnosis :

<quoting Steven>

I'm currently debugging a crash in an old 3.0-rt kernel that one of our
customers is seeing. The bug happens with a stress test that loads and
unloads the bonding module in a loop (I don't know all the details as
I'm not the one that is directly interacting with the customer). But the
bug looks to be something that may still be present and possibly present
in mainline too. It will just be much harder to trigger it in mainline.

In -rt, interrupts are threads, and can schedule in and out just like
any other thread. Note, mainline now supports interrupt threads so this
may be easily reproducible in mainline as well. I don't have the ability
to tell the customer to try mainline or other kernels, so my hands are
somewhat tied to what I can do.

But according to a core dump, I tracked down that the eth irq thread
crashed in bond_handle_frame() here:

        slave = bond_slave_get_rcu(skb->dev);
        bond = slave->bond; <--- BUG

the slave returned was NULL and accessing slave->bond caused a NULL
pointer dereference.

Looking at the code that unregisters the handler:

void netdev_rx_handler_unregister(struct net_device *dev)
{

        ASSERT_RTNL();
        RCU_INIT_POINTER(dev->rx_handler, NULL);
        RCU_INIT_POINTER(dev->rx_handler_data, NULL);
}

Which is basically:
        dev->rx_handler = NULL;
        dev->rx_handler_data = NULL;

And looking at __netif_receive_skb() we have:

        rx_handler = rcu_dereference(skb->dev->rx_handler);
        if (rx_handler) {
                if (pt_prev) {
                        ret = deliver_skb(skb, pt_prev, orig_dev);
                        pt_prev = NULL;
                }
                switch (rx_handler(&skb)) {

My question to all of you is, what stops this interrupt from happening
while the bonding module is unloading?  What happens if the interrupt
triggers and we have this:

        CPU0                    CPU1
        ----                    ----
  rx_handler = skb->dev->rx_handler

                        netdev_rx_handler_unregister() {
                           dev->rx_handler = NULL;
                           dev->rx_handler_data = NULL;

  rx_handler()
   bond_handle_frame() {
    slave = skb->dev->rx_handler;
    bond = slave->bond; <-- NULL pointer dereference!!!

What protection am I missing in the bond release handler that would
prevent the above from happening?

</quoting Steven>

We can fix bug this in two ways. First is adding a test in
bond_handle_frame() and others to check if rx_handler_data is NULL.

A second way is adding a synchronize_net() in
netdev_rx_handler_unregister() to make sure that a rcu protected reader
has the guarantee to see a non NULL rx_handler_data.

The second way is better as it avoids an extra test in fast path.
Reported-by: NSteven Rostedt <rostedt@goodmis.org>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Jiri Pirko <jpirko@redhat.com>
Cc: Paul E. McKenney <paulmck@us.ibm.com>
Acked-by: NSteven Rostedt <rostedt@goodmis.org>
Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

00cfec37

net: rtnetlink: fdb dflt dump must set idx used for cb->arg[0] · 91f3e7b1

由 John Fastabend 提交于 3月 29, 2013

In rtnl_fdb_dump() when the fdb_dump ndo op is not populated
we never set the idx value so that cb->arg[0] is always 0.
Resulting in a endless loop of messages.

Introduced with this commit,

commit 090096bf
Author: Vlad Yasevich <vyasevic@redhat.com>
Date:   Wed Mar 6 15:39:42 2013 +0000

    net: generic fdb support for drivers without ndo_fdb_<op>

CC: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

91f3e7b1

OpenHarmony / kernel_linux 上一次同步 大约 4 年

OpenHarmony / kernel_linux
上一次同步大约 4 年