提交 · 815f236d622721b54f3963ba59dad98b02cdeabf · openeuler / Kernel

08 6月, 2013 5 次提交

macvtap: add TUNSETQUEUE ioctl · 815f236d

由 Jason Wang 提交于 6月 05, 2013

This patch adds TUNSETQUEUE ioctl to let userspace can temporarily disable or
enable a queue of macvtap. This is used to be compatible at API layer of tuntap
to simplify the userspace to manage the queues. This is done through introducing
a linked list to track all taps while using vlan->taps array to only track
active taps.
Signed-off-by: NJason Wang <jasowang@redhat.com>
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

815f236d

macvtap: eliminate linear search · 376b1aab

由 Jason Wang 提交于 6月 05, 2013

Linear search were used in both get_slot() and macvtap_get_queue(), this is
because:

- macvtap didn't reshuffle the array of taps when create or destroy a queue, so
  when adding a new queue, macvtap must do linear search to find a location for
  the new queue. This will also complicate the TUNSETQUEUE implementation for
  multiqueue API.
- the queue itself didn't track the queue index, so the we must do a linear
  search in the array to find the location of a existed queue.

The solution is straightforward: reshuffle the array and introduce a queue_index
to macvtap_queue.
Signed-off-by: NJason Wang <jasowang@redhat.com>
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

376b1aab

macvtap: introduce macvtap_get_vlan() · 8f475a31

由 Jason Wang 提交于 6月 05, 2013

Factor out the device holding logic to a macvtap_get_vlan(), this will be also
used by multiqueue API.
Signed-off-by: NJason Wang <jasowang@redhat.com>
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8f475a31

macvtap: do not add self to waitqueue if doing a nonblock read · 89cee917

由 Jason Wang 提交于 6月 05, 2013

There's no need to add self to waitqueue if doing a nonblock read. This could
help to avoid the spinlock contention.
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NJason Wang <jasowang@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

89cee917

macvtap: fix a possible race between queue selection and changing queues · ed0483fa

由 Jason Wang 提交于 6月 05, 2013

Complier may generate codes that re-read the vlan->numvtaps during
macvtap_get_queue(). This may lead a race if vlan->numvtaps were changed in the
same time and which can lead unexpected result (e.g. very huge value).

We need prevent the compiler from generating such codes by adding an
ACCESS_ONCE() to make sure vlan->numvtaps were only read once.
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NJason Wang <jasowang@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ed0483fa

29 5月, 2013 1 次提交

net: pass info struct via netdevice notifier · 351638e7

由 Jiri Pirko 提交于 5月 28, 2013

So far, only net_device * could be passed along with netdevice notifier
event. This patch provides a possibility to pass custom structure
able to provide info that event listener needs to know.
Signed-off-by: NJiri Pirko <jiri@resnulli.us>

v2->v3: fix typo on simeth
	shortened dev_getter
	shortened notifier_info struct name
v1->v2: fix notifier_call parameter in call_netdevice_notifier()
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

351638e7

28 3月, 2013 1 次提交

net: switch to use skb_probe_transport_header() · 40893fd0

由 Jason Wang 提交于 3月 26, 2013

Switch to use the new help skb_probe_transport_header() to do the l4 header
probing for untrusted sources. For packets with partial csum, the header should
already been set by skb_partial_csum_set().

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: NJason Wang <jasowang@redhat.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

40893fd0

27 3月, 2013 1 次提交

macvtap: set transport header before passing skb to lower device · 9b4d669b

由 Jason Wang 提交于 3月 25, 2013

Set the transport header for 1) some drivers (e.g ixgbe) needs l4 header 2)
precise packet length estimation (introduced in 1def9238) needs l4 header to
compute header length.

For the packets with partial checksum, the patch just set the transport header
to csum_start. Otherwise tries to use skb_flow_dissect() to get l4 offset, if it
fails, just pretend no l4 header.

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: NJason Wang <jasowang@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9b4d669b

28 2月, 2013 1 次提交

macvtap: convert to idr_alloc() · ec09ebc1

由 Tejun Heo 提交于 2月 27, 2013

Convert to the much saner new idr interface.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ec09ebc1

14 2月, 2013 1 次提交

net: Fix possible wrong checksum generation. · c9af6db4

由 Pravin B Shelar 提交于 2月 11, 2013

Patch cef401de (net: fix possible wrong checksum
generation) fixed wrong checksum calculation but it broke TSO by
defining new GSO type but not a netdev feature for that type.
net_gso_ok() would not allow hardware checksum/segmentation
offload of such packets without the feature.

Following patch fixes TSO and wrong checksum. This patch uses
same logic that Eric Dumazet used. Patch introduces new flag
SKBTX_SHARED_FRAG if at least one frag can be modified by
the user. but SKBTX_SHARED_FRAG flag is kept in skb shared
info tx_flags rather than gso_type.

tx_flags is better compared to gso_type since we can have skb with
shared frag without gso packet. It does not link SHARED_FRAG to
GSO, So there is no need to define netdev feature for this.
Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c9af6db4

28 1月, 2013 1 次提交

net: fix possible wrong checksum generation · cef401de

由 Eric Dumazet 提交于 1月 25, 2013

Pravin Shelar mentioned that GSO could potentially generate
wrong TX checksum if skb has fragments that are overwritten
by the user between the checksum computation and transmit.

He suggested to linearize skbs but this extra copy can be
avoided for normal tcp skbs cooked by tcp_sendmsg().

This patch introduces a new SKB_GSO_SHARED_FRAG flag, set
in skb_shinfo(skb)->gso_type if at least one frag can be
modified by the user.

Typical sources of such possible overwrites are {vm}splice(),
sendfile(), and macvtap/tun/virtio_net drivers.

Tested:

$ netperf -H 7.7.8.84
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
7.7.8.84 () port 0 AF_INET
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 87380  16384  16384    10.00    3959.52

$ netperf -H 7.7.8.84 -t TCP_SENDFILE
TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 ()
port 0 AF_INET
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 87380  16384  16384    10.00    3216.80

Performance of the SENDFILE is impacted by the extra allocation and
copy, and because we use order-0 pages, while the TCP_STREAM uses
bigger pages.
Reported-by: NPravin Shelar <pshelar@nicira.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cef401de

13 8月, 2012 1 次提交

macvtap: rcu_dereference outside read-lock section · 3a7f8c34

由 Denis Efremov 提交于 8月 11, 2012

rcu_dereference occurs in update section. Replacement by
rcu_dereference_protected in order to prevent lockdep
complaint.

Found by Linux Driver Verification project (linuxtesting.org)
Signed-off-by: NDenis Efremov <yefremov.denis@gmail.com>
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3a7f8c34

08 6月, 2012 1 次提交

macvtap: use prepare_to_wait/finish_wait to ensure mb · ccf7e72b

由 Hong zhi guo 提交于 6月 06, 2012

instead of raw assignment to current->state
Signed-off-by: NHong Zhiguo <honkiko@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ccf7e72b

12 5月, 2012 1 次提交

macvtap: restore vlan header on user read · f09e2249

由 Basil Gor 提交于 5月 03, 2012

Ethernet vlan header is not on the packet and kept in the skb->vlan_tci
when it comes from lower dev. This patch inserts vlan header in user
buffer during skb copy on user read.
Signed-off-by: NBasil Gor <basil.gor@gmail.com>
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f09e2249

02 5月, 2012 5 次提交

macvtap: zerocopy: validate vectors before building skb · b92946e2

由 Jason Wang 提交于 5月 02, 2012

There're several reasons that the vectors need to be validated:

- Return error when caller provides vectors whose num is greater than UIO_MAXIOV.
- Linearize part of skb when userspace provides vectors grater than MAX_SKB_FRAGS.
- Return error when userspace provides vectors whose total length may exceed
- MAX_SKB_FRAGS * PAGE_SIZE.
Signed-off-by: NJason Wang <jasowang@redhat.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

b92946e2

macvtap: zerocopy: set SKBTX_DEV_ZEROCOPY only when skb is built successfully · 01d6657b

由 Jason Wang 提交于 5月 02, 2012

Current the SKBTX_DEV_ZEROCOPY is set unconditionally after
zerocopy_sg_from_iovec(), this would lead NULL pointer when macvtap
fails to build zerocopy skb because destructor_arg was not
initialized. Solve this by set this flag after the skb were built
successfully.
Signed-off-by: NJason Wang <jasowang@redhat.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

01d6657b

macvtap: zerocopy: put page when fail to get all requested user pages · 02ce04bb

由 Jason Wang 提交于 5月 02, 2012

When get_user_pages_fast() fails to get all requested pages, we could not use
kfree_skb() to free it as it has not been put in the skb fragments. So we need
to call put_page() instead.
Signed-off-by: NJason Wang <jasowang@redhat.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

02ce04bb

macvtap: zerocopy: fix truesize underestimation · 4ef67ebe

由 Jason Wang 提交于 5月 02, 2012

As the skb fragment were pinned/built from user pages, we should
account the page instead of length for truesize.
Signed-off-by: NJason Wang <jasowang@redhat.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

4ef67ebe

macvtap: zerocopy: fix offset calculation when building skb · 3afc9621

由 Jason Wang 提交于 5月 02, 2012

This patch fixes the offset calculation when building skb:

- offset1 were used as skb data offset not vector offset
- reset offset to zero only when we advance to next vector
Signed-off-by: NJason Wang <jasowang@redhat.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

3afc9621

14 2月, 2012 1 次提交

security: trim security.h · 40401530

由 Al Viro 提交于 2月 13, 2012

Trim security.h
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NJames Morris <jmorris@namei.org>

40401530

21 12月, 2011 1 次提交

macvtap: Fix macvtap_get_queue to use rxhash first · ef0002b5

由 Krishna Kumar 提交于 11月 23, 2011

It was reported that the macvtap device selects a
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ef0002b5

24 11月, 2011 1 次提交

net: treewide use of RCU_INIT_POINTER · 2cfa5a04

由 Eric Dumazet 提交于 11月 23, 2011

rcu_assign_pointer(ptr, NULL) can be safely replaced by
RCU_INIT_POINTER(ptr, NULL)

(old rcu_assign_pointer() macro was testing the NULL value and could
omit the smp_wmb(), but this had to be removed because of compiler
warnings)
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2cfa5a04

21 10月, 2011 5 次提交

macvtap: Fix the minor device number allocation · e09eff7f

由 Eric W. Biederman 提交于 10月 20, 2011

On systems that create and delete lots of dynamic devices the
31bit linux ifindex fails to fit in the 16bit macvtap minor,
resulting in unusable macvtap devices.  I have systems running
automated tests that that hit this condition in just a few days.

Use a linux idr allocator to track which mavtap minor numbers
are available and and to track the association between macvtap
minor numbers and macvtap network devices.

Remove the unnecessary unneccessary check to see if the network
device we have found is indeed a macvtap device.  With macvtap
specific data structures it is impossible to find any other
kind of networking device.

Increase the macvtap minor range from 65536 to the full 20 bits
that is supported by linux device numbers.  It doesn't solve the
original problem but there is no penalty for a larger minor
device range.
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e09eff7f

macvtap: Rewrite macvtap_newlink so the error handling works. · 9bf1907f

由 Eric W. Biederman 提交于 10月 20, 2011

Place macvlan_common_newlink at the end of macvtap_newlink because
failing in newlink after registering your network device is not
supported.

Move device_create into a netdevice creation notifier.   The network device
notifier is the only hook that is called after the network device has been
registered with the device layer and before register_network_device returns
success.
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9bf1907f

macvtap: Don't leak unreceived packets when we delete a macvtap device. · 2259fef0

由 Eric W. Biederman 提交于 10月 20, 2011

To avoid leaking packets in the receive queue.  Add a socket destructor
that will run whenever destroy a macvtap socket.
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2259fef0

macvtap: Fix macvtap_open races in the zero copy enable code. · 047af9cf

由 Eric W. Biederman 提交于 10月 20, 2011

To see if it is appropriate to enable the macvtap zero copy feature
don't test the lowerdev network device flags. Instead test the
macvtap network device flags which are a direct copy of the lowerdev
flags. This is important because nothing holds a reference to lowerdev
and on a very bad day we lowerdev could be a pointer to stale memory.
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

047af9cf

macvtap: Close a race between macvtap_open and macvtap_dellink. · 99f34b38

由 Eric W. Biederman 提交于 10月 20, 2011

There is a small window in macvtap_open between looking up a
networking device and calling macvtap_set_queue in which
macvtap_del_queues called from macvtap_dellink.   After
calling macvtap_del_queues it is totally incorrect to
allow macvtap_set_queue to proceed so prevent success by
reporting that all of the available queues are in use.
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

99f34b38

21 9月, 2011 1 次提交

macvtap: fix the uninitialized var using in macvtap_alloc_skb() · 653fc915

由 Jason Wang 提交于 9月 18, 2011

Commit d1b08284 use new frag API but would leave f to be used
uninitialized, this patch fix it.
Signed-off-by: NJason Wang <jasowang@redhat.com>
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Acked-by: NIan Campbell <ian.campbell@citrix.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

653fc915

16 9月, 2011 1 次提交

macvtap: convert to SKB paged frag API. · d1b08284

由 Ian Campbell 提交于 8月 31, 2011

Signed-off-by: NIan Campbell <ian.campbell@citrix.com>
Cc: netdev@vger.kernel.org
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d1b08284

07 7月, 2011 1 次提交

macvtap: macvtapTX zero-copy support · 97bc3633

由 Shirley Ma 提交于 7月 06, 2011

Only 128 bytes is copied, the rest of data is DMA mapped directly from
userspace.
Signed-off-by: NShirley Ma <xma@...ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

97bc3633

12 6月, 2011 1 次提交

virtio_net: introduce VIRTIO_NET_HDR_F_DATA_VALID · 10a8d94a

由 Jason Wang 提交于 6月 10, 2011

There's no need for the guest to validate the checksum if it have been
validated by host nics. So this patch introduces a new flag -
VIRTIO_NET_HDR_F_DATA_VALID which is used to bypass the checksum
examing in guest. The backend (tap/macvtap) may set this flag when
met skbs with CHECKSUM_UNNECESSARY to save cpu utilization.

No feature negotiation is needed as old driver just ignore this flag.

Iperf shows 12%-30% performance improvement for UDP traffic. For TCP,
when gro is on no difference as it produces skb with partial
checksum. But when gro is disabled, 20% or even higher improvement
could be measured by netperf.
Signed-off-by: NJason Wang <jasowang@redhat.com>
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

10a8d94a

08 3月, 2011 1 次提交

drivers/net/macvtap: fix error check · ce3c8692

由 Nicolas Kaiser 提交于 3月 04, 2011

'len' is unsigned of type size_t and can't be negative.
Signed-off-by: NNicolas Kaiser <nikai@nikai.net>
Acked-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ce3c8692

28 1月, 2011 1 次提交

drivers/net: remove some rcu sparse warnings · 13707f9e

由 Eric Dumazet 提交于 1月 26, 2011

Add missing __rcu annotations and helpers.
minor : Fix some rcu_dereference() calls in macvtap
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Acked-by: NArnd Bergmann <arnd@arndb.de>
Acked-by: NMichael Chan <mchan@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

13707f9e

14 1月, 2011 1 次提交

net: remove dev_txq_stats_fold() · 1ac9ad13

由 Eric Dumazet 提交于 1月 12, 2011

After recent changes, (percpu stats on vlan/tunnels...), we dont need
anymore per struct netdev_queue tx_bytes/tx_packets/tx_dropped counters.

Only remaining users are ixgbe, sch_teql, gianfar & macvlan :

1) ixgbe can be converted to use existing tx_ring counters.

2) macvlan incremented txq->tx_dropped, it can use the
dev->stats.tx_dropped counter.

3) sch_teql : almost revert ab35cd4b (Use net_device internal stats)
    Now we have ndo_get_stats64(), use it, even for "unsigned long"
fields (No need to bring back a struct net_device_stats)

4) gianfar adds a stats structure per tx queue to hold
tx_bytes/tx_packets

This removes a lockdep warning (and possible lockup) in rndis gadget,
calling dev_get_stats() from hard IRQ context.

Ref: http://www.spinics.net/lists/netdev/msg149202.htmlReported-by: NNeil Jones <neiljay@gmail.com>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
CC: Jarek Poplawski <jarkao2@gmail.com>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CC: Sandeep Gopalpet <sandeep.kumar@freescale.com>
CC: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1ac9ad13

17 12月, 2010 1 次提交

net: Use skb_checksum_start_offset() · 55508d60

由 Michał Mirosław 提交于 12月 14, 2010

Replace skb->csum_start - skb_headroom(skb) with skb_checksum_start_offset().

Note for usb/smsc95xx: skb->data - skb->head == skb_headroom(skb).
Signed-off-by: NMichał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

55508d60

17 8月, 2010 1 次提交

macvtap: Implement multiqueue for macvtap driver · 1565c7c1

由 Krishna Kumar 提交于 8月 04, 2010

Implement multiqueue facility for macvtap driver. The idea is that
a macvtap device can be opened multiple times and the fd's can be
used to register eg, as backend for vhost.
Signed-off-by: NKrishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1565c7c1

23 7月, 2010 1 次提交

macvtap: Limit packet queue length · 8a35747a

由 Herbert Xu 提交于 7月 21, 2010

Mark Wagner reported OOM symptoms when sending UDP traffic over
a macvtap link to a kvm receiver.

This appears to be caused by the fact that macvtap packet queues
are unlimited in length.  This means that if the receiver can't
keep up with the rate of flow, then we will hit OOM. Of course
it gets worse if the OOM killer then decides to kill the receiver.

This patch imposes a cap on the packet queue length, in the same
way as the tuntap driver, using the device TX queue length.

Please note that macvtap currently has no way of giving congestion
notification, that means the software device TX queue cannot be
used and packets will always be dropped once the macvtap driver
queue fills up.

This shouldn't be a great problem for the scenario where macvtap
is used to feed a kvm receiver, as the traffic is most likely
external in origin so congestion notification can't be applied
anyway.

Of course, if anybody decides to complain about guest-to-guest
UDP packet loss down the track, then we may have to revisit this.

Incidentally, this patch also fixes a real memory leak when
macvtap_get_queue fails.

Chris Wright noticed that for this patch to work, we need a
non-zero TX queue length.  This patch includes his work to change
the default macvtap TX queue length to 500.
Reported-by: NMark Wagner <mwagner@redhat.com>
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Acked-by: NChris Wright <chrisw@sous-sol.org>
Acked-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8a35747a

11 7月, 2010 1 次提交

macvtap: Use dev_t for macvtap_major. · 1ebed71a

由 David S. Miller 提交于 7月 10, 2010

Reported-by: N"Robert P. J. Day" <rpjday@crashcourse.ca>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1ebed71a

04 5月, 2010 1 次提交

macvtap: add ioctl to modify vnet header size · 55afbd08

由 Michael S. Tsirkin 提交于 4月 29, 2010

This adds TUNSETVNETHDRSZ/TUNGETVNETHDRSZ support
to macvtap.
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
Acked-by: NArnd Bergmann <arnd@arndb.de>
Acked-by: NDavid S. Miller <davem@davemloft.net>

55afbd08

02 5月, 2010 1 次提交

net: sock_def_readable() and friends RCU conversion · 43815482

由 Eric Dumazet 提交于 4月 29, 2010

sk_callback_lock rwlock actually protects sk->sk_sleep pointer, so we
need two atomic operations (and associated dirtying) per incoming
packet.

RCU conversion is pretty much needed :

1) Add a new structure, called "struct socket_wq" to hold all fields
that will need rcu_read_lock() protection (currently: a
wait_queue_head_t and a struct fasync_struct pointer).

[Future patch will add a list anchor for wakeup coalescing]

2) Attach one of such structure to each "struct socket" created in
sock_alloc_inode().

3) Respect RCU grace period when freeing a "struct socket_wq"

4) Change sk_sleep pointer in "struct sock" by sk_wq, pointer to "struct
socket_wq"

5) Change sk_sleep() function to use new sk->sk_wq instead of
sk->sk_sleep

6) Change sk_has_sleeper() to wq_has_sleeper() that must be used inside
a rcu_read_lock() section.

7) Change all sk_has_sleeper() callers to :
  - Use rcu_read_lock() instead of read_lock(&sk->sk_callback_lock)
  - Use wq_has_sleeper() to eventually wakeup tasks.
  - Use rcu_read_unlock() instead of read_unlock(&sk->sk_callback_lock)

8) sock_wake_async() is modified to use rcu protection as well.

9) Exceptions :
  macvtap, drivers/net/tun.c, af_unix use integrated "struct socket_wq"
instead of dynamically allocated ones. They dont need rcu freeing.

Some cleanups or followups are probably needed, (possible
sk_callback_lock conversion to a spinlock for example...).
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

43815482

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功