提交 · 3b3a5b0aab5b9ad345d4beb9a364a7dd02c23d40 · openeuler / Kernel

14 5月, 2015 5 次提交

packet: rollover huge flows before small flows · 3b3a5b0a

由 Willem de Bruijn 提交于 5月 12, 2015

Migrate flows from a socket to another socket in the fanout group not
only when the socket is full. Start migrating huge flows early, to
divert possible 4-tuple attacks without affecting normal traffic.

Introduce fanout_flow_is_huge(). This detects huge flows, which are
defined as taking up more than half the load. It does so cheaply, by
storing the rxhashes of the N most recent packets. If over half of
these are the same rxhash as the current packet, then drop it. This
only protects against 4-tuple attacks. N is chosen to fit all data in
a single cache line.

Tested:
Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input.

lpbb5:/export/hda3/willemb# ./bench_rollover -l 1000 -r -s
cpu rx rx.k drop.k rollover r.huge r.failed
0 14 14 0 0 0 0
1 20 20 0 0 0 0
2 16 16 0 0 0 0
3 6168824 6168824 0 4867721 4867721 0
4 4867741 4867741 0 0 0 0
5 12 12 0 0 0 0
6 15 15 0 0 0 0
7 17 17 0 0 0 0
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3b3a5b0a

packet: rollover lock contention avoidance · 2ccdbaa6

由 Willem de Bruijn 提交于 5月 12, 2015

Rollover has to call packet_rcv_has_room on sockets in the fanout
group to find a socket to migrate to. This operation is expensive
especially if the packet sockets use rings, when a lock has to be
acquired.

Avoid pounding on the lock by all sockets by temporarily marking a
socket as "under memory pressure" when such pressure is detected.
While set, only the socket owner may call packet_rcv_has_room on the
socket. Once it detects normal conditions, it clears the flag. The
socket is not used as a victim by any other socket in the meantime.

Under reasonably balanced load, each socket writer frequently calls
packet_rcv_has_room and clears its own pressure field. As a backup
for when the socket is rarely written to, also clear the flag on
reading (packet_recvmsg, packet_poll) if this can be done cheaply
(i.e., without calling packet_rcv_has_room). This is only for
edge cases.

Tested:
  Ran bench_rollover: a process with 8 sockets in a single fanout
  group, each pinned to a single cpu that receives one nic recv
  interrupt. RPS and RFS are disabled. The benchmark uses packet
  rx_ring, which has to take a lock when determining whether a
  socket has room.

  Sent 3.5 Mpps of UDP traffic with sufficient entropy to spread
  uniformly across the packet sockets (and inserted an iptables
  rule to drop in PREROUTING to avoid protocol stack processing).

  Without this patch, all sockets try to migrate traffic to
  neighbors, causing lock contention when searching for a non-
  empty neighbor. The lock is the top 9 entries.

    perf record -a -g sleep 5

    -  17.82%   bench_rollover  [kernel.kallsyms]    [k] _raw_spin_lock
       - _raw_spin_lock
          - 99.00% spin_lock
    	 + 81.77% packet_rcv_has_room.isra.41
    	 + 18.23% tpacket_rcv
          + 0.84% packet_rcv_has_room.isra.41
    +   5.20%      ksoftirqd/6  [kernel.kallsyms]    [k] _raw_spin_lock
    +   5.15%      ksoftirqd/1  [kernel.kallsyms]    [k] _raw_spin_lock
    +   5.14%      ksoftirqd/2  [kernel.kallsyms]    [k] _raw_spin_lock
    +   5.12%      ksoftirqd/7  [kernel.kallsyms]    [k] _raw_spin_lock
    +   5.12%      ksoftirqd/5  [kernel.kallsyms]    [k] _raw_spin_lock
    +   5.10%      ksoftirqd/4  [kernel.kallsyms]    [k] _raw_spin_lock
    +   4.66%      ksoftirqd/0  [kernel.kallsyms]    [k] _raw_spin_lock
    +   4.45%      ksoftirqd/3  [kernel.kallsyms]    [k] _raw_spin_lock
    +   1.55%   bench_rollover  [kernel.kallsyms]    [k] packet_rcv_has_room.isra.41

  On net-next with this patch, this lock contention is no longer a
  top entry. Most time is spent in the actual read function. Next up
  are other locks:

    +  15.52%  bench_rollover  bench_rollover     [.] reader
    +   4.68%         swapper  [kernel.kallsyms]  [k] memcpy_erms
    +   2.77%         swapper  [kernel.kallsyms]  [k] packet_lookup_frame.isra.51
    +   2.56%     ksoftirqd/1  [kernel.kallsyms]  [k] memcpy_erms
    +   2.16%         swapper  [kernel.kallsyms]  [k] tpacket_rcv
    +   1.93%         swapper  [kernel.kallsyms]  [k] mlx4_en_process_rx_cq

  Looking closer at the remaining _raw_spin_lock, the cost of probing
  in rollover is now comparable to the cost of taking the lock later
  in tpacket_rcv.

    -   1.51%         swapper  [kernel.kallsyms]  [k] _raw_spin_lock
       - _raw_spin_lock
          + 33.41% packet_rcv_has_room
          + 28.15% tpacket_rcv
          + 19.54% enqueue_to_backlog
          + 6.45% __free_pages_ok
          + 2.78% packet_rcv_fanout
          + 2.13% fanout_demux_rollover
          + 2.01% netif_receive_skb_internal
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2ccdbaa6

packet: rollover only to socket with headroom · 9954729b

由 Willem de Bruijn 提交于 5月 12, 2015

Only migrate flows to sockets that have sufficient headroom, where
sufficient is defined as having at least 25% empty space.

The kernel has three different buffer types: a regular socket, a ring
with frames (TPACKET_V[12]) or a ring with blocks (TPACKET_V3). The
latter two do not expose a read pointer to the kernel, so headroom is
not computed easily. All three needs a different implementation to
estimate free space.

Tested:
Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input.

bench_rollover has as many sockets as there are NIC receive queues
in the system. Each socket is owned by a process that is pinned to
one of the receive cpus. RFS is disabled. RPS is enabled with an
identity mapping (cpu x -> cpu x), to count drops with softnettop.

lpbb5:/export/hda3/willemb# ./bench_rollover -r -l 1000 -s
Press [Enter] to exit

cpu rx rx.k drop.k rollover r.huge r.failed
0 16 16 0 0 0 0
1 21 21 0 0 0 0
2 5227502 5227502 0 0 0 0
3 18 18 0 0 0 0
4 6083289 6083289 0 5227496 0 0
5 22 22 0 0 0 0
6 21 21 0 0 0 0
7 9 9 0 0 0 0
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9954729b

packet: rollover prepare: per-socket state · 0648ab70

由 Willem de Bruijn 提交于 5月 12, 2015

Replace rollover state per fanout group with state per socket. Future
patches will add fields to the new structure.
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0648ab70

packet: rollover prepare: move code out of callsites · ad377cab

由 Willem de Bruijn 提交于 5月 12, 2015

packet_rcv_fanout calls fanout_demux_rollover twice. Move all rollover
logic into the callee to simplify these callsites, especially with
upcoming changes.

The main differences between the two callsites is that the FLAG
variant tests whether the socket previously selected by another
mode (RR, RND, HASH, ..) has room before migrating flows, whereas the
rollover mode has no original socket to test.
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ad377cab

11 5月, 2015 2 次提交

net: Pass kern from net_proto_family.create to sk_alloc · 11aa9c28

由 Eric W. Biederman 提交于 5月 08, 2015

In preparation for changing how struct net is refcounted
on kernel sockets pass the knowledge that we are creating
a kernel socket from sock_create_kern through to sk_alloc.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

11aa9c28

af_packet / TX_RING not fully non-blocking (w/ MSG_DONTWAIT). · fbf33a28

由 Kretschmer, Mathias 提交于 5月 08, 2015

This patch fixes an issue where the send(MSG_DONTWAIT) call
on a TX_RING is not fully non-blocking in cases where the device's sndBuf is
full. We pass nonblock=true to sock_alloc_send_skb() and return any possibly
occuring error code (most likely EGAIN) to the caller. As the fast-path stays
as it is, we keep the unlikely() around skb == NULL.
Signed-off-by: NMathias Kretschmer <mathias.kretschmer@fokus.fraunhofer.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fbf33a28

24 3月, 2015 2 次提交

af_packet: pass checksum validation status to the user · 682f048b

由 Alexander Drozdov 提交于 3月 23, 2015

Introduce TP_STATUS_CSUM_VALID tp_status flag to tell the
af_packet user that at least the transport header checksum
has been already validated.

For now, the flag may be set for incoming packets only.
Signed-off-by: NAlexander Drozdov <al.drozdov@gmail.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

682f048b

af_packet: make tpacket_rcv to not set status value before run_filter · 68c2e5de

由 Alexander Drozdov 提交于 3月 23, 2015

It is just an optimization. We don't need the value of status variable
if the packet is filtered.
Signed-off-by: NAlexander Drozdov <al.drozdov@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

68c2e5de

10 3月, 2015 1 次提交

net: delete stale packet_mclist entries · 82f17091

由 Francesco Ruggeri 提交于 3月 09, 2015

When an interface is deleted from a net namespace the ifindex in the
corresponding entries in PF_PACKET sockets' mclists becomes stale.
This can create inconsistencies if later an interface with the same ifindex
is moved from a different namespace (not that unlikely since ifindexes are
per-namespace).
In particular we saw problems with dev->promiscuity, resulting
in "promiscuity touches roof, set promiscuity failed. promiscuity
feature of device might be broken" warnings and EOVERFLOW failures of
setsockopt(PACKET_ADD_MEMBERSHIP).
This patch deletes the mclist entries for interfaces that are deleted.
Since this now causes setsockopt(PACKET_DROP_MEMBERSHIP) to fail with
EADDRNOTAVAIL if called after the interface is deleted, also make
packet_mc_drop not fail.
Signed-off-by: NFrancesco Ruggeri <fruggeri@arista.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

82f17091

03 3月, 2015 1 次提交

net: Remove iocb argument from sendmsg and recvmsg · 1b784140

由 Ying Xue 提交于 3月 02, 2015

After TIPC doesn't depend on iocb argument in its internal
implementations of sendmsg() and recvmsg() hooks defined in proto
structure, no any user is using iocb argument in them at all now.
Then we can drop the redundant iocb argument completely from kinds of
implementations of both sendmsg() and recvmsg() in the entire
networking stack.

Cc: Christoph Hellwig <hch@lst.de>
Suggested-by: NAl Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: NYing Xue <ying.xue@windriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1b784140

02 3月, 2015 3 次提交

net: add common accessor for setting dropcount on packets · 3bc3b96f

由 Eyal Birger 提交于 3月 01, 2015

As part of an effort to move skb->dropcount to skb->cb[], use
a common function in order to set dropcount in struct sk_buff.
Signed-off-by: NEyal Birger <eyal.birger@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3bc3b96f

net: use common macro for assering skb->cb[] available size in protocol families · b4772ef8

由 Eyal Birger 提交于 3月 01, 2015

As part of an effort to move skb->dropcount to skb->cb[] use a common
macro in protocol families using skb->cb[] for ancillary data to
validate available room in skb->cb[].
Signed-off-by: NEyal Birger <eyal.birger@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b4772ef8

net: packet: use sockaddr_ll fields as storage for skb original length in recvmsg path · 2472d761

由 Eyal Birger 提交于 3月 01, 2015

As part of an effort to move skb->dropcount to skb->cb[], 4 bytes
of additional room are needed in skb->cb[] in packet sockets.

Store the skb original length in the first two fields of sockaddr_ll
(sll_family and sll_protocol) as they can be derived from the skb when
needed.
Signed-off-by: NEyal Birger <eyal.birger@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2472d761

25 2月, 2015 1 次提交

af_packet: don't pass empty blocks for PACKET_V3 · 41a50d62

由 Alexander Drozdov 提交于 2月 24, 2015

Before da413eec ("packet: Fixed TPACKET V3 to signal poll when block is
closed rather than every packet") poll listening for an af_packet socket was
not signaled if there was no packets to process. After the patch poll is
signaled evety time when block retire timer expires. That happens because
af_packet closes the current block on timeout even if the block is empty.

Passing empty blocks to the user not only wastes CPU but also wastes ring
buffer space increasing probability of packets dropping on small timeouts.
Signed-off-by: NAlexander Drozdov <al.drozdov@gmail.com>
Cc: Dan Collins <dan@dcollins.co.nz>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Guy Harris <guy@alum.mit.edu>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

41a50d62

22 2月, 2015 1 次提交

af_packet: allow packets defragmentation not only for hash fanout type · 3f34b24a

由 Alexander Drozdov 提交于 2月 20, 2015

Packets defragmentation was introduced for PACKET_FANOUT_HASH only,
see 7736d33f ("packet: Add pre-defragmentation support for ipv4
fanouts")

It may be useful to have defragmentation enabled regardless of
fanout type. Without that, the AF_PACKET user may have to:
1. Collect fragments from different rings
2. Defragment by itself
Signed-off-by: NAlexander Drozdov <al.drozdov@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3f34b24a

14 1月, 2015 1 次提交

net: rename vlan_tx_* helpers since "tx" is misleading there · df8a39de

由 Jiri Pirko 提交于 1月 13, 2015

The same macros are used for rx as well. So rename it.
Signed-off-by: NJiri Pirko <jiri@resnulli.us>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

df8a39de

13 1月, 2015 1 次提交

packet: make packet too small warning match condition · eee2f04b

由 Willem de Bruijn 提交于 1月 08, 2015

The expression in ll_header_truncated() tests less than or equal, but
the warning prints less than. Update the warning.
Reported-by: NJouni Malinen <jkmalinen@gmail.com>
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Acked-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

eee2f04b

12 1月, 2015 1 次提交

packet: bail out of packet_snd() if L2 header creation fails · 46d2cfb1

由 Christoph Jaeger 提交于 1月 11, 2015

Due to a misplaced parenthesis, the expression

  (unlikely(offset) < 0),

which expands to

  (__builtin_expect(!!(offset), 0) < 0),

never evaluates to true. Therefore, when sending packets with
PF_PACKET/SOCK_DGRAM, packet_snd() does not abort as intended
if the creation of the layer 2 header fails.

Spotted by Coverity - CID 1259975 ("Operands don't affect result").

Fixes: 9c707762 ("packet: make packet_snd fail on len smaller than l2 header")
Signed-off-by: NChristoph Jaeger <cj@linux.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Acked-by: NWillem de Bruijn <willemb@google.com>
Acked-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

46d2cfb1

23 12月, 2014 1 次提交

packet: Fixed TPACKET V3 to signal poll when block is closed rather than every packet · da413eec

由 Dan Collins 提交于 12月 19, 2014

Make TPACKET_V3 signal poll when block is closed rather than for every
packet. Side effect is that poll will be signaled when block retire
timer expires which didn't previously happen. Issue was visible when
sending packets at a very low frequency such that all blocks are retired
before packets are received by TPACKET_V3. This caused avoidable packet
loss. The fix ensures that the signal is sent when blocks are closed
which covers the normal path where the block is filled as well as the
path where the timer expires. The case where a block is filled without
moving to the next block (ie. all blocks are full) will still cause poll
to be signaled.
Signed-off-by: NDan Collins <dan@dcollins.co.nz>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

da413eec

10 12月, 2014 1 次提交

put iov_iter into msghdr · c0371da6

由 Al Viro 提交于 11月 24, 2014

Note that the code _using_ ->msg_iter at that point will be very
unhappy with anything other than unshifted iovec-backed iov_iter.
We still need to convert users to proper primitives.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

c0371da6

09 12月, 2014 1 次提交

af_packet: virtio 1.0 stubs · dc9e5153

由 Michael S. Tsirkin 提交于 11月 23, 2014

This merely fixes sparse warnings, without actually
adding support for the new APIs.

Still working out the best way to enable the new
functionality.
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

dc9e5153

25 11月, 2014 1 次提交

af_packet: fix sparse warning · 6e58040b

由 Michael S. Tsirkin 提交于 11月 24, 2014

af_packet produces lots of these:
	net/packet/af_packet.c:384:39: warning: incorrect type in return expression (different modifiers)
	net/packet/af_packet.c:384:39:    expected struct page [pure] *
	net/packet/af_packet.c:384:39:    got struct page *

this seems to be because sparse does not realize that _pure
refers to function, not the returned pointer.

Tweak code slightly to avoid the warning.
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6e58040b

24 11月, 2014 3 次提交
- A
  switch AF_PACKET and AF_UNIX to skb_copy_datagram_from_iter() · 8feb2fb2
  由 Al Viro 提交于 11月 06, 2014
```
... and kill skb_copy_datagram_iovec()
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  8feb2fb2
- A
  new helper: memcpy_to_msg() · 7eab8d9e
  由 Al Viro 提交于 4月 06, 2014
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  7eab8d9e
- A
  new helper: memcpy_from_msg() · 6ce8e9ce
  由 Al Viro 提交于 4月 06, 2014
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  6ce8e9ce
22 11月, 2014 1 次提交

packet: make packet_snd fail on len smaller than l2 header · 9c707762

由 Willem de Bruijn 提交于 11月 19, 2014

When sending packets out with PF_PACKET, SOCK_RAW, ensure that the
packet is at least as long as the device's expected link layer header.
This check already exists in tpacket_snd, but not in packet_snd.
Also rate limit the warning in tpacket_snd.
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Acked-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9c707762

06 11月, 2014 1 次提交

net: Add and use skb_copy_datagram_msg() helper. · 51f3d02b

由 David S. Miller 提交于 11月 05, 2014

This encapsulates all of the skb_copy_datagram_iovec() callers
with call argument signature "skb, offset, msghdr->msg_iov, length".

When we move to iov_iters in the networking, the iov_iter object will
sit in the msghdr.

Having a helper like this means there will be less places to touch
during that transformation.

Based upon descriptions and patch from Al Viro.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

51f3d02b

02 9月, 2014 2 次提交
- D
  net: Pass a "more" indication down into netdev_start_xmit() code paths. · fa2dbdc2
  由 David S. Miller 提交于 8月 29, 2014
```
For now it will always be false.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
  fa2dbdc2
- D
  net: Do txq_trans_update() in netdev_start_xmit() · 10b3ad8c
  由 David S. Miller 提交于 8月 29, 2014
```
That way we don't have to audit every call site to make sure it is
doing this properly.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
  10b3ad8c
30 8月, 2014 1 次提交

net: add skb_get_tx_queue() helper · 10c51b56

由 Daniel Borkmann 提交于 8月 27, 2014

Replace occurences of skb_get_queue_mapping() and follow-up
netdev_get_tx_queue() with an actual helper function.
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

10c51b56

25 8月, 2014 1 次提交
- D
  net: Add ops->ndo_xmit_flush() · 4798248e
  由 David S. Miller 提交于 8月 22, 2014
```
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
  4798248e
22 8月, 2014 1 次提交

packet: handle too big packets for PACKET_V3 · dc808110

由 Eric Dumazet 提交于 8月 15, 2014

af_packet can currently overwrite kernel memory by out of bound
accesses, because it assumed a [new] block can always hold one frame.

This is not generally the case, even if most existing tools do it right.

This patch clamps too long frames as API permits, and issue a one time
error on syslog.

[  394.357639] tpacket_rcv: packet too big, clamped from 5042 to 3966. macoff=82

In this example, packet header tp_snaplen was set to 3966,
and tp_len was set to 5042 (skb->len)
Signed-off-by: NEric Dumazet <edumazet@google.com>
Fixes: f6fb8f10 ("af-packet: TPACKET_V3 flexible buffer implementation.")
Acked-by: NDaniel Borkmann <dborkman@redhat.com>
Acked-by: NNeil Horman <nhorman@tuxdriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dc808110

30 7月, 2014 1 次提交

packet: remove deprecated syststamp timestamp · 68a360e8

由 Willem de Bruijn 提交于 7月 25, 2014

No device driver will ever return an skb_shared_info structure with
syststamp non-zero, so remove the branch that tests for this and
optionally marks the packet timestamp as TP_STATUS_TS_SYS_HARDWARE.

Do not remove the definition TP_STATUS_TS_SYS_HARDWARE, as processes
may refer to it.
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

68a360e8

16 7月, 2014 1 次提交

packet: remove unnecessary break after return · fe8c0f4a

由 Fabian Frederick 提交于 7月 14, 2014

Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fe8c0f4a

12 4月, 2014 1 次提交

net: Fix use after free by removing length arg from sk_data_ready callbacks. · 676d2369

由 David S. Miller 提交于 4月 11, 2014

Several spots in the kernel perform a sequence like:

	skb_queue_tail(&sk->s_receive_queue, skb);
	sk->sk_data_ready(sk, skb->len);

But at the moment we place the SKB onto the socket receive queue it
can be consumed and freed up.  So this skb->len access is potentially
to freed up memory.

Furthermore, the skb->len can be modified by the consumer so it is
possible that the value isn't accurate.

And finally, no actual implementation of this callback actually uses
the length argument.  And since nobody actually cared about it's
value, lots of call sites pass arbitrary values in such as '0' and
even '1'.

So just remove the length argument from the callback, that way there
is no confusion whatsoever and all of these use-after-free cases get
fixed as a side effect.

Based upon a patch by Eric Dumazet and his suggestion to audit this
issue tree-wide.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

676d2369

04 4月, 2014 2 次提交

packet: fix packet_direct_xmit for BQL enabled drivers · 8e2f1a63

由 Daniel Borkmann 提交于 4月 02, 2014

Currently, in packet_direct_xmit() we test the assigned netdevice queue
for netif_xmit_frozen_or_stopped() before doing an ndo_start_xmit().

This can have the side-effect that BQL enabled drivers which make use
of netdev_tx_sent_queue() internally, set __QUEUE_STATE_STACK_XOFF from
within the stack and would not fully fill the device's TX ring from
packet sockets with PACKET_QDISC_BYPASS enabled.

Instead, use a test without BQL bit so that bursts can be absorbed
into the NICs TX ring. Fix and code suggested by Eric Dumazet, thanks!
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8e2f1a63

packet: report tx_dropped in packet_direct_xmit · 0f97ede4

由 Daniel Borkmann 提交于 4月 02, 2014

Since commit 015f0688 ("net: net: add a core netdev->tx_dropped
counter"), we can now account for TX drops from within the core
stack instead of drivers.

Therefore, fix packet_direct_xmit() and increase drop count when we
encounter a problem before driver's xmit function was called (we do
not want to doubly account for it).
Suggested-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0f97ede4

29 3月, 2014 1 次提交

packet: respect devices with LLTX flag in direct xmit · 43279500

由 Daniel Borkmann 提交于 3月 27, 2014

Quite often it can be useful to test with dummy or similar
devices as a blackhole sink for skbs. Such devices are only
equipped with a single txq, but marked as NETIF_F_LLTX as
they do not require locking their internal queues on xmit
(or implement locking themselves). Therefore, rather use
HARD_TX_{UN,}LOCK API, so that NETIF_F_LLTX will be respected.

trafgen mmap/TX_RING example against dummy device with config
foo: { fill(0xff, 64) } results in the following performance
improvements for such scenarios on an ordinary Core i7/2.80GHz:

Before:

Performance counter stats for 'trafgen -i foo -o du0 -n100000000' (10 runs):

160,975,944,159 instructions:k # 0.55 insns per cycle ( +- 0.09% )
293,319,390,278 cycles:k # 0.000 GHz ( +- 0.35% )
192,501,104 branch-misses:k ( +- 1.63% )
831 context-switches:k ( +- 9.18% )
7 cpu-migrations:k ( +- 7.40% )
69,382 cache-misses:k # 0.010 % of all cache refs ( +- 2.18% )
671,552,021 cache-references:k ( +- 1.29% )

22.856401569 seconds time elapsed ( +- 0.33% )

After:

Performance counter stats for 'trafgen -i foo -o du0 -n100000000' (10 runs):

133,788,739,692 instructions:k # 0.92 insns per cycle ( +- 0.06% )
145,853,213,256 cycles:k # 0.000 GHz ( +- 0.17% )
59,867,100 branch-misses:k ( +- 4.72% )
384 context-switches:k ( +- 3.76% )
6 cpu-migrations:k ( +- 6.28% )
70,304 cache-misses:k # 0.077 % of all cache refs ( +- 1.73% )
90,879,408 cache-references:k ( +- 1.35% )

11.719372413 seconds time elapsed ( +- 0.24% )
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

43279500

27 3月, 2014 1 次提交

net: Rename skb->rxhash to skb->hash · 61b905da

由 Tom Herbert 提交于 3月 24, 2014

The packet hash can be considered a property of the packet, not just
on RX path.

This patch changes name of rxhash and l4_rxhash skbuff fields to be
hash and l4_hash respectively. This includes changing uses of the
field in the code which don't call the access functions.
Signed-off-by: NTom Herbert <therbert@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

61b905da

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功