提交 · 29e8f59cf1ff91f2854563faddc5f61098cf4958 · openeuler / raspberrypi-kernel

25 4月, 2014 2 次提交

net: Use netlink_ns_capable to verify the permisions of netlink messages · 90f62cf3

由 Eric W. Biederman 提交于 4月 23, 2014

It is possible by passing a netlink socket to a more privileged
executable and then to fool that executable into writing to the socket
data that happens to be valid netlink message to do something that
privileged executable did not intend to do.

To keep this from happening replace bare capable and ns_capable calls
with netlink_capable, netlink_net_calls and netlink_ns_capable calls.
Which act the same as the previous calls except they verify that the
opener of the socket had the desired permissions as well.
Reported-by: NAndy Lutomirski <luto@amacapital.net>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

90f62cf3

net: Move the permission check in sock_diag_put_filterinfo to packet_diag_dump · a53b72c8

由 Eric W. Biederman 提交于 4月 23, 2014

The permission check in sock_diag_put_filterinfo is wrong, and it is so removed
from it's sources it is not clear why it is wrong. Move the computation
into packet_diag_dump and pass a bool of the result into sock_diag_filterinfo.

This does not yet correct the capability check but instead simply moves it to make
it clear what is going on.
Reported-by: NAndy Lutomirski <luto@amacapital.net>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a53b72c8

23 4月, 2014 1 次提交

net: Fix ns_capable check in sock_diag_put_filterinfo · 78541c1d

由 Andrew Lutomirski 提交于 4月 16, 2014

The caller needs capabilities on the namespace being queried, not on
their own namespace.  This is a security bug, although it likely has
only a minor impact.

Cc: stable@vger.kernel.org
Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

78541c1d

12 4月, 2014 1 次提交

net: Fix use after free by removing length arg from sk_data_ready callbacks. · 676d2369

由 David S. Miller 提交于 4月 11, 2014

Several spots in the kernel perform a sequence like:

	skb_queue_tail(&sk->s_receive_queue, skb);
	sk->sk_data_ready(sk, skb->len);

But at the moment we place the SKB onto the socket receive queue it
can be consumed and freed up.  So this skb->len access is potentially
to freed up memory.

Furthermore, the skb->len can be modified by the consumer so it is
possible that the value isn't accurate.

And finally, no actual implementation of this callback actually uses
the length argument.  And since nobody actually cared about it's
value, lots of call sites pass arbitrary values in such as '0' and
even '1'.

So just remove the length argument from the callback, that way there
is no confusion whatsoever and all of these use-after-free cases get
fixed as a side effect.

Based upon a patch by Eric Dumazet and his suggestion to audit this
issue tree-wide.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

676d2369

04 4月, 2014 2 次提交

packet: fix packet_direct_xmit for BQL enabled drivers · 8e2f1a63

由 Daniel Borkmann 提交于 4月 02, 2014

Currently, in packet_direct_xmit() we test the assigned netdevice queue
for netif_xmit_frozen_or_stopped() before doing an ndo_start_xmit().

This can have the side-effect that BQL enabled drivers which make use
of netdev_tx_sent_queue() internally, set __QUEUE_STATE_STACK_XOFF from
within the stack and would not fully fill the device's TX ring from
packet sockets with PACKET_QDISC_BYPASS enabled.

Instead, use a test without BQL bit so that bursts can be absorbed
into the NICs TX ring. Fix and code suggested by Eric Dumazet, thanks!
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8e2f1a63

packet: report tx_dropped in packet_direct_xmit · 0f97ede4

由 Daniel Borkmann 提交于 4月 02, 2014

Since commit 015f0688 ("net: net: add a core netdev->tx_dropped
counter"), we can now account for TX drops from within the core
stack instead of drivers.

Therefore, fix packet_direct_xmit() and increase drop count when we
encounter a problem before driver's xmit function was called (we do
not want to doubly account for it).
Suggested-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0f97ede4

29 3月, 2014 1 次提交

packet: respect devices with LLTX flag in direct xmit · 43279500

由 Daniel Borkmann 提交于 3月 27, 2014

Quite often it can be useful to test with dummy or similar
devices as a blackhole sink for skbs. Such devices are only
equipped with a single txq, but marked as NETIF_F_LLTX as
they do not require locking their internal queues on xmit
(or implement locking themselves). Therefore, rather use
HARD_TX_{UN,}LOCK API, so that NETIF_F_LLTX will be respected.

trafgen mmap/TX_RING example against dummy device with config
foo: { fill(0xff, 64) } results in the following performance
improvements for such scenarios on an ordinary Core i7/2.80GHz:

Before:

Performance counter stats for 'trafgen -i foo -o du0 -n100000000' (10 runs):

160,975,944,159 instructions:k # 0.55 insns per cycle ( +- 0.09% )
293,319,390,278 cycles:k # 0.000 GHz ( +- 0.35% )
192,501,104 branch-misses:k ( +- 1.63% )
831 context-switches:k ( +- 9.18% )
7 cpu-migrations:k ( +- 7.40% )
69,382 cache-misses:k # 0.010 % of all cache refs ( +- 2.18% )
671,552,021 cache-references:k ( +- 1.29% )

22.856401569 seconds time elapsed ( +- 0.33% )

After:

Performance counter stats for 'trafgen -i foo -o du0 -n100000000' (10 runs):

133,788,739,692 instructions:k # 0.92 insns per cycle ( +- 0.06% )
145,853,213,256 cycles:k # 0.000 GHz ( +- 0.17% )
59,867,100 branch-misses:k ( +- 4.72% )
384 context-switches:k ( +- 3.76% )
6 cpu-migrations:k ( +- 6.28% )
70,304 cache-misses:k # 0.077 % of all cache refs ( +- 1.73% )
90,879,408 cache-references:k ( +- 1.35% )

11.719372413 seconds time elapsed ( +- 0.24% )
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

43279500

27 3月, 2014 1 次提交

net: Rename skb->rxhash to skb->hash · 61b905da

由 Tom Herbert 提交于 3月 24, 2014

The packet hash can be considered a property of the packet, not just
on RX path.

This patch changes name of rxhash and l4_rxhash skbuff fields to be
hash and l4_hash respectively. This includes changing uses of the
field in the code which don't call the access functions.
Signed-off-by: NTom Herbert <therbert@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

61b905da

01 3月, 2014 1 次提交

packet: allow to transmit +4 byte in TX_RING slot for VLAN case · 52f1454f

由 Daniel Borkmann 提交于 2月 28, 2014

Commit 57f89bfa ("network: Allow af_packet to transmit +4 bytes
for VLAN packets.") added the possibility for non-mmaped frames to
send extra 4 byte for VLAN header so the MTU increases from 1500 to
1504 byte, for example.

Commit cbd89acb ("af_packet: fix for sending VLAN frames via
packet_mmap") attempted to fix that for the mmap part but was
reverted as it caused regressions while using eth_type_trans()
on output path.

Lets just act analogous to 57f89bfa and add a similar logic
to TX_RING. We presume size_max as overcharged with +4 bytes and
later on after skb has been built by tpacket_fill_skb() check
for ETH_P_8021Q header on packets larger than normal MTU. Can
be easily reproduced with a slightly modified trafgen in mmap(2)
mode, test cases:

 { fill(0xff, 12) const16(0x8100) fill(0xff, <1504|1505>) }
 { fill(0xff, 12) const16(0x0806) fill(0xff, <1500|1501>) }

Note that we need to do the test right after tpacket_fill_skb()
as sockets can have PACKET_LOSS set where we would not fail but
instead just continue to traverse the ring.
Reported-by: NMathias Kretschmer <mathias.kretschmer@fokus.fraunhofer.de>
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Cc: Ben Greear <greearb@candelatech.com>
Cc: Phil Sutter <phil@nwl.cc>
Tested-by: NMathias Kretschmer <mathias.kretschmer@fokus.fraunhofer.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

52f1454f

19 2月, 2014 1 次提交

af_packet: remove a stray tab in packet_set_ring() · d7cf0c34

由 Dan Carpenter 提交于 2月 18, 2014

At first glance it looks like there is a missing curly brace but
actually the code works the same either way.  I have adjusted the
indenting but left the code the same.
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Acked-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d7cf0c34

17 2月, 2014 1 次提交

packet: check for ndo_select_queue during queue selection · 0fd5d57b

由 Daniel Borkmann 提交于 2月 16, 2014

Mathias reported that on an AMD Geode LX embedded board (ALiX)
with ath9k driver PACKET_QDISC_BYPASS, introduced in commit
d346a3fa ("packet: introduce PACKET_QDISC_BYPASS socket
option"), triggers a WARN_ON() coming from the driver itself
via 066dae93 ("ath9k: rework tx queue selection and fix
queue stopping/waking").

The reason why this happened is that ndo_select_queue() call
is not invoked from direct xmit path i.e. for ieee80211 subsystem
that sets queue and TID (similar to 802.1d tag) which is being
put into the frame through 802.11e (WMM, QoS). If that is not
set, pending frame counter for e.g. ath9k can get messed up.

So the WARN_ON() in ath9k is absolutely legitimate. Generally,
the hw queue selection in ieee80211 depends on the type of
traffic, and priorities are set according to ieee80211_ac_numbers
mapping; working in a similar way as DiffServ only on a lower
layer, so that the AP can favour frames that have "real-time"
requirements like voice or video data frames.

Therefore, check for presence of ndo_select_queue() in netdev
ops and, if available, invoke it with a fallback handler to
__packet_pick_tx_queue(), so that driver such as bnx2x, ixgbe,
or mlx4 can still select a hw queue for transmission in
relation to the current CPU while e.g. ieee80211 subsystem
can make their own choices.
Reported-by: NMathias Kretschmer <mathias.kretschmer@fokus.fraunhofer.de>
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0fd5d57b

23 1月, 2014 1 次提交

af_packet: Add Queue mapping mode to af_packet fanout operation · 2d36097d

由 Neil Horman 提交于 1月 22, 2014

This patch adds a queue mapping mode to the fanout operation of af_packet
sockets.  This allows user space af_packet users to better filter on flows
ingressing and egressing via a specific hardware queue, and avoids the potential
packet reordering that can occur when FANOUT_CPU is being used and irq affinity
varies.

Tested successfully by myself.  applies to net-next
Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2d36097d

22 1月, 2014 3 次提交

net: introduce reciprocal_scale helper and convert users · 89770b0a

由 Daniel Borkmann 提交于 1月 22, 2014

As David Laight suggests, we shouldn't necessarily call this
reciprocal_divide() when users didn't requested a reciprocal_value();
lets keep the basic idea and call it reciprocal_scale(). More
background information on this topic can be found in [1].

Joint work with Hannes Frederic Sowa.

  [1] http://homepage.cs.uiowa.edu/~jones/bcd/divide.htmlSuggested-by: NDavid Laight <david.laight@aculab.com>
Cc: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

89770b0a

random32: add prandom_u32_max and convert open coded users · f337db64

由 Daniel Borkmann 提交于 1月 22, 2014

Many functions have open coded a function that returns a random
number in range [0,N-1]. Under the assumption that we have a PRNG
such as taus113 with being well distributed in [0, ~0U] space,
we can implement such a function as uword t = (n*m')>>32, where
m' is a random number obtained from PRNG, n the right open interval
border and t our resulting random number, with n,m',t in u32 universe.

Lets go with Joe and simply call it prandom_u32_max(), although
technically we have an right open interval endpoint, but that we
have documented. Other users can further be migrated to the new
prandom_u32_max() function later on; for now, we need to make sure
to migrate reciprocal_divide() users for the reciprocal_divide()
follow-up fixup since their function signatures are going to change.

Joint work with Hannes Frederic Sowa.

Cc: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f337db64

packet: fix a couple of cppcheck warnings · f0d4eb29

由 Daniel Borkmann 提交于 1月 19, 2014

Doesn't bring much, but also doesn't hurt us to fix 'em:

1) In tpacket_rcv() flush dcache page we can restirct the scope
   for start and end and remove one layer of indent.

2) In tpacket_destruct_skb() we can restirct the scope for ph.

3) In alloc_one_pg_vec_page() we can remove the NULL assignment
   and change spacing a bit.
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f0d4eb29

19 1月, 2014 1 次提交

net: add build-time checks for msg->msg_name size · 342dfc30

由 Steffen Hurrle 提交于 1月 17, 2014

This is a follow-up patch to f3d33426 ("net: rework recvmsg
handler msg_name and msg_namelen logic").

DECLARE_SOCKADDR validates that the structure we use for writing the
name information to is not larger than the buffer which is reserved
for msg->msg_name (which is 128 bytes). Also use DECLARE_SOCKADDR
consistently in sendmsg code paths.
Signed-off-by: NSteffen Hurrle <steffen@hurrle.net>
Suggested-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

342dfc30

17 1月, 2014 3 次提交

packet: use percpu mmap tx frame pending refcount · b0138408

由 Daniel Borkmann 提交于 1月 15, 2014

In PF_PACKET's packet mmap(), we can avoid using one atomic_inc()
and one atomic_dec() call in skb destructor and use a percpu
reference count instead in order to determine if packets are
still pending to be sent out. Micro-benchmark with [1] that has
been slightly modified (that is, protcol = 0 in socket(2) and
bind(2)), example on a rather crappy testing machine; I expect
it to scale and have even better results on bigger machines:

./packet_mm_tx -s7000 -m7200 -z700000 em1, avg over 2500 runs:

With patch:    4,022,015 cyc
Without patch: 4,812,994 cyc

time ./packet_mm_tx -s64 -c10000000 em1 > /dev/null, stable:

With patch:
  real         1m32.241s
  user         0m0.287s
  sys          1m29.316s

Without patch:
  real         1m38.386s
  user         0m0.265s
  sys          1m35.572s

In function tpacket_snd(), it is okay to use packet_read_pending()
since in fast-path we short-circuit the condition already with
ph != NULL, since we have next frames to process. In case we have
MSG_DONTWAIT, we also do not execute this path as need_wait is
false here anyway, and in case of _no_ MSG_DONTWAIT flag, it is
okay to call a packet_read_pending(), because when we ever reach
that path, we're done processing outgoing frames anyway and only
look if there are skbs still outstanding to be orphaned. We can
stay lockless in this percpu counter since it's acceptable when we
reach this path for the sum to be imprecise first, but we'll level
out at 0 after all pending frames have reached the skb destructor
eventually through tx reclaim. When people pin a tx process to
particular CPUs, we expect overflows to happen in the reference
counter as on one CPU we expect heavy increase; and distributed
through ksoftirqd on all CPUs a decrease, for example. As
David Laight points out, since the C language doesn't define the
result of signed int overflow (i.e. rather than wrap, it is
allowed to saturate as a possible outcome), we have to use
unsigned int as reference count. The sum over all CPUs when tx
is complete will result in 0 again.

The BUG_ON() in tpacket_destruct_skb() we can remove as well. It
can _only_ be set from inside tpacket_snd() path and we made sure
to increase tx_ring.pending in any case before we called po->xmit(skb).
So testing for tx_ring.pending == 0 is not too useful. Instead, it
would rather have been useful to test if lower layers didn't orphan
the skb so that we're missing ring slots being put back to
TP_STATUS_AVAILABLE. But such a bug will be caught in user space
already as we end up realizing that we do not have any
TP_STATUS_AVAILABLE slots left anymore. Therefore, we're all set.

Btw, in case of RX_RING path, we do not make use of the pending
member, therefore we also don't need to use up any percpu memory
here. Also note that __alloc_percpu() already returns a zero-filled
percpu area, so initialization is done already.

  [1] http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmapSigned-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b0138408

packet: don't unconditionally schedule() in case of MSG_DONTWAIT · 87a2fd28

由 Daniel Borkmann 提交于 1月 15, 2014

In tpacket_snd(), when we've discovered a first frame that is
not in status TP_STATUS_SEND_REQUEST, and return a NULL buffer,
we exit the send routine in case of MSG_DONTWAIT, since we've
finished traversing the mmaped send ring buffer and don't care
about pending frames.

While doing so, we still unconditionally call an expensive
schedule() in the packet_current_frame() "error" path, which
is unnecessary in this case since it's enough to just quit
the function.

Also, in case MSG_DONTWAIT is not set, we should rather test
for need_resched() first and do schedule() only if necessary
since meanwhile pending frames could already have finished
processing and called skb destructor.
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

87a2fd28

packet: improve socket create/bind latency in some cases · 902fefb8

由 Daniel Borkmann 提交于 1月 15, 2014

Most people acquire PF_PACKET sockets with a protocol argument in
the socket call, e.g. libpcap does so with htons(ETH_P_ALL) for
all its sockets. Most likely, at some point in time a subsequent
bind() call will follow, e.g. in libpcap with ...

  memset(&sll, 0, sizeof(sll));
  sll.sll_family          = AF_PACKET;
  sll.sll_ifindex         = ifindex;
  sll.sll_protocol        = htons(ETH_P_ALL);

... as arguments. What happens in the kernel is that already
in socket() syscall, we install a proto hook via register_prot_hook()
if our protocol argument is != 0. Yet, in bind() we're almost
doing the same work by doing a unregister_prot_hook() with an
expensive synchronize_net() call in case during socket() the proto
was != 0, plus follow-up register_prot_hook() with a bound device
to it this time, in order to limit traffic we get.

In the case when the protocol and user supplied device index (== 0)
does not change from socket() to bind(), we can spare us doing
the same work twice. Similarly for re-binding to the same device
and protocol. For these scenarios, we can decrease create/bind
latency from ~7447us (sock-bind-2 case) to ~89us (sock-bind-1 case)
with this patch.

Alternatively, for the first case, if people care, they should
simply create their sockets with proto == 0 argument and define
the protocol during bind() as this saves a call to synchronize_net()
as well (sock-bind-3 case).

In all other cases, we're tied to user space behaviour we must not
change, also since a bind() is not strictly required. Thus, we need
the synchronize_net() to make sure no asynchronous packet processing
paths still refer to the previous elements of po->prot_hook.

In case of mmap()ed sockets, the workflow that includes bind() is
socket() -> setsockopt(<ring>) -> bind(). In that case, a pair of
{__unregister, register}_prot_hook is being called from setsockopt()
in order to install the new protocol receive handler. Thus, when
we call bind and can skip a re-hook, we have already previously
installed the new handler. For fanout, this is handled different
entirely, so we should be good.

Timings on an i7-3520M machine:

  * sock-bind-1:   89 us
  * sock-bind-2: 7447 us
  * sock-bind-3:   75 us

sock-bind-1:
  socket(PF_PACKET, SOCK_RAW, htons(ETH_P_IP)) = 3
  bind(3, {sa_family=AF_PACKET, proto=htons(ETH_P_IP), if=all(0),
           pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0

sock-bind-2:
  socket(PF_PACKET, SOCK_RAW, htons(ETH_P_IP)) = 3
  bind(3, {sa_family=AF_PACKET, proto=htons(ETH_P_IP), if=lo(1),
           pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0

sock-bind-3:
  socket(PF_PACKET, SOCK_RAW, 0) = 3
  bind(3, {sa_family=AF_PACKET, proto=htons(ETH_P_IP), if=lo(1),
           pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

902fefb8

01 1月, 2014 1 次提交

packet: fix "foo * bar" and "(foo*)" problems · d4dd8aee

由 Weilong Chen 提交于 12月 23, 2013

Cleanup checkpatch errors.Specially,the second changed line
is exactly 80 columns long.
Signed-off-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d4dd8aee

18 12月, 2013 4 次提交

packet: deliver VLAN TPID to userspace · a0cdfcf3

由 Atzm Watanabe 提交于 12月 17, 2013

This enables userspace to get VLAN TPID as well as the VLAN TCI.
Signed-off-by: NAtzm Watanabe <atzm@stratosphere.co.jp>
Acked-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a0cdfcf3

packet: fill the gap of TPACKET_ALIGNMENT with zeros · e4d26f4b

由 Atzm Watanabe 提交于 12月 17, 2013

struct tpacket{2,3}_hdr is aligned to a multiple of TPACKET_ALIGNMENT.
Explicitly defining and zeroing the gap of this makes additional changes
easier.
Signed-off-by: NAtzm Watanabe <atzm@stratosphere.co.jp>
Acked-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e4d26f4b

packet: make aligned size of struct tpacket{2,3}_hdr clear · 51846355

由 Atzm Watanabe 提交于 12月 17, 2013

struct tpacket{2,3}_hdr is aligned to a multiple of TPACKET_ALIGNMENT.
We may add members to them until current aligned size without forcing
userspace to call getsockopt(..., PACKET_HDRLEN, ...).
Signed-off-by: NAtzm Watanabe <atzm@stratosphere.co.jp>
Acked-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

51846355

net: Change skb_get_rxhash to skb_get_hash · 3958afa1

由 Tom Herbert 提交于 12月 15, 2013

Changing name of function as part of making the hash in skbuff to be
generic property, not just for receive path.
Signed-off-by: NTom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3958afa1

14 12月, 2013 1 次提交

packet: fix using smp_processor_id() in preemptible code · 1cbac010

由 Li Zhong 提交于 12月 12, 2013

This patches fixes the following warning by replacing smp_processor_id()
with raw_smp_processor_id():

[   11.120893] BUG: using smp_processor_id() in preemptible [00000000] code: arping/3510
[   11.120913] caller is .packet_sendmsg+0xc14/0xe68
[   11.120920] CPU: 13 PID: 3510 Comm: arping Not tainted 3.13.0-rc3-next-20131211-dirty #1
[   11.120926] Call Trace:
[   11.120932] [c0000001f803f6f0] [c0000000000138dc] .show_stack+0x110/0x25c (unreliable)
[   11.120942] [c0000001f803f7e0] [c00000000083dd24] .dump_stack+0xa0/0x37c
[   11.120951] [c0000001f803f870] [c000000000493fd4] .debug_smp_processor_id+0xfc/0x12c
[   11.120959] [c0000001f803f900] [c0000000007eba78] .packet_sendmsg+0xc14/0xe68
[   11.120968] [c0000001f803fa80] [c000000000700968] .sock_sendmsg+0xa0/0xe0
[   11.120975] [c0000001f803fbf0] [c0000000007014d8] .SyS_sendto+0x100/0x148
[   11.120983] [c0000001f803fd60] [c0000000006fff10] .SyS_socketcall+0x1c4/0x2e8
[   11.120990] [c0000001f803fe30] [c00000000000a1e4] syscall_exit+0x0/0x9c
Signed-off-by: NLi Zhong <zhong@linux.vnet.ibm.com>
Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1cbac010

10 12月, 2013 2 次提交

packet: introduce PACKET_QDISC_BYPASS socket option · d346a3fa

由 Daniel Borkmann 提交于 12月 06, 2013

This patch introduces a PACKET_QDISC_BYPASS socket option, that
allows for using a similar xmit() function as in pktgen instead
of taking the dev_queue_xmit() path. This can be very useful when
PF_PACKET applications are required to be used in a similar
scenario as pktgen, but with full, flexible packet payload that
needs to be provided, for example.

On default, nothing changes in behaviour for normal PF_PACKET
TX users, so everything stays as is for applications. New users,
however, can now set PACKET_QDISC_BYPASS if needed to prevent
own packets from i) reentering packet_rcv() and ii) to directly
push the frame to the driver.

In doing so we can increase pps (here 64 byte packets) for
PF_PACKET a bit:

  # CPUs -- QDISC_BYPASS   -- qdisc path -- qdisc path[**]
  1 CPU  ==  1,509,628 pps --  1,208,708 --  1,247,436
  2 CPUs ==  3,198,659 pps --  2,536,012 --  1,605,779
  3 CPUs ==  4,787,992 pps --  3,788,740 --  1,735,610
  4 CPUs ==  6,173,956 pps --  4,907,799 --  1,909,114
  5 CPUs ==  7,495,676 pps --  5,956,499 --  2,014,422
  6 CPUs ==  9,001,496 pps --  7,145,064 --  2,155,261
  7 CPUs == 10,229,776 pps --  8,190,596 --  2,220,619
  8 CPUs == 11,040,732 pps --  9,188,544 --  2,241,879
  9 CPUs == 12,009,076 pps -- 10,275,936 --  2,068,447
 10 CPUs == 11,380,052 pps -- 11,265,337 --  1,578,689
 11 CPUs == 11,672,676 pps -- 11,845,344 --  1,297,412
 [...]
 20 CPUs == 11,363,192 pps -- 11,014,933 --  1,245,081

 [**]: qdisc path with packet_rcv(), how probably most people
       seem to use it (hopefully not anymore if not needed)

The test was done using a modified trafgen, sending a simple
static 64 bytes packet, on all CPUs.  The trick in the fast
"qdisc path" case, is to avoid reentering packet_rcv() by
setting the RAW socket protocol to zero, like:
socket(PF_PACKET, SOCK_RAW, 0);

Tradeoffs are documented as well in this patch, clearly, if
queues are busy, we will drop more packets, tc disciplines are
ignored, and these packets are not visible to taps anymore. For
a pktgen like scenario, we argue that this is acceptable.

The pointer to the xmit function has been placed in packet
socket structure hole between cached_dev and prot_hook that
is hot anyway as we're working on cached_dev in each send path.

Done in joint work together with Jesper Dangaard Brouer.
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d346a3fa

packet: fix send path when running with proto == 0 · 66e56cd4

由 Daniel Borkmann 提交于 12月 06, 2013

Commit e40526cb introduced a cached dev pointer, that gets
hooked into register_prot_hook(), __unregister_prot_hook() to
update the device used for the send path.

We need to fix this up, as otherwise this will not work with
sockets created with protocol = 0, plus with sll_protocol = 0
passed via sockaddr_ll when doing the bind.

So instead, assign the pointer directly. The compiler can inline
these helper functions automagically.

While at it, also assume the cached dev fast-path as likely(),
and document this variant of socket creation as it seems it is
not widely used (seems not even the author of TX_RING was aware
of that in his reference example [1]). Tested with reproducer
from e40526cb.

 [1] http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap#Example

Fixes: e40526cb ("packet: fix use after free race in send path when dev is released")
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Tested-by: NSalam Noureddine <noureddine@aristanetworks.com>
Tested-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

66e56cd4

07 12月, 2013 1 次提交

packet: use macro GET_PBDQC_FROM_RB to simplify the codes · 22781a5b

由 Duan Jiong 提交于 12月 06, 2013

Signed-off-by: NDuan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

22781a5b

30 11月, 2013 1 次提交

af_packet: block BH in prb_shutdown_retire_blk_timer() · ec6f809f

由 Veaceslav Falico 提交于 11月 29, 2013

Currently we're using plain spin_lock() in prb_shutdown_retire_blk_timer(),
however the timer might fire right in the middle and thus try to re-aquire
the same spinlock, leaving us in a endless loop.

To fix that, use the spin_lock_bh() to block it.

Fixes: f6fb8f10 ("af-packet: TPACKET_V3 flexible buffer implementation.")
CC: "David S. Miller" <davem@davemloft.net>
CC: Daniel Borkmann <dborkman@redhat.com>
CC: Willem de Bruijn <willemb@google.com>
CC: Phil Sutter <phil@nwl.cc>
CC: Eric Dumazet <edumazet@google.com>
Reported-by: NJan Stancek <jstancek@redhat.com>
Tested-by: NJan Stancek <jstancek@redhat.com>
Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
Acked-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ec6f809f

22 11月, 2013 1 次提交

packet: fix use after free race in send path when dev is released · e40526cb

由 Daniel Borkmann 提交于 11月 21, 2013

Salam reported a use after free bug in PF_PACKET that occurs when
we're sending out frames on a socket bound device and suddenly the
net device is being unregistered. It appears that commit 827d9780
introduced a possible race condition between {t,}packet_snd() and
packet_notifier(). In the case of a bound socket, packet_notifier()
can drop the last reference to the net_device and {t,}packet_snd()
might end up suddenly sending a packet over a freed net_device.

To avoid reverting 827d9780 and thus introducing a performance
regression compared to the current state of things, we decided to
hold a cached RCU protected pointer to the net device and maintain
it on write side via bind spin_lock protected register_prot_hook()
and __unregister_prot_hook() calls.

In {t,}packet_snd() path, we access this pointer under rcu_read_lock
through packet_cached_dev_get() that holds reference to the device
to prevent it from being freed through packet_notifier() while
we're in send path. This is okay to do as dev_put()/dev_hold() are
per-cpu counters, so this should not be a performance issue. Also,
the code simplifies a bit as we don't need need_rls_dev anymore.

Fixes: 827d9780 ("af-packet: Use existing netdev reference for bound sockets.")
Reported-by: NSalam Noureddine <noureddine@aristanetworks.com>
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NSalam Noureddine <noureddine@aristanetworks.com>
Cc: Ben Greear <greearb@candelatech.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e40526cb

21 11月, 2013 1 次提交

net: rework recvmsg handler msg_name and msg_namelen logic · f3d33426

由 Hannes Frederic Sowa 提交于 11月 21, 2013

This patch now always passes msg->msg_namelen as 0. recvmsg handlers must
set msg_namelen to the proper size <= sizeof(struct sockaddr_storage)
to return msg_name to the user.

This prevents numerous uninitialized memory leaks we had in the
recvmsg handlers and makes it harder for new code to accidentally leak
uninitialized memory.

Optimize for the case recvfrom is called with NULL as address. We don't
need to copy the address at all, so set it to NULL before invoking the
recvmsg handler. We can do so, because all the recvmsg handlers must
cope with the case a plain read() is called on them. read() also sets
msg_name to NULL.

Also document these changes in include/linux/net.h as suggested by David
Miller.

Changes since RFC:

Set msg->msg_name = NULL if user specified a NULL in msg_name but had a
non-null msg_namelen in verify_iovec/verify_compat_iovec. This doesn't
affect sendto as it would bail out earlier while trying to copy-in the
address. It also more naturally reflects the logic by the callers of
verify_iovec.

With this change in place I could remove "
if (!uaddr || msg_sys->msg_namelen == 0)
	msg->msg_name = NULL
".

This change does not alter the user visible error logic as we ignore
msg_namelen as long as msg_name is NULL.

Also remove two unnecessary curly brackets in ___sys_recvmsg and change
comments to netdev style.

Cc: David Miller <davem@davemloft.net>
Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f3d33426

30 8月, 2013 2 次提交

net: packet: use reciprocal_divide in fanout_demux_hash · f55d112e

由 Daniel Borkmann 提交于 8月 28, 2013

Instead of hard-coding reciprocal_divide function, use the inline
function from reciprocal_div.h.
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f55d112e

net: packet: add randomized fanout scheduler · 5df0ddfb

由 Daniel Borkmann 提交于 8月 28, 2013

We currently allow for different fanout scheduling policies in pf_packet
such as scheduling by skb's rxhash, round-robin, by cpu, and rollover.
Also allow for a random, equidistributed selection of the socket from the
fanout process group.
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5df0ddfb

21 8月, 2013 1 次提交

packet: restore packet statistics tp_packets to include drops · 8bcdeaff

由 Willem de Bruijn 提交于 8月 19, 2013

getsockopt PACKET_STATISTICS returns tp_packets + tp_drops. Commit
ee80fbf3 ("packet: account statistics only in tpacket_stats_u")
cleaned up the getsockopt PACKET_STATISTICS code.
This also changed semantics. Historically, tp_packets included
tp_drops on return. The commit removed the line that adds tp_drops
into tp_packets.

This patch reinstates the old semantics.
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Acked-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8bcdeaff

10 8月, 2013 1 次提交

net: attempt high order allocations in sock_alloc_send_pskb() · 28d64271

由 Eric Dumazet 提交于 8月 08, 2013

Adding paged frags skbs to af_unix sockets introduced a performance
regression on large sends because of additional page allocations, even
if each skb could carry at least 100% more payload than before.

We can instruct sock_alloc_send_pskb() to attempt high order
allocations.

Most of the time, it does a single page allocation instead of 8.

I added an additional parameter to sock_alloc_send_pskb() to
let other users to opt-in for this new feature on followup patches.

Tested:

Before patch :

$ netperf -t STREAM_STREAM
STREAM STREAM TEST
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 2304  212992  212992    10.00    46861.15

After patch :

$ netperf -t STREAM_STREAM
STREAM STREAM TEST
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 2304  212992  212992    10.00    57981.11
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

28d64271

08 8月, 2013 1 次提交

packet: Revert recent header parsing changes. · 09effa67

由 David S. Miller 提交于 8月 07, 2013

This reverts commits:

0f75b09c
cbd89acb
c483e026

Amongst other things, it's modifies the SKB header
to pull the ethernet headers off via eth_type_trans()
on the output path which is bogus.

It's causing serious regressions for people.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

09effa67

03 8月, 2013 3 次提交

af_packet: simplify VLAN frame check in packet_snd · c483e026

由 Phil Sutter 提交于 8月 02, 2013

For ethernet frames, eth_type_trans() already parses the header, so one
can skip this when checking the frame size.
Signed-off-by: NPhil Sutter <phil@nwl.cc>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c483e026

af_packet: fix for sending VLAN frames via packet_mmap · cbd89acb

由 Phil Sutter 提交于 8月 02, 2013

Since tpacket_fill_skb() parses the protocol field in ethernet frames'
headers, it's easy to see if any passed frame is a VLAN one and account
for the extended size.

But as the real protocol does not turn up before tpacket_fill_skb()
runs which in turn also checks the frame length, move the max frame
length calculation into the function.
Signed-off-by: NPhil Sutter <phil@nwl.cc>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cbd89acb

af_packet: when sending ethernet frames, parse header for skb->protocol · 0f75b09c

由 Phil Sutter 提交于 8月 02, 2013

This may be necessary when the SKB is passed to other layers on the go,
which check the protocol field on their own. An example is a VLAN packet
sent out using AF_PACKET on a bridge interface. The bridging code checks
the SKB size, accounting for any VLAN header only if the protocol field
is set accordingly.

Note that eth_type_trans() sets skb->dev to the passed argument, so this
can be skipped in packet_snd() for ethernet frames, as well.
Signed-off-by: NPhil Sutter <phil@nwl.cc>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0f75b09c

23 7月, 2013 1 次提交

net: Provide a generic socket error queue delivery method for Tx time stamps. · cb820f8e

由 Richard Cochran 提交于 7月 19, 2013

This patch moves the private error queue delivery function from the
af_packet code to the core socket method. In this way, network layers
only needing the error queue for transmit time stamping can share common
code.
Signed-off-by: NRichard Cochran <richardcochran@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cb820f8e