提交 · 30e502a34b8b21fae2c789da102bd9f6e99fef83 · openeuler / Kernel

29 9月, 2014 15 次提交

net: tcp: add flag for ca to indicate that ECN is required · 30e502a3

由 Daniel Borkmann 提交于 9月 26, 2014

This patch adds a flag to TCP congestion algorithms that allows
for requesting to mark IPv4/IPv6 sockets with transport as ECN
capable, that is, ECT(0), when required by a congestion algorithm.

It is currently used and needed in DataCenter TCP (DCTCP), as it
requires both peers to assert ECT on all IP packets sent - it
uses ECN feedback (i.e. CE, Congestion Encountered information)
from switches inside the data center to derive feedback to the
end hosts.

Therefore, simply add a new flag to icsk_ca_ops. Note that DCTCP's
algorithm/behaviour slightly diverges from RFC3168, therefore this
is only (!) enabled iff the assigned congestion control ops module
has requested this. By that, we can tightly couple this logic really
only to the provided congestion control ops.

Joint work with Florian Westphal and Glenn Judd.
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NGlenn Judd <glenn.judd@morganstanley.com>
Acked-by: NStephen Hemminger <stephen@networkplumber.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

30e502a3

net: tcp: assign tcp cong_ops when tcp sk is created · 55d8694f

由 Florian Westphal 提交于 9月 26, 2014

Split assignment and initialization from one into two functions.

This is required by followup patches that add Datacenter TCP
(DCTCP) congestion control algorithm - we need to be able to
determine if the connection is moderated by DCTCP before the
3WHS has finished.

As we walk the available congestion control list during the
assignment, we are always guaranteed to have Reno present as
it's fixed compiled-in. Therefore, since we're doing the
early assignment, we don't have a real use for the Reno alias
tcp_init_congestion_ops anymore and can thus remove it.

Actual usage of the congestion control operations are being
made after the 3WHS has finished, in some cases however we
can access get_info() via diag if implemented, therefore we
need to zero out the private area for those modules.

Joint work with Daniel Borkmann and Glenn Judd.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NGlenn Judd <glenn.judd@morganstanley.com>
Acked-by: NStephen Hemminger <stephen@networkplumber.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

55d8694f

net: sched: cls_rcvp, complete rcu conversion · 53dfd501

由 John Fastabend 提交于 9月 26, 2014

This completes the cls_rsvp conversion to RCU safe
copy, update semantics.

As a result all cases of tcf_exts_change occur on
empty lists now.
Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

53dfd501

net_sched: fix another regression in cls_tcindex · 68f6a7c6

由 WANG Cong 提交于 9月 25, 2014

Clearly the following change is not expected:

	-       if (!cp.perfect && !cp.h)
	-               cp.alloc_hash = cp.hash;
	+       if (!cp->perfect && cp->h)
	+               cp->alloc_hash = cp->hash;

Fixes: commit 331b7292 ("net: sched: RCU cls_tcindex")
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Acked-by: NJohn Fastabend <john.r.fastabend@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

68f6a7c6

net_sched: fix errno in tcindex_set_parms() · 02c5e844

由 WANG Cong 提交于 9月 25, 2014

When kmemdup() fails, we should return -ENOMEM.

Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Acked-by: NJohn Fastabend <john.r.fastabend@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

02c5e844

arp: Do not perturb drop profiles with ignored ARP packets · 825bae5d

由 Rick Jones 提交于 9月 25, 2014

We do not wish to disturb dropwatch or perf drop profiles with an ARP
we will ignore.
Signed-off-by: NRick Jones <rick.jones2@hp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

825bae5d

net_sched: remove the first parameter from tcf_exts_destroy() · 18d0264f

由 WANG Cong 提交于 9月 25, 2014

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Acked-by: NJamal Hadi Salim <hadi@mojatatu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

18d0264f

net: dsa: allow switches driver to implement get/set EEE · 7905288f

由 Florian Fainelli 提交于 9月 24, 2014

Allow switches driver to query and enable/disable EEE on a per-port
basis by implementing the ethtool_{get,set}_eee settings and delegating
these operations to the switch driver.

set_eee() will need to coordinate with the PHY driver to make sure that
EEE is enabled, the link-partner supports it and the auto-negotiation
result is satisfactory.
Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7905288f

net: dsa: allow enabling and disable switch ports · b2f2af21

由 Florian Fainelli 提交于 9月 24, 2014

Whenever a per-port network device is used/unused, invoke the switch
driver port_enable/port_disable callbacks to allow saving as much power
as possible by disabling unused parts of the switch (RX/TX logic, memory
arrays, PHYs...). We supply a PHY device argument to make sure the
switch driver can act on the PHY device if needed (like putting/taking
the PHY out of deep low power mode).
Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b2f2af21

net: dsa: start and stop the PHY state machine · f7f1de51

由 Florian Fainelli 提交于 9月 24, 2014

dsa_slave_open() should start the PHY library state machine for its PHY
interface, and dsa_slave_close() should stop the PHY library state
machine accordingly.
Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f7f1de51

tcp: use tcp_flags in tcp_data_queue() · 155c6e1a

由 Peter Pan(潘卫平) 提交于 9月 24, 2014

This patch is a cleanup which follows the idea in commit e11ecddf (tcp: use
TCP_SKB_CB(skb)->tcp_flags in input path),
and it may reduce register pressure since skb->cb[] access is fast,
bacause skb is probably in a register.

v2: remove variable th
v3: reword the changelog
Signed-off-by: NWeiping Pan <panweiping3@gmail.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

155c6e1a

tcp: change tcp_skb_pcount() location · cd7d8498

由 Eric Dumazet 提交于 9月 24, 2014

Our goal is to access no more than one cache line access per skb in
a write or receive queue when doing the various walks.

After recent TCP_SKB_CB() reorganizations, it is almost done.

Last part is tcp_skb_pcount() which currently uses
skb_shinfo(skb)->gso_segs, which is a terrible choice, because it needs
3 cache lines in current kernel (skb->head, skb->end, and
shinfo->gso_segs are all in 3 different cache lines, far from skb->cb)

This very simple patch reuses space currently taken by tcp_tw_isn
only in input path, as tcp_skb_pcount is only needed for skb stored in
write queue.

This considerably speeds up tcp_ack(), granted we avoid shinfo->tx_flags
to get SKBTX_ACK_TSTAMP, which seems possible.

This also speeds up all sack processing in general.

This speeds up tcp_sendmsg() because it no longer has to access/dirty
shinfo.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cd7d8498

tcp: better TCP_SKB_CB layout to reduce cache line misses · 971f10ec

由 Eric Dumazet 提交于 9月 27, 2014

TCP maintains lists of skb in write queue, and in receive queues
(in order and out of order queues)

Scanning these lists both in input and output path usually requires
access to skb->next, TCP_SKB_CB(skb)->seq, and TCP_SKB_CB(skb)->end_seq

These fields are currently in two different cache lines, meaning we
waste lot of memory bandwidth when these queues are big and flows
have either packet drops or packet reorders.

We can move TCP_SKB_CB(skb)->header at the end of TCP_SKB_CB, because
this header is not used in fast path. This allows TCP to search much faster
in the skb lists.

Even with regular flows, we save one cache line miss in fast path.

Thanks to Christoph Paasch for noticing we need to cleanup
skb->cb[] (IPCB/IP6CB) before entering IP stack in tx path,
and that I forgot IPCB use in tcp_v4_hnd_req() and tcp_v4_save_options().
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

971f10ec

ipv6: add a struct inet6_skb_parm param to ipv6_opt_accepted() · a224772d

由 Eric Dumazet 提交于 9月 27, 2014

ipv6_opt_accepted() assumes IP6CB(skb) holds the struct inet6_skb_parm
that it needs. Lets not assume this, as TCP stack might use a different
place.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a224772d

ipv4: rename ip_options_echo to __ip_options_echo() · 24a2d43d

由 Eric Dumazet 提交于 9月 27, 2014

ip_options_echo() assumes struct ip_options is provided in &IPCB(skb)->opt
Lets break this assumption, but provide a helper to not change all call points.

ip_send_unicast_reply() gets a new struct ip_options pointer.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

24a2d43d

27 9月, 2014 5 次提交

net : optimize skb_release_data() · ff04a771

由 Eric Dumazet 提交于 9月 23, 2014

Cache skb_shinfo(skb) in a variable to avoid computing it multiple
times.

Reorganize the tests to remove one indentation level.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ff04a771

net/openvswitch: remove dup comment in vport.h · 8280bf00

由 Wang Sheng-Hui 提交于 9月 23, 2014

Remove the duplicated comment
"/* The following definitions are for users of the vport subsytem: */"
in vport.h
Signed-off-by: NWang Sheng-Hui <shhuiw@gmail.com>
Acked-by: NPravin B Shelar <pshelar@nicira.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8280bf00

net: optimise inet_proto_csum_replace4() · 58e3cac5

由 LEROY Christophe 提交于 9月 23, 2014

csum_partial() is a generic function which is not optimised for small fixed
length calculations, and its use requires to store "from" and "to" values in
memory while we already have them available in registers. This also has impact,
especially on RISC processors. In the same spirit as the change done by
Eric Dumazet on csum_replace2(), this patch rewrites inet_proto_csum_replace4()
taking into account RFC1624.

I spotted during a NATted tcp transfert that csum_partial() is one of top 5
consuming functions (around 8%), and the second user of csum_partial() is
inet_proto_csum_replace4().
Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

58e3cac5

net: introduce __skb_header_release() · f4a775d1

由 Eric Dumazet 提交于 9月 22, 2014

While profiling TCP stack, I noticed one useless atomic operation
in tcp_sendmsg(), caused by skb_header_release().

It turns out all current skb_header_release() users have a fresh skb,
that no other user can see, so we can avoid one atomic operation.

Introduce __skb_header_release() to clearly document this.

This gave me a 1.5 % improvement on TCP_RR workload.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f4a775d1

net: Change netdev_<level> logging functions to return void · 6ea754eb

由 Joe Perches 提交于 9月 22, 2014

No caller or macro uses the return value so make all
the functions return void.
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6ea754eb

26 9月, 2014 4 次提交

net: sched: use pinned timers · 4a8e320c

由 Eric Dumazet 提交于 9月 20, 2014

While using a MQ + NETEM setup, I had confirmation that the default
timer migration ( /proc/sys/kernel/timer_migration ) is killing us.

Installing this on a receiver side of a TCP_STREAM test, (NIC has 8 TX
queues) :

EST="est 1sec 4sec"
for ETH in eth1
do
 tc qd del dev $ETH root 2>/dev/null
 tc qd add dev $ETH root handle 1: mq
 tc qd add dev $ETH parent 1:1 $EST netem limit 70000 delay 6ms
 tc qd add dev $ETH parent 1:2 $EST netem limit 70000 delay 8ms
 tc qd add dev $ETH parent 1:3 $EST netem limit 70000 delay 10ms
 tc qd add dev $ETH parent 1:4 $EST netem limit 70000 delay 12ms
 tc qd add dev $ETH parent 1:5 $EST netem limit 70000 delay 14ms
 tc qd add dev $ETH parent 1:6 $EST netem limit 70000 delay 16ms
 tc qd add dev $ETH parent 1:7 $EST netem limit 80000 delay 18ms
 tc qd add dev $ETH parent 1:8 $EST netem limit 90000 delay 20ms
done

We can see that timers get migrated into a single cpu, presumably idle
at the time timers are set up.
Then all qdisc dequeues run from this cpu and huge lock contention
happens. This single cpu is stuck in softirq mode and cannot dequeue
fast enough.

    39.24%  [kernel]          [k] _raw_spin_lock
     2.65%  [kernel]          [k] netem_enqueue
     1.80%  [kernel]          [k] netem_dequeue
     1.63%  [kernel]          [k] copy_user_enhanced_fast_string
     1.45%  [kernel]          [k] _raw_spin_lock_bh

By pinning qdisc timers on the cpu running the qdisc, we respect proper
XPS setting and remove this lock contention.

     5.84%  [kernel]          [k] netem_enqueue
     4.83%  [kernel]          [k] _raw_spin_lock
     2.92%  [kernel]          [k] copy_user_enhanced_fast_string

Current Qdiscs that benefit from this change are :

	netem, cbq, fq, hfsc, tbf, htb.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4a8e320c

net: Remove gso_send_check as an offload callback · 53e50398

由 Tom Herbert 提交于 9月 20, 2014

The send_check logic was only interesting in cases of TCP offload and
UDP UFO where the checksum needed to be initialized to the pseudo
header checksum. Now we've moved that logic into the related
gso_segment functions so gso_send_check is no longer needed.
Signed-off-by: NTom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

53e50398

udp: move logic out of udp[46]_ufo_send_check · f71470b3

由 Tom Herbert 提交于 9月 20, 2014

In udp[46]_ufo_send_check the UDP checksum initialized to the pseudo
header checksum. We can move this logic into udp[46]_ufo_fragment.
After this change udp[64]_ufo_send_check is a no-op.
Signed-off-by: NTom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f71470b3

tcp: move logic out of tcp_v[64]_gso_send_check · d020f8f7

由 Tom Herbert 提交于 9月 20, 2014

In tcp_v[46]_gso_send_check the TCP checksum is initialized to the
pseudo header checksum using __tcp_v[46]_send_check. We can move this
logic into new tcp[46]_gso_segment functions to be done when
ip_summed != CHECKSUM_PARTIAL (ip_summed == CHECKSUM_PARTIAL should be
the common case, possibly always true when taking GSO path). After this
change tcp_v[46]_gso_send_check is no-op.
Signed-off-by: NTom Herbert <therbert@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d020f8f7

24 9月, 2014 2 次提交

tcp: add coalescing attempt in tcp_ofo_queue() · bd1e75ab

由 Eric Dumazet 提交于 9月 19, 2014

In order to make TCP more resilient in presence of reorders, we need
to allow coalescing to happen when skbs from out of order queue are
transferred into receive queue. LRO/GRO can be completely canceled
in some pathological cases, like per packet load balancing on aggregated
links.

I had to move tcp_try_coalesce() up in the file above tcp_ofo_queue()
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bd1e75ab

icmp: add a global rate limitation · 4cdf507d

由 Eric Dumazet 提交于 9月 19, 2014

Current ICMP rate limiting uses inetpeer cache, which is an RBL tree
protected by a lock, meaning that hosts can be stuck hard if all cpus
want to check ICMP limits.

When say a DNS or NTP server process is restarted, inetpeer tree grows
quick and machine comes to its knees.

iptables can not help because the bottleneck happens before ICMP
messages are even cooked and sent.

This patch adds a new global limitation, using a token bucket filter,
controlled by two new sysctl :

icmp_msgs_per_sec - INTEGER
    Limit maximal number of ICMP packets sent per second from this host.
    Only messages whose type matches icmp_ratemask are
    controlled by this limit.
    Default: 1000

icmp_msgs_burst - INTEGER
    icmp_msgs_per_sec controls number of ICMP packets sent per second,
    while icmp_msgs_burst controls the burst size of these packets.
    Default: 50

Note that if we really want to send millions of ICMP messages per
second, we might extend idea and infra added in commit 04ca6973
("ip: make IP identifiers less predictable") :
add a token bucket in the ip_idents hash and no longer rely on inetpeer.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4cdf507d

23 9月, 2014 12 次提交

ipv4: do not use this_cpu_ptr() in preemptible context · a35165ca

由 Eric Dumazet 提交于 9月 22, 2014

this_cpu_ptr() in preemptible context is generally bad

Sep 22 05:05:55 br kernel: [   94.608310] BUG: using smp_processor_id()
in
preemptible [00000000] code: ip/2261
Sep 22 05:05:55 br kernel: [   94.608316] caller is
tunnel_dst_set.isra.28+0x20/0x60 [ip_tunnel]
Sep 22 05:05:55 br kernel: [   94.608319] CPU: 3 PID: 2261 Comm: ip Not
tainted
3.17.0-rc5 #82

We can simply use raw_cpu_ptr(), as preemption is safe in these
contexts.

Should fix https://bugzilla.kernel.org/show_bug.cgi?id=84991Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NJoe <joe9mail@gmail.com>
Fixes: 9a4aa9af ("ipv4: Use percpu Cache route in IP tunnels")
Acked-by: NTom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a35165ca

net: sched: fix compile warning in cls_u32 · a2aeb02a

由 Eric Dumazet 提交于 9月 22, 2014

$ grep CONFIG_CLS_U32_MARK .config
# CONFIG_CLS_U32_MARK is not set

net/sched/cls_u32.c: In function 'u32_change':
net/sched/cls_u32.c:852:1: warning: label 'errout' defined but not used
[-Wunused-label]
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a2aeb02a

tcp: avoid possible arithmetic overflows · fcdd1cf4

由 Eric Dumazet 提交于 9月 22, 2014

icsk_rto is a 32bit field, and icsk_backoff can reach 15 by default,
or more if some sysctl (eg tcp_retries2) are changed.

Better use 64bit to perform icsk_rto << icsk_backoff operations

As Joe Perches suggested, add a helper for this.

Yuchung spotted the tcp_v4_err() case.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fcdd1cf4

ipv6: mld: answer mldv2 queries with mldv1 reports in mldv1 fallback · 35f7aa53

由 Daniel Borkmann 提交于 9月 20, 2014

RFC2710 (MLDv1), section 3.7. says:

The length of a received MLD message is computed by taking the
IPv6 Payload Length value and subtracting the length of any IPv6
extension headers present between the IPv6 header and the MLD
message. If that length is greater than 24 octets, that indicates
that there are other fields present *beyond* the fields described
above, perhaps belonging to a *future backwards-compatible* version
of MLD. An implementation of the version of MLD specified in this
document *MUST NOT* send an MLD message longer than 24 octets and
MUST ignore anything past the first 24 octets of a received MLD
message.

RFC3810 (MLDv2), section 8.2.1. states for *listeners* regarding
presence of MLDv1 routers:

In order to be compatible with MLDv1 routers, MLDv2 hosts MUST
operate in version 1 compatibility mode. [...] When Host
Compatibility Mode is MLDv2, a host acts using the MLDv2 protocol
on that interface. When Host Compatibility Mode is MLDv1, a host
acts in MLDv1 compatibility mode, using *only* the MLDv1 protocol,
on that interface. [...]

While section 8.3.1. specifies *router* behaviour regarding presence
of MLDv1 routers:

MLDv2 routers may be placed on a network where there is at least
one MLDv1 router. The following requirements apply:

If an MLDv1 router is present on the link, the Querier MUST use
the *lowest* version of MLD present on the network. This must be
administratively assured. Routers that desire to be compatible
with MLDv1 MUST have a configuration option to act in MLDv1 mode;
if an MLDv1 router is present on the link, the system administrator
must explicitly configure all MLDv2 routers to act in MLDv1 mode.
When in MLDv1 mode, the Querier MUST send periodic General Queries
truncated at the Multicast Address field (i.e., 24 bytes long),
and SHOULD also warn about receiving an MLDv2 Query (such warnings
must be rate-limited). The Querier MUST also fill in the Maximum
Response Delay in the Maximum Response Code field, i.e., the
exponential algorithm described in section 5.1.3. is not used. [...]

That means that we should not get queries from different versions of
MLD. When there's a MLDv1 router present, MLDv2 enforces truncation
and MRC == MRD (both fields are overlapping within the 24 octet range).

Section 8.3.2. specifies behaviour in the presence of MLDv1 multicast
address *listeners*:

MLDv2 routers may be placed on a network where there are hosts
that have not yet been upgraded to MLDv2. In order to be compatible
with MLDv1 hosts, MLDv2 routers MUST operate in version 1 compatibility
mode. MLDv2 routers keep a compatibility mode per multicast address
record. The compatibility mode of a multicast address is determined
from the Multicast Address Compatibility Mode variable, which can be
in one of the two following states: MLDv1 or MLDv2.

The Multicast Address Compatibility Mode of a multicast address
record is set to MLDv1 whenever an MLDv1 Multicast Listener Report is
*received* for that multicast address. At the same time, the Older
Version Host Present timer for the multicast address is set to Older
Version Host Present Timeout seconds. The timer is re-set whenever a
new MLDv1 Report is received for that multicast address. If the Older
Version Host Present timer expires, the router switches back to
Multicast Address Compatibility Mode of MLDv2 for that multicast
address. [...]

That means, what can happen is the following scenario, that hosts can
act in MLDv1 compatibility mode when they previously have received an
MLDv1 query (or, simply operate in MLDv1 mode-only); and at the same
time, an MLDv2 router could start up and transmits MLDv2 startup query
messages while being unaware of the current operational mode.

Given RFC2710, section 3.7 we would need to answer to that with an MLDv1
listener report, so that the router according to RFC3810, section 8.3.2.
would receive that and internally switch to MLDv1 compatibility as well.

Right now, I believe since the initial implementation of MLDv2, Linux
hosts would just silently drop such MLDv2 queries instead of replying
with an MLDv1 listener report, which would prevent a MLDv2 router going
into fallback mode (until it receives other MLDv1 queries).

Since the mapping of MRC to MRD in exactly such cases can make use of
the exponential algorithm from 5.1.3, we cannot [strictly speaking] be
aware in MLDv1 of the encoding in MRC, it seems also not mentioned by
the RFC. Since encodings are the same up to 32767, assume in such a
situation this value as a hard upper limit we would clamp. We have asked
one of the RFC authors on that regard, and he mentioned that there seem
not to be any implementations that make use of that exponential algorithm
on startup messages. In any case, this patch fixes this MLD
interoperability issue.
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

35f7aa53

net: rfkill: gpio: Fix clock status · fa5c107c

由 Loic Poulain 提交于 9月 16, 2014

Clock is disabled when the device is blocked.
So, clock_enabled is the logical negation of "blocked".
Signed-off-by: NLoic Poulain <loic.poulain@intel.com>
Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>

fa5c107c

net: sched: cls_u32 changes to knode must appear atomic to readers · de5df632

由 John Fastabend 提交于 9月 19, 2014

Changes to the cls_u32 classifier must appear atomic to the
readers. Before this patch if a change is requested for both
the exts and ifindex, first the ifindex is updated then the
exts with tcf_exts_change(). This opens a small window where
a reader can have a exts chain with an incorrect ifindex. This
violates the the RCU semantics.

Here we resolve this by always passing u32_set_parms() a copy
of the tc_u_knode to work on and then inserting it into the hash
table after the updates have been successfully applied.

Tested with the following short script:

#tc filter add dev p3p2 parent 8001:0 protocol ip prio 99 handle 1: \
	       u32 divisor 256

#tc filter add dev p3p2 parent 8001:0 protocol ip prio 99 \
	       u32 link 1: hashkey mask ffffff00 at 12    \
	       match ip src 192.168.8.0/2

#tc filter add dev p3p2 parent 8001:0 protocol ip prio 102    \
	       handle 1::10 u32 classid 1:2 ht 1: 	      \
	       match ip src 192.168.8.0/8 match ip tos 0x0a 1e

#tc filter change dev p3p2 parent 8001:0 protocol ip prio 102 \
		 handle 1::10 u32 classid 1:2 ht 1:        \
		 match ip src 1.1.0.0/8 match ip tos 0x0b 1e

CC: Eric Dumazet <edumazet@google.com>
CC: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

de5df632

net: cls_u32: fix missed pcpu_success free_percpu · a1ddcfee

由 John Fastabend 提交于 9月 19, 2014

This fixes a missed free_percpu in the unwind code path and when
keys are destroyed.
Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a1ddcfee

udp: Need to make ip6_udp_tunnel.c have GPL license · 3fcb95a8

由 Tom Herbert 提交于 9月 22, 2014

Unable to load various tunneling modules without this:

[   80.679049] fou: Unknown symbol udp_sock_create6 (err 0)
[   91.439939] ip6_udp_tunnel: Unknown symbol ip6_local_out (err 0)
[   91.439954] ip6_udp_tunnel: Unknown symbol __put_net (err 0)
[   91.457792] vxlan: Unknown symbol udp_sock_create6 (err 0)
[   91.457831] vxlan: Unknown symbol udp_tunnel6_xmit_skb (err 0)
Signed-off-by: NTom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3fcb95a8

net: keep original skb which only needs header checking during software GSO · cecda693

由 Jason Wang 提交于 9月 19, 2014

Commit ce93718f ("net: Don't keep
around original SKB when we software segment GSO frames") frees the
original skb after software GSO even for dodgy gso skbs. This breaks
the stream throughput from untrusted sources, since only header
checking was done during software GSO instead of a true
segmentation. This patch fixes this by freeing the original gso skb
only when it was really segmented by software.

Fixes ce93718f ("net: Don't keep
around original SKB when we software segment GSO frames.")

Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NJason Wang <jasowang@redhat.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cecda693

net: dsa: add {get, set}_wol callbacks to slave devices · 19e57c4e

由 Florian Fainelli 提交于 9月 18, 2014

Allow switch drivers to implement per-port Wake-on-LAN getter and
setters.
Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

19e57c4e

net: dsa: allow switch drivers to implement suspend/resume hooks · 24462549

由 Florian Fainelli 提交于 9月 18, 2014

Add an abstraction layer to suspend/resume switch devices, doing the
following split:

- suspend/resume the slave network devices and their corresponding PHY
  devices
- suspend/resume the switch hardware using switch driver callbacks
Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

24462549

net: sched: shrink struct qdisc_skb_cb to 28 bytes · 25711786

由 Eric Dumazet 提交于 9月 18, 2014

We cannot make struct qdisc_skb_cb bigger without impacting IPoIB,
or increasing skb->cb[] size.

Commit e0f31d84 ("flow_keys: Record IP layer protocol in
skb_flow_dissect()") broke IPoIB.

Only current offender is sch_choke, and this one do not need an
absolutely precise flow key.

If we store 17 bytes of flow key, its more than enough. (Its the actual
size of flow_keys if it was a packed structure, but we might add new
fields at the end of it later)
Signed-off-by: NEric Dumazet <edumazet@google.com>
Fixes: e0f31d84 ("flow_keys: Record IP layer protocol in skb_flow_dissect()")
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

25711786

20 9月, 2014 2 次提交

udp_tunnel: Only build ip6_udp_tunnel.c when IPV6 is selected · 6d967f87

由 Andy Zhou 提交于 9月 19, 2014

Functions supplied in ip6_udp_tunnel.c are only needed when IPV6 is
selected. When IPV6 is not selected, those functions are stubbed out
in udp_tunnel.h.

==================================================================
 net/ipv6/ip6_udp_tunnel.c:15:5: error: redefinition of 'udp_sock_create6'
     int udp_sock_create6(struct net *net, struct udp_port_cfg *cfg,
 In file included from net/ipv6/ip6_udp_tunnel.c:9:0:
      include/net/udp_tunnel.h:36:19: note: previous definition of 'udp_sock_create6' was here
       static inline int udp_sock_create6(struct net *net, struct udp_port_cfg *cfg,
==================================================================

Fixes:  fd384412 udp_tunnel: Seperate ipv6 functions into its own file
Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
Signed-off-by: NAndy Zhou <azhou@nicira.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6d967f87

openvswitch: restore OVS_FLOW_CMD_NEW notifications · 9b67aa4a

由 Samuel Gauthier 提交于 9月 18, 2014

Since commit fb5d1e9e ("openvswitch: Build flow cmd netlink reply only if needed."),
the new flows are not notified to the listeners of OVS_FLOW_MCGROUP.

This commit fixes the problem by using the genl function, ie
genl_has_listerners() instead of netlink_has_listeners().
Signed-off-by: NSamuel Gauthier <samuel.gauthier@6wind.com>
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: NPravin B Shelar <pshelar@nicira.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9b67aa4a

openeuler / Kernel 大约 1 年 前同步成功

openeuler / Kernel
大约 1 年前同步成功