提交 · a0c02161ecfc2f40a0837926efac5376bc6fd6d3 · openanolis / cloud-kernel

30 1月, 2017 1 次提交

net: dsa: variable number of ports · a0c02161

由 Vivien Didelot 提交于 1月 27, 2017

Change the ports[DSA_MAX_PORTS] array of the dsa_switch structure for a
zero-length array, allocated at the same time as the dsa_switch
structure itself. A dsa_switch_alloc() helper is provided for that.

This commit brings no functional change yet since we pass DSA_MAX_PORTS
as the number of ports for the moment. Future patches can update the DSA
drivers separately to support dynamic number of ports.
Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a0c02161

28 1月, 2017 1 次提交

net: adjust skb->truesize in pskb_expand_head() · 158f323b

由 Eric Dumazet 提交于 1月 27, 2017

Slava Shwartsman reported a warning in skb_try_coalesce(), when we
detect skb->truesize is completely wrong.

In his case, issue came from IPv6 reassembly coping with malicious
datagrams, that forced various pskb_may_pull() to reallocate a bigger
skb->head than the one allocated by NIC driver before entering GRO
layer.

Current code does not change skb->truesize, leaving this burden to
callers if they care enough.

Blindly changing skb->truesize in pskb_expand_head() is not
easy, as some producers might track skb->truesize, for example
in xmit path for back pressure feedback (sk->sk_wmem_alloc)

We can detect the cases where it should be safe to change
skb->truesize :

1) skb is not attached to a socket.
2) If it is attached to a socket, destructor is sock_edemux()

My audit gave only two callers doing their own skb->truesize
manipulation.

I had to remove skb parameter in sock_edemux macro when
CONFIG_INET is not set to avoid a compile error.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NSlava Shwartsman <slavash@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

158f323b

27 1月, 2017 9 次提交

tcp: don't annotate mark on control socket from tcp_v6_send_response() · 92e55f41

由 Pablo Neira 提交于 1月 26, 2017

Unlike ipv4, this control socket is shared by all cpus so we cannot use
it as scratchpad area to annotate the mark that we pass to ip6_xmit().

Add a new parameter to ip6_xmit() to indicate the mark. The SCTP socket
family caches the flowi6 structure in the sctp_transport structure, so
we cannot use to carry the mark unless we later on reset it back, which
I discarded since it looks ugly to me.

Fixes: bf99b4de ("tcp: fix mark propagation with fwmark_reflect enabled")
Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

92e55f41

F
net/ipv6: support more tunnel interfaces for EUI64 link-local generation · 45ce0fd1
由 Felix Jia 提交于 1月 26, 2017
```
Signed-off-by: NFelix Jia <felix.jia@alliedtelesis.co.nz>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
45ce0fd1

net/ipv6: allow sysctl to change link-local address generation mode · d35a00b8

由 Felix Jia 提交于 1月 26, 2017

The address generation mode for IPv6 link-local can only be configured
by netlink messages. This patch adds the ability to change the address
generation mode via sysctl.

v1 -> v2
Removed the rtnl lock and switch to use RCU lock to iterate through
the netdev list.

v2 -> v3
Removed the addrgenmode variable from the idev structure and use the
systcl storage for the flag.

Simplifed the logic for sysctl handling by removing the supported
for all operation.

Added support for more types of tunnel interfaces for link-local
address generation.

Based the patches from net-next.

v3 -> v4
Removed unnecessary whitespace changes.
Signed-off-by: NFelix Jia <felix.jia@alliedtelesis.co.nz>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d35a00b8

net: ipv6: ignore null_entry on route dumps · 1f17e2f2

由 David Ahern 提交于 1月 26, 2017

lkp-robot reported a BUG:
[   10.151226] BUG: unable to handle kernel NULL pointer dereference at 00000198
[   10.152525] IP: rt6_fill_node+0x164/0x4b8
[   10.153307] *pdpt = 0000000012ee5001 *pde = 0000000000000000
[   10.153309]
[   10.154492] Oops: 0000 [#1]
[   10.154987] CPU: 0 PID: 909 Comm: netifd Not tainted 4.10.0-rc4-00722-g41e8c70e-dirty #10
[   10.156482] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[   10.158254] task: d0deb000 task.stack: d0e0c000
[   10.159059] EIP: rt6_fill_node+0x164/0x4b8
[   10.159780] EFLAGS: 00010296 CPU: 0
[   10.160404] EAX: 00000000 EBX: d10c2358 ECX: c1f7c6cc EDX: c1f6ff44
[   10.161469] ESI: 00000000 EDI: c2059900 EBP: d0e0dc4c ESP: d0e0dbe4
[   10.162534]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
[   10.163482] CR0: 80050033 CR2: 00000198 CR3: 10d94660 CR4: 000006b0
[   10.164535] Call Trace:
[   10.164993]  ? paravirt_sched_clock+0x9/0xd
[   10.165727]  ? sched_clock+0x9/0xc
[   10.166329]  ? sched_clock_cpu+0x19/0xe9
[   10.166991]  ? lock_release+0x13e/0x36c
[   10.167652]  rt6_dump_route+0x4c/0x56
[   10.168276]  fib6_dump_node+0x1d/0x3d
[   10.168913]  fib6_walk_continue+0xab/0x167
[   10.169611]  fib6_walk+0x2a/0x40
[   10.170182]  inet6_dump_fib+0xfb/0x1e0
[   10.170855]  netlink_dump+0xcd/0x21f

This happens when the loopback device is set down and a ipv6 fib route
dump is requested.

ip6_null_entry is the root of all ipv6 fib tables making it integrated
into the table and hence passed to the ipv6 route dump code. The
null_entry route uses the loopback device for dst.dev but may not have
rt6i_idev set because of the order in which initializations are done --
ip6_route_net_init is run before addrconf_init has initialized the
loopback device. Fixing the initialization order is a much bigger problem
with no obvious solution thus far.

The BUG is triggered when the loopback is set down and the netif_running
check added by a1a22c12 fails. The fill_node descends to checking
rt->rt6i_idev for ignore_routes_with_linkdown and since rt6i_idev is
NULL it faults.

The null_entry route should not be processed in a dump request. Catch
and ignore. This check is done in rt6_dump_route as it is the highest
place in the callchain with knowledge of both the route and the network
namespace.

Fixes: a1a22c12("net: ipv6: Keep nexthop of multipath route on admin down")
Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1f17e2f2

net: ipv6: remove skb_reserve in getroute · 3b7b2b0a

由 David Ahern 提交于 1月 26, 2017

Remove skb_reserve and skb_reset_mac_header from inet6_rtm_getroute. The
allocated skb is not passed through the routing engine (like it is for
IPv4) and has not since the beginning of git time.
Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3b7b2b0a

net: dsa: Move ports assignment closer to error checking · bc1727d2

由 Florian Fainelli 提交于 1月 26, 2017

Move the assignment of ports in _dsa_register_switch() closer to where
it is checked, no functional change. Re-order declarations to be
preserve the inverted christmas tree style.
Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bc1727d2

net: dsa: Suffix function manipulating device_node with _dn · 3512a8e9

由 Florian Fainelli 提交于 1月 26, 2017

Make it clear that these functions take a device_node structure pointer
Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3512a8e9

net: dsa: Make most functions take a dsa_port argument · 293784a8

由 Florian Fainelli 提交于 1月 26, 2017

In preparation for allowing platform data, and therefore no valid
device_node pointer, make most DSA functions takes a pointer to a
dsa_port structure whenever possible. While at it, introduce a
dsa_port_is_valid() helper function which checks whether port->dn is
NULL or not at the moment.
Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

293784a8

net: dsa: Pass device pointer to dsa_register_switch · 55ed0ce0

由 Florian Fainelli 提交于 1月 26, 2017

In preparation for allowing dsa_register_switch() to be supplied with
device/platform data, pass down a struct device pointer instead of a
struct device_node.
Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

55ed0ce0

26 1月, 2017 17 次提交

batman-adv: Treat NET_XMIT_CN as transmit successfully · c3370518

由 Gao Feng 提交于 11月 21, 2016

The tc could return NET_XMIT_CN as one congestion notification, but
it does not mean the packet is lost. Other modules like ipvlan,
macvlan, and others treat NET_XMIT_CN as success too.

So batman-adv should handle NET_XMIT_CN also as NET_XMIT_SUCCESS.
Signed-off-by: NGao Feng <gfree.wind@gmail.com>
[sven@narfation.org: Moved NET_XMIT_CN handling to batadv_send_skb_packet]
Signed-off-by: NSven Eckelmann <sven@narfation.org>
Signed-off-by: NSimon Wunderlich <sw@simonwunderlich.de>

c3370518

batman-adv: Remove one condition check in batadv_route_unicast_packet · 0843f197

由 Gao Feng 提交于 11月 21, 2016

It could decrease one condition check to collect some statements in the
first condition block.
Signed-off-by: NGao Feng <gfree.wind@gmail.com>
Signed-off-by: NSven Eckelmann <sven@narfation.org>
Signed-off-by: NSimon Wunderlich <sw@simonwunderlich.de>

0843f197

S
batman-adv: Remove unused variable in batadv_tt_local_set_flags · 269cee62
由 Sven Eckelmann 提交于 12月 17, 2016
```
Signed-off-by: NSven Eckelmann <sven@narfation.org>
Signed-off-by: NSimon Wunderlich <sw@simonwunderlich.de>
```
269cee62

batman-adv: update copyright years for 2017 · ac79cbb9

由 Sven Eckelmann 提交于 1月 01, 2017

Signed-off-by: NSven Eckelmann <sven@narfation.org>
Signed-off-by: NSimon Wunderlich <sw@simonwunderlich.de>

ac79cbb9

batman-adv: don't add loop detect macs to TT · d3e9768a

由 Simon Wunderlich 提交于 11月 24, 2016

The bridge loop avoidance (BLA) feature of batman-adv sends packets to
probe for Mesh/LAN packet loops. Those packets are not sent by real
clients and should therefore not be added to the translation table (TT).
Signed-off-by: NSimon Wunderlich <simon.wunderlich@open-mesh.com>

d3e9768a

bridge: move maybe_deliver_addr() inside #ifdef · 5b9d6b15

由 Arnd Bergmann 提交于 1月 25, 2017

The only caller of this new function is inside of an #ifdef checking
for CONFIG_BRIDGE_IGMP_SNOOPING, so we should move the implementation
there too, in order to avoid this harmless warning:

net/bridge/br_forward.c:177:13: error: 'maybe_deliver_addr' defined but not used [-Werror=unused-function]

Fixes: 6db6f0ea ("bridge: multicast to unicast")
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Acked-by: NStephen Hemminger <stephen@networkplumber.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5b9d6b15

net: dsa: Bring back device detaching in dsa_slave_suspend() · f154be24

由 Florian Fainelli 提交于 1月 25, 2017

Commit 448b4482 ("net: dsa: Add lockdep class to tx queues to avoid
lockdep splat") removed the netif_device_detach() call done in
dsa_slave_suspend() which is necessary, and paired with a corresponding
netif_device_attach(), bring it back.

Fixes: 448b4482 ("net: dsa: Add lockdep class to tx queues to avoid lockdep splat")
Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f154be24

net/tcp-fastopen: make connect()'s return case more consistent with non-TFO · 3979ad7e

由 Willy Tarreau 提交于 1月 25, 2017

Without TFO, any subsequent connect() call after a successful one returns
-1 EISCONN. The last API update ensured that __inet_stream_connect() can
return -1 EINPROGRESS in response to sendmsg() when TFO is in use to
indicate that the connection is now in progress. Unfortunately since this
function is used both for connect() and sendmsg(), it has the undesired
side effect of making connect() now return -1 EINPROGRESS as well after
a successful call, while at the same time poll() returns POLLOUT. This
can confuse some applications which happen to call connect() and to
check for -1 EISCONN to ensure the connection is usable, and for which
EINPROGRESS indicates a need to poll, causing a loop.

This problem was encountered in haproxy where a call to connect() is
precisely used in certain cases to confirm a connection's readiness.
While arguably haproxy's behaviour should be improved here, it seems
important to aim at a more robust behaviour when the goal of the new
API is to make it easier to implement TFO in existing applications.

This patch simply ensures that we preserve the same semantics as in
the non-TFO case on the connect() syscall when using TFO, while still
returning -1 EINPROGRESS on sendmsg(). For this we simply tell
__inet_stream_connect() whether we're doing a regular connect() or in
fact connecting for a sendmsg() call.

Cc: Wei Wang <weiwan@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: NWilly Tarreau <w@1wt.eu>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3979ad7e

net/tcp-fastopen: Add new API support · 19f6d3f3

由 Wei Wang 提交于 1月 23, 2017

This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an
alternative way to perform Fast Open on the active side (client). Prior
to this patch, a client needs to replace the connect() call with
sendto(MSG_FASTOPEN). This can be cumbersome for applications who want
to use Fast Open: these socket operations are often done in lower layer
libraries used by many other applications. Changing these libraries
and/or the socket call sequences are not trivial. A more convenient
approach is to perform Fast Open by simply enabling a socket option when
the socket is created w/o changing other socket calls sequence:
  s = socket()
    create a new socket
  setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …);
    newly introduced sockopt
    If set, new functionality described below will be used.
    Return ENOTSUPP if TFO is not supported or not enabled in the
    kernel.

  connect()
    With cookie present, return 0 immediately.
    With no cookie, initiate 3WHS with TFO cookie-request option and
    return -1 with errno = EINPROGRESS.

  write()/sendmsg()
    With cookie present, send out SYN with data and return the number of
    bytes buffered.
    With no cookie, and 3WHS not yet completed, return -1 with errno =
    EINPROGRESS.
    No MSG_FASTOPEN flag is needed.

  read()
    Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but
    write() is not called yet.
    Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is
    established but no msg is received yet.
    Return number of bytes read if socket is established and there is
    msg received.

The new API simplifies life for applications that always perform a write()
immediately after a successful connect(). Such applications can now take
advantage of Fast Open by merely making one new setsockopt() call at the time
of creating the socket. Nothing else about the application's socket call
sequence needs to change.
Signed-off-by: NWei Wang <weiwan@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Acked-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

19f6d3f3

net: Remove __sk_dst_reset() in tcp_v6_connect() · 25776aa9

由 Wei Wang 提交于 1月 23, 2017

Remove __sk_dst_reset() in the failure handling because __sk_dst_reset()
will eventually get called when sk is released. No need to handle it in
the protocol specific connect call.
This is also to make the code path consistent with ipv4.
Signed-off-by: NWei Wang <weiwan@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

25776aa9

net/tcp-fastopen: refactor cookie check logic · 065263f4

由 Wei Wang 提交于 1月 23, 2017

Refactor the cookie check logic in tcp_send_syn_data() into a function.
This function will be called else where in later changes.
Signed-off-by: NWei Wang <weiwan@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Acked-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

065263f4

tcp: correct memory barrier usage in tcp_check_space() · 56d80622

由 Jason Baron 提交于 1月 24, 2017

sock_reset_flag() maps to __clear_bit() not the atomic version clear_bit().
Thus, we need smp_mb(), smp_mb__after_atomic() is not sufficient.

Fixes: 3c715127 ("tcp: add memory barriers to write space paths")
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: NJason Baron <jbaron@akamai.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Reported-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

56d80622

tcp: reduce skb overhead in selected places · 60b1af33

由 Eric Dumazet 提交于 1月 24, 2017

tcp_add_backlog() can use skb_condense() helper to get better
gains and less SKB_TRUESIZE() magic. This only happens when socket
backlog has to be used.

Some attacks involve specially crafted out of order tiny TCP packets,
clogging the ofo queue of (many) sockets.
Then later, expensive collapse happens, trying to copy all these skbs
into single ones.
This unfortunately does not work if each skb has no neighbor in TCP
sequence order.

By using skb_condense() if the skb could not be coalesced to a prior
one, we defeat these kind of threats, potentially saving 4K per skb
(or more, since this is one page fragment).

A typical NAPI driver allocates gro packets with GRO_MAX_HEAD bytes
in skb->head, meaning the copy done by skb_condense() is limited to
about 200 bytes.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

60b1af33

tipc: uninitialized return code in tipc_setsockopt() · a08ef476

由 Dan Carpenter 提交于 1月 24, 2017

We shuffled some code around and added some new case statements here and
now "res" isn't initialized on all paths.

Fixes: 01fd12bb ("tipc: make replicast a user selectable option")
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a08ef476

net sched actions: Add support for user cookies · 1045ba77

由 Jamal Hadi Salim 提交于 1月 24, 2017

Introduce optional 128-bit action cookie.
Like all other cookie schemes in the networking world (eg in protocols
like http or existing kernel fib protocol field, etc) the idea is to save
user state that when retrieved serves as a correlator. The kernel
_should not_ intepret it.  The user can store whatever they wish in the
128 bits.

Sample exercise(showing variable length use of cookie)

.. create an accept action with cookie a1b2c3d4
sudo $TC actions add action ok index 1 cookie a1b2c3d4

.. dump all gact actions..
sudo $TC -s actions ls action gact

    action order 0: gact action pass
     random type none pass val 0
     index 1 ref 1 bind 0 installed 5 sec used 5 sec
    Action statistics:
    Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
    backlog 0b 0p requeues 0
    cookie a1b2c3d4

.. bind the accept action to a filter..
sudo $TC filter add dev lo parent ffff: protocol ip prio 1 \
u32 match ip dst 127.0.0.1/32 flowid 1:1 action gact index 1

... send some traffic..
$ ping 127.0.0.1 -c 3
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.020 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.027 ms
64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=0.038 ms
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1045ba77

sctp: sctp gso should set feature with NETIF_F_SG when calling skb_segment · 5207f399

由 Xin Long 提交于 1月 24, 2017

Now sctp gso puts segments into skb's frag_list, then processes these
segments in skb_segment. But skb_segment handles them only when gs is
enabled, as it's in the same branch with skb's frags.

Although almost all the NICs support sg other than some old ones, but
since commit 1e16aa3d ("net: gso: use feature flag argument in all
protocol gso handlers"), features &= skb->dev->hw_enc_features, and
xfrm_output_gso call skb_segment with features = 0, which means sctp
gso would call skb_segment with sg = 0, and skb_segment would not work
as expected.

This patch is to fix it by setting features param with NETIF_F_SG when
calling skb_segment so that it can go the right branch to process the
skb's frag_list.
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5207f399

sctp: sctp_addr_id2transport should verify the addr before looking up assoc · 6f29a130

由 Xin Long 提交于 1月 24, 2017

sctp_addr_id2transport is a function for sockopt to look up assoc by
address. As the address is from userspace, it can be a v4-mapped v6
address. But in sctp protocol stack, it always handles a v4-mapped
v6 address as a v4 address. So it's necessary to convert it to a v4
address before looking up assoc by address.

This patch is to fix it by calling sctp_verify_addr in which it can do
this conversion before calling sctp_endpoint_lookup_assoc, just like
what sctp_sendmsg and __sctp_connect do for the address from users.
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Acked-by: NNeil Horman <nhorman@tuxdriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6f29a130

25 1月, 2017 12 次提交

lwtunnel: Fix oops on state free after encap module unload · 85c81401

由 Robert Shearman 提交于 1月 24, 2017

When attempting to free lwtunnel state after the module for the encap
has been unloaded an oops occurs:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: lwtstate_free+0x18/0x40
[..]
task: ffff88003e372380 task.stack: ffffc900001fc000
RIP: 0010:lwtstate_free+0x18/0x40
RSP: 0018:ffff88003fd83e88 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88002bbb3380 RCX: ffff88000c91a300
[..]
Call Trace:
 <IRQ>
 free_fib_info_rcu+0x195/0x1a0
 ? rt_fibinfo_free+0x50/0x50
 rcu_process_callbacks+0x2d3/0x850
 ? rcu_process_callbacks+0x296/0x850
 __do_softirq+0xe4/0x4cb
 irq_exit+0xb0/0xc0
 smp_apic_timer_interrupt+0x3d/0x50
 apic_timer_interrupt+0x93/0xa0
[..]
Code: e8 6e c6 fc ff 89 d8 5b 5d c3 bb de ff ff ff eb f4 66 90 66 66 66 66 90 55 48 89 e5 53 0f b7 07 48 89 fb 48 8b 04 c5 00 81 d5 81 <48> 8b 40 08 48 85 c0 74 13 ff d0 48 8d 7b 20 be 20 00 00 00 e8

The problem is after the module for the encap can be unloaded the
corresponding ops is removed and is thus NULL here.

Modules implementing lwtunnel ops should not be allowed to unload
while there is state alive using those ops, so grab the module
reference for the ops on creating lwtunnel state and of course release
the reference when freeing the state.

Fixes: 1104d9ba ("lwtunnel: Add destroy state operation")
Signed-off-by: NRobert Shearman <rshearma@brocade.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

85c81401

net: Specify the owning module for lwtunnel ops · 88ff7334

由 Robert Shearman 提交于 1月 24, 2017

Modules implementing lwtunnel ops should not be allowed to unload
while there is state alive using those ops, so specify the owning
module for all lwtunnel ops.
Signed-off-by: NRobert Shearman <rshearma@brocade.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

88ff7334

tipc: fix cleanup at module unload · 35e22e49

由 Parthasarathy Bhuvaragan 提交于 1月 24, 2017

In tipc_server_stop(), we iterate over the connections with limiting
factor as server's idr_in_use. We ignore the fact that this variable
is decremented in tipc_close_conn(), leading to premature exit.

In this commit, we iterate until the we have no connections left.
Acked-by: NYing Xue <ying.xue@windriver.com>
Acked-by: NJon Maloy <jon.maloy@ericsson.com>
Tested-by: NJohn Thompson <thompa.atl@gmail.com>
Signed-off-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

35e22e49

tipc: ignore requests when the connection state is not CONNECTED · 4c887aa6

由 Parthasarathy Bhuvaragan 提交于 1月 24, 2017

In tipc_conn_sendmsg(), we first queue the request to the outqueue
followed by the connection state check. If the connection is not
connected, we should not queue this message.

In this commit, we reject the messages if the connection state is
not CF_CONNECTED.
Acked-by: NYing Xue <ying.xue@windriver.com>
Acked-by: NJon Maloy <jon.maloy@ericsson.com>
Tested-by: NJohn Thompson <thompa.atl@gmail.com>
Signed-off-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4c887aa6

tipc: fix nametbl_lock soft lockup at module exit · 9dc3abdd

由 Parthasarathy Bhuvaragan 提交于 1月 24, 2017

Commit 333f7962 ("tipc: fix a race condition leading to
subscriber refcnt bug") reveals a soft lockup while acquiring
nametbl_lock.

Before commit 333f7962, we call tipc_conn_shutdown() from
tipc_close_conn() in the context of tipc_topsrv_stop(). In that
context, we are allowed to grab the nametbl_lock.

Commit 333f7962, moved tipc_conn_release (renamed from
tipc_conn_shutdown) to the connection refcount cleanup. This allows
either tipc_nametbl_withdraw() or tipc_topsrv_stop() to the cleanup.

Since tipc_exit_net() first calls tipc_topsrv_stop() and then
tipc_nametble_withdraw() increases the chances for the later to
perform the connection cleanup.

The soft lockup occurs in the call chain of tipc_nametbl_withdraw(),
when it performs the tipc_conn_kref_release() as it tries to grab
nametbl_lock again while holding it already.
tipc_nametbl_withdraw() grabs nametbl_lock
  tipc_nametbl_remove_publ()
    tipc_subscrp_report_overlap()
      tipc_subscrp_send_event()
        tipc_conn_sendmsg()
          << if (con->flags != CF_CONNECTED) we do conn_put(),
             triggering the cleanup as refcount=0. >>
          tipc_conn_kref_release
            tipc_sock_release
              tipc_conn_release
                tipc_subscrb_delete
                  tipc_subscrp_delete
                    tipc_nametbl_unsubscribe << Soft Lockup >>

The previous changes in this series fixes the race conditions fixed
by commit 333f7962. Hence we can now revert the commit.

Fixes: 333f7962 ("tipc: fix a race condition leading to subscriber refcnt bug")
Reported-and-Tested-by: NJohn Thompson <thompa.atl@gmail.com>
Acked-by: NYing Xue <ying.xue@windriver.com>
Acked-by: NJon Maloy <jon.maloy@ericsson.com>
Signed-off-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9dc3abdd

tipc: fix connection refcount error · fc0adfc8

由 Parthasarathy Bhuvaragan 提交于 1月 24, 2017

Until now, the generic server framework maintains the connection
id's per subscriber in server's conn_idr. At tipc_close_conn, we
remove the connection id from the server list, but the connection is
valid until we call the refcount cleanup. Hence we have a window
where the server allocates the same connection to an new subscriber
leading to inconsistent reference count. We have another refcount
warning we grab the refcount in tipc_conn_lookup() for connections
with flag with CF_CONNECTED not set. This usually occurs at shutdown
when the we stop the topology server and withdraw TIPC_CFG_SRV
publication thereby triggering a withdraw message to subscribers.

In this commit, we:
1. remove the connection from the server list at recount cleanup.
2. grab the refcount for a connection only if CF_CONNECTED is set.
Tested-by: NJohn Thompson <thompa.atl@gmail.com>
Acked-by: NYing Xue <ying.xue@windriver.com>
Acked-by: NJon Maloy <jon.maloy@ericsson.com>
Signed-off-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fc0adfc8

tipc: add subscription refcount to avoid invalid delete · d094c4d5

由 Parthasarathy Bhuvaragan 提交于 1月 24, 2017

Until now, the subscribers keep track of the subscriptions using
reference count at subscriber level. At subscription cancel or
subscriber delete, we delete the subscription only if the timer
was pending for the subscription. This approach is incorrect as:
1. del_timer() is not SMP safe, if on CPU0 the check for pending
   timer returns true but CPU1 might schedule the timer callback
   thereby deleting the subscription. Thus when CPU0 is scheduled,
   it deletes an invalid subscription.
2. We export tipc_subscrp_report_overlap(), which accesses the
   subscription pointer multiple times. Meanwhile the subscription
   timer can expire thereby freeing the subscription and we might
   continue to access the subscription pointer leading to memory
   violations.

In this commit, we introduce subscription refcount to avoid deleting
an invalid subscription.
Reported-and-Tested-by: NJohn Thompson <thompa.atl@gmail.com>
Acked-by: NYing Xue <ying.xue@windriver.com>
Acked-by: NJon Maloy <jon.maloy@ericsson.com>
Signed-off-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d094c4d5

tipc: fix nametbl_lock soft lockup at node/link events · 93f955aa

由 Parthasarathy Bhuvaragan 提交于 1月 24, 2017

We trigger a soft lockup as we grab nametbl_lock twice if the node
has a pending node up/down or link up/down event while:
- we process an incoming named message in tipc_named_rcv() and
  perform an tipc_update_nametbl().
- we have pending backlog items in the name distributor queue
  during a nametable update using tipc_nametbl_publish() or
  tipc_nametbl_withdraw().

The following are the call chain associated:
tipc_named_rcv() Grabs nametbl_lock
   tipc_update_nametbl() (publish/withdraw)
     tipc_node_subscribe()/unsubscribe()
       tipc_node_write_unlock()
          << lockup occurs if an outstanding node/link event
             exits, as we grabs nametbl_lock again >>

tipc_nametbl_withdraw() Grab nametbl_lock
  tipc_named_process_backlog()
    tipc_update_nametbl()
      << rest as above >>

The function tipc_node_write_unlock(), in addition to releasing the
lock processes the outstanding node/link up/down events. To do this,
we need to grab the nametbl_lock again leading to the lockup.

In this commit we fix the soft lockup by introducing a fast variant of
node_unlock(), where we just release the lock. We adapt the
node_subscribe()/node_unsubscribe() to use the fast variants.
Reported-and-Tested-by: NJohn Thompson <thompa.atl@gmail.com>
Acked-by: NYing Xue <ying.xue@windriver.com>
Acked-by: NJon Maloy <jon.maloy@ericsson.com>
Signed-off-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

93f955aa

netfilter: nf_tables: bump set->ndeact on set flush · b2c11e4b

由 Pablo Neira Ayuso 提交于 1月 24, 2017

Add missing set->ndeact update on each deactivated element from the set
flush path. Otherwise, sets with fixed size break after flush since
accounting breaks.

 # nft add set x y { type ipv4_addr\; size 2\; }
 # nft add element x y { 1.1.1.1 }
 # nft add element x y { 1.1.1.2 }
 # nft flush set x y
 # nft add element x y { 1.1.1.1 }
 <cmdline>:1:1-28: Error: Could not process rule: Too many open files in system

Fixes: 8411b644 ("netfilter: nf_tables: support for set flushing")
Reported-by: NElise Lennion <elise.lennion@gmail.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

b2c11e4b

netfilter: nf_tables: deconstify walk callback function · de70185d

由 Pablo Neira Ayuso 提交于 1月 24, 2017

The flush operation needs to modify set and element objects, so let's
deconstify this.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

de70185d

netfilter: nf_tables: fix set->nelems counting with no NLM_F_EXCL · 35d0ac90

由 Pablo Neira Ayuso 提交于 1月 24, 2017

If the element exists and no NLM_F_EXCL is specified, do not bump
set->nelems, otherwise we leak one set element slot. This problem
amplifies if the set is full since the abort path always decrements the
counter for the -ENFILE case too, giving one spare extra slot.

Fix this by moving set->nelems update to nft_add_set_elem() after
successful element insertion. Moreover, remove the element if the set is
full so there is no need to rely on the abort path to undo things
anymore.

Fixes: c016c7e4 ("netfilter: nf_tables: honor NLM_F_EXCL flag in set element insertion")
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

35d0ac90

netfilter: nft_log: restrict the log prefix length to 127 · 5ce6b04c

由 Liping Zhang 提交于 1月 22, 2017

First, log prefix will be truncated to NF_LOG_PREFIXLEN-1, i.e. 127,
at nf_log_packet(), so the extra part is useless.

Second, after adding a log rule with a very very long prefix, we will
fail to dump the nft rules after this _special_ one, but acctually,
they do exist. For example:
  # name_65000=$(printf "%0.sQ" {1..65000})
  # nft add rule filter output log prefix "$name_65000"
  # nft add rule filter output counter
  # nft add rule filter output counter
  # nft list chain filter output
  table ip filter {
      chain output {
          type filter hook output priority 0; policy accept;
      }
  }

So now, restrict the log prefix length to NF_LOG_PREFIXLEN-1.

Fixes: 96518518 ("netfilter: add nftables")
Signed-off-by: NLiping Zhang <zlpnobody@gmail.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

5ce6b04c

openanolis / cloud-kernel 接近 2 年 前同步成功

openanolis / cloud-kernel
接近 2 年前同步成功