提交 · fb23403536eabe81ee90d32cb3051030b871d988 · openanolis / cloud-kernel

12 2月, 2018 1 次提交

vfs: do bulk POLL* -> EPOLL* replacement · a9a08845

由 Linus Torvalds 提交于 2月 11, 2018

This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:

    for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
        L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
        for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
    done

with de-mangling cleanups yet to come.

NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do.  But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.

The next patch from Al will sort out the final differences, and we
should be all done.
Scripted-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a9a08845

08 2月, 2018 1 次提交

tcp: tracepoint: only call trace_tcp_send_reset with full socket · 5c487bb9

由 Song Liu 提交于 2月 06, 2018

tracepoint tcp_send_reset requires a full socket to work. However, it
may be called when in TCP_TIME_WAIT:

        case TCP_TW_RST:
                tcp_v6_send_reset(sk, skb);
                inet_twsk_deschedule_put(inet_twsk(sk));
                goto discard_it;

To avoid this problem, this patch checks the socket with sk_fullsock()
before calling trace_tcp_send_reset().

Fixes: c24b14c4 ("tcp: add tracepoint trace_tcp_send_reset")
Signed-off-by: NSong Liu <songliubraving@fb.com>
Reviewed-by: NLawrence Brakmo <brakmo@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5c487bb9

07 2月, 2018 3 次提交

netfilter: nf_tables: fix flowtable free · b408c5b0

由 Pablo Neira Ayuso 提交于 2月 06, 2018

Every flow_offload entry is added into the table twice. Because of this,
rhashtable_free_and_destroy can't be used, since it would call kfree for
each flow_offload object twice.

This patch cleans up the flowtable via nf_flow_table_iterate() to
schedule removal of entries by setting on the dying bit, then there is
an explicitly invocation of the garbage collector to release resources.

Based on patch from Felix Fietkau.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

b408c5b0

net: erspan: fix erspan config overwrite · 39f57f67

由 William Tu 提交于 2月 05, 2018

When an erspan tunnel device receives an erpsan packet with different
tunnel metadata (ex: version, index, hwid, direction), existing code
overwrites the tunnel device's erspan configuration with the received
packet's metadata.  The patch fixes it.

Fixes: 1a66a836 ("gre: add collect_md mode to ERSPAN tunnel")
Fixes: f551c91d ("net: erspan: introduce erspan v2 for ip_gre")
Fixes: ef7baf5e ("ip6_gre: add ip6 erspan collect_md mode")
Fixes: 94d7d8f2 ("ip6_gre: add erspan v2 support")
Signed-off-by: NWilliam Tu <u9012063@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

39f57f67

net: erspan: fix metadata extraction · 3df19283

由 William Tu 提交于 2月 05, 2018

Commit d350a823 ("net: erspan: create erspan metadata uapi header")
moves the erspan 'version' in front of the 'struct erspan_md2' for
later extensibility reason.  This breaks the existing erspan metadata
extraction code because the erspan_md2 then has a 4-byte offset
to between the erspan_metadata and erspan_base_hdr.  This patch
fixes it.

Fixes: 1a66a836 ("gre: add collect_md mode to ERSPAN tunnel")
Fixes: ef7baf5e ("ip6_gre: add ip6 erspan collect_md mode")
Fixes: 1d7e2ed2 ("net: erspan: refactor existing erspan code")
Signed-off-by: NWilliam Tu <u9012063@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3df19283

06 2月, 2018 1 次提交

net: add a UID to use for ULP socket assignment · b11a632c

由 John Fastabend 提交于 2月 05, 2018

Create a UID field and enum that can be used to assign ULPs to
sockets. This saves a set of string comparisons if the ULP id
is known.

For sockmap, which is added in the next patches, a ULP is used to
hook into TCP sockets close state. In this case the ULP being added
is done at map insert time and the ULP is known and done on the kernel
side. In this case the named lookup is not needed. Because we don't
want to expose psock internals to user space socket options a user
visible flag is also added. For TLS this is set for BPF it will be
cleared.

Alos remove pr_notice, user gets an error code back and should check
that rather than rely on logs.
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

b11a632c

03 2月, 2018 1 次提交

Revert "defer call to mem_cgroup_sk_alloc()" · edbe69ef

由 Roman Gushchin 提交于 2月 02, 2018

This patch effectively reverts commit 9f1c2674 ("net: memcontrol:
defer call to mem_cgroup_sk_alloc()").

Moving mem_cgroup_sk_alloc() to the inet_csk_accept() completely breaks
memcg socket memory accounting, as packets received before memcg
pointer initialization are not accounted and are causing refcounting
underflow on socket release.

Actually the free-after-use problem was fixed by
commit c0576e39 ("net: call cgroup_sk_alloc() earlier in
sk_clone_lock()") for the cgroup pointer.

So, let's revert it and call mem_cgroup_sk_alloc() just before
cgroup_sk_alloc(). This is safe, as we hold a reference to the socket
we're cloning, and it holds a reference to the memcg.

Also, let's drop BUG_ON(mem_cgroup_is_root()) check from
mem_cgroup_sk_alloc(). I see no reasons why bumping the root
memcg counter is a good reason to panic, and there are no realistic
ways to hit it.
Signed-off-by: NRoman Gushchin <guro@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

edbe69ef

02 2月, 2018 2 次提交

netfilter: flowtable infrastructure depends on NETFILTER_INGRESS · 6be3bcd7

由 Pablo Neira Ayuso 提交于 1月 31, 2018

config NF_FLOW_TABLE depends on NETFILTER_INGRESS. If users forget to
enable this toggle, flowtable registration fails with EOPNOTSUPP.

Moreover, turn 'select NF_FLOW_TABLE' in every flowtable family flavour
into dependency instead, otherwise this new dependency on
NETFILTER_INGRESS causes a warning. This also allows us to remove the
explicit dependency between family flowtables <-> NF_TABLES and
NF_CONNTRACK, given they depend on the NF_FLOW_TABLE core that already
expresses the general dependencies for this new infrastructure.

Moreover, NF_FLOW_TABLE_INET depends on NF_FLOW_TABLE_IPV4 and
NF_FLOWTABLE_IPV6, which already depends on NF_FLOW_TABLE. So we can get
rid of direct dependency with NF_FLOW_TABLE.

In general, let's avoid 'select', it just makes things more complicated.
Reported-by: NJohn Crispin <john@phrozen.org>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

6be3bcd7

net: igmp: add a missing rcu locking section · e7aadb27

由 Eric Dumazet 提交于 2月 01, 2018

Newly added igmpv3_get_srcaddr() needs to be called under rcu lock.

Timer callbacks do not ensure this locking.

=============================
WARNING: suspicious RCU usage
4.15.0+ #200 Not tainted
-----------------------------
./include/linux/inetdevice.h:216 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
3 locks held by syzkaller616973/4074:
 #0:  (&mm->mmap_sem){++++}, at: [<00000000bfce669e>] __do_page_fault+0x32d/0xc90 arch/x86/mm/fault.c:1355
 #1:  ((&im->timer)){+.-.}, at: [<00000000619d2f71>] lockdep_copy_map include/linux/lockdep.h:178 [inline]
 #1:  ((&im->timer)){+.-.}, at: [<00000000619d2f71>] call_timer_fn+0x1c6/0x820 kernel/time/timer.c:1316
 #2:  (&(&im->lock)->rlock){+.-.}, at: [<000000005f833c5c>] spin_lock_bh include/linux/spinlock.h:315 [inline]
 #2:  (&(&im->lock)->rlock){+.-.}, at: [<000000005f833c5c>] igmpv3_send_report+0x98/0x5b0 net/ipv4/igmp.c:600

stack backtrace:
CPU: 0 PID: 4074 Comm: syzkaller616973 Not tainted 4.15.0+ #200
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x194/0x257 lib/dump_stack.c:53
 lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4592
 __in_dev_get_rcu include/linux/inetdevice.h:216 [inline]
 igmpv3_get_srcaddr net/ipv4/igmp.c:329 [inline]
 igmpv3_newpack+0xeef/0x12e0 net/ipv4/igmp.c:389
 add_grhead.isra.27+0x235/0x300 net/ipv4/igmp.c:432
 add_grec+0xbd3/0x1170 net/ipv4/igmp.c:565
 igmpv3_send_report+0xd5/0x5b0 net/ipv4/igmp.c:605
 igmp_send_report+0xc43/0x1050 net/ipv4/igmp.c:722
 igmp_timer_expire+0x322/0x5c0 net/ipv4/igmp.c:831
 call_timer_fn+0x228/0x820 kernel/time/timer.c:1326
 expire_timers kernel/time/timer.c:1363 [inline]
 __run_timers+0x7ee/0xb70 kernel/time/timer.c:1666
 run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692
 __do_softirq+0x2d7/0xb85 kernel/softirq.c:285
 invoke_softirq kernel/softirq.c:365 [inline]
 irq_exit+0x1cc/0x200 kernel/softirq.c:405
 exiting_irq arch/x86/include/asm/apic.h:541 [inline]
 smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
 apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:938

Fixes: a46182b0 ("net: igmp: Use correct source address on IGMPv3 reports")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: Nsyzbot <syzkaller@googlegroups.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e7aadb27

01 2月, 2018 2 次提交

inet: Avoid unitialized variable warning in inet_unhash() · 0ba98718

由 Geert Uytterhoeven 提交于 2月 01, 2018

With gcc-4.1.2:

    net/ipv4/inet_hashtables.c: In function ‘inet_unhash’:
    net/ipv4/inet_hashtables.c:628: warning: ‘ilb’ may be used uninitialized in this function

While this is a false positive, it can easily be avoided by using the
pointer itself as the canary variable.
Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Acked-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0ba98718

tcp_bbr: fix pacing_gain to always be unity when using lt_bw · 3aff3b4b

由 Neal Cardwell 提交于 1月 31, 2018

This commit fixes the pacing_gain to remain at BBR_UNIT (1.0) when
using lt_bw and returning from the PROBE_RTT state to PROBE_BW.

Previously, when using lt_bw, upon exiting PROBE_RTT and entering
PROBE_BW the bbr_reset_probe_bw_mode() code could sometimes randomly
end up with a cycle_idx of 0 and hence have bbr_advance_cycle_phase()
set a pacing gain above 1.0. In such cases this would result in a
pacing rate that is 1.25x higher than intended, potentially resulting
in a high loss rate for a little while until we stop using the lt_bw a
bit later.

This commit is a stable candidate for kernels back as far as 4.9.

Fixes: 0f8782ea ("tcp_bbr: add BBR congestion control")
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
Reported-by: NBeyers Cronje <bcronje@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3aff3b4b

31 1月, 2018 3 次提交

netfilter: on sockopt() acquire sock lock only in the required scope · 3f34cfae

由 Paolo Abeni 提交于 1月 30, 2018

Syzbot reported several deadlocks in the netfilter area caused by
rtnl lock and socket lock being acquired with a different order on
different code paths, leading to backtraces like the following one:

======================================================
WARNING: possible circular locking dependency detected
4.15.0-rc9+ #212 Not tainted
------------------------------------------------------
syzkaller041579/3682 is trying to acquire lock:
  (sk_lock-AF_INET6){+.+.}, at: [<000000008775e4dd>] lock_sock
include/net/sock.h:1463 [inline]
  (sk_lock-AF_INET6){+.+.}, at: [<000000008775e4dd>]
do_ipv6_setsockopt.isra.8+0x3c5/0x39d0 net/ipv6/ipv6_sockglue.c:167

but task is already holding lock:
  (rtnl_mutex){+.+.}, at: [<000000004342eaa9>] rtnl_lock+0x17/0x20
net/core/rtnetlink.c:74

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (rtnl_mutex){+.+.}:
        __mutex_lock_common kernel/locking/mutex.c:756 [inline]
        __mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893
        mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
        rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74
        register_netdevice_notifier+0xad/0x860 net/core/dev.c:1607
        tee_tg_check+0x1a0/0x280 net/netfilter/xt_TEE.c:106
        xt_check_target+0x22c/0x7d0 net/netfilter/x_tables.c:845
        check_target net/ipv6/netfilter/ip6_tables.c:538 [inline]
        find_check_entry.isra.7+0x935/0xcf0
net/ipv6/netfilter/ip6_tables.c:580
        translate_table+0xf52/0x1690 net/ipv6/netfilter/ip6_tables.c:749
        do_replace net/ipv6/netfilter/ip6_tables.c:1165 [inline]
        do_ip6t_set_ctl+0x370/0x5f0 net/ipv6/netfilter/ip6_tables.c:1691
        nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
        nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115
        ipv6_setsockopt+0x115/0x150 net/ipv6/ipv6_sockglue.c:928
        udpv6_setsockopt+0x45/0x80 net/ipv6/udp.c:1422
        sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2978
        SYSC_setsockopt net/socket.c:1849 [inline]
        SyS_setsockopt+0x189/0x360 net/socket.c:1828
        entry_SYSCALL_64_fastpath+0x29/0xa0

-> #0 (sk_lock-AF_INET6){+.+.}:
        lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3914
        lock_sock_nested+0xc2/0x110 net/core/sock.c:2780
        lock_sock include/net/sock.h:1463 [inline]
        do_ipv6_setsockopt.isra.8+0x3c5/0x39d0 net/ipv6/ipv6_sockglue.c:167
        ipv6_setsockopt+0xd7/0x150 net/ipv6/ipv6_sockglue.c:922
        udpv6_setsockopt+0x45/0x80 net/ipv6/udp.c:1422
        sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2978
        SYSC_setsockopt net/socket.c:1849 [inline]
        SyS_setsockopt+0x189/0x360 net/socket.c:1828
        entry_SYSCALL_64_fastpath+0x29/0xa0

other info that might help us debug this:

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(rtnl_mutex);
                                lock(sk_lock-AF_INET6);
                                lock(rtnl_mutex);
   lock(sk_lock-AF_INET6);

  *** DEADLOCK ***

1 lock held by syzkaller041579/3682:
  #0:  (rtnl_mutex){+.+.}, at: [<000000004342eaa9>] rtnl_lock+0x17/0x20
net/core/rtnetlink.c:74

The problem, as Florian noted, is that nf_setsockopt() is always
called with the socket held, even if the lock itself is required only
for very tight scopes and only for some operation.

This patch addresses the issues moving the lock_sock() call only
where really needed, namely in ipv*_getorigdst(), so that nf_setsockopt()
does not need anymore to acquire both locks.

Fixes: 22265a5c ("netfilter: xt_TEE: resolve oif using netdevice notifiers")
Reported-by: syzbot+a4c2dc980ac1af699b36@syzkaller.appspotmail.com
Suggested-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

3f34cfae

tcp_nv: fix potential integer overflow in tcpnv_acked · e4823fbd

由 Gustavo A. R. Silva 提交于 1月 30, 2018

Add suffix ULL to constant 80000 in order to avoid a potential integer
overflow and give the compiler complete information about the proper
arithmetic to use. Notice that this constant is used in a context that
expects an expression of type u64.

The current cast to u64 effectively applies to the whole expression
as an argument of type u64 to be passed to div64_u64, but it does
not prevent it from being evaluated using 32-bit arithmetic instead
of 64-bit arithmetic.

Also, once the expression is properly evaluated using 64-bit arithmentic,
there is no need for the parentheses and the external cast to u64.

Addresses-Coverity-ID: 1357588 ("Unintentional integer overflow")
Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e4823fbd

netfilter: ipt_CLUSTERIP: fix out-of-bounds accesses in clusterip_tg_check() · 1a38956c

由 Dmitry Vyukov 提交于 1月 30, 2018

Commit 136e92bb switched local_nodes from an array to a bitmask
but did not add proper bounds checks. As the result
clusterip_config_init_nodelist() can both over-read
ipt_clusterip_tgt_info.local_nodes and over-write
clusterip_config.local_nodes.

Add bounds checks for both.

Fixes: 136e92bb ("[NETFILTER] CLUSTERIP: use a bitmap to store node responsibility data")
Signed-off-by: NDmitry Vyukov <dvyukov@google.com>
Reported-by: Nsyzbot <syzkaller@googlegroups.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

1a38956c

30 1月, 2018 3 次提交

ipmr: Fix ptrdiff_t print formatting · 91e6dd82

由 James Hogan 提交于 1月 30, 2018

ipmr_vif_seq_show() prints the difference between two pointers with the
format string %2zd (z for size_t), however the correct format string is
%2td instead (t for ptrdiff_t).

The same bug in ip6mr_vif_seq_show() was already fixed long ago by
commit d430a227 ("bogus format in ip6mr").
Signed-off-by: NJames Hogan <jhogan@kernel.org>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: netdev@vger.kernel.org
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

91e6dd82

tcp: release sk_frag.page in tcp_disconnect · 9b42d55a

由 Li RongQing 提交于 1月 26, 2018

socket can be disconnected and gets transformed back to a listening
socket, if sk_frag.page is not released, which will be cloned into
a new socket by sk_clone_lock, but the reference count of this page
is increased, lead to a use after free or double free issue
Signed-off-by: NLi RongQing <lirongqing@baidu.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9b42d55a

ipv4: Get the address of interface correctly. · 30e948a3

由 Tonghao Zhang 提交于 1月 28, 2018

When using ioctl to get address of interface, we can't
get it anymore. For example, the command is show as below.

	# ifconfig eth0

In the patch ("03aef17b"), the devinet_ioctl does not
return a suitable value, even though we can find it in
the kernel. Then fix it now.

Fixes: 03aef17b ("devinet_ioctl(): take copyin/copyout to caller")
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

30e948a3

26 1月, 2018 7 次提交

net/ipv4: Allow send to local broadcast from a socket bound to a VRF · 9515a2e0

由 David Ahern 提交于 1月 24, 2018

Message sends to the local broadcast address (255.255.255.255) require
uc_index or sk_bound_dev_if to be set to an egress device. However,
responses or only received if the socket is bound to the device. This
is overly constraining for processes running in an L3 domain. This
patch allows a socket bound to the VRF device to send to the local
broadcast address by using IP_UNICAST_IF to set the egress interface
with packet receipt handled by the VRF binding.

Similar to IP_MULTICAST_IF, relax the constraint on setting
IP_UNICAST_IF if a socket is bound to an L3 master device. In this
case allow uc_index to be set to an enslaved if sk_bound_dev_if is
an L3 master device and is the master device for the ifindex.

In udp and raw sendmsg, allow uc_index to override the oif if
uc_index master device is oif (ie., the oif is an L3 master and the
index is an L3 slave).
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9515a2e0

net: erspan: use bitfield instead of mask and offset · c69de58b

由 William Tu 提交于 1月 25, 2018

Originally the erspan fields are defined as a group into a __be16 field,
and use mask and offset to access each field.  This is more costly due to
calling ntohs/htons.  The patch changes it to use bitfields.
Signed-off-by: NWilliam Tu <u9012063@gmail.com>
Acked-by: NPravin B Shelar <pshelar@ovn.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c69de58b

bpf: Add BPF_SOCK_OPS_STATE_CB · d4487491

由 Lawrence Brakmo 提交于 1月 25, 2018

Adds support for calling sock_ops BPF program when there is a TCP state
change. Two arguments are used; one for the old state and another for
the new state.

There is a new enum in include/uapi/linux/bpf.h that exports the TCP
states that prepends BPF_ to the current TCP state names. If it is ever
necessary to change the internal TCP state values (other than adding
more to the end), then it will become necessary to convert from the
internal TCP state value to the BPF value before calling the BPF
sock_ops function. There are a set of compile checks added in tcp.c
to detect if the internal and BPF values differ so we can make the
necessary fixes.

New op: BPF_SOCK_OPS_STATE_CB.
Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

d4487491

bpf: Add BPF_SOCK_OPS_RETRANS_CB · a31ad29e

由 Lawrence Brakmo 提交于 1月 25, 2018

Adds support for calling sock_ops BPF program when there is a
retransmission. Three arguments are used; one for the sequence number,
another for the number of segments retransmitted, and the last one for
the return value of tcp_transmit_skb (0 => success).
Does not include syn-ack retransmissions.

New op: BPF_SOCK_OPS_RETRANS_CB.
Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

a31ad29e

bpf: Add sock_ops RTO callback · f89013f6

由 Lawrence Brakmo 提交于 1月 25, 2018

Adds an optional call to sock_ops BPF program based on whether the
BPF_SOCK_OPS_RTO_CB_FLAG is set in bpf_sock_ops_flags.
The BPF program is passed 2 arguments: icsk_retransmits and whether the
RTO has expired.
Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

f89013f6

bpf: Support passing args to sock_ops bpf function · de525be2

由 Lawrence Brakmo 提交于 1月 25, 2018

Adds support for passing up to 4 arguments to sock_ops bpf functions. It
reusues the reply union, so the bpf_sock_ops structures are not
increased in size.
Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

de525be2

net: don't call update_pmtu unconditionally · f15ca723

由 Nicolas Dichtel 提交于 1月 25, 2018

Some dst_ops (e.g. md_dst_ops)) doesn't set this handler. It may result to:
"BUG: unable to handle kernel NULL pointer dereference at           (null)"

Let's add a helper to check if update_pmtu is available before calling it.

Fixes: 52a589d5 ("geneve: update skb dst pmtu on tx path")
Fixes: a93bf0ff ("vxlan: update skb dst pmtu on tx path")
CC: Roman Kapl <code@rkapl.cz>
CC: Xin Long <lucien.xin@gmail.com>
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f15ca723

25 1月, 2018 6 次提交

net: tcp: close sock if net namespace is exiting · 4ee806d5

由 Dan Streetman 提交于 1月 18, 2018

When a tcp socket is closed, if it detects that its net namespace is
exiting, close immediately and do not wait for FIN sequence.

For normal sockets, a reference is taken to their net namespace, so it will
never exit while the socket is open. However, kernel sockets do not take a
reference to their net namespace, so it may begin exiting while the kernel
socket is still open. In this case if the kernel socket is a tcp socket,
it will stay open trying to complete its close sequence. The sock's dst(s)
hold a reference to their interface, which are all transferred to the
namespace's loopback interface when the real interfaces are taken down.
When the namespace tries to take down its loopback interface, it hangs
waiting for all references to the loopback interface to release, which
results in messages like:

unregister_netdevice: waiting for lo to become free. Usage count = 1

These messages continue until the socket finally times out and closes.
Since the net namespace cleanup holds the net_mutex while calling its
registered pernet callbacks, any new net namespace initialization is
blocked until the current net namespace finishes exiting.

After this change, the tcp socket notices the exiting net namespace, and
closes immediately, releasing its dst(s) and their reference to the
loopback interface, which lets the net namespace continue exiting.

Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=97811Signed-off-by: NDan Streetman <ddstreet@canonical.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4ee806d5

A
ipconfig: use dev_set_mtu() · 6a88fbe7
由 Al Viro 提交于 10月 01, 2017
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
6a88fbe7

ip_rt_ioctl(): take copyin to caller · ca25c300

由 Al Viro 提交于 7月 01, 2017

Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

ca25c300

devinet_ioctl(): take copyin/copyout to caller · 03aef17b

由 Al Viro 提交于 7月 01, 2017

Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

03aef17b

net: separate SIOCGIFCONF handling from dev_ioctl() · 36fd633e

由 Al Viro 提交于 6月 26, 2017

Only two of dev_ioctl() callers may pass SIOCGIFCONF to it.
Separating that codepath from the rest of dev_ioctl() allows both
to simplify dev_ioctl() itself (all other cases work with struct ifreq *)
*and* seriously simplify the compat side of that beast: all it takes
is passing to inet_gifconf() an extra argument - the size of individual
records (sizeof(struct ifreq) or sizeof(struct compat_ifreq)).  With
dev_ifconf() called directly from sock_do_ioctl()/compat_dev_ifconf()
that's easy to arrange.

As the result, compat side of SIOCGIFCONF doesn't need any
allocations, copy_in_user() back and forth, etc.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

36fd633e

ip_tunnel: Use mark in skb by default · 5c38bd1b

由 Thomas Winter 提交于 1月 23, 2018

This allows marks set by connmark in iptables
to be used for route lookups.
Signed-off-by: NThomas Winter <thomas.winter@alliedtelesis.co.nz>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5c38bd1b

23 1月, 2018 3 次提交

xfrm: Fix eth_hdr(skb)->h_proto to reflect inner IP version · 5efec5c6

由 Yossi Kuperman 提交于 1月 23, 2018

IPSec tunnel mode supports encapsulation of IPv4 over IPv6 and vice-versa.

The outer IP header is stripped and the inner IP inherits the original
Ethernet header. Tcpdump fails to properly decode the inner packet in
case that h_proto is different than the inner IP version.

Fix h_proto to reflect the inner IP version.
Signed-off-by: NYossi Kuperman <yossiku@mellanox.com>
Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>

5efec5c6

net: igmp: fix source address check for IGMPv3 reports · ad23b750

由 Felix Fietkau 提交于 1月 19, 2018

Commit "net: igmp: Use correct source address on IGMPv3 reports"
introduced a check to validate the source address of locally generated
IGMPv3 packets.
Instead of checking the local interface address directly, it uses
inet_ifa_match(fl4->saddr, ifa), which checks if the address is on the
local subnet (or equal to the point-to-point address if used).

This breaks for point-to-point interfaces, so check against
ifa->ifa_local directly.

Cc: Kevin Cernekee <cernekee@chromium.org>
Fixes: a46182b0 ("net: igmp: Use correct source address on IGMPv3 reports")
Reported-by: NSebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: NFelix Fietkau <nbd@nbd.name>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ad23b750

gso: validate gso_type in GSO handlers · 121d57af

由 Willem de Bruijn 提交于 1月 19, 2018

Validate gso_type during segmentation as SKB_GSO_DODGY sources
may pass packets where the gso_type does not match the contents.

Syzkaller was able to enter the SCTP gso handler with a packet of
gso_type SKB_GSO_TCPV4.

On entry of transport layer gso handlers, verify that the gso_type
matches the transport protocol.

Fixes: 90017acc ("sctp: Add GSO support")
Link: http://lkml.kernel.org/r/<001a1137452496ffc305617e5fe0@google.com>
Reported-by: syzbot+fee64147a25aecd48055@syzkaller.appspotmail.com
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Acked-by: NJason Wang <jasowang@redhat.com>
Reviewed-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

121d57af

20 1月, 2018 3 次提交

tcp: avoid min RTT bloat by skipping RTT from delayed-ACK in BBR · e4286603

由 Yuchung Cheng 提交于 1月 17, 2018

A persistent connection may send tiny amount of data (e.g. health-check)
for a long period of time. BBR's windowed min RTT filter may only see
RTT samples from delayed ACKs causing BBR to grossly over-estimate
the path delay depending how much the ACK was delayed at the receiver.

This patch skips RTT samples that are likely coming from delayed ACKs. Note
that it is possible the sender never obtains a valid measure to set the
min RTT. In this case BBR will continue to set cwnd to initial window
which seems fine because the connection is thin stream.
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Acked-by: NPriyaranjan Jha <priyarjha@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e4286603

tcp: avoid min-RTT overestimation from delayed ACKs · eb36be0f

由 Yuchung Cheng 提交于 1月 17, 2018

This patch avoids having TCP sender or congestion control
overestimate the min RTT by orders of magnitude. This happens when
all the samples in the windowed filter are one-packet transfer
like small request and health-check like chit-chat, which is farily
common for applications using persistent connections. This patch
tries to conservatively labels and skip RTT samples obtained from
this type of workload.
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

eb36be0f

netfilter: remove messages print and boot/module load time · e5531166

由 Pablo Neira Ayuso 提交于 1月 19, 2018

Several reasons for this:

* Several modules maintain internal version numbers, that they print at
  boot/module load time, that are not exposed to userspace, as a
  primitive mechanism to make revision number control from the earlier
  days of Netfilter.

* IPset shows the protocol version at boot/module load time, instead
  display this via module description, as Jozsef suggested.

* Remove copyright notice at boot/module load time in two spots, the
  Netfilter codebase is a collective development effort, if we would
  have to display copyrights for each contributor at boot/module load
  time for each extensions we have, we would probably fill up logs with
  lots of useless information - from a technical standpoint.

So let's be consistent and remove them all.
Acked-by: NFlorian Westphal <fw@strlen.de>
Acked-by: NJozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

e5531166

19 1月, 2018 4 次提交

netfilter: nf_nat_snmp_basic: use asn1 decoder library · cc2d5863

由 Taehee Yoo 提交于 1月 08, 2018

The basic SNMP ALG parse snmp ASN.1 payload
however, since 2012 linux kernel provide ASN.1 decoder library.
If we use ASN.1 decoder in the /lib/asn1_decoder.c, we can remove
about 1000 line of ASN.1 parsing routine.

To use asn1_decoder.c, we should write mib file(nf_nat_snmp_basic.asn1)
then /script/asn1_compiler.c makes *-asn1.c and *-asn1.h file
at the compiletime.(nf_nat_snmp_basic-asn1.c, nf_nat_snmp_basic-asn1.h)
The nf_nat_snmp_basic.asn1 is made by RFC1155, RFC1157, RFC1902, RFC1905,
RFC2578, RFC3416. of course that mib file supports only the basic SNMP ALG.

Previous SNMP ALG mangles only first octet of IPv4 address.
but after this patch, the SNMP ALG mangles whole IPv4 Address.
And SNMPv3 is not supported.

I tested with snmp commands such ans snmpd, snmpwalk, snmptrap.
Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

cc2d5863

netfilter: nf_nat_snmp_basic: use nf_ct_helper_log · bea588b0

由 Taehee Yoo 提交于 1月 08, 2018

Use nf_ct_helper_log to write log message.
Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

bea588b0

netfilter: nf_nat_snmp_basic: replace ctinfo with dir. · 8b8f0813

由 Taehee Yoo 提交于 1月 08, 2018

The snmp_translate() receives ctinfo data to get dir value only.
because of caller already has dir value, we just replace ctinfo with dir.
Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

8b8f0813

netfilter: nf_nat_snmp_basic: remove debug parameter · e29e5ddc

由 Taehee Yoo 提交于 1月 08, 2018

To see debug message of nf_nat_snmp_basic, we should set debug value
when we insert this module. but it is inconvenient and only using of
the dynamic debugging is enough to debug.

This patch just removes debug code. then in the next patch, debugging code
will be added.
Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

e29e5ddc

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功