1. 30 8月, 2018 1 次提交
  2. 23 8月, 2018 1 次提交
    • A
      net_sched: fix unused variable warning in stmmac · 191672ca
      Arnd Bergmann 提交于
      The new tcf_exts_for_each_action() macro doesn't reference its
      arguments when CONFIG_NET_CLS_ACT is disabled, which leads to
      a harmless warning in at least one driver:
      
      drivers/net/ethernet/stmicro/stmmac/stmmac_tc.c: In function 'tc_fill_actions':
      drivers/net/ethernet/stmicro/stmmac/stmmac_tc.c:64:6: error: unused variable 'i' [-Werror=unused-variable]
      
      Adding a cast to void lets us avoid this kind of warning.
      To be on the safe side, do it for all three arguments, not
      just the one that caused the warning.
      
      Fixes: 244cd96a ("net_sched: remove list_head from tc_action")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      191672ca
  3. 22 8月, 2018 4 次提交
  4. 17 8月, 2018 3 次提交
    • D
      tcp, ulp: add alias for all ulp modules · 037b0b86
      Daniel Borkmann 提交于
      Lets not turn the TCP ULP lookup into an arbitrary module loader as
      we only intend to load ULP modules through this mechanism, not other
      unrelated kernel modules:
      
        [root@bar]# cat foo.c
        #include <sys/types.h>
        #include <sys/socket.h>
        #include <linux/tcp.h>
        #include <linux/in.h>
      
        int main(void)
        {
            int sock = socket(PF_INET, SOCK_STREAM, 0);
            setsockopt(sock, IPPROTO_TCP, TCP_ULP, "sctp", sizeof("sctp"));
            return 0;
        }
      
        [root@bar]# gcc foo.c -O2 -Wall
        [root@bar]# lsmod | grep sctp
        [root@bar]# ./a.out
        [root@bar]# lsmod | grep sctp
        sctp                 1077248  4
        libcrc32c              16384  3 nf_conntrack,nf_nat,sctp
        [root@bar]#
      
      Fix it by adding module alias to TCP ULP modules, so probing module
      via request_module() will be limited to tcp-ulp-[name]. The existing
      modules like kTLS will load fine given tcp-ulp-tls alias, but others
      will fail to load:
      
        [root@bar]# lsmod | grep sctp
        [root@bar]# ./a.out
        [root@bar]# lsmod | grep sctp
        [root@bar]#
      
      Sockmap is not affected from this since it's either built-in or not.
      
      Fixes: 734942cc ("tcp: ULP infrastructure")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      037b0b86
    • F
      netfilter: nf_tables: fix register ordering · d209df3e
      Florian Westphal 提交于
      We must register nfnetlink ops last, as that exposes nf_tables to
      userspace.  Without this, we could theoretically get nfnetlink request
      before net->nft state has been initialized.
      
      Fixes: 99633ab2 ("netfilter: nf_tables: complete net namespace support")
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      d209df3e
    • T
      netfilter: nft_set: fix allocation size overflow in privsize callback. · 4ef360dd
      Taehee Yoo 提交于
      In order to determine allocation size of set, ->privsize is invoked.
      At this point, both desc->size and size of each data structure of set
      are used. desc->size means number of element that is given by user.
      desc->size is u32 type. so that upperlimit of set element is 4294967295.
      but return type of ->privsize is also u32. hence overflow can occurred.
      
      test commands:
         %nft add table ip filter
         %nft add set ip filter hash1 { type ipv4_addr \; size 4294967295 \; }
         %nft list ruleset
      
      splat looks like:
      [ 1239.202910] kasan: CONFIG_KASAN_INLINE enabled
      [ 1239.208788] kasan: GPF could be caused by NULL-ptr deref or user memory access
      [ 1239.217625] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
      [ 1239.219329] CPU: 0 PID: 1603 Comm: nft Not tainted 4.18.0-rc5+ #7
      [ 1239.229091] RIP: 0010:nft_hash_walk+0x1d2/0x310 [nf_tables_set]
      [ 1239.229091] Code: 84 d2 7f 10 4c 89 e7 89 44 24 38 e8 d8 5a 17 e0 8b 44 24 38 48 8d 7b 10 41 0f b6 0c 24 48 89 fa 48 89 fe 48 c1 ea 03 83 e6 07 <42> 0f b6 14 3a 40 38 f2 7f 1a 84 d2 74 16
      [ 1239.229091] RSP: 0018:ffff8801118cf358 EFLAGS: 00010246
      [ 1239.229091] RAX: 0000000000000000 RBX: 0000000000020400 RCX: 0000000000000001
      [ 1239.229091] RDX: 0000000000004082 RSI: 0000000000000000 RDI: 0000000000020410
      [ 1239.229091] RBP: ffff880114d5a988 R08: 0000000000007e94 R09: ffff880114dd8030
      [ 1239.229091] R10: ffff880114d5a988 R11: ffffed00229bb006 R12: ffff8801118cf4d0
      [ 1239.229091] R13: ffff8801118cf4d8 R14: 0000000000000000 R15: dffffc0000000000
      [ 1239.229091] FS:  00007f5a8fe0b700(0000) GS:ffff88011b600000(0000) knlGS:0000000000000000
      [ 1239.229091] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1239.229091] CR2: 00007f5a8ecc27b0 CR3: 000000010608e000 CR4: 00000000001006f0
      [ 1239.229091] Call Trace:
      [ 1239.229091]  ? nft_hash_remove+0xf0/0xf0 [nf_tables_set]
      [ 1239.229091]  ? memset+0x1f/0x40
      [ 1239.229091]  ? __nla_reserve+0x9f/0xb0
      [ 1239.229091]  ? memcpy+0x34/0x50
      [ 1239.229091]  nf_tables_dump_set+0x9a1/0xda0 [nf_tables]
      [ 1239.229091]  ? __kmalloc_reserve.isra.29+0x2e/0xa0
      [ 1239.229091]  ? nft_chain_hash_obj+0x630/0x630 [nf_tables]
      [ 1239.229091]  ? nf_tables_commit+0x2c60/0x2c60 [nf_tables]
      [ 1239.229091]  netlink_dump+0x470/0xa20
      [ 1239.229091]  __netlink_dump_start+0x5ae/0x690
      [ 1239.229091]  nft_netlink_dump_start_rcu+0xd1/0x160 [nf_tables]
      [ 1239.229091]  nf_tables_getsetelem+0x2e5/0x4b0 [nf_tables]
      [ 1239.229091]  ? nft_get_set_elem+0x440/0x440 [nf_tables]
      [ 1239.229091]  ? nft_chain_hash_obj+0x630/0x630 [nf_tables]
      [ 1239.229091]  ? nf_tables_dump_obj_done+0x70/0x70 [nf_tables]
      [ 1239.229091]  ? nla_parse+0xab/0x230
      [ 1239.229091]  ? nft_get_set_elem+0x440/0x440 [nf_tables]
      [ 1239.229091]  nfnetlink_rcv_msg+0x7f0/0xab0 [nfnetlink]
      [ 1239.229091]  ? nfnetlink_bind+0x1d0/0x1d0 [nfnetlink]
      [ 1239.229091]  ? debug_show_all_locks+0x290/0x290
      [ 1239.229091]  ? sched_clock_cpu+0x132/0x170
      [ 1239.229091]  ? find_held_lock+0x39/0x1b0
      [ 1239.229091]  ? sched_clock_local+0x10d/0x130
      [ 1239.229091]  netlink_rcv_skb+0x211/0x320
      [ 1239.229091]  ? nfnetlink_bind+0x1d0/0x1d0 [nfnetlink]
      [ 1239.229091]  ? netlink_ack+0x7b0/0x7b0
      [ 1239.229091]  ? ns_capable_common+0x6e/0x110
      [ 1239.229091]  nfnetlink_rcv+0x2d1/0x310 [nfnetlink]
      [ 1239.229091]  ? nfnetlink_rcv_batch+0x10f0/0x10f0 [nfnetlink]
      [ 1239.229091]  ? netlink_deliver_tap+0x829/0x930
      [ 1239.229091]  ? lock_acquire+0x265/0x2e0
      [ 1239.229091]  netlink_unicast+0x406/0x520
      [ 1239.509725]  ? netlink_attachskb+0x5b0/0x5b0
      [ 1239.509725]  ? find_held_lock+0x39/0x1b0
      [ 1239.509725]  netlink_sendmsg+0x987/0xa20
      [ 1239.509725]  ? netlink_unicast+0x520/0x520
      [ 1239.509725]  ? _copy_from_user+0xa9/0xc0
      [ 1239.509725]  __sys_sendto+0x21a/0x2c0
      [ 1239.509725]  ? __ia32_sys_getpeername+0xa0/0xa0
      [ 1239.509725]  ? retint_kernel+0x10/0x10
      [ 1239.509725]  ? sched_clock_cpu+0x132/0x170
      [ 1239.509725]  ? find_held_lock+0x39/0x1b0
      [ 1239.509725]  ? lock_downgrade+0x540/0x540
      [ 1239.509725]  ? up_read+0x1c/0x100
      [ 1239.509725]  ? __do_page_fault+0x763/0x970
      [ 1239.509725]  ? retint_user+0x18/0x18
      [ 1239.509725]  __x64_sys_sendto+0x177/0x180
      [ 1239.509725]  do_syscall_64+0xaa/0x360
      [ 1239.509725]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [ 1239.509725] RIP: 0033:0x7f5a8f468e03
      [ 1239.509725] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb d0 0f 1f 84 00 00 00 00 00 83 3d 49 c9 2b 00 00 75 13 49 89 ca b8 2c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8
      [ 1239.509725] RSP: 002b:00007ffd78d0b778 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      [ 1239.509725] RAX: ffffffffffffffda RBX: 00007ffd78d0c890 RCX: 00007f5a8f468e03
      [ 1239.509725] RDX: 0000000000000034 RSI: 00007ffd78d0b7e0 RDI: 0000000000000003
      [ 1239.509725] RBP: 00007ffd78d0b7d0 R08: 00007f5a8f15c160 R09: 000000000000000c
      [ 1239.509725] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffd78d0b7e0
      [ 1239.509725] R13: 0000000000000034 R14: 00007f5a8f9aff60 R15: 00005648040094b0
      [ 1239.509725] Modules linked in: nf_tables_set nf_tables nfnetlink ip_tables x_tables
      [ 1239.670713] ---[ end trace 39375adcda140f11 ]---
      [ 1239.676016] RIP: 0010:nft_hash_walk+0x1d2/0x310 [nf_tables_set]
      [ 1239.682834] Code: 84 d2 7f 10 4c 89 e7 89 44 24 38 e8 d8 5a 17 e0 8b 44 24 38 48 8d 7b 10 41 0f b6 0c 24 48 89 fa 48 89 fe 48 c1 ea 03 83 e6 07 <42> 0f b6 14 3a 40 38 f2 7f 1a 84 d2 74 16
      [ 1239.705108] RSP: 0018:ffff8801118cf358 EFLAGS: 00010246
      [ 1239.711115] RAX: 0000000000000000 RBX: 0000000000020400 RCX: 0000000000000001
      [ 1239.719269] RDX: 0000000000004082 RSI: 0000000000000000 RDI: 0000000000020410
      [ 1239.727401] RBP: ffff880114d5a988 R08: 0000000000007e94 R09: ffff880114dd8030
      [ 1239.735530] R10: ffff880114d5a988 R11: ffffed00229bb006 R12: ffff8801118cf4d0
      [ 1239.743658] R13: ffff8801118cf4d8 R14: 0000000000000000 R15: dffffc0000000000
      [ 1239.751785] FS:  00007f5a8fe0b700(0000) GS:ffff88011b600000(0000) knlGS:0000000000000000
      [ 1239.760993] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1239.767560] CR2: 00007f5a8ecc27b0 CR3: 000000010608e000 CR4: 00000000001006f0
      [ 1239.775679] Kernel panic - not syncing: Fatal exception
      [ 1239.776630] Kernel Offset: 0x1f000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
      [ 1239.776630] Rebooting in 5 seconds..
      
      Fixes: 20a69341 ("netfilter: nf_tables: add netlink set API")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      4ef360dd
  5. 15 8月, 2018 1 次提交
  6. 13 8月, 2018 4 次提交
  7. 12 8月, 2018 6 次提交
  8. 11 8月, 2018 4 次提交
    • M
      bpf: Enable BPF_PROG_TYPE_SK_REUSEPORT bpf prog in reuseport selection · 8217ca65
      Martin KaFai Lau 提交于
      This patch allows a BPF_PROG_TYPE_SK_REUSEPORT bpf prog to select a
      SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY introduced in
      the earlier patch.  "bpf_run_sk_reuseport()" will return -ECONNREFUSED
      when the BPF_PROG_TYPE_SK_REUSEPORT prog returns SK_DROP.
      The callers, in inet[6]_hashtable.c and ipv[46]/udp.c, are modified to
      handle this case and return NULL immediately instead of continuing the
      sk search from its hashtable.
      
      It re-uses the existing SO_ATTACH_REUSEPORT_EBPF setsockopt to attach
      BPF_PROG_TYPE_SK_REUSEPORT.  The "sk_reuseport_attach_bpf()" will check
      if the attaching bpf prog is in the new SK_REUSEPORT or the existing
      SOCKET_FILTER type and then check different things accordingly.
      
      One level of "__reuseport_attach_prog()" call is removed.  The
      "sk_unhashed() && ..." and "sk->sk_reuseport_cb" tests are pushed
      back to "reuseport_attach_prog()" in sock_reuseport.c.  sock_reuseport.c
      seems to have more knowledge on those test requirements than filter.c.
      In "reuseport_attach_prog()", after new_prog is attached to reuse->prog,
      the old_prog (if any) is also directly freed instead of returning the
      old_prog to the caller and asking the caller to free.
      
      The sysctl_optmem_max check is moved back to the
      "sk_reuseport_attach_filter()" and "sk_reuseport_attach_bpf()".
      As of other bpf prog types, the new BPF_PROG_TYPE_SK_REUSEPORT is only
      bounded by the usual "bpf_prog_charge_memlock()" during load time
      instead of bounded by both bpf_prog_charge_memlock and sysctl_optmem_max.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      8217ca65
    • M
      bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT · 2dbb9b9e
      Martin KaFai Lau 提交于
      This patch adds a BPF_PROG_TYPE_SK_REUSEPORT which can select
      a SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY.  Like other
      non SK_FILTER/CGROUP_SKB program, it requires CAP_SYS_ADMIN.
      
      BPF_PROG_TYPE_SK_REUSEPORT introduces "struct sk_reuseport_kern"
      to store the bpf context instead of using the skb->cb[48].
      
      At the SO_REUSEPORT sk lookup time, it is in the middle of transiting
      from a lower layer (ipv4/ipv6) to a upper layer (udp/tcp).  At this
      point,  it is not always clear where the bpf context can be appended
      in the skb->cb[48] to avoid saving-and-restoring cb[].  Even putting
      aside the difference between ipv4-vs-ipv6 and udp-vs-tcp.  It is not
      clear if the lower layer is only ipv4 and ipv6 in the future and
      will it not touch the cb[] again before transiting to the upper
      layer.
      
      For example, in udp_gro_receive(), it uses the 48 byte NAPI_GRO_CB
      instead of IP[6]CB and it may still modify the cb[] after calling
      the udp[46]_lib_lookup_skb().  Because of the above reason, if
      sk->cb is used for the bpf ctx, saving-and-restoring is needed
      and likely the whole 48 bytes cb[] has to be saved and restored.
      
      Instead of saving, setting and restoring the cb[], this patch opts
      to create a new "struct sk_reuseport_kern" and setting the needed
      values in there.
      
      The new BPF_PROG_TYPE_SK_REUSEPORT and "struct sk_reuseport_(kern|md)"
      will serve all ipv4/ipv6 + udp/tcp combinations.  There is no protocol
      specific usage at this point and it is also inline with the current
      sock_reuseport.c implementation (i.e. no protocol specific requirement).
      
      In "struct sk_reuseport_md", this patch exposes data/data_end/len
      with semantic similar to other existing usages.  Together
      with "bpf_skb_load_bytes()" and "bpf_skb_load_bytes_relative()",
      the bpf prog can peek anywhere in the skb.  The "bind_inany" tells
      the bpf prog that the reuseport group is bind-ed to a local
      INANY address which cannot be learned from skb.
      
      The new "bind_inany" is added to "struct sock_reuseport" which will be
      used when running the new "BPF_PROG_TYPE_SK_REUSEPORT" bpf prog in order
      to avoid repeating the "bind INANY" test on
      "sk_v6_rcv_saddr/sk->sk_rcv_saddr" every time a bpf prog is run.  It can
      only be properly initialized when a "sk->sk_reuseport" enabled sk is
      adding to a hashtable (i.e. during "reuseport_alloc()" and
      "reuseport_add_sock()").
      
      The new "sk_select_reuseport()" is the main helper that the
      bpf prog will use to select a SO_REUSEPORT sk.  It is the only function
      that can use the new BPF_MAP_TYPE_REUSEPORT_ARRAY.  As mentioned in
      the earlier patch, the validity of a selected sk is checked in
      run time in "sk_select_reuseport()".  Doing the check in
      verification time is difficult and inflexible (consider the map-in-map
      use case).  The runtime check is to compare the selected sk's reuseport_id
      with the reuseport_id that we want.  This helper will return -EXXX if the
      selected sk cannot serve the incoming request (e.g. reuseport_id
      not match).  The bpf prog can decide if it wants to do SK_DROP as its
      discretion.
      
      When the bpf prog returns SK_PASS, the kernel will check if a
      valid sk has been selected (i.e. "reuse_kern->selected_sk != NULL").
      If it does , it will use the selected sk.  If not, the kernel
      will select one from "reuse->socks[]" (as before this patch).
      
      The SK_DROP and SK_PASS handling logic will be in the next patch.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      2dbb9b9e
    • M
      net: Add ID (if needed) to sock_reuseport and expose reuseport_lock · 736b4602
      Martin KaFai Lau 提交于
      A later patch will introduce a BPF_MAP_TYPE_REUSEPORT_ARRAY which
      allows a SO_REUSEPORT sk to be added to a bpf map.  When a sk
      is removed from reuse->socks[], it also needs to be removed from
      the bpf map.  Also, when adding a sk to a bpf map, the bpf
      map needs to ensure it is indeed in a reuse->socks[].
      Hence, reuseport_lock is needed by the bpf map to ensure its
      map_update_elem() and map_delete_elem() operations are in-sync with
      the reuse->socks[].  The BPF_MAP_TYPE_REUSEPORT_ARRAY map will only
      acquire the reuseport_lock after ensuring the adding sk is already
      in a reuseport group (i.e. reuse->socks[]).  The map_lookup_elem()
      will be lockless.
      
      This patch also adds an ID to sock_reuseport.  A later patch
      will introduce BPF_PROG_TYPE_SK_REUSEPORT which allows
      a bpf prog to select a sk from a bpf map.  It is inflexible to
      statically enforce a bpf map can only contain the sk belonging to
      a particular reuse->socks[] (i.e. same IP:PORT) during the bpf
      verification time. For example, think about the the map-in-map situation
      where the inner map can be dynamically changed in runtime and the outer
      map may have inner maps belonging to different reuseport groups.
      Hence, when the bpf prog (in the new BPF_PROG_TYPE_SK_REUSEPORT
      type) selects a sk,  this selected sk has to be checked to ensure it
      belongs to the requesting reuseport group (i.e. the group serving
      that IP:PORT).
      
      The "sk->sk_reuseport_cb" pointer cannot be used for this checking
      purpose because the pointer value will change after reuseport_grow().
      Instead of saving all checking conditions like the ones
      preced calling "reuseport_add_sock()" and compare them everytime a
      bpf_prog is run, a 32bits ID is introduced to survive the
      reuseport_grow().  The ID is only acquired if any of the
      reuse->socks[] is added to the newly introduced
      "BPF_MAP_TYPE_REUSEPORT_ARRAY" map.
      
      If "BPF_MAP_TYPE_REUSEPORT_ARRAY" is not used,  the changes in this
      patch is a no-op.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      736b4602
    • M
      tcp: Avoid TCP syncookie rejected by SO_REUSEPORT socket · 40a1227e
      Martin KaFai Lau 提交于
      Although the actual cookie check "__cookie_v[46]_check()" does
      not involve sk specific info, it checks whether the sk has recent
      synq overflow event in "tcp_synq_no_recent_overflow()".  The
      tcp_sk(sk)->rx_opt.ts_recent_stamp is updated every second
      when it has sent out a syncookie (through "tcp_synq_overflow()").
      
      The above per sk "recent synq overflow event timestamp" works well
      for non SO_REUSEPORT use case.  However, it may cause random
      connection request reject/discard when SO_REUSEPORT is used with
      syncookie because it fails the "tcp_synq_no_recent_overflow()"
      test.
      
      When SO_REUSEPORT is used, it usually has multiple listening
      socks serving TCP connection requests destinated to the same local IP:PORT.
      There are cases that the TCP-ACK-COOKIE may not be received
      by the same sk that sent out the syncookie.  For example,
      if reuse->socks[] began with {sk0, sk1},
      1) sk1 sent out syncookies and tcp_sk(sk1)->rx_opt.ts_recent_stamp
         was updated.
      2) the reuse->socks[] became {sk1, sk2} later.  e.g. sk0 was first closed
         and then sk2 was added.  Here, sk2 does not have ts_recent_stamp set.
         There are other ordering that will trigger the similar situation
         below but the idea is the same.
      3) When the TCP-ACK-COOKIE comes back, sk2 was selected.
         "tcp_synq_no_recent_overflow(sk2)" returns true. In this case,
         all syncookies sent by sk1 will be handled (and rejected)
         by sk2 while sk1 is still alive.
      
      The userspace may create and remove listening SO_REUSEPORT sockets
      as it sees fit.  E.g. Adding new thread (and SO_REUSEPORT sock) to handle
      incoming requests, old process stopping and new process starting...etc.
      With or without SO_ATTACH_REUSEPORT_[CB]BPF,
      the sockets leaving and joining a reuseport group makes picking
      the same sk to check the syncookie very difficult (if not impossible).
      
      The later patches will allow bpf prog more flexibility in deciding
      where a sk should be located in a bpf map and selecting a particular
      SO_REUSEPORT sock as it sees fit.  e.g. Without closing any sock,
      replace the whole bpf reuseport_array in one map_update() by using
      map-in-map.  Getting the syncookie check working smoothly across
      socks in the same "reuse->socks[]" is important.
      
      A partial solution is to set the newly added sk's ts_recent_stamp
      to the max ts_recent_stamp of a reuseport group but that will require
      to iterate through reuse->socks[]  OR
      pessimistically set it to "now - TCP_SYNCOOKIE_VALID" when a sk is
      joining a reuseport group.  However, neither of them will solve the
      existing sk getting moved around the reuse->socks[] and that
      sk may not have ts_recent_stamp updated, unlikely under continuous
      synflood but not impossible.
      
      This patch opts to treat the reuseport group as a whole when
      considering the last synq overflow timestamp since
      they are serving the same IP:PORT from the userspace
      (and BPF program) perspective.
      
      "synq_overflow_ts" is added to "struct sock_reuseport".
      The tcp_synq_overflow() and tcp_synq_no_recent_overflow()
      will update/check reuse->synq_overflow_ts if the sk is
      in a reuseport group.  Similar to the reuseport decision in
      __inet_lookup_listener(), both sk->sk_reuseport and
      sk->sk_reuseport_cb are tested for SO_REUSEPORT usage.
      Update on "synq_overflow_ts" happens at roughly once
      every second.
      
      A synflood test was done with a 16 rx-queues and 16 reuseport sockets.
      No meaningful performance change is observed.  Before and
      after the change is ~9Mpps in IPv4.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      40a1227e
  9. 10 8月, 2018 2 次提交
  10. 08 8月, 2018 3 次提交
  11. 07 8月, 2018 4 次提交
  12. 06 8月, 2018 1 次提交
  13. 05 8月, 2018 1 次提交
  14. 04 8月, 2018 2 次提交
  15. 02 8月, 2018 3 次提交