1. 05 6月, 2019 1 次提交
  2. 04 6月, 2019 3 次提交
    • E
      net: fix use-after-free in kfree_skb_list · b7034146
      Eric Dumazet 提交于
      syzbot reported nasty use-after-free [1]
      
      Lets remove frag_list field from structs ip_fraglist_iter
      and ip6_fraglist_iter. This seens not needed anyway.
      
      [1] :
      BUG: KASAN: use-after-free in kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706
      Read of size 8 at addr ffff888085a3cbc0 by task syz-executor303/8947
      
      CPU: 0 PID: 8947 Comm: syz-executor303 Not tainted 5.2.0-rc2+ #12
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
       __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
       kasan_report+0x12/0x20 mm/kasan/common.c:614
       __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
       kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706
       ip6_fragment+0x1ef4/0x2680 net/ipv6/ip6_output.c:882
       __ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144
       ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156
       NF_HOOK_COND include/linux/netfilter.h:294 [inline]
       ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179
       dst_output include/net/dst.h:433 [inline]
       ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179
       ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796
       ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816
       rawv6_push_pending_frames net/ipv6/raw.c:617 [inline]
       rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947
       inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
       sock_sendmsg_nosec net/socket.c:652 [inline]
       sock_sendmsg+0xd7/0x130 net/socket.c:671
       ___sys_sendmsg+0x803/0x920 net/socket.c:2292
       __sys_sendmsg+0x105/0x1d0 net/socket.c:2330
       __do_sys_sendmsg net/socket.c:2339 [inline]
       __se_sys_sendmsg net/socket.c:2337 [inline]
       __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
       do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x44add9
      Code: e8 7c e6 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 1b 05 fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f826f33bce8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 00000000006e7a18 RCX: 000000000044add9
      RDX: 0000000000000000 RSI: 0000000020000240 RDI: 0000000000000005
      RBP: 00000000006e7a10 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006e7a1c
      R13: 00007ffcec4f7ebf R14: 00007f826f33c9c0 R15: 20c49ba5e353f7cf
      
      Allocated by task 8947:
       save_stack+0x23/0x90 mm/kasan/common.c:71
       set_track mm/kasan/common.c:79 [inline]
       __kasan_kmalloc mm/kasan/common.c:489 [inline]
       __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
       kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:497
       slab_post_alloc_hook mm/slab.h:437 [inline]
       slab_alloc_node mm/slab.c:3269 [inline]
       kmem_cache_alloc_node+0x131/0x710 mm/slab.c:3579
       __alloc_skb+0xd5/0x5e0 net/core/skbuff.c:199
       alloc_skb include/linux/skbuff.h:1058 [inline]
       __ip6_append_data.isra.0+0x2a24/0x3640 net/ipv6/ip6_output.c:1519
       ip6_append_data+0x1e5/0x320 net/ipv6/ip6_output.c:1688
       rawv6_sendmsg+0x1467/0x35e0 net/ipv6/raw.c:940
       inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
       sock_sendmsg_nosec net/socket.c:652 [inline]
       sock_sendmsg+0xd7/0x130 net/socket.c:671
       ___sys_sendmsg+0x803/0x920 net/socket.c:2292
       __sys_sendmsg+0x105/0x1d0 net/socket.c:2330
       __do_sys_sendmsg net/socket.c:2339 [inline]
       __se_sys_sendmsg net/socket.c:2337 [inline]
       __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
       do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Freed by task 8947:
       save_stack+0x23/0x90 mm/kasan/common.c:71
       set_track mm/kasan/common.c:79 [inline]
       __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
       kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
       __cache_free mm/slab.c:3432 [inline]
       kmem_cache_free+0x86/0x260 mm/slab.c:3698
       kfree_skbmem net/core/skbuff.c:625 [inline]
       kfree_skbmem+0xc5/0x150 net/core/skbuff.c:619
       __kfree_skb net/core/skbuff.c:682 [inline]
       kfree_skb net/core/skbuff.c:699 [inline]
       kfree_skb+0xf0/0x390 net/core/skbuff.c:693
       kfree_skb_list+0x44/0x60 net/core/skbuff.c:708
       __dev_xmit_skb net/core/dev.c:3551 [inline]
       __dev_queue_xmit+0x3034/0x36b0 net/core/dev.c:3850
       dev_queue_xmit+0x18/0x20 net/core/dev.c:3914
       neigh_direct_output+0x16/0x20 net/core/neighbour.c:1532
       neigh_output include/net/neighbour.h:511 [inline]
       ip6_finish_output2+0x1034/0x2550 net/ipv6/ip6_output.c:120
       ip6_fragment+0x1ebb/0x2680 net/ipv6/ip6_output.c:863
       __ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144
       ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156
       NF_HOOK_COND include/linux/netfilter.h:294 [inline]
       ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179
       dst_output include/net/dst.h:433 [inline]
       ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179
       ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796
       ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816
       rawv6_push_pending_frames net/ipv6/raw.c:617 [inline]
       rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947
       inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
       sock_sendmsg_nosec net/socket.c:652 [inline]
       sock_sendmsg+0xd7/0x130 net/socket.c:671
       ___sys_sendmsg+0x803/0x920 net/socket.c:2292
       __sys_sendmsg+0x105/0x1d0 net/socket.c:2330
       __do_sys_sendmsg net/socket.c:2339 [inline]
       __se_sys_sendmsg net/socket.c:2337 [inline]
       __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
       do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The buggy address belongs to the object at ffff888085a3cbc0
       which belongs to the cache skbuff_head_cache of size 224
      The buggy address is located 0 bytes inside of
       224-byte region [ffff888085a3cbc0, ffff888085a3cca0)
      The buggy address belongs to the page:
      page:ffffea0002168f00 refcount:1 mapcount:0 mapping:ffff88821b6f63c0 index:0x0
      flags: 0x1fffc0000000200(slab)
      raw: 01fffc0000000200 ffffea00027bbf88 ffffea0002105b88 ffff88821b6f63c0
      raw: 0000000000000000 ffff888085a3c080 000000010000000c 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff888085a3ca80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
       ffff888085a3cb00: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc
      >ffff888085a3cb80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
                                                 ^
       ffff888085a3cc00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff888085a3cc80: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
      
      Fixes: 0feca619 ("net: ipv6: add skbuff fraglist splitter")
      Fixes: c8b17be0 ("net: ipv4: add skbuff fraglist splitter")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Acked-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b7034146
    • E
      tcp: use this_cpu_read(*X) instead of *this_cpu_ptr(X) · 5472c3c6
      Eric Dumazet 提交于
      this_cpu_read(*X) is slightly faster than *this_cpu_ptr(X)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5472c3c6
    • E
      ipv4: icmp: use this_cpu_read() in icmp_sk() · 046386ca
      Eric Dumazet 提交于
      this_cpu_read(*X) is faster than *this_cpu_ptr(X)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      046386ca
  3. 03 6月, 2019 5 次提交
  4. 01 6月, 2019 3 次提交
  5. 31 5月, 2019 11 次提交
    • W
      net: correct zerocopy refcnt with udp MSG_MORE · 100f6d8e
      Willem de Bruijn 提交于
      TCP zerocopy takes a uarg reference for every skb, plus one for the
      tcp_sendmsg_locked datapath temporarily, to avoid reaching refcnt zero
      as it builds, sends and frees skbs inside its inner loop.
      
      UDP and RAW zerocopy do not send inside the inner loop so do not need
      the extra sock_zerocopy_get + sock_zerocopy_put pair. Commit
      52900d22288ed ("udp: elide zerocopy operation in hot path") introduced
      extra_uref to pass the initial reference taken in sock_zerocopy_alloc
      to the first generated skb.
      
      But, sock_zerocopy_realloc takes this extra reference at the start of
      every call. With MSG_MORE, no new skb may be generated to attach the
      extra_uref to, so refcnt is incorrectly 2 with only one skb.
      
      Do not take the extra ref if uarg && !tcp, which implies MSG_MORE.
      Update extra_uref accordingly.
      
      This conditional assignment triggers a false positive may be used
      uninitialized warning, so have to initialize extra_uref at define.
      
      Changes v1->v2: fix typo in Fixes SHA1
      
      Fixes: 52900d22 ("udp: elide zerocopy operation in hot path")
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Diagnosed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      100f6d8e
    • J
      net: don't clear sock->sk early to avoid trouble in strparser · 2b81f816
      Jakub Kicinski 提交于
      af_inet sets sock->sk to NULL which trips strparser over:
      
      BUG: kernel NULL pointer dereference, address: 0000000000000012
      PGD 0 P4D 0
      Oops: 0000 [#1] SMP PTI
      CPU: 7 PID: 0 Comm: swapper/7 Not tainted 5.2.0-rc1-00139-g14629453a6d3 #21
      RIP: 0010:tcp_peek_len+0x10/0x60
      RSP: 0018:ffffc02e41c54b98 EFLAGS: 00010246
      RAX: 0000000000000000 RBX: ffff9cf924c4e030 RCX: 0000000000000051
      RDX: 0000000000000000 RSI: 000000000000000c RDI: ffff9cf97128f480
      RBP: ffff9cf9365e0300 R08: ffff9cf94fe7d2c0 R09: 0000000000000000
      R10: 000000000000036b R11: ffff9cf939735e00 R12: ffff9cf91ad9ae40
      R13: ffff9cf924c4e000 R14: ffff9cf9a8fcbaae R15: 0000000000000020
      FS: 0000000000000000(0000) GS:ffff9cf9af7c0000(0000) knlGS:0000000000000000
      CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000012 CR3: 000000013920a003 CR4: 00000000003606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
       <IRQ>
       strp_data_ready+0x48/0x90
       tls_data_ready+0x22/0xd0 [tls]
       tcp_rcv_established+0x569/0x620
       tcp_v4_do_rcv+0x127/0x1e0
       tcp_v4_rcv+0xad7/0xbf0
       ip_protocol_deliver_rcu+0x2c/0x1c0
       ip_local_deliver_finish+0x41/0x50
       ip_local_deliver+0x6b/0xe0
       ? ip_protocol_deliver_rcu+0x1c0/0x1c0
       ip_rcv+0x52/0xd0
       ? ip_rcv_finish_core.isra.20+0x380/0x380
       __netif_receive_skb_one_core+0x7e/0x90
       netif_receive_skb_internal+0x42/0xf0
       napi_gro_receive+0xed/0x150
       nfp_net_poll+0x7a2/0xd30 [nfp]
       ? kmem_cache_free_bulk+0x286/0x310
       net_rx_action+0x149/0x3b0
       __do_softirq+0xe3/0x30a
       ? handle_irq_event_percpu+0x6a/0x80
       irq_exit+0xe8/0xf0
       do_IRQ+0x85/0xd0
       common_interrupt+0xf/0xf
       </IRQ>
      RIP: 0010:cpuidle_enter_state+0xbc/0x450
      
      To avoid this issue set sock->sk after sk_prot->close.
      My grepping and testing did not discover any code which
      would depend on the current behaviour.
      
      Fixes: c46234eb ("tls: RX path for ktls")
      Reported-by: NDavid Beckett <david.beckett@netronome.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NDirk van der Merwe <dirk.vandermerwe@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2b81f816
    • P
      net: ipv4: place control buffer handling away from fragmentation iterators · 19c3401a
      Pablo Neira Ayuso 提交于
      Deal with the IPCB() area away from the iterators.
      
      The bridge codebase has its own control buffer layout, move specific
      IP control buffer handling into the IPv4 codepath.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19c3401a
    • P
      net: ipv4: split skbuff into fragments transformer · 065ff79f
      Pablo Neira Ayuso 提交于
      This patch exposes a new API to refragment a skbuff. This allows you to
      split either a linear skbuff or to force the refragmentation of an
      existing fraglist using a different mtu. The API consists of:
      
      * ip_frag_init(), that initializes the internal state of the transformer.
      * ip_frag_next(), that allows you to fetch the next fragment. This function
        internally allocates the skbuff that represents the fragment, it pushes
        the IPv4 header, and it also copies the payload for each fragment.
      
      The ip_frag_state object stores the internal state of the splitter.
      
      This code has been extracted from ip_do_fragment(). Symbols are also
      exported to allow to reuse this iterator from the bridge codepath to
      build its own refragmentation routine by reusing the existing codebase.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      065ff79f
    • P
      net: ipv4: add skbuff fraglist splitter · c8b17be0
      Pablo Neira Ayuso 提交于
      This patch adds the skbuff fraglist splitter. This API provides an
      iterator to transform the fraglist into single skbuff objects, it
      consists of:
      
      * ip_fraglist_init(), that initializes the internal state of the
        fraglist splitter.
      * ip_fraglist_prepare(), that restores the IPv4 header on the
        fragments.
      * ip_fraglist_next(), that retrieves the fragment from the fraglist and
        it updates the internal state of the splitter to point to the next
        fragment skbuff in the fraglist.
      
      The ip_fraglist_iter object stores the internal state of the iterator.
      
      This code has been extracted from ip_do_fragment(). Symbols are also
      exported to allow to reuse this iterator from the bridge codepath to
      build its own refragmentation routine by reusing the existing codebase.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c8b17be0
    • J
      tcp: add support for optional TFO backup key to net.ipv4.tcp_fastopen_key · aa1236cd
      Jason Baron 提交于
      Add the ability to add a backup TFO key as:
      
      # echo "x-x-x-x,x-x-x-x" > /proc/sys/net/ipv4/tcp_fastopen_key
      
      The key before the comma acks as the primary TFO key and the key after the
      comma is the backup TFO key. This change is intended to be backwards
      compatible since if only one key is set, userspace will simply read back
      that single key as follows:
      
      # echo "x-x-x-x" > /proc/sys/net/ipv4/tcp_fastopen_key
      # cat /proc/sys/net/ipv4/tcp_fastopen_key
      x-x-x-x
      Signed-off-by: NJason Baron <jbaron@akamai.com>
      Signed-off-by: NChristoph Paasch <cpaasch@apple.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aa1236cd
    • J
      tcp: add support to TCP_FASTOPEN_KEY for optional backup key · 0f1ce023
      Jason Baron 提交于
      Add support for get/set of an optional backup key via TCP_FASTOPEN_KEY, in
      addition to the current 'primary' key. The primary key is used to encrypt
      and decrypt TFO cookies, while the backup is only used to decrypt TFO
      cookies. The backup key is used to maximize successful TFO connections when
      TFO keys are rotated.
      
      Currently, TCP_FASTOPEN_KEY allows a single 16-byte primary key to be set.
      This patch now allows a 32-byte value to be set, where the first 16 bytes
      are used as the primary key and the second 16 bytes are used for the backup
      key. Similarly, for getsockopt(), we can receive a 32-byte value as output
      if requested. If a 16-byte value is used to set the primary key via
      TCP_FASTOPEN_KEY, then any previously set backup key will be removed.
      Signed-off-by: NJason Baron <jbaron@akamai.com>
      Signed-off-by: NChristoph Paasch <cpaasch@apple.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0f1ce023
    • J
      tcp: add backup TFO key infrastructure · 9092a76d
      Jason Baron 提交于
      We would like to be able to rotate TFO keys while minimizing the number of
      client cookies that are rejected. Currently, we have only one key which can
      be used to generate and validate cookies, thus if we simply replace this
      key clients can easily have cookies rejected upon rotation.
      
      We propose having the ability to have both a primary key and a backup key.
      The primary key is used to generate as well as to validate cookies.
      The backup is only used to validate cookies. Thus, keys can be rotated as:
      
      1) generate new key
      2) add new key as the backup key
      3) swap the primary and backup key, thus setting the new key as the primary
      
      We don't simply set the new key as the primary key and move the old key to
      the backup slot because the ip may be behind a load balancer and we further
      allow for the fact that all machines behind the load balancer will not be
      updated simultaneously.
      
      We make use of this infrastructure in subsequent patches.
      Suggested-by: NIgor Lubashev <ilubashe@akamai.com>
      Signed-off-by: NJason Baron <jbaron@akamai.com>
      Signed-off-by: NChristoph Paasch <cpaasch@apple.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9092a76d
    • C
      tcp: introduce __tcp_fastopen_cookie_gen_cipher() · 483642e5
      Christoph Paasch 提交于
      Restructure __tcp_fastopen_cookie_gen() to take a 'struct crypto_cipher'
      argument and rename it as __tcp_fastopen_cookie_gen_cipher(). Subsequent
      patches will provide different ciphers based on which key is being used for
      the cookie generation.
      Signed-off-by: NChristoph Paasch <cpaasch@apple.com>
      Signed-off-by: NJason Baron <jbaron@akamai.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      483642e5
    • Y
      ipv4: tcp_input: fix stack out of bounds when parsing TCP options. · 9609dad2
      Young Xiao 提交于
      The TCP option parsing routines in tcp_parse_options function could
      read one byte out of the buffer of the TCP options.
      
      1         while (length > 0) {
      2                 int opcode = *ptr++;
      3                 int opsize;
      4
      5                 switch (opcode) {
      6                 case TCPOPT_EOL:
      7                         return;
      8                 case TCPOPT_NOP:        /* Ref: RFC 793 section 3.1 */
      9                         length--;
      10                        continue;
      11                default:
      12                        opsize = *ptr++; //out of bound access
      
      If length = 1, then there is an access in line2.
      And another access is occurred in line 12.
      This would lead to out-of-bound access.
      
      Therefore, in the patch we check that the available data length is
      larger enough to pase both TCP option code and size.
      Signed-off-by: NYoung Xiao <92siuyang@gmail.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9609dad2
    • H
      inet: frags: Remove unnecessary smp_store_release/READ_ONCE · 32707c4d
      Herbert Xu 提交于
      The smp_store_release call in fqdir_exit cannot protect the setting
      of fqdir->dead as claimed because its memory barrier is only
      guaranteed to be one-way and the barrier precedes the setting of
      fqdir->dead.
      
      IOW it doesn't provide any barriers between fq->dir and the following
      hash table destruction.
      
      In fact, the code is safe anyway because call_rcu does provide both
      the memory barrier as well as a guarantee that when the destruction
      work starts executing all RCU readers will see the updated value for
      fqdir->dead.
      
      Therefore this patch removes the unnecessary smp_store_release call
      as well as the corresponding READ_ONCE on the read-side in order to
      not confuse future readers of this code.  Comments have been added
      in their places.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      32707c4d
  6. 29 5月, 2019 7 次提交
    • D
      nexthop: Add support for nexthop groups · 430a0491
      David Ahern 提交于
      Allow the creation of nexthop groups which reference other nexthop
      objects to create multipath routes:
      
                            +--------------+
         +------------+   +--------------+ |
         | nh  nh_grp --->| nh_grp_entry |-+
         +------------+   +---------|----+
           ^                |       |    +------------+
           +----------------+       +--->| nh, weight |
              nh_parent                  +------------+
      
      A group entry points to a nexthop with a weight for that hop within the
      group. The nexthop has a list_head, grp_list, for tracking which groups
      it is a member of and the group entry has a reference back to the parent.
      The grp_list is used when a nexthop is deleted - to efficiently remove
      it from groups using it.
      
      If a nexthop group spec is given, no other attributes can be set. Each
      nexthop id in a group spec must already exist.
      
      Similar to single nexthops, the specification of a nexthop group can be
      updated so that data is managed with rcu locking.
      
      Add path selection function to account for multiple paths and add
      ipv{4,6}_good_nh helpers to know that if a neighbor entry exists it is
      in a good state.
      
      Update NETDEV event handling to rebalance multipath nexthop groups if
      a nexthop is deleted due to a link event (down or unregister).
      
      When a nexthop is removed any groups using it are updated. Groups using a
      nexthop a tracked via a grp_list.
      
      Nexthop dumps can be limited to groups only by adding NHA_GROUPS to the
      request.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      430a0491
    • D
      nexthop: Add support for lwt encaps · b513bd03
      David Ahern 提交于
      Add support for NHA_ENCAP and NHA_ENCAP_TYPE. Leverages the existing code
      for lwtunnel within fib_nh_common, so the only change needed is handling
      the attributes in the nexthop code.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b513bd03
    • D
      nexthop: Add support for IPv6 gateways · 53010f99
      David Ahern 提交于
      Handle IPv6 gateway in a nexthop spec. If nh_family is set to AF_INET6,
      NHA_GATEWAY is expected to be an IPv6 address. Add ipv6 option to gw in
      nh_config to hold the address, add fib6_nh to nh_info to leverage the
      ipv6 initialization and cleanup code. Update nh_fill_node to dump the v6
      address.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      53010f99
    • D
      nexthop: Add support for IPv4 nexthops · 597cfe4f
      David Ahern 提交于
      Add support for IPv4 nexthops. If nh_family is set to AF_INET, then
      NHA_GATEWAY is expected to be an IPv4 address.
      
      Register for netdev events to be notified of admin up/down changes as
      well as deletes. A hash table is used to track nexthop per devices to
      quickly convert device events to the affected nexthops.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      597cfe4f
    • D
      net: Initial nexthop code · ab84be7e
      David Ahern 提交于
      Barebones start point for nexthops. Implementation for RTM commands,
      notifications, management of rbtree for holding nexthops by id, and
      kernel side data structures for nexthops and nexthop config.
      
      Nexthops are maintained in an rbtree sorted by id. Similar to routes,
      nexthops are configured per namespace using netns_nexthop struct added
      to struct net.
      
      Nexthop notifications are sent when a nexthop is added or deleted,
      but NOT if the delete is due to a device event or network namespace
      teardown (which also involves device events). Applications are
      expected to use the device down event to flush nexthops and any
      routes used by the nexthops.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ab84be7e
    • E
      inet: frags: fix use-after-free read in inet_frag_destroy_rcu · dc93f46b
      Eric Dumazet 提交于
      As caught by syzbot [1], the rcu grace period that is respected
      before fqdir_rwork_fn() proceeds and frees fqdir is not enough
      to prevent inet_frag_destroy_rcu() being run after the freeing.
      
      We need a proper rcu_barrier() synchronization to replace
      the one we had in inet_frags_fini()
      
      We also have to fix a potential problem at module removal :
      inet_frags_fini() needs to make sure that all queued work queues
      (fqdir_rwork_fn) have completed, otherwise we might
      call kmem_cache_destroy() too soon and get another use-after-free.
      
      [1]
      BUG: KASAN: use-after-free in inet_frag_destroy_rcu+0xd9/0xe0 net/ipv4/inet_fragment.c:201
      Read of size 8 at addr ffff88806ed47a18 by task swapper/1/0
      
      CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.2.0-rc1+ #2
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
       __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
       kasan_report+0x12/0x20 mm/kasan/common.c:614
       __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
       inet_frag_destroy_rcu+0xd9/0xe0 net/ipv4/inet_fragment.c:201
       __rcu_reclaim kernel/rcu/rcu.h:222 [inline]
       rcu_do_batch kernel/rcu/tree.c:2092 [inline]
       invoke_rcu_callbacks kernel/rcu/tree.c:2310 [inline]
       rcu_core+0xba5/0x1500 kernel/rcu/tree.c:2291
       __do_softirq+0x25c/0x94c kernel/softirq.c:293
       invoke_softirq kernel/softirq.c:374 [inline]
       irq_exit+0x180/0x1d0 kernel/softirq.c:414
       exiting_irq arch/x86/include/asm/apic.h:536 [inline]
       smp_apic_timer_interrupt+0x13b/0x550 arch/x86/kernel/apic/apic.c:1068
       apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:806
       </IRQ>
      RIP: 0010:native_safe_halt+0xe/0x10 arch/x86/include/asm/irqflags.h:61
      Code: ff ff 48 89 df e8 f2 95 8c fa eb 82 e9 07 00 00 00 0f 00 2d e4 45 4b 00 f4 c3 66 90 e9 07 00 00 00 0f 00 2d d4 45 4b 00 fb f4 <c3> 90 55 48 89 e5 41 57 41 56 41 55 41 54 53 e8 8e 18 42 fa e8 99
      RSP: 0018:ffff8880a98e7d78 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
      RAX: 1ffffffff1164e11 RBX: ffff8880a98d4340 RCX: 0000000000000000
      RDX: dffffc0000000000 RSI: 0000000000000006 RDI: ffff8880a98d4bbc
      RBP: ffff8880a98e7da8 R08: ffff8880a98d4340 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
      R13: ffffffff88b27078 R14: 0000000000000001 R15: 0000000000000000
       arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:571
       default_idle_call+0x36/0x90 kernel/sched/idle.c:94
       cpuidle_idle_call kernel/sched/idle.c:154 [inline]
       do_idle+0x377/0x560 kernel/sched/idle.c:263
       cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:354
       start_secondary+0x34e/0x4c0 arch/x86/kernel/smpboot.c:267
       secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:243
      
      Allocated by task 8877:
       save_stack+0x23/0x90 mm/kasan/common.c:71
       set_track mm/kasan/common.c:79 [inline]
       __kasan_kmalloc mm/kasan/common.c:489 [inline]
       __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
       kasan_kmalloc+0x9/0x10 mm/kasan/common.c:503
       kmem_cache_alloc_trace+0x151/0x750 mm/slab.c:3555
       kmalloc include/linux/slab.h:547 [inline]
       kzalloc include/linux/slab.h:742 [inline]
       fqdir_init include/net/inet_frag.h:115 [inline]
       ipv6_frags_init_net+0x48/0x460 net/ipv6/reassembly.c:513
       ops_init+0xb3/0x410 net/core/net_namespace.c:130
       setup_net+0x2d3/0x740 net/core/net_namespace.c:316
       copy_net_ns+0x1df/0x340 net/core/net_namespace.c:439
       create_new_namespaces+0x400/0x7b0 kernel/nsproxy.c:107
       unshare_nsproxy_namespaces+0xc2/0x200 kernel/nsproxy.c:206
       ksys_unshare+0x440/0x980 kernel/fork.c:2692
       __do_sys_unshare kernel/fork.c:2760 [inline]
       __se_sys_unshare kernel/fork.c:2758 [inline]
       __x64_sys_unshare+0x31/0x40 kernel/fork.c:2758
       do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Freed by task 17:
       save_stack+0x23/0x90 mm/kasan/common.c:71
       set_track mm/kasan/common.c:79 [inline]
       __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
       kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
       __cache_free mm/slab.c:3432 [inline]
       kfree+0xcf/0x220 mm/slab.c:3755
       fqdir_rwork_fn+0x33/0x40 net/ipv4/inet_fragment.c:154
       process_one_work+0x989/0x1790 kernel/workqueue.c:2269
       worker_thread+0x98/0xe40 kernel/workqueue.c:2415
       kthread+0x354/0x420 kernel/kthread.c:255
       ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
      
      The buggy address belongs to the object at ffff88806ed47a00
       which belongs to the cache kmalloc-512 of size 512
      The buggy address is located 24 bytes inside of
       512-byte region [ffff88806ed47a00, ffff88806ed47c00)
      The buggy address belongs to the page:
      page:ffffea0001bb51c0 refcount:1 mapcount:0 mapping:ffff8880aa400940 index:0x0
      flags: 0x1fffc0000000200(slab)
      raw: 01fffc0000000200 ffffea000282a788 ffffea0001bb53c8 ffff8880aa400940
      raw: 0000000000000000 ffff88806ed47000 0000000100000006 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff88806ed47900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88806ed47980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      >ffff88806ed47a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                  ^
       ffff88806ed47a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88806ed47b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      
      Fixes: 3c8fc878 ("inet: frags: rework rhashtable dismantle")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dc93f46b
    • E
      inet: frags: uninline fqdir_init() · 6b73d197
      Eric Dumazet 提交于
      fqdir_init() is not fast path and is getting bigger.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b73d197
  7. 27 5月, 2019 9 次提交
    • C
      ipv4: remove redundant assignment to n · df801522
      Colin Ian King 提交于
      The pointer n is being assigned a value however this value is
      never read in the code block and the end of the code block
      continues to the next loop iteration. Clean up the code by
      removing the redundant assignment.
      
      Fixes: 1bff1a0c ("ipv4: Add function to send route updates")
      Addresses-Coverity: ("Unused value")
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df801522
    • E
      inet: frags: rework rhashtable dismantle · 3c8fc878
      Eric Dumazet 提交于
      syszbot found an interesting use-after-free [1] happening
      while IPv4 fragment rhashtable was destroyed at netns dismantle.
      
      While no insertions can possibly happen at the time a dismantling
      netns is destroying this rhashtable, timers can still fire and
      attempt to remove elements from this rhashtable.
      
      This is forbidden, since rhashtable_free_and_destroy() has
      no synchronization against concurrent inserts and deletes.
      
      Add a new fqdir->dead flag so that timers do not attempt
      a rhashtable_remove_fast() operation.
      
      We also have to respect an RCU grace period before starting
      the rhashtable_free_and_destroy() from process context,
      thus we use rcu_work infrastructure.
      
      This is a refinement of a prior rough attempt to fix this bug :
      https://marc.info/?l=linux-netdev&m=153845936820900&w=2
      
      Since the rhashtable cleanup is now deferred to a work queue,
      netns dismantles should be slightly faster.
      
      [1]
      BUG: KASAN: use-after-free in __read_once_size include/linux/compiler.h:194 [inline]
      BUG: KASAN: use-after-free in rhashtable_last_table+0x162/0x180 lib/rhashtable.c:212
      Read of size 8 at addr ffff8880a6497b70 by task kworker/0:0/5
      
      CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted 5.2.0-rc1+ #2
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Workqueue: events rht_deferred_worker
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
       __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
       kasan_report+0x12/0x20 mm/kasan/common.c:614
       __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
       __read_once_size include/linux/compiler.h:194 [inline]
       rhashtable_last_table+0x162/0x180 lib/rhashtable.c:212
       rht_deferred_worker+0x111/0x2030 lib/rhashtable.c:411
       process_one_work+0x989/0x1790 kernel/workqueue.c:2269
       worker_thread+0x98/0xe40 kernel/workqueue.c:2415
       kthread+0x354/0x420 kernel/kthread.c:255
       ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
      
      Allocated by task 32687:
       save_stack+0x23/0x90 mm/kasan/common.c:71
       set_track mm/kasan/common.c:79 [inline]
       __kasan_kmalloc mm/kasan/common.c:489 [inline]
       __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
       kasan_kmalloc+0x9/0x10 mm/kasan/common.c:503
       __do_kmalloc_node mm/slab.c:3620 [inline]
       __kmalloc_node+0x4e/0x70 mm/slab.c:3627
       kmalloc_node include/linux/slab.h:590 [inline]
       kvmalloc_node+0x68/0x100 mm/util.c:431
       kvmalloc include/linux/mm.h:637 [inline]
       kvzalloc include/linux/mm.h:645 [inline]
       bucket_table_alloc+0x90/0x480 lib/rhashtable.c:178
       rhashtable_init+0x3f4/0x7b0 lib/rhashtable.c:1057
       inet_frags_init_net include/net/inet_frag.h:109 [inline]
       ipv4_frags_init_net+0x182/0x410 net/ipv4/ip_fragment.c:683
       ops_init+0xb3/0x410 net/core/net_namespace.c:130
       setup_net+0x2d3/0x740 net/core/net_namespace.c:316
       copy_net_ns+0x1df/0x340 net/core/net_namespace.c:439
       create_new_namespaces+0x400/0x7b0 kernel/nsproxy.c:107
       unshare_nsproxy_namespaces+0xc2/0x200 kernel/nsproxy.c:206
       ksys_unshare+0x440/0x980 kernel/fork.c:2692
       __do_sys_unshare kernel/fork.c:2760 [inline]
       __se_sys_unshare kernel/fork.c:2758 [inline]
       __x64_sys_unshare+0x31/0x40 kernel/fork.c:2758
       do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Freed by task 7:
       save_stack+0x23/0x90 mm/kasan/common.c:71
       set_track mm/kasan/common.c:79 [inline]
       __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
       kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
       __cache_free mm/slab.c:3432 [inline]
       kfree+0xcf/0x220 mm/slab.c:3755
       kvfree+0x61/0x70 mm/util.c:460
       bucket_table_free+0x69/0x150 lib/rhashtable.c:108
       rhashtable_free_and_destroy+0x165/0x8b0 lib/rhashtable.c:1155
       inet_frags_exit_net+0x3d/0x50 net/ipv4/inet_fragment.c:152
       ipv4_frags_exit_net+0x73/0x90 net/ipv4/ip_fragment.c:695
       ops_exit_list.isra.0+0xaa/0x150 net/core/net_namespace.c:154
       cleanup_net+0x3fb/0x960 net/core/net_namespace.c:553
       process_one_work+0x989/0x1790 kernel/workqueue.c:2269
       worker_thread+0x98/0xe40 kernel/workqueue.c:2415
       kthread+0x354/0x420 kernel/kthread.c:255
       ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
      
      The buggy address belongs to the object at ffff8880a6497b40
       which belongs to the cache kmalloc-1k of size 1024
      The buggy address is located 48 bytes inside of
       1024-byte region [ffff8880a6497b40, ffff8880a6497f40)
      The buggy address belongs to the page:
      page:ffffea0002992580 refcount:1 mapcount:0 mapping:ffff8880aa400ac0 index:0xffff8880a64964c0 compound_mapcount: 0
      flags: 0x1fffc0000010200(slab|head)
      raw: 01fffc0000010200 ffffea0002916e88 ffffea000218fe08 ffff8880aa400ac0
      raw: ffff8880a64964c0 ffff8880a6496040 0000000100000005 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff8880a6497a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff8880a6497a80: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
      >ffff8880a6497b00: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
                                                                   ^
       ffff8880a6497b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff8880a6497c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      
      Fixes: 648700f7 ("inet: frags: use rhashtables for reassembly units")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c8fc878
    • E
      net: dynamically allocate fqdir structures · 4907abc6
      Eric Dumazet 提交于
      Following patch will add rcu grace period before fqdir
      rhashtable destruction, so we need to dynamically allocate
      fqdir structures to not force expensive synchronize_rcu() calls
      in netns dismantle path.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4907abc6
    • E
      net: add a net pointer to struct fqdir · a39aca67
      Eric Dumazet 提交于
      fqdir will soon be dynamically allocated.
      
      We need to reach the struct net pointer from fqdir,
      so add it, and replace the various container_of() constructs
      by direct access to the new field.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a39aca67
    • E
      net: rename inet_frags_init_net() to fdir_init() · 9cce45f2
      Eric Dumazet 提交于
      And pass an extra parameter, since we will soon
      dynamically allocate fqdir structures.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9cce45f2
    • E
      ipv4: no longer reference init_net in ip4_frags_ns_ctl_table[] · 8dfdb313
      Eric Dumazet 提交于
      (struct net *)->ipv4.fqdir will soon be a pointer, so make
      sure ip4_frags_ns_ctl_table[] does not reference init_net.
      
      ip4_frags_ns_ctl_register() can perform the needed initialization
      for all netns.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8dfdb313
    • E
      net: rename struct fqdir fields · 803fdd99
      Eric Dumazet 提交于
      Rename the @frags fields from structs netns_ipv4, netns_ipv6,
      netns_nf_frag and netns_ieee802154_lowpan to @fqdir
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      803fdd99
    • E
      89fb9005
    • E
      inet: rename netns_frags to fqdir · 6ce3b4dc
      Eric Dumazet 提交于
      1) struct netns_frags is renamed to struct fqdir
        This structure is really holding many frag queues in a hash table.
      
      2) (struct inet_frag_queue)->net field is renamed to fqdir
        since net is generally associated to a 'struct net' pointer
        in networking stack.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ce3b4dc
  8. 26 5月, 2019 1 次提交