1. 15 6月, 2019 2 次提交
    • E
      tcp: add tcp_tx_skb_cache sysctl · 0b7d7f6b
      Eric Dumazet 提交于
      Feng Tang reported a performance regression after introduction
      of per TCP socket tx/rx caches, for TCP over loopback (netperf)
      
      There is high chance the regression is caused by a change on
      how well the 32 KB per-thread page (current->task_frag) can
      be recycled, and lack of pcp caches for order-3 pages.
      
      I could not reproduce the regression myself, cpus all being
      spinning on the mm spinlocks for page allocs/freeing, regardless
      of enabling or disabling the per tcp socket caches.
      
      It seems best to disable the feature by default, and let
      admins enabling it.
      
      MM layer either needs to provide scalable order-3 pages
      allocations, or could attempt a trylock on zone->lock if
      the caller only attempts to get a high-order page and is
      able to fallback to order-0 ones in case of pressure.
      
      Tests run on a 56 cores host (112 hyper threads)
      
      -	35.49%	netperf 		 [kernel.vmlinux]	  [k] queued_spin_lock_slowpath
         - 35.49% queued_spin_lock_slowpath
      	  - 18.18% get_page_from_freelist
      		 - __alloc_pages_nodemask
      			- 18.18% alloc_pages_current
      				 skb_page_frag_refill
      				 sk_page_frag_refill
      				 tcp_sendmsg_locked
      				 tcp_sendmsg
      				 inet_sendmsg
      				 sock_sendmsg
      				 __sys_sendto
      				 __x64_sys_sendto
      				 do_syscall_64
      				 entry_SYSCALL_64_after_hwframe
      				 __libc_send
      	  + 17.31% __free_pages_ok
      +	31.43%	swapper 		 [kernel.vmlinux]	  [k] intel_idle
      +	 9.12%	netperf 		 [kernel.vmlinux]	  [k] copy_user_enhanced_fast_string
      +	 6.53%	netserver		 [kernel.vmlinux]	  [k] copy_user_enhanced_fast_string
      +	 0.69%	netserver		 [kernel.vmlinux]	  [k] queued_spin_lock_slowpath
      +	 0.68%	netperf 		 [kernel.vmlinux]	  [k] skb_release_data
      +	 0.52%	netperf 		 [kernel.vmlinux]	  [k] tcp_sendmsg_locked
      	 0.46%	netperf 		 [kernel.vmlinux]	  [k] _raw_spin_lock_irqsave
      
      Fixes: 472c2e07 ("tcp: add one skb cache for tx")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NFeng Tang <feng.tang@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0b7d7f6b
    • E
      tcp: add tcp_rx_skb_cache sysctl · ede61ca4
      Eric Dumazet 提交于
      Instead of relying on rps_needed, it is safer to use a separate
      static key, since we do not want to enable TCP rx_skb_cache
      by default. This feature can cause huge increase of memory
      usage on hosts with millions of sockets.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ede61ca4
  2. 12 6月, 2019 1 次提交
  3. 10 6月, 2019 2 次提交
    • Y
      tcp: fix undo spurious SYNACK in passive Fast Open · fcc2202a
      Yuchung Cheng 提交于
      Commit 794200d6 ("tcp: undo cwnd on Fast Open spurious SYNACK
      retransmit") may cause tcp_fastretrans_alert() to warn about pending
      retransmission in Open state. This is triggered when the Fast Open
      server both sends data and has spurious SYNACK retransmission during
      the handshake, and the data packets were lost or reordered.
      
      The root cause is a bit complicated:
      
      (1) Upon receiving SYN-data: a full socket is created with
          snd_una = ISN + 1 by tcp_create_openreq_child()
      
      (2) On SYNACK timeout the server/sender enters CA_Loss state.
      
      (3) Upon receiving the final ACK to complete the handshake, sender
          does not mark FLAG_SND_UNA_ADVANCED since (1)
      
          Sender then calls tcp_process_loss since state is CA_loss by (2)
      
      (4) tcp_process_loss() does not invoke undo operations but instead
          mark REXMIT_LOST to force retransmission
      
      (5) tcp_rcv_synrecv_state_fastopen() calls tcp_try_undo_loss(). It
          changes state to CA_Open but has positive tp->retrans_out
      
      (6) Next ACK triggers the WARN_ON in tcp_fastretrans_alert()
      
      The step that goes wrong is (4) where the undo operation should
      have been invoked because the ACK successfully acknowledged the
      SYN sequence. This fixes that by specifically checking undo
      when the SYN-ACK sequence is acknowledged. Then after
      tcp_process_loss() the state would be further adjusted based
      in tcp_fastretrans_alert() to avoid triggering the warning in (6).
      
      Fixes: 794200d6 ("tcp: undo cwnd on Fast Open spurious SYNACK retransmit")
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fcc2202a
    • E
      net: ipv4: fib_semantics: fix uninitialized variable · c3fee640
      Enrico Weigelt 提交于
      fix an uninitialized variable:
      
        CC      net/ipv4/fib_semantics.o
      net/ipv4/fib_semantics.c: In function 'fib_check_nh_v4_gw':
      net/ipv4/fib_semantics.c:1027:12: warning: 'err' may be used uninitialized in this function [-Wmaybe-uninitialized]
         if (!tbl || err) {
                  ^~
      Signed-off-by: NEnrico Weigelt <info@metux.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c3fee640
  4. 07 6月, 2019 1 次提交
    • D
      bpf: fix unconnected udp hooks · 983695fa
      Daniel Borkmann 提交于
      Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently
      to applications as also stated in original motivation in 7828f20e ("Merge
      branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter
      two hooks into Cilium to enable host based load-balancing with Kubernetes,
      I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes
      typically sets up DNS as a service and is thus subject to load-balancing.
      
      Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API
      is currently insufficient and thus not usable as-is for standard applications
      shipped with most distros. To break down the issue we ran into with a simple
      example:
      
        # cat /etc/resolv.conf
        nameserver 147.75.207.207
        nameserver 147.75.207.208
      
      For the purpose of a simple test, we set up above IPs as service IPs and
      transparently redirect traffic to a different DNS backend server for that
      node:
      
        # cilium service list
        ID   Frontend            Backend
        1    147.75.207.207:53   1 => 8.8.8.8:53
        2    147.75.207.208:53   1 => 8.8.8.8:53
      
      The attached BPF program is basically selecting one of the backends if the
      service IP/port matches on the cgroup hook. DNS breaks here, because the
      hooks are not transparent enough to applications which have built-in msg_name
      address checks:
      
        # nslookup 1.1.1.1
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        [...]
        ;; connection timed out; no servers could be reached
      
        # dig 1.1.1.1
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        [...]
      
        ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
        ;; global options: +cmd
        ;; connection timed out; no servers could be reached
      
      For comparison, if none of the service IPs is used, and we tell nslookup
      to use 8.8.8.8 directly it works just fine, of course:
      
        # nslookup 1.1.1.1 8.8.8.8
        1.1.1.1.in-addr.arpa	name = one.one.one.one.
      
      In order to fix this and thus act more transparent to the application,
      this needs reverse translation on recvmsg() side. A minimal fix for this
      API is to add similar recvmsg() hooks behind the BPF cgroups static key
      such that the program can track state and replace the current sockaddr_in{,6}
      with the original service IP. From BPF side, this basically tracks the
      service tuple plus socket cookie in an LRU map where the reverse NAT can
      then be retrieved via map value as one example. Side-note: the BPF cgroups
      static key should be converted to a per-hook static key in future.
      
      Same example after this fix:
      
        # cilium service list
        ID   Frontend            Backend
        1    147.75.207.207:53   1 => 8.8.8.8:53
        2    147.75.207.208:53   1 => 8.8.8.8:53
      
      Lookups work fine now:
      
        # nslookup 1.1.1.1
        1.1.1.1.in-addr.arpa    name = one.one.one.one.
      
        Authoritative answers can be found from:
      
        # dig 1.1.1.1
      
        ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
        ;; global options: +cmd
        ;; Got answer:
        ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550
        ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
      
        ;; OPT PSEUDOSECTION:
        ; EDNS: version: 0, flags:; udp: 512
        ;; QUESTION SECTION:
        ;1.1.1.1.                       IN      A
      
        ;; AUTHORITY SECTION:
        .                       23426   IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400
      
        ;; Query time: 17 msec
        ;; SERVER: 147.75.207.207#53(147.75.207.207)
        ;; WHEN: Tue May 21 12:59:38 UTC 2019
        ;; MSG SIZE  rcvd: 111
      
      And from an actual packet level it shows that we're using the back end
      server when talking via 147.75.207.20{7,8} front end:
      
        # tcpdump -i any udp
        [...]
        12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
        12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
        12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
        12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
        [...]
      
      In order to be flexible and to have same semantics as in sendmsg BPF
      programs, we only allow return codes in [1,1] range. In the sendmsg case
      the program is called if msg->msg_name is present which can be the case
      in both, connected and unconnected UDP.
      
      The former only relies on the sockaddr_in{,6} passed via connect(2) if
      passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar
      way to call into the BPF program whenever a non-NULL msg->msg_name was
      passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note
      that for TCP case, the msg->msg_name is ignored in the regular recvmsg
      path and therefore not relevant.
      
      For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE,
      the hook is not called. This is intentional as it aligns with the same
      semantics as in case of TCP cgroup BPF hooks right now. This might be
      better addressed in future through a different bpf_attach_type such
      that this case can be distinguished from the regular recvmsg paths,
      for example.
      
      Fixes: 1cedee13 ("bpf: Hooks for sys_sendmsg")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrey Ignatov <rdna@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NMartynas Pumputis <m@lambda.lt>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      983695fa
  5. 06 6月, 2019 1 次提交
    • X
      ipv4: not do cache for local delivery if bc_forwarding is enabled · 0a90478b
      Xin Long 提交于
      With the topo:
      
          h1 ---| rp1            |
                |     route  rp3 |--- h3 (192.168.200.1)
          h2 ---| rp2            |
      
      If rp1 bc_forwarding is set while rp2 bc_forwarding is not, after
      doing "ping 192.168.200.255" on h1, then ping 192.168.200.255 on
      h2, and the packets can still be forwared.
      
      This issue was caused by the input route cache. It should only do
      the cache for either bc forwarding or local delivery. Otherwise,
      local delivery can use the route cache for bc forwarding of other
      interfaces.
      
      This patch is to fix it by not doing cache for local delivery if
      all.bc_forwarding is enabled.
      
      Note that we don't fix it by checking route cache local flag after
      rt_cache_valid() in "local_input:" and "ip_mkroute_input", as the
      common route code shouldn't be touched for bc_forwarding.
      
      Fixes: 5cbf777c ("route: add support for directed broadcast forwarding")
      Reported-by: NJianlin Shi <jishi@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a90478b
  6. 05 6月, 2019 1 次提交
    • T
      udp: only choose unbound UDP socket for multicast when not in a VRF · 82ba25c6
      Tim Beale 提交于
      By default, packets received in another VRF should not be passed to an
      unbound socket in the default VRF. This patch updates the IPv4 UDP
      multicast logic to match the unicast VRF logic (in compute_score()),
      as well as the IPv6 mcast logic (in __udp_v6_is_mcast_sock()).
      
      The particular case I noticed was DHCP discover packets going
      to the 255.255.255.255 address, which are handled by
      __udp4_lib_mcast_deliver(). The previous code meant that running
      multiple different DHCP server or relay agent instances across VRFs
      did not work correctly - any server/relay agent in the default VRF
      received DHCP discover packets for all other VRFs.
      
      Fixes: 6da5b0f0 ("net: ensure unbound datagram socket to be chosen when not in a VRF")
      Signed-off-by: NTim Beale <timbeale@catalyst.net.nz>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      82ba25c6
  7. 04 6月, 2019 1 次提交
  8. 31 5月, 2019 4 次提交
    • W
      net: correct zerocopy refcnt with udp MSG_MORE · 100f6d8e
      Willem de Bruijn 提交于
      TCP zerocopy takes a uarg reference for every skb, plus one for the
      tcp_sendmsg_locked datapath temporarily, to avoid reaching refcnt zero
      as it builds, sends and frees skbs inside its inner loop.
      
      UDP and RAW zerocopy do not send inside the inner loop so do not need
      the extra sock_zerocopy_get + sock_zerocopy_put pair. Commit
      52900d22288ed ("udp: elide zerocopy operation in hot path") introduced
      extra_uref to pass the initial reference taken in sock_zerocopy_alloc
      to the first generated skb.
      
      But, sock_zerocopy_realloc takes this extra reference at the start of
      every call. With MSG_MORE, no new skb may be generated to attach the
      extra_uref to, so refcnt is incorrectly 2 with only one skb.
      
      Do not take the extra ref if uarg && !tcp, which implies MSG_MORE.
      Update extra_uref accordingly.
      
      This conditional assignment triggers a false positive may be used
      uninitialized warning, so have to initialize extra_uref at define.
      
      Changes v1->v2: fix typo in Fixes SHA1
      
      Fixes: 52900d22 ("udp: elide zerocopy operation in hot path")
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Diagnosed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      100f6d8e
    • J
      net: don't clear sock->sk early to avoid trouble in strparser · 2b81f816
      Jakub Kicinski 提交于
      af_inet sets sock->sk to NULL which trips strparser over:
      
      BUG: kernel NULL pointer dereference, address: 0000000000000012
      PGD 0 P4D 0
      Oops: 0000 [#1] SMP PTI
      CPU: 7 PID: 0 Comm: swapper/7 Not tainted 5.2.0-rc1-00139-g14629453a6d3 #21
      RIP: 0010:tcp_peek_len+0x10/0x60
      RSP: 0018:ffffc02e41c54b98 EFLAGS: 00010246
      RAX: 0000000000000000 RBX: ffff9cf924c4e030 RCX: 0000000000000051
      RDX: 0000000000000000 RSI: 000000000000000c RDI: ffff9cf97128f480
      RBP: ffff9cf9365e0300 R08: ffff9cf94fe7d2c0 R09: 0000000000000000
      R10: 000000000000036b R11: ffff9cf939735e00 R12: ffff9cf91ad9ae40
      R13: ffff9cf924c4e000 R14: ffff9cf9a8fcbaae R15: 0000000000000020
      FS: 0000000000000000(0000) GS:ffff9cf9af7c0000(0000) knlGS:0000000000000000
      CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000012 CR3: 000000013920a003 CR4: 00000000003606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
       <IRQ>
       strp_data_ready+0x48/0x90
       tls_data_ready+0x22/0xd0 [tls]
       tcp_rcv_established+0x569/0x620
       tcp_v4_do_rcv+0x127/0x1e0
       tcp_v4_rcv+0xad7/0xbf0
       ip_protocol_deliver_rcu+0x2c/0x1c0
       ip_local_deliver_finish+0x41/0x50
       ip_local_deliver+0x6b/0xe0
       ? ip_protocol_deliver_rcu+0x1c0/0x1c0
       ip_rcv+0x52/0xd0
       ? ip_rcv_finish_core.isra.20+0x380/0x380
       __netif_receive_skb_one_core+0x7e/0x90
       netif_receive_skb_internal+0x42/0xf0
       napi_gro_receive+0xed/0x150
       nfp_net_poll+0x7a2/0xd30 [nfp]
       ? kmem_cache_free_bulk+0x286/0x310
       net_rx_action+0x149/0x3b0
       __do_softirq+0xe3/0x30a
       ? handle_irq_event_percpu+0x6a/0x80
       irq_exit+0xe8/0xf0
       do_IRQ+0x85/0xd0
       common_interrupt+0xf/0xf
       </IRQ>
      RIP: 0010:cpuidle_enter_state+0xbc/0x450
      
      To avoid this issue set sock->sk after sk_prot->close.
      My grepping and testing did not discover any code which
      would depend on the current behaviour.
      
      Fixes: c46234eb ("tls: RX path for ktls")
      Reported-by: NDavid Beckett <david.beckett@netronome.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NDirk van der Merwe <dirk.vandermerwe@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2b81f816
    • Y
      ipv4: tcp_input: fix stack out of bounds when parsing TCP options. · 9609dad2
      Young Xiao 提交于
      The TCP option parsing routines in tcp_parse_options function could
      read one byte out of the buffer of the TCP options.
      
      1         while (length > 0) {
      2                 int opcode = *ptr++;
      3                 int opsize;
      4
      5                 switch (opcode) {
      6                 case TCPOPT_EOL:
      7                         return;
      8                 case TCPOPT_NOP:        /* Ref: RFC 793 section 3.1 */
      9                         length--;
      10                        continue;
      11                default:
      12                        opsize = *ptr++; //out of bound access
      
      If length = 1, then there is an access in line2.
      And another access is occurred in line 12.
      This would lead to out-of-bound access.
      
      Therefore, in the patch we check that the available data length is
      larger enough to pase both TCP option code and size.
      Signed-off-by: NYoung Xiao <92siuyang@gmail.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9609dad2
    • T
      treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 · 2874c5fd
      Thomas Gleixner 提交于
      Based on 1 normalized pattern(s):
      
        this program is free software you can redistribute it and or modify
        it under the terms of the gnu general public license as published by
        the free software foundation either version 2 of the license or at
        your option any later version
      
      extracted by the scancode license scanner the SPDX license identifier
      
        GPL-2.0-or-later
      
      has been chosen to replace the boilerplate/reference in 3029 file(s).
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NAllison Randal <allison@lohutok.net>
      Cc: linux-spdx@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2874c5fd
  9. 26 5月, 2019 1 次提交
  10. 23 5月, 2019 2 次提交
    • E
      ipv4/igmp: fix build error if !CONFIG_IP_MULTICAST · 903869bd
      Eric Dumazet 提交于
      ip_sf_list_clear_all() needs to be defined even if !CONFIG_IP_MULTICAST
      
      Fixes: 3580d04a ("ipv4/igmp: fix another memory leak in igmpv3_del_delrec()")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      903869bd
    • E
      ipv4/igmp: fix another memory leak in igmpv3_del_delrec() · 3580d04a
      Eric Dumazet 提交于
      syzbot reported memory leaks [1] that I have back tracked to
      a missing cleanup from igmpv3_del_delrec() when
      (im->sfmode != MCAST_INCLUDE)
      
      Add ip_sf_list_clear_all() and kfree_pmc() helpers to explicitely
      handle the cleanups before freeing.
      
      [1]
      
      BUG: memory leak
      unreferenced object 0xffff888123e32b00 (size 64):
        comm "softirq", pid 0, jiffies 4294942968 (age 8.010s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 e0 00 00 01 00 00 00 00  ................
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<000000006105011b>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 [inline]
          [<000000006105011b>] slab_post_alloc_hook mm/slab.h:439 [inline]
          [<000000006105011b>] slab_alloc mm/slab.c:3326 [inline]
          [<000000006105011b>] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553
          [<000000004bba8073>] kmalloc include/linux/slab.h:547 [inline]
          [<000000004bba8073>] kzalloc include/linux/slab.h:742 [inline]
          [<000000004bba8073>] ip_mc_add1_src net/ipv4/igmp.c:1961 [inline]
          [<000000004bba8073>] ip_mc_add_src+0x36b/0x400 net/ipv4/igmp.c:2085
          [<00000000a46a65a0>] ip_mc_msfilter+0x22d/0x310 net/ipv4/igmp.c:2475
          [<000000005956ca89>] do_ip_setsockopt.isra.0+0x1795/0x1930 net/ipv4/ip_sockglue.c:957
          [<00000000848e2d2f>] ip_setsockopt+0x3b/0xb0 net/ipv4/ip_sockglue.c:1246
          [<00000000b9db185c>] udp_setsockopt+0x4e/0x90 net/ipv4/udp.c:2616
          [<000000003028e438>] sock_common_setsockopt+0x38/0x50 net/core/sock.c:3130
          [<0000000015b65589>] __sys_setsockopt+0x98/0x120 net/socket.c:2078
          [<00000000ac198ef0>] __do_sys_setsockopt net/socket.c:2089 [inline]
          [<00000000ac198ef0>] __se_sys_setsockopt net/socket.c:2086 [inline]
          [<00000000ac198ef0>] __x64_sys_setsockopt+0x26/0x30 net/socket.c:2086
          [<000000000a770437>] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:301
          [<00000000d3adb93b>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: 9c8bb163 ("igmp, mld: Fix memory leak in igmpv3/mld_del_delrec()")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Hangbin Liu <liuhangbin@gmail.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3580d04a
  11. 21 5月, 2019 6 次提交
  12. 20 5月, 2019 1 次提交
  13. 17 5月, 2019 1 次提交
  14. 16 5月, 2019 2 次提交
  15. 15 5月, 2019 1 次提交
  16. 14 5月, 2019 1 次提交
    • J
      bpf: sockmap remove duplicate queue free · c42253cc
      John Fastabend 提交于
      In tcp bpf remove we free the cork list and purge the ingress msg
      list. However we do this before the ref count reaches zero so it
      could be possible some other access is in progress. In this case
      (tcp close and/or tcp_unhash) we happen to also hold the sock
      lock so no path exists but lets fix it otherwise it is extremely
      fragile and breaks the reference counting rules. Also we already
      check the cork list and ingress msg queue and free them once the
      ref count reaches zero so its wasteful to check twice.
      
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      c42253cc
  17. 10 5月, 2019 1 次提交
  18. 09 5月, 2019 1 次提交
  19. 06 5月, 2019 2 次提交
  20. 05 5月, 2019 3 次提交
  21. 04 5月, 2019 1 次提交
    • D
      ipmr_base: Do not reset index in mr_table_dump · 7fcd1e03
      David Ahern 提交于
      e is the counter used to save the location of a dump when an
      skb is filled. Once the walk of the table is complete, mr_table_dump
      needs to return without resetting that index to 0. Dump of a specific
      table is looping because of the reset because there is no way to
      indicate the walk of the table is done.
      
      Move the reset to the caller so the dump of each table starts at 0,
      but the loop counter is maintained if a dump fills an skb.
      
      Fixes: e1cedae1 ("ipmr: Refactor mr_rtm_dumproute")
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7fcd1e03
  22. 02 5月, 2019 2 次提交
    • E
      udp: fix GRO packet of death · 4dd2b82d
      Eric Dumazet 提交于
      syzbot was able to crash host by sending UDP packets with a 0 payload.
      
      TCP does not have this issue since we do not aggregate packets without
      payload.
      
      Since dev_gro_receive() sets gso_size based on skb_gro_len(skb)
      it seems not worth trying to cope with padded packets.
      
      BUG: KASAN: slab-out-of-bounds in skb_gro_receive+0xf5f/0x10e0 net/core/skbuff.c:3826
      Read of size 16 at addr ffff88808893fff0 by task syz-executor612/7889
      
      CPU: 0 PID: 7889 Comm: syz-executor612 Not tainted 5.1.0-rc7+ #96
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187
       kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
       __asan_report_load16_noabort+0x14/0x20 mm/kasan/generic_report.c:133
       skb_gro_receive+0xf5f/0x10e0 net/core/skbuff.c:3826
       udp_gro_receive_segment net/ipv4/udp_offload.c:382 [inline]
       call_gro_receive include/linux/netdevice.h:2349 [inline]
       udp_gro_receive+0xb61/0xfd0 net/ipv4/udp_offload.c:414
       udp4_gro_receive+0x763/0xeb0 net/ipv4/udp_offload.c:478
       inet_gro_receive+0xe72/0x1110 net/ipv4/af_inet.c:1510
       dev_gro_receive+0x1cd0/0x23c0 net/core/dev.c:5581
       napi_gro_frags+0x36b/0xd10 net/core/dev.c:5843
       tun_get_user+0x2f24/0x3fb0 drivers/net/tun.c:1981
       tun_chr_write_iter+0xbd/0x156 drivers/net/tun.c:2027
       call_write_iter include/linux/fs.h:1866 [inline]
       do_iter_readv_writev+0x5e1/0x8e0 fs/read_write.c:681
       do_iter_write fs/read_write.c:957 [inline]
       do_iter_write+0x184/0x610 fs/read_write.c:938
       vfs_writev+0x1b3/0x2f0 fs/read_write.c:1002
       do_writev+0x15e/0x370 fs/read_write.c:1037
       __do_sys_writev fs/read_write.c:1110 [inline]
       __se_sys_writev fs/read_write.c:1107 [inline]
       __x64_sys_writev+0x75/0xb0 fs/read_write.c:1107
       do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x441cc0
      Code: 05 48 3d 01 f0 ff ff 0f 83 9d 09 fc ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 83 3d 51 93 29 00 00 75 14 b8 14 00 00 00 0f 05 <48> 3d 01 f0 ff ff 0f 83 74 09 fc ff c3 48 83 ec 08 e8 ba 2b 00 00
      RSP: 002b:00007ffe8c716118 EFLAGS: 00000246 ORIG_RAX: 0000000000000014
      RAX: ffffffffffffffda RBX: 00007ffe8c716150 RCX: 0000000000441cc0
      RDX: 0000000000000001 RSI: 00007ffe8c716170 RDI: 00000000000000f0
      RBP: 0000000000000000 R08: 000000000000ffff R09: 0000000000a64668
      R10: 0000000020000040 R11: 0000000000000246 R12: 000000000000c2d9
      R13: 0000000000402b50 R14: 0000000000000000 R15: 0000000000000000
      
      Allocated by task 5143:
       save_stack+0x45/0xd0 mm/kasan/common.c:75
       set_track mm/kasan/common.c:87 [inline]
       __kasan_kmalloc mm/kasan/common.c:497 [inline]
       __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:470
       kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:505
       slab_post_alloc_hook mm/slab.h:437 [inline]
       slab_alloc mm/slab.c:3393 [inline]
       kmem_cache_alloc+0x11a/0x6f0 mm/slab.c:3555
       mm_alloc+0x1d/0xd0 kernel/fork.c:1030
       bprm_mm_init fs/exec.c:363 [inline]
       __do_execve_file.isra.0+0xaa3/0x23f0 fs/exec.c:1791
       do_execveat_common fs/exec.c:1865 [inline]
       do_execve fs/exec.c:1882 [inline]
       __do_sys_execve fs/exec.c:1958 [inline]
       __se_sys_execve fs/exec.c:1953 [inline]
       __x64_sys_execve+0x8f/0xc0 fs/exec.c:1953
       do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Freed by task 5351:
       save_stack+0x45/0xd0 mm/kasan/common.c:75
       set_track mm/kasan/common.c:87 [inline]
       __kasan_slab_free+0x102/0x150 mm/kasan/common.c:459
       kasan_slab_free+0xe/0x10 mm/kasan/common.c:467
       __cache_free mm/slab.c:3499 [inline]
       kmem_cache_free+0x86/0x260 mm/slab.c:3765
       __mmdrop+0x238/0x320 kernel/fork.c:677
       mmdrop include/linux/sched/mm.h:49 [inline]
       finish_task_switch+0x47b/0x780 kernel/sched/core.c:2746
       context_switch kernel/sched/core.c:2880 [inline]
       __schedule+0x81b/0x1cc0 kernel/sched/core.c:3518
       preempt_schedule_irq+0xb5/0x140 kernel/sched/core.c:3745
       retint_kernel+0x1b/0x2d
       arch_local_irq_restore arch/x86/include/asm/paravirt.h:767 [inline]
       kmem_cache_free+0xab/0x260 mm/slab.c:3766
       anon_vma_chain_free mm/rmap.c:134 [inline]
       unlink_anon_vmas+0x2ba/0x870 mm/rmap.c:401
       free_pgtables+0x1af/0x2f0 mm/memory.c:394
       exit_mmap+0x2d1/0x530 mm/mmap.c:3144
       __mmput kernel/fork.c:1046 [inline]
       mmput+0x15f/0x4c0 kernel/fork.c:1067
       exec_mmap fs/exec.c:1046 [inline]
       flush_old_exec+0x8d9/0x1c20 fs/exec.c:1279
       load_elf_binary+0x9bc/0x53f0 fs/binfmt_elf.c:864
       search_binary_handler fs/exec.c:1656 [inline]
       search_binary_handler+0x17f/0x570 fs/exec.c:1634
       exec_binprm fs/exec.c:1698 [inline]
       __do_execve_file.isra.0+0x1394/0x23f0 fs/exec.c:1818
       do_execveat_common fs/exec.c:1865 [inline]
       do_execve fs/exec.c:1882 [inline]
       __do_sys_execve fs/exec.c:1958 [inline]
       __se_sys_execve fs/exec.c:1953 [inline]
       __x64_sys_execve+0x8f/0xc0 fs/exec.c:1953
       do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The buggy address belongs to the object at ffff88808893f7c0
       which belongs to the cache mm_struct of size 1496
      The buggy address is located 600 bytes to the right of
       1496-byte region [ffff88808893f7c0, ffff88808893fd98)
      The buggy address belongs to the page:
      page:ffffea0002224f80 count:1 mapcount:0 mapping:ffff88821bc40ac0 index:0xffff88808893f7c0 compound_mapcount: 0
      flags: 0x1fffc0000010200(slab|head)
      raw: 01fffc0000010200 ffffea00025b4f08 ffffea00027b9d08 ffff88821bc40ac0
      raw: ffff88808893f7c0 ffff88808893e440 0000000100000001 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff88808893fe80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
       ffff88808893ff00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      >ffff88808893ff80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
                                                                   ^
       ffff888088940000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
       ffff888088940080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      
      Fixes: e20cf8d3 ("udp: implement GRO for plain UDP sockets.")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4dd2b82d
    • S
      ipv4: ip_do_fragment: Preserve skb_iif during fragmentation · d2f0c961
      Shmulik Ladkani 提交于
      Previously, during fragmentation after forwarding, skb->skb_iif isn't
      preserved, i.e. 'ip_copy_metadata' does not copy skb_iif from given
      'from' skb.
      
      As a result, ip_do_fragment's creates fragments with zero skb_iif,
      leading to inconsistent behavior.
      
      Assume for example an eBPF program attached at tc egress (post
      forwarding) that examines __sk_buff->ingress_ifindex:
       - the correct iif is observed if forwarding path does not involve
         fragmentation/refragmentation
       - a bogus iif is observed if forwarding path involves
         fragmentation/refragmentatiom
      
      Fix, by preserving skb_iif during 'ip_copy_metadata'.
      Signed-off-by: NShmulik Ladkani <shmulik.ladkani@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d2f0c961
  23. 01 5月, 2019 2 次提交