1. 15 6月, 2019 5 次提交
    • E
      net: add high_order_alloc_disable sysctl/static key · ce27ec60
      Eric Dumazet 提交于
      >From linux-3.7, (commit 5640f768 "net: use a per task frag
      allocator") TCP sendmsg() has preferred using order-3 allocations.
      
      While it gives good results for most cases, we had reports
      that heavy uses of TCP over loopback were hitting a spinlock
      contention in page allocations/freeing.
      
      This commits adds a sysctl so that admins can opt-in
      for order-0 allocations. Hopefully mm layer might optimize
      order-3 allocations in the future since it could give us
      a nice boost  (see 8 lines of following benchmark)
      
      The following benchmark shows a win when more than 8 TCP_STREAM
      threads are running (56 x86 cores server in my tests)
      
      for thr in {1..30}
      do
       sysctl -wq net.core.high_order_alloc_disable=0
       T0=`./super_netperf $thr -H 127.0.0.1 -l 15`
       sysctl -wq net.core.high_order_alloc_disable=1
       T1=`./super_netperf $thr -H 127.0.0.1 -l 15`
       echo $thr:$T0:$T1
      done
      
      1: 49979: 37267
      2: 98745: 76286
      3: 141088: 110051
      4: 177414: 144772
      5: 197587: 173563
      6: 215377: 208448
      7: 241061: 234087
      8: 267155: 263373
      9: 295069: 297402
      10: 312393: 335213
      11: 340462: 368778
      12: 371366: 403954
      13: 412344: 443713
      14: 426617: 473580
      15: 474418: 507861
      16: 503261: 538539
      17: 522331: 563096
      18: 532409: 567084
      19: 550824: 605240
      20: 525493: 641988
      21: 564574: 665843
      22: 567349: 690868
      23: 583846: 710917
      24: 588715: 736306
      25: 603212: 763494
      26: 604083: 792654
      27: 602241: 796450
      28: 604291: 797993
      29: 611610: 833249
      30: 577356: 841062
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ce27ec60
    • E
      tcp: add tcp_tx_skb_cache sysctl · 0b7d7f6b
      Eric Dumazet 提交于
      Feng Tang reported a performance regression after introduction
      of per TCP socket tx/rx caches, for TCP over loopback (netperf)
      
      There is high chance the regression is caused by a change on
      how well the 32 KB per-thread page (current->task_frag) can
      be recycled, and lack of pcp caches for order-3 pages.
      
      I could not reproduce the regression myself, cpus all being
      spinning on the mm spinlocks for page allocs/freeing, regardless
      of enabling or disabling the per tcp socket caches.
      
      It seems best to disable the feature by default, and let
      admins enabling it.
      
      MM layer either needs to provide scalable order-3 pages
      allocations, or could attempt a trylock on zone->lock if
      the caller only attempts to get a high-order page and is
      able to fallback to order-0 ones in case of pressure.
      
      Tests run on a 56 cores host (112 hyper threads)
      
      -	35.49%	netperf 		 [kernel.vmlinux]	  [k] queued_spin_lock_slowpath
         - 35.49% queued_spin_lock_slowpath
      	  - 18.18% get_page_from_freelist
      		 - __alloc_pages_nodemask
      			- 18.18% alloc_pages_current
      				 skb_page_frag_refill
      				 sk_page_frag_refill
      				 tcp_sendmsg_locked
      				 tcp_sendmsg
      				 inet_sendmsg
      				 sock_sendmsg
      				 __sys_sendto
      				 __x64_sys_sendto
      				 do_syscall_64
      				 entry_SYSCALL_64_after_hwframe
      				 __libc_send
      	  + 17.31% __free_pages_ok
      +	31.43%	swapper 		 [kernel.vmlinux]	  [k] intel_idle
      +	 9.12%	netperf 		 [kernel.vmlinux]	  [k] copy_user_enhanced_fast_string
      +	 6.53%	netserver		 [kernel.vmlinux]	  [k] copy_user_enhanced_fast_string
      +	 0.69%	netserver		 [kernel.vmlinux]	  [k] queued_spin_lock_slowpath
      +	 0.68%	netperf 		 [kernel.vmlinux]	  [k] skb_release_data
      +	 0.52%	netperf 		 [kernel.vmlinux]	  [k] tcp_sendmsg_locked
      	 0.46%	netperf 		 [kernel.vmlinux]	  [k] _raw_spin_lock_irqsave
      
      Fixes: 472c2e07 ("tcp: add one skb cache for tx")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NFeng Tang <feng.tang@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0b7d7f6b
    • E
      tcp: add tcp_rx_skb_cache sysctl · ede61ca4
      Eric Dumazet 提交于
      Instead of relying on rps_needed, it is safer to use a separate
      static key, since we do not want to enable TCP rx_skb_cache
      by default. This feature can cause huge increase of memory
      usage on hosts with millions of sockets.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ede61ca4
    • V
      net: sched: flower: don't call synchronize_rcu() on mask creation · 99815f50
      Vlad Buslov 提交于
      Current flower mask creating code assumes that temporary mask that is used
      when inserting new filter is stack allocated. To prevent race condition
      with data patch synchronize_rcu() is called every time fl_create_new_mask()
      replaces temporary stack allocated mask. As reported by Jiri, this
      increases runtime of creating 20000 flower classifiers from 4 seconds to
      163 seconds. However, this design is no longer necessary since temporary
      mask was converted to be dynamically allocated by commit 2cddd201
      ("net/sched: cls_flower: allocate mask dynamically in fl_change()").
      
      Remove synchronize_rcu() calls from mask creation code. Instead, refactor
      fl_change() to always deallocate temporary mask with rcu grace period.
      
      Fixes: 195c234d ("net: sched: flower: handle concurrent mask insertion")
      Reported-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Tested-by: NJiri Pirko <jiri@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      99815f50
    • N
      sctp: Free cookie before we memdup a new one · ce950f10
      Neil Horman 提交于
      Based on comments from Xin, even after fixes for our recent syzbot
      report of cookie memory leaks, its possible to get a resend of an INIT
      chunk which would lead to us leaking cookie memory.
      
      To ensure that we don't leak cookie memory, free any previously
      allocated cookie first.
      
      Change notes
      v1->v2
      update subsystem tag in subject (davem)
      repeat kfree check for peer_random and peer_hmacs (xin)
      
      v2->v3
      net->sctp
      also free peer_chunks
      
      v3->v4
      fix subject tags
      
      v4->v5
      remove cut line
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Reported-by: syzbot+f7e9153b037eac9b1df8@syzkaller.appspotmail.com
      CC: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      CC: Xin Long <lucien.xin@gmail.com>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: netdev@vger.kernel.org
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ce950f10
  2. 14 6月, 2019 8 次提交
  3. 13 6月, 2019 4 次提交
  4. 12 6月, 2019 2 次提交
    • T
      net: openvswitch: do not free vport if register_netdevice() is failed. · 309b6697
      Taehee Yoo 提交于
      In order to create an internal vport, internal_dev_create() is used and
      that calls register_netdevice() internally.
      If register_netdevice() fails, it calls dev->priv_destructor() to free
      private data of netdev. actually, a private data of this is a vport.
      
      Hence internal_dev_create() should not free and use a vport after failure
      of register_netdevice().
      
      Test command
          ovs-dpctl add-dp bonding_masters
      
      Splat looks like:
      [ 1035.667767] kasan: GPF could be caused by NULL-ptr deref or user memory access
      [ 1035.675958] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
      [ 1035.676916] CPU: 1 PID: 1028 Comm: ovs-vswitchd Tainted: G    B             5.2.0-rc3+ #240
      [ 1035.676916] RIP: 0010:internal_dev_create+0x2e5/0x4e0 [openvswitch]
      [ 1035.676916] Code: 48 c1 ea 03 80 3c 02 00 0f 85 9f 01 00 00 4c 8b 23 48 b8 00 00 00 00 00 fc ff df 49 8d bc 24 60 05 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 86 01 00 00 49 8b bc 24 60 05 00 00 e8 e4 68 f4
      [ 1035.713720] RSP: 0018:ffff88810dcb7578 EFLAGS: 00010206
      [ 1035.713720] RAX: dffffc0000000000 RBX: ffff88810d13fe08 RCX: ffffffff84297704
      [ 1035.713720] RDX: 00000000000000ac RSI: 0000000000000000 RDI: 0000000000000560
      [ 1035.713720] RBP: 00000000ffffffef R08: fffffbfff0d3b881 R09: fffffbfff0d3b881
      [ 1035.713720] R10: 0000000000000001 R11: fffffbfff0d3b880 R12: 0000000000000000
      [ 1035.768776] R13: 0000607ee460b900 R14: ffff88810dcb7690 R15: ffff88810dcb7698
      [ 1035.777709] FS:  00007f02095fc980(0000) GS:ffff88811b400000(0000) knlGS:0000000000000000
      [ 1035.777709] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1035.777709] CR2: 00007ffdf01d2f28 CR3: 0000000108258000 CR4: 00000000001006e0
      [ 1035.777709] Call Trace:
      [ 1035.777709]  ovs_vport_add+0x267/0x4f0 [openvswitch]
      [ 1035.777709]  new_vport+0x15/0x1e0 [openvswitch]
      [ 1035.777709]  ovs_vport_cmd_new+0x567/0xd10 [openvswitch]
      [ 1035.777709]  ? ovs_dp_cmd_dump+0x490/0x490 [openvswitch]
      [ 1035.777709]  ? __kmalloc+0x131/0x2e0
      [ 1035.777709]  ? genl_family_rcv_msg+0xa54/0x1030
      [ 1035.777709]  genl_family_rcv_msg+0x63a/0x1030
      [ 1035.777709]  ? genl_unregister_family+0x630/0x630
      [ 1035.841681]  ? debug_show_all_locks+0x2d0/0x2d0
      [ ... ]
      
      Fixes: cf124db5 ("net: Fix inconsistent teardown and release of private netdev state.")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Reviewed-by: NGreg Rose <gvrose8192@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      309b6697
    • W
      net: correct udp zerocopy refcnt also when zerocopy only on append · 522924b5
      Willem de Bruijn 提交于
      The below patch fixes an incorrect zerocopy refcnt increment when
      appending with MSG_MORE to an existing zerocopy udp skb.
      
        send(.., MSG_ZEROCOPY | MSG_MORE);	// refcnt 1
        send(.., MSG_ZEROCOPY | MSG_MORE);	// refcnt still 1 (bar frags)
      
      But it missed that zerocopy need not be passed at the first send. The
      right test whether the uarg is newly allocated and thus has extra
      refcnt 1 is not !skb, but !skb_zcopy.
      
        send(.., MSG_MORE);			// <no uarg>
        send(.., MSG_ZEROCOPY);		// refcnt 1
      
      Fixes: 100f6d8e ("net: correct zerocopy refcnt with udp MSG_MORE")
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      522924b5
  5. 10 6月, 2019 5 次提交
  6. 08 6月, 2019 2 次提交
  7. 07 6月, 2019 5 次提交
    • D
      bpf: fix unconnected udp hooks · 983695fa
      Daniel Borkmann 提交于
      Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently
      to applications as also stated in original motivation in 7828f20e ("Merge
      branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter
      two hooks into Cilium to enable host based load-balancing with Kubernetes,
      I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes
      typically sets up DNS as a service and is thus subject to load-balancing.
      
      Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API
      is currently insufficient and thus not usable as-is for standard applications
      shipped with most distros. To break down the issue we ran into with a simple
      example:
      
        # cat /etc/resolv.conf
        nameserver 147.75.207.207
        nameserver 147.75.207.208
      
      For the purpose of a simple test, we set up above IPs as service IPs and
      transparently redirect traffic to a different DNS backend server for that
      node:
      
        # cilium service list
        ID   Frontend            Backend
        1    147.75.207.207:53   1 => 8.8.8.8:53
        2    147.75.207.208:53   1 => 8.8.8.8:53
      
      The attached BPF program is basically selecting one of the backends if the
      service IP/port matches on the cgroup hook. DNS breaks here, because the
      hooks are not transparent enough to applications which have built-in msg_name
      address checks:
      
        # nslookup 1.1.1.1
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        [...]
        ;; connection timed out; no servers could be reached
      
        # dig 1.1.1.1
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        [...]
      
        ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
        ;; global options: +cmd
        ;; connection timed out; no servers could be reached
      
      For comparison, if none of the service IPs is used, and we tell nslookup
      to use 8.8.8.8 directly it works just fine, of course:
      
        # nslookup 1.1.1.1 8.8.8.8
        1.1.1.1.in-addr.arpa	name = one.one.one.one.
      
      In order to fix this and thus act more transparent to the application,
      this needs reverse translation on recvmsg() side. A minimal fix for this
      API is to add similar recvmsg() hooks behind the BPF cgroups static key
      such that the program can track state and replace the current sockaddr_in{,6}
      with the original service IP. From BPF side, this basically tracks the
      service tuple plus socket cookie in an LRU map where the reverse NAT can
      then be retrieved via map value as one example. Side-note: the BPF cgroups
      static key should be converted to a per-hook static key in future.
      
      Same example after this fix:
      
        # cilium service list
        ID   Frontend            Backend
        1    147.75.207.207:53   1 => 8.8.8.8:53
        2    147.75.207.208:53   1 => 8.8.8.8:53
      
      Lookups work fine now:
      
        # nslookup 1.1.1.1
        1.1.1.1.in-addr.arpa    name = one.one.one.one.
      
        Authoritative answers can be found from:
      
        # dig 1.1.1.1
      
        ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
        ;; global options: +cmd
        ;; Got answer:
        ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550
        ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
      
        ;; OPT PSEUDOSECTION:
        ; EDNS: version: 0, flags:; udp: 512
        ;; QUESTION SECTION:
        ;1.1.1.1.                       IN      A
      
        ;; AUTHORITY SECTION:
        .                       23426   IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400
      
        ;; Query time: 17 msec
        ;; SERVER: 147.75.207.207#53(147.75.207.207)
        ;; WHEN: Tue May 21 12:59:38 UTC 2019
        ;; MSG SIZE  rcvd: 111
      
      And from an actual packet level it shows that we're using the back end
      server when talking via 147.75.207.20{7,8} front end:
      
        # tcpdump -i any udp
        [...]
        12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
        12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
        12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
        12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
        [...]
      
      In order to be flexible and to have same semantics as in sendmsg BPF
      programs, we only allow return codes in [1,1] range. In the sendmsg case
      the program is called if msg->msg_name is present which can be the case
      in both, connected and unconnected UDP.
      
      The former only relies on the sockaddr_in{,6} passed via connect(2) if
      passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar
      way to call into the BPF program whenever a non-NULL msg->msg_name was
      passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note
      that for TCP case, the msg->msg_name is ignored in the regular recvmsg
      path and therefore not relevant.
      
      For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE,
      the hook is not called. This is intentional as it aligns with the same
      semantics as in case of TCP cgroup BPF hooks right now. This might be
      better addressed in future through a different bpf_attach_type such
      that this case can be distinguished from the regular recvmsg paths,
      for example.
      
      Fixes: 1cedee13 ("bpf: Hooks for sys_sendmsg")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrey Ignatov <rdna@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NMartynas Pumputis <m@lambda.lt>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      983695fa
    • P
      pktgen: do not sleep with the thread lock held. · 720f1de4
      Paolo Abeni 提交于
      Currently, the process issuing a "start" command on the pktgen procfs
      interface, acquires the pktgen thread lock and never release it, until
      all pktgen threads are completed. The above can blocks indefinitely any
      other pktgen command and any (even unrelated) netdevice removal - as
      the pktgen netdev notifier acquires the same lock.
      
      The issue is demonstrated by the following script, reported by Matteo:
      
      ip -b - <<'EOF'
      	link add type dummy
      	link add type veth
      	link set dummy0 up
      EOF
      modprobe pktgen
      echo reset >/proc/net/pktgen/pgctrl
      {
      	echo rem_device_all
      	echo add_device dummy0
      } >/proc/net/pktgen/kpktgend_0
      echo count 0 >/proc/net/pktgen/dummy0
      echo start >/proc/net/pktgen/pgctrl &
      sleep 1
      rmmod veth
      
      Fix the above releasing the thread lock around the sleep call.
      
      Additionally we must prevent racing with forcefull rmmod - as the
      thread lock no more protects from them. Instead, acquire a self-reference
      before waiting for any thread. As a side effect, running
      
      rmmod pktgen
      
      while some thread is running now fails with "module in use" error,
      before this patch such command hanged indefinitely.
      
      Note: the issue predates the commit reported in the fixes tag, but
      this fix can't be applied before the mentioned commit.
      
      v1 -> v2:
       - no need to check for thread existence after flipping the lock,
         pktgen threads are freed only at net exit time
       -
      
      Fixes: 6146e6a4 ("[PKTGEN]: Removes thread_{un,}lock() macros.")
      Reported-and-tested-by: NMatteo Croce <mcroce@redhat.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      720f1de4
    • Z
      net: rds: fix memory leak in rds_ib_flush_mr_pool · 85cb9287
      Zhu Yanjun 提交于
      When the following tests last for several hours, the problem will occur.
      
      Server:
          rds-stress -r 1.1.1.16 -D 1M
      Client:
          rds-stress -r 1.1.1.14 -s 1.1.1.16 -D 1M -T 30
      
      The following will occur.
      
      "
      Starting up....
      tsks   tx/s   rx/s  tx+rx K/s    mbi K/s    mbo K/s tx us/c   rtt us cpu
      %
        1      0      0       0.00       0.00       0.00    0.00 0.00 -1.00
        1      0      0       0.00       0.00       0.00    0.00 0.00 -1.00
        1      0      0       0.00       0.00       0.00    0.00 0.00 -1.00
        1      0      0       0.00       0.00       0.00    0.00 0.00 -1.00
      "
      >From vmcore, we can find that clean_list is NULL.
      
      >From the source code, rds_mr_flushd calls rds_ib_mr_pool_flush_worker.
      Then rds_ib_mr_pool_flush_worker calls
      "
       rds_ib_flush_mr_pool(pool, 0, NULL);
      "
      Then in function
      "
      int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
                               int free_all, struct rds_ib_mr **ibmr_ret)
      "
      ibmr_ret is NULL.
      
      In the source code,
      "
      ...
      list_to_llist_nodes(pool, &unmap_list, &clean_nodes, &clean_tail);
      if (ibmr_ret)
              *ibmr_ret = llist_entry(clean_nodes, struct rds_ib_mr, llnode);
      
      /* more than one entry in llist nodes */
      if (clean_nodes->next)
              llist_add_batch(clean_nodes->next, clean_tail, &pool->clean_list);
      ...
      "
      When ibmr_ret is NULL, llist_entry is not executed. clean_nodes->next
      instead of clean_nodes is added in clean_list.
      So clean_nodes is discarded. It can not be used again.
      The workqueue is executed periodically. So more and more clean_nodes are
      discarded. Finally the clean_list is NULL.
      Then this problem will occur.
      
      Fixes: 1bc144b6 ("net, rds, Replace xlist in net/rds/xlist.h with llist")
      Signed-off-by: NZhu Yanjun <yanjun.zhu@oracle.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      85cb9287
    • O
      ipv6: fix EFAULT on sendto with icmpv6 and hdrincl · b9aa52c4
      Olivier Matz 提交于
      The following code returns EFAULT (Bad address):
      
        s = socket(AF_INET6, SOCK_RAW, IPPROTO_ICMPV6);
        setsockopt(s, SOL_IPV6, IPV6_HDRINCL, 1);
        sendto(ipv6_icmp6_packet, addr);   /* returns -1, errno = EFAULT */
      
      The IPv4 equivalent code works. A workaround is to use IPPROTO_RAW
      instead of IPPROTO_ICMPV6.
      
      The failure happens because 2 bytes are eaten from the msghdr by
      rawv6_probe_proto_opt() starting from commit 19e3c66b ("ipv6
      equivalent of "ipv4: Avoid reading user iov twice after
      raw_probe_proto_opt""), but at that time it was not a problem because
      IPV6_HDRINCL was not yet introduced.
      
      Only eat these 2 bytes if hdrincl == 0.
      
      Fixes: 715f504b ("ipv6: add IPV6_HDRINCL option for raw sockets")
      Signed-off-by: NOlivier Matz <olivier.matz@6wind.com>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b9aa52c4
    • O
      ipv6: use READ_ONCE() for inet->hdrincl as in ipv4 · 59e3e4b5
      Olivier Matz 提交于
      As it was done in commit 8f659a03 ("net: ipv4: fix for a race
      condition in raw_sendmsg") and commit 20b50d79 ("net: ipv4: emulate
      READ_ONCE() on ->hdrincl bit-field in raw_sendmsg()") for ipv4, copy the
      value of inet->hdrincl in a local variable, to avoid introducing a race
      condition in the next commit.
      Signed-off-by: NOlivier Matz <olivier.matz@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      59e3e4b5
  8. 06 6月, 2019 5 次提交
    • H
      Revert "fib_rules: return 0 directly if an exactly same rule exists when NLM_F_EXCL not supplied" · 4970b42d
      Hangbin Liu 提交于
      This reverts commit e9919a24.
      
      Nathan reported the new behaviour breaks Android, as Android just add
      new rules and delete old ones.
      
      If we return 0 without adding dup rules, Android will remove the new
      added rules and causing system to soft-reboot.
      
      Fixes: e9919a24 ("fib_rules: return 0 directly if an exactly same rule exists when NLM_F_EXCL not supplied")
      Reported-by: NNathan Chancellor <natechancellor@gmail.com>
      Reported-by: NYaro Slav <yaro330@gmail.com>
      Reported-by: NMaciej Żenczykowski <zenczykowski@gmail.com>
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Reviewed-by: NNathan Chancellor <natechancellor@gmail.com>
      Tested-by: NNathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4970b42d
    • V
      ethtool: fix potential userspace buffer overflow · 0ee4e769
      Vivien Didelot 提交于
      ethtool_get_regs() allocates a buffer of size ops->get_regs_len(),
      and pass it to the kernel driver via ops->get_regs() for filling.
      
      There is no restriction about what the kernel drivers can or cannot do
      with the open ethtool_regs structure. They usually set regs->version
      and ignore regs->len or set it to the same size as ops->get_regs_len().
      
      But if userspace allocates a smaller buffer for the registers dump,
      we would cause a userspace buffer overflow in the final copy_to_user()
      call, which uses the regs.len value potentially reset by the driver.
      
      To fix this, make this case obvious and store regs.len before calling
      ops->get_regs(), to only copy as much data as requested by userspace,
      up to the value returned by ops->get_regs_len().
      
      While at it, remove the redundant check for non-null regbuf.
      Signed-off-by: NVivien Didelot <vivien.didelot@gmail.com>
      Reviewed-by: NMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ee4e769
    • N
      Fix memory leak in sctp_process_init · 0a8dd9f6
      Neil Horman 提交于
      syzbot found the following leak in sctp_process_init
      BUG: memory leak
      unreferenced object 0xffff88810ef68400 (size 1024):
        comm "syz-executor273", pid 7046, jiffies 4294945598 (age 28.770s)
        hex dump (first 32 bytes):
          1d de 28 8d de 0b 1b e3 b5 c2 f9 68 fd 1a 97 25  ..(........h...%
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<00000000a02cebbd>] kmemleak_alloc_recursive include/linux/kmemleak.h:55
      [inline]
          [<00000000a02cebbd>] slab_post_alloc_hook mm/slab.h:439 [inline]
          [<00000000a02cebbd>] slab_alloc mm/slab.c:3326 [inline]
          [<00000000a02cebbd>] __do_kmalloc mm/slab.c:3658 [inline]
          [<00000000a02cebbd>] __kmalloc_track_caller+0x15d/0x2c0 mm/slab.c:3675
          [<000000009e6245e6>] kmemdup+0x27/0x60 mm/util.c:119
          [<00000000dfdc5d2d>] kmemdup include/linux/string.h:432 [inline]
          [<00000000dfdc5d2d>] sctp_process_init+0xa7e/0xc20
      net/sctp/sm_make_chunk.c:2437
          [<00000000b58b62f8>] sctp_cmd_process_init net/sctp/sm_sideeffect.c:682
      [inline]
          [<00000000b58b62f8>] sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1384
      [inline]
          [<00000000b58b62f8>] sctp_side_effects net/sctp/sm_sideeffect.c:1194
      [inline]
          [<00000000b58b62f8>] sctp_do_sm+0xbdc/0x1d60 net/sctp/sm_sideeffect.c:1165
          [<0000000044e11f96>] sctp_assoc_bh_rcv+0x13c/0x200
      net/sctp/associola.c:1074
          [<00000000ec43804d>] sctp_inq_push+0x7f/0xb0 net/sctp/inqueue.c:95
          [<00000000726aa954>] sctp_backlog_rcv+0x5e/0x2a0 net/sctp/input.c:354
          [<00000000d9e249a8>] sk_backlog_rcv include/net/sock.h:950 [inline]
          [<00000000d9e249a8>] __release_sock+0xab/0x110 net/core/sock.c:2418
          [<00000000acae44fa>] release_sock+0x37/0xd0 net/core/sock.c:2934
          [<00000000963cc9ae>] sctp_sendmsg+0x2c0/0x990 net/sctp/socket.c:2122
          [<00000000a7fc7565>] inet_sendmsg+0x64/0x120 net/ipv4/af_inet.c:802
          [<00000000b732cbd3>] sock_sendmsg_nosec net/socket.c:652 [inline]
          [<00000000b732cbd3>] sock_sendmsg+0x54/0x70 net/socket.c:671
          [<00000000274c57ab>] ___sys_sendmsg+0x393/0x3c0 net/socket.c:2292
          [<000000008252aedb>] __sys_sendmsg+0x80/0xf0 net/socket.c:2330
          [<00000000f7bf23d1>] __do_sys_sendmsg net/socket.c:2339 [inline]
          [<00000000f7bf23d1>] __se_sys_sendmsg net/socket.c:2337 [inline]
          [<00000000f7bf23d1>] __x64_sys_sendmsg+0x23/0x30 net/socket.c:2337
          [<00000000a8b4131f>] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:3
      
      The problem was that the peer.cookie value points to an skb allocated
      area on the first pass through this function, at which point it is
      overwritten with a heap allocated value, but in certain cases, where a
      COOKIE_ECHO chunk is included in the packet, a second pass through
      sctp_process_init is made, where the cookie value is re-allocated,
      leaking the first allocation.
      
      Fix is to always allocate the cookie value, and free it when we are done
      using it.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Reported-by: syzbot+f7e9153b037eac9b1df8@syzkaller.appspotmail.com
      CC: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: netdev@vger.kernel.org
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a8dd9f6
    • Z
      net: rds: fix memory leak when unload rds_rdma · b50e0587
      Zhu Yanjun 提交于
      When KASAN is enabled, after several rds connections are
      created, then "rmmod rds_rdma" is run. The following will
      appear.
      
      "
      BUG rds_ib_incoming (Not tainted): Objects remaining
      in rds_ib_incoming on __kmem_cache_shutdown()
      
      Call Trace:
       dump_stack+0x71/0xab
       slab_err+0xad/0xd0
       __kmem_cache_shutdown+0x17d/0x370
       shutdown_cache+0x17/0x130
       kmem_cache_destroy+0x1df/0x210
       rds_ib_recv_exit+0x11/0x20 [rds_rdma]
       rds_ib_exit+0x7a/0x90 [rds_rdma]
       __x64_sys_delete_module+0x224/0x2c0
       ? __ia32_sys_delete_module+0x2c0/0x2c0
       do_syscall_64+0x73/0x190
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      "
      This is rds connection memory leak. The root cause is:
      When "rmmod rds_rdma" is run, rds_ib_remove_one will call
      rds_ib_dev_shutdown to drop the rds connections.
      rds_ib_dev_shutdown will call rds_conn_drop to drop rds
      connections as below.
      "
      rds_conn_path_drop(&conn->c_path[0], false);
      "
      In the above, destroy is set to false.
      void rds_conn_path_drop(struct rds_conn_path *cp, bool destroy)
      {
              atomic_set(&cp->cp_state, RDS_CONN_ERROR);
      
              rcu_read_lock();
              if (!destroy && rds_destroy_pending(cp->cp_conn)) {
                      rcu_read_unlock();
                      return;
              }
              queue_work(rds_wq, &cp->cp_down_w);
              rcu_read_unlock();
      }
      In the above function, destroy is set to false. rds_destroy_pending
      is called. This does not move rds connections to ib_nodev_conns.
      So destroy is set to true to move rds connections to ib_nodev_conns.
      In rds_ib_unregister_client, flush_workqueue is called to make rds_wq
      finsh shutdown rds connections. The function rds_ib_destroy_nodev_conns
      is called to shutdown rds connections finally.
      Then rds_ib_recv_exit is called to destroy slab.
      
      void rds_ib_recv_exit(void)
      {
              kmem_cache_destroy(rds_ib_incoming_slab);
              kmem_cache_destroy(rds_ib_frag_slab);
      }
      The above slab memory leak will not occur again.
      
      >From tests,
      256 rds connections
      [root@ca-dev14 ~]# time rmmod rds_rdma
      
      real    0m16.522s
      user    0m0.000s
      sys     0m8.152s
      512 rds connections
      [root@ca-dev14 ~]# time rmmod rds_rdma
      
      real    0m32.054s
      user    0m0.000s
      sys     0m15.568s
      
      To rmmod rds_rdma with 256 rds connections, about 16 seconds are needed.
      And with 512 rds connections, about 32 seconds are needed.
      >From ftrace, when one rds connection is destroyed,
      
      "
       19)               |  rds_conn_destroy [rds]() {
       19)   7.782 us    |    rds_conn_path_drop [rds]();
       15)               |  rds_shutdown_worker [rds]() {
       15)               |    rds_conn_shutdown [rds]() {
       15)   1.651 us    |      rds_send_path_reset [rds]();
       15)   7.195 us    |    }
       15) + 11.434 us   |  }
       19)   2.285 us    |    rds_cong_remove_conn [rds]();
       19) * 24062.76 us |  }
      "
      So if many rds connections will be destroyed, this function
      rds_ib_destroy_nodev_conns uses most of time.
      Suggested-by: NHåkon Bugge <haakon.bugge@oracle.com>
      Signed-off-by: NZhu Yanjun <yanjun.zhu@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b50e0587
    • X
      ipv4: not do cache for local delivery if bc_forwarding is enabled · 0a90478b
      Xin Long 提交于
      With the topo:
      
          h1 ---| rp1            |
                |     route  rp3 |--- h3 (192.168.200.1)
          h2 ---| rp2            |
      
      If rp1 bc_forwarding is set while rp2 bc_forwarding is not, after
      doing "ping 192.168.200.255" on h1, then ping 192.168.200.255 on
      h2, and the packets can still be forwared.
      
      This issue was caused by the input route cache. It should only do
      the cache for either bc forwarding or local delivery. Otherwise,
      local delivery can use the route cache for bc forwarding of other
      interfaces.
      
      This patch is to fix it by not doing cache for local delivery if
      all.bc_forwarding is enabled.
      
      Note that we don't fix it by checking route cache local flag after
      rt_cache_valid() in "local_input:" and "ip_mkroute_input", as the
      common route code shouldn't be touched for bc_forwarding.
      
      Fixes: 5cbf777c ("route: add support for directed broadcast forwarding")
      Reported-by: NJianlin Shi <jishi@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a90478b
  9. 05 6月, 2019 4 次提交