1. 09 7月, 2019 6 次提交
    • C
      xprtrdma: Reduce context switching due to Local Invalidation · d8099fed
      Chuck Lever 提交于
      Since commit ba69cd12 ("xprtrdma: Remove support for FMR memory
      registration"), FRWR is the only supported memory registration mode.
      
      We can take advantage of the asynchronous nature of FRWR's LOCAL_INV
      Work Requests to get rid of the completion wait by having the
      LOCAL_INV completion handler take care of DMA unmapping MRs and
      waking the upper layer RPC waiter.
      
      This eliminates two context switches when local invalidation is
      necessary. As a side benefit, we will no longer need the per-xprt
      deferred completion work queue.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      d8099fed
    • C
      xprtrdma: Add mechanism to place MRs back on the free list · 40088f0e
      Chuck Lever 提交于
      When a marshal operation fails, any MRs that were already set up for
      that request are recycled. Recycling releases MRs and creates new
      ones, which is expensive.
      
      Since commit f2877623 ("xprtrdma: Chain Send to FastReg WRs")
      was merged, recycling FRWRs is unnecessary. This is because before
      that commit, frwr_map had already posted FAST_REG Work Requests,
      so ownership of the MRs had already been passed to the NIC and thus
      dealing with them had to be delayed until they completed.
      
      Since that commit, however, FAST_REG WRs are posted at the same time
      as the Send WR. This means that if marshaling fails, we are certain
      the MRs are safe to simply unmap and place back on the free list
      because neither the Send nor the FAST_REG WRs have been posted yet.
      The kernel still has ownership of the MRs at this point.
      
      This reduces the total number of MRs that the xprt has to create
      under heavy workloads and makes the marshaling logic less brittle.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      40088f0e
    • C
      xprtrdma: Remove fr_state · 84756894
      Chuck Lever 提交于
      Now that both the Send and Receive completions are handled in
      process context, it is safe to DMA unmap and return MRs to the
      free or recycle lists directly in the completion handlers.
      
      Doing this means rpcrdma_frwr no longer needs to track the state of
      each MR, meaning that a VALID or FLUSHED MR can no longer appear on
      an xprt's MR free list. Thus there is no longer a need to track the
      MR's registration state in rpcrdma_frwr.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      84756894
    • C
      xprtrdma: Remove the RPCRDMA_REQ_F_PENDING flag · 5809ea4f
      Chuck Lever 提交于
      Commit 9590d083 ("xprtrdma: Use xprt_pin_rqst in
      rpcrdma_reply_handler") pins incoming RPC/RDMA replies so they
      can be left in the pending requests queue while they are being
      processed without introducing a race between ->buf_free and the
      transport's reply handler. Therefore RPCRDMA_REQ_F_PENDING is no
      longer necessary.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      5809ea4f
    • C
      xprtrdma: Fix occasional transport deadlock · 05eb06d8
      Chuck Lever 提交于
      Under high I/O workloads, I've noticed that an RPC/RDMA transport
      occasionally deadlocks (IOPS goes to zero, and doesn't recover).
      Diagnosis shows that the sendctx queue is empty, but when sendctxs
      are returned to the queue, the xprt_write_space wake-up never
      occurs. The wake-up logic in rpcrdma_sendctx_put_locked is racy.
      
      I noticed that both EMPTY_SCQ and XPRT_WRITE_SPACE are implemented
      via an atomic bit. Just one of those is sufficient. Removing
      EMPTY_SCQ in favor of the generic bit mechanism makes the deadlock
      un-reproducible.
      
      Without EMPTY_SCQ, rpcrdma_buffer::rb_flags is no longer used and
      is therefore removed.
      
      Unfortunately this patch does not apply cleanly to stable. If
      needed, someone will have to port it and test it.
      
      Fixes: 2fad6592 ("xprtrdma: Wait on empty sendctx queue")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      05eb06d8
    • C
      xprtrdma: Replace use of xdr_stream_pos in rpcrdma_marshal_req · 1310051c
      Chuck Lever 提交于
      This is a latent bug. xdr_stream_pos works by subtracting
      xdr_stream::nwords from xdr_buf::len. But xdr_stream::nwords is not
      initialized by xdr_init_encode().
      
      It works today only because all fields in rpcrdma_req::rl_stream
      are initialized to zero by rpcrdma_req_create, making the
      subtraction in xdr_stream_pos always a no-op.
      
      I found this issue via code inspection. It was introduced by commit
      39f4cd9e ("xprtrdma: Harden chunk list encoding against send
      buffer overflow"), but the code has changed enough since then that
      this fix can't be automatically applied to stable.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      1310051c
  2. 03 7月, 2019 1 次提交
  3. 07 6月, 2019 4 次提交
    • P
      pktgen: do not sleep with the thread lock held. · 720f1de4
      Paolo Abeni 提交于
      Currently, the process issuing a "start" command on the pktgen procfs
      interface, acquires the pktgen thread lock and never release it, until
      all pktgen threads are completed. The above can blocks indefinitely any
      other pktgen command and any (even unrelated) netdevice removal - as
      the pktgen netdev notifier acquires the same lock.
      
      The issue is demonstrated by the following script, reported by Matteo:
      
      ip -b - <<'EOF'
      	link add type dummy
      	link add type veth
      	link set dummy0 up
      EOF
      modprobe pktgen
      echo reset >/proc/net/pktgen/pgctrl
      {
      	echo rem_device_all
      	echo add_device dummy0
      } >/proc/net/pktgen/kpktgend_0
      echo count 0 >/proc/net/pktgen/dummy0
      echo start >/proc/net/pktgen/pgctrl &
      sleep 1
      rmmod veth
      
      Fix the above releasing the thread lock around the sleep call.
      
      Additionally we must prevent racing with forcefull rmmod - as the
      thread lock no more protects from them. Instead, acquire a self-reference
      before waiting for any thread. As a side effect, running
      
      rmmod pktgen
      
      while some thread is running now fails with "module in use" error,
      before this patch such command hanged indefinitely.
      
      Note: the issue predates the commit reported in the fixes tag, but
      this fix can't be applied before the mentioned commit.
      
      v1 -> v2:
       - no need to check for thread existence after flipping the lock,
         pktgen threads are freed only at net exit time
       -
      
      Fixes: 6146e6a4 ("[PKTGEN]: Removes thread_{un,}lock() macros.")
      Reported-and-tested-by: NMatteo Croce <mcroce@redhat.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      720f1de4
    • Z
      net: rds: fix memory leak in rds_ib_flush_mr_pool · 85cb9287
      Zhu Yanjun 提交于
      When the following tests last for several hours, the problem will occur.
      
      Server:
          rds-stress -r 1.1.1.16 -D 1M
      Client:
          rds-stress -r 1.1.1.14 -s 1.1.1.16 -D 1M -T 30
      
      The following will occur.
      
      "
      Starting up....
      tsks   tx/s   rx/s  tx+rx K/s    mbi K/s    mbo K/s tx us/c   rtt us cpu
      %
        1      0      0       0.00       0.00       0.00    0.00 0.00 -1.00
        1      0      0       0.00       0.00       0.00    0.00 0.00 -1.00
        1      0      0       0.00       0.00       0.00    0.00 0.00 -1.00
        1      0      0       0.00       0.00       0.00    0.00 0.00 -1.00
      "
      >From vmcore, we can find that clean_list is NULL.
      
      >From the source code, rds_mr_flushd calls rds_ib_mr_pool_flush_worker.
      Then rds_ib_mr_pool_flush_worker calls
      "
       rds_ib_flush_mr_pool(pool, 0, NULL);
      "
      Then in function
      "
      int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
                               int free_all, struct rds_ib_mr **ibmr_ret)
      "
      ibmr_ret is NULL.
      
      In the source code,
      "
      ...
      list_to_llist_nodes(pool, &unmap_list, &clean_nodes, &clean_tail);
      if (ibmr_ret)
              *ibmr_ret = llist_entry(clean_nodes, struct rds_ib_mr, llnode);
      
      /* more than one entry in llist nodes */
      if (clean_nodes->next)
              llist_add_batch(clean_nodes->next, clean_tail, &pool->clean_list);
      ...
      "
      When ibmr_ret is NULL, llist_entry is not executed. clean_nodes->next
      instead of clean_nodes is added in clean_list.
      So clean_nodes is discarded. It can not be used again.
      The workqueue is executed periodically. So more and more clean_nodes are
      discarded. Finally the clean_list is NULL.
      Then this problem will occur.
      
      Fixes: 1bc144b6 ("net, rds, Replace xlist in net/rds/xlist.h with llist")
      Signed-off-by: NZhu Yanjun <yanjun.zhu@oracle.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      85cb9287
    • O
      ipv6: fix EFAULT on sendto with icmpv6 and hdrincl · b9aa52c4
      Olivier Matz 提交于
      The following code returns EFAULT (Bad address):
      
        s = socket(AF_INET6, SOCK_RAW, IPPROTO_ICMPV6);
        setsockopt(s, SOL_IPV6, IPV6_HDRINCL, 1);
        sendto(ipv6_icmp6_packet, addr);   /* returns -1, errno = EFAULT */
      
      The IPv4 equivalent code works. A workaround is to use IPPROTO_RAW
      instead of IPPROTO_ICMPV6.
      
      The failure happens because 2 bytes are eaten from the msghdr by
      rawv6_probe_proto_opt() starting from commit 19e3c66b ("ipv6
      equivalent of "ipv4: Avoid reading user iov twice after
      raw_probe_proto_opt""), but at that time it was not a problem because
      IPV6_HDRINCL was not yet introduced.
      
      Only eat these 2 bytes if hdrincl == 0.
      
      Fixes: 715f504b ("ipv6: add IPV6_HDRINCL option for raw sockets")
      Signed-off-by: NOlivier Matz <olivier.matz@6wind.com>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b9aa52c4
    • O
      ipv6: use READ_ONCE() for inet->hdrincl as in ipv4 · 59e3e4b5
      Olivier Matz 提交于
      As it was done in commit 8f659a03 ("net: ipv4: fix for a race
      condition in raw_sendmsg") and commit 20b50d79 ("net: ipv4: emulate
      READ_ONCE() on ->hdrincl bit-field in raw_sendmsg()") for ipv4, copy the
      value of inet->hdrincl in a local variable, to avoid introducing a race
      condition in the next commit.
      Signed-off-by: NOlivier Matz <olivier.matz@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      59e3e4b5
  4. 06 6月, 2019 5 次提交
    • H
      Revert "fib_rules: return 0 directly if an exactly same rule exists when NLM_F_EXCL not supplied" · 4970b42d
      Hangbin Liu 提交于
      This reverts commit e9919a24.
      
      Nathan reported the new behaviour breaks Android, as Android just add
      new rules and delete old ones.
      
      If we return 0 without adding dup rules, Android will remove the new
      added rules and causing system to soft-reboot.
      
      Fixes: e9919a24 ("fib_rules: return 0 directly if an exactly same rule exists when NLM_F_EXCL not supplied")
      Reported-by: NNathan Chancellor <natechancellor@gmail.com>
      Reported-by: NYaro Slav <yaro330@gmail.com>
      Reported-by: NMaciej Żenczykowski <zenczykowski@gmail.com>
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Reviewed-by: NNathan Chancellor <natechancellor@gmail.com>
      Tested-by: NNathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4970b42d
    • V
      ethtool: fix potential userspace buffer overflow · 0ee4e769
      Vivien Didelot 提交于
      ethtool_get_regs() allocates a buffer of size ops->get_regs_len(),
      and pass it to the kernel driver via ops->get_regs() for filling.
      
      There is no restriction about what the kernel drivers can or cannot do
      with the open ethtool_regs structure. They usually set regs->version
      and ignore regs->len or set it to the same size as ops->get_regs_len().
      
      But if userspace allocates a smaller buffer for the registers dump,
      we would cause a userspace buffer overflow in the final copy_to_user()
      call, which uses the regs.len value potentially reset by the driver.
      
      To fix this, make this case obvious and store regs.len before calling
      ops->get_regs(), to only copy as much data as requested by userspace,
      up to the value returned by ops->get_regs_len().
      
      While at it, remove the redundant check for non-null regbuf.
      Signed-off-by: NVivien Didelot <vivien.didelot@gmail.com>
      Reviewed-by: NMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ee4e769
    • N
      Fix memory leak in sctp_process_init · 0a8dd9f6
      Neil Horman 提交于
      syzbot found the following leak in sctp_process_init
      BUG: memory leak
      unreferenced object 0xffff88810ef68400 (size 1024):
        comm "syz-executor273", pid 7046, jiffies 4294945598 (age 28.770s)
        hex dump (first 32 bytes):
          1d de 28 8d de 0b 1b e3 b5 c2 f9 68 fd 1a 97 25  ..(........h...%
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<00000000a02cebbd>] kmemleak_alloc_recursive include/linux/kmemleak.h:55
      [inline]
          [<00000000a02cebbd>] slab_post_alloc_hook mm/slab.h:439 [inline]
          [<00000000a02cebbd>] slab_alloc mm/slab.c:3326 [inline]
          [<00000000a02cebbd>] __do_kmalloc mm/slab.c:3658 [inline]
          [<00000000a02cebbd>] __kmalloc_track_caller+0x15d/0x2c0 mm/slab.c:3675
          [<000000009e6245e6>] kmemdup+0x27/0x60 mm/util.c:119
          [<00000000dfdc5d2d>] kmemdup include/linux/string.h:432 [inline]
          [<00000000dfdc5d2d>] sctp_process_init+0xa7e/0xc20
      net/sctp/sm_make_chunk.c:2437
          [<00000000b58b62f8>] sctp_cmd_process_init net/sctp/sm_sideeffect.c:682
      [inline]
          [<00000000b58b62f8>] sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1384
      [inline]
          [<00000000b58b62f8>] sctp_side_effects net/sctp/sm_sideeffect.c:1194
      [inline]
          [<00000000b58b62f8>] sctp_do_sm+0xbdc/0x1d60 net/sctp/sm_sideeffect.c:1165
          [<0000000044e11f96>] sctp_assoc_bh_rcv+0x13c/0x200
      net/sctp/associola.c:1074
          [<00000000ec43804d>] sctp_inq_push+0x7f/0xb0 net/sctp/inqueue.c:95
          [<00000000726aa954>] sctp_backlog_rcv+0x5e/0x2a0 net/sctp/input.c:354
          [<00000000d9e249a8>] sk_backlog_rcv include/net/sock.h:950 [inline]
          [<00000000d9e249a8>] __release_sock+0xab/0x110 net/core/sock.c:2418
          [<00000000acae44fa>] release_sock+0x37/0xd0 net/core/sock.c:2934
          [<00000000963cc9ae>] sctp_sendmsg+0x2c0/0x990 net/sctp/socket.c:2122
          [<00000000a7fc7565>] inet_sendmsg+0x64/0x120 net/ipv4/af_inet.c:802
          [<00000000b732cbd3>] sock_sendmsg_nosec net/socket.c:652 [inline]
          [<00000000b732cbd3>] sock_sendmsg+0x54/0x70 net/socket.c:671
          [<00000000274c57ab>] ___sys_sendmsg+0x393/0x3c0 net/socket.c:2292
          [<000000008252aedb>] __sys_sendmsg+0x80/0xf0 net/socket.c:2330
          [<00000000f7bf23d1>] __do_sys_sendmsg net/socket.c:2339 [inline]
          [<00000000f7bf23d1>] __se_sys_sendmsg net/socket.c:2337 [inline]
          [<00000000f7bf23d1>] __x64_sys_sendmsg+0x23/0x30 net/socket.c:2337
          [<00000000a8b4131f>] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:3
      
      The problem was that the peer.cookie value points to an skb allocated
      area on the first pass through this function, at which point it is
      overwritten with a heap allocated value, but in certain cases, where a
      COOKIE_ECHO chunk is included in the packet, a second pass through
      sctp_process_init is made, where the cookie value is re-allocated,
      leaking the first allocation.
      
      Fix is to always allocate the cookie value, and free it when we are done
      using it.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Reported-by: syzbot+f7e9153b037eac9b1df8@syzkaller.appspotmail.com
      CC: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: netdev@vger.kernel.org
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a8dd9f6
    • Z
      net: rds: fix memory leak when unload rds_rdma · b50e0587
      Zhu Yanjun 提交于
      When KASAN is enabled, after several rds connections are
      created, then "rmmod rds_rdma" is run. The following will
      appear.
      
      "
      BUG rds_ib_incoming (Not tainted): Objects remaining
      in rds_ib_incoming on __kmem_cache_shutdown()
      
      Call Trace:
       dump_stack+0x71/0xab
       slab_err+0xad/0xd0
       __kmem_cache_shutdown+0x17d/0x370
       shutdown_cache+0x17/0x130
       kmem_cache_destroy+0x1df/0x210
       rds_ib_recv_exit+0x11/0x20 [rds_rdma]
       rds_ib_exit+0x7a/0x90 [rds_rdma]
       __x64_sys_delete_module+0x224/0x2c0
       ? __ia32_sys_delete_module+0x2c0/0x2c0
       do_syscall_64+0x73/0x190
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      "
      This is rds connection memory leak. The root cause is:
      When "rmmod rds_rdma" is run, rds_ib_remove_one will call
      rds_ib_dev_shutdown to drop the rds connections.
      rds_ib_dev_shutdown will call rds_conn_drop to drop rds
      connections as below.
      "
      rds_conn_path_drop(&conn->c_path[0], false);
      "
      In the above, destroy is set to false.
      void rds_conn_path_drop(struct rds_conn_path *cp, bool destroy)
      {
              atomic_set(&cp->cp_state, RDS_CONN_ERROR);
      
              rcu_read_lock();
              if (!destroy && rds_destroy_pending(cp->cp_conn)) {
                      rcu_read_unlock();
                      return;
              }
              queue_work(rds_wq, &cp->cp_down_w);
              rcu_read_unlock();
      }
      In the above function, destroy is set to false. rds_destroy_pending
      is called. This does not move rds connections to ib_nodev_conns.
      So destroy is set to true to move rds connections to ib_nodev_conns.
      In rds_ib_unregister_client, flush_workqueue is called to make rds_wq
      finsh shutdown rds connections. The function rds_ib_destroy_nodev_conns
      is called to shutdown rds connections finally.
      Then rds_ib_recv_exit is called to destroy slab.
      
      void rds_ib_recv_exit(void)
      {
              kmem_cache_destroy(rds_ib_incoming_slab);
              kmem_cache_destroy(rds_ib_frag_slab);
      }
      The above slab memory leak will not occur again.
      
      >From tests,
      256 rds connections
      [root@ca-dev14 ~]# time rmmod rds_rdma
      
      real    0m16.522s
      user    0m0.000s
      sys     0m8.152s
      512 rds connections
      [root@ca-dev14 ~]# time rmmod rds_rdma
      
      real    0m32.054s
      user    0m0.000s
      sys     0m15.568s
      
      To rmmod rds_rdma with 256 rds connections, about 16 seconds are needed.
      And with 512 rds connections, about 32 seconds are needed.
      >From ftrace, when one rds connection is destroyed,
      
      "
       19)               |  rds_conn_destroy [rds]() {
       19)   7.782 us    |    rds_conn_path_drop [rds]();
       15)               |  rds_shutdown_worker [rds]() {
       15)               |    rds_conn_shutdown [rds]() {
       15)   1.651 us    |      rds_send_path_reset [rds]();
       15)   7.195 us    |    }
       15) + 11.434 us   |  }
       19)   2.285 us    |    rds_cong_remove_conn [rds]();
       19) * 24062.76 us |  }
      "
      So if many rds connections will be destroyed, this function
      rds_ib_destroy_nodev_conns uses most of time.
      Suggested-by: NHåkon Bugge <haakon.bugge@oracle.com>
      Signed-off-by: NZhu Yanjun <yanjun.zhu@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b50e0587
    • X
      ipv4: not do cache for local delivery if bc_forwarding is enabled · 0a90478b
      Xin Long 提交于
      With the topo:
      
          h1 ---| rp1            |
                |     route  rp3 |--- h3 (192.168.200.1)
          h2 ---| rp2            |
      
      If rp1 bc_forwarding is set while rp2 bc_forwarding is not, after
      doing "ping 192.168.200.255" on h1, then ping 192.168.200.255 on
      h2, and the packets can still be forwared.
      
      This issue was caused by the input route cache. It should only do
      the cache for either bc forwarding or local delivery. Otherwise,
      local delivery can use the route cache for bc forwarding of other
      interfaces.
      
      This patch is to fix it by not doing cache for local delivery if
      all.bc_forwarding is enabled.
      
      Note that we don't fix it by checking route cache local flag after
      rt_cache_valid() in "local_input:" and "ip_mkroute_input", as the
      common route code shouldn't be touched for bc_forwarding.
      
      Fixes: 5cbf777c ("route: add support for directed broadcast forwarding")
      Reported-by: NJianlin Shi <jishi@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a90478b
  5. 05 6月, 2019 21 次提交
  6. 03 6月, 2019 1 次提交
  7. 01 6月, 2019 1 次提交
    • V
      net: dsa: sja1105: Don't store frame type in skb->cb · e8d67fa5
      Vladimir Oltean 提交于
      Due to a confusion I thought that eth_type_trans() was called by the
      network stack whereas it can actually be called by network drivers to
      figure out the skb protocol and next packet_type handlers.
      
      In light of the above, it is not safe to store the frame type from the
      DSA tagger's .filter callback (first entry point on RX path), since GRO
      is yet to be invoked on the received traffic.  Hence it is very likely
      that the skb->cb will actually get overwritten between eth_type_trans()
      and the actual DSA packet_type handler.
      
      Of course, what this patch fixes is the actual overwriting of the
      SJA1105_SKB_CB(skb)->type field from the GRO layer, which made all
      frames be seen as SJA1105_FRAME_TYPE_NORMAL (0).
      
      Fixes: 227d07a0 ("net: dsa: sja1105: Add support for traffic through standalone ports")
      Signed-off-by: NVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e8d67fa5
  8. 31 5月, 2019 1 次提交
    • W
      net: correct zerocopy refcnt with udp MSG_MORE · 100f6d8e
      Willem de Bruijn 提交于
      TCP zerocopy takes a uarg reference for every skb, plus one for the
      tcp_sendmsg_locked datapath temporarily, to avoid reaching refcnt zero
      as it builds, sends and frees skbs inside its inner loop.
      
      UDP and RAW zerocopy do not send inside the inner loop so do not need
      the extra sock_zerocopy_get + sock_zerocopy_put pair. Commit
      52900d22288ed ("udp: elide zerocopy operation in hot path") introduced
      extra_uref to pass the initial reference taken in sock_zerocopy_alloc
      to the first generated skb.
      
      But, sock_zerocopy_realloc takes this extra reference at the start of
      every call. With MSG_MORE, no new skb may be generated to attach the
      extra_uref to, so refcnt is incorrectly 2 with only one skb.
      
      Do not take the extra ref if uarg && !tcp, which implies MSG_MORE.
      Update extra_uref accordingly.
      
      This conditional assignment triggers a false positive may be used
      uninitialized warning, so have to initialize extra_uref at define.
      
      Changes v1->v2: fix typo in Fixes SHA1
      
      Fixes: 52900d22 ("udp: elide zerocopy operation in hot path")
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Diagnosed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      100f6d8e