1. 07 5月, 2018 7 次提交
  2. 05 5月, 2018 1 次提交
  3. 04 5月, 2018 29 次提交
    • S
      smc: add support for splice() · 9014db20
      Stefan Raspl 提交于
      Provide an implementation for splice() when we are using SMC. See
      smc_splice_read() for further details.
      Signed-off-by: NStefan Raspl <raspl@linux.ibm.com>
      Signed-off-by: Ursula Braun <ubraun@linux.ibm.com><
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9014db20
    • S
      smc: allocate RMBs as compound pages · 2ef4f27a
      Stefan Raspl 提交于
      Preparatory work for splice() support.
      Signed-off-by: NStefan Raspl <raspl@linux.ibm.com>
      Signed-off-by: Ursula Braun <ubraun@linux.ibm.com><
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2ef4f27a
    • S
      smc: make smc_rx_wait_data() generic · b51fa1b1
      Stefan Raspl 提交于
      Turn smc_rx_wait_data into a generic function that can be used at various
      instances to wait on traffic to complete with varying criteria.
      Signed-off-by: NStefan Raspl <raspl@linux.ibm.com>
      Signed-off-by: Ursula Braun <ubraun@linux.ibm.com><
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b51fa1b1
    • S
      smc: simplify abort logic · c8b8ec8e
      Stefan Raspl 提交于
      Some of the conditions to exit recv() are common in two pathes - cleaning up
      code by moving the check up so we have it only once.
      Signed-off-by: NStefan Raspl <raspl@linux.ibm.com>
      Signed-off-by: Ursula Braun <ubraun@linux.ibm.com><
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c8b8ec8e
    • D
      bpf: add skb_load_bytes_relative helper · 4e1ec56c
      Daniel Borkmann 提交于
      This adds a small BPF helper similar to bpf_skb_load_bytes() that
      is able to load relative to mac/net header offset from the skb's
      linear data. Compared to bpf_skb_load_bytes(), it takes a fifth
      argument namely start_header, which is either BPF_HDR_START_MAC
      or BPF_HDR_START_NET. This allows for a more flexible alternative
      compared to LD_ABS/LD_IND with negative offset. It's enabled for
      tc BPF programs as well as sock filter program types where it's
      mainly useful in reuseport programs to ease access to lower header
      data.
      
      Reference: https://lists.iovisor.org/pipermail/iovisor-dev/2017-March/000698.htmlSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      4e1ec56c
    • D
      bpf: implement ld_abs/ld_ind in native bpf · e0cea7ce
      Daniel Borkmann 提交于
      The main part of this work is to finally allow removal of LD_ABS
      and LD_IND from the BPF core by reimplementing them through native
      eBPF instead. Both LD_ABS/LD_IND were carried over from cBPF and
      keeping them around in native eBPF caused way more trouble than
      actually worth it. To just list some of the security issues in
      the past:
      
        * fdfaf64e ("x86: bpf_jit: support negative offsets")
        * 35607b02 ("sparc: bpf_jit: fix loads from negative offsets")
        * e0ee9c12 ("x86: bpf_jit: fix two bugs in eBPF JIT compiler")
        * 07aee943 ("bpf, sparc: fix usage of wrong reg for load_skb_regs after call")
        * 6d59b7db ("bpf, s390x: do not reload skb pointers in non-skb context")
        * 87338c8e ("bpf, ppc64: do not reload skb pointers in non-skb context")
      
      For programs in native eBPF, LD_ABS/LD_IND are pretty much legacy
      these days due to their limitations and more efficient/flexible
      alternatives that have been developed over time such as direct
      packet access. LD_ABS/LD_IND only cover 1/2/4 byte loads into a
      register, the load happens in host endianness and its exception
      handling can yield unexpected behavior. The latter is explained
      in depth in f6b1b3bf ("bpf: fix subprog verifier bypass by
      div/mod by 0 exception") with similar cases of exceptions we had.
      In native eBPF more recent program types will disable LD_ABS/LD_IND
      altogether through may_access_skb() in verifier, and given the
      limitations in terms of exception handling, it's also disabled
      in programs that use BPF to BPF calls.
      
      In terms of cBPF, the LD_ABS/LD_IND is used in networking programs
      to access packet data. It is not used in seccomp-BPF but programs
      that use it for socket filtering or reuseport for demuxing with
      cBPF. This is mostly relevant for applications that have not yet
      migrated to native eBPF.
      
      The main complexity and source of bugs in LD_ABS/LD_IND is coming
      from their implementation in the various JITs. Most of them keep
      the model around from cBPF times by implementing a fastpath written
      in asm. They use typically two from the BPF program hidden CPU
      registers for caching the skb's headlen (skb->len - skb->data_len)
      and skb->data. Throughout the JIT phase this requires to keep track
      whether LD_ABS/LD_IND are used and if so, the two registers need
      to be recached each time a BPF helper would change the underlying
      packet data in native eBPF case. At least in eBPF case, available
      CPU registers are rare and the additional exit path out of the
      asm written JIT helper makes it also inflexible since not all
      parts of the JITer are in control from plain C. A LD_ABS/LD_IND
      implementation in eBPF therefore allows to significantly reduce
      the complexity in JITs with comparable performance results for
      them, e.g.:
      
      test_bpf             tcpdump port 22             tcpdump complex
      x64      - before    15 21 10                    14 19  18
               - after      7 10 10                     7 10  15
      arm64    - before    40 91 92                    40 91 151
               - after     51 64 73                    51 62 113
      
      For cBPF we now track any usage of LD_ABS/LD_IND in bpf_convert_filter()
      and cache the skb's headlen and data in the cBPF prologue. The
      BPF_REG_TMP gets remapped from R8 to R2 since it's mainly just
      used as a local temporary variable. This allows to shrink the
      image on x86_64 also for seccomp programs slightly since mapping
      to %rsi is not an ereg. In callee-saved R8 and R9 we now track
      skb data and headlen, respectively. For normal prologue emission
      in the JITs this does not add any extra instructions since R8, R9
      are pushed to stack in any case from eBPF side. cBPF uses the
      convert_bpf_ld_abs() emitter which probes the fast path inline
      already and falls back to bpf_skb_load_helper_{8,16,32}() helper
      relying on the cached skb data and headlen as well. R8 and R9
      never need to be reloaded due to bpf_helper_changes_pkt_data()
      since all skb access in cBPF is read-only. Then, for the case
      of native eBPF, we use the bpf_gen_ld_abs() emitter, which calls
      the bpf_skb_load_helper_{8,16,32}_no_cache() helper unconditionally,
      does neither cache skb data and headlen nor has an inlined fast
      path. The reason for the latter is that native eBPF does not have
      any extra registers available anyway, but even if there were, it
      avoids any reload of skb data and headlen in the first place.
      Additionally, for the negative offsets, we provide an alternative
      bpf_skb_load_bytes_relative() helper in eBPF which operates
      similarly as bpf_skb_load_bytes() and allows for more flexibility.
      Tested myself on x64, arm64, s390x, from Sandipan on ppc64.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      e0cea7ce
    • D
      bpf: migrate ebpf ld_abs/ld_ind tests to test_verifier · 93731ef0
      Daniel Borkmann 提交于
      Remove all eBPF tests involving LD_ABS/LD_IND from test_bpf.ko. Reason
      is that the eBPF tests from test_bpf module do not go via BPF verifier
      and therefore any instruction rewrites from verifier cannot take place.
      
      Therefore, move them into test_verifier which runs out of user space,
      so that verfier can rewrite LD_ABS/LD_IND internally in upcoming patches.
      It will have the same effect since runtime tests are also performed from
      there. This also allows to finally unexport bpf_skb_vlan_{push,pop}_proto
      and keep it internal to core kernel.
      
      Additionally, also add further cBPF LD_ABS/LD_IND test coverage into
      test_bpf.ko suite.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      93731ef0
    • D
      bpf: prefix cbpf internal helpers with bpf_ · b390134c
      Daniel Borkmann 提交于
      No change in functionality, just remove the '__' prefix and replace it
      with a 'bpf_' prefix instead. We later on add a couple of more helpers
      for cBPF and keeping the scheme with '__' is suboptimal there.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      b390134c
    • M
      xsk: statistics support · af75d9e0
      Magnus Karlsson 提交于
      In this commit, a new getsockopt is added: XDP_STATISTICS. This is
      used to obtain stats from the sockets.
      
      v2: getsockopt now returns size of stats structure.
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      af75d9e0
    • M
      xsk: support for Tx · 35fcde7f
      Magnus Karlsson 提交于
      Here, Tx support is added. The user fills the Tx queue with frames to
      be sent by the kernel, and let's the kernel know using the sendmsg
      syscall.
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      35fcde7f
    • M
      dev: packet: make packet_direct_xmit a common function · 865b03f2
      Magnus Karlsson 提交于
      The new dev_direct_xmit will be used by AF_XDP in later commits.
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      865b03f2
    • M
      xsk: add Tx queue setup and mmap support · f6145903
      Magnus Karlsson 提交于
      Another setsockopt (XDP_TX_QUEUE) is added to let the process allocate
      a queue, where the user process can pass frames to be transmitted by
      the kernel.
      
      The mmapping of the queue is done using the XDP_PGOFF_TX_QUEUE offset.
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f6145903
    • M
      xsk: add umem completion queue support and mmap · fe230832
      Magnus Karlsson 提交于
      Here, we add another setsockopt for registered user memory (umem)
      called XDP_UMEM_COMPLETION_QUEUE. Using this socket option, the
      process can ask the kernel to allocate a queue (ring buffer) and also
      mmap it (XDP_UMEM_PGOFF_COMPLETION_QUEUE) into the process.
      
      The queue is used to explicitly pass ownership of umem frames from the
      kernel to user process. This will be used by the TX path to tell user
      space that a certain frame has been transmitted and user space can use
      it for something else, if it wishes.
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      fe230832
    • B
      xsk: wire up XDP_SKB side of AF_XDP · 02671e23
      Björn Töpel 提交于
      This commit wires up the xskmap to XDP_SKB layer.
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      02671e23
    • B
      xsk: wire up XDP_DRV side of AF_XDP · 1b1a251c
      Björn Töpel 提交于
      This commit wires up the xskmap to XDP_DRV layer.
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      1b1a251c
    • B
      bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP · fbfc504a
      Björn Töpel 提交于
      The xskmap is yet another BPF map, very much inspired by
      dev/cpu/sockmap, and is a holder of AF_XDP sockets. A user application
      adds AF_XDP sockets into the map, and by using the bpf_redirect_map
      helper, an XDP program can redirect XDP frames to an AF_XDP socket.
      
      Note that a socket that is bound to certain ifindex/queue index will
      *only* accept XDP frames from that netdev/queue index. If an XDP
      program tries to redirect from a netdev/queue index other than what
      the socket is bound to, the frame will not be received on the socket.
      
      A socket can reside in multiple maps.
      
      v3: Fixed race and simplified code.
      v2: Removed one indirection in map lookup.
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      fbfc504a
    • B
      xsk: add Rx receive functions and poll support · c497176c
      Björn Töpel 提交于
      Here the actual receive functions of AF_XDP are implemented, that in a
      later commit, will be called from the XDP layers.
      
      There's one set of functions for the XDP_DRV side and another for
      XDP_SKB (generic).
      
      A new XDP API, xdp_return_buff, is also introduced.
      
      Adding xdp_return_buff, which is analogous to xdp_return_frame, but
      acts upon an struct xdp_buff. The API will be used by AF_XDP in future
      commits.
      
      Support for the poll syscall is also implemented.
      
      v2: xskq_validate_id did not update cons_tail.
          The entries variable was calculated twice in xskq_nb_avail.
          Squashed xdp_return_buff commit.
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c497176c
    • M
      xsk: add support for bind for Rx · 965a9909
      Magnus Karlsson 提交于
      Here, the bind syscall is added. Binding an AF_XDP socket, means
      associating the socket to an umem, a netdev and a queue index. This
      can be done in two ways.
      
      The first way, creating a "socket from scratch". Create the umem using
      the XDP_UMEM_REG setsockopt and an associated fill queue with
      XDP_UMEM_FILL_QUEUE. Create the Rx queue using the XDP_RX_QUEUE
      setsockopt. Call bind passing ifindex and queue index ("channel" in
      ethtool speak).
      
      The second way to bind a socket, is simply skipping the
      umem/netdev/queue index, and passing another already setup AF_XDP
      socket. The new socket will then have the same umem/netdev/queue index
      as the parent so it will share the same umem. You must also set the
      flags field in the socket address to XDP_SHARED_UMEM.
      
      v2: Use PTR_ERR instead of passing error variable explicitly.
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      965a9909
    • B
      xsk: add Rx queue setup and mmap support · b9b6b68e
      Björn Töpel 提交于
      Another setsockopt (XDP_RX_QUEUE) is added to let the process allocate
      a queue, where the kernel can pass completed Rx frames from the kernel
      to user process.
      
      The mmapping of the queue is done using the XDP_PGOFF_RX_QUEUE offset.
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      b9b6b68e
    • M
      xsk: add umem fill queue support and mmap · 423f3832
      Magnus Karlsson 提交于
      Here, we add another setsockopt for registered user memory (umem)
      called XDP_UMEM_FILL_QUEUE. Using this socket option, the process can
      ask the kernel to allocate a queue (ring buffer) and also mmap it
      (XDP_UMEM_PGOFF_FILL_QUEUE) into the process.
      
      The queue is used to explicitly pass ownership of umem frames from the
      user process to the kernel. These frames will in a later patch be
      filled in with Rx packet data by the kernel.
      
      v2: Fixed potential crash in xsk_mmap.
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      423f3832
    • B
      xsk: add user memory registration support sockopt · c0c77d8f
      Björn Töpel 提交于
      In this commit the base structure of the AF_XDP address family is set
      up. Further, we introduce the abilty register a window of user memory
      to the kernel via the XDP_UMEM_REG setsockopt syscall. The memory
      window is viewed by an AF_XDP socket as a set of equally large
      frames. After a user memory registration all frames are "owned" by the
      user application, and not the kernel.
      
      v2: More robust checks on umem creation and unaccount on error.
          Call set_page_dirty_lock on cleanup.
          Simplified xdp_umem_reg.
      Co-authored-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c0c77d8f
    • B
      net: initial AF_XDP skeleton · 68e8b849
      Björn Töpel 提交于
      Buildable skeleton of AF_XDP without any functionality. Just what it
      takes to register a new address family.
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      68e8b849
    • E
      dccp: fix tasklet usage · a8d7aa17
      Eric Dumazet 提交于
      syzbot reported a crash in tasklet_action_common() caused by dccp.
      
      dccp needs to make sure socket wont disappear before tasklet handler
      has completed.
      
      This patch takes a reference on the socket when arming the tasklet,
      and moves the sock_put() from dccp_write_xmit_timer() to dccp_write_xmitlet()
      
      kernel BUG at kernel/softirq.c:514!
      invalid opcode: 0000 [#1] SMP KASAN
      Dumping ftrace buffer:
         (ftrace buffer empty)
      Modules linked in:
      CPU: 1 PID: 17 Comm: ksoftirqd/1 Not tainted 4.17.0-rc3+ #30
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:tasklet_action_common.isra.19+0x6db/0x700 kernel/softirq.c:515
      RSP: 0018:ffff8801d9b3faf8 EFLAGS: 00010246
      dccp_close: ABORT with 65423 bytes unread
      RAX: 1ffff1003b367f6b RBX: ffff8801daf1f3f0 RCX: 0000000000000000
      RDX: ffff8801cf895498 RSI: 0000000000000004 RDI: 0000000000000000
      RBP: ffff8801d9b3fc40 R08: ffffed0039f12a95 R09: ffffed0039f12a94
      dccp_close: ABORT with 65423 bytes unread
      R10: ffffed0039f12a94 R11: ffff8801cf8954a3 R12: 0000000000000000
      R13: ffff8801d9b3fc18 R14: dffffc0000000000 R15: ffff8801cf895490
      FS:  0000000000000000(0000) GS:ffff8801daf00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000001b2bc28000 CR3: 00000001a08a9000 CR4: 00000000001406e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       tasklet_action+0x1d/0x20 kernel/softirq.c:533
       __do_softirq+0x2e0/0xaf5 kernel/softirq.c:285
      dccp_close: ABORT with 65423 bytes unread
       run_ksoftirqd+0x86/0x100 kernel/softirq.c:646
       smpboot_thread_fn+0x417/0x870 kernel/smpboot.c:164
       kthread+0x345/0x410 kernel/kthread.c:238
       ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
      Code: 48 8b 85 e8 fe ff ff 48 8b 95 f0 fe ff ff e9 94 fb ff ff 48 89 95 f0 fe ff ff e8 81 53 6e 00 48 8b 95 f0 fe ff ff e9 62 fb ff ff <0f> 0b 48 89 cf 48 89 8d e8 fe ff ff e8 64 53 6e 00 48 8b 8d e8
      RIP: tasklet_action_common.isra.19+0x6db/0x700 kernel/softirq.c:515 RSP: ffff8801d9b3faf8
      
      Fixes: dc841e30 ("dccp: Extend CCID packet dequeueing interface")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk>
      Cc: dccp@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a8d7aa17
    • S
      smc: fix sendpage() call · bda27ff5
      Stefan Raspl 提交于
      The sendpage() call grabs the sock lock before calling the default
      implementation - which tries to grab it once again.
      Signed-off-by: NStefan Raspl <raspl@linux.ibm.com>
      Signed-off-by: Ursula Braun <ubraun@linux.ibm.com><
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bda27ff5
    • K
      net/smc: handle unregistered buffers · a6920d1d
      Karsten Graul 提交于
      When smc_wr_reg_send() fails then tag (regerr) the affected buffer and
      free it in smc_buf_unuse().
      Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6920d1d
    • K
      net/smc: call consolidation · e63a5f8c
      Karsten Graul 提交于
      Consolidate the call to smc_wr_reg_send() in a new function.
      No functional changes.
      Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e63a5f8c
    • P
      net: bridge: Notify about !added_by_user FDB entries · 161d82de
      Petr Machata 提交于
      Do not automatically bail out on sending notifications about activity on
      non-user-added FDB entries. Instead, notify about this activity except
      for cases where the activity itself originates in a notification, to
      avoid sending duplicate notifications.
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Acked-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Acked-by: NIvan Vecera <ivecera@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      161d82de
    • P
      switchdev: Add fdb.added_by_user to switchdev notifications · 816a3bed
      Petr Machata 提交于
      The following patch enables sending notifications also for events on FDB
      entries that weren't added by the user. Give the drivers the information
      necessary to distinguish between the two origins of FDB entries.
      
      To maintain the current behavior, have switchdev-implementing drivers
      bail out on notifications about non-user-added FDB entries. In case of
      mlxsw driver, allow a call to mlxsw_sp_span_respin() so that SPAN over
      bridge catches up with the changed FDB.
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Reviewed-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Acked-by: NIvan Vecera <ivecera@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      816a3bed
    • N
      net: bridge: avoid duplicate notification on up/down/change netdev events · faa1cd82
      Nikolay Aleksandrov 提交于
      While handling netdevice events, br_device_event() sometimes uses
      br_stp_(disable|enable)_port which unconditionally send a notification,
      but then a second notification for the same event is sent at the end of
      the br_device_event() function. To avoid sending duplicate notifications
      in such cases, check if one has already been sent (i.e.
      br_stp_enable/disable_port have been called).
      The patch is based on a change by Satish Ashok.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      faa1cd82
  4. 03 5月, 2018 3 次提交
    • E
      tcp: restore autocorking · 114f39fe
      Eric Dumazet 提交于
      When adding rb-tree for TCP retransmit queue, we inadvertently broke
      TCP autocorking.
      
      tcp_should_autocork() should really check if the rtx queue is not empty.
      
      Tested:
      
      Before the fix :
      $ nstat -n;./netperf -H 10.246.7.152 -Cc -- -m 500;nstat | grep AutoCork
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.152 () port 0 AF_INET
      Recv   Send    Send                          Utilization       Service Demand
      Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
      Size   Size    Size     Time     Throughput  local    remote   local   remote
      bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
      
      540000 262144    500    10.00      2682.85   2.47     1.59     3.618   2.329
      TcpExtTCPAutoCorking            33                 0.0
      
      // Same test, but forcing TCP_NODELAY
      $ nstat -n;./netperf -H 10.246.7.152 -Cc -- -D -m 500;nstat | grep AutoCork
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.152 () port 0 AF_INET : nodelay
      Recv   Send    Send                          Utilization       Service Demand
      Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
      Size   Size    Size     Time     Throughput  local    remote   local   remote
      bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
      
      540000 262144    500    10.00      1408.75   2.44     2.96     6.802   8.259
      TcpExtTCPAutoCorking            1                  0.0
      
      After the fix :
      $ nstat -n;./netperf -H 10.246.7.152 -Cc -- -m 500;nstat | grep AutoCork
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.152 () port 0 AF_INET
      Recv   Send    Send                          Utilization       Service Demand
      Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
      Size   Size    Size     Time     Throughput  local    remote   local   remote
      bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
      
      540000 262144    500    10.00      5472.46   2.45     1.43     1.761   1.027
      TcpExtTCPAutoCorking            361293             0.0
      
      // With TCP_NODELAY option
      $ nstat -n;./netperf -H 10.246.7.152 -Cc -- -D -m 500;nstat | grep AutoCork
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.152 () port 0 AF_INET : nodelay
      Recv   Send    Send                          Utilization       Service Demand
      Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
      Size   Size    Size     Time     Throughput  local    remote   local   remote
      bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
      
      540000 262144    500    10.00      5454.96   2.46     1.63     1.775   1.174
      TcpExtTCPAutoCorking            315448             0.0
      
      Fixes: 75c119af ("tcp: implement rb-tree based retransmit queue")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NMichael Wenig <mwenig@vmware.com>
      Tested-by: NMichael Wenig <mwenig@vmware.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NMichael Wenig <mwenig@vmware.com>
      Tested-by: NMichael Wenig <mwenig@vmware.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      114f39fe
    • S
      ip6_gre: correct the function name in ip6gre_tnl_addr_conflict() comment · 7ccbdff1
      Sun Lianwen 提交于
      The function name is wrong in ip6gre_tnl_addr_conflict() comment, which
      use ip6_tnl_addr_conflict instead of ip6gre_tnl_addr_conflict.
      Signed-off-by: NSun Lianwen <sunlw.fnst@cn.fujitsu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ccbdff1
    • E
      rds: do not leak kernel memory to user land · eb80ca47
      Eric Dumazet 提交于
      syzbot/KMSAN reported an uninit-value in put_cmsg(), originating
      from rds_cmsg_recv().
      
      Simply clear the structure, since we have holes there, or since
      rx_traces might be smaller than RDS_MSG_RX_DGRAM_TRACE_MAX.
      
      BUG: KMSAN: uninit-value in copy_to_user include/linux/uaccess.h:184 [inline]
      BUG: KMSAN: uninit-value in put_cmsg+0x600/0x870 net/core/scm.c:242
      CPU: 0 PID: 4459 Comm: syz-executor582 Not tainted 4.16.0+ #87
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:17 [inline]
       dump_stack+0x185/0x1d0 lib/dump_stack.c:53
       kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
       kmsan_internal_check_memory+0x135/0x1e0 mm/kmsan/kmsan.c:1157
       kmsan_copy_to_user+0x69/0x160 mm/kmsan/kmsan.c:1199
       copy_to_user include/linux/uaccess.h:184 [inline]
       put_cmsg+0x600/0x870 net/core/scm.c:242
       rds_cmsg_recv net/rds/recv.c:570 [inline]
       rds_recvmsg+0x2db5/0x3170 net/rds/recv.c:657
       sock_recvmsg_nosec net/socket.c:803 [inline]
       sock_recvmsg+0x1d0/0x230 net/socket.c:810
       ___sys_recvmsg+0x3fb/0x810 net/socket.c:2205
       __sys_recvmsg net/socket.c:2250 [inline]
       SYSC_recvmsg+0x298/0x3c0 net/socket.c:2262
       SyS_recvmsg+0x54/0x80 net/socket.c:2257
       do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      Fixes: 3289025a ("RDS: add receive message trace used by application")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com>
      Cc: linux-rdma <linux-rdma@vger.kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eb80ca47