1. 21 9月, 2019 1 次提交
  2. 16 9月, 2019 3 次提交
  3. 14 9月, 2019 1 次提交
    • W
      ip: support SO_MARK cmsg · c6af0c22
      Willem de Bruijn 提交于
      Enable setting skb->mark for UDP and RAW sockets using cmsg.
      
      This is analogous to existing support for TOS, TTL, txtime, etc.
      
      Packet sockets already support this as of commit c7d39e32
      ("packet: support per-packet fwmark for af_packet sendmsg").
      
      Similar to other fields, implement by
      1. initialize the sockcm_cookie.mark from socket option sk_mark
      2. optionally overwrite this in ip_cmsg_send/ip6_datagram_send_ctl
      3. initialize inet_cork.mark from sockcm_cookie.mark
      4. initialize each (usually just one) skb->mark from inet_cork.mark
      
      Step 1 is handled in one location for most protocols by ipcm_init_sk
      as of commit 35178206 ("ipv4: ipcm_cookie initializers").
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c6af0c22
  4. 13 9月, 2019 1 次提交
    • J
      netfilter: fix coding-style errors. · b0edba2a
      Jeremy Sowden 提交于
      Several header-files, Kconfig files and Makefiles have trailing
      white-space.  Remove it.
      
      In netfilter/Kconfig, indent the type of CONFIG_NETFILTER_NETLINK_ACCT
      correctly.
      
      There are semicolons at the end of two function definitions in
      include/net/netfilter/nf_conntrack_acct.h and
      include/net/netfilter/nf_conntrack_ecache.h. Remove them.
      
      Fix indentation in nf_conntrack_l4proto.h.
      Signed-off-by: NJeremy Sowden <jeremy@azazel.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      b0edba2a
  5. 12 9月, 2019 2 次提交
    • E
      tcp: force a PSH flag on TSO packets · 051ba674
      Eric Dumazet 提交于
      When tcp sends a TSO packet, adding a PSH flag on it
      reduces the sojourn time of GRO packet in GRO receivers.
      
      This is particularly the case under pressure, since RX queues
      receive packets for many concurrent flows.
      
      A sender can give a hint to GRO engines when it is
      appropriate to flush a super-packet, especially when pacing
      is in the picture, since next packet is probably delayed by
      one ms.
      
      Having less packets in GRO engine reduces chance
      of LRU eviction or inflated RTT, and reduces GRO cost.
      
      We found recently that we must not set the PSH flag on
      individual full-size MSS segments [1] :
      
       Under pressure (CWR state), we better let the packet sit
       for a small delay (depending on NAPI logic) so that the
       ACK packet is delayed, and thus next packet we send is
       also delayed a bit. Eventually the bottleneck queue can
       be drained. DCTCP flows with CWND=1 have demonstrated
       the issue.
      
      This patch allows to slowdown the aggregate traffic without
      involving high resolution timers on senders and/or
      receivers.
      
      It has been used at Google for about four years,
      and has been discussed at various networking conferences.
      
      [1] segments smaller than MSS already have PSH flag set
          by tcp_sendmsg() / tcp_mark_push(), unless MSG_MORE
          has been requested by the user.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Tariq Toukan <tariqt@mellanox.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      051ba674
    • N
      tcp: fix tcp_ecn_withdraw_cwr() to clear TCP_ECN_QUEUE_CWR · af38d07e
      Neal Cardwell 提交于
      Fix tcp_ecn_withdraw_cwr() to clear the correct bit:
      TCP_ECN_QUEUE_CWR.
      
      Rationale: basically, TCP_ECN_DEMAND_CWR is a bit that is purely about
      the behavior of data receivers, and deciding whether to reflect
      incoming IP ECN CE marks as outgoing TCP th->ece marks. The
      TCP_ECN_QUEUE_CWR bit is purely about the behavior of data senders,
      and deciding whether to send CWR. The tcp_ecn_withdraw_cwr() function
      is only called from tcp_undo_cwnd_reduction() by data senders during
      an undo, so it should zero the sender-side state,
      TCP_ECN_QUEUE_CWR. It does not make sense to stop the reflection of
      incoming CE bits on incoming data packets just because outgoing
      packets were spuriously retransmitted.
      
      The bug has been reproduced with packetdrill to manifest in a scenario
      with RFC3168 ECN, with an incoming data packet with CE bit set and
      carrying a TCP timestamp value that causes cwnd undo. Before this fix,
      the IP CE bit was ignored and not reflected in the TCP ECE header bit,
      and sender sent a TCP CWR ('W') bit on the next outgoing data packet,
      even though the cwnd reduction had been undone.  After this fix, the
      sender properly reflects the CE bit and does not set the W bit.
      
      Note: the bug actually predates 2005 git history; this Fixes footer is
      chosen to be the oldest SHA1 I have tested (from Sep 2007) for which
      the patch applies cleanly (since before this commit the code was in a
      .h file).
      
      Fixes: bdf1ee5d ("[TCP]: Move code from tcp_ecn.h to tcp*.c and tcp.h & remove it")
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af38d07e
  6. 07 9月, 2019 2 次提交
  7. 05 9月, 2019 1 次提交
    • D
      net: Properly update v4 routes with v6 nexthop · 7bdf4de1
      Donald Sharp 提交于
      When creating a v4 route that uses a v6 nexthop from a nexthop group.
      Allow the kernel to properly send the nexthop as v6 via the RTA_VIA
      attribute.
      
      Broken behavior:
      
      $ ip nexthop add via fe80::9 dev eth0
      $ ip nexthop show
      id 1 via fe80::9 dev eth0 scope link
      $ ip route add 4.5.6.7/32 nhid 1
      $ ip route show
      default via 10.0.2.2 dev eth0
      4.5.6.7 nhid 1 via 254.128.0.0 dev eth0
      10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15
      $
      
      Fixed behavior:
      
      $ ip nexthop add via fe80::9 dev eth0
      $ ip nexthop show
      id 1 via fe80::9 dev eth0 scope link
      $ ip route add 4.5.6.7/32 nhid 1
      $ ip route show
      default via 10.0.2.2 dev eth0
      4.5.6.7 nhid 1 via inet6 fe80::9 dev eth0
      10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15
      $
      
      v2, v3: Addresses code review comments from David Ahern
      
      Fixes: dcb1ecb5 (“ipv4: Prepare for fib6_nh from a nexthop object”)
      Signed-off-by: NDonald Sharp <sharpd@cumulusnetworks.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7bdf4de1
  8. 01 9月, 2019 1 次提交
  9. 31 8月, 2019 1 次提交
  10. 29 8月, 2019 1 次提交
  11. 28 8月, 2019 1 次提交
  12. 25 8月, 2019 2 次提交
    • J
      net: route dump netlink NLM_F_MULTI flag missing · e93fb3e9
      John Fastabend 提交于
      An excerpt from netlink(7) man page,
      
        In multipart messages (multiple nlmsghdr headers with associated payload
        in one byte stream) the first and all following headers have the
        NLM_F_MULTI flag set, except for the last  header  which  has the type
        NLMSG_DONE.
      
      but, after (ee28906f) there is a missing NLM_F_MULTI flag in the middle of a
      FIB dump. The result is user space applications following above man page
      excerpt may get confused and may stop parsing msg believing something went
      wrong.
      
      In the golang netlink lib [0] the library logic stops parsing believing the
      message is not a multipart message. Found this running Cilium[1] against
      net-next while adding a feature to auto-detect routes. I noticed with
      multiple route tables we no longer could detect the default routes on net
      tree kernels because the library logic was not returning them.
      
      Fix this by handling the fib_dump_info_fnhe() case the same way the
      fib_dump_info() handles it by passing the flags argument through the
      call chain and adding a flags argument to rt_fill_info().
      
      Tested with Cilium stack and auto-detection of routes works again. Also
      annotated libs to dump netlink msgs and inspected NLM_F_MULTI and
      NLMSG_DONE flags look correct after this.
      
      Note: In inet_rtm_getroute() pass rt_fill_info() '0' for flags the same
      as is done for fib_dump_info() so this looks correct to me.
      
      [0] https://github.com/vishvananda/netlink/
      [1] https://github.com/cilium/
      
      Fixes: ee28906f ("ipv4: Dump route exceptions if requested")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Reviewed-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e93fb3e9
    • H
      ipv4/icmp: fix rt dst dev null pointer dereference · e2c69393
      Hangbin Liu 提交于
      In __icmp_send() there is a possibility that the rt->dst.dev is NULL,
      e,g, with tunnel collect_md mode, which will cause kernel crash.
      Here is what the code path looks like, for GRE:
      
      - ip6gre_tunnel_xmit
        - ip6gre_xmit_ipv4
          - __gre6_xmit
            - ip6_tnl_xmit
              - if skb->len - t->tun_hlen - eth_hlen > mtu; return -EMSGSIZE
          - icmp_send
            - net = dev_net(rt->dst.dev); <-- here
      
      The reason is __metadata_dst_init() init dst->dev to NULL by default.
      We could not fix it in __metadata_dst_init() as there is no dev supplied.
      On the other hand, the reason we need rt->dst.dev is to get the net.
      So we can just try get it from skb->dev when rt->dst.dev is NULL.
      
      v4: Julian Anastasov remind skb->dev also could be NULL. We'd better
      still use dst.dev and do a check to avoid crash.
      
      v3: No changes.
      
      v2: fix the issue in __icmp_send() instead of updating shared dst dev
      in {ip_md, ip6}_tunnel_xmit.
      
      Fixes: c8b34e68 ("ip_tunnel: Add tnl_update_pmtu in ip_md_tunnel_xmit")
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Reviewed-by: NJulian Anastasov <ja@ssi.bg>
      Acked-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2c69393
  13. 23 8月, 2019 1 次提交
  14. 22 8月, 2019 1 次提交
  15. 21 8月, 2019 1 次提交
  16. 20 8月, 2019 1 次提交
  17. 10 8月, 2019 3 次提交
  18. 09 8月, 2019 2 次提交
    • J
      net/tls: prevent skb_orphan() from leaking TLS plain text with offload · 41477662
      Jakub Kicinski 提交于
      sk_validate_xmit_skb() and drivers depend on the sk member of
      struct sk_buff to identify segments requiring encryption.
      Any operation which removes or does not preserve the original TLS
      socket such as skb_orphan() or skb_clone() will cause clear text
      leaks.
      
      Make the TCP socket underlying an offloaded TLS connection
      mark all skbs as decrypted, if TLS TX is in offload mode.
      Then in sk_validate_xmit_skb() catch skbs which have no socket
      (or a socket with no validation) and decrypted flag set.
      
      Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and
      sk->sk_validate_xmit_skb are slightly interchangeable right now,
      they all imply TLS offload. The new checks are guarded by
      CONFIG_TLS_DEVICE because that's the option guarding the
      sk_buff->decrypted member.
      
      Second, smaller issue with orphaning is that it breaks
      the guarantee that packets will be delivered to device
      queues in-order. All TLS offload drivers depend on that
      scheduling property. This means skb_orphan_partial()'s
      trick of preserving partial socket references will cause
      issues in the drivers. We need a full orphan, and as a
      result netem delay/throttling will cause all TLS offload
      skbs to be dropped.
      
      Reusing the sk_buff->decrypted flag also protects from
      leaking clear text when incoming, decrypted skb is redirected
      (e.g. by TC).
      
      See commit 0608c69c ("bpf: sk_msg, sock{map|hash} redirect
      through ULP") for justification why the internal flag is safe.
      The only location which could leak the flag in is tcp_bpf_sendmsg(),
      which is taken care of by clearing the previously unused bit.
      
      v2:
       - remove superfluous decrypted mark copy (Willem);
       - remove the stale doc entry (Boris);
       - rely entirely on EOR marking to prevent coalescing (Boris);
       - use an internal sendpages flag instead of marking the socket
         (Boris).
      v3 (Willem):
       - reorganize the can_skb_orphan_partial() condition;
       - fix the flag leak-in through tcp_bpf_sendmsg.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Reviewed-by: NBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      41477662
    • G
      inet: frags: re-introduce skb coalescing for local delivery · 891584f4
      Guillaume Nault 提交于
      Before commit d4289fcc ("net: IP6 defrag: use rbtrees for IPv6
      defrag"), a netperf UDP_STREAM test[0] using big IPv6 datagrams (thus
      generating many fragments) and running over an IPsec tunnel, reported
      more than 6Gbps throughput. After that patch, the same test gets only
      9Mbps when receiving on a be2net nic (driver can make a big difference
      here, for example, ixgbe doesn't seem to be affected).
      
      By reusing the IPv4 defragmentation code, IPv6 lost fragment coalescing
      (IPv4 fragment coalescing was dropped by commit 14fe22e3 ("Revert
      "ipv4: use skb coalescing in defragmentation"")).
      
      Without fragment coalescing, be2net runs out of Rx ring entries and
      starts to drop frames (ethtool reports rx_drops_no_frags errors). Since
      the netperf traffic is only composed of UDP fragments, any lost packet
      prevents reassembly of the full datagram. Therefore, fragments which
      have no possibility to ever get reassembled pile up in the reassembly
      queue, until the memory accounting exeeds the threshold. At that point
      no fragment is accepted anymore, which effectively discards all
      netperf traffic.
      
      When reassembly timeout expires, some stale fragments are removed from
      the reassembly queue, so a few packets can be received, reassembled
      and delivered to the netperf receiver. But the nic still drops frames
      and soon the reassembly queue gets filled again with stale fragments.
      These long time frames where no datagram can be received explain why
      the performance drop is so significant.
      
      Re-introducing fragment coalescing is enough to get the initial
      performances again (6.6Gbps with be2net): driver doesn't drop frames
      anymore (no more rx_drops_no_frags errors) and the reassembly engine
      works at full speed.
      
      This patch is quite conservative and only coalesces skbs for local
      IPv4 and IPv6 delivery (in order to avoid changing skb geometry when
      forwarding). Coalescing could be extended in the future if need be, as
      more scenarios would probably benefit from it.
      
      [0]: Test configuration
      Sender:
      ip xfrm policy flush
      ip xfrm state flush
      ip xfrm state add src fc00:1::1 dst fc00:2::1 proto esp spi 0x1000 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:1::1 dst fc00:2::1
      ip xfrm policy add src fc00:1::1 dst fc00:2::1 dir in tmpl src fc00:1::1 dst fc00:2::1 proto esp mode transport action allow
      ip xfrm state add src fc00:2::1 dst fc00:1::1 proto esp spi 0x1001 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:2::1 dst fc00:1::1
      ip xfrm policy add src fc00:2::1 dst fc00:1::1 dir out tmpl src fc00:2::1 dst fc00:1::1 proto esp mode transport action allow
      netserver -D -L fc00:2::1
      
      Receiver:
      ip xfrm policy flush
      ip xfrm state flush
      ip xfrm state add src fc00:2::1 dst fc00:1::1 proto esp spi 0x1001 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:2::1 dst fc00:1::1
      ip xfrm policy add src fc00:2::1 dst fc00:1::1 dir in tmpl src fc00:2::1 dst fc00:1::1 proto esp mode transport action allow
      ip xfrm state add src fc00:1::1 dst fc00:2::1 proto esp spi 0x1000 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:1::1 dst fc00:2::1
      ip xfrm policy add src fc00:1::1 dst fc00:2::1 dir out tmpl src fc00:1::1 dst fc00:2::1 proto esp mode transport action allow
      netperf -H fc00:2::1 -f k -P 0 -L fc00:1::1 -l 60 -t UDP_STREAM -I 99,5 -i 5,5 -T5,5 -6
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Acked-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      891584f4
  19. 04 8月, 2019 1 次提交
  20. 31 7月, 2019 3 次提交
  21. 26 7月, 2019 1 次提交
  22. 25 7月, 2019 1 次提交
  23. 23 7月, 2019 1 次提交
  24. 22 7月, 2019 2 次提交
    • J
      bpf: sockmap/tls, close can race with map free · 95fa1454
      John Fastabend 提交于
      When a map free is called and in parallel a socket is closed we
      have two paths that can potentially reset the socket prot ops, the
      bpf close() path and the map free path. This creates a problem
      with which prot ops should be used from the socket closed side.
      
      If the map_free side completes first then we want to call the
      original lowest level ops. However, if the tls path runs first
      we want to call the sockmap ops. Additionally there was no locking
      around prot updates in TLS code paths so the prot ops could
      be changed multiple times once from TLS path and again from sockmap
      side potentially leaving ops pointed at either TLS or sockmap
      when psock and/or tls context have already been destroyed.
      
      To fix this race first only update ops inside callback lock
      so that TLS, sockmap and lowest level all agree on prot state.
      Second and a ULP callback update() so that lower layers can
      inform the upper layer when they are being removed allowing the
      upper layer to reset prot ops.
      
      This gets us close to allowing sockmap and tls to be stacked
      in arbitrary order but will save that patch for *next trees.
      
      v4:
       - make sure we don't free things for device;
       - remove the checks which swap the callbacks back
         only if TLS is at the top.
      
      Reported-by: syzbot+06537213db7ba2745c4a@syzkaller.appspotmail.com
      Fixes: 02c558b2 ("bpf: sockmap, support for msg_peek in sk_msg with redirect ingress")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NDirk van der Merwe <dirk.vandermerwe@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      95fa1454
    • E
      tcp: be more careful in tcp_fragment() · b617158d
      Eric Dumazet 提交于
      Some applications set tiny SO_SNDBUF values and expect
      TCP to just work. Recent patches to address CVE-2019-11478
      broke them in case of losses, since retransmits might
      be prevented.
      
      We should allow these flows to make progress.
      
      This patch allows the first and last skb in retransmit queue
      to be split even if memory limits are hit.
      
      It also adds the some room due to the fact that tcp_sendmsg()
      and tcp_sendpage() might overshoot sk_wmem_queued by about one full
      TSO skb (64KB size). Note this allowance was already present
      in stable backports for kernels < 4.15
      
      Note for < 4.15 backports :
       tcp_rtx_queue_tail() will probably look like :
      
      static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk)
      {
      	struct sk_buff *skb = tcp_send_head(sk);
      
      	return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk);
      }
      
      Fixes: f070ef2a ("tcp: tcp_fragment() should apply sane memory limits")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAndrew Prout <aprout@ll.mit.edu>
      Tested-by: NAndrew Prout <aprout@ll.mit.edu>
      Tested-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Tested-by: NMichal Kubecek <mkubecek@suse.cz>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NChristoph Paasch <cpaasch@apple.com>
      Cc: Jonathan Looney <jtl@netflix.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b617158d
  25. 19 7月, 2019 3 次提交
  26. 18 7月, 2019 1 次提交
    • C
      fib: relax source validation check for loopback packets · 66f82095
      Cong Wang 提交于
      In a rare case where we redirect local packets from veth to lo,
      these packets fail to pass the source validation when rp_filter
      is turned on, as the tracing shows:
      
        <...>-311708 [040] ..s1 7951180.957825: fib_table_lookup: table 254 oif 0 iif 1 src 10.53.180.130 dst 10.53.180.130 tos 0 scope 0 flags 0
        <...>-311708 [040] ..s1 7951180.957826: fib_table_lookup_nh: nexthop dev eth0 oif 4 src 10.53.180.130
      
      So, the fib table lookup returns eth0 as the nexthop even though
      the packets are local and should be routed to loopback nonetheless,
      but they can't pass the dev match check in fib_info_nh_uses_dev()
      without this patch.
      
      It should be safe to relax this check for this special case, as
      normally packets coming out of loopback device still have skb_dst
      so they won't even hit this slow path.
      
      Cc: Julian Anastasov <ja@ssi.bg>
      Cc: David Ahern <dsahern@gmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      66f82095
  27. 16 7月, 2019 1 次提交