1. 10 11月, 2020 4 次提交
  2. 08 11月, 2020 1 次提交
  3. 07 11月, 2020 13 次提交
  4. 05 11月, 2020 1 次提交
    • P
      tcp: propagate MPTCP skb extensions on xmit splits · 5a369ca6
      Paolo Abeni 提交于
      When the TCP stack splits a packet on the write queue, the tail
      half currently lose the associated skb extensions, and will not
      carry the DSM on the wire.
      
      The above does not cause functional problems and is allowed by
      the RFC, but interact badly with GRO and RX coalescing, as possible
      candidates for aggregation will carry different TCP options.
      
      This change tries to improve the MPTCP behavior, propagating the
      skb extensions on split.
      
      Additionally, we must prevent the MPTCP stack from updating the
      mapping after the split occur: that will both violate the RFC and
      fool the reader.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      5a369ca6
  5. 03 11月, 2020 1 次提交
  6. 01 11月, 2020 2 次提交
  7. 31 10月, 2020 4 次提交
  8. 30 10月, 2020 1 次提交
    • J
      netfilter: use actual socket sk rather than skb sk when routing harder · 46d6c5ae
      Jason A. Donenfeld 提交于
      If netfilter changes the packet mark when mangling, the packet is
      rerouted using the route_me_harder set of functions. Prior to this
      commit, there's one big difference between route_me_harder and the
      ordinary initial routing functions, described in the comment above
      __ip_queue_xmit():
      
         /* Note: skb->sk can be different from sk, in case of tunnels */
         int __ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl,
      
      That function goes on to correctly make use of sk->sk_bound_dev_if,
      rather than skb->sk->sk_bound_dev_if. And indeed the comment is true: a
      tunnel will receive a packet in ndo_start_xmit with an initial skb->sk.
      It will make some transformations to that packet, and then it will send
      the encapsulated packet out of a *new* socket. That new socket will
      basically always have a different sk_bound_dev_if (otherwise there'd be
      a routing loop). So for the purposes of routing the encapsulated packet,
      the routing information as it pertains to the socket should come from
      that socket's sk, rather than the packet's original skb->sk. For that
      reason __ip_queue_xmit() and related functions all do the right thing.
      
      One might argue that all tunnels should just call skb_orphan(skb) before
      transmitting the encapsulated packet into the new socket. But tunnels do
      *not* do this -- and this is wisely avoided in skb_scrub_packet() too --
      because features like TSQ rely on skb->destructor() being called when
      that buffer space is truely available again. Calling skb_orphan(skb) too
      early would result in buffers filling up unnecessarily and accounting
      info being all wrong. Instead, additional routing must take into account
      the new sk, just as __ip_queue_xmit() notes.
      
      So, this commit addresses the problem by fishing the correct sk out of
      state->sk -- it's already set properly in the call to nf_hook() in
      __ip_local_out(), which receives the sk as part of its normal
      functionality. So we make sure to plumb state->sk through the various
      route_me_harder functions, and then make correct use of it following the
      example of __ip_queue_xmit().
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      Reviewed-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      46d6c5ae
  9. 24 10月, 2020 1 次提交
  10. 23 10月, 2020 1 次提交
  11. 20 10月, 2020 1 次提交
    • I
      nexthop: Fix performance regression in nexthop deletion · df6afe2f
      Ido Schimmel 提交于
      While insertion of 16k nexthops all using the same netdev ('dummy10')
      takes less than a second, deletion takes about 130 seconds:
      
      # time -p ip -b nexthop.batch
      real 0.29
      user 0.01
      sys 0.15
      
      # time -p ip link set dev dummy10 down
      real 131.03
      user 0.06
      sys 0.52
      
      This is because of repeated calls to synchronize_rcu() whenever a
      nexthop is removed from a nexthop group:
      
      # /usr/share/bcc/tools/offcputime -p `pgrep -nx ip` -K
      ...
          b'finish_task_switch'
          b'schedule'
          b'schedule_timeout'
          b'wait_for_completion'
          b'__wait_rcu_gp'
          b'synchronize_rcu.part.0'
          b'synchronize_rcu'
          b'__remove_nexthop'
          b'remove_nexthop'
          b'nexthop_flush_dev'
          b'nh_netdev_event'
          b'raw_notifier_call_chain'
          b'call_netdevice_notifiers_info'
          b'__dev_notify_flags'
          b'dev_change_flags'
          b'do_setlink'
          b'__rtnl_newlink'
          b'rtnl_newlink'
          b'rtnetlink_rcv_msg'
          b'netlink_rcv_skb'
          b'rtnetlink_rcv'
          b'netlink_unicast'
          b'netlink_sendmsg'
          b'____sys_sendmsg'
          b'___sys_sendmsg'
          b'__sys_sendmsg'
          b'__x64_sys_sendmsg'
          b'do_syscall_64'
          b'entry_SYSCALL_64_after_hwframe'
          -                ip (277)
              126554955
      
      Since nexthops are always deleted under RTNL, synchronize_net() can be
      used instead. It will call synchronize_rcu_expedited() which only blocks
      for several microseconds as opposed to multiple milliseconds like
      synchronize_rcu().
      
      With this patch deletion of 16k nexthops takes less than a second:
      
      # time -p ip link set dev dummy10 down
      real 0.12
      user 0.00
      sys 0.04
      
      Tested with fib_nexthops.sh which includes torture tests that prompted
      the initial change:
      
      # ./fib_nexthops.sh
      ...
      Tests passed: 134
      Tests failed:   0
      
      Fixes: 90f33bff ("nexthops: don't modify published nexthop groups")
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Acked-by: NNikolay Aleksandrov <nikolay@nvidia.com>
      Link: https://lore.kernel.org/r/20201016172914.643282-1-idosch@idosch.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      df6afe2f
  12. 17 10月, 2020 1 次提交
  13. 15 10月, 2020 1 次提交
    • M
      ipv4/icmp: l3mdev: Perform icmp error route lookup on source device routing table (v2) · e1e84eb5
      Mathieu Desnoyers 提交于
      As per RFC792, ICMP errors should be sent to the source host.
      
      However, in configurations with Virtual Routing and Forwarding tables,
      looking up which routing table to use is currently done by using the
      destination net_device.
      
      commit 9d1a6c4e ("net: icmp_route_lookup should use rt dev to
      determine L3 domain") changes the interface passed to
      l3mdev_master_ifindex() and inet_addr_type_dev_table() from skb_in->dev
      to skb_dst(skb_in)->dev. This effectively uses the destination device
      rather than the source device for choosing which routing table should be
      used to lookup where to send the ICMP error.
      
      Therefore, if the source and destination interfaces are within separate
      VRFs, or one in the global routing table and the other in a VRF, looking
      up the source host in the destination interface's routing table will
      fail if the destination interface's routing table contains no route to
      the source host.
      
      One observable effect of this issue is that traceroute does not work in
      the following cases:
      
      - Route leaking between global routing table and VRF
      - Route leaking between VRFs
      
      Preferably use the source device routing table when sending ICMP error
      messages. If no source device is set, fall-back on the destination
      device routing table. Else, use the main routing table (index 0).
      
      [ It has been pointed out that a similar issue may exist with ICMP
        errors triggered when forwarding between network namespaces. It would
        be worthwhile to investigate, but is outside of the scope of this
        investigation. ]
      
      [ It has also been pointed out that a similar issue exists with
        unreachable / fragmentation needed messages, which can be triggered by
        changing the MTU of eth1 in r1 to 1400 and running:
      
        ip netns exec h1 ping -s 1450 -Mdo -c1 172.16.2.2
      
        Some investigation points to raw_icmp_error() and raw_err() as being
        involved in this last scenario. The focus of this patch is TTL expired
        ICMP messages, which go through icmp_route_lookup.
        Investigation of failure modes related to raw_icmp_error() is beyond
        this investigation's scope. ]
      
      Fixes: 9d1a6c4e ("net: icmp_route_lookup should use rt dev to determine L3 domain")
      Link: https://tools.ietf.org/html/rfc792Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      e1e84eb5
  14. 14 10月, 2020 4 次提交
  15. 11 10月, 2020 1 次提交
  16. 09 10月, 2020 1 次提交
    • X
      xfrm: interface: fix the priorities for ipip and ipv6 tunnels · 7fe94612
      Xin Long 提交于
      As Nicolas noticed in his case, when xfrm_interface module is installed
      the standard IP tunnels will break in receiving packets.
      
      This is caused by the IP tunnel handlers with a higher priority in xfrm
      interface processing incoming packets by xfrm_input(), which would drop
      the packets and return 0 instead when anything wrong happens.
      
      Rather than changing xfrm_input(), this patch is to adjust the priority
      for the IP tunnel handlers in xfrm interface, so that the packets would
      go to xfrmi's later than the others', as the others' would not drop the
      packets when the handlers couldn't process them.
      
      Note that IPCOMP also defines its own IPIP tunnel handler and it calls
      xfrm_input() as well, so we must make its priority lower than xfrmi's,
      which means having xfrmi loaded would still break IPCOMP. We may seek
      another way to fix it in xfrm_input() in the future.
      Reported-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Tested-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Fixes: da9bbf05 ("xfrm: interface: support IPIP and IPIP6 tunnels processing with .cb_handler")
      FIxes: d7b360c2 ("xfrm: interface: support IP6IP6 and IP6IP tunnels processing with .cb_handler")
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      7fe94612
  17. 06 10月, 2020 2 次提交
    • F
      ipv4: use dev_sw_netstats_rx_add() · 560b50cf
      Fabian Frederick 提交于
      use new helper for netstats settings
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      560b50cf
    • E
      tcp: fix receive window update in tcp_add_backlog() · 86bccd03
      Eric Dumazet 提交于
      We got reports from GKE customers flows being reset by netfilter
      conntrack unless nf_conntrack_tcp_be_liberal is set to 1.
      
      Traces seemed to suggest ACK packet being dropped by the
      packet capture, or more likely that ACK were received in the
      wrong order.
      
       wscale=7, SYN and SYNACK not shown here.
      
       This ACK allows the sender to send 1871*128 bytes from seq 51359321 :
       New right edge of the window -> 51359321+1871*128=51598809
      
       09:17:23.389210 IP A > B: Flags [.], ack 51359321, win 1871, options [nop,nop,TS val 10 ecr 999], length 0
      
       09:17:23.389212 IP B > A: Flags [.], seq 51422681:51424089, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 1408
       09:17:23.389214 IP A > B: Flags [.], ack 51422681, win 1376, options [nop,nop,TS val 10 ecr 999], length 0
       09:17:23.389253 IP B > A: Flags [.], seq 51424089:51488857, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 64768
       09:17:23.389272 IP A > B: Flags [.], ack 51488857, win 859, options [nop,nop,TS val 10 ecr 999], length 0
       09:17:23.389275 IP B > A: Flags [.], seq 51488857:51521241, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 32384
      
       Receiver now allows to send 606*128=77568 from seq 51521241 :
       New right edge of the window -> 51521241+606*128=51598809
      
       09:17:23.389296 IP A > B: Flags [.], ack 51521241, win 606, options [nop,nop,TS val 10 ecr 999], length 0
      
       09:17:23.389308 IP B > A: Flags [.], seq 51521241:51553625, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 32384
      
       It seems the sender exceeds RWIN allowance, since 51611353 > 51598809
      
       09:17:23.389346 IP B > A: Flags [.], seq 51553625:51611353, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 57728
       09:17:23.389356 IP B > A: Flags [.], seq 51611353:51618393, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 7040
      
       09:17:23.389367 IP A > B: Flags [.], ack 51611353, win 0, options [nop,nop,TS val 10 ecr 999], length 0
      
       netfilter conntrack is not happy and sends RST
      
       09:17:23.389389 IP A > B: Flags [R], seq 92176528, win 0, length 0
       09:17:23.389488 IP B > A: Flags [R], seq 174478967, win 0, length 0
      
       Now imagine ACK were delivered out of order and tcp_add_backlog() sets window based on wrong packet.
       New right edge of the window -> 51521241+859*128=51631193
      
      Normally TCP stack handles OOO packets just fine, but it
      turns out tcp_add_backlog() does not. It can update the window
      field of the aggregated packet even if the ACK sequence
      of the last received packet is too old.
      
      Many thanks to Alexandre Ferrieux for independently reporting the issue
      and suggesting a fix.
      
      Fixes: 4f693b55 ("tcp: implement coalescing on backlog queue")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAlexandre Ferrieux <alexandre.ferrieux@orange.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      86bccd03