1. 24 1月, 2021 1 次提交
  2. 21 1月, 2021 1 次提交
    • K
      tcp: Fix potential use-after-free due to double kfree() · c89dffc7
      Kuniyuki Iwashima 提交于
      Receiving ACK with a valid SYN cookie, cookie_v4_check() allocates struct
      request_sock and then can allocate inet_rsk(req)->ireq_opt. After that,
      tcp_v4_syn_recv_sock() allocates struct sock and copies ireq_opt to
      inet_sk(sk)->inet_opt. Normally, tcp_v4_syn_recv_sock() inserts the full
      socket into ehash and sets NULL to ireq_opt. Otherwise,
      tcp_v4_syn_recv_sock() has to reset inet_opt by NULL and free the full
      socket.
      
      The commit 01770a16 ("tcp: fix race condition when creating child
      sockets from syncookies") added a new path, in which more than one cores
      create full sockets for the same SYN cookie. Currently, the core which
      loses the race frees the full socket without resetting inet_opt, resulting
      in that both sock_put() and reqsk_put() call kfree() for the same memory:
      
        sock_put
          sk_free
            __sk_free
              sk_destruct
                __sk_destruct
                  sk->sk_destruct/inet_sock_destruct
                    kfree(rcu_dereference_protected(inet->inet_opt, 1));
      
        reqsk_put
          reqsk_free
            __reqsk_free
              req->rsk_ops->destructor/tcp_v4_reqsk_destructor
                kfree(rcu_dereference_protected(inet_rsk(req)->ireq_opt, 1));
      
      Calling kmalloc() between the double kfree() can lead to use-after-free, so
      this patch fixes it by setting NULL to inet_opt before sock_put().
      
      As a side note, this kind of issue does not happen for IPv6. This is
      because tcp_v6_syn_recv_sock() clones both ipv6_opt and pktopts which
      correspond to ireq_opt in IPv4.
      
      Fixes: 01770a16 ("tcp: fix race condition when creating child sockets from syncookies")
      CC: Ricardo Dias <rdias@singlestore.com>
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Reviewed-by: NBenjamin Herrenschmidt <benh@amazon.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20210118055920.82516-1-kuniyu@amazon.co.jpSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      c89dffc7
  3. 20 1月, 2021 4 次提交
    • Y
      tcp: fix TCP socket rehash stats mis-accounting · 9c30ae83
      Yuchung Cheng 提交于
      The previous commit 32efcc06 ("tcp: export count for rehash attempts")
      would mis-account rehashing SNMP and socket stats:
      
        a. During handshake of an active open, only counts the first
           SYN timeout
      
        b. After handshake of passive and active open, stop updating
           after (roughly) TCP_RETRIES1 recurring RTOs
      
        c. After the socket aborts, over count timeout_rehash by 1
      
      This patch fixes this by checking the rehash result from sk_rethink_txhash.
      
      Fixes: 32efcc06 ("tcp: export count for rehash attempts")
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Link: https://lore.kernel.org/r/20210119192619.1848270-1-ycheng@google.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      9c30ae83
    • E
      tcp: do not mess with cloned skbs in tcp_add_backlog() · b160c285
      Eric Dumazet 提交于
      Heiner Kallweit reported that some skbs were sent with
      the following invalid GSO properties :
      - gso_size > 0
      - gso_type == 0
      
      This was triggerring a WARN_ON_ONCE() in rtl8169_tso_csum_v2.
      
      Juerg Haefliger was able to reproduce a similar issue using
      a lan78xx NIC and a workload mixing TCP incoming traffic
      and forwarded packets.
      
      The problem is that tcp_add_backlog() is writing
      over gso_segs and gso_size even if the incoming packet will not
      be coalesced to the backlog tail packet.
      
      While skb_try_coalesce() would bail out if tail packet is cloned,
      this overwriting would lead to corruptions of other packets
      cooked by lan78xx, sharing a common super-packet.
      
      The strategy used by lan78xx is to use a big skb, and split
      it into all received packets using skb_clone() to avoid copies.
      The drawback of this strategy is that all the small skb share a common
      struct skb_shared_info.
      
      This patch rewrites TCP gso_size/gso_segs handling to only
      happen on the tail skb, since skb_try_coalesce() made sure
      it was not cloned.
      
      Fixes: 4f693b55 ("tcp: implement coalescing on backlog queue")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Bisected-by: NJuerg Haefliger <juergh@canonical.com>
      Tested-by: NJuerg Haefliger <juergh@canonical.com>
      Reported-by: NHeiner Kallweit <hkallweit1@gmail.com>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=209423
      Link: https://lore.kernel.org/r/20210119164900.766957-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      b160c285
    • G
      netfilter: rpfilter: mask ecn bits before fib lookup · 2e5a6266
      Guillaume Nault 提交于
      RT_TOS() only masks one of the two ECN bits. Therefore rpfilter_mt()
      treats Not-ECT or ECT(1) packets in a different way than those with
      ECT(0) or CE.
      
      Reproducer:
      
        Create two netns, connected with a veth:
        $ ip netns add ns0
        $ ip netns add ns1
        $ ip link add name veth01 netns ns0 type veth peer name veth10 netns ns1
        $ ip -netns ns0 link set dev veth01 up
        $ ip -netns ns1 link set dev veth10 up
        $ ip -netns ns0 address add 192.0.2.10/32 dev veth01
        $ ip -netns ns1 address add 192.0.2.11/32 dev veth10
      
        Add a route to ns1 in ns0:
        $ ip -netns ns0 route add 192.0.2.11/32 dev veth01
      
        In ns1, only packets with TOS 4 can be routed to ns0:
        $ ip -netns ns1 route add 192.0.2.10/32 tos 4 dev veth10
      
        Ping from ns0 to ns1 works regardless of the ECN bits, as long as TOS
        is 4:
        $ ip netns exec ns0 ping -Q 4 192.0.2.11   # TOS 4, Not-ECT
          ... 0% packet loss ...
        $ ip netns exec ns0 ping -Q 5 192.0.2.11   # TOS 4, ECT(1)
          ... 0% packet loss ...
        $ ip netns exec ns0 ping -Q 6 192.0.2.11   # TOS 4, ECT(0)
          ... 0% packet loss ...
        $ ip netns exec ns0 ping -Q 7 192.0.2.11   # TOS 4, CE
          ... 0% packet loss ...
      
        Now use iptable's rpfilter module in ns1:
        $ ip netns exec ns1 iptables-legacy -t raw -A PREROUTING -m rpfilter --invert -j DROP
      
        Not-ECT and ECT(1) packets still pass:
        $ ip netns exec ns0 ping -Q 4 192.0.2.11   # TOS 4, Not-ECT
          ... 0% packet loss ...
        $ ip netns exec ns0 ping -Q 5 192.0.2.11   # TOS 4, ECT(1)
          ... 0% packet loss ...
      
        But ECT(0) and ECN packets are dropped:
        $ ip netns exec ns0 ping -Q 6 192.0.2.11   # TOS 4, ECT(0)
          ... 100% packet loss ...
        $ ip netns exec ns0 ping -Q 7 192.0.2.11   # TOS 4, CE
          ... 100% packet loss ...
      
      After this patch, rpfilter doesn't drop ECT(0) and CE packets anymore.
      
      Fixes: 8f97339d ("netfilter: add ipv4 reverse path filter match")
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      2e5a6266
    • G
      udp: mask TOS bits in udp_v4_early_demux() · 8d2b51b0
      Guillaume Nault 提交于
      udp_v4_early_demux() is the only function that calls
      ip_mc_validate_source() with a TOS that hasn't been masked with
      IPTOS_RT_MASK.
      
      This results in different behaviours for incoming multicast UDPv4
      packets, depending on if ip_mc_validate_source() is called from the
      early-demux path (udp_v4_early_demux) or from the regular input path
      (ip_route_input_noref).
      
      ECN would normally not be used with UDP multicast packets, so the
      practical consequences should be limited on that side. However,
      IPTOS_RT_MASK is used to also masks the TOS' high order bits, to align
      with the non-early-demux path behaviour.
      
      Reproducer:
      
        Setup two netns, connected with veth:
        $ ip netns add ns0
        $ ip netns add ns1
        $ ip -netns ns0 link set dev lo up
        $ ip -netns ns1 link set dev lo up
        $ ip link add name veth01 netns ns0 type veth peer name veth10 netns ns1
        $ ip -netns ns0 link set dev veth01 up
        $ ip -netns ns1 link set dev veth10 up
        $ ip -netns ns0 address add 192.0.2.10 peer 192.0.2.11/32 dev veth01
        $ ip -netns ns1 address add 192.0.2.11 peer 192.0.2.10/32 dev veth10
      
        In ns0, add route to multicast address 224.0.2.0/24 using source
        address 198.51.100.10:
        $ ip -netns ns0 address add 198.51.100.10/32 dev lo
        $ ip -netns ns0 route add 224.0.2.0/24 dev veth01 src 198.51.100.10
      
        In ns1, define route to 198.51.100.10, only for packets with TOS 4:
        $ ip -netns ns1 route add 198.51.100.10/32 tos 4 dev veth10
      
        Also activate rp_filter in ns1, so that incoming packets not matching
        the above route get dropped:
        $ ip netns exec ns1 sysctl -wq net.ipv4.conf.veth10.rp_filter=1
      
        Now try to receive packets on 224.0.2.11:
        $ ip netns exec ns1 socat UDP-RECVFROM:1111,ip-add-membership=224.0.2.11:veth10,ignoreeof -
      
        In ns0, send packet to 224.0.2.11 with TOS 4 and ECT(0) (that is,
        tos 6 for socat):
        $ echo test0 | ip netns exec ns0 socat - UDP-DATAGRAM:224.0.2.11:1111,bind=:1111,tos=6
      
        The "test0" message is properly received by socat in ns1, because
        early-demux has no cached dst to use, so source address validation
        is done by ip_route_input_mc(), which receives a TOS that has the
        ECN bits masked.
      
        Now send another packet to 224.0.2.11, still with TOS 4 and ECT(0):
        $ echo test1 | ip netns exec ns0 socat - UDP-DATAGRAM:224.0.2.11:1111,bind=:1111,tos=6
      
        The "test1" message isn't received by socat in ns1, because, now,
        early-demux has a cached dst to use and calls ip_mc_validate_source()
        immediately, without masking the ECN bits.
      
      Fixes: bc044e8d ("udp: perform source validation for mcast early demux")
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      8d2b51b0
  4. 19 1月, 2021 1 次提交
  5. 12 1月, 2021 1 次提交
    • W
      esp: avoid unneeded kmap_atomic call · 9bd6b629
      Willem de Bruijn 提交于
      esp(6)_output_head uses skb_page_frag_refill to allocate a buffer for
      the esp trailer.
      
      It accesses the page with kmap_atomic to handle highmem. But
      skb_page_frag_refill can return compound pages, of which
      kmap_atomic only maps the first underlying page.
      
      skb_page_frag_refill does not return highmem, because flag
      __GFP_HIGHMEM is not set. ESP uses it in the same manner as TCP.
      That also does not call kmap_atomic, but directly uses page_address,
      in skb_copy_to_page_nocache. Do the same for ESP.
      
      This issue has become easier to trigger with recent kmap local
      debugging feature CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP.
      
      Fixes: cac2661c ("esp4: Avoid skb_cow_data whenever possible")
      Fixes: 03e2a30f ("esp6: Avoid skb_cow_data whenever possible")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      9bd6b629
  6. 08 1月, 2021 5 次提交
    • P
      nexthop: Bounce NHA_GATEWAY in FDB nexthop groups · b19218b2
      Petr Machata 提交于
      The function nh_check_attr_group() is called to validate nexthop groups.
      The intention of that code seems to have been to bounce all attributes
      above NHA_GROUP_TYPE except for NHA_FDB. However instead it bounces all
      these attributes except when NHA_FDB attribute is present--then it accepts
      them.
      
      NHA_FDB validation that takes place before, in rtm_to_nh_config(), already
      bounces NHA_OIF, NHA_BLACKHOLE, NHA_ENCAP and NHA_ENCAP_TYPE. Yet further
      back, NHA_GROUPS and NHA_MASTER are bounced unconditionally.
      
      But that still leaves NHA_GATEWAY as an attribute that would be accepted in
      FDB nexthop groups (with no meaning), so long as it keeps the address
      family as unspecified:
      
       # ip nexthop add id 1 fdb via 127.0.0.1
       # ip nexthop add id 10 fdb via default group 1
      
      The nexthop code is still relatively new and likely not used very broadly,
      and the FDB bits are newer still. Even though there is a reproducer out
      there, it relies on an improbable gateway arguments "via default", "via
      all" or "via any". Given all this, I believe it is OK to reformulate the
      condition to do the right thing and bounce NHA_GATEWAY.
      
      Fixes: 38428d68 ("nexthop: support for fdb ecmp nexthops")
      Signed-off-by: NPetr Machata <petrm@nvidia.com>
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      b19218b2
    • I
      nexthop: Unlink nexthop group entry in error path · 7b01e53e
      Ido Schimmel 提交于
      In case of error, remove the nexthop group entry from the list to which
      it was previously added.
      
      Fixes: 430a0491 ("nexthop: Add support for nexthop groups")
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NPetr Machata <petrm@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      7b01e53e
    • I
      nexthop: Fix off-by-one error in error path · 07e61a97
      Ido Schimmel 提交于
      A reference was not taken for the current nexthop entry, so do not try
      to put it in the error path.
      
      Fixes: 430a0491 ("nexthop: Add support for nexthop groups")
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NPetr Machata <petrm@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      07e61a97
    • F
      net: ip: always refragment ip defragmented packets · bb4cc1a1
      Florian Westphal 提交于
      Conntrack reassembly records the largest fragment size seen in IPCB.
      However, when this gets forwarded/transmitted, fragmentation will only
      be forced if one of the fragmented packets had the DF bit set.
      
      In that case, a flag in IPCB will force fragmentation even if the
      MTU is large enough.
      
      This should work fine, but this breaks with ip tunnels.
      Consider client that sends a UDP datagram of size X to another host.
      
      The client fragments the datagram, so two packets, of size y and z, are
      sent. DF bit is not set on any of these packets.
      
      Middlebox netfilter reassembles those packets back to single size-X
      packet, before routing decision.
      
      packet-size-vs-mtu checks in ip_forward are irrelevant, because DF bit
      isn't set.  At output time, ip refragmentation is skipped as well
      because x is still smaller than the mtu of the output device.
      
      If ttransmit device is an ip tunnel, the packet size increases to
      x+overhead.
      
      Also, tunnel might be configured to force DF bit on outer header.
      
      In this case, packet will be dropped (exceeds MTU) and an ICMP error is
      generated back to sender.
      
      But sender already respects the announced MTU, all the packets that
      it sent did fit the announced mtu.
      
      Force refragmentation as per original sizes unconditionally so ip tunnel
      will encapsulate the fragments instead.
      
      The only other solution I see is to place ip refragmentation in
      the ip_tunnel code to handle this case.
      
      Fixes: d6b915e2 ("ip_fragment: don't forward defragmented DF packet")
      Reported-by: NChristian Perle <christian.perle@secunet.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      bb4cc1a1
    • F
      net: fix pmtu check in nopmtudisc mode · 50c66167
      Florian Westphal 提交于
      For some reason ip_tunnel insist on setting the DF bit anyway when the
      inner header has the DF bit set, EVEN if the tunnel was configured with
      'nopmtudisc'.
      
      This means that the script added in the previous commit
      cannot be made to work by adding the 'nopmtudisc' flag to the
      ip tunnel configuration. Doing so breaks connectivity even for the
      without-conntrack/netfilter scenario.
      
      When nopmtudisc is set, the tunnel will skip the mtu check, so no
      icmp error is sent to client. Then, because inner header has DF set,
      the outer header gets added with DF bit set as well.
      
      IP stack then sends an error to itself because the packet exceeds
      the device MTU.
      
      Fixes: 23a3647b ("ip_tunnels: Use skb-len to PMTU check.")
      Cc: Stefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      50c66167
  7. 29 12月, 2020 2 次提交
    • C
      erspan: fix version 1 check in gre_parse_header() · 085c7c4e
      Cong Wang 提交于
      Both version 0 and version 1 use ETH_P_ERSPAN, but version 0 does not
      have an erspan header. So the check in gre_parse_header() is wrong,
      we have to distinguish version 1 from version 0.
      
      We can just check the gre header length like is_erspan_type1().
      
      Fixes: cb73ee40 ("net: ip_gre: use erspan key field for tunnel lookup")
      Reported-by: syzbot+f583ce3d4ddf9836b27a@syzkaller.appspotmail.com
      Cc: William Tu <u9012063@gmail.com>
      Cc: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
      Signed-off-by: NCong Wang <cong.wang@bytedance.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      085c7c4e
    • G
      ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst() · 21fdca22
      Guillaume Nault 提交于
      RT_TOS() only clears one of the ECN bits. Therefore, when
      fib_compute_spec_dst() resorts to a fib lookup, it can return
      different results depending on the value of the second ECN bit.
      
      For example, ECT(0) and ECT(1) packets could be treated differently.
      
        $ ip netns add ns0
        $ ip netns add ns1
        $ ip link add name veth01 netns ns0 type veth peer name veth10 netns ns1
        $ ip -netns ns0 link set dev lo up
        $ ip -netns ns1 link set dev lo up
        $ ip -netns ns0 link set dev veth01 up
        $ ip -netns ns1 link set dev veth10 up
      
        $ ip -netns ns0 address add 192.0.2.10/24 dev veth01
        $ ip -netns ns1 address add 192.0.2.11/24 dev veth10
      
        $ ip -netns ns1 address add 192.0.2.21/32 dev lo
        $ ip -netns ns1 route add 192.0.2.10/32 tos 4 dev veth10 src 192.0.2.21
        $ ip netns exec ns1 sysctl -wq net.ipv4.icmp_echo_ignore_broadcasts=0
      
      With TOS 4 and ECT(1), ns1 replies using source address 192.0.2.21
      (ping uses -Q to set all TOS and ECN bits):
      
        $ ip netns exec ns0 ping -c 1 -b -Q 5 192.0.2.255
        [...]
        64 bytes from 192.0.2.21: icmp_seq=1 ttl=64 time=0.544 ms
      
      But with TOS 4 and ECT(0), ns1 replies using source address 192.0.2.11
      because the "tos 4" route isn't matched:
      
        $ ip netns exec ns0 ping -c 1 -b -Q 6 192.0.2.255
        [...]
        64 bytes from 192.0.2.11: icmp_seq=1 ttl=64 time=0.597 ms
      
      After this patch the ECN bits don't affect the result anymore:
      
        $ ip netns exec ns0 ping -c 1 -b -Q 6 192.0.2.255
        [...]
        64 bytes from 192.0.2.21: icmp_seq=1 ttl=64 time=0.591 ms
      
      Fixes: 35ebf65e ("ipv4: Create and use fib_compute_spec_dst() helper.")
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      21fdca22
  8. 18 12月, 2020 1 次提交
  9. 15 12月, 2020 2 次提交
  10. 13 12月, 2020 1 次提交
    • S
      inet: frags: batch fqdir destroy works · 0b9b2414
      SeongJae Park 提交于
      On a few of our systems, I found frequent 'unshare(CLONE_NEWNET)' calls
      make the number of active slab objects including 'sock_inode_cache' type
      rapidly and continuously increase.  As a result, memory pressure occurs.
      
      In more detail, I made an artificial reproducer that resembles the
      workload that we found the problem and reproduce the problem faster.  It
      merely repeats 'unshare(CLONE_NEWNET)' 50,000 times in a loop.  It takes
      about 2 minutes.  On 40 CPU cores / 70GB DRAM machine, the available
      memory continuously reduced in a fast speed (about 120MB per second,
      15GB in total within the 2 minutes).  Note that the issue don't
      reproduce on every machine.  On my 6 CPU cores machine, the problem
      didn't reproduce.
      
      'cleanup_net()' and 'fqdir_work_fn()' are functions that deallocate the
      relevant memory objects.  They are asynchronously invoked by the work
      queues and internally use 'rcu_barrier()' to ensure safe destructions.
      'cleanup_net()' works in a batched maneer in a single thread worker,
      while 'fqdir_work_fn()' works for each 'fqdir_exit()' call in the
      'system_wq'.  Therefore, 'fqdir_work_fn()' called frequently under the
      workload and made the contention for 'rcu_barrier()' high.  In more
      detail, the global mutex, 'rcu_state.barrier_mutex' became the
      bottleneck.
      
      This commit avoids such contention by doing the 'rcu_barrier()' and
      subsequent lightweight works in a batched manner, as similar to that of
      'cleanup_net()'.  The fqdir hashtable destruction, which is done before
      the 'rcu_barrier()', is still allowed to run in parallel for fast
      processing, but this commit makes it to use a dedicated work queue
      instead of the 'system_wq', to make sure that the number of threads is
      bounded.
      Signed-off-by: NSeongJae Park <sjpark@amazon.de>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20201211112405.31158-1-sjpark@amazon.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      0b9b2414
  11. 11 12月, 2020 1 次提交
  12. 10 12月, 2020 2 次提交
  13. 09 12月, 2020 1 次提交
    • E
      tcp: select sane initial rcvq_space.space for big MSS · 72d05c00
      Eric Dumazet 提交于
      Before commit a337531b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
      small tcp_rmem[1] values were overridden by tcp_fixup_rcvbuf() to accommodate various MSS.
      
      This is no longer the case, and Hazem Mohamed Abuelfotoh reported
      that DRS would not work for MTU 9000 endpoints receiving regular (1500 bytes) frames.
      
      Root cause is that tcp_init_buffer_space() uses tp->rcv_wnd for upper limit
      of rcvq_space.space computation, while it can select later a smaller
      value for tp->rcv_ssthresh and tp->window_clamp.
      
      ss -temoi on receiver would show :
      
      skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) rcv_space:62496 rcv_ssthresh:56596
      
      This means that TCP can not increase its window in tcp_grow_window(),
      and that DRS can never kick.
      
      Fix this by making sure that rcvq_space.space is not bigger than number of bytes
      that can be held in TCP receive queue.
      
      People unable/unwilling to change their kernel can work around this issue by
      selecting a bigger tcp_rmem[1] value as in :
      
      echo "4096 196608 6291456" >/proc/sys/net/ipv4/tcp_rmem
      
      Based on an initial report and patch from Hazem Mohamed Abuelfotoh
       https://lore.kernel.org/netdev/20201204180622.14285-1-abuehaze@amazon.com/
      
      Fixes: a337531b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
      Fixes: 041a14d2 ("tcp: start receiver buffer autotuning sooner")
      Reported-by: NHazem Mohamed Abuelfotoh <abuehaze@amazon.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      72d05c00
  14. 08 12月, 2020 1 次提交
    • S
      netfilter: x_tables: Switch synchronization to RCU · cc00bcaa
      Subash Abhinov Kasiviswanathan 提交于
      When running concurrent iptables rules replacement with data, the per CPU
      sequence count is checked after the assignment of the new information.
      The sequence count is used to synchronize with the packet path without the
      use of any explicit locking. If there are any packets in the packet path using
      the table information, the sequence count is incremented to an odd value and
      is incremented to an even after the packet process completion.
      
      The new table value assignment is followed by a write memory barrier so every
      CPU should see the latest value. If the packet path has started with the old
      table information, the sequence counter will be odd and the iptables
      replacement will wait till the sequence count is even prior to freeing the
      old table info.
      
      However, this assumes that the new table information assignment and the memory
      barrier is actually executed prior to the counter check in the replacement
      thread. If CPU decides to execute the assignment later as there is no user of
      the table information prior to the sequence check, the packet path in another
      CPU may use the old table information. The replacement thread would then free
      the table information under it leading to a use after free in the packet
      processing context-
      
      Unable to handle kernel NULL pointer dereference at virtual
      address 000000000000008e
      pc : ip6t_do_table+0x5d0/0x89c
      lr : ip6t_do_table+0x5b8/0x89c
      ip6t_do_table+0x5d0/0x89c
      ip6table_filter_hook+0x24/0x30
      nf_hook_slow+0x84/0x120
      ip6_input+0x74/0xe0
      ip6_rcv_finish+0x7c/0x128
      ipv6_rcv+0xac/0xe4
      __netif_receive_skb+0x84/0x17c
      process_backlog+0x15c/0x1b8
      napi_poll+0x88/0x284
      net_rx_action+0xbc/0x23c
      __do_softirq+0x20c/0x48c
      
      This could be fixed by forcing instruction order after the new table
      information assignment or by switching to RCU for the synchronization.
      
      Fixes: 80055dab ("netfilter: x_tables: make xt_replace_table wait until old rules are not used anymore")
      Reported-by: NSean Tranchetti <stranche@codeaurora.org>
      Reported-by: Nkernel test robot <lkp@intel.com>
      Suggested-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NSubash Abhinov Kasiviswanathan <subashab@codeaurora.org>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      cc00bcaa
  15. 07 12月, 2020 1 次提交
  16. 05 12月, 2020 9 次提交
  17. 04 12月, 2020 3 次提交
    • A
      bpf: Remove hard-coded btf_vmlinux assumption from BPF verifier · 22dc4a0f
      Andrii Nakryiko 提交于
      Remove a permeating assumption thoughout BPF verifier of vmlinux BTF. Instead,
      wherever BTF type IDs are involved, also track the instance of struct btf that
      goes along with the type ID. This allows to gradually add support for kernel
      module BTFs and using/tracking module types across BPF helper calls and
      registers.
      
      This patch also renames btf_id() function to btf_obj_id() to minimize naming
      clash with using btf_id to denote BTF *type* ID, rather than BTF *object*'s ID.
      
      Also, altough btf_vmlinux can't get destructed and thus doesn't need
      refcounting, module BTFs need that, so apply BTF refcounting universally when
      BPF program is using BTF-powered attachment (tp_btf, fentry/fexit, etc). This
      makes for simpler clean up code.
      
      Now that BTF type ID is not enough to uniquely identify a BTF type, extend BPF
      trampoline key to include BTF object ID. To differentiate that from target
      program BPF ID, set 31st bit of type ID. BTF type IDs (at least currently) are
      not allowed to take full 32 bits, so there is no danger of confusing that bit
      with a valid BTF type ID.
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20201203204634.1325171-10-andrii@kernel.org
      22dc4a0f
    • P
      bpf: Adds support for setting window clamp · cb811109
      Prankur gupta 提交于
      Adds a new bpf_setsockopt for TCP sockets, TCP_BPF_WINDOW_CLAMP,
      which sets the maximum receiver window size. It will be useful for
      limiting receiver window based on RTT.
      Signed-off-by: NPrankur gupta <prankgup@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20201202213152.435886-2-prankgup@fb.com
      cb811109
    • F
      tcp: merge 'init_req' and 'route_req' functions · 7ea851d1
      Florian Westphal 提交于
      The Multipath-TCP standard (RFC 8684) says that an MPTCP host should send
      a TCP reset if the token in a MP_JOIN request is unknown.
      
      At this time we don't do this, the 3whs completes and the 'new subflow'
      is reset afterwards.  There are two ways to allow MPTCP to send the
      reset.
      
      1. override 'send_synack' callback and emit the rst from there.
         The drawback is that the request socket gets inserted into the
         listeners queue just to get removed again right away.
      
      2. Send the reset from the 'route_req' function instead.
         This avoids the 'add&remove request socket', but route_req lacks the
         skb that is required to send the TCP reset.
      
      Instead of just adding the skb to that function for MPTCP sake alone,
      Paolo suggested to merge init_req and route_req functions.
      
      This saves one indirection from syn processing path and provides the skb
      to the merged function at the same time.
      
      'send reset on unknown mptcp join token' is added in next patch.
      Suggested-by: NPaolo Abeni <pabeni@redhat.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      7ea851d1
  18. 03 12月, 2020 1 次提交
  19. 01 12月, 2020 1 次提交
  20. 29 11月, 2020 1 次提交
    • G
      ipv4: Fix tos mask in inet_rtm_getroute() · 1ebf1790
      Guillaume Nault 提交于
      When inet_rtm_getroute() was converted to use the RCU variants of
      ip_route_input() and ip_route_output_key(), the TOS parameters
      stopped being masked with IPTOS_RT_MASK before doing the route lookup.
      
      As a result, "ip route get" can return a different route than what
      would be used when sending real packets.
      
      For example:
      
          $ ip route add 192.0.2.11/32 dev eth0
          $ ip route add unreachable 192.0.2.11/32 tos 2
          $ ip route get 192.0.2.11 tos 2
          RTNETLINK answers: No route to host
      
      But, packets with TOS 2 (ECT(0) if interpreted as an ECN bit) would
      actually be routed using the first route:
      
          $ ping -c 1 -Q 2 192.0.2.11
          PING 192.0.2.11 (192.0.2.11) 56(84) bytes of data.
          64 bytes from 192.0.2.11: icmp_seq=1 ttl=64 time=0.173 ms
      
          --- 192.0.2.11 ping statistics ---
          1 packets transmitted, 1 received, 0% packet loss, time 0ms
          rtt min/avg/max/mdev = 0.173/0.173/0.173/0.000 ms
      
      This patch re-applies IPTOS_RT_MASK in inet_rtm_getroute(), to
      return results consistent with real route lookups.
      
      Fixes: 3765d35e ("net: ipv4: Convert inet_rtm_getroute to rcu versions of route lookup")
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/b2d237d08317ca55926add9654a48409ac1b8f5b.1606412894.git.gnault@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      1ebf1790