1. 12 3月, 2021 6 次提交
    • P
      nexthop: Implement notifiers for resilient nexthop groups · 7c37c7e0
      Petr Machata 提交于
      Implement the following notifications towards drivers:
      
      - NEXTHOP_EVENT_REPLACE, when a resilient nexthop group is created.
      
      - NEXTHOP_EVENT_BUCKET_REPLACE any time there is a change in assignment of
        next hops to hash table buckets. That includes replacements, deletions,
        and delayed upkeep cycles. Some bucket notifications can be vetoed by the
        driver, to make it possible to propagate bucket busy-ness flags from the
        HW back to the algorithm. Some are however forced, e.g. if a next hop is
        deleted, all buckets that use this next hop simply must be migrated,
        whether the HW wishes so or not.
      
      - NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE, before a resilient nexthop group is
        replaced. Usually the driver will get the bucket notifications as well,
        and could veto those. But in some cases, a bucket may not be migrated
        immediately, but during delayed upkeep, and that is too late to roll the
        transaction back. This notification allows the driver to take a look and
        veto the new proposed group up front, before anything is committed.
      Signed-off-by: NPetr Machata <petrm@nvidia.com>
      Reviewed-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c37c7e0
    • P
      nexthop: Add implementation of resilient next-hop groups · 283a72a5
      Petr Machata 提交于
      At this moment, there is only one type of next-hop group: an mpath group,
      which implements the hash-threshold algorithm.
      
      To select a next hop, hash-threshold algorithm first assigns a range of
      hashes to each next hop in the group, and then selects the next hop by
      comparing the SKB hash with the individual ranges. When a next hop is
      removed from the group, the ranges are recomputed, which leads to
      reassignment of parts of hash space from one next hop to another. While
      there will usually be some overlap between the previous and the new
      distribution, some traffic flows change the next hop that they resolve to.
      That causes problems e.g. as established TCP connections are reset, because
      the traffic is forwarded to a server that is not familiar with the
      connection.
      
      Resilient hashing is a technique to address the above problem. Resilient
      next-hop group has another layer of indirection between the group itself
      and its constituent next hops: a hash table. The selection algorithm uses a
      straightforward modulo operation to choose a hash bucket, and then reads
      the next hop that this bucket contains, and forwards traffic there.
      
      This indirection brings an important feature. In the hash-threshold
      algorithm, the range of hashes associated with a next hop must be
      continuous. With a hash table, mapping between the hash table buckets and
      the individual next hops is arbitrary. Therefore when a next hop is deleted
      the buckets that held it are simply reassigned to other next hops. When
      weights of next hops in a group are altered, it may be possible to choose a
      subset of buckets that are currently not used for forwarding traffic, and
      use those to satisfy the new next-hop distribution demands, keeping the
      "busy" buckets intact. This way, established flows are ideally kept being
      forwarded to the same endpoints through the same paths as before the
      next-hop group change.
      
      In a nutshell, the algorithm works as follows. Each next hop has a number
      of buckets that it wants to have, according to its weight and the number of
      buckets in the hash table. In case of an event that might cause bucket
      allocation change, the numbers for individual next hops are updated,
      similarly to how ranges are updated for mpath group next hops. Following
      that, a new "upkeep" algorithm runs, and for idle buckets that belong to a
      next hop that is currently occupying more buckets than it wants (it is
      "overweight"), it migrates the buckets to one of the next hops that has
      fewer buckets than it wants (it is "underweight"). If, after this, there
      are still underweight next hops, another upkeep run is scheduled to a
      future time.
      
      Chances are there are not enough "idle" buckets to satisfy the new demands.
      The algorithm has knobs to select both what it means for a bucket to be
      idle, and for whether and when to forcefully migrate buckets if there keeps
      being an insufficient number of idle buckets.
      
      There are three users of the resilient data structures.
      
      - The forwarding code accesses them under RCU, and does not modify them
        except for updating the time a selected bucket was last used.
      
      - Netlink code, running under RTNL, which may modify the data.
      
      - The delayed upkeep code, which may modify the data. This runs unlocked,
        and mutual exclusion between the RTNL code and the delayed upkeep is
        maintained by canceling the delayed work synchronously before the RTNL
        code touches anything. Later it restarts the delayed work if necessary.
      
      The RTNL code has to implement next-hop group replacement, next hop
      removal, etc. For removal, the mpath code uses a neat trick of having a
      backup next hop group structure, doing the necessary changes offline, and
      then RCU-swapping them in. However, the hash tables for resilient hashing
      are about an order of magnitude larger than the groups themselves (the size
      might be e.g. 4K entries), and it was felt that keeping two of them is an
      overkill. Both the primary next-hop group and the spare therefore use the
      same resilient table, and writers are careful to keep all references valid
      for the forwarding code. The hash table references next-hop group entries
      from the next-hop group that is currently in the primary role (i.e. not
      spare). During the transition from primary to spare, the table references a
      mix of both the primary group and the spare. When a next hop is deleted,
      the corresponding buckets are not set to NULL, but instead marked as empty,
      so that the pointer is valid and can be used by the forwarding code. The
      buckets are then migrated to a new next-hop group entry during upkeep. The
      only times that the hash table is invalid is the very beginning and very
      end of its lifetime. Between those points, it is always kept valid.
      
      This patch introduces the core support code itself. It does not handle
      notifications towards drivers, which are kept as if the group were an mpath
      one. It does not handle netlink either. The only bit currently exposed to
      user space is the new next-hop group type, and that is currently bounced.
      There is therefore no way to actually access this code.
      Signed-off-by: NPetr Machata <petrm@nvidia.com>
      Reviewed-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      283a72a5
    • I
      nexthop: Add netlink defines and enumerators for resilient NH groups · 710ec562
      Ido Schimmel 提交于
      - RTM_NEWNEXTHOP et.al. that handle resilient groups will have a new nested
        attribute, NHA_RES_GROUP, whose elements are attributes NHA_RES_GROUP_*.
      
      - RTM_NEWNEXTHOPBUCKET et.al. is a suite of new messages that will
        currently serve only for dumping of individual buckets of resilient next
        hop groups. For nexthop group buckets, these messages will carry a nested
        attribute NHA_RES_BUCKET, whose elements are attributes NHA_RES_BUCKET_*.
      
        There are several reasons why a new suite of messages is created for
        nexthop buckets instead of overloading the information on the existing
        RTM_{NEW,DEL,GET}NEXTHOP messages.
      
        First, a nexthop group can contain a large number of nexthop buckets (4k
        is not unheard of). This imposes limits on the amount of information that
        can be encoded for each nexthop bucket given a netlink message is limited
        to 64k bytes.
      
        Second, while RTM_NEWNEXTHOPBUCKET is only used for notifications at
        this point, in the future it can be extended to provide user space with
        control over nexthop buckets configuration.
      
      - The new group type is NEXTHOP_GRP_TYPE_RES. Note that nexthop code is
        adjusted to bounce groups with that type for now.
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NPetr Machata <petrm@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NPetr Machata <petrm@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      710ec562
    • P
      nexthop: Add a dedicated flag for multipath next-hop groups · 90e1a9e2
      Petr Machata 提交于
      With the introduction of resilient nexthop groups, there will be two types
      of multipath groups: the current hash-threshold "mpath" ones, and resilient
      groups. Both are multipath, but to determine the fact, the system needs to
      consider two flags. This might prove costly in the datapath. Therefore,
      introduce a new flag, that should be set for next-hop groups that have more
      than one nexthop, and should be considered multipath.
      Signed-off-by: NPetr Machata <petrm@nvidia.com>
      Reviewed-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90e1a9e2
    • P
      nexthop: __nh_notifier_single_info_init(): Make nh_info an argument · 96a85625
      Petr Machata 提交于
      The cited function currently uses rtnl_dereference() to get nh_info from a
      handed-in nexthop. However, under the resilient hashing scheme, this
      function will not always be called under RTNL, sometimes the mutual
      exclusion will be achieved differently. Therefore move the nh_info
      extraction from the function to its callers to make it possible to use a
      different synchronization guarantee.
      Signed-off-by: NPetr Machata <petrm@nvidia.com>
      Reviewed-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96a85625
    • P
      nexthop: Pass nh_config to replace_nexthop() · 597f48e4
      Petr Machata 提交于
      Currently, replace assumes that the new group that is given is a
      fully-formed object. But mpath groups really only have one attribute, and
      that is the constituent next hop configuration. This may not be universally
      true. From the usability perspective, it is desirable to allow the replace
      operation to adjust just the constituent next hop configuration and leave
      the group attributes as such intact.
      
      But the object that keeps track of whether an attribute was or was not
      given is the nh_config object, not the next hop or next-hop group. To allow
      (selective) attribute updates during NH group replacement, propagate `cfg'
      to replace_nexthop() and further to replace_nexthop_grp().
      Signed-off-by: NPetr Machata <petrm@nvidia.com>
      Reviewed-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      597f48e4
  2. 11 3月, 2021 1 次提交
  3. 06 3月, 2021 1 次提交
  4. 05 3月, 2021 2 次提交
    • P
      cipso,calipso: resolve a number of problems with the DOI refcounts · ad5d07f4
      Paul Moore 提交于
      The current CIPSO and CALIPSO refcounting scheme for the DOI
      definitions is a bit flawed in that we:
      
      1. Don't correctly match gets/puts in netlbl_cipsov4_list().
      2. Decrement the refcount on each attempt to remove the DOI from the
         DOI list, only removing it from the list once the refcount drops
         to zero.
      
      This patch fixes these problems by adding the missing "puts" to
      netlbl_cipsov4_list() and introduces a more conventional, i.e.
      not-buggy, refcounting mechanism to the DOI definitions.  Upon the
      addition of a DOI to the DOI list, it is initialized with a refcount
      of one, removing a DOI from the list removes it from the list and
      drops the refcount by one; "gets" and "puts" behave as expected with
      respect to refcounts, increasing and decreasing the DOI's refcount by
      one.
      
      Fixes: b1edeb10 ("netlabel: Replace protocol/NetLabel linking with refrerence counts")
      Fixes: d7cce015 ("netlabel: Add support for removing a CALIPSO DOI.")
      Reported-by: syzbot+9ec037722d2603a9f52e@syzkaller.appspotmail.com
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ad5d07f4
    • I
      nexthop: Do not flush blackhole nexthops when loopback goes down · 76c03bf8
      Ido Schimmel 提交于
      As far as user space is concerned, blackhole nexthops do not have a
      nexthop device and therefore should not be affected by the
      administrative or carrier state of any netdev.
      
      However, when the loopback netdev goes down all the blackhole nexthops
      are flushed. This happens because internally the kernel associates
      blackhole nexthops with the loopback netdev.
      
      This behavior is both confusing to those not familiar with kernel
      internals and also diverges from the legacy API where blackhole IPv4
      routes are not flushed when the loopback netdev goes down:
      
       # ip route add blackhole 198.51.100.0/24
       # ip link set dev lo down
       # ip route show 198.51.100.0/24
       blackhole 198.51.100.0/24
      
      Blackhole IPv6 routes are flushed, but at least user space knows that
      they are associated with the loopback netdev:
      
       # ip -6 route show 2001:db8:1::/64
       blackhole 2001:db8:1::/64 dev lo metric 1024 pref medium
      
      Fix this by only flushing blackhole nexthops when the loopback netdev is
      unregistered.
      
      Fixes: ab84be7e ("net: Initial nexthop code")
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reported-by: NDonald Sharp <sharpd@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76c03bf8
  5. 02 3月, 2021 3 次提交
    • E
      tcp: add sanity tests to TCP_QUEUE_SEQ · 8811f4a9
      Eric Dumazet 提交于
      Qingyu Li reported a syzkaller bug where the repro
      changes RCV SEQ _after_ restoring data in the receive queue.
      
      mprotect(0x4aa000, 12288, PROT_READ)    = 0
      mmap(0x1ffff000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x1ffff000
      mmap(0x20000000, 16777216, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x20000000
      mmap(0x21000000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x21000000
      socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 3
      setsockopt(3, SOL_TCP, TCP_REPAIR, [1], 4) = 0
      connect(3, {sa_family=AF_INET6, sin6_port=htons(0), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = 0
      setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0
      sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="0x0000000000000003\0\0", iov_len=20}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20
      setsockopt(3, SOL_TCP, TCP_REPAIR, [0], 4) = 0
      setsockopt(3, SOL_TCP, TCP_QUEUE_SEQ, [128], 4) = 0
      recvfrom(3, NULL, 20, 0, NULL, NULL)    = -1 ECONNRESET (Connection reset by peer)
      
      syslog shows:
      [  111.205099] TCP recvmsg seq # bug 2: copied 80, seq 0, rcvnxt 80, fl 0
      [  111.207894] WARNING: CPU: 1 PID: 356 at net/ipv4/tcp.c:2343 tcp_recvmsg_locked+0x90e/0x29a0
      
      This should not be allowed. TCP_QUEUE_SEQ should only be used
      when queues are empty.
      
      This patch fixes this case, and the tx path as well.
      
      Fixes: ee995283 ("tcp: Initial repair mode")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=212005Reported-by: NQingyu Li <ieatmuttonchuan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8811f4a9
    • Y
      inetpeer: use div64_ul() and clamp_val() calculate inet_peer_threshold · 8bd2a055
      Yejune Deng 提交于
      In inet_initpeers(), struct inet_peer on IA32 uses 128 bytes in nowdays.
      Get rid of the cascade and use div64_ul() and clamp_val() calculate that
      will not need to be adjusted in the future as suggested by Eric Dumazet.
      Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NYejune Deng <yejune.deng@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8bd2a055
    • J
      net: always use icmp{,v6}_ndo_send from ndo_start_xmit · 4372339e
      Jason A. Donenfeld 提交于
      There were a few remaining tunnel drivers that didn't receive the prior
      conversion to icmp{,v6}_ndo_send. Knowing now that this could lead to
      memory corrution (see ee576c47 ("net: icmp: pass zeroed opts from
      icmp{,v6}_ndo_send before sending") for details), there's even more
      imperative to have these all converted. So this commit goes through the
      remaining cases that I could find and does a boring translation to the
      ndo variety.
      
      The Fixes: line below is the merge that originally added icmp{,v6}_
      ndo_send and converted the first batch of icmp{,v6}_send users. The
      rationale then for the change applies equally to this patch. It's just
      that these drivers were left out of the initial conversion because these
      network devices are hiding in net/ rather than in drivers/net/.
      
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: David Ahern <dsahern@kernel.org>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Fixes: 803381f9 ("Merge branch 'icmp-account-for-NAT-when-sending-icmps-from-ndo-layer'")
      Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4372339e
  6. 01 3月, 2021 1 次提交
    • D
      net: Fix gro aggregation for udp encaps with zero csum · 89e5c58f
      Daniel Borkmann 提交于
      We noticed a GRO issue for UDP-based encaps such as vxlan/geneve when the
      csum for the UDP header itself is 0. In that case, GRO aggregation does
      not take place on the phys dev, but instead is deferred to the vxlan/geneve
      driver (see trace below).
      
      The reason is essentially that GRO aggregation bails out in udp_gro_receive()
      for such case when drivers marked the skb with CHECKSUM_UNNECESSARY (ice, i40e,
      others) where for non-zero csums 2abb7cdc ("udp: Add support for doing
      checksum unnecessary conversion") promotes those skbs to CHECKSUM_COMPLETE
      and napi context has csum_valid set. This is however not the case for zero
      UDP csum (here: csum_cnt is still 0 and csum_valid continues to be false).
      
      At the same time 57c67ff4 ("udp: additional GRO support") added matches
      on !uh->check ^ !uh2->check as part to determine candidates for aggregation,
      so it certainly is expected to handle zero csums in udp_gro_receive(). The
      purpose of the check added via 662880f4 ("net: Allow GRO to use and set
      levels of checksum unnecessary") seems to catch bad csum and stop aggregation
      right away.
      
      One way to fix aggregation in the zero case is to only perform the !csum_valid
      check in udp_gro_receive() if uh->check is infact non-zero.
      
      Before:
      
        [...]
        swapper     0 [008]   731.946506: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497100400 len=1500   (1)
        swapper     0 [008]   731.946507: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497100200 len=1500
        swapper     0 [008]   731.946507: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497101100 len=1500
        swapper     0 [008]   731.946508: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497101700 len=1500
        swapper     0 [008]   731.946508: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497101b00 len=1500
        swapper     0 [008]   731.946508: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497100600 len=1500
        swapper     0 [008]   731.946508: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497100f00 len=1500
        swapper     0 [008]   731.946509: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497100a00 len=1500
        swapper     0 [008]   731.946516: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497100500 len=1500
        swapper     0 [008]   731.946516: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497100700 len=1500
        swapper     0 [008]   731.946516: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497101d00 len=1500   (2)
        swapper     0 [008]   731.946517: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497101000 len=1500
        swapper     0 [008]   731.946517: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497101c00 len=1500
        swapper     0 [008]   731.946517: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497101400 len=1500
        swapper     0 [008]   731.946518: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497100e00 len=1500
        swapper     0 [008]   731.946518: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497101600 len=1500
        swapper     0 [008]   731.946521: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497100800 len=774
        swapper     0 [008]   731.946530: net:netif_receive_skb: dev=test_vxlan skbaddr=0xffff966497100400 len=14032 (1)
        swapper     0 [008]   731.946530: net:netif_receive_skb: dev=test_vxlan skbaddr=0xffff966497101d00 len=9112  (2)
        [...]
      
        # netperf -H 10.55.10.4 -t TCP_STREAM -l 20
        MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.55.10.4 () port 0 AF_INET : demo
        Recv   Send    Send
        Socket Socket  Message  Elapsed
        Size   Size    Size     Time     Throughput
        bytes  bytes   bytes    secs.    10^6bits/sec
      
         87380  16384  16384    20.01    13129.24
      
      After:
      
        [...]
        swapper     0 [026]   521.862641: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff93ab0d479000 len=11286 (1)
        swapper     0 [026]   521.862643: net:netif_receive_skb: dev=test_vxlan skbaddr=0xffff93ab0d479000 len=11236 (1)
        swapper     0 [026]   521.862650: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff93ab0d478500 len=2898  (2)
        swapper     0 [026]   521.862650: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff93ab0d479f00 len=8490  (3)
        swapper     0 [026]   521.862653: net:netif_receive_skb: dev=test_vxlan skbaddr=0xffff93ab0d478500 len=2848  (2)
        swapper     0 [026]   521.862653: net:netif_receive_skb: dev=test_vxlan skbaddr=0xffff93ab0d479f00 len=8440  (3)
        [...]
      
        # netperf -H 10.55.10.4 -t TCP_STREAM -l 20
        MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.55.10.4 () port 0 AF_INET : demo
        Recv   Send    Send
        Socket Socket  Message  Elapsed
        Size   Size    Size     Time     Throughput
        bytes  bytes   bytes    secs.    10^6bits/sec
      
         87380  16384  16384    20.01    24576.53
      
      Fixes: 57c67ff4 ("udp: additional GRO support")
      Fixes: 662880f4 ("net: Allow GRO to use and set levels of checksum unnecessary")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
      Cc: Tom Herbert <tom@herbertland.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/r/20210226212248.8300-1-daniel@iogearbox.netSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      89e5c58f
  7. 27 2月, 2021 2 次提交
  8. 24 2月, 2021 2 次提交
    • J
      net: remove cmsg restriction from io_uring based send/recvmsg calls · e5493796
      Jens Axboe 提交于
      No need to restrict these anymore, as the worker threads are direct
      clones of the original task. Hence we know for a fact that we can
      support anything that the regular task can.
      
      Since the only user of proto_ops->flags was to flag PROTO_CMSG_DATA_ONLY,
      kill the member and the flag definition too.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e5493796
    • J
      net: icmp: pass zeroed opts from icmp{,v6}_ndo_send before sending · ee576c47
      Jason A. Donenfeld 提交于
      The icmp{,v6}_send functions make all sorts of use of skb->cb, casting
      it with IPCB or IP6CB, assuming the skb to have come directly from the
      inet layer. But when the packet comes from the ndo layer, especially
      when forwarded, there's no telling what might be in skb->cb at that
      point. As a result, the icmp sending code risks reading bogus memory
      contents, which can result in nasty stack overflows such as this one
      reported by a user:
      
          panic+0x108/0x2ea
          __stack_chk_fail+0x14/0x20
          __icmp_send+0x5bd/0x5c0
          icmp_ndo_send+0x148/0x160
      
      In icmp_send, skb->cb is cast with IPCB and an ip_options struct is read
      from it. The optlen parameter there is of particular note, as it can
      induce writes beyond bounds. There are quite a few ways that can happen
      in __ip_options_echo. For example:
      
          // sptr/skb are attacker-controlled skb bytes
          sptr = skb_network_header(skb);
          // dptr/dopt points to stack memory allocated by __icmp_send
          dptr = dopt->__data;
          // sopt is the corrupt skb->cb in question
          if (sopt->rr) {
              optlen  = sptr[sopt->rr+1]; // corrupt skb->cb + skb->data
              soffset = sptr[sopt->rr+2]; // corrupt skb->cb + skb->data
      	// this now writes potentially attacker-controlled data, over
      	// flowing the stack:
              memcpy(dptr, sptr+sopt->rr, optlen);
          }
      
      In the icmpv6_send case, the story is similar, but not as dire, as only
      IP6CB(skb)->iif and IP6CB(skb)->dsthao are used. The dsthao case is
      worse than the iif case, but it is passed to ipv6_find_tlv, which does
      a bit of bounds checking on the value.
      
      This is easy to simulate by doing a `memset(skb->cb, 0x41,
      sizeof(skb->cb));` before calling icmp{,v6}_ndo_send, and it's only by
      good fortune and the rarity of icmp sending from that context that we've
      avoided reports like this until now. For example, in KASAN:
      
          BUG: KASAN: stack-out-of-bounds in __ip_options_echo+0xa0e/0x12b0
          Write of size 38 at addr ffff888006f1f80e by task ping/89
          CPU: 2 PID: 89 Comm: ping Not tainted 5.10.0-rc7-debug+ #5
          Call Trace:
           dump_stack+0x9a/0xcc
           print_address_description.constprop.0+0x1a/0x160
           __kasan_report.cold+0x20/0x38
           kasan_report+0x32/0x40
           check_memory_region+0x145/0x1a0
           memcpy+0x39/0x60
           __ip_options_echo+0xa0e/0x12b0
           __icmp_send+0x744/0x1700
      
      Actually, out of the 4 drivers that do this, only gtp zeroed the cb for
      the v4 case, while the rest did not. So this commit actually removes the
      gtp-specific zeroing, while putting the code where it belongs in the
      shared infrastructure of icmp{,v6}_ndo_send.
      
      This commit fixes the issue by passing an empty IPCB or IP6CB along to
      the functions that actually do the work. For the icmp_send, this was
      already trivial, thanks to __icmp_send providing the plumbing function.
      For icmpv6_send, this required a tiny bit of refactoring to make it
      behave like the v4 case, after which it was straight forward.
      
      Fixes: a2b78e9b ("sunvnet: generate ICMP PTMUD messages for smaller port MTUs")
      Reported-by: NSinYu <liuxyon@gmail.com>
      Reviewed-by: NWillem de Bruijn <willemb@google.com>
      Link: https://lore.kernel.org/netdev/CAF=yD-LOF116aHub6RMe8vB8ZpnrrnoTdqhobEx+bvoA8AsP0w@mail.gmail.com/T/Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      Link: https://lore.kernel.org/r/20210223131858.72082-1-Jason@zx2c4.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      ee576c47
  9. 16 2月, 2021 1 次提交
  10. 13 2月, 2021 1 次提交
  11. 12 2月, 2021 4 次提交
    • A
      tcp: Sanitize CMSG flags and reserved args in tcp_zerocopy_receive. · 3c5a2fd0
      Arjun Roy 提交于
      Explicitly define reserved field and require it and any subsequent
      fields to be zero-valued for now. Additionally, limit the valid CMSG
      flags that tcp_zerocopy_receive accepts.
      
      Fixes: 7eeba170 ("tcp: Add receive timestamp support for receive zerocopy.")
      Signed-off-by: NArjun Roy <arjunroy@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Suggested-by: NDavid Ahern <dsahern@gmail.com>
      Suggested-by: NLeon Romanovsky <leon@kernel.org>
      Suggested-by: NJakub Kicinski <kuba@kernel.org>
      Acked-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c5a2fd0
    • V
      net: ipconfig: avoid use-after-free in ic_close_devs · f68cbaed
      Vladimir Oltean 提交于
      Due to the fact that ic_dev->dev is kept open in ic_close_dev, I had
      thought that ic_dev will not be freed either. But that is not the case,
      but instead "everybody dies" when ipconfig cleans up, and just the
      net_device behind ic_dev->dev remains allocated but not ic_dev itself.
      
      This is a problem because in ic_close_devs, for every net device that
      we're about to close, we compare it against the list of lower interfaces
      of ic_dev, to figure out whether we should close it or not. But since
      ic_dev itself is subject to freeing, this means that at some point in
      the middle of the list of ipconfig interfaces, ic_dev will have been
      freed, and we would be still attempting to iterate through its list of
      lower interfaces while checking whether to bring down the remaining
      ipconfig interfaces.
      
      There are multiple ways to avoid the use-after-free: we could delay
      freeing ic_dev until the very end (outside the while loop). Or an even
      simpler one: we can observe that we don't need ic_dev when iterating
      through its lowers, only ic_dev->dev, structure which isn't ever freed.
      So, by keeping ic_dev->dev in a variable assigned prior to freeing
      ic_dev, we can avoid all use-after-free issues.
      
      Fixes: 46acf7bd ("Revert "net: ipv4: handle DSA enabled master network devices"")
      Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f68cbaed
    • E
      tcp: add some entropy in __inet_hash_connect() · c579bd1b
      Eric Dumazet 提交于
      Even when implementing RFC 6056 3.3.4 (Algorithm 4: Double-Hash
      Port Selection Algorithm), a patient attacker could still be able
      to collect enough state from an otherwise idle host.
      
      Idea of this patch is to inject some noise, in the
      cases __inet_hash_connect() found a candidate in the first
      attempt.
      
      This noise should not significantly reduce the collision
      avoidance, and should be zero if connection table
      is already well used.
      
      Note that this is not implementing RFC 6056 3.3.5
      because we think Algorithm 5 could hurt typical
      workloads.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: David Dworken <ddworken@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c579bd1b
    • E
      tcp: change source port randomizarion at connect() time · 190cc824
      Eric Dumazet 提交于
      RFC 6056 (Recommendations for Transport-Protocol Port Randomization)
      provides good summary of why source selection needs extra care.
      
      David Dworken reminded us that linux implements Algorithm 3
      as described in RFC 6056 3.3.3
      
      Quoting David :
         In the context of the web, this creates an interesting info leak where
         websites can count how many TCP connections a user's computer is
         establishing over time. For example, this allows a website to count
         exactly how many subresources a third party website loaded.
         This also allows:
         - Distinguishing between different users behind a VPN based on
             distinct source port ranges.
         - Tracking users over time across multiple networks.
         - Covert communication channels between different browsers/browser
             profiles running on the same computer
         - Tracking what applications are running on a computer based on
             the pattern of how fast source ports are getting incremented.
      
      Section 3.3.4 describes an enhancement, that reduces
      attackers ability to use the basic information currently
      stored into the shared 'u32 hint'.
      
      This change also decreases collision rate when
      multiple applications need to connect() to
      different destinations.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDavid Dworken <ddworken@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      190cc824
  12. 09 2月, 2021 2 次提交
    • A
      IPv4: Extend 'fib_notify_on_flag_change' sysctl · 648106c3
      Amit Cohen 提交于
      Add the value '2' to 'fib_notify_on_flag_change' to allow sending
      notifications only for failed route installation.
      
      Separate value is added for such notifications because there are less of
      them, so they do not impact performance and some users will find them more
      important.
      Signed-off-by: NAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      648106c3
    • A
      IPv4: Add "offload failed" indication to routes · 36c5100e
      Amit Cohen 提交于
      After installing a route to the kernel, user space receives an
      acknowledgment, which means the route was installed in the kernel, but not
      necessarily in hardware.
      
      The asynchronous nature of route installation in hardware can lead to a
      routing daemon advertising a route before it was actually installed in
      hardware. This can result in packet loss or mis-routed packets until the
      route is installed in hardware.
      
      To avoid such cases, previous patch set added the ability to emit
      RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
      are changed, this behavior is controlled by sysctl.
      
      With the above mentioned behavior, it is possible to know from user-space
      if the route was offloaded, but if the offload fails there is no indication
      to user-space. Following a failure, a routing daemon will wait indefinitely
      for a notification that will never come.
      
      This patch adds an "offload_failed" indication to IPv4 routes, so that
      users will have better visibility into the offload process.
      
      'struct fib_alias', and 'struct fib_rt_info' are extended with new field
      that indicates if route offload failed. Note that the new field is added
      using unused bit and therefore there is no need to increase structs size.
      Signed-off-by: NAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      36c5100e
  13. 07 2月, 2021 1 次提交
    • V
      Revert "net: ipv4: handle DSA enabled master network devices" · 46acf7bd
      Vladimir Oltean 提交于
      This reverts commit 728c0208.
      
      Since 2015 DSA has gained more integration with the network stack, we
      can now have the same functionality without explicitly open-coding for
      it:
      - It now opens the DSA master netdevice automatically whenever a user
        netdevice is opened.
      - The master and switch interfaces are coupled in an upper/lower
        hierarchy using the netdev adjacency lists.
      
      In the nfsroot example below, the interface chosen by autoconfig was
      swp3, and every interface except that and the DSA master, eth1, was
      brought down afterwards:
      
      [    8.714215] mscc_felix 0000:00:00.5 swp0 (uninitialized): PHY [0000:00:00.3:10] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
      [    8.978041] mscc_felix 0000:00:00.5 swp1 (uninitialized): PHY [0000:00:00.3:11] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
      [    9.246134] mscc_felix 0000:00:00.5 swp2 (uninitialized): PHY [0000:00:00.3:12] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
      [    9.486203] mscc_felix 0000:00:00.5 swp3 (uninitialized): PHY [0000:00:00.3:13] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
      [    9.512827] mscc_felix 0000:00:00.5: configuring for fixed/internal link mode
      [    9.521047] mscc_felix 0000:00:00.5: Link is Up - 2.5Gbps/Full - flow control off
      [    9.530382] device eth1 entered promiscuous mode
      [    9.535452] DSA: tree 0 setup
      [    9.539777] printk: console [netcon0] enabled
      [    9.544504] netconsole: network logging started
      [    9.555047] fsl_enetc 0000:00:00.2 eth1: configuring for fixed/internal link mode
      [    9.562790] fsl_enetc 0000:00:00.2 eth1: Link is Up - 1Gbps/Full - flow control off
      [    9.564661] 8021q: adding VLAN 0 to HW filter on device bond0
      [    9.637681] fsl_enetc 0000:00:00.0 eth0: PHY [0000:00:00.0:02] driver [Qualcomm Atheros AR8031/AR8033] (irq=POLL)
      [    9.655679] fsl_enetc 0000:00:00.0 eth0: configuring for inband/sgmii link mode
      [    9.666611] mscc_felix 0000:00:00.5 swp0: configuring for inband/qsgmii link mode
      [    9.676216] 8021q: adding VLAN 0 to HW filter on device swp0
      [    9.682086] mscc_felix 0000:00:00.5 swp1: configuring for inband/qsgmii link mode
      [    9.690700] 8021q: adding VLAN 0 to HW filter on device swp1
      [    9.696538] mscc_felix 0000:00:00.5 swp2: configuring for inband/qsgmii link mode
      [    9.705131] 8021q: adding VLAN 0 to HW filter on device swp2
      [    9.710964] mscc_felix 0000:00:00.5 swp3: configuring for inband/qsgmii link mode
      [    9.719548] 8021q: adding VLAN 0 to HW filter on device swp3
      [    9.747811] Sending DHCP requests ..
      [   12.742899] mscc_felix 0000:00:00.5 swp1: Link is Up - 1Gbps/Full - flow control rx/tx
      [   12.743828] mscc_felix 0000:00:00.5 swp0: Link is Up - 1Gbps/Full - flow control off
      [   12.747062] IPv6: ADDRCONF(NETDEV_CHANGE): swp1: link becomes ready
      [   12.755216] fsl_enetc 0000:00:00.0 eth0: Link is Up - 1Gbps/Full - flow control rx/tx
      [   12.766603] IPv6: ADDRCONF(NETDEV_CHANGE): swp0: link becomes ready
      [   12.783188] mscc_felix 0000:00:00.5 swp2: Link is Up - 1Gbps/Full - flow control rx/tx
      [   12.785354] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
      [   12.799535] IPv6: ADDRCONF(NETDEV_CHANGE): swp2: link becomes ready
      [   13.803141] mscc_felix 0000:00:00.5 swp3: Link is Up - 1Gbps/Full - flow control rx/tx
      [   13.811646] IPv6: ADDRCONF(NETDEV_CHANGE): swp3: link becomes ready
      [   15.452018] ., OK
      [   15.470336] IP-Config: Got DHCP answer from 10.0.0.1, my address is 10.0.0.39
      [   15.477887] IP-Config: Complete:
      [   15.481330]      device=swp3, hwaddr=00:04:9f:05:de:0a, ipaddr=10.0.0.39, mask=255.255.255.0, gw=10.0.0.1
      [   15.491846]      host=10.0.0.39, domain=(none), nis-domain=(none)
      [   15.498429]      bootserver=10.0.0.1, rootserver=10.0.0.1, rootpath=
      [   15.498481]      nameserver0=8.8.8.8
      [   15.627542] fsl_enetc 0000:00:00.0 eth0: Link is Down
      [   15.690903] mscc_felix 0000:00:00.5 swp0: Link is Down
      [   15.745216] mscc_felix 0000:00:00.5 swp1: Link is Down
      [   15.800498] mscc_felix 0000:00:00.5 swp2: Link is Down
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      46acf7bd
  14. 05 2月, 2021 2 次提交
  15. 04 2月, 2021 4 次提交
  16. 03 2月, 2021 4 次提交
  17. 02 2月, 2021 2 次提交
  18. 30 1月, 2021 1 次提交