1. 14 8月, 2021 7 次提交
  2. 13 8月, 2021 1 次提交
    • L
      vsock/virtio: avoid potential deadlock when vsock device remove · 49b0b6ff
      Longpeng(Mike) 提交于
      There's a potential deadlock case when remove the vsock device or
      process the RESET event:
      
        vsock_for_each_connected_socket:
            spin_lock_bh(&vsock_table_lock) ----------- (1)
            ...
                virtio_vsock_reset_sock:
                    lock_sock(sk) --------------------- (2)
            ...
            spin_unlock_bh(&vsock_table_lock)
      
      lock_sock() may do initiative schedule when the 'sk' is owned by
      other thread at the same time, we would receivce a warning message
      that "scheduling while atomic".
      
      Even worse, if the next task (selected by the scheduler) try to
      release a 'sk', it need to request vsock_table_lock and the deadlock
      occur, cause the system into softlockup state.
        Call trace:
         queued_spin_lock_slowpath
         vsock_remove_bound
         vsock_remove_sock
         virtio_transport_release
         __vsock_release
         vsock_release
         __sock_release
         sock_close
         __fput
         ____fput
      
      So we should not require sk_lock in this case, just like the behavior
      in vhost_vsock or vmci.
      
      Fixes: 0ea9e1d3 ("VSOCK: Introduce virtio_transport.ko")
      Cc: Stefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NLongpeng(Mike) <longpeng2@huawei.com>
      Reviewed-by: NStefano Garzarella <sgarzare@redhat.com>
      Link: https://lore.kernel.org/r/20210812053056.1699-1-longpeng2@huawei.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      49b0b6ff
  3. 12 8月, 2021 11 次提交
    • V
      net: dsa: tag_8021q: don't broadcast during setup/teardown · 724395f4
      Vladimir Oltean 提交于
      Currently, on my board with multiple sja1105 switches in disjoint trees
      described in commit f66a6a69 ("net: dsa: permit cross-chip bridging
      between all trees in the system"), rebooting the board triggers the
      following benign warnings:
      
      [   12.345566] sja1105 spi2.0: port 0 failed to notify tag_8021q VLAN 1088 deletion: -ENOENT
      [   12.353804] sja1105 spi2.0: port 0 failed to notify tag_8021q VLAN 2112 deletion: -ENOENT
      [   12.362019] sja1105 spi2.0: port 1 failed to notify tag_8021q VLAN 1089 deletion: -ENOENT
      [   12.370246] sja1105 spi2.0: port 1 failed to notify tag_8021q VLAN 2113 deletion: -ENOENT
      [   12.378466] sja1105 spi2.0: port 2 failed to notify tag_8021q VLAN 1090 deletion: -ENOENT
      [   12.386683] sja1105 spi2.0: port 2 failed to notify tag_8021q VLAN 2114 deletion: -ENOENT
      
      Basically switch 1 calls dsa_tag_8021q_unregister, and switch 1's TX and
      RX VLANs cannot be found on switch 2's CPU port.
      
      But why would switch 2 even attempt to delete switch 1's TX and RX
      tag_8021q VLANs from its CPU port? Well, because we use dsa_broadcast,
      and it is supposed that it had added those VLANs in the first place
      (because in dsa_port_tag_8021q_vlan_match, all CPU ports match
      regardless of their tree index or switch index).
      
      The two trees probe asynchronously, and when switch 1 probed, it called
      dsa_broadcast which did not notify the tree of switch 2, because that
      didn't probe yet. But during unbind, switch 2's tree _is_ probed, so it
      _is_ notified of the deletion.
      
      Before jumping to introduce a synchronization mechanism between the
      probing across disjoint switch trees, let's take a step back and see
      whether we _need_ to do that in the first place.
      
      The RX and TX VLANs of switch 1 would be needed on switch 2's CPU port
      only if switch 1 and 2 were part of a cross-chip bridge. And
      dsa_tag_8021q_bridge_join takes care precisely of that (but if probing
      was synchronous, the bridge_join would just end up bumping the VLANs'
      refcount, because they are already installed by the setup path).
      
      Since by the time the ports are bridged, all DSA trees are already set
      up, and we don't need the tag_8021q VLANs of one switch installed on the
      other switches during probe time, the answer is that we don't need to
      fix the synchronization issue.
      
      So make the setup and teardown code paths call dsa_port_notify, which
      notifies only the local tree, and the bridge code paths call
      dsa_broadcast, which let the other trees know as well.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      724395f4
    • V
      net: dsa: print more information when a cross-chip notifier fails · ab97462b
      Vladimir Oltean 提交于
      Currently this error message does not say a lot:
      
      [   32.693498] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT
      [   32.699725] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT
      [   32.705931] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT
      [   32.712139] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT
      [   32.718347] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT
      [   32.724554] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT
      
      but in this form, it is immediately obvious (at least to me) what the
      problem is, even without further looking at the code:
      
      [   12.345566] sja1105 spi2.0: port 0 failed to notify tag_8021q VLAN 1088 deletion: -ENOENT
      [   12.353804] sja1105 spi2.0: port 0 failed to notify tag_8021q VLAN 2112 deletion: -ENOENT
      [   12.362019] sja1105 spi2.0: port 1 failed to notify tag_8021q VLAN 1089 deletion: -ENOENT
      [   12.370246] sja1105 spi2.0: port 1 failed to notify tag_8021q VLAN 2113 deletion: -ENOENT
      [   12.378466] sja1105 spi2.0: port 2 failed to notify tag_8021q VLAN 1090 deletion: -ENOENT
      [   12.386683] sja1105 spi2.0: port 2 failed to notify tag_8021q VLAN 2114 deletion: -ENOENT
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ab97462b
    • N
      pktgen: Add output for imix results · 769afb3f
      Nick Richardson 提交于
      The bps for imix mode is calculated by:
      sum(imix_entry.size) / time_elapsed
      
      The actual counts of each imix_entry are displayed under the
      "Current:" section of the interface output in the following format:
      imix_size_counts: size_1,count_1 size_2,count_2 ... size_n,count_n
      
      Example (count = 200000):
      imix_weights: 256,1 859,3 205,2
      imix_size_counts: 256,32082 859,99796 205,68122
      Result: OK: 17992362(c17964678+d27684) usec, 200000 (859byte,0frags)
        11115pps 47Mb/sec (47977140bps) errors: 0
      
      Summary of changes:
      Calculate bps based on imix counters when in IMIX mode.
      Add output for IMIX counters.
      Signed-off-by: NNick Richardson <richardsonnick@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      769afb3f
    • N
      pktgen: Add imix distribution bins · 90149031
      Nick Richardson 提交于
      In order to represent the distribution of imix packet sizes, a
      pre-computed data structure is used. It features 100 (IMIX_PRECISION)
      "bins". Contiguous ranges of these bins represent the respective
      packet size of each imix entry. This is done to avoid the overhead of
      selecting the correct imix packet size based on the corresponding weights.
      
      Example:
      imix_weights 40,7 576,4 1500,1
      total_weight = 7 + 4 + 1 = 12
      
      pkt_size 40 occurs 7/total_weight = 58% of the time
      pkt_size 576 occurs 4/total_weight = 33% of the time
      pkt_size 1500 occurs 1/total_weight = 9% of the time
      
      We generate a random number between 0-100 and select the corresponding
      packet size based on the specified weights.
      Eg. random number = 358723895 % 100 = 65
      Selects the packet size corresponding to index:65 in the pre-computed
      imix_distribution array.
      An example of the  pre-computed array is below:
      
      The imix_distribution will look like the following:
      0        ->  0 (index of imix_entry.size == 40)
      1        ->  0 (index of imix_entry.size == 40)
      2        ->  0 (index of imix_entry.size == 40)
      [...]    ->  0 (index of imix_entry.size == 40)
      57       ->  0 (index of imix_entry.size == 40)
      58       ->  1 (index of imix_entry.size == 576)
      [...]    ->  1 (index of imix_entry.size == 576)
      90       ->  1 (index of imix_entry.size == 576)
      91       ->  2 (index of imix_entry.size == 1500)
      [...]    ->  2 (index of imix_entry.size == 1500)
      99       ->  2 (index of imix_entry.size == 1500)
      
      Create and use "bin" representation of the imix distribution.
      Signed-off-by: NNick Richardson <richardsonnick@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90149031
    • N
      pktgen: Parse internet mix (imix) input · 52a62f86
      Nick Richardson 提交于
      Adds "imix_weights" command for specifying internet mix distribution.
      
      The command is in this format:
      "imix_weights size_1,weight_1 size_2,weight_2 ... size_n,weight_n"
      where the probability that packet size_i is picked is:
      weight_i / (weight_1 + weight_2 + .. + weight_n)
      
      The user may provide up to 100 imix entries (size_i,weight_i) in this
      command.
      
      The user specified imix entries will be displayed in the "Params"
      section of the interface output.
      
      Values for clone_skb > 0 is not supported in IMIX mode.
      
      Summary of changes:
      Add flag for enabling internet mix mode.
      Add command (imix_weights) for internet mix input.
      Return -ENOTSUPP when clone_skb > 0 in IMIX mode.
      Display imix_weights in Params.
      Create data structures to store imix entries and distribution.
      Signed-off-by: NNick Richardson <richardsonnick@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52a62f86
    • H
      Revert "tipc: Return the correct errno code" · 86704993
      Hoang Le 提交于
      This reverts commit 0efea3c6 because of:
      - The returning -ENOBUF error is fine on socket buffer allocation.
      - There is side effect in the calling path
      tipc_node_xmit()->tipc_link_xmit() when checking error code returning.
      
      Fixes: 0efea3c6 ("tipc: Return the correct errno code")
      Acked-by: NJon Maloy <jmaloy@redhat.com>
      Signed-off-by: NHoang Le <hoang.h.le@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      86704993
    • N
      net: bridge: vlan: fix global vlan option range dumping · 6c4110d9
      Nikolay Aleksandrov 提交于
      When global vlan options are equal sequentially we compress them in a
      range to save space and reduce processing time. In order to have the
      proper range end id we need to update range_end if the options are equal
      otherwise we get ranges with the same end vlan id as the start.
      
      Fixes: 743a53d9 ("net: bridge: vlan: add support for dumping global vlan options")
      Signed-off-by: NNikolay Aleksandrov <nikolay@nvidia.com>
      Link: https://lore.kernel.org/r/20210810092139.11700-1-razor@blackwall.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      6c4110d9
    • J
      mctp: Specify route types, require rtm_type in RTM_*ROUTE messages · 83f0a0b7
      Jeremy Kerr 提交于
      This change adds a 'type' attribute to routes, which can be parsed from
      a RTM_NEWROUTE message. This will help to distinguish local vs. peer
      routes in a future change.
      
      This means userspace will need to set a correct rtm_type in RTM_NEWROUTE
      and RTM_DELROUTE messages; we currently only accept RTN_UNICAST.
      Signed-off-by: NJeremy Kerr <jk@codeconstruct.com.au>
      Link: https://lore.kernel.org/r/20210810023834.2231088-1-jk@codeconstruct.com.auSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      83f0a0b7
    • E
      net: igmp: increase size of mr_ifc_count · b69dd5b3
      Eric Dumazet 提交于
      Some arches support cmpxchg() on 4-byte and 8-byte only.
      Increase mr_ifc_count width to 32bit to fix this problem.
      
      Fixes: 4a2b285e ("net: igmp: fix data-race in igmp_ifc_timer_expire()")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NGuenter Roeck <linux@roeck-us.net>
      Link: https://lore.kernel.org/r/20210811195715.3684218-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      b69dd5b3
    • N
      tcp_bbr: fix u32 wrap bug in round logic if bbr_init() called after 2B packets · 6de035fe
      Neal Cardwell 提交于
      Currently if BBR congestion control is initialized after more than 2B
      packets have been delivered, depending on the phase of the
      tp->delivered counter the tracking of BBR round trips can get stuck.
      
      The bug arises because if tp->delivered is between 2^31 and 2^32 at
      the time the BBR congestion control module is initialized, then the
      initialization of bbr->next_rtt_delivered to 0 will cause the logic to
      believe that the end of the round trip is still billions of packets in
      the future. More specifically, the following check will fail
      repeatedly:
      
        !before(rs->prior_delivered, bbr->next_rtt_delivered)
      
      and thus the connection will take up to 2B packets delivered before
      that check will pass and the connection will set:
      
        bbr->round_start = 1;
      
      This could cause many mechanisms in BBR to fail to trigger, for
      example bbr_check_full_bw_reached() would likely never exit STARTUP.
      
      This bug is 5 years old and has not been observed, and as a practical
      matter this would likely rarely trigger, since it would require
      transferring at least 2B packets, or likely more than 3 terabytes of
      data, before switching congestion control algorithms to BBR.
      
      This patch is a stable candidate for kernels as far back as v4.9,
      when tcp_bbr.c was added.
      
      Fixes: 0f8782ea ("tcp_bbr: add BBR congestion control")
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Reviewed-by: NYuchung Cheng <ycheng@google.com>
      Reviewed-by: NKevin Yang <yyd@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20210811024056.235161-1-ncardwell@google.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      6de035fe
    • W
      net: linkwatch: fix failure to restore device state across suspend/resume · 6922110d
      Willy Tarreau 提交于
      After migrating my laptop from 4.19-LTS to 5.4-LTS a while ago I noticed
      that my Ethernet port to which a bond and a VLAN interface are attached
      appeared to remain up after resuming from suspend with the cable unplugged
      (and that problem still persists with 5.10-LTS).
      
      It happens that the following happens:
      
        - the network driver (e1000e here) prepares to suspend, calls e1000e_down()
          which calls netif_carrier_off() to signal that the link is going down.
        - netif_carrier_off() adds a link_watch event to the list of events for
          this device
        - the device is completely stopped.
        - the machine suspends
        - the cable is unplugged and the machine brought to another location
        - the machine is resumed
        - the queued linkwatch events are processed for the device
        - the device doesn't yet have the __LINK_STATE_PRESENT bit and its events
          are silently dropped
        - the device is resumed with its link down
        - the upper VLAN and bond interfaces are never notified that the link had
          been turned down and remain up
        - the only way to provoke a change is to physically connect the machine
          to a port and possibly unplug it.
      
      The state after resume looks like this:
        $ ip -br li | egrep 'bond|eth'
        bond0            UP             e8:6a:64:64:64:64 <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP>
        eth0             DOWN           e8:6a:64:64:64:64 <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP>
        eth0.2@eth0      UP             e8:6a:64:64:64:64 <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP>
      
      Placing an explicit call to netdev_state_change() either in the suspend
      or the resume code in the NIC driver worked around this but the solution
      is not satisfying.
      
      The issue in fact really is in link_watch that loses events while it
      ought not to. It happens that the test for the device being present was
      added by commit 124eee3f ("net: linkwatch: add check for netdevice
      being present to linkwatch_do_dev") in 4.20 to avoid an access to
      devices that are not present.
      
      Instead of dropping events, this patch proceeds slightly differently by
      postponing their handling so that they happen after the device is fully
      resumed.
      
      Fixes: 124eee3f ("net: linkwatch: add check for netdevice being present to linkwatch_do_dev")
      Link: https://lists.openwall.net/netdev/2018/03/15/62
      Cc: Heiner Kallweit <hkallweit1@gmail.com>
      Cc: Geert Uytterhoeven <geert+renesas@glider.be>
      Cc: Florian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NWilly Tarreau <w@1wt.eu>
      Link: https://lore.kernel.org/r/20210809160628.22623-1-w@1wt.euSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      6922110d
  4. 11 8月, 2021 21 次提交