1. 02 11月, 2021 2 次提交
    • T
      net: avoid double accounting for pure zerocopy skbs · f1a456f8
      Talal Ahmad 提交于
      Track skbs with only zerocopy data and avoid charging them to kernel
      memory to correctly account the memory utilization for msg_zerocopy.
      All of the data in such skbs is held in user pages which are already
      accounted to user. Before this change, they are charged again in
      kernel in __zerocopy_sg_from_iter. The charging in kernel is
      excessive because data is not being copied into skb frags. This
      excessive charging can lead to kernel going into memory pressure
      state which impacts all sockets in the system adversely. Mark pure
      zerocopy skbs with a SKBFL_PURE_ZEROCOPY flag and remove
      charge/uncharge for data in such skbs.
      
      Initially, an skb is marked pure zerocopy when it is empty and in
      zerocopy path. skb can then change from a pure zerocopy skb to mixed
      data skb (zerocopy and copy data) if it is at tail of write queue and
      there is room available in it and non-zerocopy data is being sent in
      the next sendmsg call. At this time sk_mem_charge is done for the pure
      zerocopied data and the pure zerocopy flag is unmarked. We found that
      this happens very rarely on workloads that pass MSG_ZEROCOPY.
      
      A pure zerocopy skb can later be coalesced into normal skb if they are
      next to each other in queue but this patch prevents coalescing from
      happening. This avoids complexity of charging when skb downgrades from
      pure zerocopy to mixed. This is also rare.
      
      In sk_wmem_free_skb, if it is a pure zerocopy skb, an sk_mem_uncharge
      for SKB_TRUESIZE(MAX_TCP_HEADER) is done for sk_mem_charge in
      tcp_skb_entail for an skb without data.
      
      Testing with the msg_zerocopy.c benchmark between two hosts(100G nics)
      with zerocopy showed that before this patch the 'sock' variable in
      memory.stat for cgroup2 that tracks sum of sk_forward_alloc,
      sk_rmem_alloc and sk_wmem_queued is around 1822720 and with this
      change it is 0. This is due to no charge to sk_forward_alloc for
      zerocopy data and shows memory utilization for kernel is lowered.
      Signed-off-by: NTalal Ahmad <talalahmad@google.com>
      Acked-by: NArjun Roy <arjunroy@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      f1a456f8
    • T
      tcp: rename sk_wmem_free_skb · 03271f3a
      Talal Ahmad 提交于
      sk_wmem_free_skb() is only used by TCP.
      
      Rename it to make this clear, and move its declaration to
      include/net/tcp.h
      Signed-off-by: NTalal Ahmad <talalahmad@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NArjun Roy <arjunroy@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      03271f3a
  2. 01 11月, 2021 9 次提交
    • T
      amt: add mld report message handler · b75f7095
      Taehee Yoo 提交于
      In the previous patch, igmp report handler was added.
      That handler can be used for mld too.
      So, it uses that common code to parse mld report message.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b75f7095
    • T
      amt: add multicast(IGMP) report message handler · bc54e49c
      Taehee Yoo 提交于
      amt 'Relay' interface manages multicast groups(igmp/mld) and sources.
      In order to manage, it should have the function to parse igmp/mld
      report messages. So, this adds the logic for parsing igmp report messages
      and saves them on their own data structure.
      
         struct amt_group_node means one group(igmp/mld).
         struct amt_source_node means one source.
      
      The same source can't exist in the same group.
      The same group can exist in the same tunnel because it manages
      the host address too.
      
      The group information is used when forwarding multicast data.
      If there are no groups in the specific tunnel, Relay doesn't forward it.
      
      Although Relay manages sources, it doesn't support the source filtering
      feature. Because the reason to manage sources is just that in order
      to manage group more correctly.
      
      In the next patch, MLD part will be added.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc54e49c
    • T
      amt: add data plane of amt interface · cbc21dc1
      Taehee Yoo 提交于
      Before forwarding multicast traffic, the amt interface establishes between
      gateway and relay. In order to establish, amt defined some message type
      and those message flow looks like the below.
      
                            Gateway                  Relay
                            -------                  -----
                               :        Request        :
                           [1] |           N           |
                               |---------------------->|
                               |    Membership Query   | [2]
                               |    N,MAC,gADDR,gPORT  |
                               |<======================|
                           [3] |   Membership Update   |
                               |   ({G:INCLUDE({S})})  |
                               |======================>|
                               |                       |
          ---------------------:-----------------------:---------------------
         |                     |                       |                     |
         |                     |    *Multicast Data    |  *IP Packet(S,G)    |
         |                     |      gADDR,gPORT      |<-----------------() |
         |    *IP Packet(S,G)  |<======================|                     |
         | ()<-----------------|                       |                     |
         |                     |                       |                     |
          ---------------------:-----------------------:---------------------
                               ~                       ~
                               ~        Request        ~
                           [4] |           N'          |
                               |---------------------->|
                               |   Membership Query    | [5]
                               | N',MAC',gADDR',gPORT' |
                               |<======================|
                           [6] |                       |
                               |       Teardown        |
                               |   N,MAC,gADDR,gPORT   |
                               |---------------------->|
                               |                       | [7]
                               |   Membership Update   |
                               |  ({G:INCLUDE({S})})   |
                               |======================>|
                               |                       |
          ---------------------:-----------------------:---------------------
         |                     |                       |                     |
         |                     |    *Multicast Data    |  *IP Packet(S,G)    |
         |                     |     gADDR',gPORT'     |<-----------------() |
         |    *IP Packet (S,G) |<======================|                     |
         | ()<-----------------|                       |                     |
         |                     |                       |                     |
          ---------------------:-----------------------:---------------------
                               |                       |
                               :                       :
      
      1. Discovery
       - Sent by Gateway to Relay
       - To find Relay unique ip address
      2. Advertisement
       - Sent by Relay to Gateway
       - Contains the unique IP address
      3. Request
       - Sent by Gateway to Relay
       - Solicit to receive 'Query' message.
      4. Query
       - Sent by Relay to Gateway
       - Contains General Query message.
      5. Update
       - Sent by  Gateway to Relay
       - Contains report message.
      6. Multicast Data
       - Sent by Relay to Gateway
       - encapsulated multicast traffic.
      7. Teardown
       - Not supported at this time.
      
      Except for the Teardown message, it supports all messages.
      
      In the next patch, IGMP/MLD logic will be added.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cbc21dc1
    • T
      amt: add control plane of amt interface · b9022b53
      Taehee Yoo 提交于
      It adds definitions and control plane code for AMT.
      this is very similar to udp tunneling interfaces such as gtp, vxlan, etc.
      In the next patch, data plane code will be added.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b9022b53
    • J
      ethtool: don't drop the rtnl_lock half way thru the ioctl · 1af0a094
      Jakub Kicinski 提交于
      devlink compat code needs to drop rtnl_lock to take
      devlink->lock to ensure correct lock ordering.
      
      This is problematic because we're not strictly guaranteed
      that the netdev will not disappear after we re-lock.
      It may open a possibility of nested ->begin / ->complete
      calls.
      
      Instead of calling into devlink under rtnl_lock take
      a ref on the devlink instance and make the call after
      we've dropped rtnl_lock.
      
      We (continue to) assume that netdevs have an implicit
      reference on the devlink returned from ndo_get_devlink_port
      
      Note that ndo_get_devlink_port will now get called
      under rtnl_lock. That should be fine since none of
      the drivers seem to be taking serious locks inside
      ndo_get_devlink_port.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Reviewed-by: NLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1af0a094
    • J
      devlink: expose get/put functions · 46db1b77
      Jakub Kicinski 提交于
      Allow those who hold implicit reference on a devlink instance
      to try to take a full ref on it. This will be used from netdev
      code which has an implicit ref because of driver call ordering.
      
      Note that after recent changes devlink_unregister() may happen
      before netdev unregister, but devlink_free() should still happen
      after, so we are safe to try, but we can't just refcount_inc()
      and assume it's not zero.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Reviewed-by: NLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46db1b77
    • M
      net: dsa: populate supported_interfaces member · c07c6e8e
      Marek Behún 提交于
      Add a new DSA switch operation, phylink_get_interfaces, which should
      fill in which PHY_INTERFACE_MODE_* are supported by given port.
      
      Use this before phylink_create() to fill phylinks supported_interfaces
      member, allowing phylink to determine which PHY_INTERFACE_MODEs are
      supported.
      Signed-off-by: NMarek Behún <kabel@kernel.org>
      [tweaked patch and description to add more complete support -- rmk]
      Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: NRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c07c6e8e
    • P
      netfilter: nft_payload: support for inner header matching / mangling · c46b38dc
      Pablo Neira Ayuso 提交于
      Allow to match and mangle on inner headers / payload data after the
      transport header. There is a new field in the pktinfo structure that
      stores the inner header offset which is calculated only when requested.
      Only TCP and UDP supported at this stage.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      c46b38dc
    • P
      netfilter: nf_tables: convert pktinfo->tprot_set to flags field · b5bdc6f9
      Pablo Neira Ayuso 提交于
      Generalize boolean field to store more flags on the pktinfo structure.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      b5bdc6f9
  3. 29 10月, 2021 2 次提交
  4. 28 10月, 2021 7 次提交
  5. 27 10月, 2021 2 次提交
  6. 26 10月, 2021 11 次提交
    • J
      mctp: Implement extended addressing · 99ce45d5
      Jeremy Kerr 提交于
      This change allows an extended address struct - struct sockaddr_mctp_ext
      - to be passed to sendmsg/recvmsg. This allows userspace to specify
      output ifindex and physical address information (for sendmsg) or receive
      the input ifindex/physaddr for incoming messages (for recvmsg). This is
      typically used by userspace for MCTP address discovery and assignment
      operations.
      
      The extended addressing facility is conditional on a new sockopt:
      MCTP_OPT_ADDR_EXT; userspace must explicitly enable addressing before
      the kernel will consume/populate the extended address data.
      
      Includes a fix for an uninitialised var:
      Reported-by: Nkernel test robot <lkp@intel.com>
      Signed-off-by: NJeremy Kerr <jk@codeconstruct.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      99ce45d5
    • E
      tcp: rename sk_stream_alloc_skb · f8dd3b8d
      Eric Dumazet 提交于
      sk_stream_alloc_skb() is only used by TCP.
      
      Rename it to make this clear, and move its declaration
      to include/net/tcp.h
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8dd3b8d
    • E
      net: annotate data-race in neigh_output() · d18785e2
      Eric Dumazet 提交于
      neigh_output() reads n->nud_state and hh->hh_len locklessly.
      
      This is fine, but we need to add annotations and document this.
      
      We evaluate skip_cache first to avoid reading these fields
      if the cache has to by bypassed.
      
      syzbot report:
      
      BUG: KCSAN: data-race in __neigh_event_send / ip_finish_output2
      
      write to 0xffff88810798a885 of 1 bytes by interrupt on cpu 1:
       __neigh_event_send+0x40d/0xac0 net/core/neighbour.c:1128
       neigh_event_send include/net/neighbour.h:444 [inline]
       neigh_resolve_output+0x104/0x410 net/core/neighbour.c:1476
       neigh_output include/net/neighbour.h:510 [inline]
       ip_finish_output2+0x80a/0xaa0 net/ipv4/ip_output.c:221
       ip_finish_output+0x3b5/0x510 net/ipv4/ip_output.c:309
       NF_HOOK_COND include/linux/netfilter.h:296 [inline]
       ip_output+0xf3/0x1a0 net/ipv4/ip_output.c:423
       dst_output include/net/dst.h:450 [inline]
       ip_local_out+0x164/0x220 net/ipv4/ip_output.c:126
       __ip_queue_xmit+0x9d3/0xa20 net/ipv4/ip_output.c:525
       ip_queue_xmit+0x34/0x40 net/ipv4/ip_output.c:539
       __tcp_transmit_skb+0x142a/0x1a00 net/ipv4/tcp_output.c:1405
       tcp_transmit_skb net/ipv4/tcp_output.c:1423 [inline]
       tcp_xmit_probe_skb net/ipv4/tcp_output.c:4011 [inline]
       tcp_write_wakeup+0x4a9/0x810 net/ipv4/tcp_output.c:4064
       tcp_send_probe0+0x2c/0x2b0 net/ipv4/tcp_output.c:4079
       tcp_probe_timer net/ipv4/tcp_timer.c:398 [inline]
       tcp_write_timer_handler+0x394/0x520 net/ipv4/tcp_timer.c:626
       tcp_write_timer+0xb9/0x180 net/ipv4/tcp_timer.c:642
       call_timer_fn+0x2e/0x1d0 kernel/time/timer.c:1421
       expire_timers+0x135/0x240 kernel/time/timer.c:1466
       __run_timers+0x368/0x430 kernel/time/timer.c:1734
       run_timer_softirq+0x19/0x30 kernel/time/timer.c:1747
       __do_softirq+0x12c/0x26e kernel/softirq.c:558
       invoke_softirq kernel/softirq.c:432 [inline]
       __irq_exit_rcu kernel/softirq.c:636 [inline]
       irq_exit_rcu+0x4e/0xa0 kernel/softirq.c:648
       sysvec_apic_timer_interrupt+0x69/0x80 arch/x86/kernel/apic/apic.c:1097
       asm_sysvec_apic_timer_interrupt+0x12/0x20
       native_safe_halt arch/x86/include/asm/irqflags.h:51 [inline]
       arch_safe_halt arch/x86/include/asm/irqflags.h:89 [inline]
       acpi_safe_halt drivers/acpi/processor_idle.c:109 [inline]
       acpi_idle_do_entry drivers/acpi/processor_idle.c:553 [inline]
       acpi_idle_enter+0x258/0x2e0 drivers/acpi/processor_idle.c:688
       cpuidle_enter_state+0x2b4/0x760 drivers/cpuidle/cpuidle.c:237
       cpuidle_enter+0x3c/0x60 drivers/cpuidle/cpuidle.c:351
       call_cpuidle kernel/sched/idle.c:158 [inline]
       cpuidle_idle_call kernel/sched/idle.c:239 [inline]
       do_idle+0x1a3/0x250 kernel/sched/idle.c:306
       cpu_startup_entry+0x15/0x20 kernel/sched/idle.c:403
       secondary_startup_64_no_verify+0xb1/0xbb
      
      read to 0xffff88810798a885 of 1 bytes by interrupt on cpu 0:
       neigh_output include/net/neighbour.h:507 [inline]
       ip_finish_output2+0x79a/0xaa0 net/ipv4/ip_output.c:221
       ip_finish_output+0x3b5/0x510 net/ipv4/ip_output.c:309
       NF_HOOK_COND include/linux/netfilter.h:296 [inline]
       ip_output+0xf3/0x1a0 net/ipv4/ip_output.c:423
       dst_output include/net/dst.h:450 [inline]
       ip_local_out+0x164/0x220 net/ipv4/ip_output.c:126
       __ip_queue_xmit+0x9d3/0xa20 net/ipv4/ip_output.c:525
       ip_queue_xmit+0x34/0x40 net/ipv4/ip_output.c:539
       __tcp_transmit_skb+0x142a/0x1a00 net/ipv4/tcp_output.c:1405
       tcp_transmit_skb net/ipv4/tcp_output.c:1423 [inline]
       tcp_xmit_probe_skb net/ipv4/tcp_output.c:4011 [inline]
       tcp_write_wakeup+0x4a9/0x810 net/ipv4/tcp_output.c:4064
       tcp_send_probe0+0x2c/0x2b0 net/ipv4/tcp_output.c:4079
       tcp_probe_timer net/ipv4/tcp_timer.c:398 [inline]
       tcp_write_timer_handler+0x394/0x520 net/ipv4/tcp_timer.c:626
       tcp_write_timer+0xb9/0x180 net/ipv4/tcp_timer.c:642
       call_timer_fn+0x2e/0x1d0 kernel/time/timer.c:1421
       expire_timers+0x135/0x240 kernel/time/timer.c:1466
       __run_timers+0x368/0x430 kernel/time/timer.c:1734
       run_timer_softirq+0x19/0x30 kernel/time/timer.c:1747
       __do_softirq+0x12c/0x26e kernel/softirq.c:558
       invoke_softirq kernel/softirq.c:432 [inline]
       __irq_exit_rcu kernel/softirq.c:636 [inline]
       irq_exit_rcu+0x4e/0xa0 kernel/softirq.c:648
       sysvec_apic_timer_interrupt+0x69/0x80 arch/x86/kernel/apic/apic.c:1097
       asm_sysvec_apic_timer_interrupt+0x12/0x20
       native_safe_halt arch/x86/include/asm/irqflags.h:51 [inline]
       arch_safe_halt arch/x86/include/asm/irqflags.h:89 [inline]
       acpi_safe_halt drivers/acpi/processor_idle.c:109 [inline]
       acpi_idle_do_entry drivers/acpi/processor_idle.c:553 [inline]
       acpi_idle_enter+0x258/0x2e0 drivers/acpi/processor_idle.c:688
       cpuidle_enter_state+0x2b4/0x760 drivers/cpuidle/cpuidle.c:237
       cpuidle_enter+0x3c/0x60 drivers/cpuidle/cpuidle.c:351
       call_cpuidle kernel/sched/idle.c:158 [inline]
       cpuidle_idle_call kernel/sched/idle.c:239 [inline]
       do_idle+0x1a3/0x250 kernel/sched/idle.c:306
       cpu_startup_entry+0x15/0x20 kernel/sched/idle.c:403
       rest_init+0xee/0x100 init/main.c:734
       arch_call_rest_init+0xa/0xb
       start_kernel+0x5e4/0x669 init/main.c:1142
       secondary_startup_64_no_verify+0xb1/0xbb
      
      value changed: 0x20 -> 0x01
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.15.0-rc6-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d18785e2
    • C
      net: multicast: calculate csum of looped-back and forwarded packets · 9122a70a
      Cyril Strejc 提交于
      During a testing of an user-space application which transmits UDP
      multicast datagrams and utilizes multicast routing to send the UDP
      datagrams out of defined network interfaces, I've found a multicast
      router does not fill-in UDP checksum into locally produced, looped-back
      and forwarded UDP datagrams, if an original output NIC the datagrams
      are sent to has UDP TX checksum offload enabled.
      
      The datagrams are sent malformed out of the NIC the datagrams have been
      forwarded to.
      
      It is because:
      
      1. If TX checksum offload is enabled on the output NIC, UDP checksum
         is not calculated by kernel and is not filled into skb data.
      
      2. dev_loopback_xmit(), which is called solely by
         ip_mc_finish_output(), sets skb->ip_summed = CHECKSUM_UNNECESSARY
         unconditionally.
      
      3. Since 35fc92a9 ("[NET]: Allow forwarding of ip_summed except
         CHECKSUM_COMPLETE"), the ip_summed value is preserved during
         forwarding.
      
      4. If ip_summed != CHECKSUM_PARTIAL, checksum is not calculated during
         a packet egress.
      
      The minimum fix in dev_loopback_xmit():
      
      1. Preserves skb->ip_summed CHECKSUM_PARTIAL. This is the
         case when the original output NIC has TX checksum offload enabled.
         The effects are:
      
           a) If the forwarding destination interface supports TX checksum
              offloading, the NIC driver is responsible to fill-in the
              checksum.
      
           b) If the forwarding destination interface does NOT support TX
              checksum offloading, checksums are filled-in by kernel before
              skb is submitted to the NIC driver.
      
           c) For local delivery, checksum validation is skipped as in the
              case of CHECKSUM_UNNECESSARY, thanks to skb_csum_unnecessary().
      
      2. Translates ip_summed CHECKSUM_NONE to CHECKSUM_UNNECESSARY. It
         means, for CHECKSUM_NONE, the behavior is unmodified and is there
         to skip a looped-back packet local delivery checksum validation.
      Signed-off-by: NCyril Strejc <cyril.strejc@skoda.cz>
      Reviewed-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9122a70a
    • E
      ipv4: guard IP_MINTTL with a static key · 020e71a3
      Eric Dumazet 提交于
      RFC 5082 IP_MINTTL option is rarely used on hosts.
      
      Add a static key to remove from TCP fast path useless code,
      and potential cache line miss to fetch inet_sk(sk)->min_ttl
      
      Note that once ip4_min_ttl static key has been enabled,
      it stays enabled until next boot.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      020e71a3
    • E
      ipv6: guard IPV6_MINHOPCOUNT with a static key · 790eb673
      Eric Dumazet 提交于
      RFC 5082 IPV6_MINHOPCOUNT is rarely used on hosts.
      
      Add a static key to remove from TCP fast path useless code,
      and potential cache line miss to fetch tcp_inet6_sk(sk)->min_hopcount
      
      Note that once ip6_min_hopcount static key has been enabled,
      it stays enabled until next boot.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      790eb673
    • E
      net: annotate accesses to sk->sk_rx_queue_mapping · 09b89846
      Eric Dumazet 提交于
      sk->sk_rx_queue_mapping can be modified locklessly,
      add a couple of READ_ONCE()/WRITE_ONCE() to document this fact.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      09b89846
    • E
      net: avoid dirtying sk->sk_rx_queue_mapping · 342159ee
      Eric Dumazet 提交于
      sk_rx_queue_mapping is located in a cache line that should be kept read mostly.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      342159ee
    • E
      net: avoid dirtying sk->sk_napi_id · 2b13af8a
      Eric Dumazet 提交于
      sk_napi_id is located in a cache line that can be kept read mostly.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      2b13af8a
    • E
      ipv6: move inet6_sk(sk)->rx_dst_cookie to sk->sk_rx_dst_cookie · ef57c161
      Eric Dumazet 提交于
      Increase cache locality by moving rx_dst_coookie next to sk->sk_rx_dst
      
      This removes one or two cache line misses in IPv6 early demux (TCP/UDP)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      ef57c161
    • E
      tcp: move inet->rx_dst_ifindex to sk->sk_rx_dst_ifindex · 0c0a5ef8
      Eric Dumazet 提交于
      Increase cache locality by moving rx_dst_ifindex next to sk->sk_rx_dst
      
      This is part of an effort to reduce cache line misses in TCP fast path.
      
      This removes one cache line miss in early demux.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      0c0a5ef8
  7. 25 10月, 2021 4 次提交
    • T
      net/tls: tls_crypto_context add supported algorithms context · 39d8fb96
      Tianjia Zhang 提交于
      tls already supports the SM4 GCM/CCM algorithms. It is also necessary
      to add support for these two algorithms in tls_crypto_context to avoid
      potential issues caused by forced type conversion.
      Signed-off-by: NTianjia Zhang <tianjia.zhang@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      39d8fb96
    • J
      cfg80211: fix management registrations locking · 09b1d5dc
      Johannes Berg 提交于
      The management registrations locking was broken, the list was
      locked for each wdev, but cfg80211_mgmt_registrations_update()
      iterated it without holding all the correct spinlocks, causing
      list corruption.
      
      Rather than trying to fix it with fine-grained locking, just
      move the lock to the wiphy/rdev (still need the list on each
      wdev), we already need to hold the wdev lock to change it, so
      there's no contention on the lock in any case. This trivially
      fixes the bug since we hold one wdev's lock already, and now
      will hold the lock that protects all lists.
      
      Cc: stable@vger.kernel.org
      Reported-by: NJouni Malinen <j@w1.fi>
      Fixes: 6cd536fe ("cfg80211: change internal management frame registration API")
      Link: https://lore.kernel.org/r/20211025133111.5cf733eab0f4.I7b0abb0494ab712f74e2efcd24bb31ac33f7eee9@changeidSigned-off-by: NJohannes Berg <johannes.berg@intel.com>
      09b1d5dc
    • V
      net: dsa: introduce locking for the address lists on CPU and DSA ports · 338a3a47
      Vladimir Oltean 提交于
      Now that the rtnl_mutex is going away for dsa_port_{host_,}fdb_{add,del},
      no one is serializing access to the address lists that DSA keeps for the
      purpose of reference counting on shared ports (CPU and cascade ports).
      
      It can happen for one dsa_switch_do_fdb_del to do list_del on a dp->fdbs
      element while another dsa_switch_do_fdb_{add,del} is traversing dp->fdbs.
      We need to avoid that.
      
      Currently dp->mdbs is not at risk, because dsa_switch_do_mdb_{add,del}
      still runs under the rtnl_mutex. But it would be nice if it would not
      depend on that being the case. So let's introduce a mutex per port (the
      address lists are per port too) and share it between dp->mdbs and
      dp->fdbs.
      
      The place where we put the locking is interesting. It could be tempting
      to put a DSA-level lock which still serializes calls to
      .port_fdb_{add,del}, but it would still not avoid concurrency with other
      driver code paths that are currently under rtnl_mutex (.port_fdb_dump,
      .port_fast_age). So it would add a very false sense of security (and
      adding a global switch-wide lock in DSA to resynchronize with the
      rtnl_lock is also counterproductive and hard).
      
      So the locking is intentionally done only where the dp->fdbs and dp->mdbs
      lists are traversed. That means, from a driver perspective, that
      .port_fdb_add will be called with the dp->addr_lists_lock mutex held on
      the CPU port, but not held on user ports. This is done so that driver
      writers are not encouraged to rely on any guarantee offered by
      dp->addr_lists_lock.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      338a3a47
    • D
      Revert "Merge branch 'dsa-rtnl'" · 2d7e73f0
      David S. Miller 提交于
      This reverts commit 965e6b26, reversing
      changes made to 4d98bb0d.
      2d7e73f0
  8. 24 10月, 2021 1 次提交
    • V
      net: dsa: introduce locking for the address lists on CPU and DSA ports · d3bd8924
      Vladimir Oltean 提交于
      Now that the rtnl_mutex is going away for dsa_port_{host_,}fdb_{add,del},
      no one is serializing access to the address lists that DSA keeps for the
      purpose of reference counting on shared ports (CPU and cascade ports).
      
      It can happen for one dsa_switch_do_fdb_del to do list_del on a dp->fdbs
      element while another dsa_switch_do_fdb_{add,del} is traversing dp->fdbs.
      We need to avoid that.
      
      Currently dp->mdbs is not at risk, because dsa_switch_do_mdb_{add,del}
      still runs under the rtnl_mutex. But it would be nice if it would not
      depend on that being the case. So let's introduce a mutex per port (the
      address lists are per port too) and share it between dp->mdbs and
      dp->fdbs.
      
      The place where we put the locking is interesting. It could be tempting
      to put a DSA-level lock which still serializes calls to
      .port_fdb_{add,del}, but it would still not avoid concurrency with other
      driver code paths that are currently under rtnl_mutex (.port_fdb_dump,
      .port_fast_age). So it would add a very false sense of security (and
      adding a global switch-wide lock in DSA to resynchronize with the
      rtnl_lock is also counterproductive and hard).
      
      So the locking is intentionally done only where the dp->fdbs and dp->mdbs
      lists are traversed. That means, from a driver perspective, that
      .port_fdb_add will be called with the dp->addr_lists_lock mutex held on
      the CPU port, but not held on user ports. This is done so that driver
      writers are not encouraged to rely on any guarantee offered by
      dp->addr_lists_lock.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d3bd8924
  9. 23 10月, 2021 1 次提交
  10. 21 10月, 2021 1 次提交