1. 23 9月, 2021 2 次提交
    • V
      net: dsa: sja1105: break dependency between dsa_port_is_sja1105 and switch driver · f5aef424
      Vladimir Oltean 提交于
      It's nice to be able to test a tagging protocol with dsa_loop, but not
      at the cost of losing the ability of building the tagging protocol and
      switch driver as modules, because as things stand, there is a circular
      dependency between the two. Tagging protocol drivers cannot depend on
      switch drivers, that is a hard fact.
      
      The reasoning behind the blamed patch was that accessing dp->priv should
      first make sure that the structure behind that pointer is what we really
      think it is.
      
      Currently the "sja1105" and "sja1110" tagging protocols only operate
      with the sja1105 switch driver, just like any other tagging protocol and
      switch combination. The only way to mix and match them is by modifying
      the code, and this applies to dsa_loop as well (by default that uses
      DSA_TAG_PROTO_NONE). So while in principle there is an issue, in
      practice there isn't one.
      
      Until we extend dsa_loop to allow user space configuration, treat the
      problem as a non-issue and just say that DSA ports found by tag_sja1105
      are always sja1105 ports, which is in fact true. But keep the
      dsa_port_is_sja1105 function so that it's easy to patch it during
      testing, and rely on dead code elimination.
      
      Fixes: 994d2cbb ("net: dsa: tag_sja1105: be dsa_loop-safe")
      Link: https://lore.kernel.org/netdev/20210908220834.d7gmtnwrorhharna@skbuf/Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f5aef424
    • V
      net: dsa: move sja1110_process_meta_tstamp inside the tagging protocol driver · 6d709cad
      Vladimir Oltean 提交于
      The problem is that DSA tagging protocols really must not depend on the
      switch driver, because this creates a circular dependency at insmod
      time, and the switch driver will effectively not load when the tagging
      protocol driver is missing.
      
      The code was structured in the way it was for a reason, though. The DSA
      driver-facing API for PTP timestamping relies on the assumption that
      two-step TX timestamps are provided by the hardware in an out-of-band
      manner, typically by raising an interrupt and making that timestamp
      available inside some sort of FIFO which is to be accessed over
      SPI/MDIO/etc.
      
      So the API puts .port_txtstamp into dsa_switch_ops, because it is
      expected that the switch driver needs to save some state (like put the
      skb into a queue until its TX timestamp arrives).
      
      On SJA1110, TX timestamps are provided by the switch as Ethernet
      packets, so this makes them be received and processed by the tagging
      protocol driver. This in itself is great, because the timestamps are
      full 64-bit and do not require reconstruction, and since Ethernet is the
      fastest I/O method available to/from the switch, PTP timestamps arrive
      very quickly, no matter how bottlenecked the SPI connection is, because
      SPI interaction is not needed at all.
      
      DSA's code structure and strict isolation between the tagging protocol
      driver and the switch driver break the natural code organization.
      
      When the tagging protocol driver receives a packet which is classified
      as a metadata packet containing timestamps, it passes those timestamps
      one by one to the switch driver, which then proceeds to compare them
      based on the recorded timestamp ID that was generated in .port_txtstamp.
      
      The communication between the tagging protocol and the switch driver is
      done through a method exported by the switch driver, sja1110_process_meta_tstamp.
      To satisfy build requirements, we force a dependency to build the
      tagging protocol driver as a module when the switch driver is a module.
      However, as explained in the first paragraph, that causes the circular
      dependency.
      
      To solve this, move the skb queue from struct sja1105_private :: struct
      sja1105_ptp_data to struct sja1105_private :: struct sja1105_tagger_data.
      The latter is a data structure for which hacks have already been put
      into place to be able to create persistent storage per switch that is
      accessible from the tagging protocol driver (see sja1105_setup_ports).
      
      With the skb queue directly accessible from the tagging protocol driver,
      we can now move sja1110_process_meta_tstamp into the tagging driver
      itself, and avoid exporting a symbol.
      
      Fixes: 566b18c8 ("net: dsa: sja1105: implement TX timestamping for SJA1110")
      Link: https://lore.kernel.org/netdev/20210908220834.d7gmtnwrorhharna@skbuf/Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6d709cad
  2. 22 9月, 2021 2 次提交
  3. 21 9月, 2021 3 次提交
  4. 20 9月, 2021 3 次提交
  5. 19 9月, 2021 4 次提交
  6. 18 9月, 2021 4 次提交
    • F
      mptcp: add MPTCP_SUBFLOW_ADDRS getsockopt support · c11c5906
      Florian Westphal 提交于
      This retrieves the address pairs of all subflows currently
      active for a given mptcp connection.
      
      It re-uses the same meta-header as for MPTCP_TCPINFO.
      
      A new structure is provided to hold the subflow
      address data:
      
      struct mptcp_subflow_addrs {
      	union {
      		__kernel_sa_family_t sa_family;
      		struct sockaddr sa_local;
      		struct sockaddr_in sin_local;
      		struct sockaddr_in6 sin6_local;
      		struct sockaddr_storage ss_local;
      	};
      	union {
      		struct sockaddr sa_remote;
      		struct sockaddr_in sin_remote;
      		struct sockaddr_in6 sin6_remote;
      		struct sockaddr_storage ss_remote;
      	};
      };
      
      Usage of the new getsockopt is very similar to
      MPTCP_TCPINFO one.
      
      Userspace allocates a
      'struct mptcp_subflow_data', followed by one or
      more 'struct mptcp_subflow_addrs', then inits the
      mptcp_subflow_data structure as follows:
      
      struct mptcp_subflow_addrs *sf_addr;
      struct mptcp_subflow_data *addr;
      socklen_t olen = sizeof(*addr) + (8 * sizeof(*sf_addr));
      
      addr = malloc(olen);
      addr->size_subflow_data = sizeof(*addr);
      addr->num_subflows = 0;
      addr->size_kernel = 0;
      addr->size_user = sizeof(struct mptcp_subflow_addrs);
      
      sf_addr = (struct mptcp_subflow_addrs *)(addr + 1);
      
      and then retrieves the endpoint addresses via:
      ret = getsockopt(fd, SOL_MPTCP, MPTCP_SUBFLOW_ADDRS,
      		 addr, &olen);
      
      If the call succeeds, kernel will have added up to 8
      endpoint addresses after the 'mptcp_subflow_data' header.
      
      Userspace needs to re-check 'olen' value to detect how
      many bytes have been filled in by the kernel.
      
      Userspace can check addr->num_subflows to discover when
      there were more subflows that available data space.
      Co-developed-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c11c5906
    • F
      mptcp: add MPTCP_TCPINFO getsockopt support · 06f15cee
      Florian Westphal 提交于
      Allow users to retrieve TCP_INFO data of all subflows.
      
      Users need to pre-initialize a meta header that has to be
      prepended to the data buffer that will be filled with the tcp info data.
      
      The meta header looks like this:
      
      struct mptcp_subflow_data {
       __u32 size_subflow_data;/* size of this structure in userspace */
       __u32 num_subflows;	/* must be 0, set by kernel */
       __u32 size_kernel;	/* must be 0, set by kernel */
       __u32 size_user;	/* size of one element in data[] */
      } __attribute__((aligned(8)));
      
      size_subflow_data has to be set to 'sizeof(struct mptcp_subflow_data)'.
      This allows to extend mptcp_subflow_data structure later on without
      breaking backwards compatibility.
      
      If the structure is extended later on, kernel knows where the
      userspace-provided meta header ends, even if userspace uses an older
      (smaller) version of the structure.
      
      num_subflows must be set to 0. If the getsockopt request succeeds (return
      value is 0), it will be updated to contain the number of active subflows
      for the given logical connection.
      
      size_kernel must be set to 0. If the getsockopt request is successful,
      it will contain the size of the 'struct tcp_info' as known by the kernel.
      This is informational only.
      
      size_user must be set to 'sizeof(struct tcp_info)'.
      
      This allows the kernel to only fill in the space reserved/expected by
      userspace.
      
      Example:
      
      struct my_tcp_info {
        struct mptcp_subflow_data d;
        struct tcp_info ti[2];
      };
      struct my_tcp_info ti;
      socklen_t olen;
      
      memset(&ti, 0, sizeof(ti));
      
      ti.d.size_subflow_data = sizeof(struct mptcp_subflow_data);
      ti.d.size_user = sizeof(struct tcp_info);
      olen = sizeof(ti);
      
      ret = getsockopt(fd, SOL_MPTCP, MPTCP_TCPINFO, &ti, &olen);
      if (ret < 0)
      	die_perror("getsockopt MPTCP_TCPINFO");
      
      mptcp_subflow_data.num_subflows is populated with the number of
      subflows that exist on the kernel side for the logical mptcp connection.
      
      This allows userspace to re-try with a larger tcp_info array if the number
      of subflows was larger than the available space in the ti[] array.
      
      olen has to be set to the number of bytes that userspace has allocated to
      receive the kernel data.  It will be updated to contain the real number
      bytes that have been copied to by the kernel.
      
      In the above example, if the number if subflows was 1, olen is equal to
      'sizeof(struct mptcp_subflow_data) + sizeof(struct tcp_info).
      For 2 or more subflows olen is equal to 'sizeof(struct my_tcp_info)'.
      
      If there was more data that could not be copied due to lack of space
      in the option buffer, userspace can detect this by checking
      mptcp_subflow_data->num_subflows.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      06f15cee
    • F
      mptcp: add MPTCP_INFO getsockopt · 55c42fa7
      Florian Westphal 提交于
      Its not compatible with multipath-tcp.org kernel one.
      
      1. The out-of-tree implementation defines a different 'struct mptcp_info',
         with embedded __user addresses for additional data such as
         endpoint addresses.
      
      2. Mat Martineau points out that embedded __user addresses doesn't work
      with BPF_CGROUP_RUN_PROG_GETSOCKOPT() which assumes that copying in
      optsize bytes from optval provides all data that got copied to userspace.
      
      This provides mptcp_info data for the given mptcp socket.
      
      Userspace sets optlen to the size of the structure it expects.
      The kernel updates it to contain the number of bytes that it copied.
      
      This allows to append more information to the structure later.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      55c42fa7
    • F
      mptcp: add new mptcp_fill_diag helper · 61bc6e82
      Florian Westphal 提交于
      Will be re-used from getsockopt path.
      Since diag can be a module, we can't export the helper from diag, it
      needs to be moved to core.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      61bc6e82
  7. 17 9月, 2021 1 次提交
  8. 16 9月, 2021 3 次提交
    • T
      net/tls: support SM4 GCM/CCM algorithm · 227b9644
      Tianjia Zhang 提交于
      The RFC8998 specification defines the use of the ShangMi algorithm
      cipher suites in TLS 1.3, and also supports the GCM/CCM mode using
      the SM4 algorithm.
      Signed-off-by: NTianjia Zhang <tianjia.zhang@linux.alibaba.com>
      Acked-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      227b9644
    • V
      net: dsa: flush switchdev workqueue before tearing down CPU/DSA ports · a57d8c21
      Vladimir Oltean 提交于
      Sometimes when unbinding the mv88e6xxx driver on Turris MOX, these error
      messages appear:
      
      mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete be:79:b4:9e:9e:96 vid 1 from fdb: -2
      mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete be:79:b4:9e:9e:96 vid 0 from fdb: -2
      mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 100 from fdb: -2
      mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 1 from fdb: -2
      mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 0 from fdb: -2
      
      (and similarly for other ports)
      
      What happens is that DSA has a policy "even if there are bugs, let's at
      least not leak memory" and dsa_port_teardown() clears the dp->fdbs and
      dp->mdbs lists, which are supposed to be empty.
      
      But deleting that cleanup code, the warnings go away.
      
      => the FDB and MDB lists (used for refcounting on shared ports, aka CPU
      and DSA ports) will eventually be empty, but are not empty by the time
      we tear down those ports. Aka we are deleting them too soon.
      
      The addresses that DSA complains about are host-trapped addresses: the
      local addresses of the ports, and the MAC address of the bridge device.
      
      The problem is that offloading those entries happens from a deferred
      work item scheduled by the SWITCHDEV_FDB_DEL_TO_DEVICE handler, and this
      races with the teardown of the CPU and DSA ports where the refcounting
      is kept.
      
      In fact, not only it races, but fundamentally speaking, if we iterate
      through the port list linearly, we might end up tearing down the shared
      ports even before we delete a DSA user port which has a bridge upper.
      
      So as it turns out, we need to first tear down the user ports (and the
      unused ones, for no better place of doing that), then the shared ports
      (the CPU and DSA ports). In between, we need to ensure that all work
      items scheduled by our switchdev handlers (which only run for user
      ports, hence the reason why we tear them down first) have finished.
      
      Fixes: 161ca59d ("net: dsa: reference count the MDB entries at the cross-chip notifier level")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Link: https://lore.kernel.org/r/20210914134726.2305133-1-vladimir.oltean@nxp.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      a57d8c21
    • V
      net: dsa: destroy the phylink instance on any error in dsa_slave_phy_setup · 6a52e733
      Vladimir Oltean 提交于
      DSA supports connecting to a phy-handle, and has a fallback to a non-OF
      based method of connecting to an internal PHY on the switch's own MDIO
      bus, if no phy-handle and no fixed-link nodes were present.
      
      The -ENODEV error code from the first attempt (phylink_of_phy_connect)
      is what triggers the second attempt (phylink_connect_phy).
      
      However, when the first attempt returns a different error code than
      -ENODEV, this results in an unbalance of calls to phylink_create and
      phylink_destroy by the time we exit the function. The phylink instance
      has leaked.
      
      There are many other error codes that can be returned by
      phylink_of_phy_connect. For example, phylink_validate returns -EINVAL.
      So this is a practical issue too.
      
      Fixes: aab9c406 ("net: dsa: Plug in PHYLINK support")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: NRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Link: https://lore.kernel.org/r/20210914134331.2303380-1-vladimir.oltean@nxp.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      6a52e733
  9. 15 9月, 2021 3 次提交
  10. 14 9月, 2021 9 次提交
    • Y
      skbuff: inline page_frag_alloc_align() · 32e3573f
      Yajun Deng 提交于
      The __alloc_frag_align() is short, and only called by two functions,
      so inline page_frag_alloc_align() for reduce the overhead of calls.
      Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Signed-off-by: NYajun Deng <yajun.deng@linux.dev>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      32e3573f
    • H
      ethtool: prevent endless loop if eeprom size is smaller than announced · b9bbc4c1
      Heiner Kallweit 提交于
      It shouldn't happen, but can happen that readable eeprom size is smaller
      than announced. Then we would be stuck in an endless loop here because
      after reaching the actual end reads return eeprom.len = 0. I faced this
      issue when making a mistake in driver development. Detect this scenario
      and return an error.
      Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b9bbc4c1
    • E
      Revert "Revert "ipv4: fix memory leaks in ip_cmsg_send() callers"" · d198b277
      Eric Dumazet 提交于
      This reverts commit d7807a9a.
      
      As mentioned in https://lkml.org/lkml/2021/9/13/1819
      5 years old commit 91948309 ("ipv4: fix memory leaks in ip_cmsg_send() callers")
      was a correct fix.
      
        ip_cmsg_send() can loop over multiple cmsghdr()
      
        If IP_RETOPTS has been successful, but following cmsghdr generates an error,
        we do not free ipc.ok
      
        If IP_RETOPTS is not successful, we have freed the allocated temporary space,
        not the one currently in ipc.opt.
      
      Sure, code could be refactored, but let's not bring back old bugs.
      
      Fixes: d7807a9a ("Revert "ipv4: fix memory leaks in ip_cmsg_send() callers"")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yajun Deng <yajun.deng@linux.dev>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d198b277
    • Z
      tcp: fix tp->undo_retrans accounting in tcp_sacktag_one() · 4f884f39
      zhenggy 提交于
      Commit 10d3be56 ("tcp-tso: do not split TSO packets at retransmit
      time") may directly retrans a multiple segments TSO/GSO packet without
      split, Since this commit, we can no longer assume that a retransmitted
      packet is a single segment.
      
      This patch fixes the tp->undo_retrans accounting in tcp_sacktag_one()
      that use the actual segments(pcount) of the retransmitted packet.
      
      Before that commit (10d3be56), the assumption underlying the
      tp->undo_retrans-- seems correct.
      
      Fixes: 10d3be56 ("tcp-tso: do not split TSO packets at retransmit time")
      Signed-off-by: Nzhenggy <zhenggy@chinatelecom.cn>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f884f39
    • E
      net-caif: avoid user-triggerable WARN_ON(1) · 550ac9c1
      Eric Dumazet 提交于
      syszbot triggers this warning, which looks something
      we can easily prevent.
      
      If we initialize priv->list_field in chnl_net_init(),
      then always use list_del_init(), we can remove robust_list_del()
      completely.
      
      WARNING: CPU: 0 PID: 3233 at net/caif/chnl_net.c:67 robust_list_del net/caif/chnl_net.c:67 [inline]
      WARNING: CPU: 0 PID: 3233 at net/caif/chnl_net.c:67 chnl_net_uninit+0xc9/0x2e0 net/caif/chnl_net.c:375
      Modules linked in:
      CPU: 0 PID: 3233 Comm: syz-executor.3 Not tainted 5.14.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:robust_list_del net/caif/chnl_net.c:67 [inline]
      RIP: 0010:chnl_net_uninit+0xc9/0x2e0 net/caif/chnl_net.c:375
      Code: 89 eb e8 3a a3 ba f8 48 89 d8 48 c1 e8 03 42 80 3c 28 00 0f 85 bf 01 00 00 48 81 fb 00 14 4e 8d 48 8b 2b 75 d0 e8 17 a3 ba f8 <0f> 0b 5b 5d 41 5c 41 5d e9 0a a3 ba f8 4c 89 e3 e8 02 a3 ba f8 4c
      RSP: 0018:ffffc90009067248 EFLAGS: 00010202
      RAX: 0000000000008780 RBX: ffffffff8d4e1400 RCX: ffffc9000fd34000
      RDX: 0000000000040000 RSI: ffffffff88bb6e49 RDI: 0000000000000003
      RBP: ffff88802cd9ee08 R08: 0000000000000000 R09: ffffffff8d0e6647
      R10: ffffffff88bb6dc2 R11: 0000000000000000 R12: ffff88803791ae08
      R13: dffffc0000000000 R14: 00000000e600ffce R15: ffff888073ed3480
      FS:  00007fed10fa0700(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000001b2c322000 CR3: 00000000164a6000 CR4: 00000000001506e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       register_netdevice+0xadf/0x1500 net/core/dev.c:10347
       ipcaif_newlink+0x4c/0x260 net/caif/chnl_net.c:468
       __rtnl_newlink+0x106d/0x1750 net/core/rtnetlink.c:3458
       rtnl_newlink+0x64/0xa0 net/core/rtnetlink.c:3506
       rtnetlink_rcv_msg+0x413/0xb80 net/core/rtnetlink.c:5572
       netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2504
       netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline]
       netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1340
       netlink_sendmsg+0x86d/0xdb0 net/netlink/af_netlink.c:1929
       sock_sendmsg_nosec net/socket.c:704 [inline]
       sock_sendmsg+0xcf/0x120 net/socket.c:724
       __sys_sendto+0x21c/0x320 net/socket.c:2036
       __do_sys_sendto net/socket.c:2048 [inline]
       __se_sys_sendto net/socket.c:2044 [inline]
       __x64_sys_sendto+0xdd/0x1b0 net/socket.c:2044
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fixes: cc36a070 ("net-caif: add CAIF netdevice")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      550ac9c1
    • K
      net/smc: add generic netlink support for system EID · 3c572145
      Karsten Graul 提交于
      With SMC-Dv2 users can configure if the static system EID should be used
      during CLC handshake, or if only user EIDs are allowed.
      Add generic netlink support to enable and disable the system EID, and
      to retrieve the system EID and its current enabled state.
      Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
      Reviewed-by: NGuvenc Gulce  <guvenc@linux.ibm.com>
      Signed-off-by: NGuvenc Gulce <guvenc@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c572145
    • K
      net/smc: keep static copy of system EID · 11a26c59
      Karsten Graul 提交于
      The system EID is retrieved using an registered ISM device each time
      when needed. This adds some unnecessary complexity at all places where
      the system EID is needed, but no ISM device is at hand.
      Simplify the code and save the system EID in a static variable in
      smc_ism.c.
      Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
      Reviewed-by: NGuvenc Gulce  <guvenc@linux.ibm.com>
      Signed-off-by: NGuvenc Gulce <guvenc@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      11a26c59
    • K
      net/smc: add support for user defined EIDs · fa086662
      Karsten Graul 提交于
      SMC-Dv2 allows users to define EIDs which allows to create separate
      name spaces enabling users to cluster their SMC-Dv2 connections.
      Add support for user defined EIDs and extent the generic netlink
      interface so users can add, remove and dump EIDs.
      Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
      Reviewed-by: NGuvenc Gulce  <guvenc@linux.ibm.com>
      Signed-off-by: NGuvenc Gulce <guvenc@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa086662
    • D
      bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode · 8520e224
      Daniel Borkmann 提交于
      Fix cgroup v1 interference when non-root cgroup v2 BPF programs are used.
      Back in the days, commit bd1060a1 ("sock, cgroup: add sock->sk_cgroup")
      embedded per-socket cgroup information into sock->sk_cgrp_data and in order
      to save 8 bytes in struct sock made both mutually exclusive, that is, when
      cgroup v1 socket tagging (e.g. net_cls/net_prio) is used, then cgroup v2
      falls back to the root cgroup in sock_cgroup_ptr() (&cgrp_dfl_root.cgrp).
      
      The assumption made was "there is no reason to mix the two and this is in line
      with how legacy and v2 compatibility is handled" as stated in bd1060a1.
      However, with Kubernetes more widely supporting cgroups v2 as well nowadays,
      this assumption no longer holds, and the possibility of the v1/v2 mixed mode
      with the v2 root fallback being hit becomes a real security issue.
      
      Many of the cgroup v2 BPF programs are also used for policy enforcement, just
      to pick _one_ example, that is, to programmatically deny socket related system
      calls like connect(2) or bind(2). A v2 root fallback would implicitly cause
      a policy bypass for the affected Pods.
      
      In production environments, we have recently seen this case due to various
      circumstances: i) a different 3rd party agent and/or ii) a container runtime
      such as [0] in the user's environment configuring legacy cgroup v1 net_cls
      tags, which triggered implicitly mentioned root fallback. Another case is
      Kubernetes projects like kind [1] which create Kubernetes nodes in a container
      and also add cgroup namespaces to the mix, meaning programs which are attached
      to the cgroup v2 root of the cgroup namespace get attached to a non-root
      cgroup v2 path from init namespace point of view. And the latter's root is
      out of reach for agents on a kind Kubernetes node to configure. Meaning, any
      entity on the node setting cgroup v1 net_cls tag will trigger the bypass
      despite cgroup v2 BPF programs attached to the namespace root.
      
      Generally, this mutual exclusiveness does not hold anymore in today's user
      environments and makes cgroup v2 usage from BPF side fragile and unreliable.
      This fix adds proper struct cgroup pointer for the cgroup v2 case to struct
      sock_cgroup_data in order to address these issues; this implicitly also fixes
      the tradeoffs being made back then with regards to races and refcount leaks
      as stated in bd1060a1, and removes the fallback, so that cgroup v2 BPF
      programs always operate as expected.
      
        [0] https://github.com/nestybox/sysbox/
        [1] https://kind.sigs.k8s.io/
      
      Fixes: bd1060a1 ("sock, cgroup: add sock->sk_cgroup")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NStanislav Fomichev <sdf@google.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Link: https://lore.kernel.org/bpf/20210913230759.2313-1-daniel@iogearbox.net
      8520e224
  11. 13 9月, 2021 5 次提交
  12. 11 9月, 2021 1 次提交