1. 08 12月, 2020 2 次提交
    • C
      mwl8k: switch from 'pci_' to 'dma_' API · 01b660b8
      Christophe JAILLET 提交于
      he wrappers in include/linux/pci-dma-compat.h should go away.
      
      The patch has been generated with the coccinelle script below and has been
      hand modified to replace GFP_ with a correct flag.
      It has been compile tested.
      
      When memory is allocated in 'mwl8k_rxq_init()' and 'mwl8k_txq_init()'
      GFP_KERNEL can be used because this flag is already used in a 'kcalloc()'
      call, just a few line below.
      
      When memory is allocated in 'mwl8k_firmware_load_success()' GFP_KERNEL can
      be used because this flag is already used within 'ieee80211_register_hw()'
      which is called just a few line below.
      
      @@
      @@
      -    PCI_DMA_BIDIRECTIONAL
      +    DMA_BIDIRECTIONAL
      
      @@
      @@
      -    PCI_DMA_TODEVICE
      +    DMA_TO_DEVICE
      
      @@
      @@
      -    PCI_DMA_FROMDEVICE
      +    DMA_FROM_DEVICE
      
      @@
      @@
      -    PCI_DMA_NONE
      +    DMA_NONE
      
      @@
      expression e1, e2, e3;
      @@
      -    pci_alloc_consistent(e1, e2, e3)
      +    dma_alloc_coherent(&e1->dev, e2, e3, GFP_)
      
      @@
      expression e1, e2, e3;
      @@
      -    pci_zalloc_consistent(e1, e2, e3)
      +    dma_alloc_coherent(&e1->dev, e2, e3, GFP_)
      
      @@
      expression e1, e2, e3, e4;
      @@
      -    pci_free_consistent(e1, e2, e3, e4)
      +    dma_free_coherent(&e1->dev, e2, e3, e4)
      
      @@
      expression e1, e2, e3, e4;
      @@
      -    pci_map_single(e1, e2, e3, e4)
      +    dma_map_single(&e1->dev, e2, e3, e4)
      
      @@
      expression e1, e2, e3, e4;
      @@
      -    pci_unmap_single(e1, e2, e3, e4)
      +    dma_unmap_single(&e1->dev, e2, e3, e4)
      
      @@
      expression e1, e2, e3, e4, e5;
      @@
      -    pci_map_page(e1, e2, e3, e4, e5)
      +    dma_map_page(&e1->dev, e2, e3, e4, e5)
      
      @@
      expression e1, e2, e3, e4;
      @@
      -    pci_unmap_page(e1, e2, e3, e4)
      +    dma_unmap_page(&e1->dev, e2, e3, e4)
      
      @@
      expression e1, e2, e3, e4;
      @@
      -    pci_map_sg(e1, e2, e3, e4)
      +    dma_map_sg(&e1->dev, e2, e3, e4)
      
      @@
      expression e1, e2, e3, e4;
      @@
      -    pci_unmap_sg(e1, e2, e3, e4)
      +    dma_unmap_sg(&e1->dev, e2, e3, e4)
      
      @@
      expression e1, e2, e3, e4;
      @@
      -    pci_dma_sync_single_for_cpu(e1, e2, e3, e4)
      +    dma_sync_single_for_cpu(&e1->dev, e2, e3, e4)
      
      @@
      expression e1, e2, e3, e4;
      @@
      -    pci_dma_sync_single_for_device(e1, e2, e3, e4)
      +    dma_sync_single_for_device(&e1->dev, e2, e3, e4)
      
      @@
      expression e1, e2, e3, e4;
      @@
      -    pci_dma_sync_sg_for_cpu(e1, e2, e3, e4)
      +    dma_sync_sg_for_cpu(&e1->dev, e2, e3, e4)
      
      @@
      expression e1, e2, e3, e4;
      @@
      -    pci_dma_sync_sg_for_device(e1, e2, e3, e4)
      +    dma_sync_sg_for_device(&e1->dev, e2, e3, e4)
      
      @@
      expression e1, e2;
      @@
      -    pci_dma_mapping_error(e1, e2)
      +    dma_mapping_error(&e1->dev, e2)
      
      @@
      expression e1, e2;
      @@
      -    pci_set_dma_mask(e1, e2)
      +    dma_set_mask(&e1->dev, e2)
      
      @@
      expression e1, e2;
      @@
      -    pci_set_consistent_dma_mask(e1, e2)
      +    dma_set_coherent_mask(&e1->dev, e2)
      Signed-off-by: NChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Signed-off-by: NKalle Valo <kvalo@codeaurora.org>
      Link: https://lore.kernel.org/r/20201129150844.1466214-1-christophe.jaillet@wanadoo.fr
      01b660b8
    • L
      rtw88: pci: Add prototypes for .probe, .remove and .shutdown · 2e86ef41
      Lee Jones 提交于
      Also strip out other duplicates from driver specific headers.
      
      Ensure 'main.h' is explicitly included in 'pci.h' since the latter
      uses some defines from the former.  It avoids issues like:
      
       from drivers/net/wireless/realtek/rtw88/rtw8822be.c:5:
       drivers/net/wireless/realtek/rtw88/pci.h:209:28: error: ‘RTK_MAX_TX_QUEUE_NUM’ undeclared here (not in a function); did you mean ‘RTK_MAX_RX_DESC_NUM’?
       209 | DECLARE_BITMAP(tx_queued, RTK_MAX_TX_QUEUE_NUM);
       | ^~~~~~~~~~~~~~~~~~~~
      
      Fixes the following W=1 kernel build warning(s):
      
       drivers/net/wireless/realtek/rtw88/pci.c:1488:5: warning: no previous prototype for ‘rtw_pci_probe’ [-Wmissing-prototypes]
       1488 | int rtw_pci_probe(struct pci_dev *pdev,
       | ^~~~~~~~~~~~~
       drivers/net/wireless/realtek/rtw88/pci.c:1568:6: warning: no previous prototype for ‘rtw_pci_remove’ [-Wmissing-prototypes]
       1568 | void rtw_pci_remove(struct pci_dev *pdev)
       | ^~~~~~~~~~~~~~
       drivers/net/wireless/realtek/rtw88/pci.c:1590:6: warning: no previous prototype for ‘rtw_pci_shutdown’ [-Wmissing-prototypes]
       1590 | void rtw_pci_shutdown(struct pci_dev *pdev)
       | ^~~~~~~~~~~~~~~~
      
      Cc: Yan-Hsuan Chuang <yhchuang@realtek.com>
      Cc: Kalle Valo <kvalo@codeaurora.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: linux-wireless@vger.kernel.org
      Cc: netdev@vger.kernel.org
      Signed-off-by: NLee Jones <lee.jones@linaro.org>
      Signed-off-by: NKalle Valo <kvalo@codeaurora.org>
      Link: https://lore.kernel.org/r/20201126133152.3211309-18-lee.jones@linaro.org
      2e86ef41
  2. 07 12月, 2020 8 次提交
  3. 06 12月, 2020 7 次提交
  4. 05 12月, 2020 23 次提交
    • B
      net/nfc/nci: Support NCI 2.x initial sequence · bcd684aa
      Bongsu Jeon 提交于
      implement the NCI 2.x initial sequence to support NCI 2.x NFCC.
      Since NCI 2.0, CORE_RESET and CORE_INIT sequence have been changed.
      If NFCEE supports NCI 2.x, then NCI 2.x initial sequence will work.
      
      In NCI 1.0, Initial sequence and payloads are as below:
      (DH)                     (NFCC)
       |  -- CORE_RESET_CMD --> |
       |  <-- CORE_RESET_RSP -- |
       |  -- CORE_INIT_CMD -->  |
       |  <-- CORE_INIT_RSP --  |
       CORE_RESET_RSP payloads are Status, NCI version, Configuration Status.
       CORE_INIT_CMD payloads are empty.
       CORE_INIT_RSP payloads are Status, NFCC Features,
          Number of Supported RF Interfaces, Supported RF Interface,
          Max Logical Connections, Max Routing table Size,
          Max Control Packet Payload Size, Max Size for Large Parameters,
          Manufacturer ID, Manufacturer Specific Information.
      
      In NCI 2.0, Initial Sequence and Parameters are as below:
      (DH)                     (NFCC)
       |  -- CORE_RESET_CMD --> |
       |  <-- CORE_RESET_RSP -- |
       |  <-- CORE_RESET_NTF -- |
       |  -- CORE_INIT_CMD -->  |
       |  <-- CORE_INIT_RSP --  |
       CORE_RESET_RSP payloads are Status.
       CORE_RESET_NTF payloads are Reset Trigger,
          Configuration Status, NCI Version, Manufacturer ID,
          Manufacturer Specific Information Length,
          Manufacturer Specific Information.
       CORE_INIT_CMD payloads are Feature1, Feature2.
       CORE_INIT_RSP payloads are Status, NFCC Features,
          Max Logical Connections, Max Routing Table Size,
          Max Control Packet Payload Size,
          Max Data Packet Payload Size of the Static HCI Connection,
          Number of Credits of the Static HCI Connection,
          Max NFC-V RF Frame Size, Number of Supported RF Interfaces,
          Supported RF Interfaces.
      Signed-off-by: NBongsu Jeon <bongsu.jeon@samsung.com>
      Link: https://lore.kernel.org/r/20201202223147.3472-1-bongsu.jeon@samsung.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      bcd684aa
    • G
      selftests: forwarding: Add MPLS L2VPN test · 41fdfffd
      Guillaume Nault 提交于
      Connect hosts H1 and H2 using two intermediate encapsulation routers
      (LER1 and LER2). These routers encapsulate traffic from the hosts,
      including the original Ethernet header, into MPLS.
      
      Use ping to test reachability between H1 and H2.
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Link: https://lore.kernel.org/r/625f5c1aafa3a8085f8d3e082d680a82e16ffbaa.1606918980.git.gnault@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      41fdfffd
    • T
      net: bna: remove trailing semicolon in macro definition · 0911d463
      Tom Rix 提交于
      The macro use will already have a semicolon.
      Clean up escaped newlines.
      Signed-off-by: NTom Rix <trix@redhat.com>
      Link: https://lore.kernel.org/r/20201202163622.3733506-1-trix@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      0911d463
    • H
      tipc: support 128bit node identity for peer removing · 43fcd906
      Hoang Le 提交于
      We add the support to remove a specific node down with 128bit
      node identifier, as an alternative to legacy 32-bit node address.
      
      example:
      $tipc peer remove identiy <1001002|16777777>
      Acked-by: NJon Maloy <jmaloy@redhat.com>
      Signed-off-by: NHoang Le <hoang.h.le@dektech.com.au>
      Link: https://lore.kernel.org/r/20201203035045.4564-1-hoang.h.le@dektech.com.auSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      43fcd906
    • S
      nfp: Replace zero-length array with flexible-array member · 7f356166
      Simon Horman 提交于
      There is a regular need in the kernel to provide a way to declare having a
      dynamically sized set of trailing elements in a structure. Kernel code
      should always use "flexible array members"[1] for these cases. The older
      style of one-element or zero-length arrays should no longer be used[2].
      
      [1] https://en.wikipedia.org/wiki/Flexible_array_member
      [2] https://www.kernel.org/doc/html/v5.9/process/deprecated.html#zero-length-and-one-element-arrays
      
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Signed-off-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NLouis Peens <louis.peens@netronome.com>
      Link: https://lore.kernel.org/r/20201204125601.24876-1-simon.horman@netronome.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      7f356166
    • B
      nfc: s3fwrn5: skip the NFC bootloader mode · 4fb7b98c
      Bongsu Jeon 提交于
      If there isn't a proper NFC firmware image, Bootloader mode will be
      skipped.
      Signed-off-by: NBongsu Jeon <bongsu.jeon@samsung.com>
      Reviewed-by: NKrzysztof Kozlowski <krzk@kernel.org>
      Link: https://lore.kernel.org/r/20201203225257.2446-1-bongsu.jeon@samsung.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      4fb7b98c
    • J
      Merge branch 'perf-optimizations-for-tcp-recv-zerocopy' · 43be3a3c
      Jakub Kicinski 提交于
      Arjun Roy says:
      
      ====================
      Perf. optimizations for TCP Recv. Zerocopy
      
      This patchset contains several optimizations for TCP Recv. Zerocopy.
      
      Summarized:
      1. It is possible that a read payload is not exactly page aligned -
      that there may exist "straggler" bytes that we cannot map into the
      caller's address space cleanly. For this, we allow the caller to
      provide as argument a "hybrid copy buffer", turning
      getsockopt(TCP_ZEROCOPY_RECEIVE) into a "hybrid" operation that allows
      the caller to avoid a subsequent recvmsg() call to read the
      stragglers.
      
      2. Similarly, for "small" read payloads that are either below the size
      of a page, or small enough that remapping pages is not a performance
      win - we allow the user to short-circuit the remapping operations
      entirely and simply copy into the buffer provided.
      
      Some of the patches in the middle of this set are refactors to support
      this "short-circuiting" optimization.
      
      3. We allow the user to provide a hint that performing a page zap
      operation (and the accompanying TLB shootdown) may not be necessary,
      for the provided region that the kernel will attempt to map pages
      into. This allows us to avoid this expensive operation while holding
      the socket lock, which provides a significant performance advantage.
      
      With all of these changes combined, "medium" sized receive traffic
      (multiple tens to few hundreds of KB) see significant efficiency gains
      when using TCP receive zerocopy instead of regular recvmsg(). For
      example, with RPC-style traffic with 32KB messages, there is a roughly
      15% efficiency improvement when using zerocopy. Without these changes,
      there is a roughly 60-70% efficiency reduction with such messages when
      employing zerocopy.
      ====================
      
      Link: https://lore.kernel.org/r/20201202225349.935284-1-arjunroy.kdev@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      43be3a3c
    • A
      net-zerocopy: Defer vm zap unless actually needed. · 94ab9eb9
      Arjun Roy 提交于
      Zapping pages is required only if we are calling vm_insert_page into a
      region where pages had previously been mapped. Receive zerocopy allows
      reusing such regions, and hitherto called zap_page_range() before
      calling vm_insert_page() in that range.
      
      zap_page_range() can also be triggered from userspace with
      madvise(MADV_DONTNEED). If userspace is configured to call this before
      reusing a segment, or if there was nothing mapped at this virtual
      address to begin with, we can avoid calling zap_page_range() under the
      socket lock. That said, if userspace does not do that, then we are
      still responsible for calling zap_page_range().
      
      This patch adds a flag that the user can use to hint to the kernel
      that a zap is not required. If the flag is not set, or if an older
      user application does not have a flags field at all, then the kernel
      calls zap_page_range as before. Also, if the flag is set but a zap is
      still required, the kernel performs that zap as necessary. Thus
      incorrectly indicating that a zap can be avoided does not change the
      correctness of operation. It also increases the batchsize for
      vm_insert_pages and prefetches the page struct for the batch since
      we're about to bump the refcount.
      
      An alternative mechanism could be to not have a flag, assume by
      default a zap is not needed, and fall back to zapping if needed.
      However, this would harm performance for older applications for which
      a zap is necessary, and thus we implement it with an explicit flag
      so newer applications can opt in.
      
      When using RPC-style traffic with medium sized (tens of KB) RPCs, this
      change yields an efficency improvement of about 30% for QPS/CPU usage.
      Signed-off-by: NArjun Roy <arjunroy@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      94ab9eb9
    • A
      net-zerocopy: Set zerocopy hint when data is copied · 0c3936d3
      Arjun Roy 提交于
      Set zerocopy hint, event when falling back to copy, so that the
      pending data can be efficiently received using zerocopy when
      possible.
      Signed-off-by: NArjun Roy <arjunroy@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      0c3936d3
    • A
      net-zerocopy: Introduce short-circuit small reads. · f21a3c48
      Arjun Roy 提交于
      Sometimes, we may call tcp receive zerocopy when inq is 0,
      or inq < PAGE_SIZE, or inq is generally small enough that
      it is cheaper to copy rather than remap pages.
      
      In these cases, we may want to either return early (inq=0) or
      attempt to use the provided copy buffer to simply copy
      the received data.
      
      This allows us to save both system call overhead and
      the latency of acquiring mmap_sem in read mode for cases where
      it would be useless to do so.
      
      This patchset enables this behaviour by:
      1. Returning quickly if inq is 0.
      2. Attempting to perform a regular copy if a hybrid copybuffer is
         provided and it is large enough to absorb all available bytes.
      3. Return quickly if no such buffer was provided and there are less
         than PAGE_SIZE bytes available.
      
      For small RPC ping-pong workloads, normally we would have
      1 getsockopt(), 1 recvmsg() and 1 sendmsg() call per RPC. With this
      change, we remove the recvmsg() call entirely, reducing the syscall
      overhead by about 33%. In testing with small (hundreds of bytes)
      RPC traffic, this yields a syscall reduction of about 33% and
      an efficiency gain of about 3-5% when defined as QPS/CPU Util.
      Signed-off-by: NArjun Roy <arjunroy@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      f21a3c48
    • A
      net-zerocopy: Fast return if inq < PAGE_SIZE · 936ced41
      Arjun Roy 提交于
      Sometimes, we may call tcp receive zerocopy when inq is 0,
      or inq < PAGE_SIZE, in which case we cannot remap pages. In this case,
      simply return the appropriate hint for regular copying without taking
      mmap_sem.
      Signed-off-by: NArjun Roy <arjunroy@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      936ced41
    • A
      net-zerocopy: Refactor frag-is-remappable test. · 98917cf0
      Arjun Roy 提交于
      Refactor frag-is-remappable test for tcp receive zerocopy. This is
      part of a patch set that introduces short-circuited hybrid copies
      for small receive operations, which results in roughly 33% fewer
      syscalls for small RPC scenarios.
      Signed-off-by: NArjun Roy <arjunroy@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      98917cf0
    • A
      net-zerocopy: Refactor skb frag fast-forward op. · 7fba5309
      Arjun Roy 提交于
      Refactor skb frag fast-forwarding for tcp receive zerocopy. This is
      part of a patch set that introduces short-circuited hybrid copies
      for small receive operations, which results in roughly 33% fewer
      syscalls for small RPC scenarios.
      
      skb_advance_to_frag(), given a skb and an offset into the skb,
      iterates from the first frag for the skb until we're at the frag
      specified by the offset. Assuming the offset provided refers to how
      many bytes in the skb are already read, the returned frag points to
      the next frag we may read from, while offset_frag is set to the number
      of bytes from this frag that we have already read.
      
      If frag is not null and offset_frag is equal to 0, then we may be able
      to map this frag's page into the process address space with
      vm_insert_page(). However, if offset_frag is not equal to 0, then we
      cannot do so.
      Signed-off-by: NArjun Roy <arjunroy@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      7fba5309
    • A
      net-tcp: Introduce tcp_recvmsg_locked(). · 2cd81161
      Arjun Roy 提交于
      Refactor tcp_recvmsg() by splitting it into locked and unlocked
      portions. Callers already holding the socket lock and not using
      ERRQUEUE/cmsg/busy polling can simply call tcp_recvmsg_locked().
      This is in preparation for a short-circuit copy performed by
      TCP receive zerocopy for small (< PAGE_SIZE, or otherwise requested
      by the user) reads.
      Signed-off-by: NArjun Roy <arjunroy@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      2cd81161
    • A
      net-zerocopy: Copy straggler unaligned data for TCP Rx. zerocopy. · 18fb76ed
      Arjun Roy 提交于
      When TCP receive zerocopy does not successfully map the entire
      requested space, it outputs a 'hint' that the caller should recvmsg().
      
      Augment zerocopy to accept a user buffer that it tries to copy this
      hint into - if it is possible to copy the entire hint, it will do so.
      This elides a recvmsg() call for received traffic that isn't exactly
      page-aligned in size.
      
      This was tested with RPC-style traffic of arbitrary sizes. Normally,
      each received message required at least one getsockopt() call, and one
      recvmsg() call for the remaining unaligned data.
      
      With this change, almost all of the recvmsg() calls are eliminated,
      leading to a savings of about 25%-50% in number of system calls
      for RPC-style workloads.
      Signed-off-by: NArjun Roy <arjunroy@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      18fb76ed
    • J
      Merge branch 'seg6-add-support-for-srv6-end-dt4-dt6-behavior' · 4be986c8
      Jakub Kicinski 提交于
      Andrea Mayer says:
      
      ====================
      seg6: add support for SRv6 End.DT4/DT6 behavior
      
      This patchset provides support for the SRv6 End.DT4 and End.DT6 (VRF mode)
      behaviors.
      
      The SRv6 End.DT4 behavior is used to implement multi-tenant IPv4 L3 VPNs. It
      decapsulates the received packets and performs IPv4 routing lookup in the
      routing table of the tenant. The SRv6 End.DT4 Linux implementation leverages a
      VRF device in order to force the routing lookup into the associated routing
      table.
      The SRv6 End.DT4 behavior is defined in the SRv6 Network Programming [1].
      
      The Linux kernel already offers an implementation of the SRv6 End.DT6 behavior
      which allows us to set up IPv6 L3 VPNs over SRv6 networks. This new
      implementation of DT6 is based on the same VRF infrastructure already exploited
      for implementing the SRv6 End.DT4 behavior. The aim of the new SRv6 End.DT6 in
      VRF mode consists in simplifying the construction of IPv6 L3 VPN services in
      the multi-tenant environment.
      Currently, the two SRv6 End.DT6 implementations (legacy and VRF mode)
      coexist seamlessly and can be chosen according to the context and the user
      preferences.
      
      - Patch 1 is needed to solve a pre-existing issue with tunneled packets
        when a sniffer is attached;
      
      - Patch 2 improves the management of the seg6local attributes used by the
        SRv6 behaviors;
      
      - Patch 3 adds support for optional attributes in SRv6 behaviors;
      
      - Patch 4 introduces two callbacks used for customizing the
        creation/destruction of a SRv6 behavior;
      
      - Patch 5 is the core patch that adds support for the SRv6 End.DT4
        behavior;
      
      - Patch 6 introduces the VRF support for SRv6 End.DT6 behavior;
      
      - Patch 7 adds the selftest for SRv6 End.DT4 behavior;
      
      - Patch 8 adds the selftest for SRv6 End.DT6 (VRF mode) behavior.
      
      Regarding iproute2, the support for the new "vrftable" attribute, required by
      both SRv6 End.DT4 and End.DT6 (VRF mode) behaviors, is provided in a different
      patchset that will follow shortly.
      
      I would like to thank David Ahern for his support during the development of
      this patchset.
      
      [1] https://tools.ietf.org/html/draft-ietf-spring-srv6-network-programming
      ====================
      
      Link: https://lore.kernel.org/r/20201202130517.4967-1-andrea.mayer@uniroma2.itSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      4be986c8
    • A
      selftests: add selftest for the SRv6 End.DT6 (VRF) behavior · 2bc03553
      Andrea Mayer 提交于
      this selftest is designed for evaluating the new SRv6 End.DT6 (VRF) behavior
      used, in this example, for implementing IPv6 L3 VPN use cases.
      Signed-off-by: NAndrea Mayer <andrea.mayer@uniroma2.it>
      Signed-off-by: NPaolo Lungaroni <paolo.lungaroni@cnit.it>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      2bc03553
    • A
      selftests: add selftest for the SRv6 End.DT4 behavior · 2195444e
      Andrea Mayer 提交于
      this selftest is designed for evaluating the new SRv6 End.DT4 behavior
      used, in this example, for implementing IPv4 L3 VPN use cases.
      Signed-off-by: NAndrea Mayer <andrea.mayer@uniroma2.it>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      2195444e
    • A
      seg6: add VRF support for SRv6 End.DT6 behavior · 20a081b7
      Andrea Mayer 提交于
      SRv6 End.DT6 is defined in the SRv6 Network Programming [1].
      
      The Linux kernel already offers an implementation of the SRv6
      End.DT6 behavior which permits IPv6 L3 VPNs over SRv6 networks. This
      implementation is not particularly suitable in contexts where we need to
      deploy IPv6 L3 VPNs among different tenants which share the same network
      address schemes. The underlying problem lies in the fact that the
      current version of DT6 (called legacy DT6 from now on) needs a complex
      configuration to be applied on routers which requires ad-hoc routes and
      routing policy rules to ensure the correct isolation of tenants.
      
      Consequently, a new implementation of DT6 has been introduced with the
      aim of simplifying the construction of IPv6 L3 VPN services in the
      multi-tenant environment using SRv6 networks. To accomplish this task,
      we reused the same VRF infrastructure and SRv6 core components already
      exploited for implementing the SRv6 End.DT4 behavior.
      
      Currently the two End.DT6 implementations coexist seamlessly and can be
      used depending on the context and the user preferences. So, in order to
      support both versions of DT6 a new attribute (vrftable) has been
      introduced which allows us to differentiate the implementation of the
      behavior to be used.
      
      A SRv6 End.DT6 legacy behavior is still instantiated using a command
      like the following one:
      
       $ ip -6 route add 2001:db8::1 encap seg6local action End.DT6 table 100 dev eth0
      
      While to instantiate the SRv6 End.DT6 in VRF mode, the command is still
      pretty straight forward:
      
       $ ip -6 route add 2001:db8::1 encap seg6local action End.DT6 vrftable 100 dev eth0.
      
      Obviously as in the case of SRv6 End.DT4, the VRF strict_mode parameter
      must be set (net.vrf.strict_mode=1) and the VRF associated with table
      100 must exist.
      
      Please note that the instances of SRv6 End.DT6 legacy and End.DT6 VRF
      mode can coexist in the same system/configuration without problems.
      
      [1] https://tools.ietf.org/html/draft-ietf-spring-srv6-network-programmingSigned-off-by: NAndrea Mayer <andrea.mayer@uniroma2.it>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      20a081b7
    • A
      seg6: add support for the SRv6 End.DT4 behavior · 664d6f86
      Andrea Mayer 提交于
      SRv6 End.DT4 is defined in the SRv6 Network Programming [1].
      
      The SRv6 End.DT4 is used to implement IPv4 L3VPN use-cases in
      multi-tenants environments. It decapsulates the received packets and it
      performs IPv4 routing lookup in the routing table of the tenant.
      
      The SRv6 End.DT4 Linux implementation leverages a VRF device in order to
      force the routing lookup into the associated routing table.
      
      To make the End.DT4 work properly, it must be guaranteed that the routing
      table used for routing lookup operations is bound to one and only one
      VRF during the tunnel creation. Such constraint has to be enforced by
      enabling the VRF strict_mode sysctl parameter, i.e:
       $ sysctl -wq net.vrf.strict_mode=1.
      
      At JANOG44, LINE corporation presented their multi-tenant DC architecture
      using SRv6 [2]. In the slides, they reported that the Linux kernel is
      missing the support of SRv6 End.DT4 behavior.
      
      The SRv6 End.DT4 behavior can be instantiated using a command similar to
      the following:
      
       $ ip route add 2001:db8::1 encap seg6local action End.DT4 vrftable 100 dev eth0
      
      We introduce the "vrftable" extension in iproute2 in a following patch.
      
      [1] https://tools.ietf.org/html/draft-ietf-spring-srv6-network-programming
      [2] https://speakerdeck.com/line_developers/line-data-center-networking-with-srv6Signed-off-by: NAndrea Mayer <andrea.mayer@uniroma2.it>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      664d6f86
    • A
      seg6: add callbacks for customizing the creation/destruction of a behavior · cfdf64a0
      Andrea Mayer 提交于
      We introduce two callbacks used for customizing the creation/destruction of
      a SRv6 behavior. Such callbacks are defined in the new struct
      seg6_local_lwtunnel_ops and hereafter we provide a brief description of
      them:
      
       - build_state(...): used for calling the custom constructor of the
         behavior during its initialization phase and after all the attributes
         have been parsed successfully;
      
       - destroy_state(...): used for calling the custom destructor of the
         behavior before it is completely destroyed.
      Signed-off-by: NAndrea Mayer <andrea.mayer@uniroma2.it>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      cfdf64a0
    • A
      seg6: add support for optional attributes in SRv6 behaviors · 0a3021f1
      Andrea Mayer 提交于
      Before this patch, each SRv6 behavior specifies a set of required
      attributes that must be provided by the userspace application when such
      behavior is going to be instantiated. If at least one of the required
      attributes is not provided, the creation of the behavior fails.
      
      The SRv6 behavior framework lacks a way to manage optional attributes.
      By definition, an optional attribute for a SRv6 behavior consists of an
      attribute which may or may not be provided by the userspace. Therefore,
      if an optional attribute is missing (and thus not supplied by the user)
      the creation of the behavior goes ahead without any issue.
      
      This patch explicitly differentiates the required attributes from the
      optional attributes. In particular, each behavior can declare a set of
      required attributes and a set of optional ones.
      
      The semantic of the required attributes remains *totally* unaffected by
      this patch. The introduction of the optional attributes does NOT impact
      on the backward compatibility of the existing SRv6 behaviors.
      
      It is essential to note that if an (optional or required) attribute is
      supplied to a SRv6 behavior which does not expect it, the behavior
      simply discards such attribute without generating any error or warning.
      This operating mode remained unchanged both before and after the
      introduction of the optional attributes extension.
      
      The optional attributes are one of the key components used to implement
      the SRv6 End.DT6 behavior based on the Virtual Routing and Forwarding
      (VRF) framework. The optional attributes make possible the coexistence
      of the already existing SRv6 End.DT6 implementation with the new SRv6
      End.DT6 VRF-based implementation without breaking any backward
      compatibility. Further details on the SRv6 End.DT6 behavior (VRF mode)
      are reported in subsequent patches.
      
      From the userspace point of view, the support for optional attributes DO
      NOT require any changes to the userspace applications, i.e: iproute2
      unless new attributes (required or optional) are needed.
      Signed-off-by: NAndrea Mayer <andrea.mayer@uniroma2.it>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      0a3021f1
    • A
      seg6: improve management of behavior attributes · 964adce5
      Andrea Mayer 提交于
      Depending on the attribute (i.e.: SEG6_LOCAL_SRH, SEG6_LOCAL_TABLE, etc),
      the parse() callback performs some validity checks on the provided input
      and updates the tunnel state (slwt) with the result of the parsing
      operation. However, an attribute may also need to reserve some additional
      resources (i.e.: memory or setting up an eBPF program) in the parse()
      callback to complete the parsing operation.
      
      The parse() callbacks are invoked by the parse_nla_action() for each
      attribute belonging to a specific behavior. Given a behavior with N
      attributes, if the parsing of the i-th attribute fails, the
      parse_nla_action() returns immediately with an error. Nonetheless, the
      resources acquired during the parsing of the i-1 attributes are not freed
      by the parse_nla_action().
      
      Attributes which acquire resources must release them *in an explicit way*
      in both the seg6_local_{build/destroy}_state(). However, adding a new
      attribute of this type requires changes to
      seg6_local_{build/destroy}_state() to release the resources correctly.
      
      The seg6local infrastructure still lacks a simple and structured way to
      release the resources acquired in the parse() operations.
      
      We introduced a new callback in the struct seg6_action_param named
      destroy(). This callback releases any resource which may have been acquired
      in the parse() counterpart. Each attribute may or may not implement the
      destroy() callback depending on whether it needs to free some acquired
      resources.
      
      The destroy() callback comes with several of advantages:
      
       1) we can have many attributes as we want for a given behavior with no
          need to explicitly free the taken resources;
      
       2) As in case of the seg6_local_build_state(), the
          seg6_local_destroy_state() does not need to handle the release of
          resources directly. Indeed, it calls the destroy_attrs() function which
          is in charge of calling the destroy() callback for every set attribute.
          We do not need to patch seg6_local_{build/destroy}_state() anymore as
          we add new attributes;
      
       3) the code is more readable and better structured. Indeed, all the
          information needed to handle a given attribute are contained in only
          one place;
      
       4) it facilitates the integration with new features introduced in further
          patches.
      Signed-off-by: NAndrea Mayer <andrea.mayer@uniroma2.it>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      964adce5