1. 21 10月, 2021 4 次提交
    • V
      net: mscc: ocelot: allow a config where all bridge VLANs are egress-untagged · 0da1a1c4
      Vladimir Oltean 提交于
      At present, the ocelot driver accepts a single egress-untagged bridge
      VLAN, meaning that this sequence of operations:
      
      ip link add br0 type bridge vlan_filtering 1
      ip link set swp0 master br0
      bridge vlan add dev swp0 vid 2 pvid untagged
      
      fails because the bridge automatically installs VID 1 as a pvid & untagged
      VLAN, and vid 2 would be the second untagged VLAN on this port. It is
      necessary to delete VID 1 before proceeding to add VID 2.
      
      This limitation comes from the fact that we operate the port tag, when
      it has an egress-untagged VID, in the OCELOT_PORT_TAG_NATIVE mode.
      The ocelot switches do not have full flexibility and can either have one
      single VID as egress-untagged, or all of them.
      
      There are use cases for having all VLANs as egress-untagged as well, and
      this patch adds support for that.
      
      The change rewrites ocelot_port_set_native_vlan() into a more generic
      ocelot_port_manage_port_tag() function. Because the software bridge's
      state, transmitted to us via switchdev, can become very complex, we
      don't attempt to track all possible state transitions, but instead take
      a more declarative approach and just make ocelot_port_manage_port_tag()
      figure out which more to operate in:
      
      - port is VLAN-unaware: the classified VLAN (internal, unrelated to the
                              802.1Q header) is not inserted into packets on egress
      - port is VLAN-aware:
        - port has tagged VLANs:
          -> port has no untagged VLAN: set up as pure trunk
          -> port has one untagged VLAN: set up as trunk port + native VLAN
          -> port has more than one untagged VLAN: this is an invalid config
             which is rejected by ocelot_vlan_prepare
        - port has no tagged VLANs
          -> set up as pure egress-untagged port
      
      We don't keep the number of tagged and untagged VLANs, we just count the
      structures we keep.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0da1a1c4
    • V
      net: mscc: ocelot: convert the VLAN masks to a list · 90e0aa8d
      Vladimir Oltean 提交于
      First and foremost, the driver currently allocates a constant sized
      4K * u32 (16KB memory) array for the VLAN masks. However, a typical
      application might not need so many VLANs, so if we dynamically allocate
      the memory as needed, we might actually save some space.
      
      Secondly, we'll need to keep more advanced bookkeeping of the VLANs we
      have, notably we'll have to check how many untagged and how many tagged
      VLANs we have. This will have to stay in a structure, and allocating
      another 16 KB array for that is again a bit too much.
      
      So refactor the bridge VLANs in a linked list of structures.
      
      The hook points inside the driver are ocelot_vlan_member_add() and
      ocelot_vlan_member_del(), which previously used to operate on the
      ocelot->vlan_mask[vid] array element.
      
      ocelot_vlan_member_add() and ocelot_vlan_member_del() used to call
      ocelot_vlan_member_set() to commit to the ocelot->vlan_mask.
      Additionally, we had two calls to ocelot_vlan_member_set() from outside
      those callers, and those were directly from ocelot_vlan_init().
      Those calls do not set up bridging service VLANs, instead they:
      
      - clear the VLAN table on reset
      - set the port pvid to the value used by this driver for VLAN-unaware
        standalone port operation (VID 0)
      
      So now, when we have a structure which represents actual bridge VLANs,
      VID 0 doesn't belong in that structure, since it is not part of the
      bridging layer.
      
      So delete the middle man, ocelot_vlan_member_set(), and let
      ocelot_vlan_init() call directly ocelot_vlant_set_mask() which forgoes
      any data structure and writes directly to hardware, which is all that we
      need.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90e0aa8d
    • V
      net: mscc: ocelot: add a type definition for REW_TAG_CFG_TAG_CFG · 62a22bcb
      Vladimir Oltean 提交于
      This is a cosmetic patch which clarifies what are the port tagging
      options for Ocelot switches.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      62a22bcb
    • T
      fq_codel: generalise ce_threshold marking for subset of traffic · dfcb63ce
      Toke Høiland-Jørgensen 提交于
      Commit e72aeb9e ("fq_codel: implement L4S style ce_threshold_ect1
      marking") expanded the ce_threshold feature of FQ-CoDel so it can
      be applied to a subset of the traffic, using the ECT(1) bit of the ECN
      field as the classifier. However, hard-coding ECT(1) as the only
      classifier for this feature seems limiting, so let's expand it to be more
      general.
      
      To this end, change the parameter from a ce_threshold_ect1 boolean, to a
      one-byte selector/mask pair (ce_threshold_{selector,mask}) which is applied
      to the whole diffserv/ECN field in the IP header. This makes it possible to
      classify packets by any value in either the ECN field or the diffserv
      field. In particular, setting a selector of INET_ECN_ECT_1 and a mask of
      INET_ECN_MASK corresponds to the functionality before this patch, and a
      mask of ~INET_ECN_MASK allows using the selector as a straight-forward
      match against a diffserv code point:
      
       # apply ce_threshold to ECT(1) traffic
       tc qdisc replace dev eth0 root fq_codel ce_threshold 1ms ce_threshold_selector 0x1/0x3
      
       # apply ce_threshold to ECN-capable traffic marked as diffserv AF22
       tc qdisc replace dev eth0 root fq_codel ce_threshold 1ms ce_threshold_selector 0x50/0xfc
      
      Regardless of the selector chosen, the normal rules for ECN-marking of
      packets still apply, i.e., the flow must still declare itself ECN-capable
      by setting one of the bits in the ECN field to get marked at all.
      
      v2:
      - Add tc usage examples to patch description
      Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20211019174709.69081-1-toke@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      dfcb63ce
  2. 20 10月, 2021 2 次提交
  3. 19 10月, 2021 5 次提交
  4. 18 10月, 2021 10 次提交
  5. 16 10月, 2021 6 次提交
  6. 15 10月, 2021 13 次提交
    • I
      soc: fsl: dpio: add Net DIM integration · 69651bd8
      Ioana Ciornei 提交于
      Use the generic dynamic interrupt moderation (dim) framework to
      implement adaptive interrupt coalescing on Rx. With the per-packet
      interrupt scheme, a high interrupt rate has been noted for moderate
      traffic flows leading to high CPU utilization.
      
      The dpio driver exports new functions to enable/disable adaptive IRQ
      coalescing on a DPIO object, to query the state or to update Net DIM
      with a new set of bytes and frames dequeued.
      Signed-off-by: NIoana Ciornei <ioana.ciornei@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      69651bd8
    • I
      soc: fsl: dpio: add support for irq coalescing per software portal · ed1d2143
      Ioana Ciornei 提交于
      In DPAA2 based SoCs, the IRQ coalesing support per software portal has 2
      configurable parameters:
       - the IRQ timeout period (QBMAN_CINH_SWP_ITPR): how many 256 QBMAN
         cycles need to pass until a dequeue interrupt is asserted.
       - the IRQ threshold (QBMAN_CINH_SWP_DQRR_ITR): how many dequeue
         responses in the DQRR ring would generate an IRQ.
      
      Add support for setting up and querying these IRQ coalescing related
      parameters.
      Signed-off-by: NIoana Ciornei <ioana.ciornei@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ed1d2143
    • I
      soc: fsl: dpio: extract the QBMAN clock frequency from the attributes · 2cf0b6fe
      Ioana Ciornei 提交于
      Through the dpio_get_attributes() firmware call the dpio driver has
      access to the QBMAN clock frequency. Extend the structure which holds
      the firmware's response so that we can have access to this information.
      
      This will be needed in the next patches which also add support for
      interrupt coalescing which needs to be configured based on the
      frequency.
      Signed-off-by: NIoana Ciornei <ioana.ciornei@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2cf0b6fe
    • E
      fq_codel: implement L4S style ce_threshold_ect1 marking · e72aeb9e
      Eric Dumazet 提交于
      Add TCA_FQ_CODEL_CE_THRESHOLD_ECT1 boolean option to select Low Latency,
      Low Loss, Scalable Throughput (L4S) style marking, along with ce_threshold.
      
      If enabled, only packets with ECT(1) can be transformed to CE
      if their sojourn time is above the ce_threshold.
      
      Note that this new option does not change rules for codel law.
      In particular, if TCA_FQ_CODEL_ECN is left enabled (this is
      the default when fq_codel qdisc is created), ECT(0) packets can
      still get CE if codel law (as governed by limit/target) decides so.
      
      Section 4.3.b of current draft [1] states:
      
      b.  A scheduler with per-flow queues such as FQ-CoDel or FQ-PIE can
          be used for L4S.  For instance within each queue of an FQ-CoDel
          system, as well as a CoDel AQM, there is typically also ECN
          marking at an immediate (unsmoothed) shallow threshold to support
          use in data centres (see Sec.5.2.7 of [RFC8290]).  This can be
          modified so that the shallow threshold is solely applied to
          ECT(1) packets.  Then if there is a flow of non-ECN or ECT(0)
          packets in the per-flow-queue, the Classic AQM (e.g.  CoDel) is
          applied; while if there is a flow of ECT(1) packets in the queue,
          the shallower (typically sub-millisecond) threshold is applied.
      
      Tested:
      
      tc qd replace dev eth1 root fq_codel ce_threshold_ect1 50usec
      
      netperf ... -t TCP_STREAM -- K dctcp
      
      tc -s -d qd sh dev eth1
      qdisc fq_codel 8022: root refcnt 32 limit 10240p flows 1024 quantum 9212 target 5ms ce_threshold_ect1 49us interval 100ms memory_limit 32Mb ecn drop_batch 64
       Sent 14388596616 bytes 9543449 pkt (dropped 0, overlimits 0 requeues 152013)
       backlog 0b 0p requeues 152013
        maxpacket 68130 drop_overlimit 0 new_flow_count 95678 ecn_mark 0 ce_mark 7639
        new_flows_len 0 old_flows_len 0
      
      [1] L4S current draft:
      https://datatracker.ietf.org/doc/html/draft-ietf-tsvwg-l4s-archSigned-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Ingemar Johansson S <ingemar.s.johansson@ericsson.com>
      Cc: Tom Henderson <tomh@tomh.org>
      Cc: Bob Briscoe <in@bobbriscoe.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e72aeb9e
    • E
      net: add skb_get_dsfield() helper · 70e939dd
      Eric Dumazet 提交于
      skb_get_dsfield(skb) gets dsfield from skb, or -1
      if an error was found.
      
      This is basically a wrapper around ipv4_get_dsfield()
      and ipv6_get_dsfield().
      
      Used by following patch for fq_codel.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Ingemar Johansson S <ingemar.s.johansson@ericsson.com>
      Cc: Tom Henderson <tomh@tomh.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70e939dd
    • E
      tcp: switch orphan_count to bare per-cpu counters · 19757ceb
      Eric Dumazet 提交于
      Use of percpu_counter structure to track count of orphaned
      sockets is causing problems on modern hosts with 256 cpus
      or more.
      
      Stefan Bach reported a serious spinlock contention in real workloads,
      that I was able to reproduce with a netfilter rule dropping
      incoming FIN packets.
      
          53.56%  server  [kernel.kallsyms]      [k] queued_spin_lock_slowpath
                  |
                  ---queued_spin_lock_slowpath
                     |
                      --53.51%--_raw_spin_lock_irqsave
                                |
                                 --53.51%--__percpu_counter_sum
                                           tcp_check_oom
                                           |
                                           |--39.03%--__tcp_close
                                           |          tcp_close
                                           |          inet_release
                                           |          inet6_release
                                           |          sock_close
                                           |          __fput
                                           |          ____fput
                                           |          task_work_run
                                           |          exit_to_usermode_loop
                                           |          do_syscall_64
                                           |          entry_SYSCALL_64_after_hwframe
                                           |          __GI___libc_close
                                           |
                                            --14.48%--tcp_out_of_resources
                                                      tcp_write_timeout
                                                      tcp_retransmit_timer
                                                      tcp_write_timer_handler
                                                      tcp_write_timer
                                                      call_timer_fn
                                                      expire_timers
                                                      __run_timers
                                                      run_timer_softirq
                                                      __softirqentry_text_start
      
      As explained in commit cf86a086 ("net/dst: use a smaller percpu_counter
      batch for dst entries accounting"), default batch size is too big
      for the default value of tcp_max_orphans (262144).
      
      But even if we reduce batch sizes, there would still be cases
      where the estimated count of orphans is beyond the limit,
      and where tcp_too_many_orphans() has to call the expensive
      percpu_counter_sum_positive().
      
      One solution is to use plain per-cpu counters, and have
      a timer to periodically refresh this cache.
      
      Updating this cache every 100ms seems about right, tcp pressure
      state is not radically changing over shorter periods.
      
      percpu_counter was nice 15 years ago while hosts had less
      than 16 cpus, not anymore by current standards.
      
      v2: Fix the build issue for CONFIG_CRYPTO_DEV_CHELSIO_TLS=m,
          reported by kernel test robot <lkp@intel.com>
          Remove unused socket argument from tcp_too_many_orphans()
      
      Fixes: dd24c001 ("net: Use a percpu_counter for orphan_count")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NStefan Bach <sfb@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19757ceb
    • Y
      page_pool: disable dma mapping support for 32-bit arch with 64-bit DMA · d00e60ee
      Yunsheng Lin 提交于
      As the 32-bit arch with 64-bit DMA seems to rare those days,
      and page pool might carry a lot of code and complexity for
      systems that possibly.
      
      So disable dma mapping support for such systems, if drivers
      really want to work on such systems, they have to implement
      their own DMA-mapping fallback tracking outside page_pool.
      Reviewed-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
      Signed-off-by: NYunsheng Lin <linyunsheng@huawei.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d00e60ee
    • A
      net: of: fix stub of_net helpers for CONFIG_NET=n · 8b017fbe
      Arnd Bergmann 提交于
      Moving the of_net code from drivers/of/ to net/core means we
      no longer stub out the helpers when networking is disabled,
      which leads to a randconfig build failure with at least one
      ARM platform that calls this from non-networking code:
      
      arm-linux-gnueabi-ld: arch/arm/mach-mvebu/kirkwood.o: in function `kirkwood_dt_eth_fixup':
      kirkwood.c:(.init.text+0x54): undefined reference to `of_get_mac_address'
      
      Restore the way this worked before by changing that #ifdef
      check back to testing for both CONFIG_OF and CONFIG_NET.
      
      Fixes: e330fb14 ("of: net: move of_net under net/")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Link: https://lore.kernel.org/r/20211014090055.2058949-1-arnd@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      8b017fbe
    • F
      netfilter: ebtables: allow use of ebt_do_table as hookfn · f0d6764f
      Florian Westphal 提交于
      This is possible now that the xt_table structure is passed via *priv.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      f0d6764f
    • F
      netfilter: ip6tables: allow use of ip6t_do_table as hookfn · 44b5990e
      Florian Westphal 提交于
      This is possible now that the xt_table structure is passed via *priv.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      44b5990e
    • F
      netfilter: arp_tables: allow use of arpt_do_table as hookfn · e8d225b6
      Florian Westphal 提交于
      This is possible now that the xt_table structure is passed in via *priv.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      e8d225b6
    • F
      netfilter: iptables: allow use of ipt_do_table as hookfn · 8844e010
      Florian Westphal 提交于
      This is possible now that the xt_table structure is passed in via *priv.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      8844e010
    • L
      netfilter: Introduce egress hook · 42df6e1d
      Lukas Wunner 提交于
      Support classifying packets with netfilter on egress to satisfy user
      requirements such as:
      * outbound security policies for containers (Laura)
      * filtering and mangling intra-node Direct Server Return (DSR) traffic
        on a load balancer (Laura)
      * filtering locally generated traffic coming in through AF_PACKET,
        such as local ARP traffic generated for clustering purposes or DHCP
        (Laura; the AF_PACKET plumbing is contained in a follow-up commit)
      * L2 filtering from ingress and egress for AVB (Audio Video Bridging)
        and gPTP with nftables (Pablo)
      * in the future: in-kernel NAT64/NAT46 (Pablo)
      
      The egress hook introduced herein complements the ingress hook added by
      commit e687ad60 ("netfilter: add netfilter ingress hook after
      handle_ing() under unique static key").  A patch for nftables to hook up
      egress rules from user space has been submitted separately, so users may
      immediately take advantage of the feature.
      
      Alternatively or in addition to netfilter, packets can be classified
      with traffic control (tc).  On ingress, packets are classified first by
      tc, then by netfilter.  On egress, the order is reversed for symmetry.
      Conceptually, tc and netfilter can be thought of as layers, with
      netfilter layered above tc.
      
      Traffic control is capable of redirecting packets to another interface
      (man 8 tc-mirred).  E.g., an ingress packet may be redirected from the
      host namespace to a container via a veth connection:
      tc ingress (host) -> tc egress (veth host) -> tc ingress (veth container)
      
      In this case, netfilter egress classifying is not performed when leaving
      the host namespace!  That's because the packet is still on the tc layer.
      If tc redirects the packet to a physical interface in the host namespace
      such that it leaves the system, the packet is never subjected to
      netfilter egress classifying.  That is only logical since it hasn't
      passed through netfilter ingress classifying either.
      
      Packets can alternatively be redirected at the netfilter layer using
      nft fwd.  Such a packet *is* subjected to netfilter egress classifying
      since it has reached the netfilter layer.
      
      Internally, the skb->nf_skip_egress flag controls whether netfilter is
      invoked on egress by __dev_queue_xmit().  Because __dev_queue_xmit() may
      be called recursively by tunnel drivers such as vxlan, the flag is
      reverted to false after sch_handle_egress().  This ensures that
      netfilter is applied both on the overlay and underlying network.
      
      Interaction between tc and netfilter is possible by setting and querying
      skb->mark.
      
      If netfilter egress classifying is not enabled on any interface, it is
      patched out of the data path by way of a static_key and doesn't make a
      performance difference that is discernible from noise:
      
      Before:             1537 1538 1538 1537 1538 1537 Mb/sec
      After:              1536 1534 1539 1539 1539 1540 Mb/sec
      Before + tc accept: 1418 1418 1418 1419 1419 1418 Mb/sec
      After  + tc accept: 1419 1424 1418 1419 1422 1420 Mb/sec
      Before + tc drop:   1620 1619 1619 1619 1620 1620 Mb/sec
      After  + tc drop:   1616 1624 1625 1624 1622 1619 Mb/sec
      
      When netfilter egress classifying is enabled on at least one interface,
      a minimal performance penalty is incurred for every egress packet, even
      if the interface it's transmitted over doesn't have any netfilter egress
      rules configured.  That is caused by checking dev->nf_hooks_egress
      against NULL.
      
      Measurements were performed on a Core i7-3615QM.  Commands to reproduce:
      ip link add dev foo type dummy
      ip link set dev foo up
      modprobe pktgen
      echo "add_device foo" > /proc/net/pktgen/kpktgend_3
      samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i foo -n 400000000 -m "11:11:11:11:11:11" -d 1.1.1.1
      
      Accept all traffic with tc:
      tc qdisc add dev foo clsact
      tc filter add dev foo egress bpf da bytecode '1,6 0 0 0,'
      
      Drop all traffic with tc:
      tc qdisc add dev foo clsact
      tc filter add dev foo egress bpf da bytecode '1,6 0 0 2,'
      
      Apply this patch when measuring packet drops to avoid errors in dmesg:
      https://lore.kernel.org/netdev/a73dda33-57f4-95d8-ea51-ed483abd6a7a@iogearbox.net/Signed-off-by: NLukas Wunner <lukas@wunner.de>
      Cc: Laura García Liébana <nevola@gmail.com>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      42df6e1d