1. 27 1月, 2020 6 次提交
    • V
      devlink: add macro for "fw.roce" · 41c0d917
      Vasundhara Volam 提交于
      Add definition and documentation for the new generic info "fw.roce".
      
      v2: Remove board.nvm_cfg since fw.psid is similar.
      
      Cc: Jiri Pirko <jiri@mellanox.com>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Signed-off-by: NVasundhara Volam <vasundhara-v.volam@broadcom.com>
      Signed-off-by: NMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      41c0d917
    • S
      udp: Support UDP fraglist GRO/GSO. · 9fd1ff5d
      Steffen Klassert 提交于
      This patch extends UDP GRO to support fraglist GRO/GSO
      by using the previously introduced infrastructure.
      If the feature is enabled, all UDP packets are going to
      fraglist GRO (local input and forward).
      
      After validating the csum,  we mark ip_summed as
      CHECKSUM_UNNECESSARY for fraglist GRO packets to
      make sure that the csum is not touched.
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      Reviewed-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9fd1ff5d
    • C
      net_sched: fix ops->bind_class() implementations · 2e24cd75
      Cong Wang 提交于
      The current implementations of ops->bind_class() are merely
      searching for classid and updating class in the struct tcf_result,
      without invoking either of cl_ops->bind_tcf() or
      cl_ops->unbind_tcf(). This breaks the design of them as qdisc's
      like cbq use them to count filters too. This is why syzbot triggered
      the warning in cbq_destroy_class().
      
      In order to fix this, we have to call cl_ops->bind_tcf() and
      cl_ops->unbind_tcf() like the filter binding path. This patch does
      so by refactoring out two helper functions __tcf_bind_filter()
      and __tcf_unbind_filter(), which are lockless and accept a Qdisc
      pointer, then teaching each implementation to call them correctly.
      
      Note, we merely pass the Qdisc pointer as an opaque pointer to
      each filter, they only need to pass it down to the helper
      functions without understanding it at all.
      
      Fixes: 07d79fc7 ("net_sched: add reverse binding for tc class")
      Reported-and-tested-by: syzbot+0a0596220218fcb603a8@syzkaller.appspotmail.com
      Reported-and-tested-by: syzbot+63bdb6006961d8c917c6@syzkaller.appspotmail.com
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2e24cd75
    • S
      nf_tables: Add set type for arbitrary concatenation of ranges · 3c4287f6
      Stefano Brivio 提交于
      This new set type allows for intervals in concatenated fields,
      which are expressed in the usual way, that is, simple byte
      concatenation with padding to 32 bits for single fields, and
      given as ranges by specifying start and end elements containing,
      each, the full concatenation of start and end values for the
      single fields.
      
      Ranges are expanded to composing netmasks, for each field: these
      are inserted as rules in per-field lookup tables. Bits to be
      classified are divided in 4-bit groups, and for each group, the
      lookup table contains 4^2 buckets, representing all the possible
      values of a bit group. This approach was inspired by the Grouper
      algorithm:
      	http://www.cse.usf.edu/~ligatti/projects/grouper/
      
      Matching is performed by a sequence of AND operations between
      bucket values, with buckets selected according to the value of
      packet bits, for each group. The result of this sequence tells
      us which rules matched for a given field.
      
      In order to concatenate several ranged fields, per-field rules
      are mapped using mapping arrays, one per field, that specify
      which rules should be considered while matching the next field.
      The mapping array for the last field contains a reference to
      the element originally inserted.
      
      The notes in nft_set_pipapo.c cover the algorithm in deeper
      detail.
      
      A pure hash-based approach is of no use here, as ranges need
      to be classified. An implementation based on "proxying" the
      existing red-black tree set type, creating a tree for each
      field, was considered, but deemed impractical due to the fact
      that elements would need to be shared between trees, at least
      as long as we want to keep UAPI changes to a minimum.
      
      A stand-alone implementation of this algorithm is available at:
      	https://pipapo.lameexcu.se
      together with notes about possible future optimisations
      (in pipapo.c).
      
      This algorithm was designed with data locality in mind, and can
      be highly optimised for SIMD instruction sets, as the bulk of
      the matching work is done with repetitive, simple bitwise
      operations.
      
      At this point, without further optimisations, nft_concat_range.sh
      reports, for one AMD Epyc 7351 thread (2.9GHz, 512 KiB L1D$, 8 MiB
      L2$):
      
      TEST: performance
        net,port                                                      [ OK ]
          baseline (drop from netdev hook):              10190076pps
          baseline hash (non-ranged entries):             6179564pps
          baseline rbtree (match on first field only):    2950341pps
          set with  1000 full, ranged entries:            2304165pps
        port,net                                                      [ OK ]
          baseline (drop from netdev hook):              10143615pps
          baseline hash (non-ranged entries):             6135776pps
          baseline rbtree (match on first field only):    4311934pps
          set with   100 full, ranged entries:            4131471pps
        net6,port                                                     [ OK ]
          baseline (drop from netdev hook):               9730404pps
          baseline hash (non-ranged entries):             4809557pps
          baseline rbtree (match on first field only):    1501699pps
          set with  1000 full, ranged entries:            1092557pps
        port,proto                                                    [ OK ]
          baseline (drop from netdev hook):              10812426pps
          baseline hash (non-ranged entries):             6929353pps
          baseline rbtree (match on first field only):    3027105pps
          set with 30000 full, ranged entries:             284147pps
        net6,port,mac                                                 [ OK ]
          baseline (drop from netdev hook):               9660114pps
          baseline hash (non-ranged entries):             3778877pps
          baseline rbtree (match on first field only):    3179379pps
          set with    10 full, ranged entries:            2082880pps
        net6,port,mac,proto                                           [ OK ]
          baseline (drop from netdev hook):               9718324pps
          baseline hash (non-ranged entries):             3799021pps
          baseline rbtree (match on first field only):    1506689pps
          set with  1000 full, ranged entries:             783810pps
        net,mac                                                       [ OK ]
          baseline (drop from netdev hook):              10190029pps
          baseline hash (non-ranged entries):             5172218pps
          baseline rbtree (match on first field only):    2946863pps
          set with  1000 full, ranged entries:            1279122pps
      
      v4:
       - fix build for 32-bit architectures: 64-bit division needs
         div_u64() (kbuild test robot <lkp@intel.com>)
      v3:
       - rework interface for field length specification,
         NFT_SET_SUBKEY disappears and information is stored in
         description
       - remove scratch area to store closing element of ranges,
         as elements now come with an actual attribute to specify
         the upper range limit (Pablo Neira Ayuso)
       - also remove pointer to 'start' element from mapping table,
         closing key is now accessible via extension data
       - use bytes right away instead of bits for field lengths,
         this way we can also double the inner loop of the lookup
         function to take care of upper and lower bits in a single
         iteration (minor performance improvement)
       - make it clearer that set operations are actually atomic
         API-wise, but we can't e.g. implement flush() as one-shot
         action
       - fix type for 'dup' in nft_pipapo_insert(), check for
         duplicates only in the next generation, and in general take
         care of differentiating generation mask cases depending on
         the operation (Pablo Neira Ayuso)
       - report C implementation matching rate in commit message, so
         that AVX2 implementation can be compared (Pablo Neira Ayuso)
      v2:
       - protect access to scratch maps in nft_pipapo_lookup() with
         local_bh_disable/enable() (Florian Westphal)
       - drop rcu_read_lock/unlock() from nft_pipapo_lookup(), it's
         already implied (Florian Westphal)
       - explain why partial allocation failures don't need handling
         in pipapo_realloc_scratch(), rename 'm' to clone and update
         related kerneldoc to make it clear we're not operating on
         the live copy (Florian Westphal)
       - add expicit check for priv->start_elem in
         nft_pipapo_insert() to avoid ending up in nft_pipapo_walk()
         with a NULL start element, and also zero it out in every
         operation that might make it invalid, so that insertion
         doesn't proceed with an invalid element (Florian Westphal)
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      3c4287f6
    • S
      netfilter: nf_tables: Support for sets with multiple ranged fields · f3a2181e
      Stefano Brivio 提交于
      Introduce a new nested netlink attribute, NFTA_SET_DESC_CONCAT, used
      to specify the length of each field in a set concatenation.
      
      This allows set implementations to support concatenation of multiple
      ranged items, as they can divide the input key into matching data for
      every single field. Such set implementations would be selected as
      they specify support for NFT_SET_INTERVAL and allow desc->field_count
      to be greater than one. Explicitly disallow this for nft_set_rbtree.
      
      In order to specify the interval for a set entry, userspace would
      include in NFTA_SET_DESC_CONCAT attributes field lengths, and pass
      range endpoints as two separate keys, represented by attributes
      NFTA_SET_ELEM_KEY and NFTA_SET_ELEM_KEY_END.
      
      While at it, export the number of 32-bit registers available for
      packet matching, as nftables will need this to know the maximum
      number of field lengths that can be specified.
      
      For example, "packets with an IPv4 address between 192.0.2.0 and
      192.0.2.42, with destination port between 22 and 25", can be
      expressed as two concatenated elements:
      
        NFTA_SET_ELEM_KEY:            192.0.2.0 . 22
        NFTA_SET_ELEM_KEY_END:        192.0.2.42 . 25
      
      and NFTA_SET_DESC_CONCAT attribute would contain:
      
        NFTA_LIST_ELEM
          NFTA_SET_FIELD_LEN:		4
        NFTA_LIST_ELEM
          NFTA_SET_FIELD_LEN:		2
      
      v4: No changes
      v3: Complete rework, NFTA_SET_DESC_CONCAT instead of NFTA_SET_SUBKEY
      v2: No changes
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      f3a2181e
    • P
      netfilter: nf_tables: add NFTA_SET_ELEM_KEY_END attribute · 7b225d0b
      Pablo Neira Ayuso 提交于
      Add NFTA_SET_ELEM_KEY_END attribute to convey the closing element of the
      interval between kernel and userspace.
      
      This patch also adds the NFT_SET_EXT_KEY_END extension to store the
      closing element value in this interval.
      
      v4: No changes
      v3: New patch
      
      [sbrivio: refactor error paths and labels; add corresponding
        nft_set_ext_type for new key; rebase]
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      7b225d0b
  2. 25 1月, 2020 3 次提交
    • P
      net: sched: Make TBF Qdisc offloadable · ef6aadcc
      Petr Machata 提交于
      Invoke ndo_setup_tc as appropriate to signal init / replacement, destroying
      and dumping of TBF Qdisc.
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef6aadcc
    • F
      mptcp: do not inherit inet proto ops · e42f1ac6
      Florian Westphal 提交于
      We need to initialise the struct ourselves, else we expose tcp-specific
      callbacks such as tcp_splice_read which will then trigger splat because
      the socket is an mptcp one:
      
      BUG: KASAN: slab-out-of-bounds in tcp_mstamp_refresh+0x80/0xa0 net/ipv4/tcp_output.c:57
      Write of size 8 at addr ffff888116aa21d0 by task syz-executor.0/5478
      
      CPU: 1 PID: 5478 Comm: syz-executor.0 Not tainted 5.5.0-rc6 #3
      Call Trace:
       tcp_mstamp_refresh+0x80/0xa0 net/ipv4/tcp_output.c:57
       tcp_rcv_space_adjust+0x72/0x7f0 net/ipv4/tcp_input.c:612
       tcp_read_sock+0x622/0x990 net/ipv4/tcp.c:1674
       tcp_splice_read+0x20b/0xb40 net/ipv4/tcp.c:791
       do_splice+0x1259/0x1560 fs/splice.c:1205
      
      To prevent build error with ipv6, add the recv/sendmsg function
      declaration to ipv6.h.  The functions are already accessible "thanks"
      to retpoline related work, but they are currently only made visible
      by socket.c specific INDIRECT_CALLABLE macros.
      Reported-by: NChristoph Paasch <cpaasch@apple.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e42f1ac6
    • P
      netfilter: nf_tables: autoload modules from the abort path · eb014de4
      Pablo Neira Ayuso 提交于
      This patch introduces a list of pending module requests. This new module
      list is composed of nft_module_request objects that contain the module
      name and one status field that tells if the module has been already
      loaded (the 'done' field).
      
      In the first pass, from the preparation phase, the netlink command finds
      that a module is missing on this list. Then, a module request is
      allocated and added to this list and nft_request_module() returns
      -EAGAIN. This triggers the abort path with the autoload parameter set on
      from nfnetlink, request_module() is called and the module request enters
      the 'done' state. Since the mutex is released when loading modules from
      the abort phase, the module list is zapped so this is iteration occurs
      over a local list. Therefore, the request_module() calls happen when
      object lists are in consistent state (after fulling aborting the
      transaction) and the commit list is empty.
      
      On the second pass, the netlink command will find that it already tried
      to load the module, so it does not request it again and
      nft_request_module() returns 0. Then, there is a look up to find the
      object that the command was missing. If the module was successfully
      loaded, the command proceeds normally since it finds the missing object
      in place, otherwise -ENOENT is reported to userspace.
      
      This patch also updates nfnetlink to include the reason to enter the
      abort phase, which is required for this new autoload module rationale.
      
      Fixes: ec7470b8 ("netfilter: nf_tables: store transaction list locally while requesting module")
      Reported-by: syzbot+29125d208b3dae9a7019@syzkaller.appspotmail.com
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      eb014de4
  3. 24 1月, 2020 6 次提交
  4. 23 1月, 2020 8 次提交
  5. 22 1月, 2020 1 次提交
  6. 19 1月, 2020 3 次提交
  7. 17 1月, 2020 1 次提交
  8. 16 1月, 2020 8 次提交
  9. 15 1月, 2020 4 次提交
    • O
      cfg80211: Fix radar event during another phy CAC · 26ec17a1
      Orr Mazor 提交于
      In case a radar event of CAC_FINISHED or RADAR_DETECTED
      happens during another phy is during CAC we might need
      to cancel that CAC.
      
      If we got a radar in a channel that another phy is now
      doing CAC on then the CAC should be canceled there.
      
      If, for example, 2 phys doing CAC on the same channels,
      or on comptable channels, once on of them will finish his
      CAC the other might need to cancel his CAC, since it is no
      longer relevant.
      
      To fix that the commit adds an callback and implement it in
      mac80211 to end CAC.
      This commit also adds a call to said callback if after a radar
      event we see the CAC is no longer relevant
      Signed-off-by: NOrr Mazor <Orr.Mazor@tandemg.com>
      Reviewed-by: NSergey Matyukevich <sergey.matyukevich.os@quantenna.com>
      Link: https://lore.kernel.org/r/20191222145449.15792-1-Orr.Mazor@tandemg.com
      [slightly reformat/reword commit message]
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      26ec17a1
    • I
      ipv6: Add "offload" and "trap" indications to routes · bb3c4ab9
      Ido Schimmel 提交于
      In a similar fashion to previous patch, add "offload" and "trap"
      indication to IPv6 routes.
      
      This is done by using two unused bits in 'struct fib6_info' to hold
      these indications. Capable drivers are expected to set these when
      processing the various in-kernel route notifications.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Acked-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bb3c4ab9
    • I
      ipv4: Add "offload" and "trap" indications to routes · 90b93f1b
      Ido Schimmel 提交于
      When performing L3 offload, routes and nexthops are usually programmed
      into two different tables in the underlying device. Therefore, the fact
      that a nexthop resides in hardware does not necessarily mean that all
      the associated routes also reside in hardware and vice-versa.
      
      While the kernel can signal to user space the presence of a nexthop in
      hardware (via 'RTNH_F_OFFLOAD'), it does not have a corresponding flag
      for routes. In addition, the fact that a route resides in hardware does
      not necessarily mean that the traffic is offloaded. For example,
      unreachable routes (i.e., 'RTN_UNREACHABLE') are programmed to trap
      packets to the CPU so that the kernel will be able to generate the
      appropriate ICMP error packet.
      
      This patch adds an "offload" and "trap" indications to IPv4 routes, so
      that users will have better visibility into the offload process.
      
      'struct fib_alias' is extended with two new fields that indicate if the
      route resides in hardware or not and if it is offloading traffic from
      the kernel or trapping packets to it. Note that the new fields are added
      in the 6 bytes hole and therefore the struct still fits in a single
      cache line [1].
      
      Capable drivers are expected to invoke fib_alias_hw_flags_set() with the
      route's key in order to set the flags.
      
      The indications are dumped to user space via a new flags (i.e.,
      'RTM_F_OFFLOAD' and 'RTM_F_TRAP') in the 'rtm_flags' field in the
      ancillary header.
      
      v2:
      * Make use of 'struct fib_rt_info' in fib_alias_hw_flags_set()
      
      [1]
      struct fib_alias {
              struct hlist_node  fa_list;                      /*     0    16 */
              struct fib_info *          fa_info;              /*    16     8 */
              u8                         fa_tos;               /*    24     1 */
              u8                         fa_type;              /*    25     1 */
              u8                         fa_state;             /*    26     1 */
              u8                         fa_slen;              /*    27     1 */
              u32                        tb_id;                /*    28     4 */
              s16                        fa_default;           /*    32     2 */
              u8                         offload:1;            /*    34: 0  1 */
              u8                         trap:1;               /*    34: 1  1 */
              u8                         unused:6;             /*    34: 2  1 */
      
              /* XXX 5 bytes hole, try to pack */
      
              struct callback_head rcu __attribute__((__aligned__(8))); /*    40    16 */
      
              /* size: 56, cachelines: 1, members: 12 */
              /* sum members: 50, holes: 1, sum holes: 5 */
              /* sum bitfield members: 8 bits (1 bytes) */
              /* forced alignments: 1, forced holes: 1, sum forced holes: 5 */
              /* last cacheline: 56 bytes */
      } __attribute__((__aligned__(8)));
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90b93f1b
    • I
      ipv4: Encapsulate function arguments in a struct · 1e301fd0
      Ido Schimmel 提交于
      fib_dump_info() is used to prepare RTM_{NEW,DEL}ROUTE netlink messages
      using the passed arguments. Currently, the function takes 11 arguments,
      6 of which are attributes of the route being dumped (e.g., prefix, TOS).
      
      The next patch will need the function to also dump to user space an
      indication if the route is present in hardware or not. Instead of
      passing yet another argument, change the function to take a struct
      containing the different route attributes.
      
      v2:
      * Name last argument of fib_dump_info()
      * Move 'struct fib_rt_info' to include/net/ip_fib.h so that it could
        later be passed to fib_alias_hw_flags_set()
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e301fd0