1. 07 4月, 2020 1 次提交
  2. 31 3月, 2020 22 次提交
    • I
      devlink: Allow setting of packet trap group parameters · c064875a
      Ido Schimmel 提交于
      The previous patch allowed device drivers to publish their default
      binding between packet trap policers and packet trap groups. However,
      some users might not be content with this binding and would like to
      change it.
      
      In case user space passed a packet trap policer identifier when setting
      a packet trap group, invoke the appropriate device driver callback and
      pass the new policer identifier.
      
      v2:
      * Check for presence of 'DEVLINK_ATTR_TRAP_POLICER_ID' in
        devlink_trap_group_set() and bail if not present
      * Add extack error message in case trap group was partially modified
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Acked-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c064875a
    • I
      devlink: Add packet trap group parameters support · f9f54392
      Ido Schimmel 提交于
      Packet trap groups are used to aggregate logically related packet traps.
      Currently, these groups allow user space to batch operations such as
      setting the trap action of all member traps.
      
      In order to prevent the CPU from being overwhelmed by too many trapped
      packets, it is desirable to bind a packet trap policer to these groups.
      For example, to limit all the packets that encountered an exception
      during routing to 10Kpps.
      
      Allow device drivers to bind default packet trap policers to packet trap
      groups when the latter are registered with devlink.
      
      The next patch will enable user space to change this default binding.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Reviewed-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f9f54392
    • I
      devlink: Add packet trap policers support · 1e8c6619
      Ido Schimmel 提交于
      Devices capable of offloading the kernel's datapath and perform
      functions such as bridging and routing must also be able to send (trap)
      specific packets to the kernel (i.e., the CPU) for processing.
      
      For example, a device acting as a multicast-aware bridge must be able to
      trap IGMP membership reports to the kernel for processing by the bridge
      module.
      
      In most cases, the underlying device is capable of handling packet rates
      that are several orders of magnitude higher compared to those that can
      be handled by the CPU.
      
      Therefore, in order to prevent the underlying device from overwhelming
      the CPU, devices usually include packet trap policers that are able to
      police the trapped packets to rates that can be handled by the CPU.
      
      This patch allows capable device drivers to register their supported
      packet trap policers with devlink. User space can then tune the
      parameters of these policer (currently, rate and burst size) and read
      from the device the number of packets that were dropped by the policer,
      if supported.
      
      Subsequent patches in the series will allow device drivers to create
      default binding between these policers and packet trap groups and allow
      user space to change the binding.
      
      v2:
      * Add 'strict_start_type' in devlink policy
      * Have device drivers provide max/min rate/burst size for each policer.
        Use them to check validity of user provided parameters
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Reviewed-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e8c6619
    • A
      bpf: Implement bpf_prog replacement for an active bpf_cgroup_link · 0c991ebc
      Andrii Nakryiko 提交于
      Add new operation (LINK_UPDATE), which allows to replace active bpf_prog from
      under given bpf_link. Currently this is only supported for bpf_cgroup_link,
      but will be extended to other kinds of bpf_links in follow-up patches.
      
      For bpf_cgroup_link, implemented functionality matches existing semantics for
      direct bpf_prog attachment (including BPF_F_REPLACE flag). User can either
      unconditionally set new bpf_prog regardless of which bpf_prog is currently
      active under given bpf_link, or, optionally, can specify expected active
      bpf_prog. If active bpf_prog doesn't match expected one, no changes are
      performed, old bpf_link stays intact and attached, operation returns
      a failure.
      
      cgroup_bpf_replace() operation is resolving race between auto-detachment and
      bpf_prog update in the same fashion as it's done for bpf_link detachment,
      except in this case update has no way of succeeding because of target cgroup
      marked as dying. So in this case error is returned.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200330030001.2312810-3-andriin@fb.com
      0c991ebc
    • A
      bpf: Implement bpf_link-based cgroup BPF program attachment · af6eea57
      Andrii Nakryiko 提交于
      Implement new sub-command to attach cgroup BPF programs and return FD-based
      bpf_link back on success. bpf_link, once attached to cgroup, cannot be
      replaced, except by owner having its FD. Cgroup bpf_link supports only
      BPF_F_ALLOW_MULTI semantics. Both link-based and prog-based BPF_F_ALLOW_MULTI
      attachments can be freely intermixed.
      
      To prevent bpf_cgroup_link from keeping cgroup alive past the point when no
      BPF program can be executed, implement auto-detachment of link. When
      cgroup_bpf_release() is called, all attached bpf_links are forced to release
      cgroup refcounts, but they leave bpf_link otherwise active and allocated, as
      well as still owning underlying bpf_prog. This is because user-space might
      still have FDs open and active, so bpf_link as a user-referenced object can't
      be freed yet. Once last active FD is closed, bpf_link will be freed and
      underlying bpf_prog refcount will be dropped. But cgroup refcount won't be
      touched, because cgroup is released already.
      
      The inherent race between bpf_cgroup_link release (from closing last FD) and
      cgroup_bpf_release() is resolved by both operations taking cgroup_mutex. So
      the only additional check required is when bpf_cgroup_link attempts to detach
      itself from cgroup. At that time we need to check whether there is still
      cgroup associated with that link. And if not, exit with success, because
      bpf_cgroup_link was already successfully detached.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Link: https://lore.kernel.org/bpf/20200330030001.2312810-2-andriin@fb.com
      af6eea57
    • S
      NFS: Ensure security label is set for root inode · 779df6a5
      Scott Mayhew 提交于
      When using NFSv4.2, the security label for the root inode should be set
      via a call to nfs_setsecurity() during the mount process, otherwise the
      inode will appear as unlabeled for up to acdirmin seconds.  Currently
      the label for the root inode is allocated, retrieved, and freed entirely
      witin nfs4_proc_get_root().
      
      Add a field for the label to the nfs_fattr struct, and allocate & free
      the label in nfs_get_root(), where we also add a call to
      nfs_setsecurity().  Note that for the call to nfs_setsecurity() to
      succeed, it's necessary to also move the logic calling
      security_sb_{set,clone}_security() from nfs_get_tree_common() down into
      nfs_get_root()... otherwise the SBLABEL_MNT flag will not be set in the
      super_block's security flags and nfs_setsecurity() will silently fail.
      Reported-by: NRichard Haines <richard_c_haines@btinternet.com>
      Signed-off-by: NScott Mayhew <smayhew@redhat.com>
      Acked-by: NStephen Smalley <sds@tycho.nsa.gov>
      Tested-by: NStephen Smalley <sds@tycho.nsa.gov>
      [PM: fixed 80-char line width problems]
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      779df6a5
    • J
      bpf: Verifier, do explicit ALU32 bounds tracking · 3f50f132
      John Fastabend 提交于
      It is not possible for the current verifier to track ALU32 and JMP ops
      correctly. This can result in the verifier aborting with errors even though
      the program should be verifiable. BPF codes that hit this can work around
      it by changin int variables to 64-bit types, marking variables volatile,
      etc. But this is all very ugly so it would be better to avoid these tricks.
      
      But, the main reason to address this now is do_refine_retval_range() was
      assuming return values could not be negative. Once we fixed this code that
      was previously working will no longer work. See do_refine_retval_range()
      patch for details. And we don't want to suddenly cause programs that used
      to work to fail.
      
      The simplest example code snippet that illustrates the problem is likely
      this,
      
       53: w8 = w0                    // r8 <- [0, S32_MAX],
                                      // w8 <- [-S32_MIN, X]
       54: w8 <s 0                    // r8 <- [0, U32_MAX]
                                      // w8 <- [0, X]
      
      The expected 64-bit and 32-bit bounds after each line are shown on the
      right. The current issue is without the w* bounds we are forced to use
      the worst case bound of [0, U32_MAX]. To resolve this type of case,
      jmp32 creating divergent 32-bit bounds from 64-bit bounds, we add explicit
      32-bit register bounds s32_{min|max}_value and u32_{min|max}_value. Then
      from branch_taken logic creating new bounds we can track 32-bit bounds
      explicitly.
      
      The next case we observed is ALU ops after the jmp32,
      
       53: w8 = w0                    // r8 <- [0, S32_MAX],
                                      // w8 <- [-S32_MIN, X]
       54: w8 <s 0                    // r8 <- [0, U32_MAX]
                                      // w8 <- [0, X]
       55: w8 += 1                    // r8 <- [0, U32_MAX+1]
                                      // w8 <- [0, X+1]
      
      In order to keep the bounds accurate at this point we also need to track
      ALU32 ops. To do this we add explicit ALU32 logic for each of the ALU
      ops, mov, add, sub, etc.
      
      Finally there is a question of how and when to merge bounds. The cases
      enumerate here,
      
      1. MOV ALU32   - zext 32-bit -> 64-bit
      2. MOV ALU64   - copy 64-bit -> 32-bit
      3. op  ALU32   - zext 32-bit -> 64-bit
      4. op  ALU64   - n/a
      5. jmp ALU32   - 64-bit: var32_off | upper_32_bits(var64_off)
      6. jmp ALU64   - 32-bit: (>> (<< var64_off))
      
      Details for each case,
      
      For "MOV ALU32" BPF arch zero extends so we simply copy the bounds
      from 32-bit into 64-bit ensuring we truncate var_off and 64-bit
      bounds correctly. See zext_32_to_64.
      
      For "MOV ALU64" copy all bounds including 32-bit into new register. If
      the src register had 32-bit bounds the dst register will as well.
      
      For "op ALU32" zero extend 32-bit into 64-bit the same as move,
      see zext_32_to_64.
      
      For "op ALU64" calculate both 32-bit and 64-bit bounds no merging
      is done here. Except we have a special case. When RSH or ARSH is
      done we can't simply ignore shifting bits from 64-bit reg into the
      32-bit subreg. So currently just push bounds from 64-bit into 32-bit.
      This will be correct in the sense that they will represent a valid
      state of the register. However we could lose some accuracy if an
      ARSH is following a jmp32 operation. We can handle this special
      case in a follow up series.
      
      For "jmp ALU32" mark 64-bit reg unknown and recalculate 64-bit bounds
      from tnum by setting var_off to ((<<(>>var_off)) | var32_off). We
      special case if 64-bit bounds has zero'd upper 32bits at which point
      we can simply copy 32-bit bounds into 64-bit register. This catches
      a common compiler trick where upper 32-bits are zeroed and then
      32-bit ops are used followed by a 64-bit compare or 64-bit op on
      a pointer. See __reg_combine_64_into_32().
      
      For "jmp ALU64" cast the bounds of the 64bit to their 32-bit
      counterpart. For example s32_min_value = (s32)reg->smin_value. For
      tnum use only the lower 32bits via, (>>(<<var_off)). See
      __reg_combine_64_into_32().
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/158560419880.10843.11448220440809118343.stgit@john-Precision-5820-Tower
      3f50f132
    • J
      bpf: Don't refcount LISTEN sockets in sk_assign() · 7ae215d2
      Joe Stringer 提交于
      Avoid taking a reference on listen sockets by checking the socket type
      in the sk_assign and in the corresponding skb_steal_sock() code in the
      the transport layer, and by ensuring that the prefetch free (sock_pfree)
      function uses the same logic to check whether the socket is refcounted.
      Suggested-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NJoe Stringer <joe@wand.net.nz>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20200329225342.16317-4-joe@wand.net.nz
      7ae215d2
    • J
      net: Track socket refcounts in skb_steal_sock() · 71489e21
      Joe Stringer 提交于
      Refactor the UDP/TCP handlers slightly to allow skb_steal_sock() to make
      the determination of whether the socket is reference counted in the case
      where it is prefetched by earlier logic such as early_demux.
      Signed-off-by: NJoe Stringer <joe@wand.net.nz>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20200329225342.16317-3-joe@wand.net.nz
      71489e21
    • J
      bpf: Add socket assign support · cf7fbe66
      Joe Stringer 提交于
      Add support for TPROXY via a new bpf helper, bpf_sk_assign().
      
      This helper requires the BPF program to discover the socket via a call
      to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
      helper takes its own reference to the socket in addition to any existing
      reference that may or may not currently be obtained for the duration of
      BPF processing. For the destination socket to receive the traffic, the
      traffic must be routed towards that socket via local route. The
      simplest example route is below, but in practice you may want to route
      traffic more narrowly (eg by CIDR):
      
        $ ip route add local default dev lo
      
      This patch avoids trying to introduce an extra bit into the skb->sk, as
      that would require more invasive changes to all code interacting with
      the socket to ensure that the bit is handled correctly, such as all
      error-handling cases along the path from the helper in BPF through to
      the orphan path in the input. Instead, we opt to use the destructor
      variable to switch on the prefetch of the socket.
      Signed-off-by: NJoe Stringer <joe@wand.net.nz>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20200329225342.16317-2-joe@wand.net.nz
      cf7fbe66
    • R
      net: phylink: add separate pcs operations structure · 4c0d6d3a
      Russell King 提交于
      Add a separate set of PCS operations, which MAC drivers can use to
      couple phylink with their associated MAC PCS layer.  The PCS
      operations include:
      
      - pcs_get_state() - reads the link up/down, resolved speed, duplex
         and pause from the PCS.
      - pcs_config() - configures the PCS for the specified mode, PHY
         interface type, and setting the advertisement.
      - pcs_an_restart() - restarts 802.3 in-band negotiation with the
         link partner
      - pcs_link_up() - informs the PCS that link has come up, and the
         parameters of the link. Link parameters are used to program the
         PCS for fixed speed and non-inband modes.
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4c0d6d3a
    • R
      net: phylink: rename 'ops' to 'mac_ops' · e7765d63
      Russell King 提交于
      Rename the bland 'ops' member of struct phylink to be a more
      descriptive 'mac_ops' - this is necessary as we're about to introduce
      another set of operations.
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e7765d63
    • R
      net: phylink: change phylink_mii_c22_pcs_set_advertisement() prototype · 0bd27406
      Russell King 提交于
      Change phylink_mii_c22_pcs_set_advertisement() to take only the PHY
      interface and advertisement mask, rather than the full phylink state.
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0bd27406
    • Y
      qed: Fix use after free in qed_chain_free · 8063f761
      Yuval Basson 提交于
      The qed_chain data structure was modified in
      commit 1a4a6975 ("qed: Chain support for external PBL") to support
      receiving an external pbl (due to iWARP FW requirements).
      The pages pointed to by the pbl are allocated in qed_chain_alloc
      and their virtual address are stored in an virtual addresses array to
      enable accessing and freeing the data. The physical addresses however
      weren't stored and were accessed directly from the external-pbl
      during free.
      
      Destroy-qp flow, leads to freeing the external pbl before the chain is
      freed, when the chain is freed it tries accessing the already freed
      external pbl, leading to a use-after-free. Therefore we need to store
      the physical addresses in additional to the virtual addresses in a
      new data structure.
      
      Fixes: 1a4a6975 ("qed: Chain support for external PBL")
      Signed-off-by: NMichal Kalderon <mkalderon@marvell.com>
      Signed-off-by: NYuval Bason <ybason@marvell.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8063f761
    • V
      net: dsa: felix: add port policers · fc411eaa
      Vladimir Oltean 提交于
      This patch is a trivial passthrough towards the ocelot library, which
      support port policers since commit 2c1d029a ("net: mscc: ocelot:
      Implement port policers via tc command").
      
      Some data structure conversion between the DSA core and the Ocelot
      library is necessary, for policer parameters.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc411eaa
    • V
      net: dsa: add port policers · 34297176
      Vladimir Oltean 提交于
      The approach taken to pass the port policer methods on to drivers is
      pragmatic. It is similar to the port mirroring implementation (in that
      the DSA core does all of the filter block interaction and only passes
      simple operations for the driver to implement) and dissimilar to how
      flow-based policers are going to be implemented (where the driver has
      full control over the flow_cls_offload data structure).
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      34297176
    • X
      net: mscc: ocelot: add action of police on vcap_is2 · c9a7fe12
      Xiaoliang Yang 提交于
      Ocelot has 384 policers that can be allocated to ingress ports,
      QoS classes per port, and VCAP IS2 entries. ocelot_police.c
      supports to set policers which can be allocated to police action
      of VCAP IS2. We allocate policers from maximum pol_id, and
      decrease the pol_id when add a new vcap_is2 entry which is
      police action.
      Signed-off-by: NXiaoliang Yang <xiaoliang.yang_1@nxp.com>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9a7fe12
    • E
      devlink: Add auto dump flag to health reporter · 48bb52c8
      Eran Ben Elisha 提交于
      On low memory system, run time dumps can consume too much memory. Add
      administrator ability to disable auto dumps per reporter as part of the
      error flow handle routine.
      
      This attribute is not relevant while executing
      DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET.
      
      By default, auto dump is activated for any reporter that has a dump method,
      as part of the reporter registration to devlink.
      Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48bb52c8
    • E
      devlink: Implicitly set auto recover flag when registering health reporter · ba7d16c7
      Eran Ben Elisha 提交于
      When health reporter is registered to devlink, devlink will implicitly set
      auto recover if and only if the reporter has a recover method. No reason
      to explicitly get the auto recover flag from the driver.
      
      Remove this flag from all drivers that called
      devlink_health_reporter_create.
      
      All existing health reporters set auto recovery to true if they have a
      recover method.
      
      Yet, administrator can unset auto recover via netlink command as prior to
      this patch.
      Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Reviewed-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba7d16c7
    • R
      ptp: Avoid deadlocks in the programmable pin code. · 62582a7e
      Richard Cochran 提交于
      The PTP Hardware Clock (PHC) subsystem offers an API for configuring
      programmable pins.  User space sets or gets the settings using ioctls,
      and drivers verify dialed settings via a callback.  Drivers may also
      query pin settings by calling the ptp_find_pin() method.
      
      Although the core subsystem protects concurrent access to the pin
      settings, the implementation places illogical restrictions on how
      drivers may call ptp_find_pin().  When enabling an auxiliary function
      via the .enable(on=1) callback, drivers may invoke the pin finding
      method, but when disabling with .enable(on=0) drivers are not
      permitted to do so.  With the exception of the mv88e6xxx, all of the
      PHC drivers do respect this restriction, but still the locking pattern
      is both confusing and unnecessary.
      
      This patch changes the locking implementation to allow PHC drivers to
      freely call ptp_find_pin() from their .enable() and .verify()
      callbacks.
      
      V2 ChangeLog:
      - fixed spelling in the kernel doc
      - add Vladimir's tested by tag
      Signed-off-by: NRichard Cochran <richardcochran@gmail.com>
      Reported-by: NYangbo Lu <yangbo.lu@nxp.com>
      Tested-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      62582a7e
    • J
      net: sched: expose HW stats types per action used by drivers · 93a129eb
      Jiri Pirko 提交于
      It may be up to the driver (in case ANY HW stats is passed) to select
      which type of HW stats he is going to use. Add an infrastructure to
      expose this information to user.
      
      $ tc filter add dev enp3s0np1 ingress proto ip handle 1 pref 1 flower dst_ip 192.168.1.1 action drop
      $ tc -s filter show dev enp3s0np1 ingress
      filter protocol ip pref 1 flower chain 0
      filter protocol ip pref 1 flower chain 0 handle 0x1
        eth_type ipv4
        dst_ip 192.168.1.1
        in_hw in_hw_count 2
              action order 1: gact action drop
               random type none pass val 0
               index 1 ref 1 bind 1 installed 10 sec used 10 sec
              Action statistics:
              Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
              backlog 0b 0p requeues 0
              used_hw_stats immediate     <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93a129eb
    • J
      net: introduce nla_put_bitfield32() helper and use it · 8953b077
      Jiri Pirko 提交于
      Introduce a helper to pass value and selector to. The helper packs them
      into struct and puts them into netlink message.
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8953b077
  3. 30 3月, 2020 17 次提交