1. 26 11月, 2016 32 次提交
    • N
      mlxsw: Create a different trap group list for each device · 117b0dad
      Nogah Frankel 提交于
      Trap groups can be used to control traps priority, both in terms of
      which trap "wins" if a packet matches two traps (priority) and in terms
      of packets from which trap group will be scheduled to the cpu first (tc).
      They can also be used to set rate limiters (policers) on them (will be
      added in the next patches).
      
      Currently, we support two trap groups. In Spectrum we want a better
      resolution, so every protocol / flow will have a different trap group,
      so we can control its parameters separately. Once the policers will be
      implemented, it will also allow us limit the rate of each protocol by
      itself.
      
      This patch change the trap group list to include:
      * the emad trap group, which is shared for all the devices.
      * Switchx2's trap groups, which are a copy of the current trap groups.
      * Spectrum's new trap groups, in order to match the above guidelines.
      (Switchib is using only the emad trap group, so it require no changes).
      
      This patch also includes new configuration for Spectrum's trap groups,
      with primary priority order within them.
      Signed-off-by: NNogah Frankel <nogahf@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      117b0dad
    • N
      mlxsw: spectrum: Add BGP trap · 616d8040
      Nogah Frankel 提交于
      Add a trap for BGP protocol that was previously trapped by the generic trap
      for IP2ME. This trap will allow us to have better control (over priority
      and rate) of the traffic.
      Signed-off-by: NNogah Frankel <nogahf@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      616d8040
    • N
      mlxsw: Change trap groups setting · 579c82e4
      Nogah Frankel 提交于
      Trap groups have many options which we currently set to default values.
      In the next patches we will use many of them with non-default values.
      
      Some of these options have no default value, so this patch sets them as
      params for the trap group set function. Others almost always use the same
      values, so the set function will use this default values. In the rare cases
      when they will need to be with other values, these values can be set
      directly (using the macros for fields in registers).
      
      Parameters without default value:
      TC - the traffic class for packets that hit this trap group.
          (old default is the max tc)
      priority - if one packet hits multiple trap groups, the group with the
      	   higher priority will "catch" it. (old default is 0)
      policer - limit rate policer (old default is disabled)
      
      Default parameters:
      swid - switch id, relevant for the emad trap only, ignored on Spectrum.
             (new default is 0)
      rdq - CPU receive descriptor queue (new default is identical to trap
            group id)
      Signed-off-by: NNogah Frankel <nogahf@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      579c82e4
    • N
      mlxsw: resources: Add max trap groups resource · 23432cb8
      Nogah Frankel 提交于
      Add the max number of trap groups to resource query.
      Signed-off-by: NNogah Frankel <nogahf@mellanox.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      23432cb8
    • N
      mlxsw: core: Change emad trap group settings · 9d87fcea
      Nogah Frankel 提交于
      Currently, the emad trap init was done in the core. In the future we will
      want to add some changes to the traps groups, according to device type.
      This commit create a driver function to create the trap group for the
      emad, so later it can be changed by devices. It also changes the emad
      registration to use the new generic functions.
      Signed-off-by: NNogah Frankel <nogahf@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9d87fcea
    • N
      mlxsw: Add option to choose trap group · 0fb78a4e
      Nogah Frankel 提交于
      Currently, we set the trap group to pre-determined option, based on whether
      it is an rx or event trap.
      This commit adds a possibility to chose the trap group, so it can be set
      to different values in the following patches.
      Signed-off-by: NNogah Frankel <nogahf@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0fb78a4e
    • N
      mlxsw: Change trap set function · d570b7ee
      Nogah Frankel 提交于
      Change trap setting function so instead of determining the trap group by
      trap id, it gets it as a parameter (so later we can have different trap
      groups for Spectrum and Switchx2).
      Add "is_ctrl" parameter to the trap setting function. It control whether
      the trapped packets wait in a designated control buffer or in their
      default one. This parameter is ignored by Switchx2 and Switchib.
      Add these parameters to the traps array in Spectrum, Switchx2 and
      Switchib.
      Signed-off-by: NNogah Frankel <nogahf@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d570b7ee
    • N
      mlxsw: switchib: Use generic listener struct for events · 85d5c9cd
      Nogah Frankel 提交于
      Change the event handling in Switchib to be comptible with Spectrum and
      Switchx2.
      Use the generic listener struct for the events. Init and fini them by loop
      (and not by calling each event by its name).
      Signed-off-by: NNogah Frankel <nogahf@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      85d5c9cd
    • N
      mlxsw: switchx2: Use generic listener struct for events · 6bf08b53
      Nogah Frankel 提交于
      Change the events to use the generic listener struct.
      Merge the event list into the trap list, so the same functions will
      handle both.
      Signed-off-by: NNogah Frankel <nogahf@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6bf08b53
    • N
      mlxsw: spectrum: Use generic listener struct for events · 4544913e
      Nogah Frankel 提交于
      Change the events to use the generic listener struct.
      Merge the event list into the trap list, so the same functions will
      handle both.
      Signed-off-by: NNogah Frankel <nogahf@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4544913e
    • N
      mlxsw: core: Introduce generic macro for event · fb9012d9
      Nogah Frankel 提交于
      Create a macro for creating the generic listener struct for events,
      similar to the one for rx traps.
      Signed-off-by: NNogah Frankel <nogahf@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fb9012d9
    • N
      mlxsw: switchx2: Use generic listener struct for rx traps · 2332d8c7
      Nogah Frankel 提交于
      Reorganize the traps to use the new generic listener struct and
      functions. Use macros to shorten the traps list.
      Signed-off-by: NNogah Frankel <nogahf@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2332d8c7
    • N
      mlxsw: spectrum: Use generic listener struct for rx traps · 14eeda99
      Nogah Frankel 提交于
      Replace the old rx listener struct definitions by the generic ones.
      Use the new generic registering / unregistering functions for them.
      Add some macros to organize the trap list.
      Signed-off-by: NNogah Frankel <nogahf@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      14eeda99
    • N
      mlxsw: core: Expose generic macros for rx trap · b63da93d
      Nogah Frankel 提交于
      In Spectrum, there is a macro to arrange the traps list.
      This macro is useful for everyone who is using rx traps.
      Create a similar macro in core.h for creating the generic listener struct
      for rx traps.
      Signed-off-by: NNogah Frankel <nogahf@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b63da93d
    • N
      mlxsw: core: Create a generic function to register / unregister traps · 0791051c
      Nogah Frankel 提交于
      We have 2 types of HW traps to handle, rx traps and events.
      The registration workflow for both is very similar. So it only make
      sense to create one function to handle both.
      
      This patch creates a struct to hold the data for both cases. It also
      creates a registration and an un-registration functions that get this
      generic struct as input.
      Signed-off-by: NNogah Frankel <nogahf@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0791051c
    • N
      mlxsw: spectrum: Remove unused traps · ee4a60d8
      Nogah Frankel 提交于
      Since commit 99724c18 ("mlxsw: spectrum: Introduce support for
      router interfaces") we no longer rely on flooding traffic to the CPU in
      order to trap packets intended for the host itself. Therefore, the FDB
      MC trap can be removed.
      Remove traps for protocols that are not supported yet.
      Signed-off-by: NNogah Frankel <nogahf@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee4a60d8
    • D
      net/mlx5: remove a duplicate condition · eafa6abd
      Dan Carpenter 提交于
      We verified that MLX5_FLOW_CONTEXT_ACTION_COUNT was set on the first
      line of the function so we don't need to check again here.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Acked-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eafa6abd
    • D
      Merge branch 'thunderx-new-features' · 0ebc5b62
      David S. Miller 提交于
      Sunil Goutham says:
      
      ====================
      net: thunderx: Support for 80xx, RED, PFC e.t.c
      
      This patch series adds support for SLM modules present on 80xx
      silicon, enables ramdom early discard, backpressure generation,
      PFC and some ethtool changes to display supported link modes e.t.c.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ebc5b62
    • S
      net: thunderx: Pause frame support · 430da208
      Sunil Goutham 提交于
      Enable pause frames on both Rx and Tx side, configure pause
      interval e.t.c. Also support for enable/disable pause frames
      on Rx/Tx via ethtool has been added.
      Signed-off-by: NSunil Goutham <sgoutham@cavium.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      430da208
    • S
      net: thunderx: Configure RED and backpressure levels · d5b2d7a7
      Sunil Goutham 提交于
      This patch enables moving average calculation of Rx pkt's resources
      and configures RED and backpressure levels for both CQ and RBDR.
      Also initialize SQ's CQ_LIMIT properly.
      Signed-off-by: NSunil Goutham <sgoutham@cavium.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d5b2d7a7
    • T
    • S
      net: thunderx: 80xx BGX0 configuration changes · 5271156b
      Sunil Goutham 提交于
      On 80xx only one lane of DLM0 and DLM1 (of BGX0) can be used
      , so even though lmac count may be 2 but LMAC1 should use
      serdes lane of DLM1. Since it's not possible to distinguish
      80xx from 81xx as PCI devid are same, this patch adds this
      config support by replying on what firmware configures the
      lmacs with.
      Signed-off-by: NSunil Goutham <sgoutham@cavium.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5271156b
    • W
      phy: fix error case of phy_led_triggers_(un)register · a7dac9f9
      Woojung Huh 提交于
      When phy_init_hw() fails at phy_attach_direct();
      - phy_detach() calls phy_led_triggers_unregister() without
        previous call of phy_led_triggers_register().
      - still call phy_led_triggers_register() and cause memory leak.
      
      Fixes: 2e0bc452 ("net: phy: leds: add support for led triggers on phy link state change")
      Signed-off-by: NWoojung Huh <woojung.huh@microchip.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a7dac9f9
    • E
      net: properly flush delay-freed skbs · f52dffe0
      Eric Dumazet 提交于
      Typical NAPI drivers use napi_consume_skb(skb) at TX completion time.
      This put skb in a percpu special queue, napi_alloc_cache, to get bulk
      frees.
      
      It turns out the queue is not flushed and hits the NAPI_SKB_CACHE_SIZE
      limit quite often, with skbs that were queued hundreds of usec earlier.
      I measured this can take ~6000 nsec to perform one flush.
      
      __kfree_skb_flush() can be called from two points right now :
      
      1) From net_tx_action(), but only for skbs that were queued to
      sd->completion_queue.
      
       -> Irrelevant for NAPI drivers in normal operation.
      
      2) From net_rx_action(), but only under high stress or if RPS/RFS has a
      pending action.
      
      This patch changes net_rx_action() to perform the flush in all cases and
      after more urgent operations happened (like kicking remote CPUS for
      RPS/RFS).
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Acked-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f52dffe0
    • D
      Merge branch 'cgroup-bpf' · ca89fa77
      David S. Miller 提交于
      Daniel Mack says:
      
      ====================
      Add eBPF hooks for cgroups
      
      This is v9 of the patch set to allow eBPF programs for network
      filtering and accounting to be attached to cgroups, so that they apply
      to all sockets of all tasks placed in that cgroup. The logic also
      allows to be extendeded for other cgroup based eBPF logic.
      
      Again, only minor details are updated in this version.
      
      Changes from v8:
      
      * Move the egress hooks into ip_finish_output() and ip6_finish_output()
        so they run after the netfilter hooks. For IPv4 multicast, add a new
        ip_mc_finish_output() callback that is invoked on success by
        netfilter, and call the eBPF program from there.
      
      Changes from v7:
      
      * Replace the static inline function cgroup_bpf_run_filter() with
        two specific macros for ingress and egress.  This addresses David
        Miller's concern regarding skb->sk vs. sk in the egress path.
        Thanks a lot to Daniel Borkmann and Alexei Starovoitov for the
        suggestions.
      
      Changes from v6:
      
      * Rebased to 4.9-rc2
      
      * Add EXPORT_SYMBOL(__cgroup_bpf_run_filter). The kbuild test robot
        now succeeds in building this version of the patch set.
      
      * Switch from bpf_prog_run_save_cb() to bpf_prog_run_clear_cb() to not
        tamper with the contents of skb->cb[]. Pointed out by Daniel
        Borkmann.
      
      * Use sk_to_full_sk() in the egress path, as suggested by Daniel
        Borkmann.
      
      * Renamed BPF_PROG_TYPE_CGROUP_SOCKET to BPF_PROG_TYPE_CGROUP_SKB, as
        requested by David Ahern.
      
      * Added Alexei's Acked-by tags.
      
      Changes from v5:
      
      * The eBPF programs now operate on L3 rather than on L2 of the packets,
        and the egress hooks were moved from __dev_queue_xmit() to
        ip*_output().
      
      * For BPF_PROG_TYPE_CGROUP_SOCKET, disallow direct access to the skb
        through BPF_LD_[ABS|IND] instructions, but hook up the
        bpf_skb_load_bytes() access helper instead. Thanks to Daniel Borkmann
        for the help.
      
      Changes from v4:
      
      * Plug an skb leak when dropping packets due to eBPF verdicts in
        __dev_queue_xmit(). Spotted by Daniel Borkmann.
      
      * Check for sk_fullsock(sk) in __cgroup_bpf_run_filter() so we don't
        operate on timewait or request sockets. Suggested by Daniel Borkmann.
      
      * Add missing @parent parameter in kerneldoc of __cgroup_bpf_update().
        Spotted by Rami Rosen.
      
      * Include linux/jump_label.h from bpf-cgroup.h to fix a kbuild error.
      
      Changes from v3:
      
      * Dropped the _FILTER suffix from BPF_PROG_TYPE_CGROUP_SOCKET_FILTER,
        renamed BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS to
        BPF_CGROUP_INET_{IN,E}GRESS and alias BPF_MAX_ATTACH_TYPE to
        __BPF_MAX_ATTACH_TYPE, as suggested by Daniel Borkmann.
      
      * Dropped the attach_flags member from the anonymous struct for BPF
        attach operations in union bpf_attr. They can be added later on via
        CHECK_ATTR. Requested by Daniel Borkmann and Alexei.
      
      * Release old_prog at the end of __cgroup_bpf_update rather that at
        the beginning to fix a race gap between program updates and their
        users. Spotted by Daniel Borkmann.
      
      * Plugged an skb leak when dropping packets on the egress path.
        Spotted by Daniel Borkmann.
      
      * Add cgroups@vger.kernel.org to the loop, as suggested by Rami Rosen.
      
      * Some minor coding style adoptions not worth mentioning in particular.
      
      Changes from v2:
      
      * Fixed the RCU locking details Tejun pointed out.
      
      * Assert bpf_attr.flags == 0 in BPF_PROG_DETACH syscall handler.
      
      Changes from v1:
      
      * Moved all bpf specific cgroup code into its own file, and stub
        out related functions for !CONFIG_CGROUP_BPF as static inline nops.
        This way, the call sites are not cluttered with #ifdef guards while
        the feature remains compile-time configurable.
      
      * Implemented the new scheme proposed by Tejun. Per cgroup, store one
        set of pointers that are pinned to the cgroup, and one for the
        programs that are effective. When a program is attached or detached,
        the change is propagated to all the cgroup's descendants. If a
        subcgroup has its own pinned program, skip the whole subbranch in
        order to allow delegation models.
      
      * The hookup for egress packets is now done from __dev_queue_xmit().
      
      * A static key is now used in both the ingress and egress fast paths
        to keep performance penalties close to zero if the feature is
        not in use.
      
      * Overall cleanup to make the accessors use the program arrays.
        This should make it much easier to add new program types, which
        will then automatically follow the pinned vs. effective logic.
      
      * Fixed locking issues, as pointed out by Eric Dumazet and Alexei
        Starovoitov. Changes to the program array are now done with
        xchg() and are protected by cgroup_mutex.
      
      * eBPF programs are now expected to return 1 to let the packet pass,
        not >= 0. Pointed out by Alexei.
      
      * Operation is now limited to INET sockets, so local AF_UNIX sockets
        are not affected. The enum members are renamed accordingly. In case
        other socket families should be supported, this can be extended in
        the future.
      
      * The sample program learned to support both ingress and egress, and
        can now optionally make the eBPF program drop packets by making it
        return 0.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca89fa77
    • D
      samples: bpf: add userspace example for attaching eBPF programs to cgroups · d8c5b17f
      Daniel Mack 提交于
      Add a simple userpace program to demonstrate the new API to attach eBPF
      programs to cgroups. This is what it does:
      
       * Create arraymap in kernel with 4 byte keys and 8 byte values
      
       * Load eBPF program
      
         The eBPF program accesses the map passed in to store two pieces of
         information. The number of invocations of the program, which maps
         to the number of packets received, is stored to key 0. Key 1 is
         incremented on each iteration by the number of bytes stored in
         the skb.
      
       * Detach any eBPF program previously attached to the cgroup
      
       * Attach the new program to the cgroup using BPF_PROG_ATTACH
      
       * Once a second, read map[0] and map[1] to see how many bytes and
         packets were seen on any socket of tasks in the given cgroup.
      
      The program takes a cgroup path as 1st argument, and either "ingress"
      or "egress" as 2nd. Optionally, "drop" can be passed as 3rd argument,
      which will make the generated eBPF program return 0 instead of 1, so
      the kernel will drop the packet.
      
      libbpf gained two new wrappers for the new syscall commands.
      Signed-off-by: NDaniel Mack <daniel@zonque.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d8c5b17f
    • D
      net: ipv4, ipv6: run cgroup eBPF egress programs · 33b48679
      Daniel Mack 提交于
      If the cgroup associated with the receiving socket has an eBPF
      programs installed, run them from ip_output(), ip6_output() and
      ip_mc_output(). From mentioned functions we have two socket contexts
      as per 7026b1dd ("netfilter: Pass socket pointer down through
      okfn()."). We explicitly need to use sk instead of skb->sk here,
      since otherwise the same program would run multiple times on egress
      when encap devices are involved, which is not desired in our case.
      
      eBPF programs used in this context are expected to either return 1 to
      let the packet pass, or != 1 to drop them. The programs have access to
      the skb through bpf_skb_load_bytes(), and the payload starts at the
      network headers (L3).
      
      Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
      for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
      the feature is unused.
      Signed-off-by: NDaniel Mack <daniel@zonque.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      33b48679
    • D
      net: filter: run cgroup eBPF ingress programs · c11cd3a6
      Daniel Mack 提交于
      If the cgroup associated with the receiving socket has an eBPF
      programs installed, run them from sk_filter_trim_cap().
      
      eBPF programs used in this context are expected to either return 1 to
      let the packet pass, or != 1 to drop them. The programs have access to
      the skb through bpf_skb_load_bytes(), and the payload starts at the
      network headers (L3).
      
      Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
      for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
      the feature is unused.
      Signed-off-by: NDaniel Mack <daniel@zonque.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c11cd3a6
    • D
      bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands · f4324551
      Daniel Mack 提交于
      Extend the bpf(2) syscall by two new commands, BPF_PROG_ATTACH and
      BPF_PROG_DETACH which allow attaching and detaching eBPF programs
      to a target.
      
      On the API level, the target could be anything that has an fd in
      userspace, hence the name of the field in union bpf_attr is called
      'target_fd'.
      
      When called with BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS, the target is
      expected to be a valid file descriptor of a cgroup v2 directory which
      has the bpf controller enabled. These are the only use-cases
      implemented by this patch at this point, but more can be added.
      
      If a program of the given type already exists in the given cgroup,
      the program is swapped automically, so userspace does not have to drop
      an existing program first before installing a new one, which would
      otherwise leave a gap in which no program is attached.
      
      For more information on the propagation logic to subcgroups, please
      refer to the bpf cgroup controller implementation.
      
      The API is guarded by CAP_NET_ADMIN.
      Signed-off-by: NDaniel Mack <daniel@zonque.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f4324551
    • D
      cgroup: add support for eBPF programs · 30070984
      Daniel Mack 提交于
      This patch adds two sets of eBPF program pointers to struct cgroup.
      One for such that are directly pinned to a cgroup, and one for such
      that are effective for it.
      
      To illustrate the logic behind that, assume the following example
      cgroup hierarchy.
      
        A - B - C
              \ D - E
      
      If only B has a program attached, it will be effective for B, C, D
      and E. If D then attaches a program itself, that will be effective for
      both D and E, and the program in B will only affect B and C. Only one
      program of a given type is effective for a cgroup.
      
      Attaching and detaching programs will be done through the bpf(2)
      syscall. For now, ingress and egress inet socket filtering are the
      only supported use-cases.
      Signed-off-by: NDaniel Mack <daniel@zonque.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30070984
    • D
      bpf: add new prog type for cgroup socket filtering · 0e33661d
      Daniel Mack 提交于
      This program type is similar to BPF_PROG_TYPE_SOCKET_FILTER, except that
      it does not allow BPF_LD_[ABS|IND] instructions and hooks up the
      bpf_skb_load_bytes() helper.
      
      Programs of this type will be attached to cgroups for network filtering
      and accounting.
      Signed-off-by: NDaniel Mack <daniel@zonque.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e33661d
    • C
      cxgb4: fix memory leak on txq_info · 619228d8
      Colin Ian King 提交于
      Currently if txq_info->uldtxq cannot be allocated then
      txq_info->txq is being kfree'd (which is redundant because it
      is NULL) instead of txq_info. Fix this by instead kfree'ing
      txq_info.
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      619228d8
  2. 25 11月, 2016 8 次提交
    • J
      tuntap: remove unnecessary sk_receive_queue length check during xmit · 436acceb
      Jason Wang 提交于
      After commit 1576d986 ("tun: switch to use skb array for tx"),
      sk_receive_queue was not used any more. So remove the uncessary
      sk_receive_queue length check during xmit.
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      436acceb
    • A
      samples/bpf: fix bpf loader · db6a71dd
      Alexei Starovoitov 提交于
      llvm can emit relocations into sections other than program code
      (like debug info sections). Ignore them during parsing of elf file
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db6a71dd
    • A
      samples/bpf: fix sockex2 example · d2b024d3
      Alexei Starovoitov 提交于
      since llvm commit "Do not expand UNDEF SDNode during insn selection lowering"
      llvm will generate code that uses uninitialized registers for cases
      where C code is actually uses uninitialized data.
      So this sockex2 example is technically broken.
      Fix it by initializing on the stack variable fully.
      Also increase verifier buffer limit, since verifier output
      may not fit in 64k for this sockex2 code depending on llvm version.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d2b024d3
    • E
      mlx4: reorganize struct mlx4_en_tx_ring · e3f42f84
      Eric Dumazet 提交于
      Goal is to reorganize this critical structure to increase performance.
      
      ndo_start_xmit() should only dirty one cache line, and access as few
      cache lines as possible.
      
      Add sp_ (Slow Path) prefix to fields that are not used in fast path,
      to make clear what is going on.
      
      After this patch pahole reports something much better, as all
      ndo_start_xmit() needed fields are packed into two cache lines instead
      of seven or eight
      
      struct mlx4_en_tx_ring {
      	u32                        last_nr_txbb;         /*     0   0x4 */
      	u32                        cons;                 /*   0x4   0x4 */
      	long unsigned int          wake_queue;           /*   0x8   0x8 */
      	struct netdev_queue *      tx_queue;             /*  0x10   0x8 */
      	u32                        (*free_tx_desc)(struct mlx4_en_priv *, struct mlx4_en_tx_ring *, int, u8, u64, int); /*  0x18   0x8 */
      	struct mlx4_en_rx_ring *   recycle_ring;         /*  0x20   0x8 */
      
      	/* XXX 24 bytes hole, try to pack */
      
      	/* --- cacheline 1 boundary (64 bytes) --- */
      	u32                        prod;                 /*  0x40   0x4 */
      	unsigned int               tx_dropped;           /*  0x44   0x4 */
      	long unsigned int          bytes;                /*  0x48   0x8 */
      	long unsigned int          packets;              /*  0x50   0x8 */
      	long unsigned int          tx_csum;              /*  0x58   0x8 */
      	long unsigned int          tso_packets;          /*  0x60   0x8 */
      	long unsigned int          xmit_more;            /*  0x68   0x8 */
      	struct mlx4_bf             bf;                   /*  0x70  0x18 */
      	/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
      	__be32                     doorbell_qpn;         /*  0x88   0x4 */
      	__be32                     mr_key;               /*  0x8c   0x4 */
      	u32                        size;                 /*  0x90   0x4 */
      	u32                        size_mask;            /*  0x94   0x4 */
      	u32                        full_size;            /*  0x98   0x4 */
      	u32                        buf_size;             /*  0x9c   0x4 */
      	void *                     buf;                  /*  0xa0   0x8 */
      	struct mlx4_en_tx_info *   tx_info;              /*  0xa8   0x8 */
      	int                        qpn;                  /*  0xb0   0x4 */
      	u8                         queue_index;          /*  0xb4   0x1 */
      	bool                       bf_enabled;           /*  0xb5   0x1 */
      	bool                       bf_alloced;           /*  0xb6   0x1 */
      	u8                         hwtstamp_tx_type;     /*  0xb7   0x1 */
      	u8 *                       bounce_buf;           /*  0xb8   0x8 */
      	/* --- cacheline 3 boundary (192 bytes) --- */
      	long unsigned int          queue_stopped;        /*  0xc0   0x8 */
      	struct mlx4_hwq_resources  sp_wqres;             /*  0xc8  0x58 */
      	/* --- cacheline 4 boundary (256 bytes) was 32 bytes ago --- */
      	struct mlx4_qp             sp_qp;                /* 0x120  0x30 */
      	/* --- cacheline 5 boundary (320 bytes) was 16 bytes ago --- */
      	struct mlx4_qp_context     sp_context;           /* 0x150  0xf8 */
      	/* --- cacheline 9 boundary (576 bytes) was 8 bytes ago --- */
      	cpumask_t                  sp_affinity_mask;     /* 0x248  0x20 */
      	enum mlx4_qp_state         sp_qp_state;          /* 0x268   0x4 */
      	u16                        sp_stride;            /* 0x26c   0x2 */
      	u16                        sp_cqn;               /* 0x26e   0x2 */
      
      	/* size: 640, cachelines: 10, members: 36 */
      	/* sum members: 600, holes: 1, sum holes: 24 */
      	/* padding: 16 */
      };
      
      Instead of this silly placement :
      
      struct mlx4_en_tx_ring {
      	u32                        last_nr_txbb;         /*     0   0x4 */
      	u32                        cons;                 /*   0x4   0x4 */
      	long unsigned int          wake_queue;           /*   0x8   0x8 */
      
      	/* XXX 48 bytes hole, try to pack */
      
      	/* --- cacheline 1 boundary (64 bytes) --- */
      	u32                        prod;                 /*  0x40   0x4 */
      
      	/* XXX 4 bytes hole, try to pack */
      
      	long unsigned int          bytes;                /*  0x48   0x8 */
      	long unsigned int          packets;              /*  0x50   0x8 */
      	long unsigned int          tx_csum;              /*  0x58   0x8 */
      	long unsigned int          tso_packets;          /*  0x60   0x8 */
      	long unsigned int          xmit_more;            /*  0x68   0x8 */
      	unsigned int               tx_dropped;           /*  0x70   0x4 */
      
      	/* XXX 4 bytes hole, try to pack */
      
      	struct mlx4_bf             bf;                   /*  0x78  0x18 */
      	/* --- cacheline 2 boundary (128 bytes) was 16 bytes ago --- */
      	long unsigned int          queue_stopped;        /*  0x90   0x8 */
      	cpumask_t                  affinity_mask;        /*  0x98  0x10 */
      	struct mlx4_qp             qp;                   /*  0xa8  0x30 */
      	/* --- cacheline 3 boundary (192 bytes) was 24 bytes ago --- */
      	struct mlx4_hwq_resources  wqres;                /*  0xd8  0x58 */
      	/* --- cacheline 4 boundary (256 bytes) was 48 bytes ago --- */
      	u32                        size;                 /* 0x130   0x4 */
      	u32                        size_mask;            /* 0x134   0x4 */
      	u16                        stride;               /* 0x138   0x2 */
      
      	/* XXX 2 bytes hole, try to pack */
      
      	u32                        full_size;            /* 0x13c   0x4 */
      	/* --- cacheline 5 boundary (320 bytes) --- */
      	u16                        cqn;                  /* 0x140   0x2 */
      
      	/* XXX 2 bytes hole, try to pack */
      
      	u32                        buf_size;             /* 0x144   0x4 */
      	__be32                     doorbell_qpn;         /* 0x148   0x4 */
      	__be32                     mr_key;               /* 0x14c   0x4 */
      	void *                     buf;                  /* 0x150   0x8 */
      	struct mlx4_en_tx_info *   tx_info;              /* 0x158   0x8 */
      	struct mlx4_en_rx_ring *   recycle_ring;         /* 0x160   0x8 */
      	u32                        (*free_tx_desc)(struct mlx4_en_priv *, struct mlx4_en_tx_ring *, int, u8, u64, int); /* 0x168   0x8 */
      	u8 *                       bounce_buf;           /* 0x170   0x8 */
      	struct mlx4_qp_context     context;              /* 0x178  0xf8 */
      	/* --- cacheline 9 boundary (576 bytes) was 48 bytes ago --- */
      	int                        qpn;                  /* 0x270   0x4 */
      	enum mlx4_qp_state         qp_state;             /* 0x274   0x4 */
      	u8                         queue_index;          /* 0x278   0x1 */
      	bool                       bf_enabled;           /* 0x279   0x1 */
      	bool                       bf_alloced;           /* 0x27a   0x1 */
      
      	/* XXX 5 bytes hole, try to pack */
      
      	/* --- cacheline 10 boundary (640 bytes) --- */
      	struct netdev_queue *      tx_queue;             /* 0x280   0x8 */
      	int                        hwtstamp_tx_type;     /* 0x288   0x4 */
      
      	/* size: 704, cachelines: 11, members: 36 */
      	/* sum members: 587, holes: 6, sum holes: 65 */
      	/* padding: 52 */
      };
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e3f42f84
    • F
      ethtool: Protect {get, set}_phy_tunable with PHY device mutex · 4b65246b
      Florian Fainelli 提交于
      PHY drivers should be able to rely on the caller of {get,set}_tunable to
      have acquired the PHY device mutex, in order to both serialize against
      concurrent calls of these functions, but also against PHY state machine
      changes. All ethtool PHY-level functions do this, except
      {get,set}_tunable, so we make them consistent here as well.
      
      We need to update the Microsemi PHY driver in the same commit to avoid
      introducing either deadlocks, or lack of proper locking.
      
      Fixes: 968ad9da ("ethtool: Implements ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLE")
      Fixes: 310d9ad5 ("net: phy: Add downshift get/set support in Microsemi PHYs driver")
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: NAllan W. Nielsen <allan.nielsen@microsemi.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b65246b
    • D
      Merge branch 'mlx5-next' · fab96ec8
      David S. Miller 提交于
      Saeed Mahameed says:
      
      ====================
      Mellanox 100G mlx5 SRIOV switchdev update
      
      This series from Roi and Or further enhances the new SRIOV switchdev mode.
      
      Roi's patches deal with allowing users to configure though devlink
      the level of inline headers that the VF should be setting in order for
      the eswitch HW to do proper matching. We also enforce that the matching
      required for offloaded TC rules is aligned with that level on the PF driver.
      
      Or's patches deals with allowing the user to control on the VF operational
      link state through admin directives on the mlx5 VF rep link. Also in this series
      is implementation of HW and SW counters for the mlx5 VF rep which is aligned
      with the design set by commit a5ea31f5 'Merge branch net-offloaded-stats'.
      
      v1 --> v2:
      * constified the net-device param of get offloaded stats ndo in mlxsw
        (pointed by 0-day screaming on us...)
      * added Or's Review-by tags for Roi's patches
      
      This series was generated against commit
      e796f49d ("net: ieee802154: constify ieee802154_ops structures")
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fab96ec8
    • R
      net/mlx5e: Enforce min inline mode when offloading flows · de0af0bf
      Roi Dayan 提交于
      A flow should be offloaded only if the matches are
      allowed according to min inline mode.
      Signed-off-by: NRoi Dayan <roid@mellanox.com>
      Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de0af0bf
    • R
      net/mlx5: E-Switch, Add control for inline mode · bffaa916
      Roi Dayan 提交于
      Implement devlink show and set of HW inline-mode.
      The supported modes: none, link, network, transport.
      We currently support one mode for all vports so set is done on all vports.
      When eswitch is first initialized the inline-mode is queried from the FW.
      Signed-off-by: NRoi Dayan <roid@mellanox.com>
      Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bffaa916