提交 · 42b18f605feaf7aa1825b35656bb7d6fdc132b45 · openanolis / cloud-kernel

16 4月, 2016 7 次提交

tipc: refactor function tipc_link_timeout() · 42b18f60

由 Jon Paul Maloy 提交于 4月 15, 2016

The function tipc_link_timeout() is unnecessary complex, and can
easily be made more readable.

We do that with this commit. The only functional change is that we
remove a redundant test for whether the broadcast link is up or not.
Acked-by: NYing Xue <ying.xue@windriver.com>
Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

42b18f60

tipc: reduce transmission rate of reset messages when link is down · 88e8ac70

由 Jon Paul Maloy 提交于 4月 15, 2016

When a link is down, it will continuously try to re-establish contact
with the peer by sending out a RESET or an ACTIVATE message at each
timeout interval. The default value for this interval is currently
375 ms. This is wasteful, and may become a problem in very large
clusters with dozens or hundreds of nodes being down simultaneously.

We now introduce a simple backoff algorithm for these cases. The
first five messages are sent at default rate; thereafter a message
is sent only each 16th timer interval.

This will cover the vast majority of link recycling cases, since the
endpoint starting last will transmit at the higher speed, and the link
should normally be established well be before the rate needs to be
reduced.

The only case where we will see a degradation of link re-establishment
times is when the endpoints remain intact, and a glitch in the
transmission media is causing the link reset. We will then experience
a worst-case re-establishing time of 6 seconds, something we deem
acceptable.
Acked-by: NYing Xue <ying.xue@windriver.com>
Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

88e8ac70

tipc: guarantee peer bearer id exchange after reboot · 634696b1

由 Jon Paul Maloy 提交于 4月 15, 2016

When a link endpoint is going down locally, e.g., because its interface
is being stopped, it will spontaneously send out a RESET message to
its peer, informing it about this fact. This saves the peer from
detecting the failure via probing, and hence gives both speedier and
less resource consuming failure detection on the peer side.

According to the link FSM, a receiver of a RESET message, ignoring the
reason for it, must now consider the sender ready to come back up, and
starts periodically sending out ACTIVATE messages to the peer in order
to re-establish the link. Also, according to the FSM, the receiver of
an ACTIVATE message can now go directly to state ESTABLISHED and start
sending regular traffic packets. This is a well-proven and robust FSM.

However, in the case of a reboot, there is a small possibilty that link
endpoint on the rebooted node may have been re-created with a new bearer
identity between the moment it sent its (pre-boot) RESET and the moment
it receives the ACTIVATE from the peer. The new bearer identity cannot
be known by the peer according to this scenario, since traffic headers
don't convey such information. This is a problem, because both endpoints
need to know the correct value of the peer's bearer id at any moment in
time in order to be able to produce correct link events for their users.

The only way to guarantee this is to enforce a full setup message
exchange (RESET + ACTIVATE) even after the reboot, since those messages
carry the bearer idientity in their header.

In this commit we do this by introducing and setting a "stopping" bit in
the header of the spontaneously generated RESET messages, informing the
peer that the sender will not be immediately ready to re-establish the
link. A receiver seeing this bit must act as if this were a locally
detected connectivity failure, and hence has to go through a full two-
way setup message exchange before any link can be re-established.

Although never reported, this problem seems to have always been around.

This protocol addition is fully backwards compatible.
Acked-by: NYing Xue <ying.xue@windriver.com>
Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

634696b1

Merge branch 'mlxsw-next' · 25fb0b6c

由 David S. Miller 提交于 4月 15, 2016

Jiri Pirko says:

====================
mlxsw: spectrum_buffers: couple of cosmetic patches

As suggested by David Laight
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

25fb0b6c

mlxsw: spectrum_buffers: Use MLXSW_SP_PB_UNUSED define for unused pb · b94cdabb

由 Jiri Pirko 提交于 4月 15, 2016

Suggested-by: NDavid Laight <David.Laight@ACULAB.COM>
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b94cdabb

mlxsw: spectrum_buffers: Use designated initializers for mlxsw_sp_pbs · ce78f020

由 Jiri Pirko 提交于 4月 15, 2016

Suggested-by: NDavid Laight <David.Laight@ACULAB.COM>
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ce78f020

devlink: fix sb register stub in case devlink is disabled · de33efd0

由 Jiri Pirko 提交于 4月 15, 2016

Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
Fixes: bf797471 ("devlink: add shared buffer configuration")
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

de33efd0

15 4月, 2016 33 次提交

tun: use per cpu variables for stats accounting · 608b9977

由 Paolo Abeni 提交于 4月 13, 2016

Currently the tun device accounting uses dev->stats without applying any
kind of protection, regardless that accounting happens in preemptible
process context.
This patch move the tun stats to a per cpu data structure, and protect
the updates with  u64_stats_update_begin()/u64_stats_update_end() or
this_cpu_inc according to the stat type. The per cpu stats are
aggregated by the newly added ndo_get_stats64 ops.
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

608b9977

Merge branch 'bpf-ARG_PTR_TO_RAW_STACK' · 548aacdd

由 David S. Miller 提交于 4月 14, 2016

Merge branch 'bpf-ARG_PTR_TO_RAW_STACK'

Daniel Borkmann says:

====================
BPF updates

This series adds a new verifier argument type called
ARG_PTR_TO_RAW_STACK and converts related helpers to make
use of it. Basic idea is that we can save init of stack
memory when the helper function is guaranteed to fully
fill out the passed buffer in every path. Series also adds
test cases and converts samples. For more details, please
see individual patches.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

548aacdd

bpf, samples: add test cases for raw stack · 3f2050e2

由 Daniel Borkmann 提交于 4月 13, 2016

This adds test cases mostly around ARG_PTR_TO_RAW_STACK to check the
verifier behaviour.

  [...]
  #84 raw_stack: no skb_load_bytes OK
  #85 raw_stack: skb_load_bytes, no init OK
  #86 raw_stack: skb_load_bytes, init OK
  #87 raw_stack: skb_load_bytes, spilled regs around bounds OK
  #88 raw_stack: skb_load_bytes, spilled regs corruption OK
  #89 raw_stack: skb_load_bytes, spilled regs corruption 2 OK
  #90 raw_stack: skb_load_bytes, spilled regs + data OK
  #91 raw_stack: skb_load_bytes, invalid access 1 OK
  #92 raw_stack: skb_load_bytes, invalid access 2 OK
  #93 raw_stack: skb_load_bytes, invalid access 3 OK
  #94 raw_stack: skb_load_bytes, invalid access 4 OK
  #95 raw_stack: skb_load_bytes, invalid access 5 OK
  #96 raw_stack: skb_load_bytes, invalid access 6 OK
  #97 raw_stack: skb_load_bytes, large access OK
  Summary: 98 PASSED, 0 FAILED
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3f2050e2

bpf, samples: don't zero data when not needed · 02413cab

由 Daniel Borkmann 提交于 4月 13, 2016

Remove the zero initialization in the sample programs where appropriate.
Note that this is an optimization which is now possible, old programs
still doing the zero initialization are just fine as well. Also, make
sure we don't have padding issues when we don't memset() the entire
struct anymore.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

02413cab

bpf: convert relevant helper args to ARG_PTR_TO_RAW_STACK · 074f528e

由 Daniel Borkmann 提交于 4月 13, 2016

This patch converts all helpers that can use ARG_PTR_TO_RAW_STACK as argument
type. For tc programs this is bpf_skb_load_bytes(), bpf_skb_get_tunnel_key(),
bpf_skb_get_tunnel_opt(). For tracing, this optimizes bpf_get_current_comm()
and bpf_probe_read(). The check in bpf_skb_load_bytes() for MAX_BPF_STACK can
also be removed since the verifier already makes sure we stay within bounds
on stack buffers.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

074f528e

bpf, verifier: add ARG_PTR_TO_RAW_STACK type · 435faee1

由 Daniel Borkmann 提交于 4月 13, 2016

When passing buffers from eBPF stack space into a helper function, we have
ARG_PTR_TO_STACK argument type for helpers available. The verifier makes sure
that such buffers are initialized, within boundaries, etc.

However, the downside with this is that we have a couple of helper functions
such as bpf_skb_load_bytes() that fill out the passed buffer in the expected
success case anyway, so zero initializing them prior to the helper call is
unneeded/wasted instructions in the eBPF program that can be avoided.

Therefore, add a new helper function argument type called ARG_PTR_TO_RAW_STACK.
The idea is to skip the STACK_MISC check in check_stack_boundary() and color
the related stack slots as STACK_MISC after we checked all call arguments.

Helper functions using ARG_PTR_TO_RAW_STACK must make sure that every path of
the helper function will fill the provided buffer area, so that we cannot leak
any uninitialized stack memory. This f.e. means that error paths need to
memset() the buffers, but the expected fast-path doesn't have to do this
anymore.

Since there's no such helper needing more than at most one ARG_PTR_TO_RAW_STACK
argument, we can keep it simple and don't need to check for multiple areas.
Should in future such a use-case really appear, we have check_raw_mode() that
will make sure we implement support for it first.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

435faee1

bpf, verifier: add bpf_call_arg_meta for passing meta data · 33ff9823

由 Daniel Borkmann 提交于 4月 13, 2016

Currently, when the verifier checks calls in check_call() function, we
call check_func_arg() for all 5 arguments e.g. to make sure expected types
are correct. In some cases, we collect meta data (here: map pointer) to
perform additional checks such as checking stack boundary on key/value
sizes for subsequent arguments. As we're going to extend the meta data,
add a generic struct bpf_call_arg_meta that we can use for passing into
check_func_arg().
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

33ff9823

sctp: add support for RPS and RFS · 486bdee0

由 Marcelo Ricardo Leitner 提交于 4月 12, 2016

This patch adds what's missing to properly support RPS and RFS on SCTP,
as some of it is already implemented in common calls.

Having support for RPS and RFS allows better scaling specially because
not all NICs support hashing SCTP headers.

Save the hash right when we dequeue a skb from inqueue so we do it only
once per skb instead of per chunk. New sockets will then inherit the
hash through sctp_copy_sock().
Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

486bdee0

net: validate_xmit_skb() changes · d21fd63e

由 Eric Dumazet 提交于 4月 12, 2016

skbs given to validate_xmit_skb() should not have a next
pointer anymore.

Also if a packet is dropped, increment dev->tx_dropped
__dev_queue_xmit() no longer has to change tx_dropped in this case.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d21fd63e

packet: uses kfree_skb() for errors. · da37845f

由 Weongyo Jeong 提交于 4月 14, 2016

consume_skb() isn't for error cases that kfree_skb() is more proper
one.  At this patch, it fixed tpacket_rcv() and packet_rcv() to be
consistent for error or non-error cases letting perf trace its event
properly.
Signed-off-by: NWeongyo Jeong <weongyo.linux@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

da37845f

tipc: fix a race condition leading to subscriber refcnt bug · 333f7962

由 Parthasarathy Bhuvaragan 提交于 4月 12, 2016

Until now, the requests sent to topology server are queued
to a workqueue by the generic server framework.
These messages are processed by worker threads and trigger the
registered callbacks.
To reduce latency on uniprocessor systems, explicit rescheduling
is performed using cond_resched() after MAX_RECV_MSG_COUNT(25)
messages.

This implementation on SMP systems leads to an subscriber refcnt
error as described below:
When a worker thread yields by calling cond_resched() in a SMP
system, a new worker is created on another CPU to process the
pending workitem. Sometimes the sleeping thread wakes up before
the new thread finishes execution.
This breaks the assumption on ordering and being single threaded.
The fault is more frequent when MAX_RECV_MSG_COUNT is lowered.

If the first thread was processing subscription create and the
second thread processing close(), the close request will free
the subscriber and the create request oops as follows:

[31.224137] WARNING: CPU: 2 PID: 266 at include/linux/kref.h:46 tipc_subscrb_rcv_cb+0x317/0x380         [tipc]
[31.228143] CPU: 2 PID: 266 Comm: kworker/u8:1 Not tainted 4.5.0+ #97
[31.228377] Workqueue: tipc_rcv tipc_recv_work [tipc]
[...]
[31.228377] Call Trace:
[31.228377]  [<ffffffff812fbb6b>] dump_stack+0x4d/0x72
[31.228377]  [<ffffffff8105a311>] __warn+0xd1/0xf0
[31.228377]  [<ffffffff8105a3fd>] warn_slowpath_null+0x1d/0x20
[31.228377]  [<ffffffffa0098067>] tipc_subscrb_rcv_cb+0x317/0x380 [tipc]
[31.228377]  [<ffffffffa00a4984>] tipc_receive_from_sock+0xd4/0x130 [tipc]
[31.228377]  [<ffffffffa00a439b>] tipc_recv_work+0x2b/0x50 [tipc]
[31.228377]  [<ffffffff81071925>] process_one_work+0x145/0x3d0
[31.246554] ---[ end trace c3882c9baa05a4fd ]---
[31.248327] BUG: spinlock bad magic on CPU#2, kworker/u8:1/266
[31.249119] BUG: unable to handle kernel NULL pointer dereference at 0000000000000428
[31.249323] IP: [<ffffffff81099d0c>] spin_dump+0x5c/0xe0
[31.249323] PGD 0
[31.249323] Oops: 0000 [#1] SMP

In this commit, we
- rename tipc_conn_shutdown() to tipc_conn_release().
- move connection release callback execution from tipc_close_conn()
  to a new function tipc_sock_release(), which is executed before
  we free the connection.
Thus we release the subscriber during connection release procedure
rather than connection shutdown procedure.
Signed-off-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: NYing Xue <ying.xue@windriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

333f7962

Merge branch 'gro-fixed-id-gso-partial' · edd93cd7

由 David S. Miller 提交于 4月 14, 2016

Alexander Duyck says:

====================
GRO Fixed IPv4 ID support and GSO partial support

This patch series sets up a few different things.

First it adds support for GRO of frames with a fixed IP ID value.  This
will allow us to perform GRO for frames that go through things like an IPv6
to IPv4 header translation.

The second item we add is support for segmenting frames that are generated
this way.  Most devices only support an incrementing IP ID value, and in
the case of TCP the IP ID can be ignored in many cases since the DF bit
should be set.  So we can technically segment these frames using existing
TSO if we are willing to allow the IP ID to be mangled.  As such I have
added a matching feature for the new form of GRO/GSO called TCP IPv4 ID
mangling.  With this enabled we can assemble and disassemble a frame with
the sequence number fixed and the only ill effect will be that the IPv4 ID
will be altered which may or may not have any noticeable effect.  As such I
have defaulted the feature to disabled.

The third item this patch series adds is support for partial GSO
segmentation.  Partial GSO segmentation allows us to split a large frame
into two pieces.  The first piece will have an even multiple of MSS worth
of data and the headers before the one pointed to by csum_start will have
been updated so that they are correct for if the data payload had already
been segmented.  By doing this we can do things such as precompute the
outer header checksums for a frame to be segmented allowing us to perform
TSO on devices that don't support tunneling, or tunneling with outer header
checksums.

This patch set is based on the net-next tree, but I included "net: remove
netdevice gso_min_segs" in my tree as I assume it is likely to be applied
before this patch set will and I wanted to avoid a merge conflict.

v2: Fixed items reported by Jesse Gross
	fixed missing GSO flag in MPLS check
	adding DF check for MANGLEID
    Moved extra GSO feature checks into gso_features_check
    Rebased batches to account for "net: remove netdevice gso_min_segs"

Driver patches from the first patch set should still be compatible.  However
I do have a few changes in them so I will submit a v2 of those to Jeff
Kirsher once these patches are accepted into net-next.

Example driver patches for i40e, ixgbe, and igb:
https://patchwork.ozlabs.org/patch/608221/
https://patchwork.ozlabs.org/patch/608224/
https://patchwork.ozlabs.org/patch/608225/
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

edd93cd7

Documentation: Add documentation for TSO and GSO features · f7a6272b

由 Alexander Duyck 提交于 4月 10, 2016

This document is a starting point for defining the TSO and GSO features.
The whole thing is starting to get a bit messy so I wanted to make sure we
have notes somwhere to start describing what does and doesn't work.
Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f7a6272b

GSO: Support partial segmentation offload · 802ab55a

由 Alexander Duyck 提交于 4月 10, 2016

This patch adds support for something I am referring to as GSO partial.
The basic idea is that we can support a broader range of devices for
segmentation if we use fixed outer headers and have the hardware only
really deal with segmenting the inner header.  The idea behind the naming
is due to the fact that everything before csum_start will be fixed headers,
and everything after will be the region that is handled by hardware.

With the current implementation it allows us to add support for the
following GSO types with an inner TSO_MANGLEID or TSO6 offload:
NETIF_F_GSO_GRE
NETIF_F_GSO_GRE_CSUM
NETIF_F_GSO_IPIP
NETIF_F_GSO_SIT
NETIF_F_UDP_TUNNEL
NETIF_F_UDP_TUNNEL_CSUM

In the case of hardware that already supports tunneling we may be able to
extend this further to support TSO_TCPV4 without TSO_MANGLEID if the
hardware can support updating inner IPv4 headers.
Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

802ab55a

GRO: Add support for TCP with fixed IPv4 ID field, limit tunnel IP ID values · 1530545e

由 Alexander Duyck 提交于 4月 10, 2016

This patch does two things.

First it allows TCP to aggregate TCP frames with a fixed IPv4 ID field.  As
a result we should now be able to aggregate flows that were converted from
IPv6 to IPv4.  In addition this allows us more flexibility for future
implementations of segmentation as we may be able to use a fixed IP ID when
segmenting the flow.

The second thing this does is that it places limitations on the outer IPv4
ID header in the case of tunneled frames.  Specifically it forces the IP ID
to be incrementing by 1 unless the DF bit is set in the outer IPv4 header.
This way we can avoid creating overlapping series of IP IDs that could
possibly be fragmented if the frame goes through GRO and is then
resegmented via GSO.
Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1530545e

GSO: Add GSO type for fixed IPv4 ID · cbc53e08

由 Alexander Duyck 提交于 4月 10, 2016

This patch adds support for TSO using IPv4 headers with a fixed IP ID
field.  This is meant to allow us to do a lossless GRO in the case of TCP
flows that use a fixed IP ID such as those that convert IPv6 header to IPv4
headers.

In addition I am adding a feature that for now I am referring to TSO with
IP ID mangling.  Basically when this flag is enabled the device has the
option to either output the flow with incrementing IP IDs or with a fixed
IP ID regardless of what the original IP ID ordering was.  This is useful
in cases where the DF bit is set and we do not care if the original IP ID
value is maintained.
Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cbc53e08

ethtool: Add support for toggling any of the GSO offloads · 518f213d

由 Alexander Duyck 提交于 4月 10, 2016

The strings were missing for several of the GSO offloads that are
available.  This patch provides the missing strings so that we can toggle
or query any of them via the ethtool command.
Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

518f213d

Merge branch 'mlxsw-devlink-shared-buffers' · cb689269

由 David S. Miller 提交于 4月 14, 2016

Jiri Pirko says:

====================
devlink + mlxsw: add support for config and control of shared buffers

ASICs implement shared buffer for packet forwarding purposes and enable
flexible partitioning of the shared buffer for different flows and ports,
enabling non-blocking progress of different flows as well as separation
of lossy traffic from loss-less traffic when using Per-Priority Flow
Control (PFC). The shared buffer optimizes the buffer utilization for better
absorption of packet bursts.

This patchset implements API which is based on the model SAI uses. That is
aligned with multiple ASIC vendors so this API should be vendor neutral.

Userspace counterpart patchset for devlink iproute2 tool can be found here:
https://github.com/jpirko/iproute2_mlxsw/tree/devlink_sb

Couple of examples of usage:

switch$ devlink sb help
Usage: devlink sb show [ DEV [ sb SB_INDEX ] ]
       devlink sb pool show [ DEV [ sb SB_INDEX ] pool POOL_INDEX ]
       devlink sb pool set DEV [ sb SB_INDEX ] pool POOL_INDEX
                           size POOL_SIZE thtype { static | dynamic }
       devlink sb port pool show [ DEV/PORT_INDEX [ sb SB_INDEX ]
                                   pool POOL_INDEX ]
       devlink sb port pool set DEV/PORT_INDEX [ sb SB_INDEX ]
                                pool POOL_INDEX th THRESHOLD
       devlink sb tc bind show [ DEV/PORT_INDEX [ sb SB_INDEX ] tc TC_INDEX ]
       devlink sb tc bind set DEV/PORT_INDEX [ sb SB_INDEX ] tc TC_INDEX
                              type { ingress | egress } pool POOL_INDEX
                              th THRESHOLD
       devlink sb occupancy show { DEV | DEV/PORT_INDEX } [ sb SB_INDEX ]
       devlink sb occupancy snapshot DEV [ sb SB_INDEX ]
       devlink sb occupancy clearmax DEV [ sb SB_INDEX ]

switch$ devlink sb show
pci/0000:03:00.0: sb 0 size 16777216 ing_pools 4 eg_pools 4 ing_tcs 8 eg_tcs 8

switch$ devlink sb pool show
pci/0000:03:00.0: sb 0 pool 0 type ingress size 12400032 thtype dynamic
pci/0000:03:00.0: sb 0 pool 1 type ingress size 0 thtype dynamic
pci/0000:03:00.0: sb 0 pool 2 type ingress size 0 thtype dynamic
pci/0000:03:00.0: sb 0 pool 3 type ingress size 200064 thtype dynamic
pci/0000:03:00.0: sb 0 pool 4 type egress size 13220064 thtype dynamic
pci/0000:03:00.0: sb 0 pool 5 type egress size 0 thtype dynamic
pci/0000:03:00.0: sb 0 pool 6 type egress size 0 thtype dynamic
pci/0000:03:00.0: sb 0 pool 7 type egress size 0 thtype dynamic

switch$ devlink sb port pool show sw0p7 pool 0
sw0p7: sb 0 pool 0 threshold 16

switch$ sudo devlink sb port pool set sw0p7 pool 0 th 15

switch$ devlink sb port pool show sw0p7 pool 0
sw0p7: sb 0 pool 0 threshold 15

switch$ devlink sb tc bind show sw0p7 tc 0 type ingress
sw0p7: sb 0 tc 0 type ingress pool 0 threshold 10

switch$ sudo devlink sb tc bind set sw0p7 tc 0 type ingress pool 0 th 9

switch$ devlink sb tc bind show sw0p7 tc 0 type ingress
sw0p7: sb 0 tc 0 type ingress pool 0 threshold 9

switch$ sudo devlink sb occupancy snapshot pci/0000:03:00.0

switch$ devlink sb occupancy show sw0p7
sw0p7:
  pool: 0:      82944/3217344 1:          0/0       2:          0/0       3:          0/0
        4:          0/384     5:          0/0       6:          0/0       7:          0/0
  itc:  0(0):   96768/3217344 1(0):       0/0       2(0):       0/0       3(0):       0/0
        4(0):       0/0       5(0):       0/0       6(0):       0/0       7(0):       0/0
  etc:  0(4):       0/384     1(4):       0/0       2(4):       0/0       3(4):       0/0
        4(4):       0/0       5(4):       0/0       6(4):       0/0       7(4):       0/0

switch$ sudo devlink sb occupancy clearmax pci/0000:03:00.0
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cb689269

mlxsw: spectrum_buffers: Implement occupancy monitoring · 2d0ed39f

由 Jiri Pirko 提交于 4月 14, 2016

Implement occupancy API introduced in devlink and mlxsw core. This is
done by accessing SBPM register for Port-Pool and SBSR for Port-TC
current and max occupancy values. Max clear is implemented using the
same registers.
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2d0ed39f

mlxsw: core: Introduce support for asynchronous EMAD register access · caf7297e

由 Jiri Pirko 提交于 4月 14, 2016

So far it was possible to have one EMAD register access at a time,
locked by mutex. This patch extends this interface to allow multiple
EMAD register accesses to be in fly at once. That allows faster
processing on firmware side avoiding unused time in between EMADs.
Measured speedup is ~30% for shared occupancy snapshot operation.
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

caf7297e

mlxsw: core: Add mlxsw specific workqueue and use it for FDB notif. processing · dd9bdb04

由 Jiri Pirko 提交于 4月 14, 2016

Follow-up patch is going to need to use delayed work as well and
frequently. The FDB notification processing is already using that and
also quite frequently. It makes sense to create separate workqueue just
for mlxsw driver in this case and do not pollute system_wq.
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dd9bdb04

mlxsw: reg: Extend SBPM register for occupancy control · 42a7f1d7

由 Jiri Pirko 提交于 4月 14, 2016

Since it is not possible to get and clear Port-Pool occupancy data using
SBSR register, there's a need to implement that using SBPM.
Extend pack helper and add unpack helper to get occupancy values.
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

42a7f1d7

mlxsw: reg: Add Shared Buffer Status register definition · 26176def

由 Jiri Pirko 提交于 4月 14, 2016

This register allows to query HW for current and maximal buffer usage.
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

26176def

mlxsw: core: Add devlink shared buffer occupancy callbacks · 1ceecc88

由 Jiri Pirko 提交于 4月 14, 2016

Add middle layer in mlxsw core code to forward shared buffer occupancy
calls into specific ASIC drivers.
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1ceecc88

mlxsw: spectrum_buffers: Implement shared buffer configuration · 0f433fa0

由 Jiri Pirko 提交于 4月 14, 2016

Implement previously introduced mlxsw core shared buffer API.
For Spectrum, that is done utilizing registers SBPR, SBCM and SBPM.
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0f433fa0

mlxsw: core: Add mlxsw_core_port_driver_priv helper · 325f2f19

由 Jiri Pirko 提交于 4月 14, 2016

Needed in following patch.
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

325f2f19

mlxsw: spectrum_buffers: Get max_buff defaults into limits exposed to user · c30a53c7

由 Jiri Pirko 提交于 4月 14, 2016

Although the device supports max_buff magic values 0 and 0xff, these are
not exposed to the user via devlink.
Therefore, adjust the default values to be within configurable range.
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c30a53c7

mlxsw: spectrum_buffers: Change initialization of PG 9 · bc872506

由 Jiri Pirko 提交于 4月 14, 2016

As explained in commit ff6551ec ("mlxsw: spectrum: Correctly
configure headroom size") control packets are directed to priority group
buffer 9 (PG9) in the ports' headroom buffers.

Since we don't want to drop control packets in case they can't be
admitted to the switch's shared buffer we bind PG9 to a different
ingress pool from the one used by all other PGs.

Unlike other PGs, we currently don't expose the binding between PG9 to a
pool and leave it fixed.
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bc872506

mlxsw: spectrum_buffers: Remove eg pool 3 default init and CPU port TC binding to it · 5408f7cb

由 Jiri Pirko 提交于 4月 14, 2016

Since there is no congestion control for CPU port traffic, we can change
the CPU port TC binding to pool 0 with min_buff and max_buff zeroed.
Remove initialization for pool egress pool 3 since it is no longer used
by dafault.
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5408f7cb

mlxsw: spectrum_buffers: Cache shared buffer configuration · 078f9c71

由 Jiri Pirko 提交于 4月 14, 2016

In order to achieve faster dumping of current setting and also in order
to provide possibility to get pool mode without a need to query hardware,
do cache the configuration in driver.
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

078f9c71

mlxsw: spectrum_buffers: Rename "pool" to "pr" in initialization · aa99bc70

由 Jiri Pirko 提交于 4月 14, 2016

Be consintent with rest of the registers (pm, cm) and use "pr" here.
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

aa99bc70

mlxsw: spectrum_buffers: Push out indexes and direction out of SB structs · b11c3b40

由 Jiri Pirko 提交于 4月 14, 2016

Structs are in arrays so use array index as pool/tc/prio index. With
that, there is need to maintain separate arrays for ingress and egress.
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b11c3b40

mlxsw: spectrum_buffers: Push out shared buffer register writes · 94266e32

由 Jiri Pirko 提交于 4月 14, 2016

Pushed them into helper functions.
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

94266e32

openanolis / cloud-kernel 接近 2 年 前同步成功

openanolis / cloud-kernel
接近 2 年前同步成功