提交 · 81713d3788d2e6bc005f15ee1c59d0eb06050a6b · openanolis / cloud-kernel

15 2月, 2017 19 次提交

IB/mlx5: Add implicit MR support · 81713d37

由 Artemy Kovalyov 提交于 1月 18, 2017

Add implicit MR, covering entire user address space.
The MR is implemented as an indirect KSM MR consisting of
1GB direct MRs.
Pages and direct MRs are added/removed to MR by ODP.
Signed-off-by: NArtemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: NLeon Romanovsky <leon@kernel.org>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

81713d37

IB/mlx5: Expose MR cache for mlx5_ib · 49780d42

由 Artemy Kovalyov 提交于 1月 18, 2017

Allow other parts of mlx5_ib to use MR cache mechanism.
* Add new functions mlx5_mr_cache_alloc and mlx5_mr_cache_free
* Traditional MTT MKey buckets are limited by MAX_UMR_CACHE_ENTRY
  Additinal buckets may be added above.
Signed-off-by: NArtemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: NLeon Romanovsky <leon@kernel.org>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

49780d42

IB/mlx5: Add null_mkey access · 94990b49

由 Artemy Kovalyov 提交于 1月 18, 2017

Add mlx5_cmd_null_mkey() function to access null_mkey information
from firmware.
Signed-off-by: NArtemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: NLeon Romanovsky <leon@kernel.org>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

94990b49

IB/umem: Indicate that process is being terminated · d9d0674c

由 Artemy Kovalyov 提交于 1月 18, 2017

When process is killed while pagefault operation still in progress -
function will fail. In this specific case we don't want any warnings in
dmesg to avoid log analyzers false alerts. So we need distinct error
code for this case.
Signed-off-by: NArtemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: NLeon Romanovsky <leon@kernel.org>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

d9d0674c

IB/umem: Update on demand page (ODP) support · d07d1d70

由 Artemy Kovalyov 提交于 1月 18, 2017

Currently ODP MR may explicitly register virtual address space area
of limited length.
This change allows MR to cover entire process virtual address space
dynamicaly adding/removing translation entries to device MTT.

Add following changes to support implicit MR:
* Allow umem to be zero size to back-up implicit MR.
* Add new function ib_alloc_odp_umem() to add virtual memory regions
  to implicit MR dynamically on demand.
* Add new function rbt_ib_umem_lookup() to find dynamically added
  virtual memory regions.
* Expose function rbt_ib_umem_for_each_in_range() to other modules and
  make it safe
Signed-off-by: NArtemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: NLeon Romanovsky <leon@kernel.org>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

d07d1d70

IB/core: Add implicit MR flag · 25bf14d6

由 Artemy Kovalyov 提交于 1月 18, 2017

Add flag IB_ODP_SUPPORT_IMPLICIT indicating implicit MR supported.
Signed-off-by: NArtemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: NLeon Romanovsky <leon@kernel.org>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

25bf14d6

IB/mlx5: Support creation of a WQ with scatter FCS offload · 4be6da1e

由 Noa Osherovich 提交于 1月 18, 2017

Add support for creation of a WQ with scatter FCS capability, if
this capability is supported by the hardware.
Signed-off-by: NNoa Osherovich <noaos@mellanox.com>
Reviewed-by: NMajd Dibbiny <majd@mellanox.com>
Reviewed-by: NYishai Hadas <yishaih@mellanox.com>
Signed-off-by: NLeon Romanovsky <leon@kernel.org>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

4be6da1e

IB/mlx5: Enable QP creation with cvlan offload · e4cc4fa7

由 Noa Osherovich 提交于 1月 18, 2017

Enable creating a RAW Ethernet QP with cvlan stripping offload when
it's supported by the hardware.
Signed-off-by: NNoa Osherovich <noaos@mellanox.com>
Reviewed-by: NMaor Gottlieb <maorg@mellanox.com>
Reviewed-by: NYishai Hadas <yishaih@mellanox.com>
Signed-off-by: NLeon Romanovsky <leon@kernel.org>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

e4cc4fa7

IB/mlx5: Enable WQ creation and modification with cvlan offload · b1f74a84

由 Noa Osherovich 提交于 1月 18, 2017

Allow creating a WQ with cvlan stripping considering device's
capabilities. The default value was fixed to disable vlan stripping
till was asked explicitly.

In addition, allow modification of a WQ to turn on/off this property.
Signed-off-by: NNoa Osherovich <noaos@mellanox.com>
Reviewed-by: NMaor Gottlieb <maorg@mellanox.com>
Reviewed-by: NYishai Hadas <yishaih@mellanox.com>
Signed-off-by: NLeon Romanovsky <leon@kernel.org>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

b1f74a84

IB/mlx5: Expose vlan offloads capabilities · e8161334

由 Noa Osherovich 提交于 1月 18, 2017

Check device's capabilities and report which raw packet capabilities
are supported.
Signed-off-by: NNoa Osherovich <noaos@mellanox.com>
Reviewed-by: NMaor Gottlieb <maorg@mellanox.com>
Reviewed-by: NYishai Hadas <yishaih@mellanox.com>
Signed-off-by: NLeon Romanovsky <leon@kernel.org>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

e8161334

IB/uverbs: Enable QP creation with cvlan offload · 9e1b161f

由 Noa Osherovich 提交于 1月 18, 2017

Enable user applications to create a QP with cvlan stripping offload.
Signed-off-by: NNoa Osherovich <noaos@mellanox.com>
Reviewed-by: NMaor Gottlieb <maorg@mellanox.com>
Reviewed-by: NYishai Hadas <yishaih@mellanox.com>
Signed-off-by: NLeon Romanovsky <leon@kernel.org>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

9e1b161f

IB/uverbs: Enable WQ creation and modification with cvlan offload · af1cb95d

由 Noa Osherovich 提交于 1月 18, 2017

Enable user space application via WQ creation and modification to
turn on and off cvlan offload.
Signed-off-by: NNoa Osherovich <noaos@mellanox.com>
Reviewed-by: NMaor Gottlieb <maorg@mellanox.com>
Reviewed-by: NYishai Hadas <yishaih@mellanox.com>
Signed-off-by: NLeon Romanovsky <leon@kernel.org>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

af1cb95d

IB/uverbs: Expose vlan offloads capabilities · 5f23d426

由 Noa Osherovich 提交于 1月 18, 2017

Expose raw packet capabilities to user space as part of query device.
Signed-off-by: NNoa Osherovich <noaos@mellanox.com>
Reviewed-by: NMaor Gottlieb <maorg@mellanox.com>
Reviewed-by: NYishai Hadas <yishaih@mellanox.com>
Signed-off-by: NLeon Romanovsky <leon@kernel.org>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

5f23d426

IB/core: Add scatter FCS flag to use in WQ creation · 27b0df11

由 Noa Osherovich 提交于 1月 18, 2017

Add a new creation flag to set the scatter FCS capability of a WQ.
Signed-off-by: NNoa Osherovich <noaos@mellanox.com>
Reviewed-by: NMajd Dibbiny <majd@mellanox.com>
Reviewed-by: NYishai Hadas <yishaih@mellanox.com>
Signed-off-by: NLeon Romanovsky <leon@kernel.org>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

27b0df11

IB/core: Enable QP creation with cvlan offload · 9c2b270e

由 Noa Osherovich 提交于 1月 18, 2017

Add a QP creation flag to support cvlan stripping, it's applicable
for RAW Ethernet QP.
Signed-off-by: NNoa Osherovich <noaos@mellanox.com>
Reviewed-by: NMaor Gottlieb <maorg@mellanox.com>
Reviewed-by: NYishai Hadas <yishaih@mellanox.com>
Signed-off-by: NLeon Romanovsky <leon@kernel.org>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

9c2b270e

IB/core: Enable WQ creation and modification with cvlan offload · 10bac72b

由 Noa Osherovich 提交于 1月 18, 2017

Enable WQ creation and modification with cvlan stripping offload.
This includes:
- Adding WQ creation flags.
- Extending modify WQ to get flags and flags mask to enable turning
  it on and off.
Signed-off-by: NNoa Osherovich <noaos@mellanox.com>
Reviewed-by: NMaor Gottlieb <maorg@mellanox.com>
Reviewed-by: NYishai Hadas <yishaih@mellanox.com>
Signed-off-by: NLeon Romanovsky <leon@kernel.org>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

10bac72b

IB/core: Expose vlan offloads capabilities · ebaaee25

由 Noa Osherovich 提交于 1月 18, 2017

Expose raw packet capabilities in the core layer to enable a device
to report it.
Two existing capabilities, scatter FCS and IP CSUM were added to this
field for a better user experience by exposing the raw packet caps
from one location.
This field will serve also for future capabilities for raw packet QP.

A new capability was introduced - cvlan stripping, which is the
device's ability to remove cvlan tag from an incoming packet and
report it in the matching work completion.
Signed-off-by: NNoa Osherovich <noaos@mellanox.com>
Reviewed-by: NMaor Gottlieb <maorg@mellanox.com>
Reviewed-by: NYishai Hadas <yishaih@mellanox.com>
Signed-off-by: NLeon Romanovsky <leon@kernel.org>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

ebaaee25

IB/mlx5: Add port counter support for Receive WQs · 23a6964e

由 Majd Dibbiny 提交于 1月 18, 2017

Counters weren't updated due to Receive WQs' traffic since the
counter-id was not associated with the RQ.

Added support for associating the q-counter-id with the Receive WQ.
The attachment is done only when changing WQ's state from RESET to
READY in modify-WQ command.

FW support is required for the above, without this support
Receive WQ counters will not count.
Signed-off-by: NMajd Dibbiny <majd@mellanox.com>
Signed-off-by: NLeon Romanovsky <leon@kernel.org>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

23a6964e

IB/mlx5: Expose Q counters groups only if they are supported by FW · 7c16f477

由 Kamal Heib 提交于 1月 18, 2017

This patch modify the Q counters implementation, so each one of the
three Q counters groups will be exposed by the driver only if they are
supported by the firmware.
Signed-off-by: NKamal Heib <kamalh@mellanox.com>
Reviewed-by: NMark Bloch <markb@mellanox.com>
Signed-off-by: NLeon Romanovsky <leon@kernel.org>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

7c16f477

14 2月, 2017 12 次提交

IB/mlx5: Replace ENOTSUPP usage with EOPNOTSUPP · 1ffd3a26

由 Leon Romanovsky 提交于 1月 18, 2017

Flow steering is supposed to return EOPNOTSUPP error
for unsupported fields and not ENOTSUPP error.
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NLeon Romanovsky <leon@kernel.org>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

1ffd3a26

IB/mlx5: Add flow tag support · 2ac693f9

由 Moses Reuben 提交于 1月 18, 2017

Set flow tag in flow table entry, when IB_FLOW_SPEC_ACTION_TAG
is part of the flow specifications.

Flow tag doesn't support multicast flows, so it's passing to
hardware only when used.
Signed-off-by: NMoses Reuben <mosesr@mellanox.com>
Signed-off-by: NLeon Romanovsky <leon@kernel.org>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

2ac693f9

IB/uverbs: Add support for flow tag · 94e03f11

由 Moses Reuben 提交于 1月 18, 2017

The struct ib_uverbs_flow_spec_action_tag associates a tag_id with the
flow defined by any number of other flow_spec entries which can reference
L2, L3, and L4 packet contents.

Use of ib_uverbs_flow_spec_action_tag allows the consumer to identify
the set of rules which where matched by
the packet by examining the tag_id in the CQE.
Signed-off-by: NMoses Reuben <mosesr@mellanox.com>
Signed-off-by: NLeon Romanovsky <leon@kernel.org>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

94e03f11

IB/core: Introduce flow tag specification · 460d0198

由 Moses Reuben 提交于 1月 18, 2017

This specification identifies flow with a specific tag-id.
This tag-id will be reported in the CQE.
Signed-off-by: NMoses Reuben <mosesr@mellanox.com>
Signed-off-by: NLeon Romanovsky <leon@kernel.org>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

460d0198

net/mlx5: Consolidate flow rules regardless their flow tag · e04a0183

由 Maor Gottlieb 提交于 1月 18, 2017

Flow rules with same match criteria and value should be mapped to
the same flow table entry regardless the flow tag identifier.

Flow tag is part of flow table entry context and not of the
destination, therefore we should return error when we try to add
destination to flow table entry with different flow tag.
Signed-off-by: NMaor Gottlieb <maorg@mellanox.com>
Signed-off-by: NLeon Romanovsky <leon@kernel.org>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

e04a0183

IB/mlx5: Remove deprecated module parameter · 5abb0da9

由 Leon Romanovsky 提交于 1月 18, 2017

Commit 9603b61d ("mlx5: Move pci device handling from mlx5_ib
to mlx5_core") moved prof_sel module parameter from mlx5_ib to mlx5_core
and marked it as deprecated in 2014.

Three years after deprecation, it is time to remove the deprecated
module parameter.
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Reviewed-by: NJack Morgenstein <jackm@mellanox.com>
Signed-off-by: NLeon Romanovsky <leon@kernel.org>
Reviewed-by: NYuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

5abb0da9

IB/mlx5: Assign DSCP for R-RoCE QPs Address Path · ed88451e

由 Majd Dibbiny 提交于 1月 18, 2017

For Routable RoCE QPs, the DSCP should be set in the QP's
address path.

The DSCP's value is derived from the traffic class.

Fixes: 2811ba51 ("IB/mlx5: Add RoCE fields to Address Vector")
Cc: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: NMajd Dibbiny <majd@mellanox.com>
Reviewed-by: NMoni Shoua <monis@mellanox.com>
Signed-off-by: NLeon Romanovsky <leon@kernel.org>
Reviewed-by: NYuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

ed88451e

IB/mlx5: Avoid SMP MADs from VFs · 1e0e50b6

由 Maor Gottlieb 提交于 1月 18, 2017

According to the device specification, we need to check that the
has_smi bit is set in vport context before allowing send SMP
MADs from VF.

Fixes: e126ba97 ('mlx5: Add driver for Mellanox Connect-IB adapters')
Signed-off-by: NMaor Gottlieb <maorg@mellanox.com>
Reviewed-by: NEli Cohen <eli@mellanox.com>
Signed-off-by: NLeon Romanovsky <leon@kernel.org>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

1e0e50b6

IB/mlx5: Add additional checks before processing MADs · c43f1112

由 Maor Gottlieb 提交于 1月 18, 2017

Check the has_smi bit in vport context and class version of MADs
before allowing MADs processing to take place.
MAD_IFC SMI commands can be executed only if smi bit is set.

Fixes: e126ba97 ('mlx5: Add driver for Mellanox Connect-IB adapters')
Signed-off-by: NMaor Gottlieb <maorg@mellanox.com>
Signed-off-by: NParvi Kaustubhi <parvik@mellanox.com>
Reviewed-by: NEli Cohen <eli@mellanox.com>
Signed-off-by: NLeon Romanovsky <leon@kernel.org>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

c43f1112

IB/mlx5: Verify that Q counters are supported · 45bded2c

由 Kamal Heib 提交于 1月 18, 2017

Make sure that the Q counters are supported by the FW before trying
to allocate/deallocte them, this will avoid driver load failure when
they aren't supported by the FW.

Fixes: 0837e86a ('IB/mlx5: Add per port counters')
Cc: <stable@vger.kernel.org> # v4.7+
Signed-off-by: NKamal Heib <kamalh@mellanox.com>
Reviewed-by: NMark Bloch <markb@mellanox.com>
Signed-off-by: NLeon Romanovsky <leon@kernel.org>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

45bded2c

IB/mlx5: Return error for unsupported signature type · 12bbf1ea

由 Leon Romanovsky 提交于 1月 18, 2017

In case of unsupported singature, we returned positive
value, while the better approach is to return -EINVAL.

In addition, in this change, the error print is enriched
to provide an actual supplied signature type.

Fixes: e6631814 ("IB/mlx5: Support IB_WR_REG_SIG_MR")
Cc: Sagi Grimberg <sagi@grimberg.me>
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NLeon Romanovsky <leon@kernel.org>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

12bbf1ea

IB/mlx5: Fix out-of-bound access · 0fd27a88

由 Leon Romanovsky 提交于 1月 18, 2017

When we initialize buffer to create SRQ in kernel,
the number of pages was less than actually used in
following mlx5_fill_page_array().

Fixes: e126ba97 ("mlx5: Add driver for Mellanox Connect-IB adapters")
Cc: <stable@vger.kernel.org> # v3.10+
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Reviewed-by: NEli Cohen <eli@mellanox.com>
Signed-off-by: NLeon Romanovsky <leon@kernel.org>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

0fd27a88

10 1月, 2017 9 次提交

Merge tag 'mlx5-4kuar-for-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux · bda65b42

由 David S. Miller 提交于 1月 09, 2017

Saeed Mahameed says:

====================
mlx5 4K UAR

The following series of patches optimizes the usage of the UAR area which is
contained within the BAR 0-1. Previous versions of the firmware and the driver
assumed each system page contains a single UAR. This patch set will query the
firmware for a new capability that if published, means that the firmware can
support UARs of fixed 4K regardless of system page size. In the case of
powerpc, where page size equals 64KB, this means we can utilize 16 UARs per
system page. Since user space processes by default consume eight UARs per
context this means that with this change a process will need a single system
page to fulfill that requirement and in fact make use of more UARs which is
better in terms of performance.

In addition to optimizing user-space processes, we introduce an allocator
that can be used by kernel consumers to allocate blue flame registers
(which are areas within a UAR that are used to write doorbells). This provides
further optimization on using the UAR area since the Ethernet driver makes
use of a single blue flame register per system page and now it will use two
blue flame registers per 4K.

The series also makes changes to naming conventions and now the terms used in
the driver code match the terms used in the PRM (programmers reference manual).
Thus, what used to be called UUAR (micro UAR) is now called BFREG (blue flame
register).

In order to support compatibility between different versions of
library/driver/firmware, the library has now means to notify the kernel driver
that it supports the new scheme and the kernel can notify the library if it
supports this extension. So mixed versions of libraries can run concurrently
without any issues.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bda65b42

tcp: make TCP_INFO more consistent · b369e7fd

由 Eric Dumazet 提交于 1月 09, 2017

tcp_get_info() has to lock the socket, so lets lock it
for an extended critical section, so that various fields
have consistent values.

This solves an annoying issue that some applications
reported when multiple counters are updated during one
particular rx/rx event, and TCP_INFO was called from
another cpu.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b369e7fd

Merge branch 'bpf-verifier-improvements' · c22e5c12

由 David S. Miller 提交于 1月 09, 2017

Alexei Starovoitov says:

====================
bpf: verifier improvements

A number of bpf verifier improvements from Gianluca.
See individual patches for details.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c22e5c12

bpf: rename ARG_PTR_TO_STACK · 39f19ebb

由 Alexei Starovoitov 提交于 1月 09, 2017

since ARG_PTR_TO_STACK is no longer just pointer to stack
rename it to ARG_PTR_TO_MEM and adjust comment.
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

39f19ebb

bpf: allow helpers access to variable memory · 06c1c049

由 Gianluca Borello 提交于 1月 09, 2017

Currently, helpers that read and write from/to the stack can do so using
a pair of arguments of type ARG_PTR_TO_STACK and ARG_CONST_STACK_SIZE.
ARG_CONST_STACK_SIZE accepts a constant register of type CONST_IMM, so
that the verifier can safely check the memory access. However, requiring
the argument to be a constant can be limiting in some circumstances.

Since the current logic keeps track of the minimum and maximum value of
a register throughout the simulated execution, ARG_CONST_STACK_SIZE can
be changed to also accept an UNKNOWN_VALUE register in case its
boundaries have been set and the range doesn't cause invalid memory
accesses.

One common situation when this is useful:

int len;
char buf[BUFSIZE]; /* BUFSIZE is 128 */

if (some_condition)
	len = 42;
else
	len = 84;

some_helper(..., buf, len & (BUFSIZE - 1));

The compiler can often decide to assign the constant values 42 or 48
into a variable on the stack, instead of keeping it in a register. When
the variable is then read back from stack into the register in order to
be passed to the helper, the verifier will not be able to recognize the
register as constant (the verifier is not currently tracking all
constant writes into memory), and the program won't be valid.

However, by allowing the helper to accept an UNKNOWN_VALUE register,
this program will work because the bitwise AND operation will set the
range of possible values for the UNKNOWN_VALUE register to [0, BUFSIZE),
so the verifier can guarantee the helper call will be safe (assuming the
argument is of type ARG_CONST_STACK_SIZE_OR_ZERO, otherwise one more
check against 0 would be needed). Custom ranges can be set not only with
ALU operations, but also by explicitly comparing the UNKNOWN_VALUE
register with constants.

Another very common example happens when intercepting system call
arguments and accessing user-provided data of variable size using
bpf_probe_read(). One can load at runtime the user-provided length in an
UNKNOWN_VALUE register, and then read that exact amount of data up to a
compile-time determined limit in order to fit into the proper local
storage allocated on the stack, without having to guess a suboptimal
access size at compile time.

Also, in case the helpers accepting the UNKNOWN_VALUE register operate
in raw mode, disable the raw mode so that the program is required to
initialize all memory, since there is no guarantee the helper will fill
it completely, leaving possibilities for data leak (just relevant when
the memory used by the helper is the stack, not when using a pointer to
map element value or packet). In other words, ARG_PTR_TO_RAW_STACK will
be treated as ARG_PTR_TO_STACK.
Signed-off-by: NGianluca Borello <g.borello@gmail.com>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

06c1c049

bpf: allow adjusted map element values to spill · f0318d01

由 Gianluca Borello 提交于 1月 09, 2017

commit 48461135 ("bpf: allow access into map value arrays")
introduces the ability to do pointer math inside a map element value via
the PTR_TO_MAP_VALUE_ADJ register type.

The current support doesn't handle the case where a PTR_TO_MAP_VALUE_ADJ
is spilled into the stack, limiting several use cases, especially when
generating bpf code from a compiler.

Handle this case by explicitly enabling the register type
PTR_TO_MAP_VALUE_ADJ to be spilled. Also, make sure that min_value and
max_value are reset just for BPF_LDX operations that don't result in a
restore of a spilled register from stack.
Signed-off-by: NGianluca Borello <g.borello@gmail.com>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f0318d01

bpf: allow helpers access to map element values · 5722569b

由 Gianluca Borello 提交于 1月 09, 2017

Enable helpers to directly access a map element value by passing a
register type PTR_TO_MAP_VALUE (or PTR_TO_MAP_VALUE_ADJ) to helper
arguments ARG_PTR_TO_STACK or ARG_PTR_TO_RAW_STACK.

This enables several use cases. For example, a typical tracing program
might want to capture pathnames passed to sys_open() with:

struct trace_data {
	char pathname[PATHLEN];
};

SEC("kprobe/sys_open")
void bpf_sys_open(struct pt_regs *ctx)
{
	struct trace_data data;
	bpf_probe_read(data.pathname, sizeof(data.pathname), ctx->di);

	/* consume data.pathname, for example via
	 * bpf_trace_printk() or bpf_perf_event_output()
	 */
}

Such a program could easily hit the stack limit in case PATHLEN needs to
be large or more local variables need to exist, both of which are quite
common scenarios. Allowing direct helper access to map element values,
one could do:

struct bpf_map_def SEC("maps") scratch_map = {
	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
	.key_size = sizeof(u32),
	.value_size = sizeof(struct trace_data),
	.max_entries = 1,
};

SEC("kprobe/sys_open")
int bpf_sys_open(struct pt_regs *ctx)
{
	int id = 0;
	struct trace_data *p = bpf_map_lookup_elem(&scratch_map, &id);
	if (!p)
		return;
	bpf_probe_read(p->pathname, sizeof(p->pathname), ctx->di);

	/* consume p->pathname, for example via
	 * bpf_trace_printk() or bpf_perf_event_output()
	 */
}

And wouldn't risk exhausting the stack.

Code changes are loosely modeled after commit 6841de8b ("bpf: allow
helpers access the packet directly"). Unlike with PTR_TO_PACKET, these
changes just work with ARG_PTR_TO_STACK and ARG_PTR_TO_RAW_STACK (not
ARG_PTR_TO_MAP_KEY, ARG_PTR_TO_MAP_VALUE, ...): adding those would be
trivial, but since there is not currently a use case for that, it's
reasonable to limit the set of changes.

Also, add new tests to make sure accesses to map element values from
helpers never go out of boundary, even when adjusted.
Signed-off-by: NGianluca Borello <g.borello@gmail.com>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5722569b

bpf: split check_mem_access logic for map values · dbcfe5f7

由 Gianluca Borello 提交于 1月 09, 2017

Move the logic to check memory accesses to a PTR_TO_MAP_VALUE_ADJ from
check_mem_access() to a separate helper check_map_access_adj(). This
enables to use those checks in other parts of the verifier as well,
where boundaries on PTR_TO_MAP_VALUE_ADJ might need to be checked, for
example when checking helper function arguments. The same thing is
already happening for other types such as PTR_TO_PACKET and its
check_packet_access() helper.

The code has been copied verbatim, with the only difference of removing
the "off += reg->max_value" statement and moving the sum into the call
statement to check_map_access(), as that was only needed due to the
earlier common check_map_access() call.
Signed-off-by: NGianluca Borello <g.borello@gmail.com>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dbcfe5f7

Merge branch 'net-smc' · f3a3e248

由 David S. Miller 提交于 1月 09, 2017

Ursula Braun says:

====================
net/smc: Shared Memory Communications - RDMA

here is now V4 of the SMC-R patches having processed your feedback from end
of November. The most important change is the replacement of sysfs by a
generic netlink solution in patch 04. And I tried to get rid of the __packed
attributes. There are still a few usages left due to SMC-R protocol defined
structures.

V4 changes:
The order of patches 03 and 04 for pnet table management and SMC IB-client
establishing has been exchanged, since pnet table management is now built on
top of smc_ib_devices.
Patch 01: Use EXPORT_SYMBOL_GPL().
Patch 02: Define "use_fallback" as bool.
          Get rid of useless smc_sock fields clearing in smc_sock_alloc(),
          since sk_alloc() clears out the memory.
Patch 03: Postpone smc_ib_remember_port_attr() call till ib_device is
          mentioned in the pnet table.
Patch 04: Replace sysfs-usage by a generic netlink approach for pnet table
          configuration.
          Change layout of pnet table entries to reference net_device and
          ib_device instead of dealing with names of net_devices and
          ib_devices.
Patch 05: Adapt "use_fallback" usages to new type bool.
          Get rid of useless smc_sock fields clearing in smc_sock_alloc()
          Avoid __packed where possible.
          Check if clc responses are not too big.
Patch 09: Postpone smc_setup_per_ibdev till the first connection with this
          ib_device is really created.
Patch 11: Get rid of __packed usage.

V3 changes:
Patch 05: Remove unneeded DEFINE_WAIT
Patch 06: Improve synchronization of link group creation
Patch 07: Rename peer_rmbe_len into peer_rmbe_size to be more consistent
Patch 09: Avoid calls of ib_get_memory_region with IB_ACCESS_LOCAL_WRITE,
          use new default local_dma_lkey from protection domain as lkey
          instead.
          Remove no longer needed function smc_ib_dereg_memory_region().
Patch 14: Switch to state ACTIVE only if still in state INIT.
          Return 0 for recvmsg invoked in a socket closing state.
          Allow getname call in state APPCLOSEWAIT1
          Do not trigger destruction of a socket-in-error queued in accept
          queue.
          During cleanup of accept queue, make sure sockets are destructed,
          and sockets in fallback mode are handled appropriately.
          When freeing sndbufs/rmbs, remove them from their list and free
          the entry.
          Use add_wait_queue() and remove_wait_queue() in close wait
          functions.
          If actively closing a socket in state for PEERFINCLOSEWAIT, keep
          this state.
          If passively closing a socket while bytes are to be received, move
          to state APPCLOSEWAIT1.
          If actively aborting a socket, skip sending the close_abort flag,
          since RDMA communication is no longer possible.
          When terminating a link group, do not schedule link group freeing a
          2nd time, since already done when unregistering the last remaining
          connection.
Patch 15: Introduce smc_diag module for monitoring SMC protocol sockets.
          This replaces the old patch 0015 dealing with procfs.

V2 changes:
Patch 0002: Add SMC versions for family key strings in net/core/sock.c.
Patch 0006: initialize rb_tree.
Patch 0007: Get rid of unneeded use of xchg() in smc_sndbuf_unuse() and
            smc_rmb_unuse().
Patch 0008: Correct error checking logic for ib_function calls.
            Define struct smc_link field wr_tx_id as atomic_long_t.
            Use "do_div" instead of "%" to be architecture-independent.
Patch 0009: Correct error checking logic for ib_function calls.
Patch 0011: Remove xchg() calls in cursor handling. Use atomic64_t for cursor
            overlays on 64-bit architectures. If not available, use plain u64
            and add locking for cursor reading and writing.
            Implement smc_curs_add() without modulo operator "%".
Patch 0012: Remove xchg() calls in cursor handling.
            Implement smc_tx_rdma_writes() without module operator "%".
Patch 0013: Remove xchg() calls in cursor handling.
Patch 0014: Return type bool in smc_wr_tx_has_pending().
            Remove unneeded semicolon in smc_close_shutdown_write().
            Call smc_close_active() in non-fallback case only.
            Get rid of duplicate schedule of sock_put_work().
            Take nested sock_lock in smc_listen_work().
            Start close stream_wait in case of prepared sends only.
Patch 0015: Remove unneeded socket ref_count in smc_proc_seq_show().
            Take lock before list_empty check in smc_proc_sock_list_del().

These patches are the initial part of the implementation of the
"Shared Memory Communications-RDMA" (SMC-R) protocol as defined in
RFC7609 [1]. While SMC-R does not aim to replace TCP,
it taps a wealth of existing data center TCP socket applications
to become more efficient without the need for rewriting them.
SMC-R uses RDMA over Converged Ethernet (RoCE) to save CPU consumption.
For instance, when running 10 parallel connections with uperf, we measured
a decrease of 60% in CPU consumption with SMC-R compared to TCP/IP
(with throughput and latency comparable;
measured on x86_64 with the same RoCE card and port).

SMC-R does not require an RDMA communication manager (RDMA CM).

SMC-R inherits TCP qualities such as reliable connections, host-based
firewall packet filtering (on connection establishment) and unmodified
application of communication encryption such as TLS (transport layer
security) or SSL (secure sockets layer). Since original TCP is used to
establish SMC-R connections, load balancers and packet inspection based
on TCP/IP connection establishment continue to work for SMC-R.

On the other hand, using SMC-R implies:
- either involving a preload library when invoking the unchanged TCP-application
  or slightly modifying the source by simply changing the socket family in
  the socket() call
- accepting extra overhead and latency in connection establishment due to
  SMC Connection Layer Control (CLC) handshake
- explicit coupling of RoCE ports with Ethernet ports
- not routable as currently built on RoCE V1
- bypassing of packet-based networking features
    - filtering (netfilter)
    - sniffing (libpcap, packet sockets, (E)BPF)
    - traffic control (scheduling, shaping)
- bypassing of IP-header based socket options
- bypassing of memory buffer (pressure) management
- unusable together with IPsec

Overview of the SMC-R Protocol described in informational RFC 7609

SMC-R is an open protocol that provides RDMA capabilities over RoCE
transparently for applications exploiting TCP sockets.
A new socket protocol family PF_SMC is introduced.
There are no changes required to applications using the sockets API for TCP
stream sockets other than the specification of the new socket family AF_SMC.
Unmodified applications can be used by means of a dynamic preload shared
library which rewrites the socket API call
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) into
socket(AF_SMC,  SOCK_STREAM, IPPROTO_TCP).
SMC-R re-uses the address family AF_INET for all addressing purposes around
struct sockaddr.

SMC-R system architecture layers:

+=============================================================================+
|                                      | unmodified TCP application           |
| native SMC application               +--------------------------------------+
|                                      | dynamic preload shared library       |
+=============================================================================+
|                                 SMC socket                                  |
+-----------------------------------------------------------------------------+
|                    | TCP socket (for connection establishment and fallback) |
| IB verbs           +--------------------------------------------------------+
|                    | IP                                                     |
+--------------------+--------------------------------------------------------+
| RoCE device driver | some network device driver                             |
+=============================================================================+

Terms:

A link group is determined by an ordered peer pair of TCP client and TCP server
(IP addresses and subnet). Reversed client server roles cause an own link group.
A link is a logical point-to-point connection based on an
infiniband reliable connected queue pair (RC-QP) between two RoCE ports
(MACs and GIDs) of a peer pair.
A link group can have 1..8 links for failover and load balancing.
This initial Linux implementation always has 1 link per link group.
Each link group on a peer can have 1..255 remote memory buffers (RMBs).
If more RMBs are needed, a peer can open another link group
(this initial Linux implementation) or fall back to TCP.
Each RMB has its own particular size and its own (R)DMA mapping and credentials
(rtoken consisting of rkey and RDMA "virtual address").
This initial Linux implementation uses physically contiguous memory for RMBs
but we are working towards scattered memory because of memory fragmentation.
Each RMB has 1..255 RMB elements (RMBEs) of equal size
to provide multiplexing of connections within an RMB.
An RMBE is the RDMA Write destination organized as wrapping ring buffer
for data transmit of a particular connection in one direction
(duplex by means of mirror symmetry as with TCP).
This initial Linux implementation always has 1 RMBE per RMB
and thus an individual RMB for each connection.

SMC-R connection establishment with subsequent data transfer:

   CLIENT                                                   SERVER

TCP three-way handshake:
                         regular TCP SYN
      -------------------------------------------------------->
                       regular TCP SYN ACK
      <--------------------------------------------------------
                         regular TCP ACK
      -------------------------------------------------------->

SMC Connection Layer Control (CLC) handshake
exchanges RDMA credentials between peers:
             via above TCP connection: SMC CLC Proposal
      -------------------------------------------------------->
              via above TCP connection: SMC CLC Accept
      <--------------------------------------------------------
             via above TCP connection: SMC CLC Confirm
      -------------------------------------------------------->

SMC Link Layer Control (LLC) (only once per link, i.e. 1st conn. of link group):
                 RoCE RC-QP: SMC LLC Confirm Link
      <========================================================
             RoCE RC-QP: SMC LLC Confirm Link response
      ========================================================>

SMC data transmission (incl. SMC Connection Data Control (CDC) message):
                       RoCE RC-QP: RDMA Write
      ========================================================>
             RoCE RC-QP: SMC CDC message (flow control)
      ========================================================>
                          ...

                       RoCE RC-QP: RDMA Write
      <========================================================
             RoCE RC-QP: SMC CDC message (flow control)
      <========================================================
                          ...

Data flow within an established connection:

+----------------------------------------------------------------------------
|            SENDER
| sendmsg()
|    |
|    | produces into sndbuf [sender's process context]
|    v
| +--------+
| | sndbuf | [ring buffer]
| +--------+
|    |
|    | consumes from sndbuf and produces into receiver's RMBE [any context]
|    | by sending RDMA Write followed by SMC CDC message over RoCE RC-QP
|    |
+----|-----------------------------------------------------------------------
     |
+----|-----------------------------------------------------------------------
|    v       RECEIVER
| +------+
| | RMBE | [ring buffer, can have size different from sender's sndbuf]
| |      | [RMBE represents rcvbuf, no further de-coupling as on sender side]
| +------+
|    |
|    | consumes from RMBE [receiver's process context]
|    v
| recvmsg()
+----------------------------------------------------------------------------

Flow control ("cursor" updates) by means of SMC CDC messages:

               SENDER                            RECEIVER

        sends updates via CDC-------------+   sends updates via CDC
        on consuming from sndbuf          |   on consuming from RMBE
        and producing into RMBE           |   by means of recvmsg()
                                          |            |
                                          |            |
      +-----------------------------------|------------+
      |                                   |
   +--v-------------------------+      +--v-----------------------+
   | receiver's consumer cursor |      | sender's producer cursor----+
   +----------------|-----------+      +--------------------------+  |
                    |                                                |
                    |                        receiver's RMBE         |
                    |                  +--------------------------+  |
                    |                  |                          |  |
                    +--------------------------------+            |  |
                                       |             |            |  |
                                       |             v            |  |
                                       |             +------------|  |
                                       |-------------+////////////|  |
                                       |//RDMA data written by////|  |
                                       |////sender that is////////|  |
                                       |/available to be consumed/|  |
                                       |///////// +---------------|  |
                                       |----------+^              |  |
                                       |           |              |  |
                                       |           +-----------------+
                                       |                          |
                                       +--------------------------+

Sending updates of the producer cursor is immediate for low latency;
something like Nagle's algorithm (absence of TCP_NODELAY) is optional and
currently not part of this initial Linux implementation.
Sending updates of the consumer cursor is conditional to avoid the
silly window syndrome.

Normal connection termination:

Normal connection termination starts transitioning from socket state
ACTIVE via either "Active Close" or "Passive Close".

shutdown rdwr               +-----------------+
or close,   +-------------->|  INIT / CLOSED  |<-------------+
send PeerCon|nClosed        +-----------------+              | PeerConnClosed
            |                       |                        | received
            |            connection | established            |
            |                       V                        |
    +----------------+     +-----------------+     +----------------+
    |AppFinCloseWait |     |     ACTIVE      |     |PeerFinCloseWait|
    +----------------+     +-----------------+     +----------------+
            |                   |         |                   |
            |     Active Close: |         |Passive Close:     |
            |     close or      |         |PeerConnClosed or  |
            |     shutdown wr or|         |PeerDoneWriting    |
            |     shutdown rdwr |         |received           |

    |                   V         V                   |
 PeerConnClo|sed    +--------------+   +-------------+        | close or
 received   +--<----|PeerCloseWait1|   |AppCloseWait1|--->----+ shutdown rdwr,
            |       +--------------+   +-------------+        | send
            |  PeerDoneWri|ting                | shutdown wr, | PeerConnClosed
            |  received   |            send Pee|rDoneWriting  |
            |             V                    V              |
            |       +--------------+   +-------------+        |
            +--<----|PeerCloseWait2|   |AppCloseWait2|--->----+
                    +--------------+   +-------------+

In state CLOSED, the socket can be destructed only, once the application has
issued a close().

Abnormal connection termination:

                            +-----------------+
            +-------------->|  INIT / CLOSED  |<-------------+
            |               +-----------------+              |
            |                                                |
            |           +-----------------------+            |
            |           |     Any state         |            |
 PeerConnAbo|rt         | (before setting       |            | send
 received   |           |  PeerConnClosed       |            | PeerConnAbort
            |           |  indicator in         |            |
            |           |  peer's RMBE)         |            |
            |           +-----------------------+            |
            |                   |         |                  |
            |     Active Abort: |         | Passive Abort:   |
            |     problem,      |         | PeerConnAbort    |
            |     send          |         | received,        |
            |     PeerConnAbort,|         | ECONNRESET       |
            |     ECONNABORTED  |         |                  |
            |                   V         V                  |
            |       +--------------+   +--------------+      |
            +-------|PeerAbortWait |   | ProcessAbort |------+
                    +--------------+   +--------------+

Implementation notes beyond RFC 7609:

A PNET table in sysfs provides the mapping between network device names and
RoCE Infiniband device names for the transparent switch of data communication.
A PNET table can contain an arbitrary number of PNETIDs.
Each PNETID contains exactly one (Ethernet) network device name
and one or more RoCE Infiniband device names.
Each device name can only exist in at most one PNETID (no overlapping).
This initial Linux implementation allows at most one RoCE Infiniband device
name per PNETID.
After a new TCP connection is established, the network device name
used for egress traffic with the TCP connection's local source IP address
is used as key to lookup the unique PNETID, and the RoCE Infiniband device
of this PNETID is used to switch data communication from TCP to RDMA
during SMC CLC handshake.

Problem determination:

A protocol dissector is available with upstream wireshark for formatting
SMC-R related RoCE LAN traffic.
[https://code.wireshark.org/review/gitweb?p=wireshark.git;a=blob;f=epan/dissectors/packet-smcr.c]

We are working on enhancing the Linux implementation to cover:

- Improve default socket closing asynchronicity
- Address corner cases with many parallel connections
- Tracing
- Integrated load balancing and fail-over within a link group
- Splice and sendpage support
- IPv6 addressing support
- Keepalive, Cork
- Namespaces support
- Urgent data
- More socket options
- Diagnostics
- Statistics support
- SNMP support

References:

[1] SMC-R Informational RFC: http://www.rfc-editor.org/info/rfc7609
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f3a3e248

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功