提交 · 05b595e9c44acaca94192c6db430a489c1b212a7 · openeuler / Kernel

10 9月, 2020 3 次提交

devlink: Introduce external controller flag · 05b595e9

由 Parav Pandit 提交于 9月 09, 2020

A devlink eswitch port may represent PCI PF/VF ports of a controller.

A controller either located on same system or it can be an external
controller located in host where such NIC is plugged in.

Add the ability for driver to specify if a port is for external
controller.

Use such flag in the mlx5_core driver.

An example of an external controller having VF1 of PF0 belong to
controller 1.

$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev ens2f0pf0vf1 flavour pcivf pfnum 0 vfnum 1 external true splittable false
  function:
    hw_addr 00:00:00:00:00:00
$ devlink port show pci/0000:06:00.0/2 -jp
{
    "port": {
        "pci/0000:06:00.0/2": {
            "type": "eth",
            "netdev": "ens2f0pf0vf1",
            "flavour": "pcivf",
            "pfnum": 0,
            "vfnum": 1,
            "external": true,
            "splittable": false,
            "function": {
                "hw_addr": "00:00:00:00:00:00"
            }
        }
    }
}
Signed-off-by: NParav Pandit <parav@nvidia.com>
Reviewed-by: NJiri Pirko <jiri@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

05b595e9

devlink: Move structure comments outside of structure · ff03e63a

由 Parav Pandit 提交于 9月 09, 2020

To add more fields to the PCI PF and VF port attributes, follow standard
structure comment format.
Signed-off-by: NParav Pandit <parav@nvidia.com>
Reviewed-by: NJiri Pirko <jiri@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ff03e63a

devlink: Add comment block for missing port attributes · 2efbe6ae

由 Parav Pandit 提交于 9月 09, 2020

Add comment block for physical, PF and VF port attributes.
Signed-off-by: NParav Pandit <parav@nvidia.com>
Reviewed-by: NJiri Pirko <jiri@nvidia.com>
Reviewed-by: NRoi Dayan <roid@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2efbe6ae

08 9月, 2020 1 次提交

netfilter: nf_tables: add userdata support for nft_object · b131c964

由 Jose M. Guisado Gomez 提交于 9月 08, 2020

Enables storing userdata for nft_object. Initially this will store an
optional comment but can be extended in the future as needed.

Adds new attribute NFTA_OBJ_USERDATA to nft_object.
Signed-off-by: NJose M. Guisado Gomez <guigom@riseup.net>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

b131c964

01 9月, 2020 15 次提交

ipvs: remove dependency on ip6_tables · 144b0a0e

由 Yaroslav Bolyukin 提交于 8月 29, 2020

This dependency was added because ipv6_find_hdr was in iptables specific
code but is no longer required

Fixes: f8f62675 ("ipv6: Move ipv6_find_hdr() out of Netfilter code.")
Fixes: 63dca2c0 ("ipvs: Fix faulty IPv6 extension header handling in IPVS")
Signed-off-by: NYaroslav Bolyukin <iam@lach.pw>
Acked-by: NJulian Anastasov <ja@ssi.bg>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

144b0a0e

net: ipv4: remove unused arg exact_dif in compute_score · 34e1ec31

由 Miaohe Lin 提交于 8月 31, 2020

The arg exact_dif is not used anymore, remove it. inet_exact_dif_match()
is no longer needed after the above is removed, so remove it too.
Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

34e1ec31

net: clean up codestyle · 5af68891

由 Miaohe Lin 提交于 8月 29, 2020

This is a pure codestyle cleanup patch. No functional change intended.
Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5af68891

ipv6: add ipv6_fragment hook in ipv6_stub · 1d97898b

由 wenxu 提交于 8月 28, 2020

Add ipv6_fragment to ipv6_stub to avoid calling netfilter when
access ip6_fragment.
Signed-off-by: Nwenxu <wenxu@ucloud.cn>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1d97898b

xsk: Add shared umem support between queue ids · b5aea28d

由 Magnus Karlsson 提交于 8月 28, 2020

Add support to share a umem between queue ids on the same
device. This mode can be invoked with the XDP_SHARED_UMEM bind
flag. Previously, sharing was only supported within the same
queue id and device, and you shared one set of fill and
completion rings. However, note that when sharing a umem between
queue ids, you need to create a fill ring and a completion ring
and tie them to the socket before you do the bind with the
XDP_SHARED_UMEM flag. This so that the single-producer
single-consumer semantics can be upheld.
Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NBjörn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-12-git-send-email-magnus.karlsson@intel.com

b5aea28d

xsk: i40e: ice: ixgbe: mlx5: Test for dma_need_sync earlier for better performance · 9647c57b

由 Magnus Karlsson 提交于 8月 28, 2020

Test for dma_need_sync earlier to increase
performance. xsk_buff_dma_sync_for_cpu() takes an xdp_buff as
parameter and from that the xsk_buff_pool reference is dug out. Perf
shows that this dereference causes a lot of cache misses. But as the
buffer pool is now sent down to the driver at zero-copy initialization
time, we might as well use this pointer directly, instead of going via
the xsk_buff and we can do so already in xsk_buff_dma_sync_for_cpu()
instead of in xp_dma_sync_for_cpu. This gets rid of these cache
misses.

Throughput increases with 3% for the xdpsock l2fwd sample application
on my machine.
Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NBjörn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-11-git-send-email-magnus.karlsson@intel.com

9647c57b

xsk: Rearrange internal structs for better performance · 8ef4e27e

由 Magnus Karlsson 提交于 8月 28, 2020

Rearrange the xdp_sock, xdp_umem and xsk_buff_pool structures so
that they get smaller and align better to the cache lines. In the
previous commits of this patch set, these structs have been
reordered with the focus on functionality and simplicity, not
performance. This patch improves throughput performance by around
3%.
Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NBjörn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-10-git-send-email-magnus.karlsson@intel.com

8ef4e27e

xsk: Enable sharing of dma mappings · 921b6869

由 Magnus Karlsson 提交于 8月 28, 2020

Enable the sharing of dma mappings by moving them out from the buffer
pool. Instead we put each dma mapped umem region in a list in the umem
structure. If dma has already been mapped for this umem and device, it
is not mapped again and the existing dma mappings are reused.
Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NBjörn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-9-git-send-email-magnus.karlsson@intel.com

921b6869

xsk: Move addrs from buffer pool to umem · 7f7ffa4e

由 Magnus Karlsson 提交于 8月 28, 2020

Replicate the addrs pointer in the buffer pool to the umem. This mapping
will be the same for all buffer pools sharing the same umem. In the
buffer pool we leave the addrs pointer for performance reasons.
Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NBjörn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-8-git-send-email-magnus.karlsson@intel.com

7f7ffa4e

xsk: Move xsk_tx_list and its lock to buffer pool · a5aa8e52

由 Magnus Karlsson 提交于 8月 28, 2020

Move the xsk_tx_list and the xsk_tx_list_lock from the umem to
the buffer pool. This so that we in a later commit can share the
umem between multiple HW queues. There is one xsk_tx_list per
device and queue id, so it should be located in the buffer pool.
Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NBjörn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-7-git-send-email-magnus.karlsson@intel.com

a5aa8e52

xsk: Move queue_id, dev and need_wakeup to buffer pool · c2d3d6a4

由 Magnus Karlsson 提交于 8月 28, 2020

Move queue_id, dev, and need_wakeup from the umem to the
buffer pool. This so that we in a later commit can share the umem
between multiple HW queues. There is one buffer pool per dev and
queue id, so these variables should belong to the buffer pool, not
the umem. Need_wakeup is also something that is set on a per napi
level, so there is usually one per device and queue id. So move
this to the buffer pool too.
Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NBjörn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-6-git-send-email-magnus.karlsson@intel.com

c2d3d6a4

xsk: Move fill and completion rings to buffer pool · 7361f9c3

由 Magnus Karlsson 提交于 8月 28, 2020

Move the fill and completion rings from the umem to the buffer
pool. This so that we in a later commit can share the umem
between multiple HW queue ids. In this case, we need one fill and
completion ring per queue id. As the buffer pool is per queue id
and napi id this is a natural place for it and one umem
struture can be shared between these buffer pools.
Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NBjörn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-5-git-send-email-magnus.karlsson@intel.com

7361f9c3

xsk: Create and free buffer pool independently from umem · 1c1efc2a

由 Magnus Karlsson 提交于 8月 28, 2020

Create and free the buffer pool independently from the umem. Move
these operations that are performed on the buffer pool from the
umem create and destroy functions to new create and destroy
functions just for the buffer pool. This so that in later commits
we can instantiate multiple buffer pools per umem when sharing a
umem between HW queues and/or devices. We also erradicate the
back pointer from the umem to the buffer pool as this will not
work when we introduce the possibility to have multiple buffer
pools per umem.
Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NBjörn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-4-git-send-email-magnus.karlsson@intel.com

1c1efc2a

xsk: i40e: ice: ixgbe: mlx5: Rename xsk zero-copy driver interfaces · c4655761

由 Magnus Karlsson 提交于 8月 28, 2020

Rename the AF_XDP zero-copy driver interface functions to better
reflect what they do after the replacement of umems with buffer
pools in the previous commit. Mostly it is about replacing the
umem name from the function names with xsk_buff and also have
them take the a buffer pool pointer instead of a umem. The
various ring functions have also been renamed in the process so
that they have the same naming convention as the internal
functions in xsk_queue.h. This so that it will be clearer what
they do and also for consistency.
Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NBjörn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-3-git-send-email-magnus.karlsson@intel.com

c4655761

xsk: i40e: ice: ixgbe: mlx5: Pass buffer pool to driver instead of umem · 1742b3d5

由 Magnus Karlsson 提交于 8月 28, 2020

Replace the explicit umem reference passed to the driver in AF_XDP
zero-copy mode with the buffer pool instead. This in preparation for
extending the functionality of the zero-copy mode so that umems can be
shared between queues on the same netdev and also between netdevs. In
this commit, only an umem reference has been added to the buffer pool
struct. But later commits will add other entities to it. These are
going to be entities that are different between different queue ids
and netdevs even though the umem is shared between them.
Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NBjörn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-2-git-send-email-magnus.karlsson@intel.com

1742b3d5

29 8月, 2020 1 次提交

netfilter: nf_tables: add userdata attributes to nft_table · 7a81575b

由 Jose M. Guisado Gomez 提交于 8月 20, 2020

Enables storing userdata for nft_table. Field udata points to user data
and udlen store its length.

Adds new attribute flag NFTA_TABLE_USERDATA
Signed-off-by: NJose M. Guisado Gomez <guigom@riseup.net>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

7a81575b

27 8月, 2020 2 次提交

mac80211: rename csa counters to countdown counters · 8552a434

由 John Crispin 提交于 8月 11, 2020

We want to reuse the functions and structs for other counters such as BSS
color change. Rename them to more generic names.
Signed-off-by: NJohn Crispin <john@phrozen.org>
Link: https://lore.kernel.org/r/20200811080107.3615705-2-john@phrozen.orgSigned-off-by: NJohannes Berg <johannes.berg@intel.com>

8552a434

nl80211: add support for setting fixed HE rate/gi/ltf · eb89a6a6

由 Miles Hu 提交于 8月 04, 2020

This patch adds the nl80211 structs, definitions, policies and parsing
code required to pass fixed HE rate, GI and LTF settings.
Signed-off-by: NMiles Hu <milehu@codeaurora.org>
Signed-off-by: NJohn Crispin <john@phrozen.org>
Link: https://lore.kernel.org/r/20200804081630.2013619-1-john@phrozen.org
[fix comment]
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

eb89a6a6

26 8月, 2020 6 次提交

inet: remove inet_sk_copy_descendant() · 24da7990

由 Eric Dumazet 提交于 8月 26, 2020

This is no longer used, SCTP now uses a private helper.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

24da7990

bpf: Allow local storage to be used from LSM programs · 30897832

由 KP Singh 提交于 8月 25, 2020

Adds support for both bpf_{sk, inode}_storage_{get, delete} to be used
in LSM programs. These helpers are not used for tracing programs
(currently) as their usage is tied to the life-cycle of the object and
should only be used where the owning object won't be freed (when the
owning object is passed as an argument to the LSM hook). Thus, they
are safer to use in LSM hooks than tracing. Usage of local storage in
tracing programs will probably follow a per function based whitelist
approach.

Since the UAPI helper signature for bpf_sk_storage expect a bpf_sock,
it, leads to a compilation warning for LSM programs, it's also updated
to accept a void * pointer instead.
Signed-off-by: NKP Singh <kpsingh@google.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200825182919.1118197-7-kpsingh@chromium.org

30897832

bpf: Split bpf_local_storage to bpf_sk_storage · 450af8d0

由 KP Singh 提交于 8月 25, 2020

A purely mechanical change:

	bpf_sk_storage.c = bpf_sk_storage.c + bpf_local_storage.c
	bpf_sk_storage.h = bpf_sk_storage.h + bpf_local_storage.h
Signed-off-by: NKP Singh <kpsingh@google.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200825182919.1118197-5-kpsingh@chromium.org

450af8d0

bpf: Generalize bpf_sk_storage · f836a56e

由 KP Singh 提交于 8月 25, 2020

Refactor the functionality in bpf_sk_storage.c so that concept of
storage linked to kernel objects can be extended to other objects like
inode, task_struct etc.

Each new local storage will still be a separate map and provide its own
set of helpers. This allows for future object specific extensions and
still share a lot of the underlying implementation.

This includes the changes suggested by Martin in:

  https://lore.kernel.org/bpf/20200725013047.4006241-1-kafai@fb.com/

adding new map operations to support bpf_local_storage maps:

* storages for different kernel objects to optionally have different
  memory charging strategy (map_local_storage_charge,
  map_local_storage_uncharge)
* Functionality to extract the storage pointer from a pointer to the
  owning object (map_owner_storage_ptr)
Co-developed-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NKP Singh <kpsingh@google.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200825182919.1118197-4-kpsingh@chromium.org

f836a56e

bpf: Generalize caching for sk_storage. · 4cc9ce4e

由 KP Singh 提交于 8月 25, 2020

Provide the a ability to define local storage caches on a per-object
type basis. The caches and caching indices for different objects should
not be inter-mixed as suggested in:

  https://lore.kernel.org/bpf/20200630193441.kdwnkestulg5erii@kafai-mbp.dhcp.thefacebook.com/

  "Caching a sk-storage at idx=0 of a sk should not stop an
  inode-storage to be cached at the same idx of a inode."
Signed-off-by: NKP Singh <kpsingh@google.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200825182919.1118197-3-kpsingh@chromium.org

4cc9ce4e

bpf: Renames in preparation for bpf_local_storage · 1f00d375

由 KP Singh 提交于 8月 25, 2020

A purely mechanical change to split the renaming from the actual
generalization.

Flags/consts:

  SK_STORAGE_CREATE_FLAG_MASK	BPF_LOCAL_STORAGE_CREATE_FLAG_MASK
  BPF_SK_STORAGE_CACHE_SIZE	BPF_LOCAL_STORAGE_CACHE_SIZE
  MAX_VALUE_SIZE		BPF_LOCAL_STORAGE_MAX_VALUE_SIZE

Structs:

  bucket			bpf_local_storage_map_bucket
  bpf_sk_storage_map		bpf_local_storage_map
  bpf_sk_storage_data		bpf_local_storage_data
  bpf_sk_storage_elem		bpf_local_storage_elem
  bpf_sk_storage		bpf_local_storage

The "sk" member in bpf_local_storage is also updated to "owner"
in preparation for changing the type to void * in a subsequent patch.

Functions:

  selem_linked_to_sk			selem_linked_to_storage
  selem_alloc				bpf_selem_alloc
  __selem_unlink_sk			bpf_selem_unlink_storage_nolock
  __selem_link_sk			bpf_selem_link_storage_nolock
  selem_unlink_sk			__bpf_selem_unlink_storage
  sk_storage_update			bpf_local_storage_update
  __sk_storage_lookup			bpf_local_storage_lookup
  bpf_sk_storage_map_free		bpf_local_storage_map_free
  bpf_sk_storage_map_alloc		bpf_local_storage_map_alloc
  bpf_sk_storage_map_alloc_check	bpf_local_storage_map_alloc_check
  bpf_sk_storage_map_check_btf		bpf_local_storage_map_check_btf
Signed-off-by: NKP Singh <kpsingh@google.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200825182919.1118197-2-kpsingh@chromium.org

1f00d375

25 8月, 2020 7 次提交

tcp: bpf: Optionally store mac header in TCP_SAVE_SYN · 267cf9fa

由 Martin KaFai Lau 提交于 8月 20, 2020

This patch is adapted from Eric's patch in an earlier discussion [1].

The TCP_SAVE_SYN currently only stores the network header and
tcp header.  This patch allows it to optionally store
the mac header also if the setsockopt's optval is 2.

It requires one more bit for the "save_syn" bit field in tcp_sock.
This patch achieves this by moving the syn_smc bit next to the is_mptcp.
The syn_smc is currently used with the TCP experimental option.  Since
syn_smc is only used when CONFIG_SMC is enabled, this patch also puts
the "IS_ENABLED(CONFIG_SMC)" around it like the is_mptcp did
with "IS_ENABLED(CONFIG_MPTCP)".

The mac_hdrlen is also stored in the "struct saved_syn"
to allow a quick offset from the bpf prog if it chooses to start
getting from the network header or the tcp header.

[1]: https://lore.kernel.org/netdev/CANn89iLJNWh6bkH7DNhy_kmcAexuUCccqERqe7z2QsvPhGrYPQ@mail.gmail.com/Suggested-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20200820190123.2886935-1-kafai@fb.com

267cf9fa

bpf: tcp: Allow bpf prog to write and parse TCP header option · 0813a841

由 Martin KaFai Lau 提交于 8月 20, 2020

[ Note: The TCP changes here is mainly to implement the bpf
pieces into the bpf_skops_*() functions introduced
in the earlier patches. ]

The earlier effort in BPF-TCP-CC allows the TCP Congestion Control
algorithm to be written in BPF. It opens up opportunities to allow
a faster turnaround time in testing/releasing new congestion control
ideas to production environment.

The same flexibility can be extended to writing TCP header option.
It is not uncommon that people want to test new TCP header option
to improve the TCP performance. Another use case is for data-center
that has a more controlled environment and has more flexibility in
putting header options for internal only use.

For example, we want to test the idea in putting maximum delay
ACK in TCP header option which is similar to a draft RFC proposal [1].

This patch introduces the necessary BPF API and use them in the
TCP stack to allow BPF_PROG_TYPE_SOCK_OPS program to parse
and write TCP header options. It currently supports most of
the TCP packet except RST.

Supported TCP header option:
───────────────────────────
This patch allows the bpf-prog to write any option kind.
Different bpf-progs can write its own option by calling the new helper
bpf_store_hdr_opt(). The helper will ensure there is no duplicated
option in the header.

By allowing bpf-prog to write any option kind, this gives a lot of
flexibility to the bpf-prog. Different bpf-prog can write its
own option kind. It could also allow the bpf-prog to support a
recently standardized option on an older kernel.

Sockops Callback Flags:
──────────────────────
The bpf program will only be called to parse/write tcp header option
if the following newly added callback flags are enabled
in tp->bpf_sock_ops_cb_flags:
BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG
BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG

A few words on the PARSE CB flags. When the above PARSE CB flags are
turned on, the bpf-prog will be called on packets received
at a sk that has at least reached the ESTABLISHED state.
The parsing of the SYN-SYNACK-ACK will be discussed in the
"3 Way HandShake" section.

The default is off for all of the above new CB flags, i.e. the bpf prog
will not be called to parse or write bpf hdr option. There are
details comment on these new cb flags in the UAPI bpf.h.

sock_ops->skb_data and bpf_load_hdr_opt()
─────────────────────────────────────────
sock_ops->skb_data and sock_ops->skb_data_end covers the whole
TCP header and its options. They are read only.

The new bpf_load_hdr_opt() helps to read a particular option "kind"
from the skb_data.

Please refer to the comment in UAPI bpf.h. It has details
on what skb_data contains under different sock_ops->op.

3 Way HandShake
───────────────
The bpf-prog can learn if it is sending SYN or SYNACK by reading the
sock_ops->skb_tcp_flags.

* Passive side

When writing SYNACK (i.e. sock_ops->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB),
the received SYN skb will be available to the bpf prog. The bpf prog can
use the SYN skb (which may carry the header option sent from the remote bpf
prog) to decide what bpf header option should be written to the outgoing
SYNACK skb. The SYN packet can be obtained by getsockopt(TCP_BPF_SYN*).
More on this later. Also, the bpf prog can learn if it is in syncookie
mode (by checking sock_ops->args[0] == BPF_WRITE_HDR_TCP_SYNACK_COOKIE).

The bpf prog can store the received SYN pkt by using the existing
bpf_setsockopt(TCP_SAVE_SYN). The example in a later patch does it.
[ Note that the fullsock here is a listen sk, bpf_sk_storage
is not very useful here since the listen sk will be shared
by many concurrent connection requests.

Extending bpf_sk_storage support to request_sock will add weight
to the minisock and it is not necessary better than storing the
whole ~100 bytes SYN pkt. ]

When the connection is established, the bpf prog will be called
in the existing PASSIVE_ESTABLISHED_CB callback. At that time,
the bpf prog can get the header option from the saved syn and
then apply the needed operation to the newly established socket.
The later patch will use the max delay ack specified in the SYN
header and set the RTO of this newly established connection
as an example.

The received ACK (that concludes the 3WHS) will also be available to
the bpf prog during PASSIVE_ESTABLISHED_CB through the sock_ops->skb_data.
It could be useful in syncookie scenario. More on this later.

There is an existing getsockopt "TCP_SAVED_SYN" to return the whole
saved syn pkt which includes the IP[46] header and the TCP header.
A few "TCP_BPF_SYN*" getsockopt has been added to allow specifying where to
start getting from, e.g. starting from TCP header, or from IP[46] header.

The new getsockopt(TCP_BPF_SYN*) will also know where it can get
the SYN's packet from:
- (a) the just received syn (available when the bpf prog is writing SYNACK)
and it is the only way to get SYN during syncookie mode.
or
- (b) the saved syn (available in PASSIVE_ESTABLISHED_CB and also other
existing CB).

The bpf prog does not need to know where the SYN pkt is coming from.
The getsockopt(TCP_BPF_SYN*) will hide this details.

Similarly, a flags "BPF_LOAD_HDR_OPT_TCP_SYN" is also added to
bpf_load_hdr_opt() to read a particular header option from the SYN packet.

* Fastopen

Fastopen should work the same as the regular non fastopen case.
This is a test in a later patch.

* Syncookie

For syncookie, the later example patch asks the active
side's bpf prog to resend the header options in ACK. The server
can use bpf_load_hdr_opt() to look at the options in this
received ACK during PASSIVE_ESTABLISHED_CB.

* Active side

The bpf prog will get a chance to write the bpf header option
in the SYN packet during WRITE_HDR_OPT_CB. The received SYNACK
pkt will also be available to the bpf prog during the existing
ACTIVE_ESTABLISHED_CB callback through the sock_ops->skb_data
and bpf_load_hdr_opt().

* Turn off header CB flags after 3WHS

If the bpf prog does not need to write/parse header options
beyond the 3WHS, the bpf prog can clear the bpf_sock_ops_cb_flags
to avoid being called for header options.
Or the bpf-prog can select to leave the UNKNOWN_HDR_OPT_CB_FLAG on
so that the kernel will only call it when there is option that
the kernel cannot handle.

[1]: draft-wang-tcpm-low-latency-opt-00
https://tools.ietf.org/html/draft-wang-tcpm-low-latency-opt-00Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200820190104.2885895-1-kafai@fb.com

0813a841

bpf: tcp: Add bpf_skops_hdr_opt_len() and bpf_skops_write_hdr_opt() · 331fca43

由 Martin KaFai Lau 提交于 8月 20, 2020

The bpf prog needs to parse the SYN header to learn what options have
been sent by the peer's bpf-prog before writing its options into SYNACK.
This patch adds a "syn_skb" arg to tcp_make_synack() and send_synack().
This syn_skb will eventually be made available (as read-only) to the
bpf prog.  This will be the only SYN packet available to the bpf
prog during syncookie.  For other regular cases, the bpf prog can
also use the saved_syn.

When writing options, the bpf prog will first be called to tell the
kernel its required number of bytes.  It is done by the new
bpf_skops_hdr_opt_len().  The bpf prog will only be called when the new
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG is set in tp->bpf_sock_ops_cb_flags.
When the bpf prog returns, the kernel will know how many bytes are needed
and then update the "*remaining" arg accordingly.  4 byte alignment will
be included in the "*remaining" before this function returns.  The 4 byte
aligned number of bytes will also be stored into the opts->bpf_opt_len.
"bpf_opt_len" is a newly added member to the struct tcp_out_options.

Then the new bpf_skops_write_hdr_opt() will call the bpf prog to write the
header options.  The bpf prog is only called if it has reserved spaces
before (opts->bpf_opt_len > 0).

The bpf prog is the last one getting a chance to reserve header space
and writing the header option.

These two functions are half implemented to highlight the changes in
TCP stack.  The actual codes preparing the bpf running context and
invoking the bpf prog will be added in the later patch with other
necessary bpf pieces.
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20200820190052.2885316-1-kafai@fb.com

331fca43

bpf: tcp: Add bpf_skops_established() · 72be0fe6

由 Martin KaFai Lau 提交于 8月 20, 2020

In tcp_init_transfer(), it currently calls the bpf prog to give it a
chance to handle the just "ESTABLISHED" event (e.g. do setsockopt
on the newly established sk). Right now, it is done by calling the
general purpose tcp_call_bpf().

In the later patch, it also needs to pass the just-received skb which
concludes the 3 way handshake. E.g. the SYNACK received at the active side.
The bpf prog can then learn some specific header options written by the
peer's bpf-prog and potentially do setsockopt on the newly established sk.
Thus, instead of reusing the general purpose tcp_call_bpf(), a new function
bpf_skops_established() is added to allow passing the "skb" to the bpf
prog. The actual skb passing from bpf_skops_established() to the bpf prog
will happen together in a later patch which has the necessary bpf pieces.

A "skb" arg is also added to tcp_init_transfer() such that
it can then be passed to bpf_skops_established().

Calling the new bpf_skops_established() instead of tcp_call_bpf()
should be a noop in this patch.
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200820190039.2884750-1-kafai@fb.com

72be0fe6

tcp: bpf: Add TCP_BPF_RTO_MIN for bpf_setsockopt · ca584ba0

由 Martin KaFai Lau 提交于 8月 20, 2020

This patch adds bpf_setsockopt(TCP_BPF_RTO_MIN) to allow bpf prog
to set the min rto of a connection.  It could be used together
with the earlier patch which has added bpf_setsockopt(TCP_BPF_DELACK_MAX).

A later selftest patch will communicate the max delay ack in a
bpf tcp header option and then the receiving side can use
bpf_setsockopt(TCP_BPF_RTO_MIN) to set a shorter rto.
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200820190027.2884170-1-kafai@fb.com

ca584ba0

tcp: bpf: Add TCP_BPF_DELACK_MAX setsockopt · 2b8ee4f0

由 Martin KaFai Lau 提交于 8月 20, 2020

This change is mostly from an internal patch and adapts it from sysctl
config to the bpf_setsockopt setup.

The bpf_prog can set the max delay ack by using
bpf_setsockopt(TCP_BPF_DELACK_MAX).  This max delay ack can be communicated
to its peer through bpf header option.  The receiving peer can then use
this max delay ack and set a potentially lower rto by using
bpf_setsockopt(TCP_BPF_RTO_MIN) which will be introduced
in the next patch.

Another later selftest patch will also use it like the above to show
how to write and parse bpf tcp header option.
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200820190021.2884000-1-kafai@fb.com

2b8ee4f0

tcp: Use a struct to represent a saved_syn · 70a217f1

由 Martin KaFai Lau 提交于 8月 20, 2020

The TCP_SAVE_SYN has both the network header and tcp header.
The total length of the saved syn packet is currently stored in
the first 4 bytes (u32) of an array and the actual packet data is
stored after that.

A later patch will add a bpf helper that allows to get the tcp header
alone from the saved syn without the network header.  It will be more
convenient to have a direct offset to a specific header instead of
re-parsing it.  This requires to separately store the network hdrlen.
The total header length (i.e. network + tcp) is still needed for the
current usage in getsockopt.  Although this total length can be obtained
by looking into the tcphdr and then get the (th->doff << 2), this patch
chooses to directly store the tcp hdrlen in the second four bytes of
this newly created "struct saved_syn".  By using a new struct, it can
give a readable name to each individual header length.
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200820190014.2883694-1-kafai@fb.com

70a217f1

24 8月, 2020 1 次提交

ipv6: ndisc: adjust ndisc_ifinfo_sysctl_change prototype · 3622adb0

由 Tobias Klauser 提交于 8月 24, 2020

Commit 32927393 ("sysctl: pass kernel pointers to ->proc_handler")
changed ndisc_ifinfo_sysctl_change to take a kernel pointer. Adjust its
prototype in net/ndisc.h as well to fix the following sparse warning:

net/ipv6/ndisc.c:1838:5: error: symbol 'ndisc_ifinfo_sysctl_change' redeclared with different type (incompatible argument 3 (different address spaces)):
net/ipv6/ndisc.c:1838:5:    int extern [addressable] [signed] [toplevel] ndisc_ifinfo_sysctl_change( ... )
net/ipv6/ndisc.c: note: in included file (through include/net/ipv6.h):
./include/net/ndisc.h:496:5: note: previously declared as:
./include/net/ndisc.h:496:5:    int extern [addressable] [signed] [toplevel] ndisc_ifinfo_sysctl_change( ... )
net/ipv6/ndisc.c: note: in included file (through include/net/ip6_route.h):

Fixes: 32927393 ("sysctl: pass kernel pointers to ->proc_handler")
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NTobias Klauser <tklauser@distanz.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3622adb0

22 8月, 2020 1 次提交

netfilter: nf_tables: fix destination register zeroing · 1e105e6a

由 Florian Westphal 提交于 8月 20, 2020

Following bug was reported via irc:
nft list ruleset
   set knock_candidates_ipv4 {
      type ipv4_addr . inet_service
      size 65535
      elements = { 127.0.0.1 . 123,
                   127.0.0.1 . 123 }
      }
 ..
   udp dport 123 add @knock_candidates_ipv4 { ip saddr . 123 }
   udp dport 123 add @knock_candidates_ipv4 { ip saddr . udp dport }

It should not have been possible to add a duplicate set entry.

After some debugging it turned out that the problem is the immediate
value (123) in the second-to-last rule.

Concatenations use 32bit registers, i.e. the elements are 8 bytes each,
not 6 and it turns out the kernel inserted

inet firewall @knock_candidates_ipv4
        element 0100007f ffff7b00  : 0 [end]
        element 0100007f 00007b00  : 0 [end]

Note the non-zero upper bits of the first element.  It turns out that
nft_immediate doesn't zero the destination register, but this is needed
when the length isn't a multiple of 4.

Furthermore, the zeroing in nft_payload is broken.  We can't use
[len / 4] = 0 -- if len is a multiple of 4, index is off by one.

Skip zeroing in this case and use a conditional instead of (len -1) / 4.

Fixes: 49499c3e ("netfilter: nf_tables: switch registers to 32 bit addressing")
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

1e105e6a

21 8月, 2020 1 次提交

rxrpc: Make rxrpc_kernel_get_srtt() indicate validity · 1d4adfaf

由 David Howells 提交于 8月 20, 2020

Fix rxrpc_kernel_get_srtt() to indicate the validity of the returned
smoothed RTT.  If we haven't had any valid samples yet, the SRTT isn't
useful.

Fixes: c410bf01 ("rxrpc: Fix the excessive initial retransmission timeout")
Signed-off-by: NDavid Howells <dhowells@redhat.com>

1d4adfaf

19 8月, 2020 2 次提交

ipv6: some fixes for ipv6_dev_find() · 4ef1a7cb

由 Xin Long 提交于 8月 17, 2020

This patch is to do 3 things for ipv6_dev_find():

  As David A. noticed,

  - rt6_lookup() is not really needed. Different from __ip_dev_find(),
    ipv6_dev_find() doesn't have a compatibility problem, so remove it.

  As Hideaki suggested,

  - "valid" (non-tentative) check for the address is also needed.
    ipv6_chk_addr() calls ipv6_chk_addr_and_flags(), which will
    traverse the address hash list, but it's heavy to be called
    inside ipv6_dev_find(). This patch is to reuse the code of
    ipv6_chk_addr_and_flags() for ipv6_dev_find().

  - dev parameter is passed into ipv6_dev_find(), as link-local
    addresses from user space has sin6_scope_id set and the dev
    lookup needs it.

Fixes: 81f6cb31 ("ipv6: add ipv6_dev_find()")
Suggested-by: NYOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
Reported-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4ef1a7cb

netlink: make NLA_BINARY validation more flexible · 8aa26c57

由 Johannes Berg 提交于 8月 18, 2020

Add range validation for NLA_BINARY, allowing validation of any
combination of combination minimum or maximum lengths, using the
existing NLA_POLICY_RANGE()/NLA_POLICY_FULL_RANGE() macros, just
like for integers where the value is checked.

Also make NLA_POLICY_EXACT_LEN(), NLA_POLICY_EXACT_LEN_WARN()
and NLA_POLICY_MIN_LEN() special cases of this, removing the old
types NLA_EXACT_LEN and NLA_MIN_LEN.

This allows us to save some code where both minimum and maximum
lengths are requires, currently the policy only allows maximum
(NLA_BINARY), minimum (NLA_MIN_LEN) or exact (NLA_EXACT_LEN), so
a range of lengths cannot be accepted and must be checked by the
code that consumes the attributes later.

Also, this allows advertising the correct ranges in the policy
export to userspace. Here, NLA_MIN_LEN and NLA_EXACT_LEN already
were special cases of NLA_BINARY with min and min/max length
respectively.
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8aa26c57

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功