提交 · 3733be14a32bae288b61ed28341e593baba983af · openanolis / cloud-kernel

02 10月, 2017 4 次提交

ipv4: Namespaceify tcp_fastopen_blackhole_timeout knob · 3733be14

由 Haishuang Yan 提交于 9月 27, 2017

Different namespace application might require different time period in
second to disable Fastopen on active TCP sockets.

Tested:
Simulate following similar situation that the server's data gets dropped
after 3WHS.
C ---- syn-data ---> S
C <--- syn/ack ----- S
C ---- ack --------> S
S (accept & write)
C?  X <- data ------ S
	[retry and timeout]

And then print netstat of TCPFastOpenBlackhole, the counter increased as
expected when the firewall blackhole issue is detected and active TFO is
disabled.
# cat /proc/net/netstat | awk '{print $91}'
TCPFastOpenBlackhole
1
Signed-off-by: NHaishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3733be14

ipv4: Namespaceify tcp_fastopen_key knob · 43713848

由 Haishuang Yan 提交于 9月 27, 2017

Different namespace application might require different tcp_fastopen_key
independently of the host.

David Miller pointed out there is a leak without releasing the context
of tcp_fastopen_key during netns teardown. So add the release action in
exit_batch path.

Tested:
1. Container namespace:
# cat /proc/sys/net/ipv4/tcp_fastopen_key:
2817fff2-f803cf97-eadfd1f3-78c0992b

cookie key in tcp syn packets:
Fast Open Cookie
    Kind: TCP Fast Open Cookie (34)
    Length: 10
    Fast Open Cookie: 1e5dd82a8c492ca9

2. Host:
# cat /proc/sys/net/ipv4/tcp_fastopen_key:
107d7c5f-68eb2ac7-02fb06e6-ed341702

cookie key in tcp syn packets:
Fast Open Cookie
    Kind: TCP Fast Open Cookie (34)
    Length: 10
    Fast Open Cookie: e213c02bf0afbc8a
Signed-off-by: NHaishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

43713848

ipv4: Remove the 'publish' logic in tcp_fastopen_init_key_once · dd000598

由 Haishuang Yan 提交于 9月 27, 2017

The 'publish' logic is not necessary after commit dfea2aa6 ("tcp:
Do not call tcp_fastopen_reset_cipher from interrupt context"), because
in tcp_fastopen_cookie_gen，it wouldn't call tcp_fastopen_init_key_once.
Signed-off-by: NHaishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dd000598

ipv4: Namespaceify tcp_fastopen knob · e1cfcbe8

由 Haishuang Yan 提交于 9月 27, 2017

Different namespace application might require enable TCP Fast Open
feature independently of the host.

This patch series continues making more of the TCP Fast Open related
sysctl knobs be per net-namespace.
Reported-by: NLuca BRUNO <lucab@debian.org>
Signed-off-by: NHaishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e1cfcbe8

01 10月, 2017 9 次提交

net: dsa: remove tag ops from the switch tree · aa193d9b

由 Vivien Didelot 提交于 9月 29, 2017

Now that the dsa_ptr is a dsa_port instance, there is no need to keep
the tag operations in the dsa_switch_tree structure. Remove it.
Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

aa193d9b

net: dsa: change dsa_ptr for a dsa_port · 2f657a60

由 Vivien Didelot 提交于 9月 29, 2017

With DSA, a master net device (CPU facing interface) has a dsa_ptr
pointer to which hangs a dsa_switch_tree. This is not correct because a
master interface is wired to a dedicated switch port, and because we can
theoretically have several master interfaces pointing to several CPU
ports of the same switch fabric.

Change the master interface's dsa_ptr for the CPU dsa_port pointer.
This is a step towards supporting multiple CPU ports.
Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2f657a60

net: dsa: prepare master receive hot path · 3e41f93b

由 Vivien Didelot 提交于 9月 29, 2017

In preparation to make DSA master devices point to their corresponding
CPU port instead of the whole tree, add copies of dst and rcv in the
dsa_port structure so that we keep fast access in the receive hot path.

Also keep the copies at the beginning of the dsa_port structure in order
to ensure they are available in cacheline 1.
Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3e41f93b

net: dsa: add tagging ops to port · 15240248

由 Vivien Didelot 提交于 9月 29, 2017

The DSA tagging protocol operations are specific to each CPU port,
thus the dsa_device_ops pointer belongs to the dsa_port structure.

>From now on assign a slave's xmit copy from its CPU port tagging
operations. This will ease the future support for multiple CPU ports.

Also keep the tag_ops at the beginning of the dsa_port structure so that
we ensure copies for hot path are in cacheline 1.
Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

15240248

net: dsa: use temporary dsa_device_ops variable · 62fc9587

由 Vivien Didelot 提交于 9月 29, 2017

When resolving the DSA tagging protocol used by a CPU switch, use a
temporary "tag_ops" variable to store the dsa_device_ops instead of
using directly dst->tag_ops. This will make the future patches moving
this pointer around easier to read.

There is no functional changes.
Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

62fc9587

net: dsa: use cpu_dp in master code · 7ec764ee

由 Vivien Didelot 提交于 9月 29, 2017

Make it clear that the master device is linked to a CPU port by using
"cpu_dp" for the dsa_port variable in master.c instead of "port", then
use a "port" variable to describe the port index, as usually seen in
other places of DSA core.

This will make the future patch touching dsa_ptr more readable. There is
no functional changes.
Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7ec764ee

net: dsa: add master helper to look up slaves · 3775b1b7

由 Vivien Didelot 提交于 9月 29, 2017

The DSA tagging code does not need to know about the DSA architecture,
it only needs to return the slave device corresponding to the source
port index (and eventually the source device index for cascade-capable
switches) parsed from the frame received on the master device.

For this purpose, provide an inline dsa_master_get_slave helper which
validates the device and port indexes and look up the slave device.

This makes the tagging rcv functions more concise and robust, and also
makes dsa_get_cpu_port obsolete.
Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3775b1b7

net_sched: remove redundant assignment to ret · b1c49d14

由 Colin Ian King 提交于 9月 29, 2017

The assignment of -EINVAL to variable ret is redundant as it
is being overwritten on the following error exit paths or
to the return value from the following call to basic_set_parms.
Fix this up by removing it. Cleans up clang warning message:

net/sched/cls_basic.c:185:2: warning: Value stored to 'err' is never read

Fixes: 1d8134fe ("net_sched: use idr to allocate basic filter handles")
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b1c49d14

net: ipmr: make function ipmr_notifier_init static · ef739d8a

由 Colin Ian King 提交于 9月 29, 2017

The function ipmr_notifier_init is local to the source and does
not need to be in global scope, so make it static.

Cleans up sparse warning:
warning: symbol 'ipmr_notifier_init' was not declared. Should it be static?
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ef739d8a

30 9月, 2017 2 次提交

net-ipv6: add support for sockopt(SOL_IPV6, IPV6_FREEBIND) · 84e14fe3

由 Maciej Żenczykowski 提交于 9月 26, 2017

So far we've been relying on sockopt(SOL_IP, IP_FREEBIND) being usable
even on IPv6 sockets.

However, it turns out it is perfectly reasonable to want to set freebind
on an AF_INET6 SOCK_RAW socket - but there is no way to set any SOL_IP
socket option on such a socket (they're all blindly errored out).

One use case for this is to allow spoofing src ip on a raw socket
via sendmsg cmsg.

Tested:
  built, and booted
  # python
  >>> import socket
  >>> SOL_IP = socket.SOL_IP
  >>> SOL_IPV6 = socket.IPPROTO_IPV6
  >>> IP_FREEBIND = 15
  >>> IPV6_FREEBIND = 78
  >>> s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM, 0)
  >>> s.getsockopt(SOL_IP, IP_FREEBIND)
  0
  >>> s.getsockopt(SOL_IPV6, IPV6_FREEBIND)
  0
  >>> s.setsockopt(SOL_IPV6, IPV6_FREEBIND, 1)
  >>> s.getsockopt(SOL_IP, IP_FREEBIND)
  1
  >>> s.getsockopt(SOL_IPV6, IPV6_FREEBIND)
  1
Signed-off-by: NMaciej Żenczykowski <maze@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

84e14fe3

net: ipv6: send NS for DAD when link operationally up · 1f372c7b

由 Mike Manning 提交于 9月 25, 2017

The NS for DAD are sent on admin up as long as a valid qdisc is found.
A race condition exists by which these packets will not egress the
interface if the operational state of the lower device is not yet up.
The solution is to delay DAD until the link is operationally up
according to RFC2863. Rather than only doing this, follow the existing
code checks by deferring IPv6 device initialization altogether. The fix
allows DAD on devices like tunnels that are controlled by userspace
control plane. The fix has no impact on regular deployments, but means
that there is no IPv6 connectivity until the port has been opened in
the case of port-based network access control, which should be
desirable.
Signed-off-by: NMike Manning <mmanning@brocade.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1f372c7b

29 9月, 2017 12 次提交

net: ipv4: remove fib_info arg to fib_check_nh · fa8fefaa

由 David Ahern 提交于 9月 27, 2017

fib_check_nh does not use the fib_info arg; remove t.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fa8fefaa

net: ipv4: remove fib_weight · c7c3e591

由 David Ahern 提交于 9月 27, 2017

fib_weight in fib_info is set but not used. Remove it and the
helpers for setting it.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c7c3e591

tcp: fix under-evaluated ssthresh in TCP Vegas · cf5d74b8

由 Hoang Tran 提交于 9月 27, 2017

With the commit 76174004 (tcp: do not slow start when cwnd equals
ssthresh), the comparison to the reduced cwnd in tcp_vegas_ssthresh() would
under-evaluate the ssthresh.
Signed-off-by: NHoang Tran <hoang.tran@uclouvain.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cf5d74b8

net: bridge: add per-port group_fwd_mask with less restrictions · 5af48b59

由 Nikolay Aleksandrov 提交于 9月 27, 2017

We need to be able to transparently forward most link-local frames via
tunnels (e.g. vxlan, qinq). Currently the bridge's group_fwd_mask has a
mask which restricts the forwarding of STP and LACP, but we need to be able
to forward these over tunnels and control that forwarding on a per-port
basis thus add a new per-port group_fwd_mask option which only disallows
mac pause frames to be forwarded (they're always dropped anyway).
The patch does not change the current default situation - all of the others
are still restricted unless configured for forwarding.
We have successfully tested this patch with LACP and STP forwarding over
VxLAN and qinq tunnels.
Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5af48b59

rtnetlink: rtnl_have_link_slave_info doesn't need rtnl · 4c82a95e

由 Florian Westphal 提交于 9月 26, 2017

it can be switched to rcu.
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4c82a95e

rtnetlink: add helpers to dump netnsid information · b1e66b9a

由 Florian Westphal 提交于 9月 26, 2017

Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b1e66b9a

rtnetlink: add helpers to dump vf information · 250fc3df

由 Florian Westphal 提交于 9月 26, 2017

similar to earlier patches, split out more parts of this function to
better see what is happening and where we assume rtnl is locked.
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

250fc3df

rtnetlink: add helper to put master and link ifindexes · 79110a04

由 Florian Westphal 提交于 9月 26, 2017

rtnl_fill_ifinfo currently requires caller to hold the rtnl mutex.
Unfortunately the function is quite large which makes it harder to see
which spots require the lock, which spots assume it and which ones could
do without.

Add helpers to factor out the ifindex dumping, one can use rcu to avoid
rtnl dependency.
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

79110a04

net_sched: use idr to allocate u32 filter handles · e7614370

由 Cong Wang 提交于 9月 25, 2017

Instead of calling u32_lookup_ht() in a loop to find
a unused handle, just switch to idr API to allocate
new handles. u32 filters are special as the handle
could contain a hash table id and a key id, so we
need two IDR to allocate each of them.

Cc: Chris Mi <chrism@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e7614370

net_sched: use idr to allocate basic filter handles · 1d8134fe

由 Cong Wang 提交于 9月 25, 2017

Instead of calling basic_get() in a loop to find
a unused handle, just switch to idr API to allocate
new handles.

Cc: Chris Mi <chrism@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1d8134fe

net_sched: use idr to allocate bpf filter handles · 76cf546c

由 Cong Wang 提交于 9月 25, 2017

Instead of calling cls_bpf_get() in a loop to find
a unused handle, just switch to idr API to allocate
new handles.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Chris Mi <chrism@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

76cf546c

inetpeer: speed up inetpeer_invalidate_tree() · 8f1975e3

由 Eric Dumazet 提交于 9月 25, 2017

As measured in my prior patch ("sch_netem: faster rb tree removal"),
rbtree_postorder_for_each_entry_safe() is nice looking but much slower
than using rb_next() directly, except when tree is small enough
to fit in CPU caches (then the cost is the same)

From: Eric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8f1975e3

28 9月, 2017 5 次提交

net: mroute: Check if rule is a default rule · 478e4c2f

由 Yotam Gigi 提交于 9月 27, 2017

When the ipmr starts, it adds one default FIB rule that matches all packets
and sends them to the DEFAULT (multicast) FIB table. A more complex rule
can be added by user to specify that for a specific interface, a packet
should be look up at either an arbitrary table or according to the l3mdev
of the interface.

For drivers willing to offload the ipmr logic into a hardware but don't
want to offload all the FIB rules functionality, provide a function that
can indicate whether the FIB rule is the default multicast rule, thus only
one routing table is needed.

This way, a driver can register to the FIB notification chain, get
notifications about FIB rules added and trigger some kind of an internal
abort mechanism when a non default rule is added by the user.
Signed-off-by: NYotam Gigi <yotamg@mellanox.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

478e4c2f

net: ipmr: Add MFC offload indication · c7c0bbea

由 Yotam Gigi 提交于 9月 27, 2017

Allow drivers, registered to the fib notification chain indicate whether a
multicast MFC route is offloaded or not, similarly to unicast routes. The
indication of whether a route is offloaded is done using the mfc_flags
field on an mfc_cache struct, and the information is sent to the userspace
via the RTNetlink interface only.

Currently, MFC routes are either offloaded or not, thus there is no need to
add per-VIF offload indication.
Signed-off-by: NYotam Gigi <yotamg@mellanox.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c7c0bbea

ipmr: Send FIB notifications on MFC and VIF entries · b362053a

由 Yotam Gigi 提交于 9月 27, 2017

Use the newly introduced notification chain to send events upon VIF and MFC
addition and deletion. The MFC notifications are sent only on resolved MFC
entries, as unresolved cannot be offloaded.
Signed-off-by: NYotam Gigi <yotamg@mellanox.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b362053a

ipmr: Add FIB notification access functions · 4d65b948

由 Yotam Gigi 提交于 9月 27, 2017

Make the ipmr module register as a FIB notifier. To do that, implement both
the ipmr_seq_read and ipmr_dump ops.

The ipmr_seq_read op returns a sequence counter that is incremented on
every notification related operation done by the ipmr. To implement that,
add a sequence counter in the netns_ipv4 struct and increment it whenever a
new MFC route or VIF are added or deleted. The sequence operations are
protected by the RTNL lock.

The ipmr_dump iterates the list of MFC routes and the list of VIF entries
and sends notifications about them. The entries dump is done under RCU
where the VIF dump uses the mrt_lock too, as the vif->dev field can change
under RCU.
Signed-off-by: NYotam Gigi <yotamg@mellanox.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4d65b948

ipmr: Add reference count to MFC entries · 310ebbba

由 Yotam Gigi 提交于 9月 27, 2017

Next commits will introduce MFC notifications through the atomic
fib_notification chain, thus allowing modules to be aware of MFC entries.

Due to the fact that modules may need to hold a reference to an MFC entry,
add reference count to MFC entries to prevent them from being freed while
these modules use them.

The reference counting is done only on resolved MFC entries currently.
Signed-off-by: NYotam Gigi <yotamg@mellanox.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

310ebbba

27 9月, 2017 8 次提交

net: dsa: use phy_ethtool_nway_reset · 69b2c162

由 Vivien Didelot 提交于 9月 26, 2017

Use phy_ethtool_nway_reset now that dsa_slave_nway_reset does exactly
the same.
Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

69b2c162

net: dsa: use phy_ethtool_set_link_ksettings · aa62a8ca

由 Vivien Didelot 提交于 9月 26, 2017

Use phy_ethtool_set_link_ksettings now that dsa_slave_set_link_ksettings
does exactly the same.
Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

aa62a8ca

net: dsa: use phy_ethtool_get_link_ksettings · 771df31a

由 Vivien Didelot 提交于 9月 26, 2017

Use phy_ethtool_get_link_ksettings now that dsa_slave_get_link_ksettings
does exactly the same.
Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

771df31a

net: dsa: use slave device phydev · 0115dcd1

由 Vivien Didelot 提交于 9月 26, 2017

There is no need to store a phy_device in dsa_slave_priv since
net_device already provides one. Simply s/p->phy/dev->phydev/.
Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0115dcd1

net: dsa: return -ENODEV is there is no slave PHY · f4344e0a

由 Vivien Didelot 提交于 9月 26, 2017

Instead of returning -EOPNOTSUPP when a slave device has no PHY,
directly return -ENODEV as ethtool and phylib do.
Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f4344e0a

bpf: add meta pointer for direct access · de8f3a83

由 Daniel Borkmann 提交于 9月 25, 2017

This work enables generic transfer of metadata from XDP into skb. The
basic idea is that we can make use of the fact that the resulting skb
must be linear and already comes with a larger headroom for supporting
bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
for adjusting a new pointer called xdp->data_meta. Thus, the packet has
a flexible and programmable room for meta data, followed by the actual
packet data. struct xdp_buff is therefore laid out that we first point
to data_hard_start, then data_meta directly prepended to data followed
by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
account whether we have meta data already prepended and if so, memmove()s
this along with the given offset provided there's enough room.

xdp->data_meta is optional and programs are not required to use it. The
rationale is that when we process the packet in XDP (e.g. as DoS filter),
we can push further meta data along with it for the XDP_PASS case, and
give the guarantee that a clsact ingress BPF program on the same device
can pick this up for further post-processing. Since we work with skb
there, we can also set skb->mark, skb->priority or other skb meta data
out of BPF, thus having this scratch space generic and programmable
allows for more flexibility than defining a direct 1:1 transfer of
potentially new XDP members into skb (it's also more efficient as we
don't need to initialize/handle each of such new members). The facility
also works together with GRO aggregation. The scratch space at the head
of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
yet supporting xdp->data_meta can simply be set up with xdp->data_meta
as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
such that the subsequent match against xdp->data for later access is
guaranteed to fail.

The verifier treats xdp->data_meta/xdp->data the same way as we treat
xdp->data/xdp->data_end pointer comparisons. The requirement for doing
the compare against xdp->data is that it hasn't been modified from it's
original address we got from ctx access. It may have a range marking
already from prior successful xdp->data/xdp->data_end pointer comparisons
though.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

de8f3a83

bpf: rename bpf_compute_data_end into bpf_compute_data_pointers · 6aaae2b6

由 Daniel Borkmann 提交于 9月 25, 2017

Just do the rename into bpf_compute_data_pointers() as we'll add
one more pointer here to recompute.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6aaae2b6

kcm: Remove redundant unlikely() · d9db5e36

由 Tobias Klauser 提交于 9月 26, 2017

IS_ERR() already implies unlikely(), so it can be omitted.
Signed-off-by: NTobias Klauser <tklauser@distanz.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d9db5e36

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功