提交 · bc95cd8e8b2fc779b96ed4d7a2608c6a0e8dc240 · openanolis / cloud-kernel

25 4月, 2017 4 次提交

openvswitch: Add eventmask support to CT action. · 12064551

由 Jarno Rajahalme 提交于 4月 21, 2017

Add a new optional conntrack action attribute OVS_CT_ATTR_EVENTMASK,
which can be used in conjunction with the commit flag
(OVS_CT_ATTR_COMMIT) to set the mask of bits specifying which
conntrack events (IPCT_*) should be delivered via the Netfilter
netlink multicast groups.  Default behavior depends on the system
configuration, but typically a lot of events are delivered.  This can be
very chatty for the NFNLGRP_CONNTRACK_UPDATE group, even if only some
types of events are of interest.

Netfilter core init_conntrack() adds the event cache extension, so we
only need to set the ctmask value.  However, if the system is
configured without support for events, the setting will be skipped due
to extension not being found.
Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
Reviewed-by: NGreg Rose <gvrose8192@gmail.com>
Acked-by: NJoe Stringer <joe@ovn.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

12064551

packet: add PACKET_FANOUT_FLAG_UNIQUEID to assign new fanout group id. · 4a69a864

由 Mike Maloney 提交于 4月 21, 2017

Fanout uses a per net global namespace. A process that intends to create
a new fanout group can accidentally join an existing group. It is not
possible to detect this.

Add socket option PACKET_FANOUT_FLAG_UNIQUEID.  When specified the
supplied fanout group id must be set to 0, and the kernel chooses an id
that is not already in use.  This is an ephemeral flag so that
other sockets can be added to this group using setsockopt, but NOT
specifying this flag.  The current getsockopt(..., PACKET_FANOUT, ...)
can be used to retrieve the new group id.

We assume that there are not a lot of fanout groups and that this is not
a high frequency call.

The method assigns ids starting at zero and increases until it finds an
unused id.  It keeps track of the last assigned id, and uses it as a
starting point to find new ids.
Signed-off-by: NMike Maloney <maloney@google.com>
Acked-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4a69a864

VSOCK: Add vsockmon device · 0b2e6644

由 Gerard Garcia 提交于 4月 21, 2017

Add vsockmon virtual network device that receives packets from the vsock
transports and exposes them to user space.

Based on the nlmon device.
Signed-off-by: NGerard Garcia <ggarcia@deic.uab.cat>
Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0b2e6644

VSOCK: Add vsockmon tap functions · 531b3748

由 Gerard Garcia 提交于 4月 21, 2017

Add tap functions that can be used by the vsock transports to
deliver packets to vsockmon virtual network devices.
Signed-off-by: NGerard Garcia <ggarcia@deic.uab.cat>
Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: NJorgen Hansen <jhansen@vmware.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

531b3748

23 4月, 2017 1 次提交

net/devlink: Add E-Switch encapsulation control · f43e9b06

由 Roi Dayan 提交于 9月 25, 2016

This is an e-switch global knob to enable HW support for applying
encapsulation/decapsulation to VF traffic as part of SRIOV e-switch offloading.

The actual encap/decap is carried out (along with the matching and other actions)
per offloaded e-switch rules, e.g as done when offloading the TC tunnel key action.
Signed-off-by: NRoi Dayan <roid@mellanox.com>
Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
Acked-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

f43e9b06

22 4月, 2017 5 次提交

net: Remove NET_CORE_BUDGET_USECS from sysctl binary interface. · 1f4407e2

由 David S. Miller 提交于 4月 21, 2017

We are not supposed to add new entries to this thing
any more.

Thanks to Eric Dumazet for noticing this.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1f4407e2

net: ipv6: RTF_PCPU should not be settable from userspace · 557c44be

由 David Ahern 提交于 4月 19, 2017

Andrey reported a fault in the IPv6 route code:

kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Modules linked in:
CPU: 1 PID: 4035 Comm: a.out Not tainted 4.11.0-rc7+ #250
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff880069809600 task.stack: ffff880062dc8000
RIP: 0010:ip6_rt_cache_alloc+0xa6/0x560 net/ipv6/route.c:975
RSP: 0018:ffff880062dced30 EFLAGS: 00010206
RAX: dffffc0000000000 RBX: ffff8800670561c0 RCX: 0000000000000006
RDX: 0000000000000003 RSI: ffff880062dcfb28 RDI: 0000000000000018
RBP: ffff880062dced68 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ffff880062dcfb28 R14: dffffc0000000000 R15: 0000000000000000
FS:  00007feebe37e7c0(0000) GS:ffff88006cb00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000205a0fe4 CR3: 000000006b5c9000 CR4: 00000000000006e0
Call Trace:
 ip6_pol_route+0x1512/0x1f20 net/ipv6/route.c:1128
 ip6_pol_route_output+0x4c/0x60 net/ipv6/route.c:1212
...

Andrey's syzkaller program passes rtmsg.rtmsg_flags with the RTF_PCPU bit
set. Flags passed to the kernel are blindly copied to the allocated
rt6_info by ip6_route_info_create making a newly inserted route appear
as though it is a per-cpu route. ip6_rt_cache_alloc sees the flag set
and expects rt->dst.from to be set - which it is not since it is not
really a per-cpu copy. The subsequent call to __ip6_dst_alloc then
generates the fault.

Fix by checking for the flag and failing with EINVAL.

Fixes: d52d3997 ("ipv6: Create percpu rt6_info")
Reported-by: NAndrey Konovalov <andreyknvl@google.com>
Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Tested-by: NAndrey Konovalov <andreyknvl@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

557c44be

bpf: add napi_id read access to __sk_buff · b1d9fc41

由 Daniel Borkmann 提交于 4月 19, 2017

Add napi_id access to __sk_buff for socket filter program types, tc
program types and other bpf_convert_ctx_access() users. Having access
to skb->napi_id is useful for per RX queue listener siloing, f.e.
in combination with SO_ATTACH_REUSEPORT_EBPF and when busy polling is
used, meaning SO_REUSEPORT enabled listeners can then select the
corresponding socket at SYN time already [1]. The skb is marked via
skb_mark_napi_id() early in the receive path (e.g., napi_gro_receive()).

Currently, sockets can only use SO_INCOMING_NAPI_ID from 6d433902
("net: Introduce SO_INCOMING_NAPI_ID") as a socket option to look up
the NAPI ID associated with the queue for steering, which requires a
prior sk_mark_napi_id() after the socket was looked up.

Semantics for the __sk_buff napi_id access are similar, meaning if
skb->napi_id is < MIN_NAPI_ID (e.g. outgoing packets using sender_cpu),
then an invalid napi_id of 0 is returned to the program, otherwise a
valid non-zero napi_id.

  [1] http://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdfSuggested-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b1d9fc41

Replace 2 jiffies with sysctl netdev_budget_usecs to enable softirq tuning · 7acf8a1e

由 Matthew Whitehead 提交于 4月 19, 2017

Constants used for tuning are generally a bad idea, especially as hardware
changes over time. Replace the constant 2 jiffies with sysctl variable
netdev_budget_usecs to enable sysadmins to tune the softirq processing.
Also document the variable.

For example, a very fast machine might tune this to 1000 microseconds,
while my regression testing 486DX-25 needs it to be 4000 microseconds on
a nearly idle network to prevent time_squeeze from being incremented.

Version 2: changed jiffies to microseconds for predictable units.
Signed-off-by: NMatthew Whitehead <tedheadster@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7acf8a1e

ip6_tunnel: Allow policy-based routing through tunnels · 0a473b82

由 Craig Gallek 提交于 4月 19, 2017

This feature allows the administrator to set an fwmark for
packets traversing a tunnel.  This allows the use of independent
routing tables for tunneled packets without the use of iptables.
Signed-off-by: NCraig Gallek <kraig@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0a473b82

14 4月, 2017 3 次提交

xfrm: Add an IPsec hardware offloading API · d77e38e6

由 Steffen Klassert 提交于 4月 14, 2017

This patch adds all the bits that are needed to do
IPsec hardware offload for IPsec states and ESP packets.
We add xfrmdev_ops to the net_device. xfrmdev_ops has
function pointers that are needed to manage the xfrm
states in the hardware and to do a per packet
offloading decision.

Joint work with:
Ilan Tayari <ilant@mellanox.com>
Guy Shapiro <guysh@mellanox.com>
Yossi Kuperman <yossiku@mellanox.com>
Signed-off-by: NGuy Shapiro <guysh@mellanox.com>
Signed-off-by: NIlan Tayari <ilant@mellanox.com>
Signed-off-by: NYossi Kuperman <yossiku@mellanox.com>
Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>

d77e38e6

netlink: allow sending extended ACK with cookie on success · ba0dc5f6

由 Johannes Berg 提交于 4月 12, 2017

Now that we have extended error reporting and a new message format for
netlink ACK messages, also extend this to be able to return arbitrary
cookie data on success.

This will allow, for example, nl80211 to not send an extra message for
cookies identifying newly created objects, but return those directly
in the ACK message.

The cookie data size is currently limited to 20 bytes (since Jamal
talked about using SHA1 for identifiers.)

Thanks to Jamal Hadi Salim for bringing up this idea during the
discussions.
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
Reviewed-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ba0dc5f6

netlink: extended ACK reporting · 2d4bc933

由 Johannes Berg 提交于 4月 12, 2017

Add the base infrastructure and UAPI for netlink extended ACK
reporting. All "manual" calls to netlink_ack() pass NULL for now and
thus don't get extended ACK reporting.

Big thanks goes to Pablo Neira Ayuso for not only bringing up the
whole topic at netconf (again) but also coming up with the nlattr
passing trick and various other ideas.
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
Reviewed-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2d4bc933

11 4月, 2017 1 次提交

Revert "virtio_pci: don't duplicate the msix_enable flag in struct pci_dev" · 2008c154

由 Michael S. Tsirkin 提交于 4月 04, 2017

This reverts commit 53a020c6.

The cleanup seems to be one of the changes that broke
hybernation for some users. We are still not sure why
but revert helps.
Tested-by: NMike Galbraith <efault@gmx.de>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

2008c154

10 4月, 2017 2 次提交

bpf: fix comment typo · 3c60a531

由 Alexander Alemayhu 提交于 4月 08, 2017

o s/bpf_bpf_get_socket_cookie/bpf_get_socket_cookie
Signed-off-by: NAlexander Alemayhu <alexander@alemayhu.com>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3c60a531

Revert "rtnl: Add support for netdev event to link messages" · bf74b20d

由 David S. Miller 提交于 4月 09, 2017

This reverts commit def12888.

As per discussion between Roopa Prabhu and David Ahern, it is
advisable that we instead have the code collect the setlink triggered
events into a bitmask emitted in the IFLA_EVENT netlink attribute.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bf74b20d

08 4月, 2017 2 次提交

netlink: uapi: use hex numbers for NLM_F_* flags · 261a0a54

由 Johannes Berg 提交于 4月 06, 2017

It's rather confusing that the netlink message flags are
numbered 1, 2, 4, 8, 16, 32, <unused>, 0x100. Make that
more understandable by numbering the lower ones with hex
constants as well.
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

261a0a54

New getsockopt option to get socket cookie · 5daab9db

由 Chenbo Feng 提交于 4月 05, 2017

Introduce a new getsockopt operation to retrieve the socket cookie
for a specific socket based on the socket fd. It returns a unique
non-decreasing cookie for each socket.
Tested: https://android-review.googlesource.com/#/c/358163/Acked-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NChenbo Feng <fengc@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5daab9db

05 4月, 2017 3 次提交

rtnl: Add support for netdev event to link messages · def12888

由 Vlad Yasevich 提交于 4月 04, 2017

When netdev events happen, a rtnetlink_event() handler will send
messages for every event in it's white list.  These messages contain
current information about a particular device, but they do not include
the iformation about which event just happened.  The consumer of
the message has to try to infer this information.  In some cases
(ex: NETDEV_NOTIFY_PEERS), that is not possible.

This patch adds a new extension to RTM_NEWLINK message called IFLA_EVENT
that would have an encoding of the which event triggered this
message.  This would allow the the message consumer to easily determine
if it is interested in a particular event or not.
Signed-off-by: NVladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

def12888

netlink/diag: report flags for netlink sockets · 457c79e5

由 Andrey Vagin 提交于 4月 03, 2017

cb_running is reported in /proc/self/net/netlink and it is reported by
the ss tool, when it gets information from the proc files.

sock_diag is a new interface which is used instead of proc files, so it
looks reasonable that this interface has to report no less information
about sockets than proc files.

We use these flags to dump and restore netlink sockets.
Signed-off-by: NAndrei Vagin <avagin@openvz.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

457c79e5

phy/ethtool: Add missing SPEED_<foo> strings · 1f37b177

由 Joe Perches 提交于 4月 02, 2017

Add all the currently available SPEED_<foo> strings.
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1f37b177

04 4月, 2017 1 次提交

sctp: add SCTP_PR_STREAM_STATUS sockopt for prsctp · d229d48d

由 Xin Long 提交于 4月 01, 2017

Before when implementing sctp prsctp, SCTP_PR_STREAM_STATUS wasn't
added, as it needs to save abandoned_(un)sent for every stream.

After sctp stream reconf is added in sctp, assoc has structure
sctp_stream_out to save per stream info.

This patch is to add SCTP_PR_STREAM_STATUS by putting the prsctp
per stream statistics into sctp_stream_out.

v1->v2:
  fix an indent issue.
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d229d48d

03 4月, 2017 2 次提交

statx: Include a mask for stx_attributes in struct statx · 3209f68b

由 David Howells 提交于 3月 31, 2017

Include a mask in struct stat to indicate which bits of stx_attributes the
filesystem actually supports.

This would also be useful if we add another system call that allows you to
do a 'bulk attribute set' and pass in a statx struct with the masks
appropriately set to say what you want to set.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

3209f68b

statx: Reserve the top bit of the mask for future struct expansion · 47071aee

由 David Howells 提交于 3月 31, 2017

Reserve the top bit of the mask for future expansion of the statx struct
and give an error if statx() sees it set.  All the other bits are ignored
if we see them set but don't support the bit; we just clear the bit in the
returned mask.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

47071aee

02 4月, 2017 1 次提交

bpf: introduce BPF_PROG_TEST_RUN command · 1cf1cae9

由 Alexei Starovoitov 提交于 3月 30, 2017

development and testing of networking bpf programs is quite cumbersome.
Despite availability of user space bpf interpreters the kernel is
the ultimate authority and execution environment.
Current test frameworks for TC include creation of netns, veth,
qdiscs and use of various packet generators just to test functionality
of a bpf program. XDP testing is even more complicated, since
qemu needs to be started with gro/gso disabled and precise queue
configuration, transferring of xdp program from host into guest,
attaching to virtio/eth0 and generating traffic from the host
while capturing the results from the guest.

Moreover analyzing performance bottlenecks in XDP program is
impossible in virtio environment, since cost of running the program
is tiny comparing to the overhead of virtio packet processing,
so performance testing can only be done on physical nic
with another server generating traffic.

Furthermore ongoing changes to user space control plane of production
applications cannot be run on the test servers leaving bpf programs
stubbed out for testing.

Last but not least, the upstream llvm changes are validated by the bpf
backend testsuite which has no ability to test the code generated.

To improve this situation introduce BPF_PROG_TEST_RUN command
to test and performance benchmark bpf programs.

Joint work with Daniel Borkmann.
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1cf1cae9

31 3月, 2017 1 次提交

cfg80211: Add support for FILS shared key authentication offload · a3caf744

由 Vidyullatha Kanchanapally 提交于 3月 31, 2017

Enhance nl80211 and cfg80211 connect request and response APIs to
support FILS shared key authentication offload. The new nl80211
attributes can be used to provide additional information to the driver
to establish a FILS connection. Also enhance the set/del PMKSA to allow
support for adding and deleting PMKSA based on FILS cache identifier.

Add a new feature flag that drivers can use to advertize support for
FILS shared key authentication and association in station mode when
using their own SME.
Signed-off-by: NVidyullatha Kanchanapally <vkanchan@qti.qualcomm.com>
Signed-off-by: NJouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

a3caf744

29 3月, 2017 2 次提交

rtnetlink: Add RTM_DELNETCONF · 983701eb

由 David Ahern 提交于 3月 28, 2017

Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

983701eb

devlink: Support for pipeline debug (dpipe) · 1555d204

由 Arkadi Sharshevsky 提交于 3月 28, 2017

The pipeline debug is used to export the pipeline abstractions for the
main objects - tables, headers and entries. The only support for set is
for changing the counter parameter on specific table.

The basic structures:

Header - can represent a real protocol header information or internal
         metadata. Generic protocol headers like IPv4 can be shared
         between drivers. Each driver can add local headers.

Field - part of a header. Can represent protocol field or specific ASIC
        metadata field. Hardware special metadata fields can be mapped
        to different resources, for example switch ASIC ports can have
        internal number which from the systems point of view is mapped
        to netdeivce ifindex.

Match - represent specific match rule. Can describe match on specific
        field or header. The header index should be specified as well
        in order to support several header instances of the same type
        (tunneling).

Action - represents specific action rule. Actions can describe operations
         on specific field values for example like set, increment, etc.
         And header operation like add and delete.

Value - represents value which can be associated with specific match or
        action.

Table - represents a hardware block which can be described with match/
        action behavior. The match/action can be done on the packets
        data or on the internal metadata that it gathered along the
        packets traversal throw the pipeline which is vendor specific
        and should be exported in order to provide understanding of
        ASICs behavior.

Entry - represents single record in a specific table. The entry is
        identified by specific combination of values for match/action.

Prior to accessing the tables/entries the drivers provide the header/
field data base which is used by driver to user-space. The data base
is split between the shared headers and unique headers.
Signed-off-by: NArkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1555d204

26 3月, 2017 2 次提交

gtp: support SGSN-side tunnels · 91ed81f9

由 Jonas Bonn 提交于 3月 24, 2017

The GTP-tunnel driver is explicitly GGSN-side as it searches for PDP
contexts based on the incoming packets _destination_ address.  If we
want to place ourselves on the SGSN side of the  tunnel, then we want
to be identifying PDP contexts based on _source_ address.

Let it be noted that in a "real" configuration this module would never
be used:  the SGSN normally does not see IP packets as input.  The
justification for this functionality is for PGW load-testing applications
where the input to the SGSN is locally generally IP traffic.

This patch adds a "role" argument at GTP-link creation time to specify
whether we are on the GGSN or SGSN side of the tunnel; this flag is then
used to determine which part of the IP packet to use in determining
the PDP context.
Signed-off-by: NJonas Bonn <jonas@southpole.se>
Acked-by: NPablo Neira Ayuso <pablo@netfilter.org>
Acked-by: NHarald Welte <laforge@gnumonks.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

91ed81f9

gtp: rename SGSN netlink attribute · ae6336b5

由 Jonas Bonn 提交于 3月 24, 2017

This is a mostly cosmetic rename of the SGSN netlink attribute to
the GTP link.  The justification for this is that we will be making
the module support decapsulation of "downstream" SGSN packets, in
which case the netlink parameter actually refers to the upstream GGSN
peer.  Renaming the parameter makes the relationship clearer.

The legacy name is maintained as a define in the header file in order
to not break existing code.
Signed-off-by: NJonas Bonn <jonas@southpole.se>
Acked-by: NPablo Neira Ayuso <pablo@netfilter.org>
Acked-by: NHarald Welte <laforge@gnumonks.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ae6336b5

25 3月, 2017 3 次提交

uapi: add missing install of userio.h · 5659495a

由 Naohiro Aota 提交于 3月 24, 2017

While commit 5523662e ("Input: add userio module") added userio.h
under the uapi/ directory, it forgot to add the header file to Kbuild.
Thus, the file was missing from header installation.
Signed-off-by: NNaohiro Aota <naota@elisp.net>
Reviewed-by: NLyude Paul <thatslyude@gmail.com>
Signed-off-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>

5659495a

net: Introduce SO_INCOMING_NAPI_ID · 6d433902

由 Sridhar Samudrala 提交于 3月 24, 2017

This socket option returns the NAPI ID associated with the queue on which
the last frame is received. This information can be used by the apps to
split the incoming flows among the threads based on the Rx queue on which
they are received.

If the NAPI ID actually represents a sender_cpu then the value is ignored
and 0 is returned.
Signed-off-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6d433902

uapi: fix rdma/mlx5-abi.h userspace compilation errors · 812755d6

由 Dmitry V. Levin 提交于 2月 24, 2017

Consistently use types from linux/types.h to fix the following
rdma/mlx5-abi.h userspace compilation errors:

/usr/include/rdma/mlx5-abi.h:69:25: error: 'u64' undeclared here (not in a function)
  MLX5_LIB_CAP_4K_UAR = (u64)1 << 0,
/usr/include/rdma/mlx5-abi.h:69:29: error: expected ',' or '}' before numeric constant
  MLX5_LIB_CAP_4K_UAR = (u64)1 << 0,

Include <linux/if_ether.h> to fix the following rdma/mlx5-abi.h
userspace compilation error:

/usr/include/rdma/mlx5-abi.h:286:12: error: 'ETH_ALEN' undeclared here (not in a function)
  __u8 dmac[ETH_ALEN];
Signed-off-by: NDmitry V. Levin <ldv@altlinux.org>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

812755d6

24 3月, 2017 2 次提交

Add a eBPF helper function to retrieve socket uid · 6acc5c29

由 Chenbo Feng 提交于 3月 22, 2017

Returns the owner uid of the socket inside a sk_buff. This is useful to
perform per-UID accounting of network traffic or per-UID packet
filtering. The socket need to be a fullsock otherwise overflowuid is
returned.
Signed-off-by: NChenbo Feng <fengc@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6acc5c29

Add a helper function to get socket cookie in eBPF · 91b8270f

由 Chenbo Feng 提交于 3月 22, 2017

Retrieve the socket cookie generated by sock_gen_cookie() from a sk_buff
with a known socket. Generates a new cookie if one was not yet set.If
the socket pointer inside sk_buff is NULL, 0 is returned. The helper
function coud be useful in monitoring per socket networking traffic
statistics and provide a unique socket identifier per namespace.
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NChenbo Feng <fengc@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

91b8270f

23 3月, 2017 5 次提交

bpf: Add hash of maps support · bcc6b1b7

由 Martin KaFai Lau 提交于 3月 22, 2017

This patch adds hash of maps support (hashmap->bpf_map).
BPF_MAP_TYPE_HASH_OF_MAPS is added.

A map-in-map contains a pointer to another map and lets call
this pointer 'inner_map_ptr'.

Notes on deleting inner_map_ptr from a hash map:

1. For BPF_F_NO_PREALLOC map-in-map, when deleting
   an inner_map_ptr, the htab_elem itself will go through
   a rcu grace period and the inner_map_ptr resides
   in the htab_elem.

2. For pre-allocated htab_elem (!BPF_F_NO_PREALLOC),
   when deleting an inner_map_ptr, the htab_elem may
   get reused immediately.  This situation is similar
   to the existing prealloc-ated use cases.

   However, the bpf_map_fd_put_ptr() calls bpf_map_put() which calls
   inner_map->ops->map_free(inner_map) which will go
   through a rcu grace period (i.e. all bpf_map's map_free
   currently goes through a rcu grace period).  Hence,
   the inner_map_ptr is still safe for the rcu reader side.

This patch also includes BPF_MAP_TYPE_HASH_OF_MAPS to the
check_map_prealloc() in the verifier.  preallocation is a
must for BPF_PROG_TYPE_PERF_EVENT.  Hence, even we don't expect
heavy updates to map-in-map, enforcing BPF_F_NO_PREALLOC for map-in-map
is impossible without disallowing BPF_PROG_TYPE_PERF_EVENT from using
map-in-map first.
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bcc6b1b7

bpf: Add array of maps support · 56f668df

由 Martin KaFai Lau 提交于 3月 22, 2017

This patch adds a few helper funcs to enable map-in-map
support (i.e. outer_map->inner_map).  The first outer_map type
BPF_MAP_TYPE_ARRAY_OF_MAPS is also added in this patch.
The next patch will introduce a hash of maps type.

Any bpf map type can be acted as an inner_map.  The exception
is BPF_MAP_TYPE_PROG_ARRAY because the extra level of
indirection makes it harder to verify the owner_prog_type
and owner_jited.

Multi-level map-in-map is not supported (i.e. map->map is ok
but not map->map->map).

When adding an inner_map to an outer_map, it currently checks the
map_type, key_size, value_size, map_flags, max_entries and ops.
The verifier also uses those map's properties to do static analysis.
map_flags is needed because we need to ensure BPF_PROG_TYPE_PERF_EVENT
is using a preallocated hashtab for the inner_hash also.  ops and
max_entries are needed to generate inlined map-lookup instructions.
For simplicity reason, a simple '==' test is used for both map_flags
and max_entries.  The equality of ops is implied by the equality of
map_type.

During outer_map creation time, an inner_map_fd is needed to create an
outer_map.  However, the inner_map_fd's life time does not depend on the
outer_map.  The inner_map_fd is merely used to initialize
the inner_map_meta of the outer_map.

Also, for the outer_map:

* It allows element update and delete from syscall
* It allows element lookup from bpf_prog

The above is similar to the current fd_array pattern.
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

56f668df

net: ipv6: Add sysctl for minimum prefix len acceptable in RIOs. · bbea124b

由 Joel Scherpelz 提交于 3月 22, 2017

This commit adds a new sysctl accept_ra_rt_info_min_plen that
defines the minimum acceptable prefix length of Route Information
Options. The new sysctl is intended to be used together with
accept_ra_rt_info_max_plen to configure a range of acceptable
prefix lengths. It is useful to prevent misconfigurations from
unintentionally blackholing too much of the IPv6 address space
(e.g., home routers announcing RIOs for fc00::/7, which is
incorrect).
Signed-off-by: NJoel Scherpelz <jscherpelz@google.com>
Acked-by: NLorenzo Colitti <lorenzo@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bbea124b

openvswitch: Optimize sample action for the clone use cases · 798c1661

由 andy zhou 提交于 3月 20, 2017

With the introduction of open flow 'clone' action, the OVS user space
can now translate the 'clone' action into kernel datapath 'sample'
action, with 100% probability, to ensure that the clone semantics,
which is that the packet seen by the clone action is the same as the
packet seen by the action after clone, is faithfully carried out
in the datapath.

While the sample action in the datpath has the matching semantics,
its implementation is only optimized for its original use.
Specifically, there are two limitation: First, there is a 3 level of
nesting restriction, enforced at the flow downloading time. This
limit turns out to be too restrictive for the 'clone' use case.
Second, the implementation avoid recursive call only if the sample
action list has a single userspace action.

The main optimization implemented in this series removes the static
nesting limit check, instead, implement the run time recursion limit
check, and recursion avoidance similar to that of the 'recirc' action.
This optimization solve both #1 and #2 issues above.

One related optimization attempts to avoid copying flow key as
long as the actions enclosed does not change the flow key. The
detection is performed only once at the flow downloading time.

Another related optimization is to rewrite the action list
at flow downloading time in order to save the fast path from parsing
the sample action list in its original form repeatedly.
Signed-off-by: NAndy Zhou <azhou@ovn.org>
Acked-by: NPravin B Shelar <pshelar@ovn.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

798c1661

sock: introduce SO_MEMINFO getsockopt · a2d133b1

由 Josh Hunt 提交于 3月 20, 2017

Allows reading of SK_MEMINFO_VARS via socket option. This way an
application can get all meminfo related information in single socket
option call instead of multiple calls.

Adds helper function, sk_get_meminfo(), and uses that for both
getsockopt and sock_diag_put_meminfo().

Suggested by Eric Dumazet.
Signed-off-by: NJosh Hunt <johunt@akamai.com>
Reviewed-by: NJason Baron <jbaron@akamai.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a2d133b1

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功