提交 · efc21d95067f658a20e51e24c4c86d68f23b7f7f · openeuler / raspberrypi-kernel

15 3月, 2016 2 次提交

net: add a hardware buffer management helper API · 8cb2d8bf

由 Gregory CLEMENT 提交于 3月 14, 2016

This basic implementation allows to share code between driver using
hardware buffer management. As the code is hardware agnostic, there is
few helpers, most of the optimization brought by the an HW BM has to be
done at driver level.
Tested-by: NSebastian Careba <nitroshift@yahoo.com>
Signed-off-by: NGregory CLEMENT <gregory.clement@free-electrons.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8cb2d8bf

bus: mvebu-mbus: provide api for obtaining IO and DRAM window information · f2900ace

由 Marcin Wojtas 提交于 3月 14, 2016

This commit enables finding appropriate mbus window and obtaining its
target id and attribute for given physical address in two separate
routines, both for IO and DRAM windows. This functionality
is needed for Armada XP/38x Network Controller's Buffer Manager and
PnC configuration.

[gregory.clement@free-electrons.com: Fix size test for
mvebu_mbus_get_dram_win_info]
Signed-off-by: NMarcin Wojtas <mw@semihalf.com>
[DRAM window information reference in LKv3.10]
Signed-off-by: NEvan Wang <xswang@marvell.com>
Signed-off-by: NGregory CLEMENT <gregory.clement@free-electrons.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f2900ace

14 3月, 2016 9 次提交

ipv6: Pass proto to csum_ipv6_magic as __u8 instead of unsigned short · 1e940829

由 Alexander Duyck 提交于 3月 11, 2016

This patch updates csum_ipv6_magic so that it correctly recognizes that
protocol is a unsigned 8 bit value.

This will allow us to better understand what limitations may or may not be
present in how we handle the data.  For example there are a number of
places that call htonl on the protocol value.  This is likely not necessary
and can be replaced with a multiplication by ntohl(1) which will be
converted to a shift by the compiler.
Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1e940829

ipv4: Update parameters for csum_tcpudp_magic to their original types · 01cfbad7

由 Alexander Duyck 提交于 3月 11, 2016

This patch updates all instances of csum_tcpudp_magic and
csum_tcpudp_nofold to reflect the types that are usually used as the source
inputs.  For example the protocol field is populated based on nexthdr which
is actually an unsigned 8 bit value.  The length is usually populated based
on skb->len which is an unsigned integer.

This addresses an issue in which the IPv6 function csum_ipv6_magic was
generating a checksum using the full 32b of skb->len while
csum_tcpudp_magic was only using the lower 16 bits.  As a result we could
run into issues when attempting to adjust the checksum as there was no
protocol agnostic way to update it.

With this change the value is still truncated as many architectures use
"(len + proto) << 8", however this truncation only occurs for values
greater than 16776960 in length and as such is unlikely to occur as we stop
the inner headers at ~64K in size.

I did have to make a few minor changes in the arm, mn10300, nios2, and
score versions of the function in order to support these changes as they
were either using things such as an OR to combine the protocol and length,
or were using ntohs to convert the length which would have truncated the
value.

I also updated a few spots in terms of whitespace and type differences for
the addresses.  Most of this was just to make sure all of the definitions
were in sync going forward.
Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

01cfbad7

net: add MACsec netdevice priv_flags and helper · 3c175784

由 Sabrina Dubroca 提交于 3月 11, 2016

Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
Reviewed-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3c175784

uapi: add MACsec bits · dece8d2b

由 Sabrina Dubroca 提交于 3月 11, 2016

Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
Reviewed-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dece8d2b

sctp: allow sctp_transmit_packet and others to use gfp · cea8768f

由 Marcelo Ricardo Leitner 提交于 3月 10, 2016

Currently sctp_sendmsg() triggers some calls that will allocate memory
with GFP_ATOMIC even when not necessary. In the case of
sctp_packet_transmit it will allocate a linear skb that will be used to
construct the packet and this may cause sends to fail due to ENOMEM more
often than anticipated specially with big MTUs.

This patch thus allows it to inherit gfp flags from upper calls so that
it can use GFP_KERNEL if it was triggered by a sctp_sendmsg call or
similar. All others, like retransmits or flushes started from BH, are
still allocated using GFP_ATOMIC.

In netperf tests this didn't result in any performance drawbacks when
memory is not too fragmented and made it trigger ENOMEM way less often.
Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cea8768f

phy: remove documentation of removed members of phy_device structure · 470c3822

由 LABBE Corentin 提交于 3月 10, 2016

Commit e5a03bfd ("phy: Add an mdio_device structure") removed addr,
bus and dev member of the phy_device structure.
This patch remove the documentation about those members.
Signed-off-by: NLABBE Corentin <clabbe.montjoie@gmail.com>
Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
Acked-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

470c3822

xen-netback: re-import canonical netif header · 6b8abef5

由 Paul Durrant 提交于 3月 10, 2016

The canonical netif header (in the Xen source repo) and the Linux variant
have diverged significantly. Recently much documentation has been added to
the canonical header which is highly useful for developers making
modifications to either xen-netfront or xen-netback. This patch therefore
re-imports the canonical header in its entirity.

To maintain compatibility and some style consistency with the old Linux
variant, the header was stripped of its emacs boilerplate, and
post-processed and copied into place with the following commands:

ed -s netif.h << EOF
H
,s/NETTXF_/XEN_NETTXF_/g
,s/NETRXF_/XEN_NETRXF_/g
,s/NETIF_/XEN_NETIF_/g
,s/XEN_XEN_/XEN_/g
,s/netif/xen_netif/g
,s/xen_xen_/xen_/g
,s/^typedef.*$//g
,s/^    /${TAB}/g
w
$
w
EOF

indent --line-length 80 --linux-style netif.h \
-o include/xen/interface/io/netif.h
Signed-off-by: NPaul Durrant <paul.durrant@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Acked-by: NWei Liu <wei.liu2@citrix.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6b8abef5

netconf: add macro to represent all attributes · 136ba622

由 Zhang Shengju 提交于 3月 10, 2016

This patch adds macro NETCONFA_ALL to represent all type of netconf
attributes for IPv4 and IPv6.
Signed-off-by: NZhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

136ba622

csum: Update csum_block_add to use rotate instead of byteswap · 33803963

由 Alexander Duyck 提交于 3月 09, 2016

The code for csum_block_add was doing a funky byteswap to swap the even and
odd bytes of the checksum if the offset was odd.  Instead of doing this we
can save ourselves some trouble and just shift by 8 as this should have the
same effect in terms of the final checksum value and only requires one
instruction.

In addition we can update csum_block_sub to just use csum_block_add with a
inverse value for csum2.  This way we follow the same code path as
csum_block_add without having to duplicate it.
Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

33803963

12 3月, 2016 6 次提交

bpf: support flow label for bpf_skb_{set, get}_tunnel_key · 4018ab18

由 Daniel Borkmann 提交于 3月 09, 2016

This patch extends bpf_tunnel_key with a tunnel_label member, that maps
to ip_tunnel_key's label so underlying backends like vxlan and geneve
can propagate the label to udp_tunnel6_xmit_skb(), where it's being set
in the IPv6 header. It allows for having 20 more bits to encode/decode
flow related meta information programmatically. Tested with vxlan and
geneve.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4018ab18

geneve: support setting IPv6 flow label · 8eb3b995

由 Daniel Borkmann 提交于 3月 09, 2016

This work adds support for setting the IPv6 flow label for geneve per
device and through collect metadata (ip_tunnel_key) frontends. Also here,
the geneve dst cache does not need any special considerations, for the
cases where caches can be used, the label is static per cache.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8eb3b995

vxlan: support setting IPv6 flow label · e7f70af1

由 Daniel Borkmann 提交于 3月 09, 2016

This work adds support for setting the IPv6 flow label for vxlan per
device and through collect metadata (ip_tunnel_key) frontends. The
vxlan dst cache does not need any special considerations here, for
the cases where caches can be used, the label is static per cache.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e7f70af1

ip_tunnel: add support for setting flow label via collect metadata · 13461144

由 Daniel Borkmann 提交于 3月 09, 2016

This patch extends udp_tunnel6_xmit_skb() to pass in the IPv6 flow label
from call sites. Currently, there's no such option and it's always set to
zero when writing ip6_flow_hdr(). Add a label member to ip_tunnel_key, so
that flow-based tunnels via collect metadata frontends can make use of it.
vxlan and geneve will be converted to add flow label support separately.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

13461144

bridge: allow zero ageing time · 4c656c13

由 Stephen Hemminger 提交于 3月 08, 2016

This fixes a regression in the bridge ageing time caused by:
commit c62987bb ("bridge: push bridge setting ageing_time down to switchdev")

There are users of Linux bridge which use the feature that if ageing time
is set to 0 it causes entries to never expire. See:
  https://www.linuxfoundation.org/collaborate/workgroups/networking/bridge

For a pure software bridge, it is unnecessary for the code to have
arbitrary restrictions on what values are allowable.
Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
Acked-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4c656c13

net/flower: Fix pointer cast · 8208d21b

由 Amir Vadai 提交于 3月 11, 2016

Cast pointer to unsigned long instead of u64, to fix compilation warning
on 32 bit arch, spotted by 0day build.

Fixes: 5b33f488 ("net/flower: Introduce hardware offload support")
Signed-off-by: NAmir Vadai <amir@vadai.me>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8208d21b

11 3月, 2016 5 次提交

net/act_skbedit: Utility functions for mark action · 519afb18

由 Amir Vadai 提交于 3月 08, 2016

Enable device drivers to query the action, if and only if is a mark
action and what value to use for marking.
Acked-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NAmir Vadai <amir@vadai.me>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

519afb18

net/sched: Macro instead of CONFIG_NET_CLS_ACT ifdef · 00175aec

由 Amir Vadai 提交于 3月 08, 2016

Introduce the macros tc_no_actions and tc_for_each_action to make code
clearer.
Extracted struct tc_action out of the ifdef to make calls to
is_tcf_gact_shot() and similar functions valid, even when it is a nop.
Acked-by: NJiri Pirko <jiri@mellanox.com>
Acked-by: NJohn Fastabend <john.r.fastabend@intel.com>
Suggested-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NAmir Vadai <amir@vadai.me>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

00175aec

net/flow_dissector: Make dissector_uses_key() and skb_flow_dissector_target() public · 8de2d793

由 Amir Vadai 提交于 3月 08, 2016

Will be used in a following patch to query if a key is being used, and
what it's value in the target object.
Acked-by: NJohn Fastabend <john.r.fastabend@intel.com>
Acked-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NAmir Vadai <amir@vadai.me>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8de2d793

net/flower: Introduce hardware offload support · 5b33f488

由 Amir Vadai 提交于 3月 08, 2016

This patch is based on a patch made by John Fastabend.
It adds support for offloading cls_flower.
when NETIF_F_HW_TC is on:
  flags = 0       => Rule will be processed twice - by hardware, and if
                     still relevant, by software.
  flags = SKIP_HW => Rull will be processed by software only

If hardware fail/not capabale to apply the rule, operation will NOT
fail. Filter will be processed by SW only.
Acked-by: NJiri Pirko <jiri@mellanox.com>
Suggested-by: NJohn Fastabend <john.r.fastabend@intel.com>
Signed-off-by: NAmir Vadai <amir@vadai.me>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5b33f488

kcm: mark helper functions inline · f720d0ca

由 Arnd Bergmann 提交于 3月 10, 2016

The stub helper functions for the newly added kcm_proc_init/exit interfaces
are defined as 'static' in a header file, which leads to build warnings for
each file that includes them without calling them:

include/net/kcm.h:183:12: error: 'kcm_proc_init' defined but not used [-Werror=unused-function]
include/net/kcm.h:184:13: error: 'kcm_proc_exit' defined but not used [-Werror=unused-function]

This marks the two functions as 'static inline' instead, which avoids the
warnings and is obviously what was meant here.
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Fixes: cd6e111b ("kcm: Add statistics and proc interfaces")
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f720d0ca

10 3月, 2016 10 次提交

net: validate variable length ll headers · 2793a23a

由 Willem de Bruijn 提交于 3月 09, 2016

Netdevice parameter hard_header_len is variously interpreted both as
an upper and lower bound on link layer header length. The field is
used as upper bound when reserving room at allocation, as lower bound
when validating user input in PF_PACKET.

Clarify the definition to be maximum header length. For validation
of untrusted headers, add an optional validate member to header_ops.

Allow bypassing of validation by passing CAP_SYS_RAWIO, for instance
for deliberate testing of corrupt input. In this case, pad trailing
bytes, as some device drivers expect completely initialized headers.

See also http://comments.gmane.org/gmane.linux.network/401064Signed-off-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2793a23a

NFC: microread: Drop platform data header file · 87aca737

由 Jean Delvare 提交于 1月 21, 2016

Originally I only wanted to drop the unneeded inclusion of
<linux/i2c.h>, but then noticed that struct
microread_nfc_platform_data isn't actually used, and
MICROREAD_DRIVER_NAME is redefined in the only file where it is used,
so we can get rid of the header file and dead code altogether.
Signed-off-by: NJean Delvare <jdelvare@suse.de>
Cc: Lauro Ramos Venancio <lauro.venancio@openbossa.org>
Cc: Aloisio Almeida Jr <aloisio.almeida@openbossa.org>
Signed-off-by: NSamuel Ortiz <sameo@linux.intel.com>

87aca737

kcm: Add receive message timeout · 29152a34

由 Tom Herbert 提交于 3月 07, 2016

This patch adds receive timeout for message assembly on the attached TCP
sockets. The timeout is set when a new messages is started and the whole
message has not been received by TCP (not in the receive queue). If the
completely message is subsequently received the timer is cancelled, if the
timer expires the RX side is aborted.

The timeout value is taken from the socket timeout (SO_RCVTIMEO) that is
set on a TCP socket (i.e. set by get sockopt before attaching a TCP socket
to KCM.
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

29152a34

kcm: Add memory limit for receive message construction · 7ced95ef

由 Tom Herbert 提交于 3月 07, 2016

Message assembly is performed on the TCP socket. This is logically
equivalent of an application that performs a peek on the socket to find
out how much memory is needed for a receive buffer. The receive socket
buffer also provides the maximum message size which is checked.

The receive algorithm is something like:

   1) Receive the first skbuf for a message (or skbufs if multiple are
      needed to determine message length).
   2) Check the message length against the number of bytes in the TCP
      receive queue (tcp_inq()).
	- If all the bytes of the message are in the queue (incluing the
	  skbuf received), then proceed with message assembly (it should
	  complete with the tcp_read_sock)
        - Else, mark the psock with the number of bytes needed to
	  complete the message.
   3) In TCP data ready function, if the psock indicates that we are
      waiting for the rest of the bytes of a messages, check the number
      of queued bytes against that.
        - If there are still not enough bytes for the message, just
	  return
        - Else, clear the waiting bytes and proceed to receive the
	  skbufs.  The message should now be received in one
	  tcp_read_sock
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7ced95ef

kcm: Add statistics and proc interfaces · cd6e111b

由 Tom Herbert 提交于 3月 07, 2016

This patch adds various counters for KCM. These include counters for
messages and bytes received or sent, as well as counters for number of
attached/unattached TCP sockets and other error or edge events.

The statistics are exposed via a proc interface. /proc/net/kcm provides
statistics per KCM socket and per psock (attached TCP sockets).
/proc/net/kcm_stats provides aggregate statistics.
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cd6e111b

kcm: Kernel Connection Multiplexor module · ab7ac4eb

由 Tom Herbert 提交于 3月 07, 2016

This module implements the Kernel Connection Multiplexor.

Kernel Connection Multiplexor (KCM) is a facility that provides a
message based interface over TCP for generic application protocols.
With KCM an application can efficiently send and receive application
protocol messages over TCP using datagram sockets.

For more information see the included Documentation/networking/kcm.txt
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ab7ac4eb

tcp: Add tcp_inq to get available receive bytes on socket · 473bd239

由 Tom Herbert 提交于 3月 07, 2016

Create a common kernel function to get the number of bytes available
on a TCP socket. This is based on code in INQ getsockopt and we now call
the function for that getsockopt.
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

473bd239

net: Add MSG_BATCH flag · f092276d

由 Tom Herbert 提交于 3月 07, 2016

Add a new msg flag called MSG_BATCH. This flag is used in sendmsg to
indicate that more messages will follow (i.e. a batch of messages is
being sent). This is similar to MSG_MORE except that the following
messages are not merged into one packet, they are sent individually.
sendmmsg is updated so that each contained message except for the
last one is marked as MSG_BATCH.

MSG_BATCH is a performance optimization in cases where a socket
implementation can benefit by transmitting packets in a batch.
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f092276d

net: Make sock_alloc exportable · f4a00aac

由 Tom Herbert 提交于 3月 07, 2016

Export it for cases where we want to create sockets by hand.
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f4a00aac

rcu: Add list_next_or_null_rcu · ff3c44e6

由 Tom Herbert 提交于 3月 07, 2016

This is a convenience function that returns the next entry in an RCU
list or NULL if at the end of the list.
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ff3c44e6

09 3月, 2016 8 次提交

ip_tunnel, bpf: ip_tunnel_info_opts_{get, set} depends on CONFIG_INET · e28e87ed

由 Daniel Borkmann 提交于 3月 08, 2016

Helpers like ip_tunnel_info_opts_{get,set}() are only available if
CONFIG_INET is set, thus add an empty definition into the header for
the !CONFIG_INET case, where already other empty inline helpers are
defined.

This avoids ifdef kludge inside filter.c, but also vxlan and geneve
themself where this facility can only be used with, depend on INET
being set. For the !INET case TUNNEL_OPTIONS_PRESENT would never be
set in flags.

Fixes: 14ca0751 ("bpf: support for access to tunnel options")
Reported-by: NFengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e28e87ed

bpf: convert stackmap to pre-allocation · 557c0c6e

由 Alexei Starovoitov 提交于 3月 07, 2016

It was observed that calling bpf_get_stackid() from a kprobe inside
slub or from spin_unlock causes similar deadlock as with hashmap,
therefore convert stackmap to use pre-allocated memory.

The call_rcu is no longer feasible mechanism, since delayed freeing
causes bpf_get_stackid() to fail unpredictably when number of actual
stacks is significantly less than user requested max_entries.
Since elements are no longer freed into slub, we can push elements into
freelist immediately and let them be recycled.
However the very unlikley race between user space map_lookup() and
program-side recycling is possible:
     cpu0                          cpu1
     ----                          ----
user does lookup(stackidX)
starts copying ips into buffer
                                   delete(stackidX)
                                   calls bpf_get_stackid()
				   which recyles the element and
                                   overwrites with new stack trace

To avoid user space seeing a partial stack trace consisting of two
merged stack traces, do bucket = xchg(, NULL); copy; xchg(,bucket);
to preserve consistent stack trace delivery to user space.
Now we can move memset(,0) of left-over element value from critical
path of bpf_get_stackid() into slow-path of user space lookup.
Also disallow lookup() from bpf program, since it's useless and
program shouldn't be messing with collected stack trace.

Note that similar race between user space lookup and kernel side updates
is also present in hashmap, but it's not a new race. bpf programs were
always allowed to modify hash and array map elements while user space
is copying them.

Fixes: d5a3b1f6 ("bpf: introduce BPF_MAP_TYPE_STACK_TRACE")
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

557c0c6e

bpf: pre-allocate hash map elements · 6c905981

由 Alexei Starovoitov 提交于 3月 07, 2016

If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.

The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work

At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.

Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.

While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.

Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.

Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.

Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.

Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:

1 cpu:
pcpu_ida	2.1M
pcpu_ida nolock	2.3M
bt		2.4M
kmalloc		1.8M
hlist+spinlock	2.3M
pcpu_freelist	2.6M

4 cpu:
pcpu_ida	1.5M
pcpu_ida nolock	1.8M
bt w/smp_align	1.7M
bt no/smp_align	1.1M
kmalloc		0.7M
hlist+spinlock	0.2M
pcpu_freelist	2.0M

8 cpu:
pcpu_ida	0.7M
bt w/smp_align	0.8M
kmalloc		0.4M
pcpu_freelist	1.5M

32 cpu:
kmalloc		0.13M
pcpu_freelist	0.49M

pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.

bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist

hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.

kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6c905981

bpf: prevent kprobe+bpf deadlocks · b121d1e7

由 Alexei Starovoitov 提交于 3月 07, 2016

if kprobe is placed within update or delete hash map helpers
that hold bucket spin lock and triggered bpf program is trying to
grab the spinlock for the same bucket on the same cpu, it will
deadlock.
Fix it by extending existing recursion prevention mechanism.

Note, map_lookup and other tracing helpers don't have this problem,
since they don't hold any locks and don't modify global data.
bpf_trace_printk has its own recursive check and ok as well.
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b121d1e7

ipv6: per netns FIB garbage collection · 3dc94f93

由 Michal Kubeček 提交于 3月 08, 2016

One of our customers observed issues with FIB6 garbage collectors
running in different network namespaces blocking each other, resulting
in soft lockups (fib6_run_gc() initiated from timer runs always in
forced mode).

Now that FIB6 walkers are separated per namespace, there is no more need
for instances of fib6_run_gc() in different namespaces blocking each
other. There is still a call to icmp6_dst_gc() which operates on shared
data but this function is protected by its own shared lock.
Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
Reviewed-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3dc94f93

ipv6: per netns fib6 walkers · 9a03cd8f

由 Michal Kubeček 提交于 3月 08, 2016

The IPv6 FIB data structures are separated per network namespace but
there is still only one global walkers list and one global walker list
lock. This means changes in one namespace unnecessarily interfere with
walkers in other namespaces.

Replace the global list with per-netns lists (and give each its own
lock).
Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
Reviewed-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9a03cd8f

bpf, vxlan, geneve, gre: fix usage of dst_cache on xmit · db3c6139

由 Daniel Borkmann 提交于 3月 04, 2016

The assumptions from commit 0c1d70af ("net: use dst_cache for vxlan
device"), 468dfffc ("geneve: add dst caching support") and 3c1cb4d2
("net/ipv4: add dst cache support for gre lwtunnels") on dst_cache usage
when ip_tunnel_info is used is unfortunately not always valid as assumed.

While it seems correct for ip_tunnel_info front-ends such as OVS, eBPF
however can fill in ip_tunnel_info for consumers like vxlan, geneve or gre
with different remote dsts, tos, etc, therefore they cannot be assumed as
packet independent.

Right now vxlan, geneve, gre would cache the dst for eBPF and every packet
would reuse the same entry that was first created on the initial route
lookup. eBPF doesn't store/cache the ip_tunnel_info, so each skb may have
a different one.

Fix it by adding a flag that checks the ip_tunnel_info. Also the !tos test
in vxlan needs to be handeled differently in this context as it is currently
inferred from ip_tunnel_info as well if present. ip_tunnel_dst_cache_usable()
helper is added for the three tunnel cases, which checks if we can use dst
cache.

Fixes: 0c1d70af ("net: use dst_cache for vxlan device")
Fixes: 468dfffc ("geneve: add dst caching support")
Fixes: 3c1cb4d2 ("net/ipv4: add dst cache support for gre lwtunnels")
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NPaolo Abeni <pabeni@redhat.com>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

db3c6139

bpf: support for access to tunnel options · 14ca0751

由 Daniel Borkmann 提交于 3月 04, 2016

After eBPF being able to programmatically access/manage tunnel key meta
data via commit d3aa45ce ("bpf: add helpers to access tunnel metadata")
and more recently also for IPv6 through c6c33454 ("bpf: support ipv6
for bpf_skb_{set,get}_tunnel_key"), this work adds two complementary
helpers to generically access their auxiliary tunnel options.

Geneve and vxlan support this facility. For geneve, TLVs can be pushed,
and for the vxlan case its GBP extension. I.e. setting tunnel key for geneve
case only makes sense, if we can also read/write TLVs into it. In the GBP
case, it provides the flexibility to easily map the group policy ID in
combination with other helpers or maps.

I chose to model this as two separate helpers, bpf_skb_{set,get}_tunnel_opt(),
for a couple of reasons. bpf_skb_{set,get}_tunnel_key() is already rather
complex by itself, and there may be cases for tunnel key backends where
tunnel options are not always needed. If we would have integrated this
into bpf_skb_{set,get}_tunnel_key() nevertheless, we are very limited with
remaining helper arguments, so keeping compatibility on structs in case of
passing in a flat buffer gets more cumbersome. Separating both also allows
for more flexibility and future extensibility, f.e. options could be fed
directly from a map, etc.

Moreover, change geneve's xmit path to test only for info->options_len
instead of TUNNEL_GENEVE_OPT flag. This makes it more consistent with vxlan's
xmit path and allows for avoiding to specify a protocol flag in the API on
xmit, so it can be protocol agnostic. Having info->options_len is enough
information that is needed. Tested with vxlan and geneve.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

14ca0751