提交 · c1b7fca65070bfadca94dd53a4e6b71cd4f69715 · openanolis / cloud-kernel

11 3月, 2016 3 次提交

sh_eth: fix NULL pointer dereference in sh_eth_ring_format() · c1b7fca6

由 Sergei Shtylyov 提交于 3月 08, 2016

In a low memory situation, if netdev_alloc_skb() fails on a first RX ring
loop iteration in sh_eth_ring_format(), 'rxdesc' is still NULL. Avoid
kernel oops by adding the 'rxdesc' check after the loop.
Reported-by: NWolfram Sang <wsa+renesas@sang-engineering.com>
Signed-off-by: NSergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c1b7fca6

kcm: mark helper functions inline · f720d0ca

由 Arnd Bergmann 提交于 3月 10, 2016

The stub helper functions for the newly added kcm_proc_init/exit interfaces
are defined as 'static' in a header file, which leads to build warnings for
each file that includes them without calling them:

include/net/kcm.h:183:12: error: 'kcm_proc_init' defined but not used [-Werror=unused-function]
include/net/kcm.h:184:13: error: 'kcm_proc_exit' defined but not used [-Werror=unused-function]

This marks the two functions as 'static inline' instead, which avoids the
warnings and is obviously what was meant here.
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Fixes: cd6e111b ("kcm: Add statistics and proc interfaces")
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f720d0ca

Merge tag 'linux-can-next-for-4.6-20160310' of... · 798fee48

由 David S. Miller 提交于 3月 10, 2016

Merge tag 'linux-can-next-for-4.6-20160310' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next

Marc Kleine-Budde says:

====================
pull-request: can-next 2016-03-10

this is a pull request of 5 patch for net-next/master.

Marek Vasut contributes 4 patches for the ifi CAN driver, which makes
it work on real hardware. There is one patch by Ramesh Shanmugasundaram
for the rcar_can driver that adds support for the 3rd generation IP
core.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

798fee48

10 3月, 2016 25 次提交

can: rcar_can: Add r8a7795 support · e481ab23

由 Ramesh Shanmugasundaram 提交于 2月 29, 2016

Added r8a7795 SoC support.
Signed-off-by: NRamesh Shanmugasundaram <ramesh.shanmugasundaram@bp.renesas.com>
Acked-by: NRob Herring <robh@kernel.org>
Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>

e481ab23

can: ifi: Add obscure bit swap for EFF frame IDs · 6cc64266

由 Marek Vasut 提交于 3月 03, 2016

In case of CAN2.0 EFF frame, the controller handles frame IDs in a
rather bizzare way. The ID is split into an extended part, IDX[28:11]
and standard part, ID[10:0]. In the TX path, the core first sends the
top 11 bits of the IDX, followed by ID and finally the rest of IDX.
In the RX path, the core stores the ID the LSbit part of IDX field,
followed by the LSbit parts of real IDX. The MSbit parts of IDX are
stored in ID field of the register.

This patch implements the necessary bit shuffling to mitigate this
obscure behavior. In case two of these controllers are connected
together, the RX and TX bit swapping nullifies itself and the issue
does not manifest. The issue only manifests when talking to another
different CAN controller.
Signed-off-by: NMarek Vasut <marex@denx.de>
Cc: Marc Kleine-Budde <mkl@pengutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Oliver Hartkopp <socketcan@hartkopp.net>
Cc: Wolfgang Grandegger <wg@grandegger.com>
Reviewed-by: NOliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>

6cc64266

can: ifi: Fix RX and TX ID mask · 22365435

由 Marek Vasut 提交于 3月 03, 2016

The RX and TX ID mask for CAN2.0 is 11 bits wide. This patch fixes
the incorrect mask, which caused the CAN IDs to miss the MSBit both
on receive and transmit.
Signed-off-by: NMarek Vasut <marex@denx.de>
Cc: Marc Kleine-Budde <mkl@pengutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Oliver Hartkopp <socketcan@hartkopp.net>
Cc: Wolfgang Grandegger <wg@grandegger.com>
Reviewed-by: NOliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>

22365435

can: ifi: Fix TX DLC configuration · f1deaee0

由 Marek Vasut 提交于 3月 03, 2016

The TX DLC, the transmission length information, was not written
into the transmit configuration register. When using the CAN core
with different CAN controller, the receiving CAN controller will
receive only the ID part of the CAN frame, but no data at all.

This patch adds the TX DLC into the register to fix this issue.
Signed-off-by: NMarek Vasut <marex@denx.de>
Cc: Marc Kleine-Budde <mkl@pengutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Oliver Hartkopp <socketcan@hartkopp.net>
Cc: Wolfgang Grandegger <wg@grandegger.com>
Reviewed-by: NOliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>

f1deaee0

can: ifi: Fix clock generator configuration · 99312c37

由 Marek Vasut 提交于 3月 03, 2016

The clock generation does not match reality when using the CAN IP
core outside of the FPGA design. This patch fixes the computation
of values which are programmed into the clock generator registers.

First, there are some off-by-one errors which manifest themselves
only when communicating with different controller, so those are
fixed.

Second, the bits in the clock generator registers have different
meaning depending on whether the core is in ISO CANFD mode or any
of the other modes (BOSCH CANFD or CAN2.0). Detect the ISO CANFD
mode and fix handling of this special case of clock configuration.

Finally, the CAN clock speed is in CANCLOCK register, not SYSCLOCK
register, so fix this as well.
Signed-off-by: NMarek Vasut <marex@denx.de>
Cc: Marc Kleine-Budde <mkl@pengutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Oliver Hartkopp <socketcan@hartkopp.net>
Cc: Wolfgang Grandegger <wg@grandegger.com>
Reviewed-by: NOliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>

99312c37

bpf: avoid copying junk bytes in bpf_get_current_comm() · cdc4e47d

由 Alexei Starovoitov 提交于 3月 09, 2016

Lots of places in the kernel use memcpy(buf, comm, TASK_COMM_LEN); but
the result is typically passed to print("%s", buf) and extra bytes
after zero don't cause any harm.
In bpf the result of bpf_get_current_comm() is used as the part of
map key and was causing spurious hash map mismatches.
Use strlcpy() to guarantee zero-terminated string.
bpf verifier checks that output buffer is zero-initialized,
so even for short task names the output buffer don't have junk bytes.
Note it's not a security concern, since kprobe+bpf is root only.

Fixes: ffeedafb ("bpf: introduce current->pid, tgid, uid, gid, comm accessors")
Reported-by: NTobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cdc4e47d

bpf: bpf_stackmap_copy depends on CONFIG_PERF_EVENTS · b8cdc051

由 Alexei Starovoitov 提交于 3月 09, 2016

0-day bot reported build error:
kernel/built-in.o: In function `map_lookup_elem':
>> kernel/bpf/.tmp_syscall.o:(.text+0x329b3c): undefined reference to `bpf_stackmap_copy'
when CONFIG_BPF_SYSCALL is set and CONFIG_PERF_EVENTS is not.
Add weak definition to resolve it.
This code path in map_lookup_elem() is never taken
when CONFIG_PERF_EVENTS is not set.

Fixes: 557c0c6e ("bpf: convert stackmap to pre-allocation")
Reported-by: NFengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b8cdc051

Merge branch 'variable-length-ll-headers' · 3b8377dc

由 David S. Miller 提交于 3月 09, 2016

Willem de Bruijn says:

====================
net: validate variable length ll headers

Allow device-specific validation of link layer headers. Existing
checks drop all packets shorter than hard_header_len. For variable
length protocols, such packets can be valid.

patch 1 adds header_ops.validate and dev_validate_header
patch 2 implements the protocol specific callback for AX25
patch 3 replaces ll_header_truncated with dev_validate_header
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3b8377dc

packet: validate variable length ll headers · 9ed988cd

由 Willem de Bruijn 提交于 3月 09, 2016

Replace link layer header validation check ll_header_truncate with
more generic dev_validate_header.

Validation based on hard_header_len incorrectly drops valid packets
in variable length protocols, such as AX25. dev_validate_header
calls header_ops.validate for such protocols to ensure correctness
below hard_header_len.

See also http://comments.gmane.org/gmane.linux.network/401064

Fixes 9c707762 ("packet: make packet_snd fail on len smaller than l2 header")
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9ed988cd

ax25: add link layer header validation function · ea47781c

由 Willem de Bruijn 提交于 3月 09, 2016

As variable length protocol, AX25 fails link layer header validation
tests based on a minimum length. header_ops.validate allows protocols
to validate headers that are shorter than hard_header_len. Implement
this callback for AX25.

See also http://comments.gmane.org/gmane.linux.network/401064Signed-off-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ea47781c

net: validate variable length ll headers · 2793a23a

由 Willem de Bruijn 提交于 3月 09, 2016

Netdevice parameter hard_header_len is variously interpreted both as
an upper and lower bound on link layer header length. The field is
used as upper bound when reserving room at allocation, as lower bound
when validating user input in PF_PACKET.

Clarify the definition to be maximum header length. For validation
of untrusted headers, add an optional validate member to header_ops.

Allow bypassing of validation by passing CAP_SYS_RAWIO, for instance
for deliberate testing of corrupt input. In this case, pad trailing
bytes, as some device drivers expect completely initialized headers.

See also http://comments.gmane.org/gmane.linux.network/401064Signed-off-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2793a23a

Merge branch 'kcm' · 9531ab65

由 David S. Miller 提交于 3月 09, 2016

Tom Herbert says:

====================
kcm: Kernel Connection Multiplexor (KCM)

Kernel Connection Multiplexor (KCM) is a facility that provides a
message based interface over TCP for generic application protocols.
The motivation for this is based on the observation that although
TCP is byte stream transport protocol with no concept of message
boundaries, a common use case is to implement a framed application
layer protocol running over TCP. To date, most TCP stacks offer
byte stream API for applications, which places the burden of message
delineation, message I/O operation atomicity, and load balancing
in the application. With KCM an application can efficiently send
and receive application protocol messages over TCP using a
datagram interface.

In order to delineate message in a TCP stream for receive in KCM, the
kernel implements a message parser. For this we chose to employ BPF
which is applied to the TCP stream. BPF code parses application layer
messages and returns a message length. Nearly all binary application
protocols are parsable in this manner, so KCM should be applicable
across a wide range of applications. Other than message length
determination in receive, KCM does not require any other application
specific awareness. KCM does not implement any other application
protocol semantics-- these are are provided in userspace or could be
implemented in a kernel module layered above KCM.

KCM implements an NxM multiplexor in the kernel as diagrammed below:

+------------+   +------------+   +------------+   +------------+
| KCM socket |   | KCM socket |   | KCM socket |   | KCM socket |
+------------+   +------------+   +------------+   +------------+
      |                 |               |                |
      +-----------+     |               |     +----------+
                  |     |               |     |
               +----------------------------------+
               |           Multiplexor            |
               +----------------------------------+
                 |   |           |           |  |
       +---------+   |           |           |  ------------+
       |             |           |           |              |
+----------+  +----------+  +----------+  +----------+ +----------+
|  Psock   |  |  Psock   |  |  Psock   |  |  Psock   | |  Psock   |
+----------+  +----------+  +----------+  +----------+ +----------+
      |              |           |            |             |
+----------+  +----------+  +----------+  +----------+ +----------+
| TCP sock |  | TCP sock |  | TCP sock |  | TCP sock | | TCP sock |
+----------+  +----------+  +----------+  +----------+ +----------+

The KCM sockets provide the datagram interface to applications,
Psocks are the state for each attached TCP connection (i.e. where
message delineation is performed on receive).

A description of the APIs and design can be found in the included
Documentation/networking/kcm.txt.

In this patch set:

  - Add MSG_BATCH flag. This is used in sendmsg msg_hdr flags to
    indicate that more messages will be sent on the socket. The stack
    may batch messages up if it is beneficial for transmission.
  - In sendmmsg, set MSG_BATCH in all sub messages except for the last
    one.
  - In order to allow sendmmsg to contain multiple messages with
    SOCK_SEQPAKET we allow each msg_hdr in the sendmmsg to set MSG_EOR.
  - Add KCM module
    - This supports SOCK_DGRAM and SOCK_SEQPACKET.
  - KCM documentation

v2:
  - Added splice and page operations.
  - Assemble receive messages in place on TCP socket (don't have a
    separate assembly queue.
  - Based on above, enforce maxmimum receive message to be the size
    of the recceive socket buffer.
  - Support message assembly timeout. Use the timeout value in
    sk_rcvtimeo on the TCP socket.
  - Tested some with a couple of other production applications,
    see ~5% improvement in application latency.

Testing:

Dave Watson has integrated KCM into Thrift and we intend to put these
changes into open source. Example of this is in:

https://github.com/djwatson/fbthrift/commit/
dd7e0f9cf4e80912fdb90f6cd394db24e61a14cc

Some initial KCM Thrift benchmark numbers (comment from Dave)

Thrift by default ties a single connection to a single thread.  KCM is
instead able to load balance multiple connections across multiple epoll
loops easily.

A test sending ~5k bytes of data to a kcm thrift server, dropping the
bytes on recv:

QPS     Latency / std dev Latency
  without KCM
    70336     209/123
  with KCM
    70353     191/124

A test sending a small request, then doing work in the epoll thread,
before serving more requests:

QPS     Latency / std dev Latency
without KCM
    14282     559/602
with KCM
    23192     344/234

At the high end, there's definitely some additional kernel overhead:

Cranking the pipelining way up, with lots of small requests

QPS     Latency / std dev Latency
without KCM
   1863429     127/119
with KCM
   1337713     192/241

---

So for a "realistic" workload, KCM performs pretty well (second case).
Under extreme conditions of highest tps we still have some work to do.
In its nature a multiplexor will spread work between CPUs which is
logically good for load balancing but coan conflict with the goal
promoting affinity. Batching messages on both send and receive are
the means to recoup performance.

Future support:

 - Integration with TLS (TLS-in-kernel is a separate initiative).
 - Page operations/splice support
 - Unconnected KCM sockets. Will be able to attach sockets to different
   destinations, AF_KCM addresses with be used in sendmsg and recvmsg
   to indicate destination
 - Explore more utility in performing BPF inline with a TCP data stream
   (setting SO_MARK, rxhash for messages being sent received on
   KCM sockets).
 - Performance work
   - Diagnose performance issues under high message load

FAQ (Questions posted on LWN)

Q: Why do this in the kernel?

A: Because the kernel is good at scheduling threads and steering packets
   to threads. KCM fits well into this model since it allows the unit
   of work for scheduling and steering to be the application layer
   messages themselves. KCM should be thought of as generic application
   protocol acceleration. It to the philosophy that the kernel provides
   generic and extensible interfaces.

Q: How can adding code in the path yield better performance?

A: It is true that for just sending receiving a single message there
   would be some performance loss since the code path is longer (for
   instance comparing netperf to KCM). But for real production
   applications performance takes on many dynamics. Parallelism, context
   switching, affinity, granularity of locking, and load balancing are
   all relevant. The theory of KCM is that by an application-centric
   interface, the kernel can provide better support for these
   performance characteristics.

Q: Why not use an existing message-oriented protocol such as RUDP,
   DCCP, SCTP, RDS, and others?

A: Because that would entail using a completely new transport protocol.
   Deploying a new protocol at scale is either a huge undertaking or
   fundamentally infeasible. This is true in either the Internet and in
   the data center due in a large part to protocol ossification.
   Besides, KCM we want KCM to work existing, well deployed application
   protocols that we couldn't change even if we wanted to (e.g. http/2).

   KCM simply defines a new interface method, it does not redefine any
   aspect of the transport protocol nor application protocol, nor set
   any new requirements on these. Neither does KCM attempt to implement
   any application protocol logic other than message deliniation in the
   stream. These are fundamental requirement of KCM.

Q: How does this affect TCP?

A: It doesn't, not in the slightest. The use of KCM can be one-sided,
   KCM has no effect on the wire.

Q: Why force TCP into doing something it's not designed for?

A: TCP is defined as transport protocol and there is no standard that
   says the API into TCP must be stream based sockets, or for that
   matter sockets at all (or even that TCP needs to be implemented in a
   kernel). KCM is not inconsistent with the design of TCP just because
   to makes an message based interface over TCP, if it were then every
   application protocol sending messages over TCP would also be! :-)

Q: What about the problem of a connections with very slow rate of
   incoming data? As a result your application can get storms of very
   short reads. And it actually happens a lot with connection from
   mobile devices and it is a problem for servers handling a lot of
   connections.

A: The storm of short reads will occur regardless of whether KCM is used
   or not. KCM does have one advantage in this scenario though, it will
   only wake up the application when a full message has been received,
   not for each packet that makes up part of a bigger messages. If a
   bunch of small messages are received, the application can receive
   messages in batches using recvmmsg.

Q: Why not just use DPDK, or at least provide KCM like functionality in
   DPDK?

A: DPDK, or more generally OS bypass presumably with a TCP stack in
   userland, presents a different model of load balancing than that of
   KCM (and the kernel). KCM implements load balancing of messages
   across the threads of an application, whereas DPDK load balances
   based on queues which are more static and coarse-grained since
   multiple connections are bound to queues. DPDK works best when
   processing of packets is silo'ed in a thread on the CPU processing
   a queue, and packet processing (for both the stack and application)
   is fairly uniform. KCM works well for applications where the amount
   of work to process messages varies an application work is commonly
   delegated to worker threads often on different CPUs.

   The message based interface over TCP is something that could be
   provide by a DPDK or OS bypass library.

Q: I'm not quite seeing this for HTTP. Maybe for HTTP/2, I guess, or web
   sockets?

A: Yes. KCM is most appropriate for message based protocols over TCP
   where is easy to deduce the message length (e.g. a length field)
   and the protocol implements its own message ordering semantics.
   Fortunately this encompasses many modern protocols.

Q: How is memory limited and controlled?

A: In v2 all data for messages is now kept in socket buffers, either
   those for TCP or KCM, so socket buffer limits are applicable.
   This includes receive messages assembly which is now done ont teh
   TCP socket buffer instead of a separate queue-- this has the
   consequence that the TCP socket buffer limit provides an
   enforceable maxmimum message size.

   Additionally, a timeout may be set for messages assembly. The
   value used for this is taken from sk_rcvtimeo of the TCP socket.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9531ab65

kcm: Add description in Documentation · 10016594

由 Tom Herbert 提交于 3月 07, 2016

Add kcm.txt to desribe KCM and interfaces.
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

10016594

kcm: Add receive message timeout · 29152a34

由 Tom Herbert 提交于 3月 07, 2016

This patch adds receive timeout for message assembly on the attached TCP
sockets. The timeout is set when a new messages is started and the whole
message has not been received by TCP (not in the receive queue). If the
completely message is subsequently received the timer is cancelled, if the
timer expires the RX side is aborted.

The timeout value is taken from the socket timeout (SO_RCVTIMEO) that is
set on a TCP socket (i.e. set by get sockopt before attaching a TCP socket
to KCM.
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

29152a34

kcm: Add memory limit for receive message construction · 7ced95ef

由 Tom Herbert 提交于 3月 07, 2016

Message assembly is performed on the TCP socket. This is logically
equivalent of an application that performs a peek on the socket to find
out how much memory is needed for a receive buffer. The receive socket
buffer also provides the maximum message size which is checked.

The receive algorithm is something like:

   1) Receive the first skbuf for a message (or skbufs if multiple are
      needed to determine message length).
   2) Check the message length against the number of bytes in the TCP
      receive queue (tcp_inq()).
	- If all the bytes of the message are in the queue (incluing the
	  skbuf received), then proceed with message assembly (it should
	  complete with the tcp_read_sock)
        - Else, mark the psock with the number of bytes needed to
	  complete the message.
   3) In TCP data ready function, if the psock indicates that we are
      waiting for the rest of the bytes of a messages, check the number
      of queued bytes against that.
        - If there are still not enough bytes for the message, just
	  return
        - Else, clear the waiting bytes and proceed to receive the
	  skbufs.  The message should now be received in one
	  tcp_read_sock
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7ced95ef

kcm: Sendpage support · f29698fc

由 Tom Herbert 提交于 3月 07, 2016

Implement kcm_sendpage. Set in sendpage to kcm_sendpage in both
dgram and seqpacket ops.
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f29698fc

kcm: Splice support · 91687355

由 Tom Herbert 提交于 3月 07, 2016

Implement kcm_splice_read. This is supported only for seqpacket.
Add kcm_seqpacket_ops and set splice read to kcm_splice_read.
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

91687355

kcm: Add statistics and proc interfaces · cd6e111b

由 Tom Herbert 提交于 3月 07, 2016

This patch adds various counters for KCM. These include counters for
messages and bytes received or sent, as well as counters for number of
attached/unattached TCP sockets and other error or edge events.

The statistics are exposed via a proc interface. /proc/net/kcm provides
statistics per KCM socket and per psock (attached TCP sockets).
/proc/net/kcm_stats provides aggregate statistics.
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cd6e111b

kcm: Kernel Connection Multiplexor module · ab7ac4eb

由 Tom Herbert 提交于 3月 07, 2016

This module implements the Kernel Connection Multiplexor.

Kernel Connection Multiplexor (KCM) is a facility that provides a
message based interface over TCP for generic application protocols.
With KCM an application can efficiently send and receive application
protocol messages over TCP using datagram sockets.

For more information see the included Documentation/networking/kcm.txt
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ab7ac4eb

tcp: Add tcp_inq to get available receive bytes on socket · 473bd239

由 Tom Herbert 提交于 3月 07, 2016

Create a common kernel function to get the number of bytes available
on a TCP socket. This is based on code in INQ getsockopt and we now call
the function for that getsockopt.
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

473bd239

net: Walk fragments in __skb_splice_bits · fa9835e5

由 Tom Herbert 提交于 3月 07, 2016

Add walking of fragments in __skb_splice_bits.
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fa9835e5

net: Add MSG_BATCH flag · f092276d

由 Tom Herbert 提交于 3月 07, 2016

Add a new msg flag called MSG_BATCH. This flag is used in sendmsg to
indicate that more messages will follow (i.e. a batch of messages is
being sent). This is similar to MSG_MORE except that the following
messages are not merged into one packet, they are sent individually.
sendmmsg is updated so that each contained message except for the
last one is marked as MSG_BATCH.

MSG_BATCH is a performance optimization in cases where a socket
implementation can benefit by transmitting packets in a batch.
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f092276d

net: Allow MSG_EOR in each msghdr of sendmmsg · 28a94d8f

由 Tom Herbert 提交于 3月 07, 2016

This patch allows setting MSG_EOR in each individual msghdr passed
in sendmmsg. This allows a sendmmsg to send multiple messages when
using SOCK_SEQPACKET.
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

28a94d8f

net: Make sock_alloc exportable · f4a00aac

由 Tom Herbert 提交于 3月 07, 2016

Export it for cases where we want to create sockets by hand.
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f4a00aac

rcu: Add list_next_or_null_rcu · ff3c44e6

由 Tom Herbert 提交于 3月 07, 2016

This is a convenience function that returns the next entry in an RCU
list or NULL if at the end of the list.
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ff3c44e6

09 3月, 2016 12 次提交

samples/bpf: add map performance test · 26e90931

由 Alexei Starovoitov 提交于 3月 08, 2016

performance tests for hash map and per-cpu hash map
with and without pre-allocation
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

26e90931

samples/bpf: stress test bpf_get_stackid · 7dcc42b6

由 Alexei Starovoitov 提交于 3月 08, 2016

increase stress by also calling bpf_get_stackid() from
various *spin* functions
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7dcc42b6

samples/bpf: add bpf map stress test · 9d8b612d

由 Alexei Starovoitov 提交于 3月 08, 2016

this test calls bpf programs from different contexts:
from inside of slub, from rcu, from pretty much everywhere,
since it kprobes all spin_lock functions.
It stresses the bpf hash and percpu map pre-allocation,
deallocation logic and call_rcu mechanisms.
User space part adding more stress by walking and deleting map elements.

Note that due to nature bpf_load.c the earlier kprobe+bpf programs are
already active while loader loads new programs, creates new kprobes and
attaches them.
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9d8b612d

ip_tunnel, bpf: ip_tunnel_info_opts_{get, set} depends on CONFIG_INET · e28e87ed

由 Daniel Borkmann 提交于 3月 08, 2016

Helpers like ip_tunnel_info_opts_{get,set}() are only available if
CONFIG_INET is set, thus add an empty definition into the header for
the !CONFIG_INET case, where already other empty inline helpers are
defined.

This avoids ifdef kludge inside filter.c, but also vxlan and geneve
themself where this facility can only be used with, depend on INET
being set. For the !INET case TUNNEL_OPTIONS_PRESENT would never be
set in flags.

Fixes: 14ca0751 ("bpf: support for access to tunnel options")
Reported-by: NFengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e28e87ed

Merge branch 'bpf-map-prealloc' · f14b488d

由 David S. Miller 提交于 3月 08, 2016

Alexei Starovoitov says:

====================
bpf: map pre-alloc

v1->v2:
. fix few issues spotted by Daniel
. converted stackmap into pre-allocation as well
. added a workaround for lockdep false positive
. added pcpu_freelist_populate to be used by hashmap and stackmap

this path set switches bpf hash map to use pre-allocation by default
and introduces BPF_F_NO_PREALLOC flag to keep old behavior for cases
where full map pre-allocation is too memory expensive.

Some time back Daniel Wagner reported crashes when bpf hash map is
used to compute time intervals between preempt_disable->preempt_enable
and recently Tom Zanussi reported a dead lock in iovisor/bcc/funccount
tool if it's used to count the number of invocations of kernel
'*spin*' functions. Both problems are due to the recursive use of
slub and can only be solved by pre-allocating all map elements.

A lot of different solutions were considered. Many implemented,
but at the end pre-allocation seems to be the only feasible answer.
As far as pre-allocation goes it also was implemented 4 different ways:
- simple free-list with single lock
- percpu_ida with optimizations
- blk-mq-tag variant customized for bpf use case
- percpu_freelist
For bpf style of alloc/free patterns percpu_freelist is the best
and implemented in this patch set.
Detailed performance numbers in patch 3.
Patch 2 introduces percpu_freelist
Patch 1 fixes simple deadlocks due to missing recursion checks
Patch 5: converts stackmap to pre-allocation
Patches 6-9: prepare test infra
Patch 10: stress test for hash map infra. It attaches to spin_lock
functions and bpf_map_update/delete are called from different contexts
Patch 11: stress for bpf_get_stackid
Patch 12: map performance test
Reported-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
Reported-by: NTom Zanussi <tom.zanussi@linux.intel.com>
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f14b488d

samples/bpf: test both pre-alloc and normal maps · c3f85cff

由 Alexei Starovoitov 提交于 3月 07, 2016

extend test coveraged to include pre-allocated and run-time alloc maps
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c3f85cff

samples/bpf: add map_flags to bpf loader · 89b97607

由 Alexei Starovoitov 提交于 3月 07, 2016

note old loader is compatible with new kernel.
map_flags are optional
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

89b97607

samples/bpf: move ksym_search() into library · 3622e7e4

由 Alexei Starovoitov 提交于 3月 07, 2016

move ksym search from offwaketime into library to be reused
in other tests
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3622e7e4

samples/bpf: make map creation more verbose · 618ec9a7

由 Alexei Starovoitov 提交于 3月 07, 2016

map creation is typically the first one to fail when rlimits are
too low, not enough memory, etc
Make this failure scenario more verbose
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

618ec9a7

bpf: convert stackmap to pre-allocation · 557c0c6e

由 Alexei Starovoitov 提交于 3月 07, 2016

It was observed that calling bpf_get_stackid() from a kprobe inside
slub or from spin_unlock causes similar deadlock as with hashmap,
therefore convert stackmap to use pre-allocated memory.

The call_rcu is no longer feasible mechanism, since delayed freeing
causes bpf_get_stackid() to fail unpredictably when number of actual
stacks is significantly less than user requested max_entries.
Since elements are no longer freed into slub, we can push elements into
freelist immediately and let them be recycled.
However the very unlikley race between user space map_lookup() and
program-side recycling is possible:
     cpu0                          cpu1
     ----                          ----
user does lookup(stackidX)
starts copying ips into buffer
                                   delete(stackidX)
                                   calls bpf_get_stackid()
				   which recyles the element and
                                   overwrites with new stack trace

To avoid user space seeing a partial stack trace consisting of two
merged stack traces, do bucket = xchg(, NULL); copy; xchg(,bucket);
to preserve consistent stack trace delivery to user space.
Now we can move memset(,0) of left-over element value from critical
path of bpf_get_stackid() into slow-path of user space lookup.
Also disallow lookup() from bpf program, since it's useless and
program shouldn't be messing with collected stack trace.

Note that similar race between user space lookup and kernel side updates
is also present in hashmap, but it's not a new race. bpf programs were
always allowed to modify hash and array map elements while user space
is copying them.

Fixes: d5a3b1f6 ("bpf: introduce BPF_MAP_TYPE_STACK_TRACE")
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

557c0c6e

bpf: check for reserved flag bits in array and stack maps · 823707b6

由 Alexei Starovoitov 提交于 3月 07, 2016

Suggested-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

823707b6

bpf: pre-allocate hash map elements · 6c905981

由 Alexei Starovoitov 提交于 3月 07, 2016

If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.

The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work

At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.

Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.

While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.

Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.

Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.

Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.

Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:

1 cpu:
pcpu_ida	2.1M
pcpu_ida nolock	2.3M
bt		2.4M
kmalloc		1.8M
hlist+spinlock	2.3M
pcpu_freelist	2.6M

4 cpu:
pcpu_ida	1.5M
pcpu_ida nolock	1.8M
bt w/smp_align	1.7M
bt no/smp_align	1.1M
kmalloc		0.7M
hlist+spinlock	0.2M
pcpu_freelist	2.0M

8 cpu:
pcpu_ida	0.7M
bt w/smp_align	0.8M
kmalloc		0.4M
pcpu_freelist	1.5M

32 cpu:
kmalloc		0.13M
pcpu_freelist	0.49M

pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.

bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist

hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.

kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6c905981

openanolis / cloud-kernel 接近 2 年 前同步成功

openanolis / cloud-kernel
接近 2 年前同步成功