提交 · 860b642b9c33ea4a6ae2f416607b0b98a9d11bb0 · openeuler / Kernel

04 7月, 2018 11 次提交

net/sched: Allow creating a Qdisc watchdog with other clocks · 860b642b

由 Vinicius Costa Gomes 提交于 7月 03, 2018

This adds 'qdisc_watchdog_init_clockid()' that allows a clockid to be
passed, this allows other time references to be used when scheduling
the Qdisc to run.
Signed-off-by: NVinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

860b642b

net: ipv4: Hook into time based transmission · bc969a97

由 Jesus Sanchez-Palencia 提交于 7月 03, 2018

Add a transmit_time field to struct inet_cork, then copy the
timestamp from the CMSG cookie at ip_setup_cork() so we can
safely copy it into the skb later during __ip_make_skb().

For the raw fast path, just perform the copy at raw_send_hdrinc().
Signed-off-by: NRichard Cochran <rcochran@linutronix.de>
Signed-off-by: NJesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bc969a97

net: Add a new socket option for a future transmit time. · 80b14dee

由 Richard Cochran 提交于 7月 03, 2018

This patch introduces SO_TXTIME. User space enables this option in
order to pass a desired future transmit time in a CMSG when calling
sendmsg(2). The argument to this socket option is a 8-bytes long struct
provided by the uapi header net_tstamp.h defined as:

struct sock_txtime {
	clockid_t 	clockid;
	u32		flags;
};

Note that new fields were added to struct sock by filling a 2-bytes
hole found in the struct. For that reason, neither the struct size or
number of cachelines were altered.
Signed-off-by: NRichard Cochran <rcochran@linutronix.de>
Signed-off-by: NJesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

80b14dee

net:sched: add action inheritdsfield to skbedit · e7e3728b

由 Qiaobin Fu 提交于 7月 01, 2018

The new action inheritdsfield copies the field DS of
IPv4 and IPv6 packets into skb->priority. This enables
later classification of packets based on the DS field.

v5:
*Update the drop counter for TC_ACT_SHOT

v4:
*Not allow setting flags other than the expected ones.

*Allow dumping the pure flags.

v3:
*Use optional flags, so that it won't break old versions of tc.

*Allow users to set both SKBEDIT_F_PRIORITY and SKBEDIT_F_INHERITDSFIELD flags.

v2:
*Fix the style issue

*Move the code from skbmod to skbedit

Original idea by Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: NQiaobin Fu <qiaobinf@bu.edu>
Reviewed-by: NMichel Machado <michel@digirati.com.br>
Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: NDavide Caratti <dcaratti@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e7e3728b

net: ipv4: listified version of ip_rcv · 17266ee9

由 Edward Cree 提交于 7月 02, 2018

Also involved adding a way to run a netfilter hook over a list of packets.
Rather than attempting to make netfilter know about lists (which would be
a major project in itself) we just let it call the regular okfn (in this
case ip_rcv_finish()) for any packets it steals, and have it give us back
a list of packets it's synchronously accepted (which normally NF_HOOK
would automatically call okfn() on, but we want to be able to potentially
pass the list to a listified version of okfn().)
The netfilter hooks themselves are indirect calls that still happen per-
packet (see nf_hook_entry_hookfn()), but again, changing that can be left
for future work.

There is potential for out-of-order receives if the netfilter hook ends up
synchronously stealing packets, as they will be processed before any
accepts earlier in the list. However, it was already possible for an
asynchronous accept to cause out-of-order receives, so presumably this is
considered OK.
Signed-off-by: NEdward Cree <ecree@solarflare.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

17266ee9

net: core: another layer of lists, around PF_MEMALLOC skb handling · 4ce0017a

由 Edward Cree 提交于 7月 02, 2018

First example of a layer splitting the list (rather than merely taking
 individual packets off it).
Involves new list.h function, list_cut_before(), like list_cut_position()
 but cuts on the other side of the given entry.
Signed-off-by: NEdward Cree <ecree@solarflare.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4ce0017a

net: core: unwrap skb list receive slightly further · 920572b7

由 Edward Cree 提交于 7月 02, 2018

Signed-off-by: NEdward Cree <ecree@solarflare.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

920572b7

net: core: trivial netif_receive_skb_list() entry point · f6ad8c1b

由 Edward Cree 提交于 7月 02, 2018

Just calls netif_receive_skb() in a loop.
Signed-off-by: NEdward Cree <ecree@solarflare.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f6ad8c1b

sctp: add spp_ipv6_flowlabel and spp_dscp for sctp_paddrparams · 0b0dce7a

由 Xin Long 提交于 7月 02, 2018

spp_ipv6_flowlabel and spp_dscp are added in sctp_paddrparams in
this patch so that users could set sctp_sock/asoc/transport dscp
and flowlabel with spp_flags SPP_IPV6_FLOWLABEL or SPP_DSCP by
SCTP_PEER_ADDR_PARAMS , as described section 8.1.12 in RFC6458.

As said in last patch, it uses '| 0x100000' or '|0x1' to mark
flowlabel or dscp is set,  so that their values could be set
to 0.

Note that to guarantee that an old app built with old kernel
headers could work on the newer kernel, the param's check in
sctp_g/setsockopt_peer_addr_params() is also improved, which
follows the way that sctp_g/setsockopt_delayed_ack() or some
other sockopts' process that accept two types of params does.
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0b0dce7a

sctp: add support for dscp and flowlabel per transport · 8a9c58d2

由 Xin Long 提交于 7月 02, 2018

Like some other per transport params, flowlabel and dscp are added
in transport, asoc and sctp_sock. By default, transport sets its
value from asoc's, and asoc does it from sctp_sock. flowlabel
only works for ipv6 transport.

Other than that they need to be passed down in sctp_xmit, flow4/6
also needs to set them before looking up route in get_dst.

Note that it uses '& 0x100000' to check if flowlabel is set and
'& 0x1' (tos 1st bit is unused) to check if dscp is set by users,
so that they could be set to 0 by sockopt in next patch.
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8a9c58d2

ipv4: add __ip_queue_xmit() that supports tos param · 69b9e1e0

由 Xin Long 提交于 7月 02, 2018

This patch introduces __ip_queue_xmit(), through which the callers
can pass tos param into it without having to set inet->tos. For
ipv6, ip6_xmit() already allows passing tclass parameter.

It's needed when some transport protocol doesn't use inet->tos,
like sctp's per transport dscp, which will be added in next patch.
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

69b9e1e0

02 7月, 2018 6 次提交

net: expose sk wmem in sock_exceed_buf_limit tracepoint · d6f19938

由 Yafang Shao 提交于 7月 01, 2018

Currently trace_sock_exceed_buf_limit() only show rmem info,
but wmem limit may also be hit.
So expose wmem info in this tracepoint as well.

Regarding memcg, I think it is better to introduce a new tracepoint(if
that is needed), i.e. trace_memcg_limit_hit other than show memcg info in
trace_sock_exceed_buf_limit.
Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d6f19938

net: fix use-after-free in GRO with ESP · 603d4cf8

由 Sabrina Dubroca 提交于 6月 30, 2018

Since the addition of GRO for ESP, gro_receive can consume the skb and
return -EINPROGRESS. In that case, the lower layer GRO handler cannot
touch the skb anymore.

Commit 5f114163 ("net: Add a skb_gro_flush_final helper.") converted
some of the gro_receive handlers that can lead to ESP's gro_receive so
that they wouldn't access the skb when -EINPROGRESS is returned, but
missed other spots, mainly in tunneling protocols.

This patch finishes the conversion to using skb_gro_flush_final(), and
adds a new helper, skb_gro_flush_final_remcsum(), used in VXLAN and
GUE.

Fixes: 5f114163 ("net: Add a skb_gro_flush_final helper.")
Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
Reviewed-by: NStefano Brivio <sbrivio@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

603d4cf8

net: Enable Tx queue selection based on Rx queues · fc9bab24

由 Amritha Nambiar 提交于 6月 29, 2018

This patch adds support to pick Tx queue based on the Rx queue(s) map
configuration set by the admin through the sysfs attribute
for each Tx queue. If the user configuration for receive queue(s) map
does not apply, then the Tx queue selection falls back to CPU(s) map
based selection and finally to hashing.
Signed-off-by: NAmritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fc9bab24

net: Record receive queue number for a connection · c6345ce7

由 Amritha Nambiar 提交于 6月 29, 2018

This patch adds a new field to sock_common 'skc_rx_queue_mapping'
which holds the receive queue number for the connection. The Rx queue
is marked in tcp_finish_connect() to allow a client app to do
SO_INCOMING_NAPI_ID after a connect() call to get the right queue
association for a socket. Rx queue is also marked in tcp_conn_request()
to allow syn-ack to go on the right tx-queue associated with
the queue on which syn is received.
Signed-off-by: NAmritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c6345ce7

net: sock: Change tx_queue_mapping in sock_common to unsigned short · 755c31cd

由 Amritha Nambiar 提交于 6月 29, 2018

Change 'skc_tx_queue_mapping' field in sock_common structure from
'int' to 'unsigned short' type with ~0 indicating unset and
other positive queue values being set. This will accommodate adding
a new 'unsigned short' field in sock_common in the next patch for
rx_queue_mapping.
Signed-off-by: NAmritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

755c31cd

net: Refactor XPS for CPUs and Rx queues · 80d19669

由 Amritha Nambiar 提交于 6月 29, 2018

Refactor XPS code to support Tx queue selection based on
CPU(s) map or Rx queue(s) map.
Signed-off-by: NAmritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

80d19669

30 6月, 2018 7 次提交

tipc: extend sock diag for group communication · a1be5a20

由 GhantaKrishnamurthy MohanKrishna 提交于 6月 29, 2018

This commit extends the existing TIPC socket diagnostics framework
for information related to TIPC group communication.
Acked-by: NYing Xue <ying.xue@windriver.com>
Acked-by: NJon Maloy <jon.maloy@ericsson.com>
Signed-off-by: NGhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a1be5a20

net/smc: add SMC-D diag support · 4b1b7d3b

由 Hans Wippel 提交于 6月 28, 2018

This patch adds diag support for SMC-D.
Signed-off-by: NHans Wippel <hwippel@linux.ibm.com>
Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
Suggested-by: NThomas Richter <tmricht@linux.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4b1b7d3b

net/smc: add pnetid support for SMC-D and ISM · 1619f770

由 Hans Wippel 提交于 6月 28, 2018

SMC-D relies on PNETIDs to find usable SMC-D/ISM devices for a SMC
connection. This patch adds SMC-D/ISM support to the current PNETID
implementation.
Signed-off-by: NHans Wippel <hwippel@linux.ibm.com>
Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
Suggested-by: NThomas Richter <tmricht@linux.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1619f770

net/smc: add base infrastructure for SMC-D and ISM · c6ba7c9b

由 Hans Wippel 提交于 6月 28, 2018

SMC supports two variants: SMC-R and SMC-D. For data transport, SMC-R
uses RDMA devices, SMC-D uses so-called Internal Shared Memory (ISM)
devices. An ISM device only allows shared memory communication between
SMC instances on the same machine. For example, this allows virtual
machines on the same host to communicate via SMC without RDMA devices.

This patch adds the base infrastructure for SMC-D and ISM devices to
the existing SMC code. It contains the following:

* ISM driver interface:
  This interface allows an ISM driver to register ISM devices in SMC. In
  the process, the driver provides a set of device ops for each device.
  SMC uses these ops to execute SMC specific operations on or transfer
  data over the device.

* Core SMC-D link group, connection, and buffer support:
  Link groups, SMC connections and SMC buffers (in smc_core) are
  extended to support SMC-D.

* SMC type checks:
  Some type checks are added to prevent using SMC-R specific code for
  SMC-D and vice versa.

To actually use SMC-D, additional changes to pnetid, CLC, CDC, etc. are
required. These are added in follow-up patches.
Signed-off-by: NHans Wippel <hwippel@linux.ibm.com>
Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
Suggested-by: NThomas Richter <tmricht@linux.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c6ba7c9b

net/smc: add pnetid support · 0afff91c

由 Ursula Braun 提交于 6月 28, 2018

s390 hardware supports the definition of a so-call Physical NETwork
IDentifier (short PNETID) per network device port. These PNETIDS
can be used to identify network devices that are attached to the same
physical network (broadcast domain).

On s390 try to use the PNETID of the ethernet device port used for
initial connecting, and derive the IB device port used for SMC RDMA
traffic.

On platforms without PNETID support fall back to the existing
solution of a configured pnet table.
Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0afff91c

tcp: add new SNMP counter for drops when try to queue in rcv queue · ea5d0c32

由 Yafang Shao 提交于 6月 28, 2018

When sk_rmem_alloc is larger than the receive buffer and we can't
schedule more memory for it, the skb will be dropped.

In above situation, if this skb is put into the ofo queue,
LINUX_MIB_TCPOFODROP is incremented to track it.

While if this skb is put into the receive queue, there's no record.
So a new SNMP counter is introduced to track this behavior.

LINUX_MIB_TCPRCVQDROP:  Number of packets meant to be queued in rcv queue
			but dropped because socket rcvbuf limit hit.
Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ea5d0c32

bpf: undo prog rejection on read-only lock failure · 85782e03

由 Daniel Borkmann 提交于 6月 28, 2018

Partially undo commit 9facc336 ("bpf: reject any prog that failed
read-only lock") since it caused a regression, that is, syzkaller was
able to manage to cause a panic via fault injection deep in set_memory_ro()
path by letting an allocation fail: In x86's __change_page_attr_set_clr()
it was able to change the attributes of the primary mapping but not in
the alias mapping via cpa_process_alias(), so the second, inner call
to the __change_page_attr() via __change_page_attr_set_clr() had to split
a larger page and failed in the alloc_pages() with the artifically triggered
allocation error which is then propagated down to the call site.

Thus, for set_memory_ro() this means that it returned with an error, but
from debugging a probe_kernel_write() revealed EFAULT on that memory since
the primary mapping succeeded to get changed. Therefore the subsequent
hdr->locked = 0 reset triggered the panic as it was performed on read-only
memory, so call-site assumptions were infact wrong to assume that it would
either succeed /or/ not succeed at all since there's no such rollback in
set_memory_*() calls from partial change of mappings, in other words, we're
left in a state that is "half done". A later undo via set_memory_rw() is
succeeding though due to matching permissions on that part (aka due to the
try_preserve_large_page() succeeding). While reproducing locally with
explicitly triggering this error, the initial splitting only happens on
rare occasions and in real world it would additionally need oom conditions,
but that said, it could partially fail. Therefore, it is definitely wrong
to bail out on set_memory_ro() error and reject the program with the
set_memory_*() semantics we have today. Shouldn't have gone the extra mile
since no other user in tree today infact checks for any set_memory_*()
errors, e.g. neither module_enable_ro() / module_disable_ro() for module
RO/NX handling which is mostly default these days nor kprobes core with
alloc_insn_page() / free_insn_page() as examples that could be invoked long
after bootup and original 314beb9b ("x86: bpf_jit_comp: secure bpf jit
against spraying attacks") did neither when it got first introduced to BPF
so "improving" with bailing out was clearly not right when set_memory_*()
cannot handle it today.

Kees suggested that if set_memory_*() can fail, we should annotate it with
__must_check, and all callers need to deal with it gracefully given those
set_memory_*() markings aren't "advisory", but they're expected to actually
do what they say. This might be an option worth to move forward in future
but would at the same time require that set_memory_*() calls from supporting
archs are guaranteed to be "atomic" in that they provide rollback if part
of the range fails, once that happened, the transition from RW -> RO could
be made more robust that way, while subsequent RO -> RW transition /must/
continue guaranteeing to always succeed the undo part.

Reported-by: syzbot+a4eb8c7766952a1ca872@syzkaller.appspotmail.com
Reported-by: syzbot+d866d1925855328eac3b@syzkaller.appspotmail.com
Fixes: 9facc336 ("bpf: reject any prog that failed read-only lock")
Cc: Laura Abbott <labbott@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

85782e03

29 6月, 2018 10 次提交

net/sched: add tunnel option support to act_tunnel_key · 0ed5269f

由 Simon Horman 提交于 6月 26, 2018

Allow setting tunnel options using the act_tunnel_key action.

Options are expressed as class:type:data and multiple options
may be listed using a comma delimiter.

 # ip link add name geneve0 type geneve dstport 0 external
 # tc qdisc add dev eth0 ingress
 # tc filter add dev eth0 protocol ip parent ffff: \
     flower indev eth0 \
        ip_proto udp \
        action tunnel_key \
            set src_ip 10.0.99.192 \
            dst_ip 10.0.99.193 \
            dst_port 6081 \
            id 11 \
            geneve_opts 0102:80:00800022,0102:80:00800022 \
    action mirred egress redirect dev geneve0
Signed-off-by: NSimon Horman <simon.horman@netronome.com>
Signed-off-by: NPieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0ed5269f

net: check tunnel option type in tunnel flags · 256c87c1

由 Pieter Jansen van Vuuren 提交于 6月 26, 2018

Check the tunnel option type stored in tunnel flags when creating options
for tunnels. Thereby ensuring we do not set geneve, vxlan or erspan tunnel
options on interfaces that are not associated with them.

Make sure all users of the infrastructure set correct flags, for the BPF
helper we have to set all bits to keep backward compatibility.
Signed-off-by: NPieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

256c87c1

sg: remove ->sg_magic member · 9544bc53

由 Jens Axboe 提交于 6月 29, 2018

This was introduced more than a decade ago when sg chaining was
added, but we never really caught anything with it. The scatterlist
entry size can be critical, since drivers allocate it, so remove
the magic member. Recently it's been triggering allocation stalls
and failures in NVMe.
Tested-by: NJordan Glover <Golden_Miller83@protonmail.ch>
Acked-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9544bc53

aio: mark __aio_sigset::sigmask const · 2cd3ae21

由 Avi Kivity 提交于 6月 29, 2018

io_pgetevents() will not change the signal mask.  Mark it const to make
it clear and to reduce the need for casts in user code.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAvi Kivity <avi@scylladb.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
[hch: reapply the patch that got incorrectly reverted]
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2cd3ae21

sctp: add support for SCTP_REUSE_PORT sockopt · b0e9a2fe

由 Xin Long 提交于 6月 28, 2018

This feature is actually already supported by sk->sk_reuse which can be
set by socket level opt SO_REUSEADDR. But it's not working exactly as
RFC6458 demands in section 8.1.27, like:

  - This option only supports one-to-one style SCTP sockets
  - This socket option must not be used after calling bind()
    or sctp_bindx().

Besides, SCTP_REUSE_PORT sockopt should be provided for user's programs.
Otherwise, the programs with SCTP_REUSE_PORT from other systems will not
work in linux.

To separate it from the socket level version, this patch adds 'reuse' in
sctp_sock and it works pretty much as sk->sk_reuse, but with some extra
setup limitations that are needed when it is being enabled.

"It should be noted that the behavior of the socket-level socket option
to reuse ports and/or addresses for SCTP sockets is unspecified", so it
leaves SO_REUSEADDR as is for the compatibility.

Note that the name SCTP_REUSE_PORT is somewhat confusing, as its
functionality is nearly identical to SO_REUSEADDR, but with some
extra restrictions. Here it uses 'reuse' in sctp_sock instead of
'reuseport'. As for sk->sk_reuseport support for SCTP, it will be
added in another patch.

Thanks to Neil to make this clear.

v1->v2:
  - add sctp_sk->reuse to separate it from the socket level version.
v2->v3:
  - improve changelog according to Marcelo's suggestion.
Acked-by: NNeil Horman <nhorman@tuxdriver.com>
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b0e9a2fe

ila: Flush netlink command to clear xlat table · b6e71bde

由 Tom Herbert 提交于 6月 27, 2018

Add ILA_CMD_FLUSH netlink command to clear the ILA translation table.
Signed-off-by: NTom Herbert <tom@quantonium.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b6e71bde

bpf: Change bpf_fib_lookup to return lookup status · 4c79579b

由 David Ahern 提交于 6月 26, 2018

For ACLs implemented using either FIB rules or FIB entries, the BPF
program needs the FIB lookup status to be able to drop the packet.
Since the bpf_fib_lookup API has not reached a released kernel yet,
change the return code to contain an encoding of the FIB lookup
result and return the nexthop device index in the params struct.

In addition, inform the BPF program of any post FIB lookup reason as
to why the packet needs to go up the stack.

The fib result for unicast routes must have an egress device, so remove
the check that it is non-NULL.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

4c79579b

include/linux/dax.h: dax_iomap_fault() returns vm_fault_t · f77bc3a8

由 Souptick Joarder 提交于 6月 27, 2018

Commit 1c8f4220 ("mm: change return type to vm_fault_t") missed a
conversion.  It's not a big problem at present because mainline is still
using

	typedef int vm_fault_t;

Fixes: 1c8f4220 ("mm: change return type to vm_fault_t")
Link: http://lkml.kernel.org/r/20180620172046.GA27894@jordon-HP-15-Notebook-PCSigned-off-by: NSouptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f77bc3a8

slub: fix failure when we delete and create a slab cache · d50d82fa

由 Mikulas Patocka 提交于 6月 27, 2018

In kernel 4.17 I removed some code from dm-bufio that did slab cache
merging (commit 21bb1327: "dm bufio: remove code that merges slab
caches") - both slab and slub support merging caches with identical
attributes, so dm-bufio now just calls kmem_cache_create and relies on
implicit merging.

This uncovered a bug in the slub subsystem - if we delete a cache and
immediatelly create another cache with the same attributes, it fails
because of duplicate filename in /sys/kernel/slab/.  The slub subsystem
offloads freeing the cache to a workqueue - and if we create the new
cache before the workqueue runs, it complains because of duplicate
filename in sysfs.

This patch fixes the bug by moving the call of kobject_del from
sysfs_slab_remove_workfn to shutdown_cache.  kobject_del must be called
while we hold slab_mutex - so that the sysfs entry is deleted before a
cache with the same attributes could be created.

Running device-mapper-test-suite with:

  dmtest run --suite thin-provisioning -n /commit_failure_causes_fallback/

triggered:

  Buffer I/O error on dev dm-0, logical block 1572848, async page read
  device-mapper: thin: 253:1: metadata operation 'dm_pool_alloc_data_block' failed: error = -5
  device-mapper: thin: 253:1: aborting current metadata transaction
  sysfs: cannot create duplicate filename '/kernel/slab/:a-0000144'
  CPU: 2 PID: 1037 Comm: kworker/u48:1 Not tainted 4.17.0.snitm+ #25
  Hardware name: Supermicro SYS-1029P-WTR/X11DDW-L, BIOS 2.0a 12/06/2017
  Workqueue: dm-thin do_worker [dm_thin_pool]
  Call Trace:
   dump_stack+0x5a/0x73
   sysfs_warn_dup+0x58/0x70
   sysfs_create_dir_ns+0x77/0x80
   kobject_add_internal+0xba/0x2e0
   kobject_init_and_add+0x70/0xb0
   sysfs_slab_add+0xb1/0x250
   __kmem_cache_create+0x116/0x150
   create_cache+0xd9/0x1f0
   kmem_cache_create_usercopy+0x1c1/0x250
   kmem_cache_create+0x18/0x20
   dm_bufio_client_create+0x1ae/0x410 [dm_bufio]
   dm_block_manager_create+0x5e/0x90 [dm_persistent_data]
   __create_persistent_data_objects+0x38/0x940 [dm_thin_pool]
   dm_pool_abort_metadata+0x64/0x90 [dm_thin_pool]
   metadata_operation_failed+0x59/0x100 [dm_thin_pool]
   alloc_data_block.isra.53+0x86/0x180 [dm_thin_pool]
   process_cell+0x2a3/0x550 [dm_thin_pool]
   do_worker+0x28d/0x8f0 [dm_thin_pool]
   process_one_work+0x171/0x370
   worker_thread+0x49/0x3f0
   kthread+0xf8/0x130
   ret_from_fork+0x35/0x40
  kobject_add_internal failed for :a-0000144 with -EEXIST, don't try to register things with the same name in the same directory.
  kmem_cache_create(dm_bufio_buffer-16) failed with error -17

Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1806151817130.6333@file01.intranet.prod.int.rdu2.redhat.comSigned-off-by: NMikulas Patocka <mpatocka@redhat.com>
Reported-by: NMike Snitzer <snitzer@redhat.com>
Tested-by: NMike Snitzer <snitzer@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d50d82fa

Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL · a11e1d43

由 Linus Torvalds 提交于 6月 28, 2018

The poll() changes were not well thought out, and completely
unexplained.  They also caused a huge performance regression, because
"->poll()" was no longer a trivial file operation that just called down
to the underlying file operations, but instead did at least two indirect
calls.

Indirect calls are sadly slow now with the Spectre mitigation, but the
performance problem could at least be largely mitigated by changing the
"->get_poll_head()" operation to just have a per-file-descriptor pointer
to the poll head instead.  That gets rid of one of the new indirections.

But that doesn't fix the new complexity that is completely unwarranted
for the regular case.  The (undocumented) reason for the poll() changes
was some alleged AIO poll race fixing, but we don't make the common case
slower and more complex for some uncommon special case, so this all
really needs way more explanations and most likely a fundamental
redesign.

[ This revert is a revert of about 30 different commits, not reverted
  individually because that would just be unnecessarily messy  - Linus ]

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a11e1d43

28 6月, 2018 3 次提交

netfilter: check if the socket netns is correct. · f5646501

由 Flavio Leitner 提交于 6月 27, 2018

Netfilter assumes that if the socket is present in the skb, then
it can be used because that reference is cleaned up while the skb
is crossing netns.

We want to change that to preserve the socket reference in a future
patch, so this is a preparation updating netfilter to check if the
socket netns matches before use it.
Signed-off-by: NFlavio Leitner <fbl@redhat.com>
Acked-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f5646501

net sched actions: fix coding style in pedit headers · d020d455

由 Roman Mashak 提交于 6月 27, 2018

Fix coding style issues in tc pedit headers detected by the
checkpatch script.
Reviewed-by: NSimon Horman <simon.horman@netronome.com>
Signed-off-by: NRoman Mashak <mrv@mojatatu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d020d455

netem: slotting with non-uniform distribution · 0a9fe5c3

由 Yousuk Seung 提交于 6月 27, 2018

Extend slotting with support for non-uniform distributions. This is
similar to netem's non-uniform distribution delay feature.

Commit f043efeae2f1 ("netem: support delivering packets in delayed
time slots") added the slotting feature to approximate the behaviors
of media with packet aggregation but only supported a uniform
distribution for delays between transmission attempts. Tests with TCP
BBR with emulated wifi links with non-uniform distributions produced
more useful results.

Syntax:
   slot dist DISTRIBUTION DELAY JITTER [packets MAX_PACKETS] \
      [bytes MAX_BYTES]

The syntax and use of the distribution table is the same as in the
non-uniform distribution delay feature. A file DISTRIBUTION must be
present in TC_LIB_DIR (e.g. /usr/lib/tc) containing numbers scaled by
NETEM_DIST_SCALE. A random value x is selected from the table and it
takes DELAY + ( x * JITTER ) as delay. Correlation between values is not
supported.

Examples:
  Normal distribution delay with mean = 800us and stdev = 100us.
  > tc qdisc add dev eth0 root netem slot dist normal 800us 100us

  Optionally set the max slot size in bytes and/or packets.
  > tc qdisc add dev eth0 root netem slot dist normal 800us 100us \
    bytes 64k packets 42
Signed-off-by: NYousuk Seung <ysseung@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0a9fe5c3

27 6月, 2018 3 次提交

nfp: reject binding to shared blocks · 951a8ee6

由 John Hurley 提交于 6月 25, 2018

TC shared blocks allow multiple qdiscs to be grouped together and filters
shared between them. Currently the chains of filters attached to a block
are only flushed when the block is removed. If a qdisc is removed from a
block but the block still exists, flow del messages are not passed to the
callback registered for that qdisc. For the NFP, this presents the
possibility of rules still existing in hw when they should be removed.

Prevent binding to shared blocks until the kernel can send per qdisc del
messages when block unbinds occur.

tcf_block_shared() was not used outside of the core until now, so also
add an empty implementation for builds with CONFIG_NET_CLS=n.

Fixes: 48617387 ("net: sched: introduce shared filter blocks infrastructure")
Signed-off-by: NJohn Hurley <john.hurley@netronome.com>
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: NSimon Horman <simon.horman@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

951a8ee6

net/mlx5: E-Switch, Avoid setup attempt if not being e-switch manager · 0efc8562

由 Or Gerlitz 提交于 5月 31, 2018

In smartnic env, the host (PF) driver might not be an e-switch
manager, hence the FW will err on driver attempts to deal with
setting/unsetting the eswitch and as a result the overall setup
of sriov will fail.

Fix that by avoiding the operation if e-switch management is not
allowed for this driver instance. While here, move to use the
correct name for the esw manager capability name.

Fixes: 81848731 ('net/mlx5: E-Switch, Add SR-IOV (FDB) support')
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Reported-by: NGuy Kushnir <guyk@mellanox.com>
Reviewed-by: NEli Cohen <eli@melloanox.com>
Tested-by: NEli Cohen <eli@melloanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

0efc8562

block: Fix transfer when chunk sectors exceeds max · 15bfd21f

由 Keith Busch 提交于 6月 26, 2018

A device may have boundary restrictions where the number of sectors
between boundaries exceeds its max transfer size. In this case, we need
to cap the max size to the smaller of the two limits.
Reported-by: NJitendra Bhivare <jitendra.bhivare@broadcom.com>
Tested-by: NJitendra Bhivare <jitendra.bhivare@broadcom.com>
Cc: <stable@vger.kernel.org>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

15bfd21f

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功