提交 · df560056d960a3e164c179d89770d5a51b798537 · openanolis / cloud-kernel

07 1月, 2017 3 次提交

udp: inuse checks can quit early for reuseport · df560056

由 Eric Garver 提交于 1月 05, 2017

UDP lib inuse checks will walk the entire hash bucket to check if the
portaddr is in use. In the case of reuseport we can stop searching when
we find a matching reuseport.

On a 16-core VM a test program that spawns 16 threads that each bind to
1024 sockets (one per 10ms) takes 1m45s. With this change it takes 11s.

Also add a cond_resched() when the port is not specified.
Signed-off-by: NEric Garver <e@erig.me>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

df560056

net: ipv4: make fib_select_default static · c7b371e3

由 David Ahern 提交于 1月 05, 2017

fib_select_default has a single caller within the same file.
Make it static.
Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c7b371e3

net: ipv4: Simplify rt_fill_info · 0c8d803f

由 David Ahern 提交于 1月 05, 2017

rt_fill_info has only 1 caller and both of the last 2 args -- nowait
and flags -- are hardcoded to 0. Given that remove them as input arguments
and simplify rt_fill_info accordingly.
Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0c8d803f

06 1月, 2017 1 次提交

tcp: provide timestamps for partial writes · ad02c4f5

由 Soheil Hassas Yeganeh 提交于 1月 04, 2017

For TCP sockets, TX timestamps are only captured when the user data
is successfully and fully written to the socket. In many cases,
however, TCP writes can be partial for which no timestamp is
collected.

Collect timestamps whenever any user data is (fully or partially)
copied into the socket. Pass tcp_write_queue_tail to tcp_tx_timestamp
instead of the local skb pointer since it can be set to NULL on
the error path.

Note that tcp_write_queue_tail can be NULL, even if bytes have been
copied to the socket. This is because acknowledgements are being
processed in tcp_sendmsg(), and by the time tcp_tx_timestamp is
called tcp_write_queue_tail can be NULL. For such cases, this patch
does not collect any timestamps (i.e., it is best-effort).

This patch is written with suggestions from Willem de Bruijn and
Eric Dumazet.

Change-log V1 -> V2:
	- Use sockc.tsflags instead of sk->sk_tsflags.
	- Use the same code path for normal writes and errors.
Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
Acked-by: NYuchung Cheng <ycheng@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Acked-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ad02c4f5

05 1月, 2017 1 次提交

scm: remove use CMSG{_COMPAT}_ALIGN(sizeof(struct {compat_}cmsghdr)) · 1ff8cebf

由 yuan linyu 提交于 1月 03, 2017

sizeof(struct cmsghdr) and sizeof(struct compat_cmsghdr) already aligned.
remove use CMSG_ALIGN(sizeof(struct cmsghdr)) and
CMSG_COMPAT_ALIGN(sizeof(struct compat_cmsghdr)) keep code consistent.
Signed-off-by: Nyuan linyu <Linyu.Yuan@alcatel-sbell.com.cn>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1ff8cebf

03 1月, 2017 3 次提交

ipmr, ip6mr: add RTNH_F_UNRESOLVED flag to unresolved cache entries · 1708ebc9

由 Nikolay Aleksandrov 提交于 1月 03, 2017

While working with ipmr, we noticed that it is impossible to determine
if an entry is actually unresolved or its IIF interface has disappeared
(e.g. virtual interface got deleted). These entries look almost
identical to user-space when dumping or receiving notifications. So in
order to recognize them add a new RTNH_F_UNRESOLVED flag which is set when
sending an unresolved cache entry to user-space.
Suggested-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1708ebc9

ipv4: Do not allow MAIN to be alias for new LOCAL w/ custom rules · 5350d54f

由 Alexander Duyck 提交于 1月 02, 2017

In the case of custom rules being present we need to handle the case of the
LOCAL table being intialized after the new rule has been added.  To address
that I am adding a new check so that we can make certain we don't use an
alias of MAIN for LOCAL when allocating a new table.

Fixes: 0ddcf43d ("ipv4: FIB Local/MAIN table collapse")
Reported-by: NOliver Brunel <jjk@jjacky.com>
Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5350d54f

igmp: Make igmp group member RFC 3376 compliant · 7ababb78

由 Michal Tesar 提交于 1月 02, 2017

5.2. Action on Reception of a Query

 When a system receives a Query, it does not respond immediately.
 Instead, it delays its response by a random amount of time, bounded
 by the Max Resp Time value derived from the Max Resp Code in the
 received Query message.  A system may receive a variety of Queries on
 different interfaces and of different kinds (e.g., General Queries,
 Group-Specific Queries, and Group-and-Source-Specific Queries), each
 of which may require its own delayed response.

 Before scheduling a response to a Query, the system must first
 consider previously scheduled pending responses and in many cases
 schedule a combined response.  Therefore, the system must be able to
 maintain the following state:

 o A timer per interface for scheduling responses to General Queries.

 o A per-group and interface timer for scheduling responses to Group-
   Specific and Group-and-Source-Specific Queries.

 o A per-group and interface list of sources to be reported in the
   response to a Group-and-Source-Specific Query.

 When a new Query with the Router-Alert option arrives on an
 interface, provided the system has state to report, a delay for a
 response is randomly selected in the range (0, [Max Resp Time]) where
 Max Resp Time is derived from Max Resp Code in the received Query
 message.  The following rules are then used to determine if a Report
 needs to be scheduled and the type of Report to schedule.  The rules
 are considered in order and only the first matching rule is applied.

 1. If there is a pending response to a previous General Query
    scheduled sooner than the selected delay, no additional response
    needs to be scheduled.

 2. If the received Query is a General Query, the interface timer is
    used to schedule a response to the General Query after the
    selected delay.  Any previously pending response to a General
    Query is canceled.
--8<--

Currently the timer is rearmed with new random expiration time for
every incoming query regardless of possibly already pending report.
Which is not aligned with the above RFE.
It also might happen that higher rate of incoming queries can
postpone the report after the expiration time of the first query
causing group membership loss.

Now the per interface general query timer is rearmed only
when there is no pending report already scheduled on that interface or
the newly selected expiration time is before the already pending
scheduled report.
Signed-off-by: NMichal Tesar <mtesar@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7ababb78

31 12月, 2016 1 次提交

net: Allow IP_MULTICAST_IF to set index to L3 slave · 7bb387c5

由 David Ahern 提交于 12月 29, 2016

IP_MULTICAST_IF fails if sk_bound_dev_if is already set and the new index
does not match it. e.g.,

    ntpd[15381]: setsockopt IP_MULTICAST_IF 192.168.1.23 fails: Invalid argument

Relax the check in setsockopt to allow setting mc_index to an L3 slave if
sk_bound_dev_if points to an L3 master.

Make a similar change for IPv6. In this case change the device lookup to
take the rcu_read_lock avoiding a refcnt. The rcu lock is also needed for
the lookup of a potential L3 master device.

This really only silences a setsockopt failure since uses of mc_index are
secondary to sk_bound_dev_if if it is set. In both cases, if either index
is an L3 slave or master, lookups are directed to the same FIB table so
relaxing the check at setsockopt time causes no harm.

Patch is based on a suggested change by Darwin for a problem noted in
their code base.
Suggested-by: NDarwin Dingel <darwin.dingel@alliedtelesis.co.nz>
Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7bb387c5

30 12月, 2016 4 次提交

net: ipv4: dst for local input routes should use l3mdev if relevant · f5a0aab8

由 David Ahern 提交于 12月 29, 2016

IPv4 output routes already use l3mdev device instead of loopback for dst's
if it is applicable. Change local input routes to do the same.

This fixes icmp responses for unreachable UDP ports which are directed
to the wrong table after commit 9d1a6c4e because local_input
routes use the loopback device. Moving from ingress device to loopback
loses the L3 domain causing responses based on the dst to get to lost.

Fixes: 9d1a6c4e ("net: icmp_route_lookup should use rt dev to
		       determine L3 domain")
Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f5a0aab8

net: fix incorrect original ingress device index in PKTINFO · f0c16ba8

由 Wei Zhang 提交于 12月 29, 2016

When we send a packet for our own local address on a non-loopback
interface (e.g. eth0), due to the change had been introduced from
commit 0b922b7a ("net: original ingress device index in PKTINFO"), the
original ingress device index would be set as the loopback interface.
However, the packet should be considered as if it is being arrived via the
sending interface (eth0), otherwise it would break the expectation of the
userspace application (e.g. the DHCPRELEASE message from dhcp_release
binary would be ignored by the dnsmasq daemon, since it come from lo which
is not the interface dnsmasq bind to)

Fixes: 0b922b7a ("net: original ingress device index in PKTINFO")
Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NWei Zhang <asuka.com@163.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f0c16ba8

ipv4: Namespaceify tcp_max_syn_backlog knob · fee83d09

由 Haishuang Yan 提交于 12月 28, 2016

Different namespace application might require different maximal
number of remembered connection requests.
Signed-off-by: NHaishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fee83d09

ipv4: Namespaceify tcp_tw_recycle and tcp_max_tw_buckets knob · 1946e672

由 Haishuang Yan 提交于 12月 28, 2016

Different namespace application might require fast recycling
TIME-WAIT sockets independently of the host.
Signed-off-by: NHaishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1946e672

28 12月, 2016 1 次提交

ipv4: Namespaceify tcp_tw_reuse knob · 56ab6b93

由 Haishuang Yan 提交于 12月 25, 2016

Different namespaces might have different requirements to reuse
TIME-WAIT sockets for new connections. This might be required in
cases where different namespace applications are in place which
require TIME_WAIT socket connections to be reduced independently
of the host.
Signed-off-by: NHaishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

56ab6b93

26 12月, 2016 1 次提交

ktime: Get rid of the union · 2456e855

由 Thomas Gleixner 提交于 12月 25, 2016

ktime is a union because the initial implementation stored the time in
scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
variant for 32bit machines. The Y2038 cleanup removed the timespec variant
and switched everything to scalar nanoseconds. The union remained, but
become completely pointless.

Get rid of the union and just keep ktime_t as simple typedef of type s64.

The conversion was done with coccinelle and some manual mopping up.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>

2456e855

25 12月, 2016 1 次提交

Replace <asm/uaccess.h> with <linux/uaccess.h> globally · 7c0f6ba6

由 Linus Torvalds 提交于 12月 24, 2016

This was entirely automated, using the script by Al:

  PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
  sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
        $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.
Requested-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7c0f6ba6

24 12月, 2016 1 次提交

inet: fix IP(V6)_RECVORIGDSTADDR for udp sockets · 39b2dd76

由 Willem de Bruijn 提交于 12月 22, 2016

Socket cmsg IP(V6)_RECVORIGDSTADDR checks that port range lies within
the packet. For sockets that have transport headers pulled, transport
offset can be negative. Use signed comparison to avoid overflow.

Fixes: e6afc8ac ("udp: remove headers from UDP packets before queueing")
Reported-by: NNisar Jagabar <njagabar@cloudmark.com>
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

39b2dd76

23 12月, 2016 1 次提交

net: ipv4: Don't crash if passing a null sk to ip_do_redirect. · 7d995694

由 Lorenzo Colitti 提交于 12月 23, 2016

Commit e2d118a1 ("net: inet: Support UID-based routing in IP
protocols.") made ip_do_redirect call sock_net(sk) to determine
the network namespace of the passed-in socket. This crashes if sk
is NULL.

Fix this by getting the network namespace from the skb instead.

Fixes: e2d118a1 ("net: inet: Support UID-based routing in IP protocols.")
Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7d995694

22 12月, 2016 1 次提交

tcp: add a missing barrier in tcp_tasklet_func() · 0a9648f1

由 Eric Dumazet 提交于 12月 21, 2016

Madalin reported crashes happening in tcp_tasklet_func() on powerpc64

Before TSQ_QUEUED bit is cleared, we must ensure the changes done
by list_del(&tp->tsq_node); are committed to memory, otherwise
corruption might happen, as an other cpu could catch TSQ_QUEUED
clearance too soon.

We can notice that old kernels were immune to this bug, because
TSQ_QUEUED was cleared after a bh_lock_sock(sk)/bh_unlock_sock(sk)
section, but they could have missed a kick to write additional bytes,
when NIC interrupts for a given flow are spread to multiple cpus.

Affected TCP flows would need an incoming ACK or RTO timer to add more
packets to the pipe. So overall situation should be better now.

Fixes: b223feb9 ("tcp: tsq: add shortcut in tcp_tasklet_func()")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NMadalin Bucur <madalin.bucur@nxp.com>
Tested-by: NMadalin Bucur <madalin.bucur@nxp.com>
Tested-by: NXing Lei <xing.lei@nxp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0a9648f1

20 12月, 2016 1 次提交

ipv4: Should use consistent conditional judgement for ip fragment in... · 0a28cfd5

由 zheng li 提交于 12月 12, 2016

ipv4: Should use consistent conditional judgement for ip fragment in __ip_append_data and ip_finish_output

There is an inconsistent conditional judgement in __ip_append_data and
ip_finish_output functions, the variable length in __ip_append_data just
include the length of application's payload and udp header, don't include
the length of ip header, but in ip_finish_output use
(skb->len > ip_skb_dst_mtu(skb)) as judgement, and skb->len include the
length of ip header.

That causes some particular application's udp payload whose length is
between (MTU - IP Header) and MTU were fragmented by ip_fragment even
though the rst->dev support UFO feature.

Add the length of ip header to length in __ip_append_data to keep
consistent conditional judgement as ip_finish_output for ip fragment.
Signed-off-by: NZheng Li <james.z.li@ericsson.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0a28cfd5

18 12月, 2016 2 次提交

inet: Fix get port to handle zero port number with soreuseport set · 0643ee4f

由 Tom Herbert 提交于 12月 14, 2016

A user may call listen with binding an explicit port with the intent
that the kernel will assign an available port to the socket. In this
case inet_csk_get_port does a port scan. For such sockets, the user may
also set soreuseport with the intent a creating more sockets for the
port that is selected. The problem is that the initial socket being
opened could inadvertently choose an existing and unreleated port
number that was already created with soreuseport.

This patch adds a boolean parameter to inet_bind_conflict that indicates
rather soreuseport is allowed for the check (in addition to
sk->sk_reuseport). In calls to inet_bind_conflict from inet_csk_get_port
the argument is set to true if an explicit port is being looked up (snum
argument is nonzero), and is false if port scan is done.
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0643ee4f

inet: Don't go into port scan when looking for specific bind port · 9af7e923

由 Tom Herbert 提交于 12月 14, 2016

inet_csk_get_port is called with port number (snum argument) that may be
zero or nonzero. If it is zero, then the intent is to find an available
ephemeral port number to bind to. If snum is non-zero then the caller
is asking to allocate a specific port number. In the latter case we
never want to perform the scan in ephemeral port range. It is
conceivable that this can happen if the "goto again" in "tb_found:"
is done. This patch adds a check that snum is zero before doing
the "goto again".
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9af7e923

10 12月, 2016 4 次提交

udp: udp_rmem_release() should touch sk_rmem_alloc later · 02ab0d13

由 Eric Dumazet 提交于 12月 08, 2016

In flood situations, keeping sk_rmem_alloc at a high value
prevents producers from touching the socket.

It makes sense to lower sk_rmem_alloc only at the end
of udp_rmem_release() after the thread draining receive
queue in udp_recvmsg() finished the writes to sk_forward_alloc.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

02ab0d13

udp: add batching to udp_rmem_release() · 6b229cf7

由 Eric Dumazet 提交于 12月 08, 2016

If udp_recvmsg() constantly releases sk_rmem_alloc
for every read packet, it gives opportunity for
producers to immediately grab spinlocks and desperatly
try adding another packet, causing false sharing.

We can add a simple heuristic to give the signal
by batches of ~25 % of the queue capacity.

This patch considerably increases performance under
flood by about 50 %, since the thread draining the queue
is no longer slowed by false sharing.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6b229cf7

udp: copy skb->truesize in the first cache line · c84d9490

由 Eric Dumazet 提交于 12月 08, 2016

In UDP RX handler, we currently clear skb->dev before skb
is added to receive queue, because device pointer is no longer
available once we exit from RCU section.

Since this first cache line is always hot, lets reuse this space
to store skb->truesize and thus avoid a cache line miss at
udp_recvmsg()/udp_skb_destructor time while receive queue
spinlock is held.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c84d9490

udp: add busylocks in RX path · 4b272750

由 Eric Dumazet 提交于 12月 08, 2016

Idea of busylocks is to let producers grab an extra spinlock
to relieve pressure on the receive_queue spinlock shared by consumer.

This behavior is requested only once socket receive queue is above
half occupancy.

Under flood, this means that only one producer can be in line
trying to acquire the receive_queue spinlock.

These busylock can be allocated on a per cpu manner, instead of a
per socket one (that would consume a cache line per socket)

This patch considerably improves UDP behavior under stress,
depending on number of NIC RX queues and/or RPS spread.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4b272750

09 12月, 2016 2 次提交

udp: under rx pressure, try to condense skbs · c8c8b127

由 Eric Dumazet 提交于 12月 07, 2016

Under UDP flood, many softirq producers try to add packets to
UDP receive queue, and one user thread is burning one cpu trying
to dequeue packets as fast as possible.

Two parts of the per packet cost are :
- copying payload from kernel space to user space,
- freeing memory pieces associated with skb.

If socket is under pressure, softirq handler(s) can try to pull in
skb->head the payload of the packet if it fits.

Meaning the softirq handler(s) can free/reuse the page fragment
immediately, instead of letting udp_recvmsg() do this hundreds of usec
later, possibly from another node.

Additional gains :
- We reduce skb->truesize and thus can store more packets per SO_RCVBUF
- We avoid cache line misses at copyout() time and consume_skb() time,
and avoid one put_page() with potential alien freeing on NUMA hosts.

This comes at the cost of a copy, bounded to available tail room, which
is usually small. (We might have to fix GRO_MAX_HEAD which looks bigger
than necessary)

This patch gave me about 5 % increase in throughput in my tests.

skb_condense() helper could probably used in other contexts.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c8c8b127

icmp: correct return value of icmp_rcv() · f91c58d6

由 Zhang Shengju 提交于 12月 07, 2016

Currently, icmp_rcv() always return zero on a packet delivery upcall.

To make its behavior more compliant with the way this API should be
used, this patch changes this to let it return NET_RX_SUCCESS when the
packet is proper handled, and NET_RX_DROP otherwise.
Signed-off-by: NZhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f91c58d6

07 12月, 2016 9 次提交

netfilter: rpfilter: bypass ipv4 lbcast packets with zeronet source · 3b760dcb

由 Liping Zhang 提交于 12月 03, 2016

Otherwise, DHCP Discover packets(0.0.0.0->255.255.255.255) may be
dropped incorrectly.
Signed-off-by: NLiping Zhang <zlpnobody@gmail.com>
Acked-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

3b760dcb

netfilter: nft_fib_ipv4: initialize *dest to zero · e0ffdbc7

由 Liping Zhang 提交于 11月 23, 2016

Otherwise, if fib lookup fail, *dest will be filled with garbage value,
so reverse path filtering will not work properly:
 # nft add rule x prerouting fib saddr oif eq 0 drop

Fixes: f6d0cbcf ("netfilter: nf_tables: add fib expression")
Signed-off-by: NLiping Zhang <zlpnobody@gmail.com>
Acked-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

e0ffdbc7

netfilter: nft_fib: convert htonl to ntohl properly · 11583438

由 Liping Zhang 提交于 11月 23, 2016

Acctually ntohl and htonl are identical, so this doesn't affect
anything, but it is conceptually wrong.
Signed-off-by: NLiping Zhang <zlpnobody@gmail.com>
Acked-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

11583438

netfilter: x_tables: pack percpu counter allocations · ae0ac0ed

由 Florian Westphal 提交于 11月 22, 2016

instead of allocating each xt_counter individually, allocate 4k chunks
and then use these for counter allocation requests.

This should speed up rule evaluation by increasing data locality,
also speeds up ruleset loading because we reduce calls to the percpu
allocator.

As Eric points out we can't use PAGE_SIZE, page_allocator would fail on
arches with 64k page size.
Suggested-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

ae0ac0ed

netfilter: x_tables: pass xt_counters struct to counter allocator · f28e15ba

由 Florian Westphal 提交于 11月 22, 2016

Keeps some noise away from a followup patch.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

f28e15ba

netfilter: x_tables: pass xt_counters struct instead of packet counter · 4d31eef5

由 Florian Westphal 提交于 11月 22, 2016

On SMP we overload the packet counter (unsigned long) to contain
percpu offset.  Hide this from callers and pass xt_counters address
instead.

Preparation patch to allocate the percpu counters in page-sized batch
chunks.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

4d31eef5

netfilter: defrag: only register defrag functionality if needed · 834184b1

由 Florian Westphal 提交于 11月 15, 2016

nf_defrag modules for ipv4 and ipv6 export an empty stub function.
Any module that needs the defragmentation hooks registered simply 'calls'
this empty function to create a phony module dependency -- modprobe will
then load the defrag module too.

This extends netfilter ipv4/ipv6 defragmentation modules to delay the hook
registration until the functionality is requested within a network namespace
instead of module load time for all namespaces.

Hooks are only un-registered on module unload or when a namespace that used
such defrag functionality exits.

We have to use struct net for this as the register hooks can be called
before netns initialization here from the ipv4/ipv6 conntrack module
init path.

There is no unregister functionality support, defrag will always be
active once it was requested inside a net namespace.

The reason is that defrag has impact on nft and iptables rulesets
(without defrag we might see framents).
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

834184b1

Revert "dctcp: update cwnd on congestion event" · 343dfaa1

由 Florian Westphal 提交于 12月 06, 2016

Neal Cardwell says:
 If I am reading the code correctly, then I would have two concerns:
 1) Has that been tested? That seems like an extremely dramatic
    decrease in cwnd. For example, if the cwnd is 80, and there are 40
    ACKs, and half the ACKs are ECE marked, then my back-of-the-envelope
    calculations seem to suggest that after just 11 ACKs the cwnd would be
    down to a minimal value of 2 [..]
 2) That seems to contradict another passage in the draft [..] where it
    sazs:
       Just as specified in [RFC3168], DCTCP does not react to congestion
       indications more than once for every window of data.

Neal is right.  Fortunately we don't have to complicate this by testing
vs. current rtt estimate, we can just revert the patch.

Normal stack already handles this for us: receiving ACKs with ECE
set causes a call to tcp_enter_cwr(), from there on the ssthresh gets
adjusted and prr will take care of cwnd adjustment.

Fixes: 47805667 ("dctcp: update cwnd on congestion event")
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

343dfaa1

tcp: warn on bogus MSS and try to amend it · dcb17d22

由 Marcelo Ricardo Leitner 提交于 12月 05, 2016

There have been some reports lately about TCP connection stalls caused
by NIC drivers that aren't setting gso_size on aggregated packets on rx
path. This causes TCP to assume that the MSS is actually the size of the
aggregated packet, which is invalid.

Although the proper fix is to be done at each driver, it's often hard
and cumbersome for one to debug, come to such root cause and report/fix
it.

This patch amends this situation in two ways. First, it adds a warning
on when this situation occurs, so it gives a hint to those trying to
debug this. It also limit the maximum probed MSS to the adverised MSS,
as it should never be any higher than that.

The result is that the connection may not have the best performance ever
but it shouldn't stall, and the admin will have a hint on what to look
for.

Tested with virtio by forcing gso_size to 0.

v2: updated msg per David's suggestion
v3: use skb_iif to find the interface and also log its name, per Eric
    Dumazet's suggestion. As the skb may be backlogged and the interface
    gone by then, we need to check if the number still has a meaning.
v4: use helper tcp_gro_dev_warn() and avoid pr_warn_once inside __once, per
    David's suggestion

Cc: Jonathan Maxwell <jmaxwell37@gmail.com>
Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dcb17d22

06 12月, 2016 3 次提交

A
switch getfrag callbacks to ..._full() primitives · 0b62fca2
由 Al Viro 提交于 11月 03, 2016
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
0b62fca2

net: ping: check minimum size on ICMP header length · 0eab121e

由 Kees Cook 提交于 12月 05, 2016

Prior to commit c0371da6 ("put iov_iter into msghdr") in v3.19, there
was no check that the iovec contained enough bytes for an ICMP header,
and the read loop would walk across neighboring stack contents. Since the
iov_iter conversion, bad arguments are noticed, but the returned error is
EFAULT. Returning EINVAL is a clearer error and also solves the problem
prior to v3.19.

This was found using trinity with KASAN on v3.18:

BUG: KASAN: stack-out-of-bounds in memcpy_fromiovec+0x60/0x114 at addr ffffffc071077da0
Read of size 8 by task trinity-c2/9623
page:ffffffbe034b9a08 count:0 mapcount:0 mapping:          (null) index:0x0
flags: 0x0()
page dumped because: kasan: bad access detected
CPU: 0 PID: 9623 Comm: trinity-c2 Tainted: G    BU         3.18.0-dirty #15
Hardware name: Google Tegra210 Smaug Rev 1,3+ (DT)
Call trace:
[<ffffffc000209c98>] dump_backtrace+0x0/0x1ac arch/arm64/kernel/traps.c:90
[<ffffffc000209e54>] show_stack+0x10/0x1c arch/arm64/kernel/traps.c:171
[<     inline     >] __dump_stack lib/dump_stack.c:15
[<ffffffc000f18dc4>] dump_stack+0x7c/0xd0 lib/dump_stack.c:50
[<     inline     >] print_address_description mm/kasan/report.c:147
[<     inline     >] kasan_report_error mm/kasan/report.c:236
[<ffffffc000373dcc>] kasan_report+0x380/0x4b8 mm/kasan/report.c:259
[<     inline     >] check_memory_region mm/kasan/kasan.c:264
[<ffffffc00037352c>] __asan_load8+0x20/0x70 mm/kasan/kasan.c:507
[<ffffffc0005b9624>] memcpy_fromiovec+0x5c/0x114 lib/iovec.c:15
[<     inline     >] memcpy_from_msg include/linux/skbuff.h:2667
[<ffffffc000ddeba0>] ping_common_sendmsg+0x50/0x108 net/ipv4/ping.c:674
[<ffffffc000dded30>] ping_v4_sendmsg+0xd8/0x698 net/ipv4/ping.c:714
[<ffffffc000dc91dc>] inet_sendmsg+0xe0/0x12c net/ipv4/af_inet.c:749
[<     inline     >] __sock_sendmsg_nosec net/socket.c:624
[<     inline     >] __sock_sendmsg net/socket.c:632
[<ffffffc000cab61c>] sock_sendmsg+0x124/0x164 net/socket.c:643
[<     inline     >] SYSC_sendto net/socket.c:1797
[<ffffffc000cad270>] SyS_sendto+0x178/0x1d8 net/socket.c:1761

CVE-2016-8399
Reported-by: NQidan He <i@flanker017.me>
Fixes: c319b4d7 ("net: ipv4: add IPPROTO_ICMP socket kind")
Cc: stable@vger.kernel.org
Signed-off-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0eab121e

tcp: tsq: move tsq_flags close to sk_wmem_alloc · 7aa5470c

由 Eric Dumazet 提交于 12月 03, 2016

tsq_flags being in the same cache line than sk_wmem_alloc
makes a lot of sense. Both fields are changed from tcp_wfree()
and more generally by various TSQ related functions.

Prior patch made room in struct sock and added sk_tsq_flags,
this patch deletes tsq_flags from struct tcp_sock.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7aa5470c

openanolis / cloud-kernel 大约 1 年 前同步成功

openanolis / cloud-kernel
大约 1 年前同步成功