提交 · c15df306fc79c672573f1cc2ebdfcb32d7e68780 · openanolis / cloud-kernel

16 7月, 2015 5 次提交

ipv6: Remove unused arguments for __ipv6_dev_get_saddr(). · c15df306

由 YOSHIFUJI Hideaki 提交于 7月 16, 2015

Signed-off-by: NYOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c15df306

netlink: changes for setting and clearing protodown via netlink. · 88d6378b

由 Anuradha Karuppiah 提交于 7月 14, 2015

Signed-off-by: NAnuradha Karuppiah <anuradhak@cumulusnetworks.com>
Signed-off-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: NWilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

88d6378b

net core: Add protodown support. · d746d707

由 Anuradha Karuppiah 提交于 7月 14, 2015

This patch introduces the proto_down flag that can be used by user space
applications to notify switch drivers that errors have been detected on the
device.

The switch driver can react to protodown notification by doing a phys down
on the associated switch port.
Signed-off-by: NAnuradha Karuppiah <anuradhak@cumulusnetworks.com>
Signed-off-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: NWilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d746d707

ipv6: Fix finding best source address in ipv6_dev_get_saddr(). · c0b8da1e

由 YOSHIFUJI Hideaki/吉藤英明提交于 7月 13, 2015

Commit 9131f3de ("ipv6: Do not iterate over all interfaces when
finding source address on specific interface.") did not properly
update best source address available.  Plus, it introduced
possible NULL pointer dereference.

Bug was reported by Erik Kline <ek@google.com>.
Based on patch proposed by Hajime Tazaki <thehajime@gmail.com>.

Fixes: 9131f3de ("ipv6: Do not
	iterate over all interfaces when finding source address
	on specific interface.")
Signed-off-by: NYOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
Acked-by: NHajime Tazaki <thehajime@gmail.com>
Acked-by: NErik Kline <ek@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c0b8da1e

pkt_sched: sch_qfq: remove unused member of struct qfq_sched · 40bdc536

由 Andrea Parri 提交于 7月 14, 2015

The member (u32) "num_active_agg" of struct qfq_sched has been unused
since its introduction in 462dbc91
"pkt_sched: QFQ Plus: fair-queueing service at DRR cost" and (AFAICT)
there is no active plan to use it; this removes the member.
Signed-off-by: NAndrea Parri <parri.andrea@gmail.com>
Acked-by: NPaolo Valente <paolo.valente@unimore.it>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

40bdc536

14 7月, 2015 2 次提交

bridge: mdb: add vlan support for user entries · 74fe61f1

由 Nikolay Aleksandrov 提交于 7月 10, 2015

Until now all user mdb entries were added in vlan 0, this patch adds
support to allow the user to specify the vlan for the entry.
About the uapi change a hole in struct br_mdb_entry is used so the size
and offsets are kept the same (verified with pahole and tested with older
iproute2).

Example:
$ bridge mdb
dev br0 port eth1 grp 239.0.0.1 permanent vlan 2000
dev br0 port eth1 grp 239.0.0.1 permanent vlan 200
dev br0 port eth1 grp 239.0.0.1 permanent
Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

74fe61f1

net: Build IPv6 into kernel by default · de551f2e

由 Tom Herbert 提交于 7月 13, 2015

This patch makes the default to build IPv6 into the kernel. IPv6
now has significant traction and any remaining vestiges of IPv6
not being provided parity with IPv4 should be swept away. IPv6 is now
core to the Internet and kernel.

Points on IPv6 adoption:

- Per Google statistics, IPv6 usage has reached 7% on the Internet
  and continues to exhibit an exponential growth rate
  https://www.google.com/intl/en/ipv6/statistics.html
- Just a few days ago ARIN officially depleted its IPv4 pool
- IPv6 only data centers are being successfully built
  (e.g. at Facebook)

This patch changes the IPv6 Kconfig for IPV6. Default for CONFIG_IPV6
is set to "y" and the text has been updated to reflect the maturity of
IPv6.

Impact:

Under some circumstances building modules in to kernel might have a
performance advantage. In my testing, I did notice a very slight
improvement.

This will obviously increase the size of the kernel image. In my
configuration I see:

IPv6 as module:

   text    data     bss     dec     hex filename
9703666 1899288  933888 12536842         bf4c0a vmlinux

IPv6 built into kernel

  text     data     bss     dec     hex filename
9436490 1879600  913408 12229498         ba9b7a vmlinux

Which increases text size by ~270K (2.8% increase in size for me). If
image size is an issue, presumably for a device which does not do IP
networking (IMO we should be discouraging IPv4-only devices), IPV6 can
be disabled or still built as a module.
Acked-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

de551f2e

13 7月, 2015 1 次提交

can: replace timestamp as unique skb attribute · d3b58c47

由 Oliver Hartkopp 提交于 6月 26, 2015

Commit 514ac99c "can: fix multiple delivery of a single CAN frame for
overlapping CAN filters" requires the skb->tstamp to be set to check for
identical CAN skbs.

Without timestamping to be required by user space applications this timestamp
was not generated which lead to commit 36c01245 "can: fix loss of CAN frames
in raw_rcv" - which forces the timestamp to be set in all CAN related skbuffs
by introducing several __net_timestamp() calls.

This forces e.g. out of tree drivers which are not using alloc_can{,fd}_skb()
to add __net_timestamp() after skbuff creation to prevent the frame loss fixed
in mainline Linux.

This patch removes the timestamp dependency and uses an atomic counter to
create an unique identifier together with the skbuff pointer.

Btw: the new skbcnt element introduced in struct can_skb_priv has to be
initialized with zero in out-of-tree drivers which are not using
alloc_can{,fd}_skb() too.
Signed-off-by: NOliver Hartkopp <socketcan@hartkopp.net>
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>

d3b58c47

12 7月, 2015 4 次提交

net: dsa: Fix off-by-one in switch address parsing · c8cf89f7

由 Florian Fainelli 提交于 7月 11, 2015

cd->sw_addr is used as a MDIO bus address, which cannot exceed
PHY_MAX_ADDR (32), our check was off-by-one.

Fixes: 5e95329b ("dsa: add device tree bindings to register DSA switches")
Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c8cf89f7

net: dsa: Test array index before use · 8f5063e9

由 Florian Fainelli 提交于 7月 11, 2015

port_index is used an index into an array, and this information comes
from Device Tree, make sure that port_index is not equal to the array
size before using it. Move the check against port_index earlier in the
loop.

Fixes: 5e95329b: ("dsa: add device tree bindings to register DSA switches")
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8f5063e9

net: switchdev: don't abort unsupported operations · 2ee94014

由 Vivien Didelot 提交于 7月 10, 2015

There is no need to abort attribute setting or object addition, if the
prepare phase returned operation not supported.

Thus, abort these two transactions only if the error is not -EOPNOTSUPP.
Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
Acked-by: NJiri Pirko <jiri@resnulli.us>
Acked-by: NScott Feldman <sfeldma@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2ee94014

Revert "ipv4: use skb coalescing in defragmentation" · 14fe22e3

由 Florian Westphal 提交于 7月 11, 2015

This reverts commit 3cc49492.

There is nothing wrong with coalescing during defragmentation, it
reduces truesize overhead and simplifies things for the receiving
socket (no fraglist walk needed).

However, it also destroys geometry of the original fragments.
While that doesn't cause any breakage (we make sure to not exceed largest
original size) ip_do_fragment contains a 'fastpath' that takes advantage
of a present frag list and results in fragments that (in most cases)
match what was received.

In case its needed the coalescing could be done later, when we're sure
the skb is not forwarded.  But discussion during NFWS resulted in
'lets just remove this for now'.

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

14fe22e3

11 7月, 2015 5 次提交

net: inet_diag: always export IPV6_V6ONLY sockopt for listening sockets · 8220ea23

由 Phil Sutter 提交于 7月 10, 2015

Reconsidering my commit 20462155 "net: inet_diag: export IPV6_V6ONLY
sockopt", I am not happy with the limitations it causes for socket
analysing code in userspace. Exporting the value only if it is set makes
it hard for userspace to decide whether the option is not set or the
kernel does not support exporting the option at all.

>From an auditor's perspective, the interesting question for listening
AF_INET6 sockets is: "Does it NOT have IPV6_V6ONLY set?" Because it is
the unexpected case. This patch allows to answer this question reliably.
Signed-off-by: NPhil Sutter <phil@nwl.cc>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8220ea23

ipv6: Do not iterate over all interfaces when finding source address on specific interface. · 9131f3de

由 YOSHIFUJI Hideaki/吉藤英明提交于 7月 10, 2015

If outgoing interface is specified and the candidate address is
restricted to the outgoing interface, it is enough to iterate
over that given interface only.
Signed-off-by: NYOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
Acked-by: NErik Kline <ek@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9131f3de

bridge: mdb: allow the user to delete mdb entry if there's a querier · 51ed7f3e

由 Satish Ashok 提交于 7月 09, 2015

Until now when a querier was present static entries couldn't be deleted.
Fix this and allow the user to manipulate the mdb with or without a
querier.
Signed-off-by: NSatish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

51ed7f3e

net: call rcu_read_lock early in process_backlog · 2c17d27c

由 Julian Anastasov 提交于 7月 09, 2015

Incoming packet should be either in backlog queue or
in RCU read-side section. Otherwise, the final sequence of
flush_backlog() and synchronize_net() may miss packets
that can run without device reference:

CPU 1                  CPU 2
                       skb->dev: no reference
                       process_backlog:__skb_dequeue
                       process_backlog:local_irq_enable

on_each_cpu for
flush_backlog =>       IPI(hardirq): flush_backlog
                       - packet not found in backlog

                       CPU delayed ...
synchronize_net
- no ongoing RCU
read-side sections

netdev_run_todo,
rcu_barrier: no
ongoing callbacks
                       __netif_receive_skb_core:rcu_read_lock
                       - too late
free dev
                       process packet for freed dev

Fixes: 6e583ce5 ("net: eliminate refcounting in backlog queue")
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: NJulian Anastasov <ja@ssi.bg>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2c17d27c

net: do not process device backlog during unregistration · e9e4dd32

由 Julian Anastasov 提交于 7月 09, 2015

commit 381c759d ("ipv4: Avoid crashing in ip_error")
fixes a problem where processed packet comes from device
with destroyed inetdev (dev->ip_ptr). This is not expected
because inetdev_destroy is called in NETDEV_UNREGISTER
phase and packets should not be processed after
dev_close_many() and synchronize_net(). Above fix is still
required because inetdev_destroy can be called for other
reasons. But it shows the real problem: backlog can keep
packets for long time and they do not hold reference to
device. Such packets are then delivered to upper levels
at the same time when device is unregistered.
Calling flush_backlog after NETDEV_UNREGISTER_FINAL still
accounts all packets from backlog but before that some packets
continue to be delivered to upper levels long after the
synchronize_net call which is supposed to wait the last
ones. Also, as Eric pointed out, processed packets, mostly
from other devices, can continue to add new packets to backlog.

Fix the problem by moving flush_backlog early, after the
device driver is stopped and before the synchronize_net() call.
Then use netif_running check to make sure we do not add more
packets to backlog. We have to do it in enqueue_to_backlog
context when the local IRQ is disabled. As result, after the
flush_backlog and synchronize_net sequence all packets
should be accounted.

Thanks to Eric W. Biederman for the test script and his
valuable feedback!
Reported-by: NVittorio Gambaletta <linuxbugs@vittgam.net>
Fixes: 6e583ce5 ("net: eliminate refcounting in backlog queue")
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: NJulian Anastasov <ja@ssi.bg>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e9e4dd32

10 7月, 2015 17 次提交

bridge: fix potential crash in __netdev_pick_tx() · a7d35f9d

由 Eric Dumazet 提交于 7月 09, 2015

Commit c29390c6 ("xps: must clear sender_cpu before forwarding")
fixed an issue in normal forward path, caused by sender_cpu & napi_id
skb fields being an union.

Bridge is another point where skb can be forwarded, so we need
the same cure.

Bug triggers if packet was received on a NIC using skb_mark_napi_id()

Fixes: 2bd82484 ("xps: fix xps for stacked devices")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NBob Liu <bob.liu@oracle.com>
Tested-by: NBob Liu <bob.liu@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a7d35f9d

tcp: do not export tcp_init_xmit_timers() · a4e2405c

由 Eric Dumazet 提交于 7月 09, 2015

After commit 900f65d3 ("tcp: move duplicate code from
tcp_v4_init_sock()/tcp_v6_init_sock()"), we no longer
need to export tcp_init_xmit_timers()
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a4e2405c

bridge: mdb: fill state in br_mdb_notify · 09cf0211

由 Nikolay Aleksandrov 提交于 7月 09, 2015

Fill also the port group state when sending notifications.
Signed-off-by: NSatish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

09cf0211

route: remove unsed variable in __mkroute_input · cb1c6168

由 Masatake YAMATO 提交于 7月 09, 2015

flags local variable in __mkroute_input is not used as a variable.
Signed-off-by: NMasatake YAMATO <yamato@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cb1c6168

ipv6: Nonlocal bind · 35a256fe

由 Tom Herbert 提交于 7月 08, 2015

Add support to allow non-local binds similar to how this was done for IPv4.
Non-local binds are very useful in emulating the Internet in a box, etc.

This add the ip_nonlocal_bind sysctl under ipv6.

Testing:

Set up nonlocal binding and receive routing on a host, e.g.:

ip -6 rule add from ::/0 iif eth0 lookup 200
ip -6 route add local 2001:0:0:1::/64 dev lo proto kernel scope host table 200
sysctl -w net.ipv6.ip_nonlocal_bind=1

Set up routing to 2001:0:0:1::/64 on peer to go to first host

ping6 -I 2001:0:0:1::1 peer-address -- to verify
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

35a256fe

inet: inet_twsk_deschedule factorization · dbe7faa4

由 Eric Dumazet 提交于 7月 08, 2015

inet_twsk_deschedule() calls are followed by inet_twsk_put().

Only particular case is in inet_twsk_purge() but there is no point
to defer the inet_twsk_put() after re-enabling BH.

Lets rename inet_twsk_deschedule() to inet_twsk_deschedule_put()
and move the inet_twsk_put() inside.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dbe7faa4

inet: simplify timewait refcounting · fc01538f

由 Eric Dumazet 提交于 7月 08, 2015

timewait sockets have a complex refcounting logic.
Once we realize it should be similar to established and
syn_recv sockets, we can use sk_nulls_del_node_init_rcu()
and remove inet_twsk_unhash()

In particular, deferred inet_twsk_put() added in commit
13475a30 ("tcp: connect() race with timewait reuse")
looks unecessary : When removing a timewait socket from
ehash or bhash, caller must own a reference on the socket
anyway.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fc01538f

ipv6: use flag instead of u16 for hop in inet6_skb_parm · 8b58a398

由 Florian Westphal 提交于 7月 08, 2015

Hop was always either 0 or sizeof(struct ipv6hdr).
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8b58a398

net: pktgen: kill the "Wait for kthread_stop" code in pktgen_thread_worker() · 1fbe4b46

由 Oleg Nesterov 提交于 7月 08, 2015

pktgen_thread_worker() doesn't need to wait for kthread_stop(), it
can simply exit. Just pktgen_create_thread() and pg_net_exit() should
do get_task_struct()/put_task_struct(). kthread_stop(dead_thread) is
fine.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1fbe4b46

net: pktgen: fix race between pktgen_thread_worker() and kthread_stop() · fecdf8be

由 Oleg Nesterov 提交于 7月 08, 2015

pktgen_thread_worker() is obviously racy, kthread_stop() can come
between the kthread_should_stop() check and set_current_state().
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Reported-by: NJan Stancek <jstancek@redhat.com>
Reported-by: NMarcelo Leitner <mleitner@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fecdf8be

tcp: update congestion state first before raising cwnd · b20a3fa3

由 Yuchung Cheng 提交于 7月 09, 2015

The congestion state and cwnd can be updated in the wrong order.
For example, upon receiving a dubious ACK, we incorrectly raise
the cwnd first (tcp_may_raise_cwnd()/tcp_cong_avoid()) because
the state is still Open, then enter recovery state to reduce cwnd.

For another example, if the ACK indicates spurious timeout or
retransmits, we first revert the cwnd reduction and congestion
state back to Open state.  But we don't raise the cwnd even though
the ACK does not indicate any congestion.

To fix this problem we should first call tcp_fastretrans_alert() to
process the dubious ACK and update the congestion state, then call
tcp_may_raise_cwnd() that raises cwnd based on the current state.
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NNandita Dukkipati <nanditad@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b20a3fa3

tcp: do not slow start when cwnd equals ssthresh · 76174004

由 Yuchung Cheng 提交于 7月 09, 2015

In the original design slow start is only used to raise cwnd
when cwnd is stricly below ssthresh. It makes little sense
to slow start when cwnd == ssthresh: especially
when hystart has set ssthresh in the initial ramp, or after
recovery when cwnd resets to ssthresh. Not doing so will
also help reduce the buffer bloat slightly.
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NNandita Dukkipati <nanditad@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

76174004

tcp: add tcp_in_slow_start helper · 071d5080

由 Yuchung Cheng 提交于 7月 09, 2015

Add a helper to test the slow start condition in various congestion
control modules and other places. This is to prepare a slight improvement
in policy as to exactly when to slow start.
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NNandita Dukkipati <nanditad@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

071d5080

net: skb_defer_rx_timestamp should check for phydev before setting up classify · 1007f59d

由 Alexander Duyck 提交于 7月 09, 2015

This change makes it so that the call skb_defer_rx_timestamp will first
check for a phydev before going in and manipulating the skb->data and
skb->len values. By doing this we can avoid unnecessary work on network
devices that don't support phydev. As a result we reduce the total
instruction count needed to process this on most devices.
Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1007f59d

tcp: v1 always send a quick ack when quickacks are enabled · 2251ae46

由 Jon Maxwell 提交于 7月 08, 2015

V1 of this patch contains Eric Dumazet's suggestion to move the per
dst RTAX_QUICKACK check into tcp_in_quickack_mode(). Thanks Eric.

I ran some tests and after setting the "ip route change quickack 1"
knob there were still many delayed ACKs sent. This occured
because when icsk_ack.quick=0 the !icsk_ack.pingpong value is
subsequently ignored as tcp_in_quickack_mode() checks both these
values. The condition for a quick ack to trigger requires
that both icsk_ack.quick != 0 and icsk_ack.pingpong=0. Currently
only icsk_ack.pingpong is controlled by the knob. But the
icsk_ack.quick value changes dynamically depending on heuristics.
The crux of the matter is that delayed acks still cannot be entirely
disabled even with the RTAX_QUICKACK per dst knob enabled. This
patch ensures that a quick ack is always sent when the RTAX_QUICKACK
per dst knob is turned on.

The "ip route change quickack 1" knob was recently added to enable
quickacks. It was modeled around the TCP_QUICKACK setsockopt() option.
This issue is that even with "ip route change quickack 1" enabled
we still see delayed ACKs under some conditions. It would be nice
to be able to completely disable delayed ACKs.

Here is an example:

# netstat -s|grep dela
    3 delayed acks sent

For all routes enable the knob

# ip route change quickack 1

Generate some traffic across a slow link and we still see the delayed
acks.

# netstat -s|grep dela
    106 delayed acks sent
    1 delayed acks further delayed because of locked socket

The issue is that both the "ip route change quickack 1" knob and
the TCP_QUICKACK option set the icsk_ack.pingpong variable to 0.
However at the business end in the __tcp_ack_snd_check() routine,
tcp_in_quickack_mode() checks that both icsk_ack.quick != 0
and icsk_ack.pingpong=0 in order to trigger a quickack. As
icsk_ack.quick is determined by heuristics it can be 0. When
that occurs the icsk_ack.pingpong value is ignored and a delayed
ACK is sent regardless.

This patch moves the RTAX_QUICKACK per dst check into the
tcp_in_quickack_mode() routine which ensures that a quickack is
always sent when the quickack knob is enabled for that dst.
Signed-off-by: NJon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2251ae46

libceph: treat sockaddr_storage with uninitialized family as blank · c44bd69c

由 Ilya Dryomov 提交于 7月 09, 2015

addr_is_blank() should return true if family is neither AF_INET nor
AF_INET6.  This is what its counterpart entity_addr_t::is_blank_ip() is
doing and it is the right thing to do: in process_banner() we check if
our address is blank and if it is "learn" it from our peer.  As it is,
we never learn our address and always send out a blank one.  This goes
way back to ceph.git commit dd732cbfc1c9 ("use sockaddr_storage; and
some ipv6 support groundwork") from 2009.

While at at, do not open-code ipv6_addr_any() and use INADDR_ANY
constant instead of 0.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NSage Weil <sage@redhat.com>

c44bd69c

libceph: enable ceph in a non-default network namespace · 757856d2

由 Ilya Dryomov 提交于 6月 25, 2015

Grab a reference on a network namespace of the 'rbd map' (in case of
rbd) or 'mount' (in case of ceph) process and use that to open sockets
instead of always using init_net and bailing if network namespace is
anything but init_net.  Be careful to not share struct ceph_client
instances between different namespaces and don't add any code in the
!CONFIG_NET_NS case.

This is based on a patch from Hong Zhiguo <zhiguohong@tencent.com>.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NSage Weil <sage@redhat.com>

757856d2

09 7月, 2015 6 次提交

ipv4: add support for linkdown sysctl to netconf · 974d7af5

由 Andy Gospodarek 提交于 7月 07, 2015

This kernel patch exports the value of the new
ignore_routes_with_linkdown via netconf.

v2: changes to notify userspace via netlink when sysctl values change
and proposed for 'net' since this could be considered a bugfix
Signed-off-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
Suggested-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

974d7af5

bridge: mdb: zero out the local br_ip variable before use · f1158b74

由 Nikolay Aleksandrov 提交于 7月 07, 2015

Since commit b0e9a30d ("bridge: Add vlan id to multicast groups")
there's a check in br_ip_equal() for a matching vlan id, but the mdb
functions were not modified to use (or at least zero it) so when an
entry was added it would have a garbage vlan id (from the local br_ip
variable in __br_mdb_add/del) and this would prevent it from being
matched and also deleted. So zero out the whole local ip var to protect
ourselves from future changes and also to fix the current bug, since
there's no vlan id support in the mdb uapi - use always vlan id 0.
Example before patch:
root@debian:~# bridge mdb add dev br0 port eth1 grp 239.0.0.1 permanent
root@debian:~# bridge mdb
dev br0 port eth1 grp 239.0.0.1 permanent
root@debian:~# bridge mdb del dev br0 port eth1 grp 239.0.0.1 permanent
RTNETLINK answers: Invalid argument

After patch:
root@debian:~# bridge mdb add dev br0 port eth1 grp 239.0.0.1 permanent
root@debian:~# bridge mdb
dev br0 port eth1 grp 239.0.0.1 permanent
root@debian:~# bridge mdb del dev br0 port eth1 grp 239.0.0.1 permanent
root@debian:~# bridge mdb
Signed-off-by: NNikolay Aleksandrov <razor@blackwall.org>
Fixes: b0e9a30d ("bridge: Add vlan id to multicast groups")
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f1158b74

net/tipc: initialize security state for new connection socket · fdd75ea8

由 Stephen Smalley 提交于 7月 07, 2015

Calling connect() with an AF_TIPC socket would trigger a series
of error messages from SELinux along the lines of:
SELinux: Invalid class 0
type=AVC msg=audit(1434126658.487:34500): avc:  denied  { <unprintable> }
  for pid=292 comm="kworker/u16:5" scontext=system_u:system_r:kernel_t:s0
  tcontext=system_u:object_r:unlabeled_t:s0 tclass=<unprintable>
  permissive=0

This was due to a failure to initialize the security state of the new
connection sock by the tipc code, leaving it with junk in the security
class field and an unlabeled secid.  Add a call to security_sk_clone()
to inherit the security state from the parent socket.
Reported-by: NTim Shearer <tim.shearer@overturenetworks.com>
Signed-off-by: NStephen Smalley <sds@tycho.nsa.gov>
Acked-by: NPaul Moore <paul@paul-moore.com>
Acked-by: NYing Xue <ying.xue@windriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fdd75ea8

ip_tunnel: fix ipv4 pmtu check to honor inner ip header df · fc24f2b2

由 Timo Teräs 提交于 7月 07, 2015

Frag needed should be sent only if the inner header asked
to not fragment. Currently fragmentation is broken if the
tunnel has df set, but df was not asked in the original
packet. The tunnel's df needs to be still checked to update
internally the pmtu cache.

Commit 23a3647b broke it, and this commit fixes
the ipv4 df check back to the way it was.

Fixes: 23a3647b ("ip_tunnels: Use skb-len to PMTU check.")
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: NTimo Teräs <timo.teras@iki.fi>
Acked-by: NPravin B Shelar <pshelar@nicira.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fc24f2b2

rtnetlink: verify IFLA_VF_INFO attributes before passing them to driver · 4f7d2cdf

由 Daniel Borkmann 提交于 7月 07, 2015

Jason Gunthorpe reported that since commit c02db8c6 ("rtnetlink: make
SR-IOV VF interface symmetric"), we don't verify IFLA_VF_INFO attributes
anymore with respect to their policy, that is, ifla_vfinfo_policy[].

Before, they were part of ifla_policy[], but they have been nested since
placed under IFLA_VFINFO_LIST, that contains the attribute IFLA_VF_INFO,
which is another nested attribute for the actual VF attributes such as
IFLA_VF_MAC, IFLA_VF_VLAN, etc.

Despite the policy being split out from ifla_policy[] in this commit,
it's never applied anywhere. nla_for_each_nested() only does basic nla_ok()
testing for struct nlattr, but it doesn't know about the data context and
their requirements.

Fix, on top of Jason's initial work, does 1) parsing of the attributes
with the right policy, and 2) using the resulting parsed attribute table
from 1) instead of the nla_for_each_nested() loop (just like we used to
do when still part of ifla_policy[]).

Reference: http://thread.gmane.org/gmane.linux.network/368913
Fixes: c02db8c6 ("rtnetlink: make SR-IOV VF interface symmetric")
Reported-by: NJason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Sucheta Chakraborty <sucheta.chakraborty@qlogic.com>
Cc: Greg Rose <gregory.v.rose@intel.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Rony Efraim <ronye@mellanox.com>
Cc: Vlad Zolotarov <vladz@cloudius-systems.com>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: NJason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NVlad Zolotarov <vladz@cloudius-systems.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4f7d2cdf

Revert "dev: set iflink to 0 for virtual interfaces" · 95ec655b

由 Nicolas Dichtel 提交于 7月 06, 2015

This reverts commit e1622baf.

The side effect of this commit is to add a '@NONE' after each virtual
interface name with a 'ip link'. It may break existing scripts.
Reported-by: NOlivier Hartkopp <socketcan@hartkopp.net>
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Tested-by: NOliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

95ec655b

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功