提交 · f7cb8886335dea39fa31bb701700361f1aa7a6ea · openanolis / cloud-kernel

15 11月, 2013 10 次提交

sit/gre6: don't try to add the same route two times · f7cb8886

由 Nicolas Dichtel 提交于 11月 14, 2013

addrconf_add_linklocal() already adds the link local route, so there is no
reason to add it before calling this function.
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f7cb8886

sit: link local routes are missing · f0e2acfa

由 Nicolas Dichtel 提交于 11月 14, 2013

When a link local address was added to a sit interface, the corresponding route
was not configured. This breaks routing protocols that use the link local
address, like OSPFv3.

To ease the code reading, I remove sit_route_add(), which only adds v4 mapped
routes, and add this kind of route directly in sit_add_v4_addrs(). Thus link
local and v4 mapped routes are configured in the same place.
Reported-by: NLi Hongjun <hongjun.li@6wind.com>
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f0e2acfa

sit: fix prefix length of ll and v4mapped addresses · 929c9cf3

由 Nicolas Dichtel 提交于 11月 14, 2013

When the local IPv4 endpoint is wilcard (0.0.0.0), the prefix length is
correctly set, ie 64 if the address is a link local one or 96 if the address is
a v4 mapped one.
But when the local endpoint is specified, the prefix length is set to 128 for
both kind of address. This patch fix this wrong prefix length.
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

929c9cf3

sit: fix use after free of fb_tunnel_dev · 9434266f

由 Willem de Bruijn 提交于 11月 13, 2013

Bug: The fallback device is created in sit_init_net and assumed to be
freed in sit_exit_net. First, it is dereferenced in that function, in
sit_destroy_tunnels:

        struct net *net = dev_net(sitn->fb_tunnel_dev);

Prior to this, rtnl_unlink_register has removed all devices that match
rtnl_link_ops == sit_link_ops.

Commit 205983c4 added the line

+       sitn->fb_tunnel_dev->rtnl_link_ops = &sit_link_ops;

which cases the fallback device to match here and be freed before it
is last dereferenced.

Fix: This commit adds an explicit .delllink callback to sit_link_ops
that skips deallocation at rtnl_unlink_register for the fallback
device. This mechanism is comparable to the one in ip_tunnel.

It also modifies sit_destroy_tunnels and its only caller sit_exit_net
to avoid the offending dereference in the first place. That double
lookup is more complicated than required.

Test: The bug is only triggered when CONFIG_NET_NS is enabled. It
causes a GPF only when CONFIG_DEBUG_SLAB is enabled. Verified that
this bug exists at the mentioned commit, at davem-net HEAD and at
3.11.y HEAD. Verified that it went away after applying this patch.

Fixes: 205983c4 ("sit: allow to use rtnl ops on fb tunnel")
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9434266f

net: sctp: bug-fixing: retran_path not set properly after transports recovering (v3) · d30a58ba

由 Chang Xiangzhong 提交于 11月 14, 2013

When a transport recovers due to the new coming sack, SCTP should
iterate all of its transport_list to locate the __two__ most recently used
transport and set to active_path and retran_path respectively. The exising
code does not find the two properly - In case of the following list:

[most-recent] -> [2nd-most-recent] -> ...

Both active_path and retran_path would be set to the 1st element.

The bug happens when:
1) multi-homing
2) failure/partial_failure transport recovers
Both active_path and retran_path would be set to the same most-recent one, in
other words, retran_path would not take its role - an end user might not even
notice this issue.
Signed-off-by: NChang Xiangzhong <changxiangzhong@gmail.com>
Acked-by: NVlad Yasevich <vyasevich@gmail.com>
Acked-by: NNeil Horman <nhorman@tuxdriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d30a58ba

net-tcp: fix panic in tcp_fastopen_cache_set() · dccf76ca

由 Eric Dumazet 提交于 11月 13, 2013

We had some reports of crashes using TCP fastopen, and Dave Jones
gave a nice stack trace pointing to the error.

Issue is that tcp_get_metrics() should not be called with a NULL dst

Fixes: 1fe4c481 ("net-tcp: Fast Open client - cookie cache")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NDave Jones <davej@redhat.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: NYuchung Cheng <ycheng@google.com>
Tested-by: NDave Jones <davej@fedoraproject.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dccf76ca

tcp: tsq: restore minimal amount of queueing · 98e09386

由 Eric Dumazet 提交于 11月 13, 2013

After commit c9eeec26 ("tcp: TSQ can use a dynamic limit"), several
users reported throughput regressions, notably on mvneta and wifi
adapters.

802.11 AMPDU requires a fair amount of queueing to be effective.

This patch partially reverts the change done in tcp_write_xmit()
so that the minimal amount is sysctl_tcp_limit_output_bytes.

It also remove the use of this sysctl while building skb stored
in write queue, as TSO autosizing does the right thing anyway.

Users with well behaving NICS and correct qdisc (like sch_fq),
can then lower the default sysctl_tcp_limit_output_bytes value from
128KB to 8KB.

This new usage of sysctl_tcp_limit_output_bytes permits each driver
authors to check how their driver performs when/if the value is set
to a minimum of 4KB.

Normally, line rate for a single TCP flow should be possible,
but some drivers rely on timers to perform TX completion and
too long TX completion delays prevent reaching full throughput.

Fixes: c9eeec26 ("tcp: TSQ can use a dynamic limit")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NSujith Manoharan <sujith@msujith.org>
Reported-by: NArnaud Ebalard <arno@natisbad.org>
Tested-by: NSujith Manoharan <sujith@msujith.org>
Cc: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

98e09386

bridge: Fix memory leak when deleting bridge with vlan filtering enabled · b4e09b29

由 Toshiaki Makita 提交于 11月 13, 2013

We currently don't call br_vlan_flush() when deleting a bridge, which
leads to memory leak if br->vlan_info is allocated.

Steps to reproduce:
  while :
  do
    brctl addbr br0
    bridge vlan add dev br0 vid 10 self
    brctl delbr br0
  done
We can observe the cache size of corresponding slab entry
(as kmalloc-2048 in SLUB) is increased.

kmemleak output:
unreferenced object 0xffff8800b68a7000 (size 2048):
  comm "bridge", pid 2086, jiffies 4295774704 (age 47.656s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 48 9b 36 00 88 ff ff  .........H.6....
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff815eb6ae>] kmemleak_alloc+0x4e/0xb0
    [<ffffffff8116a1ca>] kmem_cache_alloc_trace+0xca/0x220
    [<ffffffffa03eddd6>] br_vlan_add+0x66/0xe0 [bridge]
    [<ffffffffa03e543c>] br_setlink+0x2dc/0x340 [bridge]
    [<ffffffff8150e481>] rtnl_bridge_setlink+0x101/0x200
    [<ffffffff8150d9d9>] rtnetlink_rcv_msg+0x99/0x260
    [<ffffffff81528679>] netlink_rcv_skb+0xa9/0xc0
    [<ffffffff8150d938>] rtnetlink_rcv+0x28/0x30
    [<ffffffff81527ccd>] netlink_unicast+0xdd/0x190
    [<ffffffff8152807f>] netlink_sendmsg+0x2ff/0x740
    [<ffffffff814e8368>] sock_sendmsg+0x88/0xc0
    [<ffffffff814e8ac8>] ___sys_sendmsg.part.14+0x298/0x2b0
    [<ffffffff814e91de>] __sys_sendmsg+0x4e/0x90
    [<ffffffff814e922e>] SyS_sendmsg+0xe/0x10
    [<ffffffff81601669>] system_call_fastpath+0x16/0x1b
    [<ffffffffffffffff>] 0xffffffffffffffff
Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b4e09b29

bridge: Call vlan_vid_del for all vids at nbp_vlan_flush · dbbaf949

由 Toshiaki Makita 提交于 11月 13, 2013

We should call vlan_vid_del for all vids at nbp_vlan_flush to prevent
vid_info->refcount from being leaked when detaching a bridge port.
Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dbbaf949

bridge: Use vlan_vid_[add/del] instead of direct ndo_vlan_rx_[add/kill]_vid calls · 19236837

由 Toshiaki Makita 提交于 11月 13, 2013

We should use wrapper functions vlan_vid_[add/del] instead of
ndo_vlan_rx_[add/kill]_vid. Otherwise, we might be not able to communicate
using vlan interface in a certain situation.

Example of problematic case:
  vconfig add eth0 10
  brctl addif br0 eth0
  bridge vlan add dev eth0 vid 10
  bridge vlan del dev eth0 vid 10
  brctl delif br0 eth0
In this case, we cannot communicate via eth0.10 because vlan 10 is
filtered by NIC that has the vlan filtering feature.
Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

19236837

14 11月, 2013 1 次提交

core/dev: do not ignore dmac in dev_forward_skb() · 81b9eab5

由 Alexei Starovoitov 提交于 11月 12, 2013

commit 06a23fe3
("core/dev: set pkt_type after eth_type_trans() in dev_forward_skb()")
and refactoring 64261f23
("dev: move skb_scrub_packet() after eth_type_trans()")

are forcing pkt_type to be PACKET_HOST when skb traverses veth.

which means that ip forwarding will kick in inside netns
even if skb->eth->h_dest != dev->dev_addr

Fix order of eth_type_trans() and skb_scrub_packet() in dev_forward_skb()
and in ip_tunnel_rcv()

Fixes: 06a23fe3 ("core/dev: set pkt_type after eth_type_trans() in dev_forward_skb()")
CC: Isaku Yamahata <yamahatanetdev@gmail.com>
CC: Maciej Zenczykowski <zenczykowski@gmail.com>
CC: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

81b9eab5

11 11月, 2013 4 次提交

ipv6: protect for_each_sk_fl_rcu in mem_check with rcu_read_lock_bh · f8c31c8f

由 Hannes Frederic Sowa 提交于 11月 08, 2013

Fixes a suspicious rcu derference warning.

Cc: Florent Fourcot <florent.fourcot@enst-bretagne.fr>
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f8c31c8f

vlan: Implement vlan_dev_get_egress_qos_mask as an inline. · e267cb96

由 David S. Miller 提交于 11月 11, 2013

This is to avoid very silly Kconfig dependencies for modules
using this routine.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e267cb96

netfilter: push reasm skb through instead of original frag skbs · 6aafeef0

由 Jiri Pirko 提交于 11月 06, 2013

Pushing original fragments through causes several problems. For example
for matching, frags may not be matched correctly. Take following
example:

<example>
On HOSTA do:
ip6tables -I INPUT -p icmpv6 -j DROP
ip6tables -I INPUT -p icmpv6 -m icmp6 --icmpv6-type 128 -j ACCEPT

and on HOSTB you do:
ping6 HOSTA -s2000    (MTU is 1500)

Incoming echo requests will be filtered out on HOSTA. This issue does
not occur with smaller packets than MTU (where fragmentation does not happen)
</example>

As was discussed previously, the only correct solution seems to be to use
reassembled skb instead of separete frags. Doing this has positive side
effects in reducing sk_buff by one pointer (nfct_reasm) and also the reams
dances in ipvs and conntrack can be removed.

Future plan is to remove net/ipv6/netfilter/nf_conntrack_reasm.c
entirely and use code in net/ipv6/reassembly.c instead.
Signed-off-by: NJiri Pirko <jiri@resnulli.us>
Acked-by: NJulian Anastasov <ja@ssi.bg>
Signed-off-by: NMarcelo Ricardo Leitner <mleitner@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6aafeef0

ip6_output: fragment outgoing reassembled skb properly · 9037c357

由 Jiri Pirko 提交于 11月 06, 2013

If reassembled packet would fit into outdev MTU, it is not fragmented
according the original frag size and it is send as single big packet.

The second case is if skb is gso. In that case fragmentation does not happen
according to the original frag size.

This patch fixes these.
Signed-off-by: NJiri Pirko <jiri@resnulli.us>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9037c357

10 11月, 2013 1 次提交

net_sched: tbf: support of 64bit rates · a33c4a26

由 Yang Yingliang 提交于 11月 08, 2013

With psched_ratecfg_precompute(), tbf can deal with 64bit rates.
Add two new attributes so that tc can use them to break the 32bit
limit.
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Suggested-by: NSergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a33c4a26

09 11月, 2013 7 次提交

ipv6: use rt6_get_dflt_router to get default router in rt6_route_rcv · f104a567

由 Duan Jiong 提交于 11月 08, 2013

As the rfc 4191 said, the Router Preference and Lifetime values in a
::/0 Route Information Option should override the preference and lifetime
values in the Router Advertisement header. But when the kernel deals with
a ::/0 Route Information Option, the rt6_get_route_info() always return
NULL, that means that overriding will not happen, because those default
routers were added without flag RTF_ROUTEINFO in rt6_add_dflt_router().

In order to deal with that condition, we should call rt6_get_dflt_router
when the prefix length is 0.
Signed-off-by: NDuan Jiong <duanj.fnst@cn.fujitsu.com>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f104a567

nfnetlink: do not ack malformed messages · cdbe7c2d

由 Jiri Benc 提交于 11月 07, 2013

Commit 0628b123 ("netfilter: nfnetlink: add batch support and use it
from nf_tables") introduced a bug leading to various crashes in netlink_ack
when netlink message with invalid nlmsg_len was sent by an unprivileged
user.
Signed-off-by: NJiri Benc <jbenc@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cdbe7c2d

net: Fix "ip rule delete table 256" · 13eb2ab2

由 Andreas Henriksson 提交于 11月 07, 2013

When trying to delete a table >= 256 using iproute2 the local table
will be deleted.
The table id is specified as a netlink attribute when it needs more then
8 bits and iproute2 then sets the table field to RT_TABLE_UNSPEC (0).
Preconditions to matching the table id in the rule delete code
doesn't seem to take the "table id in netlink attribute" into condition
so the frh_get_table helper function never gets to do its job when
matching against current rule.
Use the helper function twice instead of peaking at the table value directly.

Originally reported at: http://bugs.debian.org/724783Reported-by: NNicolas HICHER <nhicher@avencall.com>
Signed-off-by: NAndreas Henriksson <andreas@fatal.se>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

13eb2ab2

ipv6: protect flow label renew against GC · 394055f6

由 Florent Fourcot 提交于 11月 07, 2013

Take ip6_fl_lock before to read and update
a label.

v2: protect only the relevant code
Reported-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NFlorent Fourcot <florent.fourcot@enst-bretagne.fr>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

394055f6

ipv6: increase maximum lifetime of flow labels · 53b47106

由 Florent Fourcot 提交于 11月 07, 2013

If the last RFC 6437 does not give any constraints
for lifetime of flow labels, the previous RFC 3697
spoke of a minimum of 120 seconds between
reattribution of a flow label.

The maximum linger is currently set to 60 seconds
and does not allow this configuration without
CAP_NET_ADMIN right.

This patch increase the maximum linger to 150
seconds, allowing more flexibility to standard
users.
Signed-off-by: NFlorent Fourcot <florent.fourcot@enst-bretagne.fr>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

53b47106

ipv6: enable IPV6_FLOWLABEL_MGR for getsockopt · 3fdfa5ff

由 Florent Fourcot 提交于 11月 07, 2013

It is already possible to set/put/renew a label
with IPV6_FLOWLABEL_MGR and setsockopt. This patch
add the possibility to get information about this
label (current value, time before expiration, etc).

It helps application to take decision for a renew
or a release of the label.

v2:
 * Add spin_lock to prevent race condition
 * return -ENOENT if no result found
 * check if flr_action is GET

v3:
 * move the spin_lock to protect only the
   relevant code
Signed-off-by: NFlorent Fourcot <florent.fourcot@enst-bretagne.fr>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3fdfa5ff

net: flow_dissector: small optimizations in IPv4 dissect · 3797d3e8

由 Eric Dumazet 提交于 11月 07, 2013

By moving code around, we avoid :

1) A reload of iph->ihl (bit field, so needs a mask)

2) A conditional test (replaced by a conditional mov on x86)
   Fast path loads iph->protocol anyway.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3797d3e8

08 11月, 2013 10 次提交

inet: fix a UFO regression · dcd60771

由 Eric Dumazet 提交于 11月 07, 2013

While testing virtio_net and skb_segment() changes, Hannes reported
that UFO was sending wrong frames.

It appears this was introduced by a recent commit :
8c3a897b ("inet: restore gso for vxlan")

The old condition to perform IP frag was :

tunnel = !!skb->encapsulation;
...
        if (!tunnel && proto == IPPROTO_UDP) {

So the new one should be :

udpfrag = !skb->encapsulation && proto == IPPROTO_UDP;
...
        if (udpfrag) {

Initialization of udpfrag must be done before call
to ops->callbacks.gso_segment(skb, features), as
skb_udp_tunnel_segment() clears skb->encapsulation

(We want udpfrag to be true for UFO, false for VXLAN)

With help from Alexei Starovoitov
Reported-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dcd60771

net: skbuff - kernel-doc fixes · bc32383c

由 Mathias Krause 提交于 11月 07, 2013

Use "@" to refer to parameters in the kernel-doc description. According
to Documentation/kernel-doc-nano-HOWTO.txt "&" shall be used to refer to
structures only.
Signed-off-by: NMathias Krause <mathias.krause@secunet.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bc32383c

caif: use pskb_put() instead of reimplementing its functionality · 253c6daa

由 Mathias Krause 提交于 11月 07, 2013

Also remove the warning for fragmented packets -- skb_cow_data() will
linearize the buffer, removing all fragments.
Signed-off-by: NMathias Krause <mathias.krause@secunet.com>
Cc: Dmitry Tarnyagin <dmitry.tarnyagin@lockless.no>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

253c6daa

net: move pskb_put() to core code · 0c7ddf36

由 Mathias Krause 提交于 11月 07, 2013

This function has usage beside IPsec so move it to the core skbuff code.
While doing so, give it some documentation and change its return type to
'unsigned char *' to be in line with skb_put().
Signed-off-by: NMathias Krause <mathias.krause@secunet.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0c7ddf36

net: Add layer 2 hardware acceleration operations for macvlan devices · a6cc0cfa

由 John Fastabend 提交于 11月 06, 2013

Add a operations structure that allows a network interface to export
the fact that it supports package forwarding in hardware between
physical interfaces and other mac layer devices assigned to it (such
as macvlans). This operaions structure can be used by virtual mac
devices to bypass software switching so that forwarding can be done
in hardware more efficiently.
Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a6cc0cfa

6lowpan: release device on error path · 78032f9b

由 Dan Carpenter 提交于 11月 07, 2013

We recently added a new error path and it needs a dev_put().

Fixes: 7adac1ec ('6lowpan: Only make 6lowpan links to IEEE802154 devices')
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

78032f9b

net/vlan: Provide read access to the vlan egress map · d3243539

由 Eyal Perry 提交于 11月 06, 2013

Provide a method for read-only access to the vlan device egress mapping.

Do this by refactoring vlan_dev_get_egress_qos_mask() such that now it
receives as an argument the skb priority instead of pointer to the skb.

Such an access is needed for the IBoE stack where the control plane
goes through the network stack. This is an add-on step on top of commit
d4a96865 "net/route: export symbol ip_tos2prio" which allowed the RDMA-CM
to use ip_tos2prio.
Signed-off-by: NEyal Perry <eyalpe@mellanox.com>
Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d3243539

tipc: reassembly failures should cause link reset · a715b49e

由 Erik Hugne 提交于 11月 06, 2013

If appending a received fragment to the pending fragment chain
in a unicast link fails, the current code tries to force a retransmission
of the fragment by decrementing the 'next received sequence number'
field in the link. This is done under the assumption that the failure
is caused by an out-of-memory situation, an assumption that does
not hold true after the previous patch in this series.

A failure to append a fragment can now only be caused by a protocol
violation by the sending peer, and it must hence be assumed that it
is either malicious or buggy.  Either way, the correct behavior is now
to reset the link instead of trying to revert its sequence number.
So, this is what we do in this commit.
Signed-off-by: NErik Hugne <erik.hugne@ericsson.com>
Reviewed-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a715b49e

tipc: message reassembly using fragment chain · 40ba3cdf

由 Erik Hugne 提交于 11月 06, 2013

When the first fragment of a long data data message is received on a link, a
reassembly buffer large enough to hold the data from this and all subsequent
fragments of the message is allocated. The payload of each new fragment is
copied into this buffer upon arrival. When the last fragment is received, the
reassembled message is delivered upwards to the port/socket layer.

Not only is this an inefficient approach, but it may also cause bursts of
reassembly failures in low memory situations. since we may fail to allocate
the necessary large buffer in the first place. Furthermore, after 100 subsequent
such failures the link will be reset, something that in reality aggravates the
situation.

To remedy this problem, this patch introduces a different approach. Instead of
allocating a big reassembly buffer, we now append the arriving fragments
to a reassembly chain on the link, and deliver the whole chain up to the
socket layer once the last fragment has been received. This is safe because
the retransmission layer of a TIPC link always delivers packets in strict
uninterrupted order, to the reassembly layer as to all other upper layers.
Hence there can never be more than one fragment chain pending reassembly at
any given time in a link, and we can trust (but still verify) that the
fragments will be chained up in the correct order.
Signed-off-by: NErik Hugne <erik.hugne@ericsson.com>
Reviewed-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

40ba3cdf

tipc: don't reroute message fragments · 528f6f4b

由 Erik Hugne 提交于 11月 06, 2013

When a message fragment is received in a broadcast or unicast link,
the reception code will append the fragment payload to a big reassembly
buffer through a call to the function tipc_recv_fragm(). However, after
the return of that call, the logics goes on and passes the fragment
buffer to the function tipc_net_route_msg(), which will simply drop it.
This behavior is a remnant from the now obsolete multi-cluster
functionality, and has no relevance in the current code base.

Although currently harmless, this unnecessary call would be fatal
after applying the next patch in this series, which introduces
a completely new reassembly algorithm. So we change the code to
eliminate the redundant call.
Signed-off-by: NErik Hugne <erik.hugne@ericsson.com>
Reviewed-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

528f6f4b

06 11月, 2013 4 次提交

ipv6: drop the judgement in rt6_alloc_cow() · 249a3630

由 Duan Jiong 提交于 11月 05, 2013

Now rt6_alloc_cow() is only called by ip6_pol_route() when
rt->rt6i_flags doesn't contain both RTF_NONEXTHOP and RTF_GATEWAY,
and rt->rt6i_flags hasn't been changed in ip6_rt_copy().
So there is no neccessary to judge whether rt->rt6i_flags contains
RTF_GATEWAY or not.
Signed-off-by: NDuan Jiong <duanj.fnst@cn.fujitsu.com>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

249a3630

ipv6: fix headroom calculation in udp6_ufo_fragment · 0e033e04

由 Hannes Frederic Sowa 提交于 11月 05, 2013

Commit 1e2bd517 ("udp6: Fix udp
fragmentation for tunnel traffic.") changed the calculation if
there is enough space to include a fragment header in the skb from a
skb->mac_header dervived one to skb_headroom. Because we already peeled
off the skb to transport_header this is wrong. Change this back to check
if we have enough room before the mac_header.

This fixes a panic Saran Neti reported. He used the tbf scheduler which
skb_gso_segments the skb. The offsets get negative and we panic in memcpy
because the skb was erroneously not expanded at the head.
Reported-by: NSaran Neti <Saran.Neti@telus.com>
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0e033e04

ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE · 482fc609

由 Hannes Frederic Sowa 提交于 11月 05, 2013

Sockets marked with IP_PMTUDISC_INTERFACE won't do path mtu discovery,
their sockets won't accept and install new path mtu information and they
will always use the interface mtu for outgoing packets. It is guaranteed
that the packet is not fragmented locally. But we won't set the DF-Flag
on the outgoing frames.

Florian Weimer had the idea to use this flag to ensure DNS servers are
never generating outgoing fragments. They may well be fragmented on the
path, but the server never stores or usees path mtu values, which could
well be forged in an attack.

(The root of the problem with path MTU discovery is that there is
no reliable way to authenticate ICMP Fragmentation Needed But DF Set
messages because they are sent from intermediate routers with their
source addresses, and the IMCP payload will not always contain sufficient
information to identify a flow.)

Recent research in the DNS community showed that it is possible to
implement an attack where DNS cache poisoning is feasible by spoofing
fragments. This work was done by Amir Herzberg and Haya Shulman:
<https://sites.google.com/site/hayashulman/files/fragmentation-poisoning.pdf>

This issue was previously discussed among the DNS community, e.g.
<http://www.ietf.org/mail-archive/web/dnsext/current/msg01204.html>,
without leading to fixes.

This patch depends on the patch "ipv4: fix DO and PROBE pmtu mode
regarding local fragmentation with UFO/CORK" for the enforcement of the
non-fragmentable checks. If other users than ip_append_page/data should
use this semantic too, we have to add a new flag to IPCB(skb)->flags to
suppress local fragmentation and check for this in ip_finish_output.

Many thanks to Florian Weimer for the idea and feedback while implementing
this patch.

Cc: David S. Miller <davem@davemloft.net>
Suggested-by: NFlorian Weimer <fweimer@redhat.com>
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

482fc609

ipv6: remove old conditions on flow label sharing · b579035f

由 Florent Fourcot 提交于 11月 02, 2013

The code of flow label in Linux Kernel follows
the rules of RFC 1809 (an informational one) for
conditions on flow label sharing. There rules are
not in the last proposed standard for flow label
(RFC 6437), or in the previous one (RFC 3697).

Since this code does not follow any current or
old standard, we can remove it.

With this removal, the ipv6_opt_cmp function is
now a dead code and it can be removed too.

Changelog to v1:
 * add justification for the change
 * remove the condition on IPv6 options

[ Remove ipv6_hdr_cmp and it is now unused as well. -DaveM ]
Signed-off-by: NFlorent Fourcot <florent.fourcot@enst-bretagne.fr>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b579035f

05 11月, 2013 3 次提交

net: introduce skb_coalesce_rx_frag() · f8e617e1

由 Jason Wang 提交于 11月 01, 2013

Sometimes we need to coalesce the rx frags to avoid frag list. One example is
virtio-net driver which tries to use small frags for both MTU sized packet and
GSO packet. So this patch introduce skb_coalesce_rx_frag() to do this.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michael Dalton <mwdalton@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NJason Wang <jasowang@redhat.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f8e617e1

tcp: properly handle stretch acks in slow start · 9f9843a7

由 Yuchung Cheng 提交于 10月 31, 2013

Slow start now increases cwnd by 1 if an ACK acknowledges some packets,
regardless the number of packets. Consequently slow start performance
is highly dependent on the degree of the stretch ACKs caused by
receiver or network ACK compression mechanisms (e.g., delayed-ACK,
GRO, etc).  But slow start algorithm is to send twice the amount of
packets of packets left so it should process a stretch ACK of degree
N as if N ACKs of degree 1, then exits when cwnd exceeds ssthresh. A
follow up patch will use the remainder of the N (if greater than 1)
to adjust cwnd in the congestion avoidance phase.

In addition this patch retires the experimental limited slow start
(LSS) feature. LSS has multiple drawbacks but questionable benefit. The
fractional cwnd increase in LSS requires a loop in slow start even
though it's rarely used. Configuring such an increase step via a global
sysctl on different BDPS seems hard. Finally and most importantly the
slow start overshoot concern is now better covered by the Hybrid slow
start (hystart) enabled by default.
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9f9843a7

tcp: enable sockets to use MSG_FASTOPEN by default · 0d41cca4

由 Yuchung Cheng 提交于 10月 31, 2013

Applications have started to use Fast Open (e.g., Chrome browser has
such an optional flag) and the feature has gone through several
generations of kernels since 3.7 with many real network tests. It's
time to enable this flag by default for applications to test more
conveniently and extensively.
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0d41cca4

openanolis / cloud-kernel 12 个月 前同步成功

openanolis / cloud-kernel
12 个月前同步成功