- 29 5月, 2018 1 次提交
-
-
由 David Ahern 提交于
Add support for IFA_RT_PRIORITY to ipv4 addresses. If the metric is changed on an existing address then the new route is inserted before removing the old one. Since the metric is one of the route keys, the prefix route can not be replaced. Signed-off-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 15 3月, 2018 1 次提交
-
-
由 Sabrina Dubroca 提交于
Prior to the rework of PMTU information storage in commit 2c8cec5c ("ipv4: Cache learned PMTU information in inetpeer."), when a PMTU event advertising a PMTU smaller than net.ipv4.route.min_pmtu was received, we would disable setting the DF flag on packets by locking the MTU metric, and set the PMTU to net.ipv4.route.min_pmtu. Since then, we don't disable DF, and set PMTU to net.ipv4.route.min_pmtu, so the intermediate router that has this link with a small MTU will have to drop the packets. This patch reestablishes pre-2.6.39 behavior by splitting rtable->rt_pmtu into a bitfield with rt_mtu_locked and rt_pmtu. rt_mtu_locked indicates that we shouldn't set the DF bit on that path, and is checked in ip_dont_fragment(). One possible workaround is to set net.ipv4.route.min_pmtu to a value low enough to accommodate the lowest MTU encountered. Fixes: 2c8cec5c ("ipv4: Cache learned PMTU information in inetpeer.") Signed-off-by: NSabrina Dubroca <sd@queasysnail.net> Reviewed-by: NStefano Brivio <sbrivio@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 16 2月, 2018 2 次提交
-
-
由 Xin Long 提交于
In early time, when freeing a xdst, it would be inserted into dst_garbage.list first. Then if it's refcnt was still held somewhere, later it would be put into dst_busy_list in dst_gc_task(). When one dev was being unregistered, the dev of these dsts in dst_busy_list would be set with loopback_dev and put this dev. So that this dev's removal wouldn't get blocked, and avoid the kmsg warning: kernel:unregister_netdevice: waiting for veth0 to become \ free. Usage count = 2 However after Commit 52df157f ("xfrm: take refcnt of dst when creating struct xfrm_dst bundle"), the xdst will not be freed with dst gc, and this warning happens. To fix it, we need to find these xdsts that are still held by others when removing the dev, and free xdst's dev and set it with loopback_dev. But unfortunately after flow_cache for xfrm was deleted, no list tracks them anymore. So we need to save these xdsts somewhere to release the xdst's dev later. To make this easier, this patch is to reuse uncached_list to track xdsts, so that the dev refcnt can be released in the event NETDEV_UNREGISTER process of fib_netdev_notifier. Thanks to Florian, we could move forward this fix quickly. Fixes: 52df157f ("xfrm: take refcnt of dst when creating struct xfrm_dst bundle") Reported-by: NJianlin Shi <jishi@redhat.com> Reported-by: NHangbin Liu <liuhangbin@gmail.com> Tested-by: NEyal Birger <eyal.birger@gmail.com> Signed-off-by: NXin Long <lucien.xin@gmail.com> Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
-
由 David Ahern 提交于
Remove rt_table_id from rtable. It was added for getroute to return the table id that was hit in the lookup. With the changes for fibmatch the table id can be extracted from the fib_info returned in the fib_result so it no longer needs to be in rtable directly. Signed-off-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 25 1月, 2018 1 次提交
-
-
由 Al Viro 提交于
Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 01 10月, 2017 1 次提交
-
-
由 Paolo Abeni 提交于
The UDP early demux can leverate the rx dst cache even for multicast unconnected sockets. In such scenario the ipv4 source address is validated only on the first packet in the given flow. After that, when we fetch the dst entry from the socket rx cache, we stop enforcing the rp_filter and we even start accepting any kind of martian addresses. Disabling the dst cache for unconnected multicast socket will cause large performace regression, nearly reducing by half the max ingress tput. Instead we factor out a route helper to completely validate an skb source address for multicast packets and we call it from the UDP early demux for mcast packets landing on unconnected sockets, after successful fetching the related cached dst entry. This still gives a measurable, but limited performance regression: rp_filter = 0 rp_filter = 1 edmux disabled: 1182 Kpps 1127 Kpps edmux before: 2238 Kpps 2238 Kpps edmux after: 2037 Kpps 2019 Kpps The above figures are on top of current net tree. Applying the net-next commit 6e617de8 ("net: avoid a full fib lookup when rp_filter is disabled.") the delta with rp_filter == 0 will decrease even more. Fixes: 421b3885 ("udp: ipv4: Add udp early demux") Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 22 9月, 2017 1 次提交
-
-
由 Eric Dumazet 提交于
In linux-4.13, Wei worked hard to convert dst to a traditional refcounted model, removing GC. We now want to make sure a dst refcount can not transition from 0 back to 1. The problem here is that input path attached a not refcounted dst to an skb. Then later, because packet is forwarded and hits skb_dst_force() before exiting RCU section, we might try to take a refcount on one dst that is about to be freed, if another cpu saw 1 -> 0 transition in dst_release() and queued the dst for freeing after one RCU grace period. Lets unify skb_dst_force() and skb_dst_force_safe(), since we should always perform the complete check against dst refcount, and not assume it is not zero. Bugzilla : https://bugzilla.kernel.org/show_bug.cgi?id=197005 [ 989.919496] skb_dst_force+0x32/0x34 [ 989.919498] __dev_queue_xmit+0x1ad/0x482 [ 989.919501] ? eth_header+0x28/0xc6 [ 989.919502] dev_queue_xmit+0xb/0xd [ 989.919504] neigh_connected_output+0x9b/0xb4 [ 989.919507] ip_finish_output2+0x234/0x294 [ 989.919509] ? ipt_do_table+0x369/0x388 [ 989.919510] ip_finish_output+0x12c/0x13f [ 989.919512] ip_output+0x53/0x87 [ 989.919513] ip_forward_finish+0x53/0x5a [ 989.919515] ip_forward+0x2cb/0x3e6 [ 989.919516] ? pskb_trim_rcsum.part.9+0x4b/0x4b [ 989.919518] ip_rcv_finish+0x2e2/0x321 [ 989.919519] ip_rcv+0x26f/0x2eb [ 989.919522] ? vlan_do_receive+0x4f/0x289 [ 989.919523] __netif_receive_skb_core+0x467/0x50b [ 989.919526] ? tcp_gro_receive+0x239/0x239 [ 989.919529] ? inet_gro_receive+0x226/0x238 [ 989.919530] __netif_receive_skb+0x4d/0x5f [ 989.919532] netif_receive_skb_internal+0x5c/0xaf [ 989.919533] napi_gro_receive+0x45/0x81 [ 989.919536] ixgbe_poll+0xc8a/0xf09 [ 989.919539] ? kmem_cache_free_bulk+0x1b6/0x1f7 [ 989.919540] net_rx_action+0xf4/0x266 [ 989.919543] __do_softirq+0xa8/0x19d [ 989.919545] irq_exit+0x5d/0x6b [ 989.919546] do_IRQ+0x9c/0xb5 [ 989.919548] common_interrupt+0x93/0x93 [ 989.919548] </IRQ> Similarly dst_clone() can use dst_hold() helper to have additional debugging, as a follow up to commit 44ebe791 ("net: add debug atomic_inc_not_zero() in dst_hold()") In net-next we will convert dst atomic_t to refcount_t for peace of mind. Fixes: a4c2fd7f ("net: remove DST_NOCACHE flag") Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Wei Wang <weiwan@google.com> Reported-by: NPaweł Staszewski <pstaszewski@itcare.pl> Bisected-by: NPaweł Staszewski <pstaszewski@itcare.pl> Acked-by: NWei Wang <weiwan@google.com> Acked-by: NMartin KaFai Lau <kafai@fb.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 04 9月, 2017 1 次提交
-
-
由 Stefano Brivio 提交于
After ip_route_input() calls ip_route_input_noref(), another check on skb_dst() is done, but if this fails, we shouldn't override the return code from ip_route_input_noref(), as it could have been more specific (i.e. -EHOSTUNREACH). This also saves one call to skb_dst_force_safe() and one to skb_dst() in case the ip_route_input_noref() check fails. Reported-by: NSabrina Dubroca <sdubroca@redhat.com> Fixes: 9df16efa ("ipv4: call dst_hold_safe() properly") Signed-off-by: NStefano Brivio <sbrivio@redhat.com> Acked-by: NWei Wang <weiwan@google.com> Acked-by: NSabrina Dubroca <sd@queasysnail.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 18 6月, 2017 1 次提交
-
-
由 Wei Wang 提交于
This patch checks all the calls to dst_hold()/skb_dst_force()/dst_clone()/dst_use() to see if dst_hold_safe() is needed to avoid double free issue if dst gc is removed and dst_release() directly destroys dst when dst->__refcnt drops to 0. In tx path, TCP hold sk->sk_rx_dst ref count and also hold sock_lock(). UDP and other similar protocols always hold refcount for skb->_skb_refdst. So both paths seem to be safe. In rx path, as it is lockless and skb_dst_set_noref() is likely to be used, dst_hold_safe() should always be used when trying to hold dst. In the routing code, if dst is held during an rcu protected session, it is necessary to call dst_hold_safe() as the current dst might be in its rcu grace period. Signed-off-by: NWei Wang <weiwan@google.com> Acked-by: NMartin KaFai Lau <kafai@fb.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 27 5月, 2017 2 次提交
-
-
由 David Ahern 提交于
A later patch wants access to the fib result on an input route lookup with the rcu lock held. Refactor ip_route_input_noref pushing the logic between rcu_read_lock ... rcu_read_unlock into a new helper that takes the fib_result as an input arg. Signed-off-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
A later patch wants access to the fib result on an output route lookup with the rcu lock held. Refactor __ip_route_output_key_hash, pushing the logic between rcu_read_lock ... rcu_read_unlock into a new helper with the fib_result as an input arg. To keep the name length under control remove the leading underscores from the name and add _rcu to the name of the new helper indicating it is called with the rcu read lock held. Signed-off-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 09 5月, 2017 2 次提交
-
-
由 David S. Miller 提交于
This reverts commit 82486aa6. As implemented, this causes dangling netdevice refs. Reported-by: NEric Dumazet <eric.dumazet@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 WANG Cong 提交于
IPv4 dst could use fi->fib_metrics to store metrics but fib_info itself is refcnt'ed, so without taking a refcnt fi and fi->fib_metrics could be freed while dst metrics still points to it. This triggers use-after-free as reported by Andrey twice. This patch reverts commit 2860583f ("ipv4: Kill rt->fi") to restore this reference counting. It is a quick fix for -net and -stable, for -net-next, as Eric suggested, we can consider doing reference counting for metrics itself instead of relying on fib_info. IPv6 is very different, it copies or steals the metrics from mx6_config in fib6_commit_metrics() so probably doesn't need a refcnt. Decnet has already done the refcnt'ing, see dn_fib_semantic_match(). Fixes: 2860583f ("ipv4: Kill rt->fi") Reported-by: NAndrey Konovalov <andreyknvl@google.com> Tested-by: NAndrey Konovalov <andreyknvl@google.com> Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 22 3月, 2017 1 次提交
-
-
由 Nikolay Aleksandrov 提交于
This patch adds support for ECMP hash policy choice via a new sysctl called fib_multipath_hash_policy and also adds support for L4 hashes. The current values for fib_multipath_hash_policy are: 0 - layer 3 (default) 1 - layer 4 If there's an skb hash already set and it matches the chosen policy then it will be used instead of being calculated (currently only for L4). In L3 mode we always calculate the hash due to the ICMP error special case, the flow dissector's field consistentification should handle the address order thus we can remove the address reversals. If the skb is provided we always use it for the hash calculation, otherwise we fallback to fl4, that is if skb is NULL fl4 has to be set. Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 05 11月, 2016 1 次提交
-
-
由 Lorenzo Colitti 提交于
- Use the UID in routing lookups made by protocol connect() and sendmsg() functions. - Make sure that routing lookups triggered by incoming packets (e.g., Path MTU discovery) take the UID of the socket into account. - For packets not associated with a userspace socket, (e.g., ping replies) use UID 0 inside the user namespace corresponding to the network namespace the socket belongs to. This allows all namespaces to apply routing and iptables rules to kernel-originated traffic in that namespaces by matching UID 0. This is better than using the UID of the kernel socket that is sending the traffic, because the UID of kernel sockets created at namespace creation time (e.g., the per-processor ICMP and TCP sockets) is the UID of the user that created the socket, which might not be mapped in the namespace. Tested: compiles allnoconfig, allyesconfig, allmodconfig Tested: https://android-review.googlesource.com/253302Signed-off-by: NLorenzo Colitti <lorenzo@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 11 9月, 2016 1 次提交
-
-
由 David Ahern 提交于
No longer needed Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 12 4月, 2016 1 次提交
-
-
由 David Ahern 提交于
Vivek reported a kernel exception deleting a VRF with an active connection through it. The root cause is that the socket has a cached reference to a dst that is destroyed. Converting the dst_destroy to dst_release and letting proper reference counting kick in does not work as the dst has a reference to the device which needs to be released as well. I talked to Hannes about this at netdev and he pointed out the ipv4 and ipv6 dst handling has dst_ifdown for just this scenario. Rather than continuing with the reinvented dst wheel in VRF just remove it and leverage the ipv4 and ipv6 versions. Fixes: 193125db ("net: Introduce VRF device driver") Fixes: 35402e31 ("net: Add IPv6 support to VRF device") Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 08 4月, 2016 1 次提交
-
-
由 Tom Herbert 提交于
In inet_iif check if skb_rtable is NULL for the skb and return skb->skb_iif if it is. This change allows inet_iif to be called before the dst information has been set in the skb (e.g. when doing socket based UDP GRO). Signed-off-by: NTom Herbert <tom@herbertland.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 17 2月, 2016 1 次提交
-
-
由 Nikolay Borisov 提交于
Signed-off-by: NNikolay Borisov <kernel@kyup.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 05 1月, 2016 1 次提交
-
-
由 David Ahern 提交于
Commands run in a vrf context are not failing as expected on a route lookup: root@kenny:~# ip ro ls table vrf-red unreachable default root@kenny:~# ping -I vrf-red -c1 -w1 10.100.1.254 ping: Warning: source address might be selected on device other than vrf-red. PING 10.100.1.254 (10.100.1.254) from 0.0.0.0 vrf-red: 56(84) bytes of data. --- 10.100.1.254 ping statistics --- 2 packets transmitted, 0 received, 100% packet loss, time 999ms Since the vrf table does not have a route for 10.100.1.254 the ping should have failed. The saddr lookup causes a full VRF table lookup. Propogating a lookup failure to the user allows the command to fail as expected: root@kenny:~# ping -I vrf-red -c1 -w1 10.100.1.254 connect: No route to host Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 07 10月, 2015 2 次提交
-
-
由 David Ahern 提交于
Add operation to l3mdev to lookup source address for a given flow. Add support for the operation to VRF driver and convert existing IPv4 hooks to use the new lookup. Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 05 10月, 2015 1 次提交
-
-
由 Peter Nørlund 提交于
ICMP packets are inspected to let them route together with the flow they belong to, minimizing the chance that a problematic path will affect flows on other paths, and so that anycast environments can work with ECMP. Signed-off-by: NPeter Nørlund <pch@ordbogen.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 30 9月, 2015 2 次提交
-
-
由 David Ahern 提交于
Change CONFIG dependency to CONFIG_NET_L3_MASTER_DEV as well. Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
Rename IFF_VRF_MASTER to IFF_L3MDEV_MASTER and update the name of the netif_is_vrf and netif_index_is_vrf macros. Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 26 9月, 2015 1 次提交
-
-
由 Eric Dumazet 提交于
Very soon, TCP stack might call inet_csk_route_req(), which calls inet_csk_route_req() with an unlocked listener socket, so we need to make sure ip_route_output_flow() is not trying to change any field from its socket argument. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 18 9月, 2015 1 次提交
-
-
由 David Ahern 提交于
Steffen reported that the recent change to add oif to dst lookups breaks the VTI use case. The problem is that with the oif set in the flow struct the comparison to the nh_oif is triggered. Fix by splitting the FLOWI_FLAG_VRFSRC into 2 flags -- one that triggers the vrf device cache bypass (FLOWI_FLAG_VRFSRC) and another telling the lookup to not compare nh oif (FLOWI_FLAG_SKIP_NH_OIF). Fixes: 42a7b32b ("xfrm: Add oif to dst lookups") Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Acked-by: NSteffen Klassert <steffen.klassert@secunet.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 16 9月, 2015 1 次提交
-
-
由 David Ahern 提交于
Add the FIB table id to rtable to make the information available for IPv4 as it is for IPv6. Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 02 9月, 2015 1 次提交
-
-
由 David Ahern 提交于
A number of VRF patches used 'int' for table id. It should be u32 to be consistent with the rest of the stack. Fixes: 4e3c8992 ("net: Introduce VRF related flags and helpers") 15be405e ("net: Add inet_addr lookup by table") 30bbaa19 ("net: Fix up inet_addr_type checks") 021dd3b8 ("net: Add routes to the table associated with the device") dc028da5 ("inet: Move VRF table lookup to inlined function") f6d3c192 ("net: FIB tracepoints") Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Reviewed-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 21 8月, 2015 1 次提交
-
-
由 Jiri Benc 提交于
Currently, the lwtunnel state resides in per-protocol data. This is a problem if we encapsulate ipv6 traffic in an ipv4 tunnel (or vice versa). The xmit function of the tunnel does not know whether the packet has been routed to it by ipv4 or ipv6, yet it needs the lwtstate data. Moving the lwtstate data to dst_entry makes such inter-protocol tunneling possible. As a bonus, this brings a nice diffstat. Signed-off-by: NJiri Benc <jbenc@redhat.com> Acked-by: NRoopa Prabhu <roopa@cumulusnetworks.com> Acked-by: NThomas Graf <tgraf@suug.ch> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 14 8月, 2015 3 次提交
-
-
由 David Ahern 提交于
Currently inet_addr_type and inet_dev_addr_type expect local addresses to be in the local table. With the VRF device local routes for devices associated with a VRF will be in the table associated with the VRF. Provide an alternate inet_addr lookup to use a specific table rather than defaulting to the local table. inet_addr_type_dev_table keeps the same semantics as inet_addr_type but if the passed in device is enslaved to a VRF then the table for that VRF is used for the lookup. Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
Currently inet_addr_type and inet_dev_addr_type expect local addresses to be in the local table. With the VRF device local routes for devices associated with a VRF will be in the table associated with the VRF. Provide an alternate inet_addr lookup to use a specific table rather than defaulting to the local table. Signed-off-by: NShrijeet Mukherjee <shm@cumulusnetworks.com> Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
As with ingress use the index of VRF master device for route lookups on egress. However, the oif should only be used to direct the lookups to a specific table. Routes in the table are not based on the VRF device but rather interfaces that are part of the VRF so do not consider the oif for lookups within the table. The FLOWI_FLAG_VRFSRC is used to control this latter part. Signed-off-by: NShrijeet Mukherjee <shm@cumulusnetworks.com> Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 22 7月, 2015 1 次提交
-
-
由 Roopa Prabhu 提交于
This patch adds support in ipv4 fib functions to parse user provided encap attributes and attach encap state data to fib_nh and rtable. Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 16 1月, 2015 1 次提交
-
-
由 Eric Dumazet 提交于
RAW sockets with hdrinc suffer from contention on rt_uncached_lock spinlock. One solution is to use percpu lists, since most routes are destroyed by the cpu that created them. It is unclear why we even have to put these routes in uncached_list, as all outgoing packets should be freed when a device is dismantled. Signed-off-by: NEric Dumazet <edumazet@google.com> Fixes: caacf05e ("ipv4: Properly purge netdev references on uncached routes.") Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 25 3月, 2014 1 次提交
-
-
由 Li RongQing 提交于
ip_rt_dump do nothing after IPv4 route caches removal, so we can remove it. Signed-off-by: NLi RongQing <roy.qing.li@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 14 1月, 2014 1 次提交
-
-
由 Hannes Frederic Sowa 提交于
While forwarding we should not use the protocol path mtu to calculate the mtu for a forwarded packet but instead use the interface mtu. We mark forwarded skbs in ip_forward with IPSKB_FORWARDED, which was introduced for multicast forwarding. But as it does not conflict with our usage in unicast code path it is perfect for reuse. I moved the functions ip_sk_accept_pmtu, ip_sk_use_pmtu and ip_skb_dst_mtu along with the new ip_dst_mtu_maybe_forward to net/ip.h to fix circular dependencies because of IPSKB_FORWARDED. Because someone might have written a software which does probe destinations manually and expects the kernel to honour those path mtus I introduced a new per-namespace "ip_forward_use_pmtu" knob so someone can disable this new behaviour. We also still use mtus which are locked on a route for forwarding. The reason for this change is, that path mtus information can be injected into the kernel via e.g. icmp_err protocol handler without verification of local sockets. As such, this could cause the IPv4 forwarding path to wrongfully emit fragmentation needed notifications or start to fragment packets along a path. Tunnel and ipsec output paths clear IPCB again, thus IPSKB_FORWARDED won't be set and further fragmentation logic will use the path mtu to determine the fragmentation size. They also recheck packet size with help of path mtu discovery and report appropriate errors. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Cc: John Heffner <johnwheffner@gmail.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 06 12月, 2013 1 次提交
-
-
由 Steffen Klassert 提交于
FLOWI_FLAG_CAN_SLEEP was used to notify xfrm about the posibility to sleep until the needed states are resolved. This code is gone, so FLOWI_FLAG_CAN_SLEEP is not needed anymore. Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
-
- 06 11月, 2013 1 次提交
-
-
由 Hannes Frederic Sowa 提交于
Sockets marked with IP_PMTUDISC_INTERFACE won't do path mtu discovery, their sockets won't accept and install new path mtu information and they will always use the interface mtu for outgoing packets. It is guaranteed that the packet is not fragmented locally. But we won't set the DF-Flag on the outgoing frames. Florian Weimer had the idea to use this flag to ensure DNS servers are never generating outgoing fragments. They may well be fragmented on the path, but the server never stores or usees path mtu values, which could well be forged in an attack. (The root of the problem with path MTU discovery is that there is no reliable way to authenticate ICMP Fragmentation Needed But DF Set messages because they are sent from intermediate routers with their source addresses, and the IMCP payload will not always contain sufficient information to identify a flow.) Recent research in the DNS community showed that it is possible to implement an attack where DNS cache poisoning is feasible by spoofing fragments. This work was done by Amir Herzberg and Haya Shulman: <https://sites.google.com/site/hayashulman/files/fragmentation-poisoning.pdf> This issue was previously discussed among the DNS community, e.g. <http://www.ietf.org/mail-archive/web/dnsext/current/msg01204.html>, without leading to fixes. This patch depends on the patch "ipv4: fix DO and PROBE pmtu mode regarding local fragmentation with UFO/CORK" for the enforcement of the non-fragmentable checks. If other users than ip_append_page/data should use this semantic too, we have to add a new flag to IPCB(skb)->flags to suppress local fragmentation and check for this in ip_finish_output. Many thanks to Florian Weimer for the idea and feedback while implementing this patch. Cc: David S. Miller <davem@davemloft.net> Suggested-by: NFlorian Weimer <fweimer@redhat.com> Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 18 10月, 2013 1 次提交
-
-
由 Eric Dumazet 提交于
Half of the rt_cache_stat fields are no longer used after IP route cache removal, lets shrink this per cpu area. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-