提交 · bb176f67090ca54869fc1262c913aa69d2ede070 · openeuler / Kernel

22 9月, 2017 1 次提交

net: prevent dst uses after free · 222d7dbd

由 Eric Dumazet 提交于 9月 21, 2017

In linux-4.13, Wei worked hard to convert dst to a traditional
refcounted model, removing GC.

We now want to make sure a dst refcount can not transition from 0 back
to 1.

The problem here is that input path attached a not refcounted dst to an
skb. Then later, because packet is forwarded and hits skb_dst_force()
before exiting RCU section, we might try to take a refcount on one dst
that is about to be freed, if another cpu saw 1 -> 0 transition in
dst_release() and queued the dst for freeing after one RCU grace period.

Lets unify skb_dst_force() and skb_dst_force_safe(), since we should
always perform the complete check against dst refcount, and not assume
it is not zero.

Bugzilla : https://bugzilla.kernel.org/show_bug.cgi?id=197005

[  989.919496]  skb_dst_force+0x32/0x34
[  989.919498]  __dev_queue_xmit+0x1ad/0x482
[  989.919501]  ? eth_header+0x28/0xc6
[  989.919502]  dev_queue_xmit+0xb/0xd
[  989.919504]  neigh_connected_output+0x9b/0xb4
[  989.919507]  ip_finish_output2+0x234/0x294
[  989.919509]  ? ipt_do_table+0x369/0x388
[  989.919510]  ip_finish_output+0x12c/0x13f
[  989.919512]  ip_output+0x53/0x87
[  989.919513]  ip_forward_finish+0x53/0x5a
[  989.919515]  ip_forward+0x2cb/0x3e6
[  989.919516]  ? pskb_trim_rcsum.part.9+0x4b/0x4b
[  989.919518]  ip_rcv_finish+0x2e2/0x321
[  989.919519]  ip_rcv+0x26f/0x2eb
[  989.919522]  ? vlan_do_receive+0x4f/0x289
[  989.919523]  __netif_receive_skb_core+0x467/0x50b
[  989.919526]  ? tcp_gro_receive+0x239/0x239
[  989.919529]  ? inet_gro_receive+0x226/0x238
[  989.919530]  __netif_receive_skb+0x4d/0x5f
[  989.919532]  netif_receive_skb_internal+0x5c/0xaf
[  989.919533]  napi_gro_receive+0x45/0x81
[  989.919536]  ixgbe_poll+0xc8a/0xf09
[  989.919539]  ? kmem_cache_free_bulk+0x1b6/0x1f7
[  989.919540]  net_rx_action+0xf4/0x266
[  989.919543]  __do_softirq+0xa8/0x19d
[  989.919545]  irq_exit+0x5d/0x6b
[  989.919546]  do_IRQ+0x9c/0xb5
[  989.919548]  common_interrupt+0x93/0x93
[  989.919548]  </IRQ>

Similarly dst_clone() can use dst_hold() helper to have additional
debugging, as a follow up to commit 44ebe791 ("net: add debug
atomic_inc_not_zero() in dst_hold()")

In net-next we will convert dst atomic_t to refcount_t for peace of
mind.

Fixes: a4c2fd7f ("net: remove DST_NOCACHE flag")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Reported-by: NPaweł Staszewski <pstaszewski@itcare.pl>
Bisected-by: NPaweł Staszewski <pstaszewski@itcare.pl>
Acked-by: NWei Wang <weiwan@google.com>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

222d7dbd

19 8月, 2017 1 次提交

ipv4: convert dst_metrics.refcnt from atomic_t to refcount_t · 9620fef2

由 Eric Dumazet 提交于 8月 18, 2017

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9620fef2

18 6月, 2017 8 次提交

net: add debug atomic_inc_not_zero() in dst_hold() · 44ebe791

由 Wei Wang 提交于 6月 17, 2017

This patch is meant to add a debug warning on the situation where dst is
being held during its destroy phase. This could potentially cause double
free issue on the dst.
Signed-off-by: NWei Wang <weiwan@google.com>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

44ebe791

net: reorder all the dst flags · 1eb04e7c

由 Wei Wang 提交于 6月 17, 2017

As some dst flags are removed, reorder the dst flags to fill in the
blanks.
Note: these flags are not exposed into user space. So it is safe to
reorder.
Signed-off-by: NWei Wang <weiwan@google.com>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1eb04e7c

net: remove DST_NOCACHE flag · a4c2fd7f

由 Wei Wang 提交于 6月 17, 2017

DST_NOCACHE flag check has been removed from dst_release() and
dst_hold_safe() in a previous patch because all the dst are now ref
counted properly and can be released based on refcnt only.
Looking at the rest of the DST_NOCACHE use, all of them can now be
removed or replaced with other checks.
So this patch gets rid of all the DST_NOCACHE usage and remove this flag
completely.
Signed-off-by: NWei Wang <weiwan@google.com>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a4c2fd7f

net: remove DST_NOGC flag · b2a9c0ed

由 Wei Wang 提交于 6月 17, 2017

Now that all the components have been changed to release dst based on
refcnt only and not depend on dst gc anymore, we can remove the
temporary flag DST_NOGC.

Note that we also need to remove the DST_NOCACHE check in dst_release()
and dst_hold_safe() because now all the dst are released based on refcnt
and behaves as DST_NOCACHE.
Signed-off-by: NWei Wang <weiwan@google.com>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b2a9c0ed

net: remove dst gc related code · 5b7c9a8f

由 Wei Wang 提交于 6月 17, 2017

This patch removes all dst gc related code and all the dst free
functions
Signed-off-by: NWei Wang <weiwan@google.com>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5b7c9a8f

xfrm: take refcnt of dst when creating struct xfrm_dst bundle · 52df157f

由 Wei Wang 提交于 6月 17, 2017

During the creation of xfrm_dst bundle, always take ref count when
allocating the dst. This way, xfrm_bundle_create() will form a linked
list of dst with dst->child pointing to a ref counted dst child. And
the returned dst pointer is also ref counted. This makes the link from
the flow cache to this dst now ref counted properly.
As the dst is always ref counted properly, we can safely mark
DST_NOGC flag so dst_release() will release dst based on refcnt only.
And dst gc is no longer needed and all dst_free() and its related
function calls should be replaced with dst_release() or
dst_release_immediate().

The special handling logic for dst->child in dst_destroy() can be
replaced with a simple dst_release_immediate() call on the child to
release the whole list linked by dst->child pointer.
Previously used DST_NOHASH flag is not needed anymore as well. The
reason that DST_NOHASH is used in the existing code is mainly to prevent
the dst inserted in the fib tree to be wrongly destroyed during the
deletion of the xfrm_dst bundle. So in the existing code, DST_NOHASH
flag is marked in all the dst children except the one which is in the
fib tree.
However, with this patch series to remove dst gc logic and release dst
only based on ref count, it is safe to release all the children from a
xfrm_dst bundle as long as the dst children are all ref counted
properly which is already the case in the existing code.
So, this patch removes the use of DST_NOHASH flag.
Signed-off-by: NWei Wang <weiwan@google.com>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

52df157f

net: introduce a new function dst_dev_put() · 4a6ce2b6

由 Wei Wang 提交于 6月 17, 2017

This function should be called when removing routes from fib tree after
the dst gc is no longer in use.
We first mark DST_OBSOLETE_DEAD on this dst to make sure next
dst_ops->check() fails and returns NULL.
Secondly, as we no longer keep the gc_list, we need to properly
release dst->dev right at the moment when the dst is removed from
the fib/fib6 tree.
It does the following:
1. change dst->input and output pointers to dst_discard/dst_dscard_out to
   discard all packets
2. replace dst->dev with loopback interface
Signed-off-by: NWei Wang <weiwan@google.com>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4a6ce2b6

net: introduce DST_NOGC in dst_release() to destroy dst based on refcnt · 5f56f409

由 Wei Wang 提交于 6月 17, 2017

The current mechanism of freeing dst is a bit complicated. dst has its
ref count and when user grabs the reference to the dst, the ref count is
properly taken in most cases except in IPv4/IPv6/decnet/xfrm routing
code due to some historic reasons.

If the reference to dst is always taken properly, we should be able to
simplify the logic in dst_release() to destroy dst when dst->__refcnt
drops from 1 to 0. And this should be the only condition to determine
if we can call dst_destroy().
And as dst is always ref counted, there is no need for a dst garbage
list to hold the dst entries that already get removed by the routing
code but are still held by other users. And the task to periodically
check the list to free dst if ref count become 0 is also not needed
anymore.

This patch introduces a temporary flag DST_NOGC(no garbage collector).
If it is set in the dst, dst_release() will call dst_destroy() when
dst->__refcnt drops to 0. dst_hold_safe() will also check for this flag
and do atomic_inc_not_zero() similar as DST_NOCACHE to avoid double free
issue.
This temporary flag is mainly used so that we can make the transition
component by component without breaking other parts.
This flag will be removed after all components are properly transitioned.

This patch also introduces a new function dst_release_immediate() which
destroys dst without waiting on the rcu when refcnt drops to 0. It will
be used in later patches.

Follow-up patches will correct all the places to properly take ref count
on dst and mark DST_NOGC. dst_release() or dst_release_immediate() will
be used to release the dst instead of dst_free() and its related
functions.
And final clean-up patch will remove the DST_NOGC flag.
Signed-off-by: NWei Wang <weiwan@google.com>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5f56f409

27 5月, 2017 1 次提交

ipv4: add reference counting to metrics · 3fb07daf

由 Eric Dumazet 提交于 5月 25, 2017

Andrey Konovalov reported crashes in ipv4_mtu()

I could reproduce the issue with KASAN kernels, between
10.246.7.151 and 10.246.7.152 :

1) 20 concurrent netperf -t TCP_RR -H 10.246.7.152 -l 1000 &

2) At the same time run following loop :
while :
do
 ip ro add 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500
 ip ro del 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500
done

Cong Wang attempted to add back rt->fi in commit
82486aa6 ("ipv4: restore rt->fi for reference counting")
but this proved to add some issues that were complex to solve.

Instead, I suggested to add a refcount to the metrics themselves,
being a standalone object (in particular, no reference to other objects)

I tried to make this patch as small as possible to ease its backport,
instead of being super clean. Note that we believe that only ipv4 dst
need to take care of the metric refcount. But if this is wrong,
this patch adds the basic infrastructure to extend this to other
families.

Many thanks to Julian Anastasov for reviewing this patch, and Cong Wang
for his efforts on this problem.

Fixes: 2860583f ("ipv4: Kill rt->fi")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NAndrey Konovalov <andreyknvl@google.com>
Reviewed-by: NJulian Anastasov <ja@ssi.bg>
Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3fb07daf

18 5月, 2017 1 次提交

net: make struct dst_entry::dev first member · 66727145

由 Alexey Dobriyan 提交于 5月 17, 2017

struct dst_entry::dev is used most often. Move it so it can be
accessed without imm8 offset on x86_64.

	add/remove: 0/0 grow/shrink: 9/239 up/down: 52/-413 (-361)
	function                                     old     new   delta
	dst_rcu_free                                 126     138     +12
	fnhe_flush_routes                            211     219      +8
	rt_set_nexthop                               747     754      +7
	rt_cache_route                                85      91      +6
	rt6_release                                  209     215      +6
	dst_release                                  107     111      +4
	dst_destroy_rcu                               29      33      +4
	dn_dst_check_expire                          329     333      +4
	dn_insert_route                              484     485      +1
	xfrm_resolve_and_create_bundle              2991    2990      -1
					...
	ip_route_me_harder                          1163    1157      -6
	__ip_append_data.isra                       2730    2724      -6
	ip6_forward                                 3052    3045      -7
	callforward_do_filter                        659     651      -8
	dst_gc_task                                  571     549     -22
Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

66727145

12 2月, 2017 1 次提交

net: rename dst_neigh_output back to neigh_output · c16ec185

由 Julian Anastasov 提交于 2月 11, 2017

After the dst->pending_confirm flag was removed, we do not
need anymore to provide dst arg to dst_neigh_output.
So, rename it to neigh_output as before commit 5110effe
("net: Do delayed neigh confirmation.").
Signed-off-by: NJulian Anastasov <ja@ssi.bg>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c16ec185

08 2月, 2017 2 次提交

net: pending_confirm is not used anymore · 51ce8bd4

由 Julian Anastasov 提交于 2月 06, 2017

When same struct dst_entry can be used for many different
neighbours we can not use it for pending confirmations.
As last step, we can remove the pending_confirm flag.
Reported-by: NYueHaibing <yuehaibing@huawei.com>
Fixes: 5110effe ("net: Do delayed neigh confirmation.")
Fixes: f2bb4bed ("ipv4: Cache output routes in fib_info nexthops.")
Signed-off-by: NJulian Anastasov <ja@ssi.bg>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

51ce8bd4

net: add confirm_neigh method to dst_ops · 63fca65d

由 Julian Anastasov 提交于 2月 06, 2017

Add confirm_neigh method to dst_ops and use it from IPv4 and IPv6
to lookup and confirm the neighbour. Its usage via the new helper
dst_confirm_neigh() should be restricted to MSG_PROBE users for
performance reasons.

For XFRM prefer the last tunnel address, if present. With help
from Steffen Klassert.
Signed-off-by: NJulian Anastasov <ja@ssi.bg>
Acked-by: NSteffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

63fca65d

26 4月, 2016 1 次提交

route: move lwtunnel state to a single place · 0868e253

由 Jiri Benc 提交于 4月 22, 2016

Commit 751a587a ("route: fix breakage after moving lwtunnel state")
moved lwtstate to the end of dst_entry for 32bit archs. This makes it share
the cacheline with __refcnt which had an unkown effect on performance. For
this reason, the pointer was kept in place for 64bit archs.

However, later performance measurements showed this is of no concern. It
turns out that every performance sensitive path that accesses lwtstate
accesses also struct rtable or struct rt6_info which share the same cache
line.

Thus, to get rid of a few #ifdefs, move the field to the end of the struct
also for 64bit.
Signed-off-by: NJiri Benc <jbenc@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0868e253

19 3月, 2016 1 次提交

bpf, dst: add and use dst_tclassid helper · 808c1b69

由 Daniel Borkmann 提交于 3月 16, 2016

We can just add a small helper dst_tclassid() for retrieving the
dst->tclassid value. It makes the code a bit better in that we can
get rid of the ifdef from filter.c by moving this into the header.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

808c1b69

15 12月, 2015 1 次提交

net: fix IP early demux races · 5037e9ef

由 Eric Dumazet 提交于 12月 14, 2015

David Wilder reported crashes caused by dst reuse.

<quote David>
  I am seeing a crash on a distro V4.2.3 kernel caused by a double
  release of a dst_entry.  In ipv4_dst_destroy() the call to
  list_empty() finds a poisoned next pointer, indicating the dst_entry
  has already been removed from the list and freed. The crash occurs
  18 to 24 hours into a run of a network stress exerciser.
</quote>

Thanks to his detailed report and analysis, we were able to understand
the core issue.

IP early demux can associate a dst to skb, after a lookup in TCP/UDP
sockets.

When socket cache is not properly set, we want to store into
sk->sk_dst_cache the dst for future IP early demux lookups,
by acquiring a stable refcount on the dst.

Problem is this acquisition is simply using an atomic_inc(),
which works well, unless the dst was queued for destruction from
dst_release() noticing dst refcount went to zero, if DST_NOCACHE
was set on dst.

We need to make sure current refcount is not zero before incrementing
it, or risk double free as David reported.

This patch, being a stable candidate, adds two new helpers, and use
them only from IP early demux problematic paths.

It might be possible to merge in net-next skb_dst_force() and
skb_dst_force_safe(), but I prefer having the smallest patch for stable
kernels : Maybe some skb_dst_force() callers do not expect skb->dst
can suddenly be cleared.

Can probably be backported back to linux-3.6 kernels
Reported-by: NDavid J. Wilder <dwilder@us.ibm.com>
Tested-by: NDavid J. Wilder <dwilder@us.ibm.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5037e9ef

08 10月, 2015 2 次提交

dst: Pass net into dst->output · ede2059d

由 Eric W. Biederman 提交于 10月 07, 2015

The network namespace is already passed into dst_output pass it into
dst->output lwt->output and friends.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ede2059d

net: Pass net into dst_output and remove dst_output_okfn · 13206b6b

由 Eric W. Biederman 提交于 10月 07, 2015

Replace dst_output_okfn with dst_output
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

13206b6b

26 9月, 2015 1 次提交

inet: constify ip_route_output_flow() socket argument · 6f9c9615

由 Eric Dumazet 提交于 9月 25, 2015

Very soon, TCP stack might call inet_csk_route_req(), which
calls inet_csk_route_req() with an unlocked listener socket,
so we need to make sure ip_route_output_flow() is not trying to
change any field from its socket argument.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6f9c9615

18 9月, 2015 2 次提交

netfilter: Pass net into okfn · 0c4b51f0

由 Eric W. Biederman 提交于 9月 15, 2015

This is immediately motivated by the bridge code that chains functions that
call into netfilter.  Without passing net into the okfns the bridge code would
need to guess about the best expression for the network namespace to process
packets in.

As net is frequently one of the first things computed in continuation functions
after netfilter has done it's job passing in the desired network namespace is in
many cases a code simplification.

To support this change the function dst_output_okfn is introduced to
simplify passing dst_output as an okfn.  For the moment dst_output_okfn
just silently drops the struct net.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0c4b51f0

net: Merge dst_output and dst_output_sk · 5a70649e

由 Eric W. Biederman 提交于 9月 15, 2015

Add a sock paramter to dst_output making dst_output_sk superfluous.
Add a skb->sk parameter to all of the callers of dst_output
Have the callers of dst_output_sk call dst_output.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5a70649e

01 9月, 2015 1 次提交

tcp: use dctcp if enabled on the route to the initiator · c3a8d947

由 Daniel Borkmann 提交于 8月 31, 2015

Currently, the following case doesn't use DCTCP, even if it should:
A responder has f.e. Cubic as system wide default, but for a specific
route to the initiating host, DCTCP is being set in RTAX_CC_ALGO. The
initiating host then uses DCTCP as congestion control, but since the
initiator sets ECT(0), tcp_ecn_create_request() doesn't set ecn_ok,
and we have to fall back to Reno after 3WHS completes.

We were thinking on how to solve this in a minimal, non-intrusive
way without bloating tcp_ecn_create_request() needlessly: lets cache
the CA ecn option flag in RTAX_FEATURES. In other words, when ECT(0)
is set on the SYN packet, set ecn_ok=1 iff route RTAX_FEATURES
contains the unexposed (internal-only) DST_FEATURE_ECN_CA. This allows
to only do a single metric feature lookup inside tcp_ecn_create_request().

Joint work with Florian Westphal.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c3a8d947

28 8月, 2015 1 次提交

dst: Add __skb_dst_copy() variation · e79e2595

由 Joe Stringer 提交于 8月 26, 2015

This variation on skb_dst_copy() doesn't require two skbs.
Signed-off-by: NJoe Stringer <joestringer@nicira.com>
Acked-by: NPravin B Shelar <pshelar@nicira.com>
Acked-by: NThomas Graf <tgraf@suug.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e79e2595

24 8月, 2015 1 次提交

route: fix breakage after moving lwtunnel state · 751a587a

由 Jiri Benc 提交于 8月 21, 2015

__recnt and related fields need to be in its own cacheline for performance
reasons. Commit 61adedf3 ("route: move lwtunnel state to dst_entry")
broke that on 32bit archs, causing BUILD_BUG_ON in dst_hold to be triggered.

This patch fixes the breakage by moving the lwtunnel state to the end of
dst_entry on 32bit archs. Unfortunately, this makes it share the cacheline
with __refcnt and may affect performance, thus further patches may be
needed.
Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
Fixes: 61adedf3 ("route: move lwtunnel state to dst_entry")
Signed-off-by: NJiri Benc <jbenc@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

751a587a

21 8月, 2015 1 次提交

route: move lwtunnel state to dst_entry · 61adedf3

由 Jiri Benc 提交于 8月 20, 2015

Currently, the lwtunnel state resides in per-protocol data. This is
a problem if we encapsulate ipv6 traffic in an ipv4 tunnel (or vice versa).
The xmit function of the tunnel does not know whether the packet has been
routed to it by ipv4 or ipv6, yet it needs the lwtstate data. Moving the
lwtstate data to dst_entry makes such inter-protocol tunneling possible.

As a bonus, this brings a nice diffstat.
Signed-off-by: NJiri Benc <jbenc@redhat.com>
Acked-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: NThomas Graf <tgraf@suug.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

61adedf3

22 7月, 2015 1 次提交

dst: Metadata destinations · f38a9eb1

由 Thomas Graf 提交于 7月 21, 2015

Introduces a new dst_metadata which enables to carry per packet metadata
between forwarding and processing elements via the skb->dst pointer.

The structure is set up to be a union. Thus, each separate type of
metadata requires its own dst instance. If demand arises to carry
multiple types of metadata concurrently, metadata dst entries can be
made stackable.

The metadata dst entry is refcnt'ed as expected for now but a non
reference counted use is possible if the reference is forced before
queueing the skb.

In order to allow allocating dsts with variable length, the existing
dst_alloc() is split into a dst_alloc() and dst_init() function. The
existing dst_init() function to initialize the subsystem is being
renamed to dst_subsys_init() to make it clear what is what.

The check before ip_route_input() is changed to ignore metadata dsts
and drop the dst inside the routing function thus allowing to interpret
metadata in a later commit.
Signed-off-by: NThomas Graf <tgraf@suug.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f38a9eb1

13 5月, 2015 1 次提交

net: make skb_dst_pop routine static · 9449c3cd

由 Ying Xue 提交于 5月 12, 2015

As xfrm_output_one() is the only caller of skb_dst_pop(), we should
make skb_dst_pop() localized.
Signed-off-by: NYing Xue <ying.xue@windriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9449c3cd

02 5月, 2015 1 次提交

ipv6: Remove DST_METRICS_FORCE_OVERWRITE and _rt6i_peer · afc4eef8

由 Martin KaFai Lau 提交于 4月 28, 2015

_rt6i_peer is no longer needed after the last patch,
'ipv6: Stop rt6_info from using inet_peer's metrics'.

DST_METRICS_FORCE_OVERWRITE is added by
commit e5fd387a ("ipv6: do not overwrite inetpeer metrics prematurely").
Since inetpeer is no longer used for metrics, this bit is also not needed.
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Reviewed-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Michal Kubeček <mkubecek@suse.cz>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

afc4eef8

12 2月, 2015 1 次提交

xfrm: release dst_orig in case of error in xfrm_lookup() · ac37e251

由 huaibin Wang 提交于 2月 11, 2015

dst_orig should be released on error. Function like __xfrm_route_forward()
expects that behavior.
Since a recent commit, xfrm_lookup() may also be called by xfrm_lookup_route(),
which expects the opposite.
Let's introduce a new flag (XFRM_LOOKUP_KEEP_DST_REF) to tell what should be
done in case of error.

Fixes: f92ee619("xfrm: Generate blackhole routes only from route lookup functions")
Signed-off-by: Nhuaibin Wang <huaibin.wang@6wind.com>
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>

ac37e251

16 9月, 2014 2 次提交

xfrm: Generate queueing routes only from route lookup functions · b8c203b2

由 Steffen Klassert 提交于 9月 16, 2014

Currently we genarate a queueing route if we have matching policies
but can not resolve the states and the sysctl xfrm_larval_drop is
disabled. Here we assume that dst_output() is called to kill the
queued packets. Unfortunately this assumption is not true in all
cases, so it is possible that these packets leave the system unwanted.

We fix this by generating queueing routes only from the
route lookup functions, here we can guarantee a call to
dst_output() afterwards.

Fixes: a0073fe1 ("xfrm: Add a state resolution packet queue")
Reported-by: NKonstantinos Kolelis <k.kolelis@sirrix.com>
Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>

b8c203b2

xfrm: Generate blackhole routes only from route lookup functions · f92ee619

由 Steffen Klassert 提交于 9月 16, 2014

Currently we genarate a blackhole route route whenever we have
matching policies but can not resolve the states. Here we assume
that dst_output() is called to kill the balckholed packets.
Unfortunately this assumption is not true in all cases, so
it is possible that these packets leave the system unwanted.

We fix this by generating blackhole routes only from the
route lookup functions, here we can guarantee a call to
dst_output() afterwards.

Fixes: 2774c131 ("xfrm: Handle blackhole route creation via afinfo.")
Reported-by: NKonstantinos Kolelis <k.kolelis@sirrix.com>
Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>

f92ee619

16 4月, 2014 1 次提交

ipv4: add a sock pointer to dst->output() path. · aad88724

由 Eric Dumazet 提交于 4月 15, 2014

In the dst->output() path for ipv4, the code assumes the skb it has to
transmit is attached to an inet socket, specifically via
ip_mc_output() : The sk_mc_loop() test triggers a WARN_ON() when the
provider of the packet is an AF_PACKET socket.

The dst->output() method gets an additional 'struct sock *sk'
parameter. This needs a cascade of changes so that this parameter can
be propagated from vxlan to final consumer.

Fixes: 8f646c92 ("vxlan: keep original skb ownership")
Reported-by: Nlucien xin <lucien.xin@gmail.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

aad88724

28 3月, 2014 1 次提交

ipv6: do not overwrite inetpeer metrics prematurely · e5fd387a

由 Michal Kubeček 提交于 3月 27, 2014

If an IPv6 host route with metrics exists, an attempt to add a
new route for the same target with different metrics fails but
rewrites the metrics anyway:

12sp0:~ # ip route add fec0::1 dev eth0 rto_min 1000
12sp0:~ # ip -6 route show
fe80::/64 dev eth0  proto kernel  metric 256
fec0::1 dev eth0  metric 1024  rto_min lock 1s
12sp0:~ # ip route add fec0::1 dev eth0 rto_min 1500
RTNETLINK answers: File exists
12sp0:~ # ip -6 route show
fe80::/64 dev eth0  proto kernel  metric 256
fec0::1 dev eth0  metric 1024  rto_min lock 1.5s

This is caused by all IPv6 host routes using the metrics in
their inetpeer (or the shared default). This also holds for the
new route created in ip6_route_add() which shares the metrics
with the already existing route and thus ip6_route_add()
rewrites the metrics even if the new route ends up not being
used at all.

Another problem is that old metrics in inetpeer can reappear
unexpectedly for a new route, e.g.

12sp0:~ # ip route add fec0::1 dev eth0 rto_min 1000
12sp0:~ # ip route del fec0::1
12sp0:~ # ip route add fec0::1 dev eth0
12sp0:~ # ip route change fec0::1 dev eth0 hoplimit 10
12sp0:~ # ip -6 route show
fe80::/64 dev eth0  proto kernel  metric 256
fec0::1 dev eth0  metric 1024  hoplimit 10 rto_min lock 1s

Resolve the first problem by moving the setting of metrics down
into fib6_add_rt2node() to the point we are sure we are
inserting the new route into the tree. Second problem is
addressed by introducing new flag DST_METRICS_FORCE_OVERWRITE
which is set for a new host route in ip6_route_add() and makes
ipv6_cow_metrics() always overwrite the metrics in inetpeer
(even if they are not "new"); it is reset after that.

v5: use a flag in _metrics member rather than one in flags

v4: fix a typo making a condition always true (thanks to Hannes
Frederic Sowa)

v3: rewritten based on David Miller's idea to move setting the
metrics (and allocation in non-host case) down to the point we
already know the route is to be inserted. Also rebased to
net-next as it is quite late in the cycle.
Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e5fd387a

07 3月, 2014 1 次提交

inet: remove now unused flag DST_NOPEER · e90c1483

由 Hannes Frederic Sowa 提交于 3月 06, 2014

Commit e688a604 ("net: introduce DST_NOPEER dst flag") introduced
DST_NOPEER because because of crashes in ipv6_select_ident called from
udp6_ufo_fragment.

Since commit 916e4cf4 ("ipv6: reuse ip6_frag_id from
ip6_ufo_append_data") we don't call ipv6_select_ident any more from
ip6_ufo_append_data, thus this flag lost its purpose and can be removed.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e90c1483

18 12月, 2013 1 次提交

net: Add utility functions to clear rxhash · 7539fadc

由 Tom Herbert 提交于 12月 15, 2013

In several places 'skb->rxhash = 0' is being done to clear the
rxhash value in an skb.  This does not clear l4_rxhash which could
still be set so that the rxhash wouldn't be recalculated on subsequent
call to skb_get_rxhash.  This patch adds an explict function to clear
all the rxhash related information in the skb properly.

skb_clear_hash_if_not_l4 clears the rxhash only if it is not marked as
l4_rxhash.

Fixed up places where 'skb->rxhash = 0' was being called.
Signed-off-by: NTom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7539fadc

18 10月, 2013 1 次提交

net: dst: provide accessor function to dst->xfrm · e87b3998

由 Vlad Yasevich 提交于 10月 15, 2013

dst->xfrm is conditionally defined.  Provide accessor funtion that
is always available.
Signed-off-by: NVlad Yasevich <vyasevich@gmail.com>
Acked-by: NNeil Horman <nhorman@tuxdriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e87b3998

21 9月, 2013 1 次提交

dst.h: Remove extern from function prototypes · a4023dd0

由 Joe Perches 提交于 9月 20, 2013

There are a mix of function prototypes with and without extern
in the kernel sources.  Standardize on not using extern for
function prototypes.

Function prototypes don't need to be written with extern.
extern is assumed by the compiler.  Its use is as unnecessary as
using auto to declare automatic/local variables in a block.
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a4023dd0

04 9月, 2013 1 次提交

tunnels: harmonize cleanup done on skb on rx path · ea23192e

由 Nicolas Dichtel 提交于 9月 02, 2013

The goal of this patch is to harmonize cleanup done on a skbuff on rx path.
Before this patch, behaviors were different depending of the tunnel type.
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ea23192e

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功