提交 · eb49a97363f020c1d7eef8bcd93865726b1fa11d · openanolis / cloud-kernel

24 3月, 2011 1 次提交

ipv4: fix ip_rt_update_pmtu() · eb49a973

由 Eric Dumazet 提交于 3月 23, 2011

commit 2c8cec5c (Cache learned PMTU information in inetpeer) added
an extra inet_putpeer() call in ip_rt_update_pmtu().

This results in various problems, since we can free one inetpeer, while
it is still in use.

Ref: http://www.spinics.net/lists/netdev/msg159121.htmlReported-by: NAlexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

eb49a973

16 3月, 2011 1 次提交

net_sched: fix ip_tos2prio · 4a2b9c37

由 Dan Siemon 提交于 3月 15, 2011

ECN support incorrectly maps ECN BESTEFFORT packets to TC_PRIO_FILLER
(1) instead of TC_PRIO_BESTEFFORT (0)

This means ECN enabled flows are placed in pfifo_fast/prio low priority
band, giving ECN enabled flows [ECT(0) and CE codepoints] higher drop
probabilities.

This is rather unfortunate, given we would like ECN being more widely
used.

Ref : http://www.coverfire.com/archives/2011/03/13/pfifo_fast-and-ecn/Signed-off-by: NDan Siemon <dan@coverfire.com>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Cc: Dave Täht <d@taht.net>
Cc: Jonathan Morton <chromatix99@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4a2b9c37

14 3月, 2011 1 次提交

ipv4: Fix PMTU update. · 46af3180

由 Hiroaki SHIMODA 提交于 3月 09, 2011

On current net-next-2.6, when Linux receives ICMP Type: 3, Code: 4
(Destination unreachable (Fragmentation needed)),

  icmp_unreach
    -> ip_rt_frag_needed
         (peer->pmtu_expires is set here)
    -> tcp_v4_err
         -> do_pmtu_discovery
              -> ip_rt_update_pmtu
                   (peer->pmtu_expires is already set,
                    so check_peer_pmtu is skipped.)
                   -> check_peer_pmtu

check_peer_pmtu is skipped and MTU is not updated.

To fix this, let check_peer_pmtu execute unconditionally.
And some minor fixes
1) Avoid potential peer->pmtu_expires set to be zero.
2) In check_peer_pmtu, argument of time_before is reversed.
3) check_peer_pmtu expects peer->pmtu_orig is initialized as zero,
   but not initialized.
Signed-off-by: NHiroaki SHIMODA <shimoda.hiroaki@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

46af3180

13 3月, 2011 4 次提交

D
ipv4: Use flowi4 in public route lookup interfaces. · 9d6ec938
由 David S. Miller 提交于 3月 12, 2011
```
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
9d6ec938
D
ipv4: Use struct flowi4 internally in routing lookups. · 68a5e3dd
由 David S. Miller 提交于 3月 11, 2011
```
We will change the externally visible APIs next.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
68a5e3dd

ipv4: Pass ipv4 flow objects into fib_lookup() paths. · 22bd5b9b

由 David S. Miller 提交于 3月 11, 2011

To start doing these conversions, we need to add some temporary
flow4_* macros which will eventually go away when all the protocol
code paths are changed to work on AF specific flowi objects.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

22bd5b9b

net: Put flowi_* prefix on AF independent members of struct flowi · 1d28f42c

由 David S. Miller 提交于 3月 12, 2011

I intend to turn struct flowi into a union of AF specific flowi
structs.  There will be a common structure that each variant includes
first, much like struct sock_common.

This is the first step to move in that direction.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1d28f42c

11 3月, 2011 3 次提交

D
ipv4: Kill flowi arg to fib_select_multipath() · 1b7fe593
由 David S. Miller 提交于 3月 10, 2011
```
Completely unused.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
1b7fe593

ipv4: Remove unnecessary test from ip_mkroute_input() · ff3fccb3

由 David S. Miller 提交于 3月 10, 2011

fl->oif will always be zero on the input path, so there is no reason
to test for that.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ff3fccb3

ipv4: Remove redundant RCU locking in ip_check_mc(). · dbdd9a52

由 David S. Miller 提交于 3月 10, 2011

All callers are under rcu_read_lock() protection already.

Rename to ip_check_mc_rcu() to make it even more clear.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dbdd9a52

10 3月, 2011 1 次提交

ipv4: Optimize flow initialization in input route lookup. · 67e28ffd

由 David S. Miller 提交于 3月 09, 2011

Like in commit 44713b67
("ipv4: Optimize flow initialization in output route lookup."
we can optimize the on-stack flow setup to only initialize
the members which are actually used.

Otherwise we bzero the entire structure, then initialize
explicitly the first half of it.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

67e28ffd

05 3月, 2011 4 次提交

ipv4: Remove flowi from struct rtable. · 5e2b61f7

由 David S. Miller 提交于 3月 04, 2011

The only necessary parts are the src/dst addresses, the
interface indexes, the TOS, and the mark.

The rest is unnecessary bloat, which amounts to nearly
50 bytes on 64-bit.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5e2b61f7

ipv4: Set rt->rt_iif more sanely on output routes. · 1018b5c0

由 David S. Miller 提交于 3月 04, 2011

rt->rt_iif is only ever inspected on input routes, for example DCCP
uses this to populate a route lookup flow key when generating replies
to another packet.

Therefore, setting it to anything other than zero on output routes
makes no sense.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1018b5c0

ipv4: Get peer more cheaply in rt_init_metrics(). · 3c0afdca

由 David S. Miller 提交于 3月 04, 2011

We know this is a new route object, so doing atomics and
stuff makes no sense at all.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3c0afdca

ipv4: Optimize flow initialization in output route lookup. · 44713b67

由 David S. Miller 提交于 3月 04, 2011

We burn a lot of useless cycles, cpu store buffer traffic, and
memory operations memset()'ing the on-stack flow used to perform
output route lookups in __ip_route_output_key().

Only the first half of the flow object members even matter for
output route lookups in this context, specifically:

FIB rules matching cares about:

	dst, src, tos, iif, oif, mark

FIB trie lookup cares about:

	dst

FIB semantic match cares about:

	tos, scope, oif

Therefore only initialize these specific members and elide the
memset entirely.

On Niagara2 this kills about ~300 cycles from the output route
lookup path.

Likely, we can take things further, since all callers of output
route lookups essentially throw away the on-stack flow they use.
So they don't care if we use it as a scratch-pad to compute the
final flow key.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
Acked-by: NEric Dumazet <eric.dumazet@gmail.com>

44713b67

03 3月, 2011 3 次提交
- D
  ipv4: ip_route_output_key() is better as an inline. · 5bfa787f
  由 David S. Miller 提交于 3月 02, 2011
```
This avoid a stack frame at zero cost.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
  5bfa787f
- D
  ipv4: Make output route lookup return rtable directly. · b23dd4fe
  由 David S. Miller 提交于 3月 02, 2011
```
Instead of on the stack.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
  b23dd4fe
- D
  xfrm: Return dst directly from xfrm_lookup() · 452edd59
  由 David S. Miller 提交于 3月 02, 2011
```
Instead of on the stack.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
  452edd59
02 3月, 2011 4 次提交

xfrm: Handle blackhole route creation via afinfo. · 2774c131

由 David S. Miller 提交于 3月 01, 2011

That way we don't have to potentially do this in every xfrm_lookup()
caller.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2774c131

xfrm: Kill XFRM_LOOKUP_WAIT flag. · 80c0bc9e

由 David S. Miller 提交于 3月 01, 2011

This can be determined from the flow flags instead.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

80c0bc9e

ipv4: Kill can_sleep arg to ip_route_output_flow() · 273447b3

由 David S. Miller 提交于 3月 01, 2011

This boolean state is now available in the flow flags.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

273447b3

D
ipv4: Make final arg to ip_route_output_flow to be boolean "can_sleep" · 420d44da
由 David S. Miller 提交于 3月 01, 2011
```
Since that is what the current vague "flags" argument means.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
420d44da

19 2月, 2011 1 次提交

net: provide default_advmss() methods to blackhole dst_ops · 214f45c9

由 Eric Dumazet 提交于 2月 18, 2011

Commit 0dbaee3b (net: Abstract default ADVMSS behind an
accessor.) introduced a possible crash in tcp_connect_init(), when
dst->default_advmss() is called from dst_metric_advmss()
Reported-by: NGeorge Spelvin <linux@horizon.com>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

214f45c9

18 2月, 2011 5 次提交

ipv4: Use const'ify fib_result deep in the route call chains. · 982721f3

由 David S. Miller 提交于 2月 16, 2011

The only troublesome bit here is __mkroute_output which wants
to override res->fi and res->type, compute those in local
variables instead.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

982721f3

net: Add initial_ref arg to dst_alloc(). · 3c7bd1a1

由 David S. Miller 提交于 2月 16, 2011

This allows avoiding multiple writes to the initial __refcnt.

The most simplest cases of wanting an initial reference of "1"
in ipv4 and ipv6 have been converted, the rest have been left
along and kept at the existing "0".
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3c7bd1a1

ipv4: Consolidate ipv4 dst allocation logic. · 0c4dcd58

由 David S. Miller 提交于 2月 17, 2011

This also allows us to combine all the dst->flags settings and avoid
read/modify/write sequences to this struct member.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0c4dcd58

D
ipv4: Move rcu_read_{lock,unlock}() into ip_route_output_slow(). · 010c2708
由 David S. Miller 提交于 2月 17, 2011
```
Simplifies tail of __ip_route_output_key().
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
010c2708

ipv4: Simplify output route creation call sequence. · 5ada5527

由 David S. Miller 提交于 2月 17, 2011

There's a lot of redundancy and unnecessary stack frames
in the output route creation path.

1) Make __mkroute_output() return error pointers.

2) Eliminate ip_mkroute_output() entirely, made possible by #1.

3) Call __mkroute_output() directly and handling the returning error
   pointers in ip_route_output_slow().
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5ada5527

15 2月, 2011 2 次提交

ipv4: Cache learned redirect information in inetpeer. · f39925db

由 David S. Miller 提交于 2月 09, 2011

Note that we do not generate the redirect netevent any longer,
because we don't create a new cached route.

Instead, once the new neighbour is bound to the cached route,
we emit a neigh update event instead.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f39925db

ipv4: Cache learned PMTU information in inetpeer. · 2c8cec5c

由 David S. Miller 提交于 2月 09, 2011

The general idea is that if we learn new PMTU information, we
bump the peer genid.

This triggers the dst_ops->check() code to validate and if
necessary propagate the new PMTU value into the metrics.

Learned PMTU information self-expires.

This means that it is not necessary to kill a cached route
entry just because the PMTU information is too old.

As a consequence:

1) When the path appears unreachable (dst_ops->link_failure
   or dst_ops->negative_advice) we unwind the PMTU state if
   it is out of date, instead of killing the cached route.

   A redirected route will still be invalidated in these
   situations.

2) rt_check_expire(), rt_worker_func(), et al. are no longer
   necessary at all.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2c8cec5c

11 2月, 2011 1 次提交

inet: Create a mechanism for upward inetpeer propagation into routes. · 6431cbc2

由 David S. Miller 提交于 2月 07, 2011

If we didn't have a routing cache, we would not be able to properly
propagate certain kinds of dynamic path attributes, for example
PMTU information and redirects.

The reason is that if we didn't have a routing cache, then there would
be no way to lookup all of the active cached routes hanging off of
sockets, tunnels, IPSEC bundles, etc.

Consider the case where we created a cached route, but no inetpeer
entry existed and also we were not asked to pre-COW the route metrics
and therefore did not force the creation a new inetpeer entry.

If we later get a PMTU message, or a redirect, and store this
information in a new inetpeer entry, there is no way to teach that
cached route about the newly existing inetpeer entry.

The facilities implemented here handle this problem.

First we create a generation ID.  When we create a cached route of any
kind, we remember the generation ID at the time of attachment.  Any
time we force-create an inetpeer entry in response to new path
information, we bump that generation ID.

The dst_ops->check() callback is where the knowledge of this event
is propagated.  If the global generation ID does not equal the one
stored in the cached route, and the cached route has not attached
to an inetpeer yet, we look it up and attach if one is found.  Now
that we've updated the cached route's information, we update the
route's generation ID too.

This clears the way for implementing PMTU and redirects directly in
the inetpeer cache.  There is absolutely no need to consult cached
route information in order to maintain this information.

At this point nothing bumps the inetpeer genids, that comes in the
later changes which handle PMTUs and redirects using inetpeers.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6431cbc2

09 2月, 2011 1 次提交

net: Kill NETEVENT_PMTU_UPDATE. · 8d13a2a9

由 David S. Miller 提交于 2月 08, 2011

Nobody actually does anything in response to the event,
so just kill it off.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8d13a2a9

05 2月, 2011 2 次提交

inetpeer: Move ICMP rate limiting state into inet_peer entries. · 92d86829

由 David S. Miller 提交于 2月 04, 2011

Like metrics, the ICMP rate limiting bits are cached state about
a destination.  So move it into the inet_peer entries.

If an inet_peer cannot be bound (the reason is memory allocation
failure or similar), the policy is to allow.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

92d86829

ipv4: Don't miss existing cached metrics in new routes. · 0131ba45

由 David S. Miller 提交于 2月 04, 2011

Always lookup to see if we have an existing inetpeer entry for
a route.  Let FLOWI_FLAG_PRECOW_METRICS merely influence the
"create" argument to rt_bind_peer().

Also, call rt_bind_peer() unconditionally since it is not
possible for rt->peer to be non-NULL at this point.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0131ba45

01 2月, 2011 2 次提交

ipv4: Consolidate all default route selection implementations. · 0c838ff1

由 David S. Miller 提交于 1月 31, 2011

Both fib_trie and fib_hash have a local implementation of
fib_table_select_default().  This is completely unnecessary
code duplication.

Since we now remember the fib_table and the head of the fib
alias list of the default route, we can implement one single
generic version of this routine.

Looking at the fib_hash implementation you may get the impression
that it's possible for there to be multiple top-level routes in
the table for the default route.  The truth is, it isn't, the
insert code will only allow one entry to exist in the zero
prefix hash table, because all keys evaluate to zero and all
keys in a hash table must be unique.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0c838ff1

net: Add default_mtu() methods to blackhole dst_ops · ec831ea7

由 Roland Dreier 提交于 1月 31, 2011

When an IPSEC SA is still being set up, __xfrm_lookup() will return
-EREMOTE and so ip_route_output_flow() will return a blackhole route.
This can happen in a sndmsg call, and after d33e4553 ("net: Abstract
default MTU metric calculation behind an accessor.") this leads to a
crash in ip_append_data() because the blackhole dst_ops have no
default_mtu() method and so dst_mtu() calls a NULL pointer.

Fix this by adding default_mtu() methods (that simply return 0, matching
the old behavior) to the blackhole dst_ops.

The IPv4 part of this patch fixes a crash that I saw when using an IPSEC
VPN; the IPv6 part is untested because I don't have an IPv6 VPN, but it
looks to be needed as well.
Signed-off-by: NRoland Dreier <roland@purestorage.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ec831ea7

29 1月, 2011 1 次提交

ipv4: If fib metrics are default, no need to grab ref to FIB info. · b8dad61c

由 David S. Miller 提交于 1月 28, 2011

The fib metric memory in this case is static in the kernel image,
so we don't need to reference count it since it's never going
to go away on us.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b8dad61c

28 1月, 2011 2 次提交

net: Pre-COW metrics for TCP. · a4daad6b

由 David S. Miller 提交于 1月 27, 2011

TCP is going to record metrics for the connection,
so pre-COW the route metrics at route cache entry
creation time.

This avoids several atomic operations that have to
occur if we COW the metrics after the entry reaches
global visibility.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a4daad6b

net: Store ipv4/ipv6 COW'd metrics in inetpeer cache. · 06582540

由 David S. Miller 提交于 1月 27, 2011

Please note that the IPSEC dst entry metrics keep using
the generic metrics COW'ing mechanism using kmalloc/kfree.

This gives the IPSEC routes an opportunity to use metrics
which are unique to their encapsulated paths.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

06582540

27 1月, 2011 1 次提交

net: Implement read-only protection and COW'ing of metrics. · 62fa8a84

由 David S. Miller 提交于 1月 26, 2011

Routing metrics are now copy-on-write.

Initially a route entry points it's metrics at a read-only location.
If a routing table entry exists, it will point there.  Else it will
point at the all zero metric place-holder called 'dst_default_metrics'.

The writeability state of the metrics is stored in the low bits of the
metrics pointer, we have two bits left to spare if we want to store
more states.

For the initial implementation, COW is implemented simply via kmalloc.
However future enhancements will change this to place the writable
metrics somewhere else, in order to increase sharing.  Very likely
this "somewhere else" will be the inetpeer cache.

Note also that this means that metrics updates may transiently fail
if we cannot COW the metrics successfully.

But even by itself, this patch should decrease memory usage and
increase cache locality especially for routing workloads.  In those
cases the read-only metric copies stay in place and never get written
to.

TCP workloads where metrics get updated, and those rare cases where
PMTU triggers occur, will take a very slight performance hit.  But
that hit will be alleviated when the long-term writable metrics
move to a more sharable location.

Since the metrics storage went from a u32 array of RTAX_MAX entries to
what is essentially a pointer, some retooling of the dst_entry layout
was necessary.

Most importantly, we need to preserve the alignment of the reference
count so that it doesn't share cache lines with the read-mostly state,
as per Eric Dumazet's alignment assertion checks.

The only non-trivial bit here is the move of the 'flags' member into
the writeable cacheline.  This is OK since we are always accessing the
flags around the same moment when we made a modification to the
reference count.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

62fa8a84

openanolis / cloud-kernel 大约 1 年 前同步成功

openanolis / cloud-kernel
大约 1 年前同步成功