提交 · 183dea5818315c0a172d21ecbcd2554894bf01e3 · openanolis / cloud-kernel

03 12月, 2017 1 次提交

openvswitch: do not propagate headroom updates to internal port · 183dea58

由 Paolo Abeni 提交于 11月 30, 2017

After commit 3a927bc7 ("ovs: propagate per dp max headroom to
all vports") the need_headroom for the internal vport is updated
accordingly to the max needed headroom in its datapath.

That avoids the pskb_expand_head() costs when sending/forwarding
packets towards tunnel devices, at least for some scenarios.

We still require such copy when using the ovs-preferred configuration
for vxlan tunnels:

    br_int
  /       \
tap      vxlan
           (remote_ip:X)

br_phy
     \
    NIC

where the route towards the IP 'X' is via 'br_phy'.

When forwarding traffic from the tap towards the vxlan device, we
will call pskb_expand_head() in vxlan_build_skb() because
br-phy->needed_headroom is equal to tun->needed_headroom.

With this change we avoid updating the internal vport needed_headroom,
so that in the above scenario no head copy is needed, giving 5%
performance improvement in UDP throughput test.

As a trade-off, packets sent from the internal port towards a tunnel
device will now experience the head copy overhead. The rationale is
that the latter use-case is less relevant performance-wise.
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Acked-by: NPravin B Shelar <pshelar@ovn.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

183dea58

02 12月, 2017 26 次提交

Merge branch 'cpsw-ale-cleanups' · c5f66a85

由 David S. Miller 提交于 12月 01, 2017

Grygorii Strashko says:

====================
net: ethernet: ti: cpsw/ale clean up and optimization

This is set of non critical clean ups and optimizations for TI
CPSW and ALE drivers.

Rebased on top on net-next.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c5f66a85

net: ethernet: ti: ale: fix port check in cpsw_ale_control_set/get · 97193601

由 Grygorii Strashko 提交于 11月 30, 2017

ALE ports number includes the Host port and ext Ports, and
ALE ports numbering starts from 0, so correct corresponding port
checks in cpsw_ale_control_set/get().
Signed-off-by: NGrygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

97193601

net: ethernet: ti: ale: use devm_kzalloc in cpsw_ale_create() · 1971ab58

由 Grygorii Strashko 提交于 11月 30, 2017

Use cpsw_ale_create in cpsw_ale_create(). This also makes
cpsw_ale_destroy() function nop, so remove it.
Signed-off-by: NGrygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1971ab58

net: ethernet: ti: ale: move static initialization in cpsw_ale_create() · fb1a732d

由 Grygorii Strashko 提交于 11月 30, 2017

Move static initialization from cpsw_ale_start() to cpsw_ale_create() as it
does not make much sence to perform static initializtion in
cpsw_ale_start() which is called everytime netif[s] is opened.
Signed-off-by: NGrygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fb1a732d

net: ethernet: ti: ale: optimize ale entry mask bits configuartion · b5d31f29

由 Grygorii Strashko 提交于 11月 30, 2017

The ale->params.ale_ports parameter can be used to deriver values for all
ale entry mask bits: port_mask_bits, port_mask_bits, port_num_bits.
Hence, calculate above values and drop all hardcoded values. For
port_num_bits calcualtion use order_base_2() API.
Signed-off-by: NGrygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b5d31f29

net: ethernet: ti: ale: disable ale from stop() · d0aef029

由 Grygorii Strashko 提交于 11月 30, 2017

ALE is enabled from cpsw_ale_start() now, but disabled only from
cpsw_ale_destroy() which introduces inconsitance as cpsw_ale_start() is
called when netif[s] is opened, but cpsw_ale_destroy() is called when
driver is removed. Hence, move ALE disabling in cpsw_ale_stop().
Signed-off-by: NGrygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d0aef029

net: ethernet: ti: ale: use proper io apis · 4ff2c4bd

由 Grygorii Strashko 提交于 11月 30, 2017

Switch to use writel_relaxed/readl_relaxed() IO API instead of raw version
as it is recommended.
Signed-off-by: NGrygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4ff2c4bd

net: ethernet: ti: cpsw: fix ale port numbers · c6395f12

由 Grygorii Strashko 提交于 11月 30, 2017

TI OMAP/Sitara SoCs have fixed number of ALE ports 3, which includes Host
port also.

Hence, use fixed value instead of value calcualted from DT, which can be
set by user and might not reflect actual HW configuration.
Signed-off-by: NGrygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c6395f12

net: ethernet: ti: cpsw: move mac_hi/lo defines in cpsw.h · 2733d7b8

由 Grygorii Strashko 提交于 11月 30, 2017

Move mac_hi/lo defines in common header cpsw.h and re-use
them for netcp_ethss.c.
Signed-off-by: NGrygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2733d7b8

net: ethernet: ti: cpsw: move platform data struct to .c file · 2c8a14d6

由 Grygorii Strashko 提交于 11月 30, 2017

CPSW platform data struct cpsw_platform_data and struct cpsw_slave_data are
used only incide cpsw.c module, so move these definitions there.
Signed-off-by: NGrygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2c8a14d6

net: ethernet: ti: cpsw: use proper io apis · dda5f5fe

由 Grygorii Strashko 提交于 11月 30, 2017

Switch to use writel_relaxed/readl_relaxed() IO API instead of raw version
as it is recommended.
Signed-off-by: NGrygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dda5f5fe

net: ethernet: ti: cpsw: drop unused var poll from cpsw_update_channels_res · fc49be85

由 Grygorii Strashko 提交于 11月 30, 2017

Drop unused variable "poll" from cpsw_update_channels_res().
Signed-off-by: NGrygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fc49be85

net: phy: remove generic settings for callbacks config_aneg and read_status from drivers · 80274aba

由 Heiner Kallweit 提交于 11月 30, 2017

Remove generic settings for callbacks config_aneg and read_status
from drivers.
Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

80274aba

net: phy: core: use genphy version of callbacks read_status and config_aneg per default · 00fde795

由 Heiner Kallweit 提交于 11月 30, 2017

read_status and config_aneg are the only mandatory callbacks and most
of the time the generic implementation is used by drivers.
So make the core fall back to the generic version if a driver doesn't
implement the respective callback.

Also currently the core doesn't seem to verify that drivers implement
the mandatory calls. If a driver doesn't do so we'd just get a NPE.
With this patch this potential issue doesn't exit any longer.
Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

00fde795

Merge branch 'ip6_gre-add-erspan-native-tunnel-for-ipv6' · efbae716

由 David S. Miller 提交于 12月 01, 2017

William Tu says:

====================
ip6_gre: add erspan native tunnel for ipv6

The patch series add support for ERSPAN tunnel over ipv6.  The first patch
refectors the existing ipv4 gre implementation and the second refactors the
ipv6 gre's xmit code.  Finally the last patch introduces erspan protocol.

change in v5:
  - add cover-letter description

change in v4:
  - rebase on top of net-next
  - use log_ecn_error in ip6_tnl_rcv

change in v3:
  - add inline for functions in header
  - rebase on top of net-next

change in v2:
  - remove inline
  - fix some indent
  - fix errors reports by clang and scan-build
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

efbae716

ip6_gre: Add ERSPAN native tunnel support · 5a963eb6

由 William Tu 提交于 11月 30, 2017

The patch adds support for ERSPAN tunnel over ipv6.
Signed-off-by: NWilliam Tu <u9012063@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5a963eb6

ip6_gre: Refactor ip6gre xmit codes · 898b2979

由 William Tu 提交于 11月 30, 2017

This patch refactors the ip6gre_xmit_{ipv4, ipv6}.
It is a prep work to add the ip6erspan tunnel.
Signed-off-by: NWilliam Tu <u9012063@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

898b2979

ip_gre: Refector the erpsan tunnel code. · a3222dc9

由 William Tu 提交于 11月 30, 2017

Move two erspan functions to header file, erspan.h, so ipv6
erspan implementation can use it.
Signed-off-by: NWilliam Tu <u9012063@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a3222dc9

Merge branch 'ethtool-reset-AP' · 50e0f5c0

由 David S. Miller 提交于 12月 01, 2017

Scott Branden says:

====================
net: ethtool: add support for ETH_RESET_AP

Add support to reset appplication processors inside SmartNICs by
defining new ETH_RESET_AP bit.

And use new ETH_RESET_AP bit in bnxt ethernet driver.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

50e0f5c0

bnxt_en: Add ETH_RESET_AP support · 6502ad59

由 Scott Branden 提交于 11月 30, 2017

Add ETH_RESET_AP support handling to reset the internal
Application Processor(s) of the SmartNIC card.
Signed-off-by: NScott Branden <scott.branden@broadcom.com>
Acked-by: NMichael Chan <michael.chan@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6502ad59

net: ethtool: add support for reset of AP inside NIC interface. · 40e44a1e

由 Scott Branden 提交于 11月 30, 2017

Add ETH_RESET_AP to reset the application processor(s) inside the NIC
interface.

Current ETH_RESET_MGMT supports a management processor inside this NIC.
This is typically used for remote NIC management purposes.

Application processors exist inside some SmartNICs to run various
applications inside the NIC processor - be it a simple algorithm without
an OS to as complex as hosting multiple VMs.
Signed-off-by: NScott Branden <scott.branden@broadcom.com>
Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

40e44a1e

Merge branch 'rds-tcp-netns-delete-related-fixes' · 68bf33f4

由 David S. Miller 提交于 12月 01, 2017

Sowmini Varadhan says:

====================
rds-tcp netns delete related fixes

Patchset contains cleanup and bug fixes. Patch 1 is the removal
of some redundant code/functions. Patch 2 and 3 are fixes for
corner cases identified by syzkaller. I've not been able to
reproduce the actual use-after-free race flagged in the syzkaller
reports, thus these fixes are based on code inspection plus
manual testing to make sure the modified code paths are executed
without problems in the commonly encountered timing cases.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

68bf33f4

rds: tcp: atomically purge entries from rds_tcp_conn_list during netns delete · f10b4cff

由 Sowmini Varadhan 提交于 11月 30, 2017

The rds_tcp_kill_sock() function parses the rds_tcp_conn_list
to find the rds_connection entries marked for deletion as part
of the netns deletion under the protection of the rds_tcp_conn_lock.
Since the rds_tcp_conn_list tracks rds_tcp_connections (which
have a 1:1 mapping with rds_conn_path), multiple tc entries in
the rds_tcp_conn_list will map to a single rds_connection, and will
be deleted as part of the rds_conn_destroy() operation that is
done outside the rds_tcp_conn_lock.

The rds_tcp_conn_list traversal done under the protection of
rds_tcp_conn_lock should not leave any doomed tc entries in
the list after the rds_tcp_conn_lock is released, else another
concurrently executiong netns delete (for a differnt netns) thread
may trip on these entries.
Reported-by: Nsyzbot <syzkaller@googlegroups.com>
Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f10b4cff

rds: tcp: correctly sequence cleanup on netns deletion. · 681648e6

由 Sowmini Varadhan 提交于 11月 30, 2017

Commit 8edc3aff ("rds: tcp: Take explicit refcounts on struct net")
introduces a regression in rds-tcp netns cleanup. The cleanup_net(),
(and thus rds_tcp_dev_event notification) is only called from put_net()
when all netns refcounts go to 0, but this cannot happen if the
rds_connection itself is holding a c_net ref that it expects to
release in rds_tcp_kill_sock.

Instead, the rds_tcp_kill_sock callback should make sure to
tear down state carefully, ensuring that the socket teardown
is only done after all data-structures and workqs that depend
on it are quiesced.

The original motivation for commit 8edc3aff ("rds: tcp: Take explicit
refcounts on struct net") was to resolve a race condition reported by
syzkaller where workqs for tx/rx/connect were triggered after the
namespace was deleted. Those worker threads should have been
cancelled/flushed before socket tear-down and indeed,
rds_conn_path_destroy() does try to sequence this by doing
     /* cancel cp_send_w */
     /* cancel cp_recv_w */
     /* flush cp_down_w */
     /* free data structures */
Here the "flush cp_down_w" will trigger rds_conn_shutdown and thus
invoke rds_tcp_conn_path_shutdown() to close the tcp socket, so that
we ought to have satisfied the requirement that "socket-close is
done after all other dependent state is quiesced". However,
rds_conn_shutdown has a bug in that it *always* triggers the reconnect
workq (and if connection is successful, we always restart tx/rx
workqs so with the right timing, we risk the race conditions reported
by syzkaller).

Netns deletion is like module teardown- no need to restart a
reconnect in this case. We can use the c_destroy_in_prog bit
to avoid restarting the reconnect.

Fixes: 8edc3aff ("rds: tcp: Take explicit refcounts on struct net")
Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

681648e6

rds: tcp: remove redundant function rds_tcp_conn_paths_destroy() · 2d746c93

由 Sowmini Varadhan 提交于 11月 30, 2017

A side-effect of Commit c14b0366 ("rds: tcp: set linger to 1
when unloading a rds-tcp") is that we always send a RST on the tcp
connection for rds_conn_destroy(), so rds_tcp_conn_paths_destroy()
is not needed any more and is removed in this patch.
Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2d746c93

tipc: fall back to smaller MTU if allocation of local send skb fails · 4c94cc2d

由 Jon Maloy 提交于 11月 30, 2017

When sending node local messages the code is using an 'mtu' of 66060
bytes to avoid unnecessary fragmentation. During situations of low
memory tipc_msg_build() may sometimes fail to allocate such large
buffers, resulting in unnecessary send failures. This can easily be
remedied by falling back to a smaller MTU, and then reassemble the
buffer chain as if the message were arriving from a remote node.

At the same time, we change the initial MTU setting of the broadcast
link to a lower value, so that large messages always are fragmented
into smaller buffers even when we run in single node mode. Apart from
obtaining the same advantage as for the 'fallback' solution above, this
turns out to give a significant performance improvement. This can
probably be explained with the __pskb_copy() operation performed on the
buffer for each recipient during reception. We found the optimal value
for this, considering the most relevant skb pool, to be 3744 bytes.
Acked-by: NYing Xue <ying.xue@ericsson.com>
Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4c94cc2d

01 12月, 2017 6 次提交

Merge branch 'macb-rx-packet-filtering' · 201c78e0

由 David S. Miller 提交于 11月 30, 2017

Rafal Ozieblo says:

====================
Receive packets filtering for macb driver

This patch series adds support for receive packets
filtering for Cadence GEM driver. Packets can be redirect
to different hardware queues based on source IP, destination IP,
source port or destination port. To enable filtering,
support for RX queueing was added as well.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

201c78e0

net: macb: Added support for RX filtering · ae8223de

由 Rafal Ozieblo 提交于 11月 30, 2017

This patch allows filtering received packets to different
hardware queues (aka ntuple).
Signed-off-by: NRafal Ozieblo <rafalo@cadence.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ae8223de

net: macb: Added some queue statistics · 512286bb

由 Rafal Ozieblo 提交于 11月 30, 2017

Added statistics per queue:
- qX_rx_packets
- qX_rx_bytes
- qX_rx_dropped
- qX_tx_packets
- qX_tx_bytes
- qX_tx_dropped
Signed-off-by: NRafal Ozieblo <rafalo@cadence.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

512286bb

net: macb: Added support for many RX queues · ae1f2a56

由 Rafal Ozieblo 提交于 11月 30, 2017

To be able for packet reception on different RX queues some
configuration has to be performed. This patch checks how many
hardware queue does GEM support and initializes them.
Signed-off-by: NRafal Ozieblo <rafalo@cadence.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ae1f2a56

vmxnet3: increase default rx ring sizes · 7475908f

由 Shrikrishna Khare 提交于 11月 30, 2017

There are several reasons for increasing the receive ring sizes:

1. The original ring size of 256 was chosen about 10 years ago when
vmxnet3 was first created. At that time, 10Gbps Ethernet was not prevalent
and servers were dominated by 1Gbps Ethernet. Now 10Gbps is common place,
and higher bandwidth links -- 25Gbps, 40Gbps, 50Gbps -- are starting
to appear. 256 Rx ring entries are simply not enough to keep up with
higher link speed when there is a burst of network frames coming from
these high speed links. Even with full MTU size frames, they are gone
in a short time. It is also more common to have a mix of frame sizes,
and more likely bi-modal distribution of frame sizes so the average frame
size is not close to full MTU. If we consider average frame size of 800B,
1024 frames that come in a burst takes ~0.65 ms to arrive at 10Gbps. With
256 entires, it takes ~0.16 ms to arrive at 10Gbps. At 25Gbps or 40Gbps,
this time is reduced accordingly.

2. On a hypervisor where there are many VMs and CPU is over committed,
i.e. the number of VCPUs is more than the number of VCPUs, each PCPU is
in effect time shared between multiple VMs/VCPUs. The time granularity at
which this multiplexing occurs is typically coarser than between processes
on a guest OS. Trying to time slice more finely is not efficient, for
example, if memory cache is barely warmed up when switching from one VM
to another occurs. This CPU overcommit adds delay to when the driver
in a VM can service incoming packets. Whether CPU is over committed
really depends on customer workloads. For certain situations, it is very
common. For example, workloads of desktop VMs and product testing setups.
Consolidation and sharing is what drives efficiency of a customer setup
for such workloads. In these situations, the raw network bandwidth may
not be very high, but the delays between when a VM is running or not
running can also be relatively long.
Signed-off-by: NShrikrishna Khare <skhare@vmware.com>
Acked-by: NJin Heo <heoj@vmware.com>
Acked-by: NGuolin Yang <gyang@vmware.com>
Acked-by: NBoon Ang <bang@vmware.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7475908f

net: dsa: bcm_sf2: Utilize b53_get_tag_protocol() · 9f66816a

由 Florian Fainelli 提交于 11月 30, 2017

Utilize the much more capable b53_get_tag_protocol() which takes care of
all Broadcom switches specifics to resolve which port can have Broadcom
tags enabled or not.
Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9f66816a

30 11月, 2017 7 次提交

net/reuseport: drop legacy code · e94a62f5

由 Paolo Abeni 提交于 11月 30, 2017

Since commit e32ea7e7 ("soreuseport: fast reuseport UDP socket
selection") and commit c125e80b ("soreuseport: fast reuseport
TCP socket selection") the relevant reuseport socket matching the current
packet is selected by the reuseport_select_sock() call. The only
exceptions are invalid BPF filters/filters returning out-of-range
indices.
In the latter case the code implicitly falls back to using the hash
demultiplexing, but instead of selecting the socket inside the
reuseport_select_sock() function, it relies on the hash selection
logic introduced with the early soreuseport implementation.

With this patch, in case of a BPF filter returning a bad socket
index value, we fall back to hash-based selection inside the
reuseport_select_sock() body, so that we can drop some duplicate
code in the ipv4 and ipv6 stack.

This also allows faster lookup in the above scenario and will allow
us to avoid computing the hash value for successful, BPF based
demultiplexing - in a later patch.
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Acked-by: NCraig Gallek <kraig@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e94a62f5

Documentation: net: dsa: Cut set_addr() documentation · 0fc66ddf

由 Linus Walleij 提交于 11月 29, 2017

This is not supported anymore, devices needing a MAC address
just assign one at random, it's just a driver pecularity.
Signed-off-by: NLinus Walleij <linus.walleij@linaro.org>
Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0fc66ddf

Merge branch 'net-dst_entry-shrink' · 3d8068c5

由 David S. Miller 提交于 11月 30, 2017

David Miller says:

====================
net: Significantly shrink the size of routes.

Through a combination of several things, our route structures are
larger than they need to be.

Mostly this stems from having members in dst_entry which are only used
by one class of routes.  So the majority of the work in this series is
about "un-commoning" these members and pushing them into the type
specific structures.

Unfortunately, IPSEC needed the most surgery.  The majority of the
changes here had to do with bundle creation and management.

The other issue is the refcount alignment in dst_entry.  Once we get
rid of the not-so-common members, it really opens the door to removing
that alignment entirely.

I think the new layout looks really nice, so I'll reproduce it here:

	struct net_device       *dev;
	struct  dst_ops	        *ops;
	unsigned long		_metrics;
	unsigned long           expires;
	struct xfrm_state	*xfrm;
	int			(*input)(struct sk_buff *);
	int			(*output)(struct net *net, struct sock *sk, struct sk_buff *skb);
	unsigned short		flags;
	short			obsolete;
	unsigned short		header_len;
	unsigned short		trailer_len;
	atomic_t		__refcnt;
	int			__use;
	unsigned long		lastuse;
	struct lwtunnel_state   *lwtstate;
	struct rcu_head		rcu_head;
	short			error;
	short			__pad;
	__u32			tclassid;

(This is for 64-bit, on 32-bit the __refcnt comes at the very end)

So, the good news:

1) struct dst_entry shrinks from 160 to 112 bytes.

2) struct rtable shrinks from 216 to 168 bytes.

3) struct rt6_info shrinks from 384 to 320 bytes.

Enjoy.

v2:
	Collapse some patches logically based upon feedback.
	Fix the strange patch #7.

v3:	xfrm_dst_path() needs inline keyword
	Properly align __refcnt on 32-bit.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3d8068c5

net: Remove dst->next · 7149f813

由 David Miller 提交于 11月 28, 2017

There are no more users.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
Reviewed-by: NEric Dumazet <edumazet@google.com>

7149f813

xfrm: Stop using dst->next in bundle construction. · 5492093d

由 David Miller 提交于 11月 28, 2017

While building ipsec bundles, blocks of xfrm dsts are linked together
using dst->next from bottom to the top.

The only thing this is used for is initializing the pmtu values of the
xfrm stack, and for updating the mtu values at xfrm_bundle_ok() time.

The bundle pmtu entries must be processed in this order so that pmtu
values lower in the stack of routes can propagate up to the higher
ones.

Avoid using dst->next by simply maintaining an array of dst pointers
as we already do for the xfrm_state objects when building the bundle.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
Reviewed-by: NEric Dumazet <edumazet@google.com>

5492093d

net: Rearrange dst_entry layout to avoid useless padding. · 8b207e73

由 David Miller 提交于 11月 28, 2017

We have padding to try and align the refcount on a separate cache
line.  But after several simplifications the padding has increased
substantially.

So now it's easy to change the layout to get rid of the padding
entirely.

We group the write-heavy __refcnt and __use with less often used
items such as the rcu_head and the error code.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
Reviewed-by: NEric Dumazet <edumazet@google.com>

8b207e73

xfrm: Move dst->path into struct xfrm_dst · 0f6c480f

由 David Miller 提交于 11月 28, 2017

The first member of an IPSEC route bundle chain sets it's dst->path to
the underlying ipv4/ipv6 route that carries the bundle.

Stated another way, if one were to follow the xfrm_dst->child chain of
the bundle, the final non-NULL pointer would be the path and point to
either an ipv4 or an ipv6 route.

This is largely used to make sure that PMTU events propagate down to
the correct ipv4 or ipv6 route.

When we don't have the top of an IPSEC bundle 'dst->path == dst'.

Move it down into xfrm_dst and key off of dst->xfrm.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
Reviewed-by: NEric Dumazet <edumazet@google.com>

0f6c480f

openanolis / cloud-kernel 接近 2 年 前同步成功

openanolis / cloud-kernel
接近 2 年前同步成功