提交 · 0b5e8b8eeae40bae6ad7c7e91c97c3c0d0e57882 · openanolis / cloud-kernel

06 10月, 2014 4 次提交

net: Add Geneve tunneling protocol driver · 0b5e8b8e

由 Andy Zhou 提交于 10月 03, 2014

This adds a device level support for Geneve -- Generic Network
Virtualization Encapsulation. The protocol is documented at
http://tools.ietf.org/html/draft-gross-geneve-01

Only protocol layer Geneve support is provided by this driver.
Openvswitch can be used for configuring, set up and tear down
functional Geneve tunnels.
Signed-off-by: NJesse Gross <jesse@nicira.com>
Signed-off-by: NAndy Zhou <azhou@nicira.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0b5e8b8e

bridge: Add filtering support for default_pvid · 5be5a2df

由 Vlad Yasevich 提交于 10月 03, 2014

Currently when vlan filtering is turned on on the bridge, the bridge
will drop all traffic untill the user configures the filter.  This
isn't very nice for ports that don't care about vlans and just
want untagged traffic.

A concept of a default_pvid was recently introduced.  This patch
adds filtering support for default_pvid.   Now, ports that don't
care about vlans and don't define there own filter will belong
to the VLAN of the default_pvid and continue to receive untagged
traffic.

This filtering can be disabled by setting default_pvid to 0.
Signed-off-by: NVladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5be5a2df

bridge: Simplify pvid checks. · 3df6bf45

由 Vlad Yasevich 提交于 10月 03, 2014

Currently, if the pvid is not set, we return an illegal vlan value
even though the pvid value is set to 0.  Since pvid of 0 is currently
invalid, just return 0 instead.  This makes the current and future
checks simpler.
Signed-off-by: NVladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3df6bf45

bridge: Add a default_pvid sysfs attribute · 96a20d9d

由 Vlad Yasevich 提交于 10月 03, 2014

This patch allows the user to set and retrieve default_pvid
value.  A new value can only be stored when vlan filtering
is disabled.
Signed-off-by: NVladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

96a20d9d

05 10月, 2014 3 次提交

net: sched: suspicious RCU usage in qdisc_watchdog · 1e203c1a

由 John Fastabend 提交于 10月 02, 2014

Suspicious RCU usage in qdisc_watchdog call needs to be done inside
rcu_read_lock/rcu_read_unlock. And then Qdisc destroy operations
need to ensure timer is cancelled before removing qdisc structure.

[ 3992.191339] ===============================
[ 3992.191340] [ INFO: suspicious RCU usage. ]
[ 3992.191343] 3.17.0-rc6net-next+ #72 Not tainted
[ 3992.191345] -------------------------------
[ 3992.191347] include/net/sch_generic.h:272 suspicious rcu_dereference_check() usage!
[ 3992.191348]
[ 3992.191348] other info that might help us debug this:
[ 3992.191348]
[ 3992.191351]
[ 3992.191351] rcu_scheduler_active = 1, debug_locks = 1
[ 3992.191353] no locks held by swapper/1/0.
[ 3992.191355]
[ 3992.191355] stack backtrace:
[ 3992.191358] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.17.0-rc6net-next+ #72
[ 3992.191360] Hardware name:                  /DZ77RE-75K, BIOS GAZ7711H.86A.0060.2012.1115.1750 11/15/2012
[ 3992.191362]  0000000000000001 ffff880235803e48 ffffffff8178f92c 0000000000000000
[ 3992.191366]  ffff8802322224a0 ffff880235803e78 ffffffff810c9966 ffff8800a5fe3000
[ 3992.191370]  ffff880235803f30 ffff8802359cd768 ffff8802359cd6e0 ffff880235803e98
[ 3992.191374] Call Trace:
[ 3992.191376]  <IRQ>  [<ffffffff8178f92c>] dump_stack+0x4e/0x68
[ 3992.191387]  [<ffffffff810c9966>] lockdep_rcu_suspicious+0xe6/0x130
[ 3992.191392]  [<ffffffff8167213a>] qdisc_watchdog+0x8a/0xb0
[ 3992.191396]  [<ffffffff810f93f2>] __run_hrtimer+0x72/0x420
[ 3992.191399]  [<ffffffff810f9bcd>] ? hrtimer_interrupt+0x7d/0x240
[ 3992.191403]  [<ffffffff816720b0>] ? tc_classify+0xc0/0xc0
[ 3992.191406]  [<ffffffff810f9c4f>] hrtimer_interrupt+0xff/0x240
[ 3992.191410]  [<ffffffff8109e4a5>] ? __atomic_notifier_call_chain+0x5/0x140
[ 3992.191415]  [<ffffffff8103577b>] local_apic_timer_interrupt+0x3b/0x60
[ 3992.191419]  [<ffffffff8179c2b5>] smp_apic_timer_interrupt+0x45/0x60
[ 3992.191422]  [<ffffffff8179a6bf>] apic_timer_interrupt+0x6f/0x80
[ 3992.191424]  <EOI>  [<ffffffff815ed233>] ? cpuidle_enter_state+0x73/0x2e0
[ 3992.191432]  [<ffffffff815ed22e>] ? cpuidle_enter_state+0x6e/0x2e0
[ 3992.191437]  [<ffffffff815ed567>] cpuidle_enter+0x17/0x20
[ 3992.191441]  [<ffffffff810c0741>] cpu_startup_entry+0x3d1/0x4a0
[ 3992.191445]  [<ffffffff81106fc6>] ? clockevents_config_and_register+0x26/0x30
[ 3992.191448]  [<ffffffff81033c16>] start_secondary+0x1b6/0x260

Fixes: b26b0d1e ("net: qdisc: use rcu prefix and silence sparse warnings")
Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
Acked-by: NCong Wang <cwang@twopensource.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1e203c1a

net: dsa: do not call phy_start_aneg · f7d6b96f

由 Florian Fainelli 提交于 10月 02, 2014

Commit f7f1de51 ("net: dsa: start and stop the PHY state machine")
add calls to phy_start() in dsa_slave_open() respectively phy_stop() in
dsa_slave_close().

We also call phy_start_aneg() in dsa_slave_create(), and this call is
messing up with the PHY state machine, since we basically start the
auto-negotiation, and later on restart it when calling phy_start().
phy_start() does not currently handle the PHY_FORCING or PHY_AN states
properly, but such a fix would be too invasive for this window.

Fixes: f7f1de51 ("net: dsa: start and stop the PHY state machine")
Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f7d6b96f

net: Cleanup skb cloning by adding SKB_FCLONE_FREE · c8753d55

由 Vijay Subramanian 提交于 10月 02, 2014

SKB_FCLONE_UNAVAILABLE has overloaded meaning depending on type of skb.
1: If skb is allocated from head_cache, it indicates fclone is not available.
2: If skb is a companion fclone skb (allocated from fclone_cache), it indicates
it is available to be used.

To avoid confusion for case 2 above, this patch  replaces
SKB_FCLONE_UNAVAILABLE with SKB_FCLONE_FREE where appropriate. For fclone
companion skbs, this indicates it is free for use.

SKB_FCLONE_UNAVAILABLE will now simply indicate skb is from head_cache and
cannot / will not have a companion fclone.
Signed-off-by: NVijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c8753d55

04 10月, 2014 8 次提交

ip_tunnel: Add GUE support · bc1fc390

由 Tom Herbert 提交于 10月 03, 2014

This patch allows configuring IPIP, sit, and GRE tunnels to use GUE.
This is very similar to fou excpet that we need to insert the GUE header
in addition to the UDP header on transmit.
Signed-off-by: NTom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bc1fc390

gue: Receive side for Generic UDP Encapsulation · 37dd0247

由 Tom Herbert 提交于 10月 03, 2014

This patch adds support receiving for GUE packets in the fou module. The
fou module now supports direct foo-over-udp (no encapsulation header)
and GUE. To support this a type parameter is added to the fou netlink
parameters.

For a GUE socket we define gue_udp_recv, gue_gro_receive, and
gue_gro_complete to handle the specifics of the GUE protocol. Most
of the code to manage and configure sockets is common with the fou.
Signed-off-by: NTom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

37dd0247

fou: eliminate IPv4,v6 specific GRO functions · efc98d08

由 Tom Herbert 提交于 10月 03, 2014

This patch removes fou[46]_gro_receive and fou[46]_gro_complete
functions. The v4 or v6 variants were chosen for the UDP offloads
based on the address family of the socket this is not necessary
or correct. Alternatively, this patch adds is_ipv6 to napi_gro_skb.
This is set in udp6_gro_receive and unset in udp4_gro_receive. In
fou_gro_receive the value is used to select the correct inet_offloads
for the protocol of the outer IP header.
Signed-off-by: NTom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

efc98d08

ip_tunnel: Account for secondary encapsulation header in max_headroom · 7371e022

由 Tom Herbert 提交于 10月 03, 2014

When adjusting max_header for the tunnel interface based on egress
device we need to account for any extra bytes in secondary encapsulation
(e.g. FOU).
Signed-off-by: NTom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7371e022

net: do not export skb_gro_receive() · 01291202

由 Eric Dumazet 提交于 10月 02, 2014

skb_gro_receive() is only called from tcp_gro_receive() which is
not in a module.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

01291202

qdisc: validate skb without holding lock · 55a93b3e

由 Eric Dumazet 提交于 10月 03, 2014

Validation of skb can be pretty expensive :

GSO segmentation and/or checksum computations.

We can do this without holding qdisc lock, so that other cpus
can queue additional packets.

Trick is that requeued packets were already validated, so we carry
a boolean so that sch_direct_xmit() can validate a fresh skb list,
or directly use an old one.

Tested on 40Gb NIC (8 TX queues) and 200 concurrent flows, 48 threads
host.

Turning TSO on or off had no effect on throughput, only few more cpu
cycles. Lock contention on qdisc lock disappeared.

Same if disabling TX checksum offload.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

55a93b3e

qdisc: dequeue bulking also pickup GSO/TSO packets · 808e7ac0

由 Jesper Dangaard Brouer 提交于 10月 01, 2014

The TSO and GSO segmented packets already benefit from bulking
on their own.

The TSO packets have always taken advantage of the only updating
the tailptr once for a large packet.

The GSO segmented packets have recently taken advantage of
bulking xmit_more API, via merge commit 53fda7f7 ("Merge
branch 'xmit_list'"), specifically via commit 7f2e870f ("net:
Move main gso loop out of dev_hard_start_xmit() into helper.")
allowing qdisc requeue of remaining list.  And via commit
ce93718f ("net: Don't keep around original SKB when we
software segment GSO frames.").

This patch allow further bulking of TSO/GSO packets together,
when dequeueing from the qdisc.

Testing:
 Measuring HoL (Head-of-Line) blocking for TSO and GSO, with
netperf-wrapper. Bulking several TSO show no performance regressions
(requeues were in the area 32 requeues/sec).

Bulking several GSOs does show small regression or very small
improvement (requeues were in the area 8000 requeues/sec).

 Using ixgbe 10Gbit/s with GSO bulking, we can measure some additional
latency. Base-case, which is "normal" GSO bulking, sees varying
high-prio queue delay between 0.38ms to 0.47ms.  Bulking several GSOs
together, result in a stable high-prio queue delay of 0.50ms.

 Using igb at 100Mbit/s with GSO bulking, shows an improvement.
Base-case sees varying high-prio queue delay between 2.23ms to 2.35ms
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

808e7ac0

qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE · 5772e9a3

由 Jesper Dangaard Brouer 提交于 10月 01, 2014

Based on DaveM's recent API work on dev_hard_start_xmit(), that allows
sending/processing an entire skb list.

This patch implements qdisc bulk dequeue, by allowing multiple packets
to be dequeued in dequeue_skb().

The optimization principle for this is two fold, (1) to amortize
locking cost and (2) avoid expensive tailptr update for notifying HW.
 (1) Several packets are dequeued while holding the qdisc root_lock,
amortizing locking cost over several packet.  The dequeued SKB list is
processed under the TXQ lock in dev_hard_start_xmit(), thus also
amortizing the cost of the TXQ lock.
 (2) Further more, dev_hard_start_xmit() will utilize the skb->xmit_more
API to delay HW tailptr update, which also reduces the cost per
packet.

One restriction of the new API is that every SKB must belong to the
same TXQ.  This patch takes the easy way out, by restricting bulk
dequeue to qdisc's with the TCQ_F_ONETXQUEUE flag, that specifies the
qdisc only have attached a single TXQ.

Some detail about the flow; dev_hard_start_xmit() will process the skb
list, and transmit packets individually towards the driver (see
xmit_one()).  In case the driver stops midway in the list, the
remaining skb list is returned by dev_hard_start_xmit().  In
sch_direct_xmit() this returned list is requeued by dev_requeue_skb().

To avoid overshooting the HW limits, which results in requeuing, the
patch limits the amount of bytes dequeued, based on the drivers BQL
limits.  In-effect bulking will only happen for BQL enabled drivers.

Small amounts for extra HoL blocking (2x MTU/0.24ms) were
measured at 100Mbit/s, with bulking 8 packets, but the
oscillating nature of the measurement indicate something, like
sched latency might be causing this effect. More comparisons
show, that this oscillation goes away occationally. Thus, we
disregard this artifact completely and remove any "magic" bulking
limit.

For now, as a conservative approach, stop bulking when seeing TSO and
segmented GSO packets.  They already benefit from bulking on their own.
A followup patch add this, to allow easier bisect-ability for finding
regressions.

Jointed work with Hannes, Daniel and Florian.
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5772e9a3

03 10月, 2014 8 次提交

netfilter: nft_masq: register/unregister notifiers on module init/exit · 8da4cc1b

由 Arturo Borrero 提交于 10月 03, 2014

We have to register the notifiers in the masquerade expression from
the the module _init and _exit path.

This fixes crashes when removing the masquerade rule with no
ipt_MASQUERADE support in place (which was masking the problem).

Fixes: 9ba1f726 ("netfilter: nf_tables: add new nft_masq expression")
Signed-off-by: NArturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

8da4cc1b

netfilter: explicit module dependency between br_netfilter and physdev · 4b7fd5d9

由 Pablo Neira Ayuso 提交于 10月 02, 2014

You can use physdev to match the physical interface enslaved to the
bridge device. This information is stored in skb->nf_bridge and it is
set up by br_netfilter. So, this is only available when iptables is
used from the bridge netfilter path.

Since 34666d46 ("netfilter: bridge: move br_netfilter out of the core"),
the br_netfilter code is modular. To reduce the impact of this change,
we can autoload the br_netfilter if the physdev match is used since
we assume that the users need br_netfilter in place.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

4b7fd5d9

netfilter: nf_tables: allow to filter from prerouting and postrouting · 36d2af59

由 Pablo Neira Ayuso 提交于 10月 01, 2014

This allows us to emulate the NAT table in ebtables, which is actually
a plain filter chain that hooks at prerouting, output and postrouting.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

36d2af59

netfilter: nft_compat: remove incomplete 32/64 bits arch compat code · 756c1b1a

由 Pablo Neira Ayuso 提交于 6月 17, 2014

This code was based on the wrong asumption that you can probe based
on the match/target private size that we get from userspace. This
doesn't work at all when you have to dump the info back to userspace
since you don't know what word size the userspace utility is using.

Currently, the extensions that require arch compat are limit match
and the ebt_mark match/target. The standard targets are not used by
the nft-xt compat layer, so they are not affected. We can work around
this limitation with a new revision that uses arch agnostic types.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

756c1b1a

netfilter: nf_tables: wait for call_rcu completion on module removal · 1b1bc49c

由 Pablo Neira Ayuso 提交于 10月 01, 2014

Make sure the objects have been released before the nf_tables modules
is removed.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

1b1bc49c

netfilter: use IS_ENABLED(CONFIG_BRIDGE_NETFILTER) · 1109a90c

由 Pablo Neira Ayuso 提交于 10月 01, 2014

In 34666d46 ("netfilter: bridge: move br_netfilter out of the core"),
the bridge netfilter code has been modularized.

Use IS_ENABLED instead of ifdef to cover the module case.

Fixes: 34666d46 ("netfilter: bridge: move br_netfilter out of the core")
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

1109a90c

netfilter: move nf_send_resetX() code to nf_reject_ipvX modules · c8d7b98b

由 Pablo Neira Ayuso 提交于 9月 26, 2014

Move nf_send_reset() and nf_send_reset6() to nf_reject_ipv4 and
nf_reject_ipv6 respectively. This code is shared by x_tables and
nf_tables.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

c8d7b98b

netfilter: nft_reject: introduce icmp code abstraction for inet and bridge · 51b0a5d8

由 Pablo Neira Ayuso 提交于 9月 26, 2014

This patch introduces the NFT_REJECT_ICMPX_UNREACH type which provides
an abstraction to the ICMP and ICMPv6 codes that you can use from the
inet and bridge tables, they are:

* NFT_REJECT_ICMPX_NO_ROUTE: no route to host - network unreachable
* NFT_REJECT_ICMPX_PORT_UNREACH: port unreachable
* NFT_REJECT_ICMPX_HOST_UNREACH: host unreachable
* NFT_REJECT_ICMPX_ADMIN_PROHIBITED: administratevely prohibited

You can still use the specific codes when restricting the rule to match
the corresponding layer 3 protocol.

I decided to not overload the existing NFT_REJECT_ICMP_UNREACH to have
different semantics depending on the table family and to allow the user
to specify ICMP family specific codes if they restrict it to the
corresponding family.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

51b0a5d8

02 10月, 2014 17 次提交

Bluetooth: 6lowpan: Check transmit errors for multicast packets · 9c238ca8

由 Jukka Rissanen 提交于 10月 01, 2014

We did not return error if multicast packet transmit failed.
This might not be desired so return error also in this case.
If there are multiple 6lowpan devices where the multicast packet
is sent, then return error even if sending to only one of them fails.
Signed-off-by: NJukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: NJohan Hedberg <johan.hedberg@intel.com>

9c238ca8

Bluetooth: 6lowpan: Return EAGAIN error also for multicast packets · d7b6b0a5

由 Jukka Rissanen 提交于 10月 01, 2014

Make sure that we are able to return EAGAIN from l2cap_chan_send()
even for multicast packets. The error code was ignored unncessarily.
Signed-off-by: NJukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: NJohan Hedberg <johan.hedberg@intel.com>

d7b6b0a5

Bluetooth: 6lowpan: Avoid memory leak if memory allocation fails · a7807d73

由 Jukka Rissanen 提交于 10月 01, 2014

If skb_unshare() returns NULL, then we leak the original skb.
Solution is to use temp variable to hold the new skb.
Signed-off-by: NJukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: NJohan Hedberg <johan.hedberg@intel.com>

a7807d73

Bluetooth: 6lowpan: Memory leak as the skb is not freed · fc12518a

由 Jukka Rissanen 提交于 10月 01, 2014

The earlier multicast commit 36b3dd25 ("Bluetooth: 6lowpan:
Ensure header compression does not corrupt IPv6 header") lost one
skb free which then caused memory leak.
Signed-off-by: NJukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: NJohan Hedberg <johan.hedberg@intel.com>

fc12518a

Bluetooth: Fix lockdep warning with l2cap_chan_connect · 02e246ae

由 Johan Hedberg 提交于 10月 02, 2014

The L2CAP connection's channel list lock (conn->chan_lock) must never be
taken while already holding a channel lock (chan->lock) in order to
avoid lock-inversion and lockdep warnings. So far the l2cap_chan_connect
function has acquired the chan->lock early in the function and then
later called l2cap_chan_add(conn, chan) which will try to take the
conn->chan_lock. This violates the correct order of taking the locks and
may lead to the following type of lockdep warnings:

-> #1 (&conn->chan_lock){+.+...}:
       [<c109324d>] lock_acquire+0x9d/0x140
       [<c188459c>] mutex_lock_nested+0x6c/0x420
       [<d0aab48e>] l2cap_chan_add+0x1e/0x40 [bluetooth]
       [<d0aac618>] l2cap_chan_connect+0x348/0x8f0 [bluetooth]
       [<d0cc9a91>] lowpan_control_write+0x221/0x2d0 [bluetooth_6lowpan]
-> #0 (&chan->lock){+.+.+.}:
       [<c10928d8>] __lock_acquire+0x1a18/0x1d20
       [<c109324d>] lock_acquire+0x9d/0x140
       [<c188459c>] mutex_lock_nested+0x6c/0x420
       [<d0ab05fd>] l2cap_connect_cfm+0x1dd/0x3f0 [bluetooth]
       [<d0a909c4>] hci_le_meta_evt+0x11a4/0x1260 [bluetooth]
       [<d0a910eb>] hci_event_packet+0x3ab/0x3120 [bluetooth]
       [<d0a7cb08>] hci_rx_work+0x208/0x4a0 [bluetooth]

       CPU0                    CPU1
       ----                    ----
  lock(&conn->chan_lock);
                               lock(&chan->lock);
                               lock(&conn->chan_lock);
  lock(&chan->lock);

Before calling l2cap_chan_add() the channel is not part of the
conn->chan_l list, and can therefore only be accessed by the L2CAP user
(such as l2cap_sock.c). We can therefore assume that it is the
responsibility of the user to handle mutual exclusion until this point
(which we can see is already true in l2cap_sock.c by it in many places
touching chan members without holding chan->lock).

Since the hci_conn and by exctension l2cap_conn creation in the
l2cap_chan_connect() function depend on chan details we cannot simply
add a mutex_lock(&conn->chan_lock) in the beginning of the function
(since the conn object doesn't yet exist there). What we can do however
is move the chan->lock taking later into the function where we already
have the conn object and can that way take conn->chan_lock first.

This patch implements the above strategy and does some other necessary
changes such as using __l2cap_chan_add() which assumes conn->chan_lock
is held, as well as adding a second needed label so the unlocking
happens as it should.
Reported-by: NJukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: NJohan Hedberg <johan.hedberg@intel.com>
Tested-by: NJukka Rissanen <jukka.rissanen@linux.intel.com>
Acked-by: NJukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>

02e246ae

net: pktgen: packet bursting via skb->xmit_more · 38b2cf29

由 Alexei Starovoitov 提交于 9月 30, 2014

This patch demonstrates the effect of delaying update of HW tailptr.
(based on earlier patch by Jesper)

burst=1 is the default. It sends one packet with xmit_more=false
burst=2 sends one packet with xmit_more=true and
        2nd copy of the same packet with xmit_more=false
burst=3 sends two copies of the same packet with xmit_more=true and
        3rd copy with xmit_more=false

Performance with ixgbe (usec 30):
burst=1  tx:9.2 Mpps
burst=2  tx:13.5 Mpps
burst=3  tx:14.5 Mpps full 10G line rate
Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

38b2cf29

net: bridge: add a br_set_state helper function · 775dd692

由 Florian Fainelli 提交于 9月 30, 2014

In preparation for being able to propagate port states to e.g: notifiers
or other kernel parts, do not manipulate the port state directly, but
instead use a helper function which will allow us to do a bit more than
just setting the state.
Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

775dd692

net_sched: avoid calling tcf_unbind_filter() in call_rcu callback · a0efb80c

由 WANG Cong 提交于 9月 30, 2014

This fixes the following crash:

[   63.976822] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[   63.980094] CPU: 1 PID: 15 Comm: ksoftirqd/1 Not tainted 3.17.0-rc6+ #648
[   63.980094] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   63.980094] task: ffff880117dea690 ti: ffff880117dfc000 task.ti: ffff880117dfc000
[   63.980094] RIP: 0010:[<ffffffff817e6d07>]  [<ffffffff817e6d07>] u32_destroy_key+0x27/0x6d
[   63.980094] RSP: 0018:ffff880117dffcc0  EFLAGS: 00010202
[   63.980094] RAX: ffff880117dea690 RBX: ffff8800d02e0820 RCX: 0000000000000000
[   63.980094] RDX: 0000000000000001 RSI: 0000000000000002 RDI: 6b6b6b6b6b6b6b6b
[   63.980094] RBP: ffff880117dffcd0 R08: 0000000000000000 R09: 0000000000000000
[   63.980094] R10: 00006c0900006ba8 R11: 00006ba100006b9d R12: 0000000000000001
[   63.980094] R13: ffff8800d02e0898 R14: ffffffff817e6d4d R15: ffff880117387a30
[   63.980094] FS:  0000000000000000(0000) GS:ffff88011a800000(0000) knlGS:0000000000000000
[   63.980094] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   63.980094] CR2: 00007f07e6732fed CR3: 000000011665b000 CR4: 00000000000006e0
[   63.980094] Stack:
[   63.980094]  ffff88011a9cd300 ffffffff82051ac0 ffff880117dffce0 ffffffff817e6d68
[   63.980094]  ffff880117dffd70 ffffffff810cb4c7 ffffffff810cb3cd ffff880117dfffd8
[   63.980094]  ffff880117dea690 ffff880117dea690 ffff880117dfffd8 000000000000000a
[   63.980094] Call Trace:
[   63.980094]  [<ffffffff817e6d68>] u32_delete_key_freepf_rcu+0x1b/0x1d
[   63.980094]  [<ffffffff810cb4c7>] rcu_process_callbacks+0x3bb/0x691
[   63.980094]  [<ffffffff810cb3cd>] ? rcu_process_callbacks+0x2c1/0x691
[   63.980094]  [<ffffffff817e6d4d>] ? u32_destroy_key+0x6d/0x6d
[   63.980094]  [<ffffffff810780a4>] __do_softirq+0x142/0x323
[   63.980094]  [<ffffffff810782a8>] run_ksoftirqd+0x23/0x53
[   63.980094]  [<ffffffff81092126>] smpboot_thread_fn+0x203/0x221
[   63.980094]  [<ffffffff81091f23>] ? smpboot_unpark_thread+0x33/0x33
[   63.980094]  [<ffffffff8108e44d>] kthread+0xc9/0xd1
[   63.980094]  [<ffffffff819e00ea>] ? do_wait_for_common+0xf8/0x125
[   63.980094]  [<ffffffff8108e384>] ? __kthread_parkme+0x61/0x61
[   63.980094]  [<ffffffff819e43ec>] ret_from_fork+0x7c/0xb0
[   63.980094]  [<ffffffff8108e384>] ? __kthread_parkme+0x61/0x61

tp could be freed in call_rcu callback too, the order is not guaranteed.

John Fastabend says:

====================
Its worth noting why this is safe. Any running schedulers will either
read the valid class field or it will be zeroed.

All schedulers today when the class is 0 do a lookup using the
same call used by the tcf_exts_bind(). So even if we have a running
classifier hit the null class pointer it will do a lookup and get
to the same result. This is particularly fragile at the moment because
the only way to verify this is to audit the schedulers call sites.
====================

Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Acked-by: NJohn Fastabend <john.r.fastabend@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a0efb80c

net_sched: fix another crash in cls_tcindex · 6e056569

由 WANG Cong 提交于 9月 30, 2014

This patch fixes the following crash:

[  166.670795] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  166.674230] IP: [<ffffffff814b739f>] __list_del_entry+0x5c/0x98
[  166.674230] PGD d0ea5067 PUD ce7fc067 PMD 0
[  166.674230] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  166.674230] CPU: 1 PID: 775 Comm: tc Not tainted 3.17.0-rc6+ #642
[  166.674230] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  166.674230] task: ffff8800d03c4d20 ti: ffff8800cae7c000 task.ti: ffff8800cae7c000
[  166.674230] RIP: 0010:[<ffffffff814b739f>]  [<ffffffff814b739f>] __list_del_entry+0x5c/0x98
[  166.674230] RSP: 0018:ffff8800cae7f7d0  EFLAGS: 00010207
[  166.674230] RAX: 0000000000000000 RBX: ffff8800cba8d700 RCX: ffff8800cba8d700
[  166.674230] RDX: 0000000000000000 RSI: dead000000200200 RDI: ffff8800cba8d700
[  166.674230] RBP: ffff8800cae7f7d0 R08: 0000000000000001 R09: 0000000000000001
[  166.674230] R10: 0000000000000000 R11: 000000000000859a R12: ffffffffffffffe8
[  166.674230] R13: ffff8800cba8c5b8 R14: 0000000000000001 R15: ffff8800cba8d700
[  166.674230] FS:  00007fdb5f04a740(0000) GS:ffff88011a800000(0000) knlGS:0000000000000000
[  166.674230] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  166.674230] CR2: 0000000000000000 CR3: 00000000cf929000 CR4: 00000000000006e0
[  166.674230] Stack:
[  166.674230]  ffff8800cae7f7e8 ffffffff814b73e8 ffff8800cba8d6e8 ffff8800cae7f828
[  166.674230]  ffffffff817caeec 0000000000000046 ffff8800cba8c5b0 ffff8800cba8c5b8
[  166.674230]  0000000000000000 0000000000000001 ffff8800cf8e33e8 ffff8800cae7f848
[  166.674230] Call Trace:
[  166.674230]  [<ffffffff814b73e8>] list_del+0xd/0x2b
[  166.674230]  [<ffffffff817caeec>] tcf_action_destroy+0x4c/0x71
[  166.674230]  [<ffffffff817ca0ce>] tcf_exts_destroy+0x20/0x2d
[  166.674230]  [<ffffffff817ec2b5>] tcindex_delete+0x196/0x1b7

struct list_head can not be simply copied and we should always init it.

Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Acked-by: NJohn Fastabend <john.r.fastabend@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6e056569

gre: Set inner protocol in v4 and v6 GRE transmit · 54bc9bac

由 Tom Herbert 提交于 9月 29, 2014

Call skb_set_inner_protocol to set inner Ethernet protocol to
protocol being encapsulation by GRE before tunnel_xmit. This is
needed for GSO if UDP encapsulation (fou) is being done.
Signed-off-by: NTom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

54bc9bac

ipip: Set inner IP protocol in ipip · 077c5a09

由 Tom Herbert 提交于 9月 29, 2014

Call skb_set_inner_ipproto to set inner IP protocol to IPPROTO_IPV4
before tunnel_xmit. This is needed if UDP encapsulation (fou) is
being done.
Signed-off-by: NTom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

077c5a09

sit: Set inner IP protocol in sit · 469471cd

由 Tom Herbert 提交于 9月 29, 2014

Call skb_set_inner_ipproto to set inner IP protocol to IPPROTO_IPV6
before tunnel_xmit. This is needed if UDP encapsulation (fou) is
being done.
Signed-off-by: NTom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

469471cd

udp: Generalize skb_udp_segment · 8bce6d7d

由 Tom Herbert 提交于 9月 29, 2014

skb_udp_segment is the function called from udp4_ufo_fragment to
segment a UDP tunnel packet. This function currently assumes
segmentation is transparent Ethernet bridging (i.e. VXLAN
encapsulation). This patch generalizes the function to
operate on either Ethertype or IP protocol.

The inner_protocol field must be set to the protocol of the inner
header. This can now be either an Ethertype or an IP protocol
(in a union). A new flag in the skbuff indicates which type is
effective. skb_set_inner_protocol and skb_set_inner_ipproto
helper functions were added to set the inner_protocol. These
functions are called from the point where the tunnel encapsulation
is occuring.

When skb_udp_tunnel_segment is called, the function to segment the
inner packet is selected based on the inner IP or Ethertype. In the
case of an IP protocol encapsulation, the function is derived from
inet[6]_offloads. In the case of Ethertype, skb->protocol is
set to the inner_protocol and skb_mac_gso_segment is called. (GRE
currently does this, but it might be possible to lookup the protocol
in offload_base and call the appropriate segmenation function
directly).
Signed-off-by: NTom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8bce6d7d

net: avoid one atomic operation in skb_clone() · ce1a4ea3

由 Eric Dumazet 提交于 10月 01, 2014

Fast clone cloning can actually avoid an atomic_inc(), if we
guarantee prior clone_ref value is 1.

This requires a change kfree_skbmem(), to perform the
atomic_dec_and_test() on clone_ref before setting fclone to
SKB_FCLONE_UNAVAILABLE.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ce1a4ea3

net/dccp/ccid.c: add __init to ccid_activate · e500f488

由 Fabian Frederick 提交于 10月 01, 2014

ccid_activate is only called by __init ccid_initialize_builtins in same module.
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e500f488

net/dccp/proto.c: add __init to dccp_mib_init · 0c5b8a46

由 Fabian Frederick 提交于 10月 01, 2014

dccp_mib_init is only called by __init dccp_init in same module.
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0c5b8a46

net: cleanup and document skb fclone layout · d0bf4a9e

由 Eric Dumazet 提交于 9月 29, 2014

Lets use a proper structure to clearly document and implement
skb fast clones.

Then, we might experiment more easily alternative layouts.

This patch adds a new skb_fclone_busy() helper, used by tcp and xfrm,
to stop leaking of implementation details.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d0bf4a9e

openanolis / cloud-kernel 接近 2 年 前同步成功

openanolis / cloud-kernel
接近 2 年前同步成功