提交 · 5d2525f7b8a7df810c3fbc548a91ba6e3cde578a · openanolis / cloud-kernel

02 7月, 2016 26 次提交

net: hns: add media-type property for hns · 5d2525f7

由 Kejian Yan 提交于 7月 01, 2016

It is PORT_TP type if the service port is GE mode. It is wrong to
judge the port type by using if it is service port. Adding the media
type to know port type.
Reported-by: NJinchuan Tian <tianjinchuan1@huawei.com>
Signed-off-by: NKejian Yan <yankejian@huawei.com>
Signed-off-by: NYisen Zhuang <Yisen.Zhuang@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5d2525f7

net: hns: remove redundant hns_mac_dev_to_enet_if() · 2e14b218

由 Kejian Yan 提交于 7月 01, 2016

The sequence of hns_mac_dev_to_enet_if() is the same as
hns_get_enet_interface(), and hns_get_enet_interface() is called
by initialization to get the mac mode. And the mode is not changed
anywhere. Thus add hns_mac_dev_to_enet_if() function to get the mac
mode is obviously redundant.
Reported-by: NJinchuan Tian <tianjinchuan1@huawei.com>
Signed-off-by: NKejian Yan <yankejian@huawei.com>
Signed-off-by: NYisen Zhuang <Yisen.Zhuang@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2e14b218

net: hns: normalize two different loop · 45fc764e

由 Daode Huang 提交于 7月 01, 2016

There are two approaches to assign data, one does 2 loops, another
does 1 loop. This patch normalize the different methods to 1 loop.
Signed-off-by: NDaode Huang <huangdaode@hisilicon.com>
Signed-off-by: NYisen Zhuang <Yisen.Zhuang@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

45fc764e

net: hns: add a space before "*/" · 6ba312eb

由 Daode Huang 提交于 7月 01, 2016

In comment line, some time miss a space before */, so this
patch adds a space before */.
Signed-off-by: NDaode Huang <huangdaode@hisilicon.com>
Signed-off-by: NYisen Zhuang <Yisen.Zhuang@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6ba312eb

net: hns: delete redundant parenthese · 68fa1636

由 Daode Huang 提交于 7月 01, 2016

According to the previous review comments from Andy, this patch
deletes the redundant parens in the patch.
Signed-off-by: NDaode Huang <huangdaode@hisilicon.com>
Signed-off-by: NYisen Zhuang <Yisen.Zhuang@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

68fa1636

net: hns: change code style from a = a + x to a += x · 8ec98ba7

由 Daode Huang 提交于 7月 01, 2016

This patch fixes the code style in hns driver. Change it from
"buff = buff + xxx" to "buff += xxx". The reveiw comments is
from andy.
Reviewed-by: NAndriy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: NDaode Huang <huangdaode@hisilicon.com>
Signed-off-by: NYisen Zhuang <Yisen.Zhuang@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8ec98ba7

net: hns: fix code style about hns driver · d9fdb4ed

由 Daode Huang 提交于 7月 01, 2016

This patch fixes code sytle of hns driver to make it
simple.
Signed-off-by: NDaode Huang <huangdaode@hisilicon.com>
Signed-off-by: NYisen Zhuang <Yisen.Zhuang@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d9fdb4ed

MAINTAINERS: add maintainers for hns driver · b30d74e4

由 Daode Huang 提交于 7月 01, 2016

This patch adds maintainers for hisilicon network subsystem driver
Signed-off-by: NDaode Huang <huangdaode@hisilicon.com>
Signed-off-by: NYisen Zhuang <Yisen.Zhuang@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b30d74e4

Merge branch 'rds-multipath-datastructures' · 1364db42

由 David S. Miller 提交于 7月 01, 2016

Sowmini Varadhan says:

====================
RDS:TCP data structure changes for multipath support

The second installment of changes to enable multipath support in
RDS-TCP. This series implements the changes in rds-tcp so that the
rds_conn_path has a pointer to the rds_tcp_connection in cp_transport_data.
Struct rds_tcp_connection keeps track of the inet_sk per path in
t_sock. The ->sk_user_data in turn is a pointer to the rds_conn_path.
With this set of changes, rds_tcp has the needed plumbing to handle
multiple paths(socket) per rds_connection.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1364db42

RDS: Do not send a pong to an incoming ping with 0 src port · 11bb62f7

由 Sowmini Varadhan 提交于 6月 30, 2016

RDS ping messages are sent with a non-zero src port to a zero
dst port, so that the rds pong messages can be sent back to the
originators src port. However if a confused/malicious sender
sends a ping with a 0 src port, we'd have an infinite ping-pong
loop. To avoid this, the receiver should ignore ping messages
with a 0 src port.
Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

11bb62f7

RDS: TCP: Simplify reconnect to avoid duelling reconnnect attempts · 8315011a

由 Sowmini Varadhan 提交于 6月 30, 2016

When reconnecting, the peer with the smaller IP address will initiate
the reconnect, to avoid needless duelling SYN issues.
Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8315011a

RDS: TCP: Hooks to set up a single connection path · b04e8554

由 Sowmini Varadhan 提交于 6月 30, 2016

This patch adds ->conn_path_connect callbacks in the rds_transport
that are used to set up a single connection path.
Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b04e8554

RDS: TCP: make receive path use the rds_conn_path · 2da43c4a

由 Sowmini Varadhan 提交于 6月 30, 2016

The ->sk_user_data contains a pointer to the rds_conn_path
for the socket. Use this consistently in the rds_tcp_data_ready
callbacks to get the rds_conn_path for rds_recv_incoming.
Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2da43c4a

RDS: TCP: make ->sk_user_data point to a rds_conn_path · ea3b1ea5

由 Sowmini Varadhan 提交于 6月 30, 2016

The socket callbacks should all operate on a struct rds_conn_path,
in preparation for a MP capable RDS-TCP.
Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ea3b1ea5

RDS: TCP: Refactor connection destruction to handle multiple paths · afb4164d

由 Sowmini Varadhan 提交于 6月 30, 2016

A single rds_connection may have multiple rds_conn_paths that have
to be carefully and correctly destroyed, for both rmmod and
netns-delete cases.

For both cases, we extract a single rds_tcp_connection for
each conn into a temporary list, and then invoke rds_conn_destroy()
which iteratively dismantles every path in the rds_connection.

For the netns deletion case, we additionally have to make sure
that we do not leave a socket in TIME_WAIT state, as this will
hold up the netns deletion. Thus we call rds_tcp_conn_paths_destroy()
to reset state quickly.
Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

afb4164d

RDS: TCP: Make rds_tcp_connection track the rds_conn_path · 02105b2c

由 Sowmini Varadhan 提交于 6月 30, 2016

The struct rds_tcp_connection is the transport-specific private
data structure that tracks TCP information per rds_conn_path.
Modify this structure to have a back-pointer to the rds_conn_path
for which it is the ->cp_transport_data.
Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

02105b2c

RDS: TCP: Remove dead logic around c_passive in rds-tcp · 26e4e6bb

由 Sowmini Varadhan 提交于 6月 30, 2016

The c_passive bit is only intended for the IB transport and will
never be encountered in rds-tcp, so remove the dead logic that
predicates on this bit.
Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

26e4e6bb

RDS: Rework path specific indirections · 226f7a7d

由 Sowmini Varadhan 提交于 6月 30, 2016

Refactor code to avoid separate indirections for single-path
and multipath transports. All transports (both single and mp-capable)
will get a pointer to the rds_conn_path, and can trivially derive
the rds_connection from the ->cp_conn.
Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

226f7a7d

Merge branch 'bpf-cgroup2' · dc9a2002

由 David S. Miller 提交于 7月 01, 2016

Martin KaFai Lau says:

====================
cgroup: bpf: cgroup2 membership test on skb

This series is to implement a bpf-way to
check the cgroup2 membership of a skb (sk_buff).

It is similar to the feature added in netfilter:
c38c4597 ("netfilter: implement xt_cgroup cgroup2 path match")

The current target is the tc-like usage.

v3:
- Remove WARN_ON_ONCE(!rcu_read_lock_held())
- Stop BPF_MAP_TYPE_CGROUP_ARRAY usage in patch 2/4
- Avoid mounting bpf fs manually in patch 4/4

- Thanks for Daniel's review and the above suggestions

- Check CONFIG_SOCK_CGROUP_DATA instead of CONFIG_CGROUPS.  Thanks to
  the kbuild bot's report.
  Patch 2/4 only needs CONFIG_CGROUPS while patch 3/4 needs
  CONFIG_SOCK_CGROUP_DATA.  Since a single bpf cgrp2 array alone is
  not useful for now, CONFIG_SOCK_CGROUP_DATA is also used in
  patch 2/4.  We can fine tune it later if we find other use cases
  for the cgrp2 array.
- Return EAGAIN instead of ENOENT if the cgrp2 array entry is
  NULL.  It is to distinguish these two cases: 1) the userland has
  not populated this array entry yet. or 2) not finding cgrp2 from the skb.

- Be-lated thanks to Alexei and Tejun on reviewing v1 and giving advice on
  this work.

v2:
- Fix two return cases in cgroup_get_from_fd()
- Fix compilation errors when CONFIG_CGROUPS is not used:
  - arraymap.c: avoid registering BPF_MAP_TYPE_CGROUP_ARRAY
  - filter.c: tc_cls_act_func_proto() returns NULL on BPF_FUNC_skb_in_cgroup
- Add comments to BPF_FUNC_skb_in_cgroup and cgroup_get_from_fd()
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dc9a2002

cgroup: bpf: Add an example to do cgroup checking in BPF · a3f74617

由 Martin KaFai Lau 提交于 6月 30, 2016

test_cgrp2_array_pin.c:
A userland program that creates a bpf_map (BPF_MAP_TYPE_GROUP_ARRAY),
pouplates/updates it with a cgroup2's backed fd and pins it to a
bpf-fs's file.  The pinned file can be loaded by tc and then used
by the bpf prog later.  This program can also update an existing pinned
array and it could be useful for debugging/testing purpose.

test_cgrp2_tc_kern.c:
A bpf prog which should be loaded by tc.  It is to demonstrate
the usage of bpf_skb_in_cgroup.

test_cgrp2_tc.sh:
A script that glues the test_cgrp2_array_pin.c and
test_cgrp2_tc_kern.c together.  The idea is like:
1. Load the test_cgrp2_tc_kern.o by tc
2. Use test_cgrp2_array_pin.c to populate a BPF_MAP_TYPE_CGROUP_ARRAY
   with a cgroup fd
3. Do a 'ping -6 ff02::1%ve' to ensure the packet has been
   dropped because of a match on the cgroup

Most of the lines in test_cgrp2_tc.sh is the boilerplate
to setup the cgroup/bpf-fs/net-devices/netns...etc.  It is
not bulletproof on errors but should work well enough and
give enough debug info if things did not go well.
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a3f74617

cgroup: bpf: Add bpf_skb_in_cgroup_proto · 4a482f34

由 Martin KaFai Lau 提交于 6月 30, 2016

Adds a bpf helper, bpf_skb_in_cgroup, to decide if a skb->sk
belongs to a descendant of a cgroup2.  It is similar to the
feature added in netfilter:
commit c38c4597 ("netfilter: implement xt_cgroup cgroup2 path match")

The user is expected to populate a BPF_MAP_TYPE_CGROUP_ARRAY
which will be used by the bpf_skb_in_cgroup.

Modifications to the bpf verifier is to ensure BPF_MAP_TYPE_CGROUP_ARRAY
and bpf_skb_in_cgroup() are always used together.
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4a482f34

cgroup: bpf: Add BPF_MAP_TYPE_CGROUP_ARRAY · 4ed8ec52

由 Martin KaFai Lau 提交于 6月 30, 2016

Add a BPF_MAP_TYPE_CGROUP_ARRAY and its bpf_map_ops's implementations.
To update an element, the caller is expected to obtain a cgroup2 backed
fd by open(cgroup2_dir) and then update the array with that fd.
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4ed8ec52

cgroup: Add cgroup_get_from_fd · 1f3fe7eb

由 Martin KaFai Lau 提交于 6月 30, 2016

Add a helper function to get a cgroup2 from a fd.  It will be
stored in a bpf array (BPF_MAP_TYPE_CGROUP_ARRAY) which will
be introduced in the later patch.
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1f3fe7eb

Merge branch 'bpf-robustify' · 6bd3847b

由 David S. Miller 提交于 7月 01, 2016

Daniel Borkmann says:

====================
Further robustify putting BPF progs

This series addresses a potential issue reported to us by Jann Horn
with regards to putting progs. First patch moves progs generally under
RCU destruction and second patch refactors getting of progs to simplify
code a bit. For details, please see individual patches. Note, we think
that addressing this one in net-next should be sufficient.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6bd3847b

bpf: refactor bpf_prog_get and type check into helper · 113214be

由 Daniel Borkmann 提交于 6月 30, 2016

Since bpf_prog_get() and program type check is used in a couple of places,
refactor this into a small helper function that we can make use of. Since
the non RO prog->aux part is not used in performance critical paths and a
program destruction via RCU is rather very unlikley when doing the put, we
shouldn't have an issue just doing the bpf_prog_get() + prog->type != type
check, but actually not taking the ref at all (due to being in fdget() /
fdput() section of the bpf fd) is even cleaner and makes the diff smaller
as well, so just go for that. Callsites are changed to make use of the new
helper where possible.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

113214be

bpf: generally move prog destruction to RCU deferral · 1aacde3d

由 Daniel Borkmann 提交于 6月 30, 2016

Jann Horn reported following analysis that could potentially result
in a very hard to trigger (if not impossible) UAF race, to quote his
event timeline:

 - Set up a process with threads T1, T2 and T3
 - Let T1 set up a socket filter F1 that invokes another filter F2
   through a BPF map [tail call]
 - Let T1 trigger the socket filter via a unix domain socket write,
   don't wait for completion
 - Let T2 call PERF_EVENT_IOC_SET_BPF with F2, don't wait for completion
 - Now T2 should be behind bpf_prog_get(), but before bpf_prog_put()
 - Let T3 close the file descriptor for F2, dropping the reference
   count of F2 to 2
 - At this point, T1 should have looked up F2 from the map, but not
   finished executing it
 - Let T3 remove F2 from the BPF map, dropping the reference count of
   F2 to 1
 - Now T2 should call bpf_prog_put() (wrong BPF program type), dropping
   the reference count of F2 to 0 and scheduling bpf_prog_free_deferred()
   via schedule_work()
 - At this point, the BPF program could be freed
 - BPF execution is still running in a freed BPF program

While at PERF_EVENT_IOC_SET_BPF time it's only guaranteed that the perf
event fd we're doing the syscall on doesn't disappear from underneath us
for whole syscall time, it may not be the case for the bpf fd used as
an argument only after we did the put. It needs to be a valid fd pointing
to a BPF program at the time of the call to make the bpf_prog_get() and
while T2 gets preempted, F2 must have dropped reference to 1 on the other
CPU. The fput() from the close() in T3 should also add additionally delay
to the reference drop via exit_task_work() when bpf_prog_release() gets
called as well as scheduling bpf_prog_free_deferred().

That said, it makes nevertheless sense to move the BPF prog destruction
generally after RCU grace period to guarantee that such scenario above,
but also others as recently fixed in ceb56070 ("bpf, perf: delay release
of BPF prog after grace period") with regards to tail calls won't happen.
Integrating bpf_prog_free_deferred() directly into the RCU callback is
not allowed since the invocation might happen from either softirq or
process context, so we're not permitted to block. Reviewing all bpf_prog_put()
invocations from eBPF side (note, cBPF -> eBPF progs don't use this for
their destruction) with call_rcu() look good to me.

Since we don't know whether at the time of attaching the program, we're
already part of a tail call map, we need to use RCU variant. However, due
to this, there won't be severely more stress on the RCU callback queue:
situations with above bpf_prog_get() and bpf_prog_put() combo in practice
normally won't lead to releases, but even if they would, enough effort/
cycles have to be put into loading a BPF program into the kernel already.
Reported-by: NJann Horn <jannh@google.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1aacde3d

01 7月, 2016 14 次提交

atm: horizon: Use setup_timer · 466fc793

由 Amitoj Kaur Chawla 提交于 6月 30, 2016

Convert a call to init_timer and accompanying intializations of
the timer's data and function fields to a call to setup_timer.

The Coccinelle semantic patch that fixes this problem is
as follows:
@@
expression t,d,f,e1;
identifier x1;
statement S1;
@@

(
-t.data = d;
|
-t.function = f;
|
-init_timer(&t);
+setup_timer(&t,f,d);
|
-init_timer_on_stack(&t);
+setup_timer_on_stack(&t,f,d);
)
<... when != S1
t.x1 = e1;
...>
Signed-off-by: NAmitoj Kaur Chawla <amitoj1606@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

466fc793

Merge branch 'qed-next' · e3cc6e37

由 David S. Miller 提交于 7月 01, 2016

Manish Chopra says:

====================
qede: Enhancements

This patch series have few small fastpath features
support and code refactoring.

Note - regarding get/set tunable configuration via ethtool
Surprisingly, there is NO ethtool application support for
such configuration given that we have kernel support.
Do let us know if we need to add support for that in user ethtool.

Please consider applying this series to "net-next".
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e3cc6e37

qede: Bump up driver version to 8.10.1.20 · 831a8e6c

由 Manish Chopra 提交于 6月 30, 2016

Signed-off-by: NManish Chopra <manish.chopra@qlogic.com>
Signed-off-by: NYuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

831a8e6c

qede: Add get/set rx copy break tunable support · 3d789994

由 Manish Chopra 提交于 6月 30, 2016

Signed-off-by: NManish <manish.chopra@qlogic.com>
Signed-off-by: NYuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3d789994

qede: Utilize xmit_more · 312e0676

由 Manish Chopra 提交于 6月 30, 2016

This patch uses xmit_more optimization to reduce
number of TX doorbells write per packet.
Signed-off-by: NManish <manish.chopra@qlogic.com>
Signed-off-by: NYuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

312e0676

qede: qede_poll refactoring · c774169d

由 Manish Chopra 提交于 6月 30, 2016

This patch cleanups qede_poll() routine a bit
and allows qede_poll() to do single iteration to handle
TX completion [As under heavy TX load qede_poll() might
run for indefinite time in the while(1) loop for TX
completion processing and cause CPU stuck].
Signed-off-by: NManish <manish.chopra@qlogic.com>
Signed-off-by: NYuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c774169d

qede: Add support for handling IP fragmented packets. · c72a6125

由 Manish Chopra 提交于 6月 30, 2016

When handling IP fragmented packets with csum in their
transport header, the csum isn't changed as part of the
fragmentation. As a result, the packet containing the
transport headers would have the correct csum of the original
packet, but one that mismatches the actual packet that
passes on the wire. As a result, on receive path HW would
give an indication that the packet has incorrect csum,
which would cause qede to discard the incoming packet.

Since HW also delivers a notification of IP fragments,
change driver behavior to pass such incoming packets
to stack and let it make the decision whether it needs
to be dropped.
Signed-off-by: NManish <manish.chopra@qlogic.com>
Signed-off-by: NYuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c72a6125

Merge branch 'tun-skb_array' · beb528d0

由 David S. Miller 提交于 7月 01, 2016

Jason Wang says:

====================
switch to use tx skb array in tun

This series tries to switch to use skb array in tun. This is used to
eliminate the spinlock contention between producer and consumer. The
conversion was straightforward: just introdce a tx skb array and use
it instead of sk_receive_queue.

A minor issue is to keep the tx_queue_len behaviour, since tun used to
use it for the length of sk_receive_queue. This is done through:

- add the ability to resize multiple rings at once to avoid handling
  partial resize failure for mutiple rings.
- add the support for zero length ring.
- introduce a notifier which was triggered when tx_queue_len was
  changed for a netdev.
- resize all queues during the tx_queue_len changing.

Tests shows about 15% improvement on guest rx pps:

Before: ~1300000pps
After : ~1500000pps

Changes from V3:
- fix kbuild warnings
- call NETDEV_CHANGE_TX_QUEUE_LEN on IFLA_TXQLEN

Changes from V2:
- add multiple rings resizing support for ptr_ring/skb_array
- add zero length ring support
- introdce a NETDEV_CHANGE_TX_QUEUE_LEN
- drop new flags

Changes from V1:
- switch to use skb array instead of a customized circular buffer
- add non-blocking support
- rename .peek to .peek_len
- drop lockless peeking since test show very minor improvement
====================
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Acked-from-altitude: 34697 feet.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

beb528d0

tun: switch to use skb array for tx · 1576d986

由 Jason Wang 提交于 6月 30, 2016

We used to queue tx packets in sk_receive_queue, this is less
efficient since it requires spinlocks to synchronize between producer
and consumer.

This patch tries to address this by:

- switch from sk_receive_queue to a skb_array, and resize it when
  tx_queue_len was changed.
- introduce a new proto_ops peek_len which was used for peeking the
  skb length.
- implement a tun version of peek_len for vhost_net to use and convert
  vhost_net to use peek_len if possible.

Pktgen test shows about 15.3% improvement on guest receiving pps for small
buffers:

Before: ~1300000pps
After : ~1500000pps
Signed-off-by: NJason Wang <jasowang@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1576d986

net: introduce NETDEV_CHANGE_TX_QUEUE_LEN · 08294a26

由 Jason Wang 提交于 6月 30, 2016

This patch introduces a new event - NETDEV_CHANGE_TX_QUEUE_LEN, this
will be triggered when tx_queue_len. It could be used by net device
who want to do some processing at that time. An example is tun who may
want to resize tx array when tx_queue_len is changed.

Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: NJason Wang <jasowang@redhat.com>
Acked-by: NJohn Fastabend <john.r.fastabend@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

08294a26

skb_array: add wrappers for resizing · bf900b3d

由 Jason Wang 提交于 6月 30, 2016

Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NJason Wang <jasowang@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bf900b3d

ptr_ring: support resizing multiple queues · 59e6ae53

由 Michael S. Tsirkin 提交于 6月 30, 2016

Sometimes, we need support resizing multiple queues at once. This is
because it was not easy to recover to recover from a partial failure
of multiple queues resizing.
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NJason Wang <jasowang@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

59e6ae53

skb_array: minor tweak · fd68adec

由 Jason Wang 提交于 6月 30, 2016

Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NJason Wang <jasowang@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fd68adec

ptr_ring: support zero length ring · 982fb490

由 Jason Wang 提交于 6月 30, 2016

Sometimes, we need zero length ring. But current code will crash since
we don't do any check before accessing the ring. This patch fixes this.
Signed-off-by: NJason Wang <jasowang@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

982fb490

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功