提交 · c468efe2c7d478bad8855f7d170cf245ee0f1b3f · openanolis / cloud-kernel

05 6月, 2015 3 次提交

net: Remove superfluous setting of key_basic · c468efe2

由 Tom Herbert 提交于 6月 04, 2015

key_basic is set twice in __skb_flow_dissect which seems unnecessary.
Remove second one.
Acked-by: NJiri Pirko <jiri@resnulli.us>
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c468efe2

net: Simplify GRE case in flow_dissector · ce3b5355

由 Tom Herbert 提交于 6月 04, 2015

Do break when we see routing flag or a non-zero version number in GRE
header.
Acked-by: NJiri Pirko <jiri@resnulli.us>
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ce3b5355

bpf: fix build due to missing tc_verd · 94db13fe

由 Alexei Starovoitov 提交于 6月 04, 2015

fix build error:
net/core/filter.c: In function 'bpf_clone_redirect':
net/core/filter.c:1429:18: error: 'struct sk_buff' has no member named 'tc_verd'
  if (G_TC_AT(skb2->tc_verd) & AT_INGRESS)

Fixes: 3896d655 ("bpf: introduce bpf_clone_redirect() helper")
Reported-by: NOr Gerlitz <gerlitz.or@gmail.com>
Reported-by: NFengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

94db13fe

04 6月, 2015 1 次提交

bpf: introduce bpf_clone_redirect() helper · 3896d655

由 Alexei Starovoitov 提交于 6月 02, 2015

Allow eBPF programs attached to classifier/actions to call
bpf_clone_redirect(skb, ifindex, flags) helper which will
mirror or redirect the packet by dynamic ifindex selection
from within the program to a target device either at ingress
or at egress. Can be used for various scenarios, for example,
to load balance skbs into veths, split parts of the traffic
to local taps, etc.
Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3896d655

02 6月, 2015 2 次提交

net: Add priority to packet_offload objects. · bdef7de4

由 David S. Miller 提交于 6月 01, 2015

When we scan a packet for GRO processing, we want to see the most
common packet types in the front of the offload_base list.

So add a priority field so we can handle this properly.

IPv4/IPv6 get the highest priority with the implicit zero priority
field.

Next comes ethernet with a priority of 10, and then we have the MPLS
types with a priority of 15.
Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
Suggested-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bdef7de4

Revert "net: core: 'ethtool' issue with querying phy settings" · 18ec898e

由 David S. Miller 提交于 6月 01, 2015

This reverts commit f96dee13.

It isn't right, ethtool is meant to manage one PHY instance
per netdevice at a time, and this is selected by the SET
command.  Therefore by definition the GET command must only
return the settings for the configured and selected PHY.
Reported-by: NBen Hutchings <ben@decadent.org.uk>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

18ec898e

01 6月, 2015 1 次提交

ebpf: allow bpf_ktime_get_ns_proto also for networking · 17ca8cbf

由 Daniel Borkmann 提交于 5月 29, 2015

As this is already exported from tracing side via commit d9847d31
("tracing: Allow BPF programs to call bpf_ktime_get_ns()"), we might
as well want to move it to the core, so also networking users can make
use of it, e.g. to measure diffs for certain flows from ingress/egress.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

17ca8cbf

31 5月, 2015 2 次提交

netevent: remove automatic variable in register_netevent_notifier() · 282c320d

由 Wang Long 提交于 5月 29, 2015

Remove automatic variable 'err' in register_netevent_notifier() and
return the result of atomic_notifier_chain_register() directly.
Signed-off-by: NWang Long <long.wanglong@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

282c320d

bpf: allow BPF programs access skb->skb_iif and skb->dev->ifindex fields · 37e82c2f

由 Alexei Starovoitov 提交于 5月 27, 2015

classic BPF already exposes skb->dev->ifindex via SKF_AD_IFINDEX extension.
Allow eBPF program to access it as well. Note that classic aborts execution
of the program if 'skb->dev == NULL' (which is inconvenient for program
writers), whereas eBPF returns zero in such case.
Also expose the 'skb_iif' field, since programs triggered by redirected
packet need to known the original interface index.
Summary:
__skb->ifindex         -> skb->dev->ifindex
__skb->ingress_ifindex -> skb->skb_iif
Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

37e82c2f

27 5月, 2015 1 次提交

tcp: tcp_tso_autosize() minimum is one packet · d6a4e26a

由 Eric Dumazet 提交于 5月 26, 2015

By making sure sk->sk_gso_max_segs minimal value is one,
and sysctl_tcp_min_tso_segs minimal value is one as well,
tcp_tso_autosize() will return a non zero value.

We can then revert 843925f3
("tcp: Do not apply TSO segment limit to non-TSO packets")
and save few cpu cycles in fast path.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d6a4e26a

26 5月, 2015 3 次提交

net: fix inet_proto_csum_replace4() sparse errors · 05c98543

由 Eric Dumazet 提交于 5月 25, 2015

make C=2 CF=-D__CHECK_ENDIAN__ net/core/utils.o
...
net/core/utils.c:307:72: warning: incorrect type in argument 2 (different base types)
net/core/utils.c:307:72: expected restricted __wsum [usertype] addend
net/core/utils.c:307:72: got restricted __be32 [usertype] from
net/core/utils.c:308:34: warning: incorrect type in argument 2 (different base types)
net/core/utils.c:308:34: expected restricted __wsum [usertype] addend
net/core/utils.c:308:34: got restricted __be32 [usertype] to
net/core/utils.c:310:70: warning: incorrect type in argument 2 (different base types)
net/core/utils.c:310:70: expected restricted __wsum [usertype] addend
net/core/utils.c:310:70: got restricted __be32 [usertype] from
net/core/utils.c:310:77: warning: incorrect type in argument 2 (different base types)
net/core/utils.c:310:77: expected restricted __wsum [usertype] addend
net/core/utils.c:310:77: got restricted __be32 [usertype] to
net/core/utils.c:312:72: warning: incorrect type in argument 2 (different base types)
net/core/utils.c:312:72: expected restricted __wsum [usertype] addend
net/core/utils.c:312:72: got restricted __be32 [usertype] from
net/core/utils.c:313:35: warning: incorrect type in argument 2 (different base types)
net/core/utils.c:313:35: expected restricted __wsum [usertype] addend
net/core/utils.c:313:35: got restricted __be32 [usertype] to

Note we can use csum_replace4() helper

Fixes: 58e3cac5 ("net: optimise inet_proto_csum_replace4()")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

05c98543

net: remove a sparse error in secure_dccpv6_sequence_number() · 68319052

由 Eric Dumazet 提交于 5月 25, 2015

make C=2 CF=-D__CHECK_ENDIAN__ net/core/secure_seq.o
net/core/secure_seq.c:157:50: warning: restricted __be32 degrades to
integer
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

68319052

pktgen: remove one sparse error · d4969581

由 Eric Dumazet 提交于 5月 25, 2015

net/core/pktgen.c:2672:43: warning: incorrect type in assignment (different base types)
net/core/pktgen.c:2672:43: expected unsigned short [unsigned] [short] [usertype] <noident>
net/core/pktgen.c:2672:43: got restricted __be16 [usertype] protocol

Let's use proper struct ethhdr instead of hard coding everything.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d4969581

25 5月, 2015 3 次提交

net: af_unix: implement splice for stream af_unix sockets · 2b514574

由 Hannes Frederic Sowa 提交于 5月 21, 2015

unix_stream_recvmsg is refactored to unix_stream_read_generic in this
patch and enhanced to deal with pipe splicing. The refactoring is
inneglible, we mostly have to deal with a non-existing struct msghdr
argument.
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2b514574

net: make skb_splice_bits more configureable · a60e3cc7

由 Hannes Frederic Sowa 提交于 5月 21, 2015

Prepare skb_splice_bits to be able to deal with AF_UNIX sockets.

AF_UNIX sockets don't use lock_sock/release_sock and thus we have to
use a callback to make the locking and unlocking configureable.
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a60e3cc7

net: skbuff: add skb_append_pagefrags and use it · be12a1fe

由 Hannes Frederic Sowa 提交于 5月 21, 2015

Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

be12a1fe

23 5月, 2015 4 次提交

pktgen: make /proc/net/pktgen/pgctrl report fail on invalid input · 40207264

由 Jesper Dangaard Brouer 提交于 5月 21, 2015

Giving /proc/net/pktgen/pgctrl an invalid command just returns shell
success and prints a warning in dmesg.  This is not very useful for
shell scripting, as it can only detect the error by parsing dmesg.

Instead return -EINVAL when the command is unknown, as this provides
userspace shell scripting a way of detecting this.

Also bump version tag to 2.75, because (1) reading /proc/net/pktgen/pgctrl
output this version number which would allow to detect this small
semantic change, and (2) because the pktgen version tag have not been
updated since 2010.
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

40207264

pktgen: adjust spacing in proc file interface output · d079abd1

由 Jesper Dangaard Brouer 提交于 5月 21, 2015

Too many spaces were introduced in commit 63adc6fb ("pktgen: cleanup
checkpatch warnings"), thus misaligning "src_min:" to other columns.

Fixes: 63adc6fb ("pktgen: cleanup checkpatch warnings")
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d079abd1

net: core: 'ethtool' issue with querying phy settings · f96dee13

由 Arun Parameswaran 提交于 5月 20, 2015

When trying to configure the settings for PHY1, using commands
like 'ethtool -s eth0 phyad 1 speed 100', the 'ethtool' seems to
modify other settings apart from the speed of the PHY1, in the
above case.

The ethtool seems to query the settings for PHY0, and use this
as the base to apply the new settings to the PHY1. This is
causing the other settings of the PHY 1 to be wrongly
configured.

The issue is caused by the '_ethtool_get_settings()' API, which
gets called because of the 'ETHTOOL_GSET' command, is clearing
the 'cmd' pointer (of type 'struct ethtool_cmd') by calling
memset. This clears all the parameters (if any) passed for the
'ETHTOOL_GSET' cmd. So the driver's callback is always invoked
with 'cmd->phy_address' as '0'.

The '_ethtool_get_settings()' is called from other files in the
'net/core'. So the fix is applied to the 'ethtool_get_settings()'
which is only called in the context of the 'ethtool'.
Signed-off-by: NArun Parameswaran <aparames@broadcom.com>
Reviewed-by: NRay Jui <rjui@broadcom.com>
Reviewed-by: NScott Branden <sbranden@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f96dee13

flow_dissector: do not break if ports are not needed in flowlabel · 12c227ec

由 Jiri Pirko 提交于 5月 22, 2015

This restored previous behaviour. If caller does not want ports to be
filled, we should not break.

Fixes: 06635a35 ("flow_dissect: use programable dissector in skb_flow_dissect and friends")
Signed-off-by: NJiri Pirko <jiri@resnulli.us>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

12c227ec

22 5月, 2015 3 次提交

bpf: allow bpf programs to tail-call other bpf programs · 04fd61ab

由 Alexei Starovoitov 提交于 5月 19, 2015

introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
  ...
  bpf_tail_call(ctx, &jmp_table, index);
  ...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
  ...
  if (jmp_table[index])
    return (*jmp_table[index])(ctx);
  ...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.

bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table

Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.

New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.

The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.

Use cases:
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>

==========
- simplify complex programs by splitting them into a sequence of small programs

- dispatch routine
  For tracing and future seccomp the program may be triggered on all system
  calls, but processing of syscall arguments will be different. It's more
  efficient to implement them as:
  int syscall_entry(struct seccomp_data *ctx)
  {
     bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
     ... default: process unknown syscall ...
  }
  int sys_write_event(struct seccomp_data *ctx) {...}
  int sys_read_event(struct seccomp_data *ctx) {...}
  syscall_jmp_table[__NR_write] = sys_write_event;
  syscall_jmp_table[__NR_read] = sys_read_event;

  For networking the program may call into different parsers depending on
  packet format, like:
  int packet_parser(struct __sk_buff *skb)
  {
     ... parse L2, L3 here ...
     __u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
     bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
     ... default: process unknown protocol ...
  }
  int parse_tcp(struct __sk_buff *skb) {...}
  int parse_udp(struct __sk_buff *skb) {...}
  ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
  ipproto_jmp_table[IPPROTO_UDP] = parse_udp;

- for TC use case, bpf_tail_call() allows to implement reclassify-like logic

- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
  are atomic, so user space can build chains of BPF programs on the fly

Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
  It could have been implemented without JIT changes as a wrapper on top of
  BPF_PROG_RUN() macro, but with two downsides:
  . all programs would have to pay performance penalty for this feature and
    tail call itself would be slower, since mandatory stack unwind, return,
    stack allocate would be done for every tailcall.
  . tailcall would be limited to programs running preempt_disabled, since
    generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
    need to be either global per_cpu variable accessed by helper and by wrapper
    or global variable protected by locks.

  In this implementation x64 JIT bypasses stack unwind and jumps into the
  callee program after prologue.

- bpf_prog_array_compatible() ensures that prog_type of callee and caller
  are the same and JITed/non-JITed flag is the same, since calling JITed
  program from non-JITed is invalid, since stack frames are different.
  Similarly calling kprobe type program from socket type program is invalid.

- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
  abstraction, its user space API and all of verifier logic.
  It's in the existing arraymap.c file, since several functions are
  shared with regular array map.
Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

04fd61ab

net: dev: reduce both ingress hook ifdefs · e7582bab

由 Daniel Borkmann 提交于 5月 19, 2015

Reduce ifdef pollution slightly, no functional change. We can simply
remove the extra alternative definition of handle_ing() and nf_ingress().
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NPablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e7582bab

neigh: Better handling of transition to NUD_PROBE state · 765c9c63

由 Erik Kline 提交于 5月 18, 2015

[1] When entering NUD_PROBE state via neigh_update(), perhaps received
    from userspace, correctly (re)initialize the probes count to zero.

    This is useful for forcing revalidation of a neighbor (for example
    if the host is attempting to do DNA [IPv4 4436, IPv6 6059]).

[2] Notify listeners when a neighbor goes into NUD_PROBE state.

    By sending notifications on entry to NUD_PROBE state listeners get
    more timely warnings of imminent connectivity issues.

    The current notifications on entry to NUD_STALE have somewhat
    limited usefulness: NUD_STALE is a perfectly normal state, as is
    NUD_DELAY, whereas notifications on entry to NUD_FAILURE come after
    a neighbor reachability problem has been confirmed (typically after
    three probes).
Signed-off-by: NErik Kline <ek@google.com>
Acked-By: NLorenzo Colitti <lorenzo@google.com>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

765c9c63

18 5月, 2015 4 次提交

netns: make nsid_lock per net · de133464

由 WANG Cong 提交于 5月 15, 2015

The spinlock is used to protect netns_ids which is per net,
so there is no need to use a global spinlock.

Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

de133464

flow_dissector: remove bogus return in tipc section · 74b80e84

由 Jiri Pirko 提交于 5月 15, 2015

Fixes: 06635a35 ("flow_dissect: use programable dissector in skb_flow_dissect and friends")
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NJiri Pirko <jiri@resnulli.us>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

74b80e84

net: fix sk_mem_reclaim_partial() · 1a24e04e

由 Eric Dumazet 提交于 5月 15, 2015

sk_mem_reclaim_partial() goal is to ensure each socket has
one SK_MEM_QUANTUM forward allocation. This is needed both for
performance and better handling of memory pressure situations in
follow up patches.

SK_MEM_QUANTUM is currently a page, but might be reduced to 4096 bytes
as some arches have 64KB pages.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1a24e04e

rtnl/bond: don't send rtnl msg for unregistered iface · ed2a80ab

由 Nicolas Dichtel 提交于 5月 13, 2015

Before the patch, the command 'ip link add bond2 type bond mode 802.3ad'
causes the kernel to send a rtnl message for the bond2 interface, with an
ifindex 0.

'ip monitor' shows:
0: bond2: <BROADCAST,MULTICAST,MASTER> mtu 1500 state DOWN group default
    link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
9: bond2@NONE: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN group default
    link/ether ea:3e:1f:53:92:7b brd ff:ff:ff:ff:ff:ff
[snip]

The patch fixes the spotted bug by checking in bond driver if the interface
is registered before calling the notifier chain.
It also adds a check in rtmsg_ifinfo() to prevent this kind of bug in the
future.

Fixes: d4261e56 ("bonding: create netlink event when bonding option is changed")
CC: Jiri Pirko <jiri@resnulli.us>
Reported-by: NJulien Meunier <julien.meunier@6wind.com>
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ed2a80ab

15 5月, 2015 2 次提交

net: core: set qdisc pkt len before tc_classify · 3365495c

由 Florian Westphal 提交于 5月 14, 2015

commit d2788d34 ("net: sched: further simplify handle_ing")
removed the call to qdisc_enqueue_root().

However, after this removal we no longer set qdisc pkt length.
This breaks traffic policing on ingress.

This is the minimum fix: set qdisc pkt length before tc_classify.

Only setting the length does remove support for 'stab' on ingress, but
as Alexei pointed out:
 "Though it was allowed to add qdisc_size_table to ingress, it's useless.
  Nothing takes advantage of recomputed qdisc_pkt_len".

Jamal suggested to use qdisc_pkt_len_init(), but as Eric mentioned that
would result in qdisc_pkt_len_init to no longer get inlined due to the
additional 2nd call site.

ingress policing is rare and GRO doesn't really work that well with police
on ingress, as we see packets > mtu and drop skbs that  -- without
aggregation -- would still have fitted the policier budget.
Thus to have reliable/smooth ingress policing GRO has to be turned off.

Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Fixes: d2788d34 ("net: sched: further simplify handle_ing")
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Acked-by: NEric Dumazet <edumazet@google.com>
Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3365495c

netns: fix unbalanced spin_lock on error · 0c58a2db

由 Nicolas Dichtel 提交于 5月 13, 2015

Unlock was missing on error path.

Fixes: 95f38411 ("netns: use a spin_lock to protect nsid management")
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0c58a2db

14 5月, 2015 11 次提交

netfilter: add netfilter ingress hook after handle_ing() under unique static key · e687ad60

由 Pablo Neira 提交于 5月 13, 2015

This patch adds the Netfilter ingress hook just after the existing tc ingress
hook, that seems to be the consensus solution for this.

Note that the Netfilter hook resides under the global static key that enables
ingress filtering. Nonetheless, Netfilter still also has its own static key for
minimal impact on the existing handle_ing().

* Without this patch:

Result: OK: 6216490(c6216338+d152) usec, 100000000 (60byte,0frags)
  16086246pps 7721Mb/sec (7721398080bps) errors: 100000000

    42.46%  kpktgend_0   [kernel.kallsyms]   [k] __netif_receive_skb_core
    25.92%  kpktgend_0   [kernel.kallsyms]   [k] kfree_skb
     7.81%  kpktgend_0   [pktgen]            [k] pktgen_thread_worker
     5.62%  kpktgend_0   [kernel.kallsyms]   [k] ip_rcv
     2.70%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_internal
     2.34%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_sk
     1.44%  kpktgend_0   [kernel.kallsyms]   [k] __build_skb

* With this patch:

Result: OK: 6214833(c6214731+d101) usec, 100000000 (60byte,0frags)
  16090536pps 7723Mb/sec (7723457280bps) errors: 100000000

    41.23%  kpktgend_0      [kernel.kallsyms]  [k] __netif_receive_skb_core
    26.57%  kpktgend_0      [kernel.kallsyms]  [k] kfree_skb
     7.72%  kpktgend_0      [pktgen]           [k] pktgen_thread_worker
     5.55%  kpktgend_0      [kernel.kallsyms]  [k] ip_rcv
     2.78%  kpktgend_0      [kernel.kallsyms]  [k] netif_receive_skb_internal
     2.06%  kpktgend_0      [kernel.kallsyms]  [k] netif_receive_skb_sk
     1.43%  kpktgend_0      [kernel.kallsyms]  [k] __build_skb

* Without this patch + tc ingress:

        tc filter add dev eth4 parent ffff: protocol ip prio 1 \
                u32 match ip dst 4.3.2.1/32

Result: OK: 9269001(c9268821+d179) usec, 100000000 (60byte,0frags)
  10788648pps 5178Mb/sec (5178551040bps) errors: 100000000

    40.99%  kpktgend_0   [kernel.kallsyms]  [k] __netif_receive_skb_core
    17.50%  kpktgend_0   [kernel.kallsyms]  [k] kfree_skb
    11.77%  kpktgend_0   [cls_u32]          [k] u32_classify
     5.62%  kpktgend_0   [kernel.kallsyms]  [k] tc_classify_compat
     5.18%  kpktgend_0   [pktgen]           [k] pktgen_thread_worker
     3.23%  kpktgend_0   [kernel.kallsyms]  [k] tc_classify
     2.97%  kpktgend_0   [kernel.kallsyms]  [k] ip_rcv
     1.83%  kpktgend_0   [kernel.kallsyms]  [k] netif_receive_skb_internal
     1.50%  kpktgend_0   [kernel.kallsyms]  [k] netif_receive_skb_sk
     0.99%  kpktgend_0   [kernel.kallsyms]  [k] __build_skb

* With this patch + tc ingress:

        tc filter add dev eth4 parent ffff: protocol ip prio 1 \
                u32 match ip dst 4.3.2.1/32

Result: OK: 9308218(c9308091+d126) usec, 100000000 (60byte,0frags)
  10743194pps 5156Mb/sec (5156733120bps) errors: 100000000

    42.01%  kpktgend_0   [kernel.kallsyms]   [k] __netif_receive_skb_core
    17.78%  kpktgend_0   [kernel.kallsyms]   [k] kfree_skb
    11.70%  kpktgend_0   [cls_u32]           [k] u32_classify
     5.46%  kpktgend_0   [kernel.kallsyms]   [k] tc_classify_compat
     5.16%  kpktgend_0   [pktgen]            [k] pktgen_thread_worker
     2.98%  kpktgend_0   [kernel.kallsyms]   [k] ip_rcv
     2.84%  kpktgend_0   [kernel.kallsyms]   [k] tc_classify
     1.96%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_internal
     1.57%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_sk

Note that the results are very similar before and after.

I can see gcc gets the code under the ingress static key out of the hot path.
Then, on that cold branch, it generates the code to accomodate the netfilter
ingress static key. My explanation for this is that this reduces the pressure
on the instruction cache for non-users as the new code is out of the hot path,
and it comes with minimal impact for tc ingress users.

Using gcc version 4.8.4 on:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
[...]
L1d cache:             16K
L1i cache:             64K
L2 cache:              2048K
L3 cache:              8192K
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e687ad60

net: add CONFIG_NET_INGRESS to enable ingress filtering · 1cf51900

由 Pablo Neira 提交于 5月 13, 2015

This new config switch enables the ingress filtering infrastructure that is
controlled through the ingress_needed static key. This prepares the
introduction of the Netfilter ingress hook that resides under this unique
static key.

Note that CONFIG_SCH_INGRESS automatically selects this, that should be no
problem since this also depends on CONFIG_NET_CLS_ACT.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1cf51900

net: Reserve skb headroom and set skb->dev even if using __alloc_skb · a080e7bd

由 Alexander Duyck 提交于 5月 13, 2015

When I had inlined __alloc_rx_skb into __netdev_alloc_skb and
__napi_alloc_skb I had overlooked the fact that there was a return in the
__alloc_rx_skb.  As a result we weren't reserving headroom or setting the
skb->dev in certain cases.  This change corrects that by adding a couple of
jump labels to jump to depending on __alloc_skb either succeeding or failing.

Fixes: 9451980a ("net: Use cached copy of pfmemalloc to avoid accessing page")
Reported-by: NFelipe Balbi <balbi@ti.com>
Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
Tested-by: NKevin Hilman <khilman@linaro.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a080e7bd

J
flow_dissector: change port array into src, dst tuple · 59346afe
由 Jiri Pirko 提交于 5月 12, 2015
```
Signed-off-by: NJiri Pirko <jiri@resnulli.us>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
59346afe
J
flow_dissector: introduce support for Ethernet addresses · 67a900cc
由 Jiri Pirko 提交于 5月 12, 2015
```
Signed-off-by: NJiri Pirko <jiri@resnulli.us>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
67a900cc

flow_dissector: introduce support for ipv6 addressses · b924933c

由 Jiri Pirko 提交于 5月 12, 2015

So far, only hashes made out of ipv6 addresses could be dissected. This
patch introduces support for dissection of full ipv6 addresses.
Signed-off-by: NJiri Pirko <jiri@resnulli.us>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b924933c

J
flow_dissect: use programable dissector in skb_flow_dissect and friends · 06635a35
由 Jiri Pirko 提交于 5月 12, 2015
```
Signed-off-by: NJiri Pirko <jiri@resnulli.us>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
06635a35

flow_dissector: introduce programable flow_dissector · fbff949e

由 Jiri Pirko 提交于 5月 12, 2015

Introduce dissector infrastructure which allows user to specify which
parts of skb he wants to dissect.
Signed-off-by: NJiri Pirko <jiri@resnulli.us>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fbff949e

flow_dissector: fix doc for skb_get_poff · 0db89b8b

由 Jiri Pirko 提交于 5月 12, 2015

Signed-off-by: NJiri Pirko <jiri@resnulli.us>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0db89b8b

net: move netdev_pick_tx and dependencies to net/core/dev.c · 638b2a69

由 Jiri Pirko 提交于 5月 12, 2015

next to its user. No relation to flow_dissector so it makes no sense to
have it in flow_dissector.c
Signed-off-by: NJiri Pirko <jiri@resnulli.us>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

638b2a69

net: move __skb_tx_hash to dev.c · 5605c762

由 Jiri Pirko 提交于 5月 12, 2015

__skb_tx_hash function has no relation to flow_dissect so just move it
to dev.c
Signed-off-by: NJiri Pirko <jiri@resnulli.us>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5605c762

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功