提交 · cdb3f4a31b64c3a1c6eef40bc01ebc9594c58a8c · openeuler / raspberrypi-kernel

08 1月, 2014 1 次提交

net: Do not enable tx-nocache-copy by default · cdb3f4a3

由 Benjamin Poirier 提交于 1月 07, 2014

There are many cases where this feature does not improve performance or even
reduces it.

For example, here are the results from tests that I've run using 3.12.6 on one
Intel Xeon W3565 and one i7 920 connected by ixgbe adapters. The results are
from the Xeon, but they're similar on the i7. All numbers report the
mean±stddev over 10 runs of 10s.

1) latency tests similar to what is described in "c6e1a0d1 net: Allow no-cache
copy from user on transmit"
There is no statistically significant difference between tx-nocache-copy
on/off.
nic irqs spread out (one queue per cpu)

200x netperf -r 1400,1
tx-nocache-copy off
        692000±1000 tps
        50/90/95/99% latency (us): 275±2/643.8±0.4/799±1/2474.4±0.3
tx-nocache-copy on
        693000±1000 tps
        50/90/95/99% latency (us): 274±1/644.1±0.7/800±2/2474.5±0.7

200x netperf -r 14000,14000
tx-nocache-copy off
        86450±80 tps
        50/90/95/99% latency (us): 334.37±0.02/838±1/2100±20/3990±40
tx-nocache-copy on
        86110±60 tps
        50/90/95/99% latency (us): 334.28±0.01/837±2/2110±20/3990±20

2) single stream throughput tests
tx-nocache-copy leads to higher service demand

                        throughput  cpu0        cpu1        demand
                        (Gb/s)      (Gcycle)    (Gcycle)    (cycle/B)

nic irqs and netperf on cpu0 (1x netperf -T0,0 -t omni -- -d send)

tx-nocache-copy off     9402±5      9.4±0.2                 0.80±0.01
tx-nocache-copy on      9403±3      9.85±0.04               0.838±0.004

nic irqs on cpu0, netperf on cpu1 (1x netperf -T1,1 -t omni -- -d send)

tx-nocache-copy off     9401±5      5.83±0.03   5.0±0.1     0.923±0.007
tx-nocache-copy on      9404±2      5.74±0.03   5.523±0.009 0.958±0.002

As a second example, here are some results from Eric Dumazet with latest
net-next.
tx-nocache-copy also leads to higher service demand

(cpu is Intel(R) Xeon(R) CPU X5660  @ 2.80GHz)

lpq83:~# ./ethtool -K eth0 tx-nocache-copy on
lpq83:~# perf stat ./netperf -H lpq84 -c
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

 87380  16384  16384    10.00      9407.44   2.50     -1.00    0.522   -1.000

 Performance counter stats for './netperf -H lpq84 -c':

       4282.648396 task-clock                #    0.423 CPUs utilized
             9,348 context-switches          #    0.002 M/sec
                88 CPU-migrations            #    0.021 K/sec
               355 page-faults               #    0.083 K/sec
    11,812,797,651 cycles                    #    2.758 GHz                     [82.79%]
     9,020,522,817 stalled-cycles-frontend   #   76.36% frontend cycles idle    [82.54%]
     4,579,889,681 stalled-cycles-backend    #   38.77% backend  cycles idle    [67.33%]
     6,053,172,792 instructions              #    0.51  insns per cycle
                                             #    1.49  stalled cycles per insn [83.64%]
       597,275,583 branches                  #  139.464 M/sec                   [83.70%]
         8,960,541 branch-misses             #    1.50% of all branches         [83.65%]

      10.128990264 seconds time elapsed

lpq83:~# ./ethtool -K eth0 tx-nocache-copy off
lpq83:~# perf stat ./netperf -H lpq84 -c
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

 87380  16384  16384    10.00      9412.45   2.15     -1.00    0.449   -1.000

 Performance counter stats for './netperf -H lpq84 -c':

       2847.375441 task-clock                #    0.281 CPUs utilized
            11,632 context-switches          #    0.004 M/sec
                49 CPU-migrations            #    0.017 K/sec
               354 page-faults               #    0.124 K/sec
     7,646,889,749 cycles                    #    2.686 GHz                     [83.34%]
     6,115,050,032 stalled-cycles-frontend   #   79.97% frontend cycles idle    [83.31%]
     1,726,460,071 stalled-cycles-backend    #   22.58% backend  cycles idle    [66.55%]
     2,079,702,453 instructions              #    0.27  insns per cycle
                                             #    2.94  stalled cycles per insn [83.22%]
       363,773,213 branches                  #  127.757 M/sec                   [83.29%]
         4,242,732 branch-misses             #    1.17% of all branches         [83.51%]

      10.128449949 seconds time elapsed

CC: Tom Herbert <therbert@google.com>
Signed-off-by: NBenjamin Poirier <bpoirier@suse.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cdb3f4a3

07 1月, 2014 1 次提交

net: Export skb_zerocopy() to zerocopy from one skb to another · af2806f8

由 Thomas Graf 提交于 12月 13, 2013

Make the skb zerocopy logic written for nfnetlink queue available for
use by other modules.
Signed-off-by: NThomas Graf <tgraf@suug.ch>
Reviewed-by: NDaniel Borkmann <dborkman@redhat.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NJesse Gross <jesse@nicira.com>

af2806f8

06 1月, 2014 1 次提交

net: netdev_kobject_init: annotate with __init · a48d4bb0

由 Daniel Borkmann 提交于 1月 06, 2014

netdev_kobject_init() is only being called from __init context,
that is, net_dev_init(), so annotate it with __init as well, thus
the kernel can take this as a hint that the function is used only
during the initialization phase and free up used memory resources
after its invocation.
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a48d4bb0

04 1月, 2014 3 次提交

socket: cleanups · 8f09898b

由 stephen hemminger 提交于 1月 03, 2014

Namespace related cleaning

 * make cred_to_ucred static
 * remove unused sock_rmalloc function
Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8f09898b

net: netprio: rename config to be more consistent with cgroup configs · 86f8515f

由 Daniel Borkmann 提交于 12月 29, 2013

While we're at it and introduced CGROUP_NET_CLASSID, lets also make
NETPRIO_CGROUP more consistent with the rest of cgroups and rename it
into CONFIG_CGROUP_NET_PRIO so that for networking, we now have
CONFIG_CGROUP_NET_{PRIO,CLASSID}. This not only makes the CONFIG
option consistent among networking cgroups, but also among cgroups
CONFIG conventions in general as the vast majority has a prefix of
CONFIG_CGROUP_<SUBSYS>.
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Cc: Zefan Li <lizefan@huawei.com>
Cc: cgroups@vger.kernel.org
Acked-by: NLi Zefan <lizefan@huawei.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

86f8515f

net: net_cls: move cgroupfs classid handling into core · fe1217c4

由 Daniel Borkmann 提交于 12月 29, 2013

Zefan Li requested [1] to perform the following cleanup/refactoring:

- Split cgroupfs classid handling into net core to better express a
  possible more generic use.

- Disable module support for cgroupfs bits as the majority of other
  cgroupfs subsystems do not have that, and seems to be not wished
  from cgroup side. Zefan probably might want to follow-up for netprio
  later on.

- By this, code can be further reduced which previously took care of
  functionality built when compiled as module.

cgroupfs bits are being placed under net/core/netclassid_cgroup.c, so
that we are consistent with {netclassid,netprio}_cgroup naming that is
under net/core/ as suggested by Zefan.

No change in functionality, but only code refactoring that is being
done here.

 [1] http://patchwork.ozlabs.org/patch/304825/Suggested-by: NLi Zefan <lizefan@huawei.com>
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Cc: Zefan Li <lizefan@huawei.com>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: cgroups@vger.kernel.org
Acked-by: NLi Zefan <lizefan@huawei.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

fe1217c4

03 1月, 2014 1 次提交

netpoll: Fix missing TXQ unlock and and OOPS. · aca5f58f

由 David S. Miller 提交于 1月 02, 2014

The VLAN tag handling code in netpoll_send_skb_on_dev() has two problems.

1) It exits without unlocking the TXQ.

2) It then tries to queue a NULL skb to npinfo->txq.
Reported-by: NAhmed Tamrawi <atamrawi@iastate.edu>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

aca5f58f

02 1月, 2014 2 次提交

net: core functions cleanup · 1d143d9f

由 stephen hemminger 提交于 12月 29, 2013

The following functions are not used outside of net/core/dev.c
and should be declared static.

  call_netdevice_notifiers_info
  __dev_remove_offload
  netdev_has_any_upper_dev
  __netdev_adjacent_dev_remove
  __netdev_adjacent_dev_link_lists
  __netdev_adjacent_dev_unlink_lists
  __netdev_adjacent_dev_unlink
  __netdev_adjacent_dev_link_neighbour
  __netdev_adjacent_dev_unlink_neighbour

And the following are never used and should be deleted
  netdev_lower_dev_get_private_rcu
  __netdev_find_adj_rcu
Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1d143d9f

netlink: cleanup rntl_af_register · 3678a9d8

由 stephen hemminger 提交于 12月 30, 2013

The function __rtnl_af_register is never called outside this
code, and the return value is always 0.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3678a9d8

01 1月, 2014 2 次提交

Z
net, rps: fix the comment of net_rps_action_and_irq_enable() · 855abcf0
由 Zhi Yong Wu 提交于 1月 01, 2014
```
Signed-off-by: NZhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
855abcf0

vlan: Fix header ops passthru when doing TX VLAN offload. · 2205369a

由 David S. Miller 提交于 12月 31, 2013

When the vlan code detects that the real device can do TX VLAN offloads
in hardware, it tries to arrange for the real device's header_ops to
be invoked directly.

But it does so illegally, by simply hooking the real device's
header_ops up to the VLAN device.

This doesn't work because we will end up invoking a set of header_ops
routines which expect a device type which matches the real device, but
will see a VLAN device instead.

Fix this by providing a pass-thru set of header_ops which will arrange
to pass the proper real device instead.

To facilitate this add a dev_rebuild_header().  There are
implementations which provide a ->cache and ->create but not a
->rebuild (f.e. PLIP).  So we need a helper function just like
dev_hard_header() to avoid crashes.

Use this helper in the one existing place where the
header_ops->rebuild was being invoked, the neighbour code.

With lots of help from Florian Westphal.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2205369a

22 12月, 2013 1 次提交

net: use kfree_skb_list() helper · 289dccbe

由 Eric Dumazet 提交于 12月 20, 2013

We can use kfree_skb_list() instead of open coding it.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

289dccbe

20 12月, 2013 1 次提交

rps: NUMA flow limit allocations · 5b59d467

由 Eric Dumazet 提交于 12月 18, 2013

Given we allocate memory for each cpu, we can do this
using NUMA affinities, instead of using NUMA policies
of the process changing flow_limit_cpu_bitmap value.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5b59d467

18 12月, 2013 5 次提交

net: allow netdev_all_upper_get_next_dev_rcu with rtnl lock held · 85328240

由 John Fastabend 提交于 11月 26, 2013

It is useful to be able to walk all upper devices when bringing
a device online where the RTNL lock is held. In this case it
is safe to walk the all_adj_list because the RTNL lock is used
to protect the write side as well.

This patch adds a check to see if the rtnl lock is held before
throwing a warning in netdev_all_upper_get_next_dev_rcu().

Also because we now have a call site for lockdep_rtnl_is_held()
outside COFIG_LOCK_PROVING an inline definition returning 1 is
needed. Similar to the rcu_read_lock_is_held().

Fixes: 2a47fa45 ("ixgbe: enable l2 forwarding acceleration for macvlans")
CC: Veaceslav Falico <vfalico@redhat.com>
Reported-by: NYuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
Tested-by: NPhil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

85328240

net: Add utility function to copy skb hash · 3df7a74e

由 Tom Herbert 提交于 12月 15, 2013

Adds skb_copy_hash to copy rxhash and l4_rxhash from one skb to another.
Signed-off-by: NTom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3df7a74e

net: Change skb_get_rxhash to skb_get_hash · 3958afa1

由 Tom Herbert 提交于 12月 15, 2013

Changing name of function as part of making the hash in skbuff to be
generic property, not just for receive path.
Signed-off-by: NTom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3958afa1

neigh: Netlink notification for administrative NUD state change · 53385d2d

由 Bob Gilligan 提交于 12月 15, 2013

The neighbour code sends up an RTM_NEWNEIGH netlink notification if
the NUD state of a neighbour cache entry is changed by a timer (e.g.
from REACHABLE to STALE), even if the lladdr of the entry has not
changed.

But an administrative change to the the NUD state of a neighbour cache
entry that does not change the lladdr (e.g. via "ip -4 neigh change
...  nud ...") does not trigger a netlink notification.  This means
that netlink listeners will not hear about administrative NUD state
changes such as from a resolved state to PERMANENT.

This patch changes the neighbor code to generate an RTM_NEWNEIGH
message when the NUD state of an entry is changed administratively.
Signed-off-by: NBob Gilligan <gilligan@aristanetworks.com>
Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

53385d2d

net: remove dead code for add/del multiple · 477bb933

由 stephen hemminger 提交于 12月 13, 2013

These function to manipulate multiple addresses are not used anywhere
in current net-next tree. Some out of tree code maybe using these but
too bad; they should submit their code upstream..

Also, make __hw_addr_flush local since only used by dev_addr_lists.c
Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

477bb933

14 12月, 2013 1 次提交

bonding: create bond_first_slave_rcu() · e001bfad

由 dingtianhong 提交于 12月 13, 2013

The bond_first_slave_rcu() will be used to instead of bond_first_slave()
in rcu_read_lock().

According to the Jay Vosburgh's suggestion, the struct netdev_adjacent
should hide from users who wanted to use it directly. so I package a
new function to get the first slave of the bond.
Suggested-by: NNikolay Aleksandrov <nikolay@redhat.com>
Suggested-by: NJay Vosburgh <fubar@us.ibm.com>
Suggested-by: NVeaceslav Falico <vfalico@redhat.com>
Signed-off-by: NDing Tianhong <dingtianhong@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e001bfad

13 12月, 2013 1 次提交

net-gro: Prepare GRO stack for the upcoming tunneling support · 299603e8

由 Jerry Chu 提交于 12月 11, 2013

This patch modifies the GRO stack to avoid the use of "network_header"
and associated macros like ip_hdr() and ipv6_hdr() in order to allow
an arbitary number of IP hdrs (v4 or v6) to be used in the
encapsulation chain. This lays the foundation for various IP
tunneling support (IP-in-IP, GRE, VXLAN, SIT,...) to be added later.

With this patch, the GRO stack traversing now is mostly based on
skb_gro_offset rather than special hdr offsets saved in skb (e.g.,
skb->network_header). As a result all but the top layer (i.e., the
the transport layer) must have hdrs of the same length in order for
a pkt to be considered for aggregation. Therefore when adding a new
encap layer (e.g., for tunneling), one must check and skip flows
(e.g., by setting NAPI_GRO_CB(p)->same_flow to 0) that have a
different hdr length.

Note that unlike the network header, the transport header can and
will continue to be set by the GRO code since there will be at
most one "transport layer" in the encap chain.
Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
Suggested-by: NEric Dumazet <edumazet@google.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

299603e8

12 12月, 2013 1 次提交

ipv6: router reachability probing · 7e980569

由 Jiri Benc 提交于 12月 11, 2013

RFC 4191 states in 3.5:

   When a host avoids using any non-reachable router X and instead sends
   a data packet to another router Y, and the host would have used
   router X if router X were reachable, then the host SHOULD probe each
   such router X's reachability by sending a single Neighbor
   Solicitation to that router's address.  A host MUST NOT probe a
   router's reachability in the absence of useful traffic that the host
   would have sent to the router if it were reachable.  In any case,
   these probes MUST be rate-limited to no more than one per minute per
   router.

Currently, when the neighbour corresponding to a router falls into
NUD_FAILED, it's never considered again. Introduce a new rt6_nud_state
value, RT6_NUD_FAIL_PROBE, which suggests the route should not be used but
should be probed with a single NS. The probe is ratelimited by the existing
code. To better distinguish meanings of the failure values, rename
RT6_NUD_FAIL_SOFT to RT6_NUD_FAIL_DO_RR.
Signed-off-by: NJiri Benc <jbenc@redhat.com>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7e980569

11 12月, 2013 3 次提交

net: more spelling fixes · 8e3bff96

由 stephen hemminger 提交于 12月 08, 2013

Various spelling fixes in networking stack
Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8e3bff96

net: unix: allow set_peek_off to fail · 12663bfc

由 Sasha Levin 提交于 12月 07, 2013

unix_dgram_recvmsg() will hold the readlock of the socket until recv
is complete.

In the same time, we may try to setsockopt(SO_PEEK_OFF) which will hang until
unix_dgram_recvmsg() will complete (which can take a while) without allowing
us to break out of it, triggering a hung task spew.

Instead, allow set_peek_off to fail, this way userspace will not hang.
Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
Acked-by: NPavel Emelyanov <xemul@parallels.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

12663bfc

neigh: use neigh_parms_net() to get struct neigh_parms->net pointer · 77d47afb

由 Jiri Pirko 提交于 12月 10, 2013

This fixes compile error when CONFIG_NET_NS is not set.

Introduced by:
commit 1d4c8c29
    "neigh: restore old behaviour of default parms values"
Signed-off-by: NJiri Pirko <jiri@resnulli.us>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

77d47afb

10 12月, 2013 7 次提交

net: drop_monitor: fix the value of maxattr · d323e92c

由 Changli Gao 提交于 12月 08, 2013

maxattr in genl_family should be used to save the max attribute
type, but not the max command type. Drop monitor doesn't support
any attributes, so we should leave it as zero.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d323e92c

neigh: ipv6: respect default values set before an address is assigned to device · bba24896

由 Jiri Pirko 提交于 12月 07, 2013

Make the behaviour similar to ipv4. This will allow user to set sysctl
default neigh param values and these values will be respected even by
devices registered before (that ones what do not have address set yet).
Signed-off-by: NJiri Pirko <jiri@resnulli.us>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bba24896

neigh: restore old behaviour of default parms values · 1d4c8c29

由 Jiri Pirko 提交于 12月 07, 2013

Previously inet devices were only constructed when addresses are added.
Therefore the default neigh parms values they get are the ones at the
time of these operations.

Now that we're creating inet devices earlier, this changes the behaviour
of default neigh parms values in an incompatible way (see bug #8519).

This patch creates a compromise by setting the default values at the
same point as before but only for those that have not been explicitly
set by the user since the inet device's creation.

Introduced by:
commit 8030f544
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date:   Thu Feb 22 01:53:47 2007 +0900

    [IPV4] devinet: Register inetdev earlier.
Signed-off-by: NJiri Pirko <jiri@resnulli.us>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1d4c8c29

J
neigh: use tbl->family to distinguish ipv4 from ipv6 · 73af614a
由 Jiri Pirko 提交于 12月 07, 2013
```
Signed-off-by: NJiri Pirko <jiri@resnulli.us>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
73af614a

neigh: wrap proc dointvec functions · cb5b09c1

由 Jiri Pirko 提交于 12月 07, 2013

This will be needed later on to provide better management of default values.
Signed-off-by: NJiri Pirko <jiri@resnulli.us>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cb5b09c1

neigh: convert parms to an array · 1f9248e5

由 Jiri Pirko 提交于 12月 07, 2013

This patch converts the neigh param members to an array. This allows easier
manipulation which will be needed later on to provide better management of
default values.
Signed-off-by: NJiri Pirko <jiri@resnulli.us>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1f9248e5

net: dev: move inline skb_needs_linearize helper to header · 4262e5cc

由 Daniel Borkmann 提交于 12月 06, 2013

As we need it elsewhere, move the inline helper function of
skb_needs_linearize() over to skbuff.h include file. While
at it, also convert the return to 'bool' instead of 'int'
and add a proper kernel doc.
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4262e5cc

07 12月, 2013 2 次提交

net: introduce dev_consume_skb_any() · e6247027

由 Eric Dumazet 提交于 12月 05, 2013

Some network drivers use dev_kfree_skb_any() and dev_kfree_skb_irq()
helpers to free skbs, both for dropped packets and TX completed ones.

We need to separate the two causes to get better diagnostics
given by dropwatch or "perf record -e skb:kfree_skb"

This patch provides two new helpers, dev_consume_skb_any() and
dev_consume_skb_irq() to be used for consumed skbs.

__dev_kfree_skb_irq() is slightly optimized to remove one
atomic_dec_and_test() in fast path, and use this_cpu_{r|w} accessors.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e6247027

gro: small napi_get_frags() optim · 84b9cd63

由 Eric Dumazet 提交于 12月 05, 2013

Remove one useless conditional branch :
napi->skb is NULL, so nothing bad can happen.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

84b9cd63

06 12月, 2013 1 次提交

net: clear local_df when passing skb between namespaces · 239c78db

由 Hannes Frederic Sowa 提交于 12月 05, 2013

We must clear local_df when passing the skb between namespaces as the
packet is not local to the new namespace any more and thus may not get
fragmented by local rules. Fred Templin noticed that other namespaces
do fragment IPv6 packets while forwarding. Instead they should have send
back a PTB.

The same problem should be present when forwarding DF-IPv4 packets
between namespaces.
Reported-by: NTemplin, Fred L <Fred.L.Templin@boeing.com>
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

239c78db

02 12月, 2013 1 次提交

{pktgen, xfrm} Update IPv4 header total len and checksum after tranformation · 3868204d

由 fan.du 提交于 12月 01, 2013

commit a553e4a6 ("[PKTGEN]: IPSEC support")
tried to support IPsec ESP transport transformation for pktgen, but acctually
this doesn't work at all for two reasons(The orignal transformed packet has
bad IPv4 checksum value, as well as wrong auth value, reported by wireshark)

- After transpormation, IPv4 header total length needs update,
  because encrypted payload's length is NOT same as that of plain text.

- After transformation, IPv4 checksum needs re-caculate because of payload
  has been changed.

With this patch, armmed pktgen with below cofiguration, Wireshark is able to
decrypted ESP packet generated by pktgen without any IPv4 checksum error or
auth value error.

pgset "flag IPSEC"
pgset "flows 1"
Signed-off-by: NFan Du <fan.du@windriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3868204d

22 11月, 2013 1 次提交

gso: handle new frag_list of frags GRO packets · 9d8506cc

由 Herbert Xu 提交于 11月 21, 2013

Recently GRO started generating packets with frag_lists of frags.
This was not handled by GSO, thus leading to a crash.

Thankfully these packets are of a regular form and are easy to
handle.  This patch handles them in two ways.  For completely
non-linear frag_list entries, we simply continue to iterate over
the frag_list frags once we exhaust the normal frags.  For frag_list
entries with linear parts, we call pskb_trim on the first part
of the frag_list skb, and then process the rest of the frags in
the usual way.

This patch also kills a chunk of dead frag_list code that has
obviously never ever been run since it ends up generating a bogus
GSO-segmented packet with a frag_list entry.

Future work is planned to split super big packets into TSO
ones.

Fixes: 8a29111c ("net: gro: allow to build full sized skb")
Reported-by: NChristoph Paasch <christoph.paasch@uclouvain.be>
Reported-by: NJerry Chu <hkchu@google.com>
Reported-by: NSander Eikelenboom <linux@eikelenboom.it>
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Tested-by: NSander Eikelenboom <linux@eikelenboom.it>
Tested-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9d8506cc

21 11月, 2013 2 次提交

net: rework recvmsg handler msg_name and msg_namelen logic · f3d33426

由 Hannes Frederic Sowa 提交于 11月 21, 2013

This patch now always passes msg->msg_namelen as 0. recvmsg handlers must
set msg_namelen to the proper size <= sizeof(struct sockaddr_storage)
to return msg_name to the user.

This prevents numerous uninitialized memory leaks we had in the
recvmsg handlers and makes it harder for new code to accidentally leak
uninitialized memory.

Optimize for the case recvfrom is called with NULL as address. We don't
need to copy the address at all, so set it to NULL before invoking the
recvmsg handler. We can do so, because all the recvmsg handlers must
cope with the case a plain read() is called on them. read() also sets
msg_name to NULL.

Also document these changes in include/linux/net.h as suggested by David
Miller.

Changes since RFC:

Set msg->msg_name = NULL if user specified a NULL in msg_name but had a
non-null msg_namelen in verify_iovec/verify_compat_iovec. This doesn't
affect sendto as it would bail out earlier while trying to copy-in the
address. It also more naturally reflects the logic by the callers of
verify_iovec.

With this change in place I could remove "
if (!uaddr || msg_sys->msg_namelen == 0)
	msg->msg_name = NULL
".

This change does not alter the user visible error logic as we ignore
msg_namelen as long as msg_name is NULL.

Also remove two unnecessary curly brackets in ___sys_recvmsg and change
comments to netdev style.

Cc: David Miller <davem@davemloft.net>
Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f3d33426

net: core: Always propagate flag changes to interfaces · d2615bf4

由 Vlad Yasevich 提交于 11月 19, 2013

The following commit:
    b6c40d68
    net: only invoke dev->change_rx_flags when device is UP

tried to fix a problem with VLAN devices and promiscuouse flag setting.
The issue was that VLAN device was setting a flag on an interface that
was down, thus resulting in bad promiscuity count.
This commit blocked flag propagation to any device that is currently
down.

A later commit:
    deede2fa
    vlan: Don't propagate flag changes on down interfaces

fixed VLAN code to only propagate flags when the VLAN interface is up,
thus fixing the same issue as above, only localized to VLAN.

The problem we have now is that if we have create a complex stack
involving multiple software devices like bridges, bonds, and vlans,
then it is possible that the flags would not propagate properly to
the physical devices.  A simple examle of the scenario is the
following:

  eth0----> bond0 ----> bridge0 ---> vlan50

If bond0 or eth0 happen to be down at the time bond0 is added to
the bridge, then eth0 will never have promisc mode set which is
currently required for operation as part of the bridge.  As a
result, packets with vlan50 will be dropped by the interface.

The only 2 devices that implement the special flag handling are
VLAN and DSA and they both have required code to prevent incorrect
flag propagation.  As a result we can remove the generic solution
introduced in b6c40d68 and leave
it to the individual devices to decide whether they will block
flag propagation or not.
Reported-by: NStefan Priebe <s.priebe@profihost.ag>
Suggested-by: NVeaceslav Falico <vfalico@redhat.com>
Signed-off-by: NVlad Yasevich <vyasevic@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d2615bf4

20 11月, 2013 2 次提交

genetlink: make multicast groups const, prevent abuse · 2a94fe48

由 Johannes Berg 提交于 11月 19, 2013

Register generic netlink multicast groups as an array with
the family and give them contiguous group IDs. Then instead
of passing the global group ID to the various functions that
send messages, pass the ID relative to the family - for most
families that's just 0 because the only have one group.

This avoids the list_head and ID in each group, adding a new
field for the mcast group ID offset to the family.

At the same time, this allows us to prevent abusing groups
again like the quota and dropmon code did, since we can now
check that a family only uses a group it owns.
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2a94fe48

genetlink: pass family to functions using groups · 68eb5503

由 Johannes Berg 提交于 11月 19, 2013

This doesn't really change anything, but prepares for the
next patch that will change the APIs to pass the group ID
within the family, rather than the global group ID.
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

68eb5503