提交 · 0022d2dd4d76e0e7d5c241c343a5016fdfa2ad4f · openeuler / raspberrypi-kernel

16 4月, 2013 2 次提交

net: sctp: minor: make sctp_ep_common's member 'dead' a bool · 0022d2dd

由 Daniel Borkmann 提交于 4月 15, 2013

Since dead only holds two states (0,1), make it a bool instead
of a 'char', which is more appropriate for its purpose.
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Acked-by: NVlad Yasevich <vyasevich@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0022d2dd

net: sctp: remove sctp_ep_common struct member 'malloced' · ff2266cd

由 Daniel Borkmann 提交于 4月 15, 2013

There is actually no need to keep this member in the structure, because
after init it's always 1 anyway, thus always kfree called. This seems to
be an ancient leftover from the very initial implementation from 2.5
times. Only in case the initialization of an association fails, we leave
base.malloced as 0, but we nevertheless kfree it in the error path in
sctp_association_new().
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Acked-by: NVlad Yasevich <vyasevich@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ff2266cd

15 4月, 2013 1 次提交

net: sock: make sock_tx_timestamp void · bf84a010

由 Daniel Borkmann 提交于 4月 14, 2013

Currently, sock_tx_timestamp() always returns 0. The comment that
describes the sock_tx_timestamp() function wrongly says that it
returns an error when an invalid argument is passed (from commit
20d49473, ``net: socket infrastructure for SO_TIMESTAMPING'').
Make the function void, so that we can also remove all the unneeded
if conditions that check for such a _non-existant_ error case in the
output path.
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bf84a010

13 4月, 2013 1 次提交

tcp: GSO should be TSQ friendly · d6a4a104

由 Eric Dumazet 提交于 4月 12, 2013

I noticed that TSQ (TCP Small queues) was less effective when TSO is
turned off, and GSO is on. If BQL is not enabled, TSQ has then no
effect.

It turns out the GSO engine frees the original gso_skb at the time the
fragments are generated and queued to the NIC.

We should instead call the tcp_wfree() destructor for the last fragment,
to keep the flow control as intended in TSQ. This effectively limits
the number of queued packets on qdisc + NIC layers.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d6a4a104

12 4月, 2013 2 次提交

usbnet: handle link change · 4b49f58f

由 Ming Lei 提交于 4月 11, 2013

The link change is detected via the interrupt pipe, and bulk
pipes are responsible for transfering packets, so it is reasonable
to stop bulk transfer after link is reported as off.

Two adavantages may be obtained with stopping bulk transfer
after link becomes off:

- USB bus bandwidth is saved(USB bus is shared bus except for
USB3.0), for example, lots of 'IN' token packets and 'NYET'
handshake packets is transfered on 2.0 bus.

- probabaly power might be saved for usb host controller since
cancelling bulk transfer may disable the asynchronous schedule of
host controller.

With this patch, when link becomes off, about ~10% performance
boost can be found on bulk transfer of anther usb device which
is attached to same bus with the usbnet device, see below
test on next-20130410:

- read from usb mass storage(Sandisk Extreme USB 3.0) on pandaboard
with below command after unplugging ethernet cable:

	dd if=/dev/sda iflag=direct of=/dev/null bs=1M count=800

- without the patch
1, 838860800 bytes (839 MB) copied, 36.2216 s, 23.2 MB/s
2, 838860800 bytes (839 MB) copied, 35.8368 s, 23.4 MB/s
3, 838860800 bytes (839 MB) copied, 35.823 s, 23.4 MB/s
4, 838860800 bytes (839 MB) copied, 35.937 s, 23.3 MB/s
5, 838860800 bytes (839 MB) copied, 35.7365 s, 23.5 MB/s
average: 23.6MB/s

- with the patch
1, 838860800 bytes (839 MB) copied, 32.3817 s, 25.9 MB/s
2, 838860800 bytes (839 MB) copied, 31.7389 s, 26.4 MB/s
3, 838860800 bytes (839 MB) copied, 32.438 s, 25.9 MB/s
4, 838860800 bytes (839 MB) copied, 32.5492 s, 25.8 MB/s
5, 838860800 bytes (839 MB) copied, 31.6178 s, 26.5 MB/s
average: 26.1MB/s
Signed-off-by: NMing Lei <ming.lei@canonical.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4b49f58f

usbnet: introduce usbnet_link_change API · ac64995d

由 Ming Lei 提交于 4月 11, 2013

This patch introduces the API of usbnet_link_change, so that
usbnet can handle link change centrally, which may help to
implement killing traffic URBs for saving USB bus bandwidth
and host controller power.
Signed-off-by: NMing Lei <ming.lei@canonical.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ac64995d

10 4月, 2013 4 次提交

net: sctp: introduce uapi header for sctp · 1b866434

由 Daniel Borkmann 提交于 4月 09, 2013

This patch introduces an UAPI header for the SCTP protocol,
so that we can facilitate the maintenance and development of
user land applications or libraries, in particular in terms
of header synchronization.

To not break compatibility, some fragments from lksctp-tools'
netinet/sctp.h have been carefully included, while taking care
that neither kernel nor user land breaks, so both compile fine
with this change (for lksctp-tools I tested with the old
netinet/sctp.h header and with a newly adapted one that includes
the uapi sctp header). lksctp-tools smoke test run through
successfully as well in both cases.
Suggested-by: NNeil Horman <nhorman@tuxdriver.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1b866434

netprio_cgroup: remove task_struct parameter from sock_update_netprio() · 6ffd4641

由 Zefan Li 提交于 4月 08, 2013

The callers always pass current to sock_update_netprio().
Signed-off-by: NLi Zefan <lizefan@huawei.com>
Acked-by: NNeil Horman <nhorman@tuxdriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6ffd4641

cls_cgroup: remove task_struct parameter from sock_update_classid() · 211d2f97

由 Zefan Li 提交于 4月 08, 2013

The callers always pass current to sock_update_classid().
Signed-off-by: NLi Zefan <lizefan@huawei.com>
Acked-by: NNeil Horman <nhorman@tuxdriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

211d2f97

net: ipv6: only invalidate previously tokenized addresses · 617fe29d

由 Daniel Borkmann 提交于 4月 09, 2013

Instead of invalidating all IPv6 addresses with global scope
when one decides to use IPv6 tokens, we should only invalidate
previous tokens and leave the rest intact until they expire
eventually (or are intact forever). For doing this less greedy
approach, we're adding a bool at the end of inet6_ifaddr structure
instead, for two reasons: i) per-inet6_ifaddr flag space is
already used up, making it wider might not be a good idea,
since ii) also we do not necessarily need to export this
information into user space.
Suggested-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

617fe29d

09 4月, 2013 3 次提交

net: ipv6: add tokenized interface identifier support · f53adae4

由 Daniel Borkmann 提交于 4月 08, 2013

This patch adds support for IPv6 tokenized IIDs, that allow
for administrators to assign well-known host-part addresses
to nodes whilst still obtaining global network prefix from
Router Advertisements. It is currently in draft status.

  The primary target for such support is server platforms
  where addresses are usually manually configured, rather
  than using DHCPv6 or SLAAC. By using tokenised identifiers,
  hosts can still determine their network prefix by use of
  SLAAC, but more readily be automatically renumbered should
  their network prefix change. [...]

  The disadvantage with static addresses is that they are
  likely to require manual editing should the network prefix
  in use change.  If instead there were a method to only
  manually configure the static identifier part of the IPv6
  address, then the address could be automatically updated
  when a new prefix was introduced, as described in [RFC4192]
  for example.  In such cases a DNS server might be
  configured with such a tokenised interface identifier of
  ::53, and SLAAC would use the token in constructing the
  interface address, using the advertised prefix. [...]

  http://tools.ietf.org/html/draft-chown-6man-tokenised-ipv6-identifiers-02

The implementation is partially based on top of Mark K.
Thompson's proof of concept. However, it uses the Netlink
interface for configuration resp. data retrival, so that
it can be easily extended in future. Successfully tested
by myself.

Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f53adae4

ieee802154/nl-mac.c: make some MLME operations optional · 56aa091d

由 Werner Almesberger 提交于 4月 04, 2013

Check for NULL before calling the following operations from "struct
ieee802154_mlme_ops": assoc_req, assoc_resp, disassoc_req, start_req,
and scan_req.

This fixes a current oops where those functions are called but not
implemented. It also updates the documentation to clarify that they
are now optional by design. If a call to an unimplemented function
is attempted, the kernel returns EOPNOTSUPP via netlink.

The following operations are still required: get_phy, get_pan_id,
get_short_addr, and get_dsn.

Note that the places where this patch changes the initialization
of "ret" should not affect the rest of the code since "ret" was
always set (again) before returning its value.
Signed-off-by: NWerner Almesberger <werner@almesberger.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

56aa091d

IEEE 802.15.4: remove get_bsn from "struct ieee802154_mlme_ops" · d87c8c6d

由 Werner Almesberger 提交于 4月 04, 2013

It served no purpose: we never call it from anywhere in the stack
and the only driver that did implement it (fakehard) merely provided
a dummy value.

There is also considerable doubt whether it would make sense to
even attempt beacon processing at this level in the Linux kernel.
Signed-off-by: NWerner Almesberger <werner@almesberger.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d87c8c6d

08 4月, 2013 2 次提交

scm: Stop passing struct cred · 6b0ee8c0

由 Eric W. Biederman 提交于 4月 03, 2013

Now that uids and gids are completely encapsulated in kuid_t
and kgid_t we no longer need to pass struct cred which allowed
us to test both the uid and the user namespace for equality.

Passing struct cred potentially allows us to pass the entire group
list as BSD does but I don't believe the cost of cache line misses
justifies retaining code for a future potential application.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6b0ee8c0

net/mlx4_en: Enable DCB ETS ops only when supported by the firmware · 540b3a39

由 Or Gerlitz 提交于 4月 07, 2013

Enable the DCB ETS ops only when supported by the firmware. For older firmware/cards
which don't support ETS, advertize only PFC DCB ops.
Signed-off-by: NEugenia Emantayev <eugenia@mellanox.co.il>
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

540b3a39

06 4月, 2013 4 次提交

netfilter: don't reset nf_trace in nf_reset() · 124dff01

由 Patrick McHardy 提交于 4月 05, 2013

Commit 130549fe ("netfilter: reset nf_trace in nf_reset") added code
to reset nf_trace in nf_reset(). This is wrong and unnecessary.

nf_reset() is used in the following cases:

- when passing packets up the the socket layer, at which point we want to
  release all netfilter references that might keep modules pinned while
  the packet is queued. nf_trace doesn't matter anymore at this point.

- when encapsulating or decapsulating IPsec packets. We want to continue
  tracing these packets after IPsec processing.

- when passing packets through virtual network devices. Only devices on
  that encapsulate in IPv4/v6 matter since otherwise nf_trace is not
  used anymore. Its not entirely clear whether those packets should
  be traced after that, however we've always done that.

- when passing packets through virtual network devices that make the
  packet cross network namespace boundaries. This is the only cases
  where we clearly want to reset nf_trace and is also what the
  original patch intended to fix.

Add a new function nf_reset_trace() and use it in dev_forward_skb() to
fix this properly.
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

124dff01

netfilter: remove unneeded variable proc_net_netfilter · 12202fa7

由 Pablo Neira Ayuso 提交于 4月 05, 2013

Now that this supports net namespace for nflog and nfqueue,
we can remove the global proc_net_netfilter which has no
clients anymore.

Based on patch from Gao feng.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

12202fa7

netfilter: nf_log: prepare net namespace support for loggers · 30e0c6a6

由 Gao feng 提交于 3月 24, 2013

This patch adds netns support to nf_log and it prepares netns
support for existing loggers. It is composed of four major
changes.

1) nf_log_register has been split to two functions: nf_log_register
   and nf_log_set. The new nf_log_register is used to globally
   register the nf_logger and nf_log_set is used for enabling
   pernet support from nf_loggers.

   Per netns is not yet complete after this patch, it comes in
   separate follow up patches.

2) Add net as a parameter of nf_log_bind_pf. Per netns is not
   yet complete after this patch, it only allows to bind the
   nf_logger to the protocol family from init_net and it skips
   other cases.

3) Adapt all nf_log_packet callers to pass netns as parameter.
   After this patch, this function only works for init_net.

4) Make the sysctl net/netfilter/nf_log pernet.
Signed-off-by: NGao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

30e0c6a6

netfilter: make /proc/net/netfilter pernet · f3c1a44a

由 Gao feng 提交于 3月 24, 2013

This patch makes this proc dentry pernet. So far only init_net
had a /proc/net/netfilter directory.
Signed-off-by: NGao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

f3c1a44a

05 4月, 2013 2 次提交

net: count hw_addr syncs so that unsync works properly. · 4543fbef

由 Vlad Yasevich 提交于 4月 02, 2013

A few drivers use dev_uc_sync/unsync to synchronize the
address lists from master down to slave/lower devices.  In
some cases (bond/team) a single address list is synched down
to multiple devices.  At the time of unsync, we have a leak
in these lower devices, because "synced" is treated as a
boolean and the address will not be unsynced for anything after
the first device/call.

Treat "synced" as a count (same as refcount) and allow all
unsync calls to work.
Signed-off-by: NVlad Yasevich <vyasevic@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4543fbef

net: frag queue per hash bucket locking · 19952cc4

由 Jesper Dangaard Brouer 提交于 4月 03, 2013

This patch implements per hash bucket locking for the frag queue
hash.  This removes two write locks, and the only remaining write
lock is for protecting hash rebuild.  This essentially reduce the
readers-writer lock to a rebuild lock.

This patch is part of "net: frag performance followup"
 http://thread.gmane.org/gmane.linux.network/263644
of which two patches have already been accepted:

Same test setup as previous:
 (http://thread.gmane.org/gmane.linux.network/257155)
 Two 10G interfaces, on seperate NUMA nodes, are under-test, and uses
 Ethernet flow-control.  A third interface is used for generating the
 DoS attack (with trafgen).

Notice, I have changed the frag DoS generator script to be more
efficient/deadly.  Before it would only hit one RX queue, now its
sending packets causing multi-queue RX, due to "better" RX hashing.

Test types summary (netperf UDP_STREAM):
 Test-20G64K     == 2x10G with 65K fragments
 Test-20G3F      == 2x10G with 3x fragments (3*1472 bytes)
 Test-20G64K+DoS == Same as 20G64K with frag DoS
 Test-20G3F+DoS  == Same as 20G3F  with frag DoS
 Test-20G64K+MQ  == Same as 20G64K with Multi-Queue frag DoS
 Test-20G3F+MQ   == Same as 20G3F  with Multi-Queue frag DoS

When I rebased this-patch(03) (on top of net-next commit a210576c) and
removed the _bh spinlock, I saw a performance regression.  BUT this
was caused by some unrelated change in-between.  See tests below.

Test (A) is what I reported before for patch-02, accepted in commit 1b5ab0de.
Test (B) verifying-retest of commit 1b5ab0de corrospond to patch-02.
Test (C) is what I reported before for this-patch

Test (D) is net-next master HEAD (commit a210576c), which reveals some
(unknown) performance regression (compared against test (B)).
Test (D) function as a new base-test.

Performance table summary (in Mbit/s):

(#) Test-type:  20G64K    20G3F    20G64K+DoS  20G3F+DoS  20G64K+MQ 20G3F+MQ
    ----------  -------   -------  ----------  ---------  --------  -------
(A) Patch-02  : 18848.7   13230.1   4103.04     5310.36     130.0    440.2
(B) 1b5ab0de  : 18841.5   13156.8   4101.08     5314.57     129.0    424.2
(C) Patch-03v1: 18838.0   13490.5   4405.11     6814.72     196.6    461.6

(D) a210576c  : 18321.5   11250.4   3635.34     5160.13     119.1    405.2
(E) with _bh  : 17247.3   11492.6   3994.74     6405.29     166.7    413.6
(F) without bh: 17471.3   11298.7   3818.05     6102.11     165.7    406.3

Test (E) and (F) is this-patch(03), with(V1) and without(V2) the _bh spinlocks.

I cannot explain the slow down for 20G64K (but its an artificial
"lab-test" so I'm not worried).  But the other results does show
improvements.  And test (E) "with _bh" version is slightly better.
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: NEric Dumazet <edumazet@google.com>

----
V2:
- By analysis from Hannes Frederic Sowa and Eric Dumazet, we don't
  need the spinlock _bh versions, as Netfilter currently does a
  local_bh_disable() before entering inet_fragment.
- Fold-in desc from cover-mail
V3:
- Drop the chain_len counter per hash bucket.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

19952cc4

03 4月, 2013 2 次提交

of_net.h: Provide empty functions if OF_NET is not configured · 65b3841b

由 Guenter Roeck 提交于 4月 02, 2013

of_get_mac_address() and of_get_phy_mode() are only provided if OF_NET
is configured. While most callers check for the define, not all do, and those
who do require #ifdef around the code. For those who don't, the missing check
can result in errors such as

arch/powerpc/sysdev/tsi108_dev.c:107:3: error: implicit declaration of
	function 'of_get_mac_address' [-Werror=implicit-function-declaration]
arch/powerpc/sysdev/mv64x60_dev.c:253:2: error: implicit declaration of
	function 'of_get_mac_address' [-Werror=implicit-function-declaration]

Provide empty functions if OF_NET is not configured.
This is safe because all callers do check the return values.

Cc: David Daney <david.daney@cavium.com>
Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
Acked-by: NRob Herring <rob.herring@calxeda.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

65b3841b

tcp: Remove dead sysctl_tcp_cookie_size declaration · 8466563e

由 Neal Cardwell 提交于 4月 02, 2013

Remove a declaration left over from the TCPCT-ectomy. This sysctl is
no longer referenced anywhere since 1a2c6181 ("tcp: Remove TCPCT").
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8466563e

02 4月, 2013 17 次提交

netfilter: fix struct ip6t_frag field description · 152b0f5d

由 Michal Kubeček 提交于 4月 02, 2013

Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

152b0f5d

PM / devfreq: Fix compiler warnings for CONFIG_PM_DEVFREQ unset · 5faaa035

由 Rajagopal Venkat 提交于 3月 21, 2013

Fix compiler warnings generated when devfreq is not enabled
(CONFIG_PM_DEVFREQ is not set).
Signed-off-by: NRajagopal Venkat <rajagopal.venkat@linaro.org>
Acked-by: NMyungJoo Ham <myungjoo.ham@samsung.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

5faaa035

netfilter: xt_NFQUEUE: introduce CPU fanout · 8746ddcf

由 holger@eitzenberger.org 提交于 3月 23, 2013

Current NFQUEUE target uses a hash, computed over source and
destination address (and other parameters), for steering the packet
to the actual NFQUEUE. This, however forgets about the fact that the
packet eventually is handled by a particular CPU on user request.

If E. g.

  1) IRQ affinity is used to handle packets on a particular CPU already
     (both single-queue or multi-queue case)

and/or

  2) RPS is used to steer packets to a specific softirq

the target easily chooses an NFQUEUE which is not handled by a process
pinned to the same CPU.

The idea is therefore to use the CPU index for determining the
NFQUEUE handling the packet.

E. g. when having a system with 4 CPUs, 4 MQ queues and 4 NFQUEUEs it
looks like this:

 +-----+  +-----+  +-----+  +-----+
 |NFQ#0|  |NFQ#1|  |NFQ#2|  |NFQ#3|
 +-----+  +-----+  +-----+  +-----+
    ^        ^        ^        ^
    |        |NFQUEUE |        |
    +        +        +        +
 +-----+  +-----+  +-----+  +-----+
 |rx-0 |  |rx-1 |  |rx-2 |  |rx-3 |
 +-----+  +-----+  +-----+  +-----+

The NFQUEUEs not necessarily have to start with number 0, setups with
less NFQUEUEs than packet-handling CPUs are not a problem as well.

This patch extends the NFQUEUE target to accept a new
NFQ_FLAG_CPU_FANOUT flag. If this is specified the target uses the
CPU index for determining the NFQUEUE being used. I have to introduce
rev3 for this. The 'flags' are folded into _v2 'bypass'.

By changing the way which queue is assigned, I'm able to improve the
performance if the processes reading on the NFQUEUs are pinned
correctly.
Signed-off-by: NHolger Eitzenberger <holger@eitzenberger.org>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

8746ddcf

ipvs: convert services to rcu · ceec4c38

由 Julian Anastasov 提交于 3月 22, 2013

This is the final step in RCU conversion.

Things that are removed:

- svc->usecnt: now svc is accessed under RCU read lock
- svc->inc: and some unused code
- ip_vs_bind_pe and ip_vs_unbind_pe: no ability to replace PE
- __ip_vs_svc_lock: replaced with RCU
- IP_VS_WAIT_WHILE: now readers lookup svcs and dests under
	RCU and work in parallel with configuration

Other changes:

- before now, a RCU read-side critical section included the
calling of the schedule method, now it is extended to include
service lookup
- ip_vs_svc_table and ip_vs_svc_fwm_table are now using hlist
- svc->pe and svc->scheduler remain to the end (of grace period),
	the schedulers are prepared for such RCU readers
	even after done_service is called but they need
	to use synchronize_rcu because last ip_vs_scheduler_put
	can happen while RCU read-side critical sections
	use an outdated svc->scheduler pointer
- as planned, update_service is removed
- empty services can be freed immediately after grace period.
	If dests were present, the services are freed from
	the dest trash code
Signed-off-by: NJulian Anastasov <ja@ssi.bg>
Signed-off-by: NSimon Horman <horms@verge.net.au>

ceec4c38

ipvs: convert dests to rcu · 413c2d04

由 Julian Anastasov 提交于 3月 22, 2013

In previous commits the schedulers started to access
svc->destinations with _rcu list traversal primitives
because the IP_VS_WAIT_WHILE macro still plays the role of
grace period. Now it is time to finish the updating part,
i.e. adding and deleting of dests with _rcu suffix before
removing the IP_VS_WAIT_WHILE in next commit.

We use the same rule for conns as for the
schedulers: dests can be searched in RCU read-side critical
section where ip_vs_dest_hold can be called by ip_vs_bind_dest.

Some things are not perfect, for example, calling
functions like ip_vs_lookup_dest from updating code under
RCU, just because we use some function both from reader
and from updater.
Signed-off-by: NJulian Anastasov <ja@ssi.bg>
Signed-off-by: NSimon Horman <horms@verge.net.au>

413c2d04

ipvs: convert sched_lock to spin lock · ba3a3ce1

由 Julian Anastasov 提交于 3月 22, 2013

As all read_locks are gone spin lock is preferred.
Signed-off-by: NJulian Anastasov <ja@ssi.bg>
Signed-off-by: NSimon Horman <horms@verge.net.au>

ba3a3ce1

ipvs: do not expect result from done_service · ed3ffc4e

由 Julian Anastasov 提交于 3月 22, 2013

This method releases the scheduler state,
it can not fail. Such change will help to properly
replace the scheduler in following patch.
Signed-off-by: NJulian Anastasov <ja@ssi.bg>
Signed-off-by: NSimon Horman <horms@verge.net.au>

ed3ffc4e

ipvs: reorganize dest trash · 578bc3ef

由 Julian Anastasov 提交于 3月 22, 2013

All dests will go to trash, no exceptions.
But we have to use new list node t_list for this, due
to RCU changes in following patches. Dests will wait there
initial grace period and later all conns and schedulers to
put their reference. The dests don't get reference for
staying in dest trash as before.

	As result, we do not load ip_vs_dest_put with
extra checks for last refcnt and the schedulers do not
need to play games with atomic_inc_not_zero while
selecting best destination.
Signed-off-by: NJulian Anastasov <ja@ssi.bg>
Signed-off-by: NSimon Horman <horms@verge.net.au>

578bc3ef

ipvs: add ip_vs_dest_hold and ip_vs_dest_put · fca9c20a

由 Julian Anastasov 提交于 3月 22, 2013

ip_vs_dest_hold will be used under RCU lock
while ip_vs_dest_put can be called even after dest
is removed from service, as it happens for conns and
some schedulers.
Signed-off-by: NJulian Anastasov <ja@ssi.bg>
Signed-off-by: NSimon Horman <horms@verge.net.au>

fca9c20a

ipvs: preparations for using rcu in schedulers · 6b6df466

由 Julian Anastasov 提交于 3月 22, 2013

Allow schedulers to use rcu_dereference when
returning destination on lookup. The RCU read-side critical
section will allow ip_vs_bind_dest to get dest refcnt as
preparation for the step where destinations will be
deleted without an IP_VS_WAIT_WHILE guard that holds the
packet processing during update.

	Add new optional scheduler methods add_dest,
del_dest and upd_dest. For now the methods are called
together with update_service but update_service will be
removed in a following change.
Signed-off-by: NJulian Anastasov <ja@ssi.bg>
Signed-off-by: NSimon Horman <horms@verge.net.au>

6b6df466

ipvs: avoid kmem_cache_zalloc in ip_vs_conn_new · 9a05475c

由 Julian Anastasov 提交于 3月 21, 2013

We have many fields to set and few to reset,
use kmem_cache_alloc instead to save some cycles.
Signed-off-by: NJulian Anastasov <ja@ssi.bg>
Signed-off by: Hans Schillstrom <hans@schillstrom.com>
Signed-off-by: NSimon Horman <horms@verge.net.au>

9a05475c

ipvs: reorder keys in connection structure · 1845ed0b

由 Julian Anastasov 提交于 3月 21, 2013

__ip_vs_conn_in_get and ip_vs_conn_out_get are
hot places. Optimize them, so that ports are matched first.
By moving net and fwmark below, on 32-bit arch we can fit
caddr in 32-byte cache line and all addresses in 64-byte
cache line.
Signed-off-by: NJulian Anastasov <ja@ssi.bg>
Signed-off by: Hans Schillstrom <hans@schillstrom.com>
Signed-off-by: NSimon Horman <horms@verge.net.au>

1845ed0b

ipvs: convert connection locking · 088339a5

由 Julian Anastasov 提交于 3月 21, 2013

Convert __ip_vs_conntbl_lock_array as follows:

- readers that do not modify conn lists will use RCU lock
- updaters that modify lists will use spinlock_t

Now for conn lookups we will use RCU read-side
critical section. Without using __ip_vs_conn_get such
places have access to connection fields and can
dereference some pointers like pe and pe_data plus
the ability to update timer expiration. If full access
is required we contend for reference.

We add barrier in __ip_vs_conn_put, so that
other CPUs see the refcnt operation after other writes.

With the introduction of ip_vs_conn_unlink()
we try to reorganize ip_vs_conn_expire(), so that
unhashing of connections that should stay more time is
avoided, even if it is for very short time.
Signed-off-by: NJulian Anastasov <ja@ssi.bg>
Signed-off by: Hans Schillstrom <hans@schillstrom.com>
Signed-off-by: NSimon Horman <horms@verge.net.au>

088339a5

ipvs: remove rs_lock by using RCU · 276472ea

由 Julian Anastasov 提交于 3月 21, 2013

rs_lock was used to protect rs_table (hash table)
from updaters (under global mutex) and readers (packet handlers).
We can remove rs_lock by using RCU lock for readers. Reclaiming
dest only with kfree_rcu is enough because the readers access
only fields from the ip_vs_dest structure.

Use hlist for rs_table.

As we are now using hlist_del_rcu, introduce in_rs_table
flag as replacement for the list_empty checks which do not
work with RCU. It is needed because only NAT dests are in
the rs_table.
Signed-off-by: NJulian Anastasov <ja@ssi.bg>
Signed-off by: Hans Schillstrom <hans@schillstrom.com>
Signed-off-by: NSimon Horman <horms@verge.net.au>

276472ea

ipvs: convert app locks · 363c97d7

由 Julian Anastasov 提交于 3月 21, 2013

We use locks like tcp_app_lock, udp_app_lock,
sctp_app_lock to protect access to the protocol hash tables
from readers in packet context while the application
instances (inc) are [un]registered under global mutex.

As the hash tables are mostly read when conns are
created and bound to app, use RCU for readers and reclaim
app instance after grace period.

Simplify ip_vs_app_inc_get because we use usecnt
only for statistics and rely on module refcounting.
Signed-off-by: NJulian Anastasov <ja@ssi.bg>
Signed-off by: Hans Schillstrom <hans@schillstrom.com>
Signed-off-by: NSimon Horman <horms@verge.net.au>

363c97d7

ipvs: optimize dst usage for real server · 026ace06

由 Julian Anastasov 提交于 3月 21, 2013

Currently when forwarding requests to real servers
we use dst_lock and atomic operations when cloning the
dst_cache value. As the dst_cache value does not change
most of the time it is better to use RCU and to lock
dst_lock only when we need to replace the obsoleted dst.
For this to work we keep dst_cache in new structure protected
by RCU. For packets to remote real servers we will use noref
version of dst_cache, it will be valid while we are in RCU
read-side critical section because now dst_release for replaced
dsts will be invoked after the grace period. Packets to
local real servers that are passed to local stack with
NF_ACCEPT need a dst clone.
Signed-off-by: NJulian Anastasov <ja@ssi.bg>
Signed-off by: Hans Schillstrom <hans@schillstrom.com>
Signed-off-by: NSimon Horman <horms@verge.net.au>

026ace06

ipvs: rename functions related to dst_cache reset · d1deae4d

由 Julian Anastasov 提交于 3月 21, 2013

Move and give better names to two functions:

- ip_vs_dst_reset to __ip_vs_dst_cache_reset
- __ip_vs_dev_reset to ip_vs_forget_dev
Signed-off-by: NJulian Anastasov <ja@ssi.bg>
Signed-off by: Hans Schillstrom <hans@schillstrom.com>
Signed-off-by: NSimon Horman <horms@verge.net.au>

d1deae4d