提交 · 335c54bdc4d3bacdbd619ec95cd0b352435bd37f · openanolis / cloud-kernel

29 4月, 2009 3 次提交

NFSD: Prevent a buffer overflow in svc_xprt_names() · 335c54bd

由 Chuck Lever 提交于 4月 23, 2009

The svc_xprt_names() function can overflow its buffer if it's so near
the end of the passed in buffer that the "name too long" string still
doesn't fit.  Of course, it could never tell if it was near the end
of the passed in buffer, since its only caller passes in zero as the
buffer length.

Let's make this API a little safer.

Change svc_xprt_names() so it *always* checks for a buffer overflow,
and change its only caller to pass in the correct buffer length.

If svc_xprt_names() does overflow its buffer, it now fails with an
ENAMETOOLONG errno, instead of trying to write a message at the end
of the buffer.  I don't like this much, but I can't figure out a clean
way that's always safe to return some of the names, *and* an
indication that the buffer was not long enough.

The displayed error when doing a 'cat /proc/fs/nfsd/portlist' is
"File name too long".
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>

335c54bd

SUNRPC: Fix error return value of svc_addr_len() · abc5c44d

由 Chuck Lever 提交于 4月 23, 2009

The svc_addr_len() helper function returns -EAFNOSUPPORT if it doesn't
recognize the address family of the passed-in socket address. However,
the return type of this function is size_t, which means -EAFNOSUPPORT
is turned into a very large positive value in this case.

The check in svc_udp_recvfrom() to see if the return value is less
than zero therefore won't work at all.

Additionally, handle_connect_req() passes this value directly to
memset(). This could cause memset() to clobber a large chunk of memory
if svc_addr_len() has returned an error. Currently the address family
of these addresses, however, is known to be supported long before
handle_connect_req() is called, so this isn't a real risk.

Change the error return value of svc_addr_len() to zero, which fits in
the range of size_t, and is safer to pass to memset() directly.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>

abc5c44d

net/sunrpc/svc_xprt.c: fix sparse warnings · dcf1a357

由 H Hartley Sweeten 提交于 4月 22, 2009

Fix the following sparse warnings in net/sunrpc/svc_xprt.c.

warning: symbol 'svc_recv' was not declared. Should it be static?
warning: symbol 'svc_drop' was not declared. Should it be static?
warning: symbol 'svc_send' was not declared. Should it be static?
warning: symbol 'svc_close_all' was not declared. Should it be static?
Signed-off-by: NH Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>

dcf1a357

04 4月, 2009 1 次提交

nfsd: don't use the deferral service, return NFS4ERR_DELAY · 2f425878

由 Andy Adamson 提交于 4月 03, 2009

On an NFSv4.1 server cache miss that causes an upcall, NFS4ERR_DELAY will be
returned. It is up to the NFSv4.1 client to resend only the operations that
have not been processed.

Initialize rq_usedeferral to 1 in svc_process(). It sill be turned off in
nfsd4_proc_compound() only when NFSv4.1 Sessions are used.

Note: this isn't an adequate solution on its own. It's acceptable as a way
to get some minimal 4.1 up and working, but we're going to have to find a
way to avoid returning DELAY in all common cases before 4.1 can really be
considered ready.
Signed-off-by: NAndy Adamson <andros@netapp.com>
Signed-off-by: NBenny Halevy <bhalevy@panasas.com>
[nfsd41: reverse rq_nodeferral negative logic]
Signed-off-by: NBenny Halevy <bhalevy@panasas.com>
[sunrpc: initialize rq_usedeferral]
Signed-off-by: NAndy Adamson <andros@netapp.com>
Signed-off-by: NBenny Halevy <bhalevy@panasas.com>
Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>

2f425878

02 4月, 2009 1 次提交

SUNRPC: Ensure IPV6_V6ONLY is set on the socket before binding to a port · c69da774

由 Trond Myklebust 提交于 3月 30, 2009

Also ensure that we use the protocol family instead of the address
family when calling sock_create_kern().
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

c69da774

31 3月, 2009 1 次提交

proc 2/2: remove struct proc_dir_entry::owner · 99b76233

由 Alexey Dobriyan 提交于 3月 25, 2009

Setting ->owner as done currently (pde->owner = THIS_MODULE) is racy
as correctly noted at bug #12454. Someone can lookup entry with NULL
->owner, thus not pinning enything, and release it later resulting
in module refcount underflow.

We can keep ->owner and supply it at registration time like ->proc_fops
and ->data.

But this leaves ->owner as easy-manipulative field (just one C assignment)
and somebody will forget to unpin previous/pin current module when
switching ->owner. ->proc_fops is declared as "const" which should give
some thoughts.

->read_proc/->write_proc were just fixed to not require ->owner for
protection.

rmmod'ed directories will be empty and return "." and ".." -- no harm.
And directories with tricky enough readdir and lookup shouldn't be modular.
We definitely don't want such modular code.

Removing ->owner will also make PDE smaller.

So, let's nuke it.

Kudos to Jeff Layton for reminding about this, let's say, oversight.

http://bugzilla.kernel.org/show_bug.cgi?id=12454Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>

99b76233

30 3月, 2009 2 次提交

trivial: fix typos/grammar errors in Kconfig texts · 692105b8

由 Matt LaPlante 提交于 1月 26, 2009

Signed-off-by: NMatt LaPlante <kernel1@cyberdogtech.com>
Acked-by: NRandy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: NJiri Kosina <jkosina@suse.cz>

692105b8

cpumask: use new cpumask_ functions in core code. · aa85ea5b

由 Rusty Russell 提交于 3月 30, 2009

Impact: cleanup

Time to clean up remaining laggards using the old cpu_ functions.
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Trond.Myklebust@netapp.com

aa85ea5b

29 3月, 2009 16 次提交

SUNRPC: Remove CONFIG_SUNRPC_REGISTER_V4 · 93559828

由 Chuck Lever 提交于 3月 18, 2009

We just augmented the kernel's RPC service registration code so that
it automatically adjusts to what is supported in user space. Thus we
no longer need the kernel configuration option to enable registering
RPC services with v4 -- it's all done automatically.

This patch is part of a series that addresses
http://bugzilla.kernel.org/show_bug.cgi?id=12256Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

93559828

SUNRPC: rpcb_register() should handle errors silently · 363f724c

由 Chuck Lever 提交于 3月 18, 2009

Move error reporting for RPC registration to rpcb_register's caller.

This way the caller can choose to recover silently from certain
errors, but report errors it does not recognize. Error reporting
for kernel RPC service registration is now handled in one place.

363f724c

SUNRPC: Simplify kernel RPC service registration · cadc0fa5

由 Chuck Lever 提交于 3月 18, 2009

The kernel registers RPC services with the local portmapper with an
rpcbind SET upcall to the local portmapper.  Traditionally, this used
rpcbind v2 (PMAP), but registering RPC services that support IPv6
requires rpcbind v3 or v4.

Since we now want separate PF_INET and PF_INET6 listeners for each
kernel RPC service, svc_register() will do only one of those
registrations at a time.

For PF_INET, it tries an rpcb v4 SET upcall first; if that fails, it
does a legacy portmap SET.  This makes it entirely backwards
compatible with legacy user space, but allows a proper v4 SET to be
used if rpcbind is available.

For PF_INET6, it does an rpcb v4 SET upcall.  If that fails, it fails
the registration, and thus the transport creation.  This let's the
kernel detect if user space is able to support IPv6 RPC services, and
thus whether it should maintain a PF_INET6 listener for each service
at all.

This provides complete backwards compatibilty with legacy user space
that only supports rpcbind v2.  The only down-side is that registering
a new kernel RPC service may take an extra exchange with the local
portmapper on legacy systems, but this is an infrequent operation and
is done over UDP (no lingering sockets in TIMEWAIT), so it shouldn't
be consequential.

This patch is part of a series that addresses
   http://bugzilla.kernel.org/show_bug.cgi?id=12256Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

cadc0fa5

SUNRPC: Simplify svc_unregister() · d5a8620f

由 Chuck Lever 提交于 3月 18, 2009

Our initial implementation of svc_unregister() assumed that PMAP_UNSET
cleared all rpcbind registrations for a [program, version] tuple.
However, we now have evidence that PMAP_UNSET clears only "inet"
entries, and not "inet6" entries, in the rpcbind database.

For backwards compatibility with the legacy portmapper, the
svc_unregister() function also must work if user space doesn't support
rpcbind version 4 at all.

Thus we'll send an rpcbind v4 UNSET, and if that fails, we'll send a
PMAP_UNSET.

This simplifies the code in svc_unregister() and provides better
backwards compatibility with legacy user space that does not support
rpcbind version 4.  We can get rid of the conditional compilation in
here as well.

This patch is part of a series that addresses
   http://bugzilla.kernel.org/show_bug.cgi?id=12256Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

d5a8620f

SUNRPC: Allow callers to pass rpcb_v4_register a NULL address · 1673d0de

由 Chuck Lever 提交于 3月 18, 2009

The user space TI-RPC library uses an empty string for the universal
address when unregistering all target addresses for [program, version].
The kernel's rpcb client should behave the same way.

Here, we are switching between several registration methods based on
the protocol family of the incoming address.  Rename the other rpcbind
v4 registration functions to make it clear that they, as well, are
switched on protocol family.  In /etc/netconfig, this is either "inet"
or "inet6".

NB: The loopback protocol families are not supported in the kernel.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

1673d0de

SUNRPC: rpcbind actually interprets r_owner string · 126e4bc3

由 Chuck Lever 提交于 3月 18, 2009

RFC 1833 has little to say about the contents of r_owner; it only
specifies that it is a string, and states that it is used to control
who can UNSET an entry.

Our port of rpcbind (from Sun) assumes this string contains a numeric
UID value, not alphabetical or symbolic characters, but checks this
value only for AF_LOCAL RPCB_SET or RPCB_UNSET requests.  In all other
cases, rpcbind ignores the contents of the r_owner string.

The reference user space implementation of rpcb_set(3) uses a numeric
UID for all SET/UNSET requests (even via the network) and an empty
string for all other requests.  We emulate that behavior here to
maintain bug-for-bug compatibility.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

126e4bc3

SUNRPC: Clean up address type casts in rpcb_v4_register() · 3aba4553

由 Chuck Lever 提交于 3月 18, 2009

Clean up: Simplify rpcb_v4_register() and its helpers by moving the
details of sockaddr type casting to rpcb_v4_register()'s helper
functions.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

3aba4553

SUNRPC: Don't return EPROTONOSUPPORT in svc_register()'s helpers · ba5c35e0

由 Chuck Lever 提交于 3月 18, 2009

The RPC client returns -EPROTONOSUPPORT if there is a protocol version
mismatch (ie the remote RPC server doesn't support the RPC protocol
version sent by the client).

Helpers for the svc_register() function return -EPROTONOSUPPORT if they
don't recognize the passed-in IPPROTO_ value.

These are two entirely different failure modes.

Have the helpers return -ENOPROTOOPT instead of -EPROTONOSUPPORT.  This
will allow callers to determine more precisely what the underlying
problem is, and decide to report or recover appropriately.

This patch is part of a series that addresses
   http://bugzilla.kernel.org/show_bug.cgi?id=12256Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

ba5c35e0

SUNRPC: Use IPv4 loopback for registering AF_INET6 kernel RPC services · fc28decd

由 Chuck Lever 提交于 3月 18, 2009

The kernel uses an IPv6 loopback address when registering its AF_INET6
RPC services so that it can tell whether the local portmapper is
actually IPv6-enabled.

Since the legacy portmapper doesn't listen on IPv6, however, this
causes a long timeout on older systems if the kernel happens to try
creating and registering an AF_INET6 RPC service.  Originally I wanted
to use a connected transport (either TCP or connected UDP) so that the
upcall would fail immediately if the portmapper wasn't listening on
IPv6, but we never agreed on what transport to use.

In the end, it's of little consequence to the kernel whether the local
portmapper is listening on IPv6.  It's only important whether the
portmapper supports rpcbind v4.  And the kernel can't tell that at all
if it is sending requests via IPv6 -- the portmapper will just ignore
them.

So, send both rpcbind v2 and v4 SET/UNSET requests via IPv4 loopback
to maintain better backwards compatibility between new kernels and
legacy user space, and prevent multi-second hangs in some cases when
the kernel attempts to register RPC services.

This patch is part of a series that addresses

   http://bugzilla.kernel.org/show_bug.cgi?id=12256Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

fc28decd

SUNRPC: Set IPV6ONLY flag on PF_INET6 RPC listener sockets · 7d21c0f9

由 Chuck Lever 提交于 3月 18, 2009

We are about to convert to using separate RPC listener sockets for
PF_INET and PF_INET6. This echoes the way IPv6 is handled in user
space by TI-RPC, and eliminates the need for ULPs to worry about
mapped IPv4 AF_INET6 addresses when doing address comparisons.

Start by setting the IPV6ONLY flag on PF_INET6 RPC listener sockets.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

7d21c0f9

SUNRPC: Remove @family argument from svc_create() and svc_create_pooled() · 49a9072f

由 Chuck Lever 提交于 3月 18, 2009

Since an RPC service listener's protocol family is specified now via
svc_create_xprt(), it no longer needs to be passed to svc_create() or
svc_create_pooled(). Remove that argument from the synopsis of those
functions, and remove the sv_family field from the svc_serv struct.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

49a9072f

SUNRPC: Change svc_create_xprt() to take a @family argument · 9652ada3

由 Chuck Lever 提交于 3月 18, 2009

The sv_family field is going away.  Pass a protocol family argument to
svc_create_xprt() instead of extracting the family from the passed-in
svc_serv struct.

Again, as this is a listener socket and not an address, we make this
new argument an "int" protocol family, instead of an "sa_family_t."
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

9652ada3

SUNRPC: svc_setup_socket() gets protocol family from socket · baf01caf

由 Chuck Lever 提交于 3月 18, 2009

Since the sv_family field is going away, modify svc_setup_socket() to
extract the protocol family from the passed-in socket instead of from
the passed-in svc_serv struct.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

baf01caf

SUNRPC: Pass a family argument to svc_register() · 4b62e58c

由 Chuck Lever 提交于 3月 18, 2009

The sv_family field is going away. Instead of using sv_family, have
the svc_register() function take a protocol family argument.

Since this argument represents a protocol family, and not an address
family, this argument takes an int, as this is what is passed to
sock_create_kern(). Also make sure svc_register's helpers are
checking for PF_FOO instead of AF_FOO. The value of [AP]F_FOO are
equivalent; this is simply a symbolic change to reflect the semantics
of the value stored in that variable.

sock_create_kern() should return EPFNOSUPPORT if the passed-in
protocol family isn't supported, but it uses EAFNOSUPPORT for this
case. We will stick with that tradition here, as svc_register()
is called by the RPC server in the same path as sock_create_kern().
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

4b62e58c

SUNRPC: Clean up svc_find_xprt() calling sequence · 156e6209

由 Chuck Lever 提交于 3月 18, 2009

Clean up: add documentating comment and use appropriate data types for
svc_find_xprt()'s arguments.

This also eliminates a mixed sign comparison: @port was an int, while
the return value of svc_xprt_local_port() is an unsigned short.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

156e6209

SUNRPC: Don't flag empty RPCB_GETADDR reply as bogus · 776bd5c7

由 Chuck Lever 提交于 3月 18, 2009

In 2007, commit e65fe397 added
additional sanity checking to rpcb_decode_getaddr() to make sure we
were getting a reply that was long enough to be an actual universal
address.  If the uaddr string isn't long enough, the XDR decoder
returns EIO.

However, an empty string is a valid RPCB_GETADDR response if the
requested service isn't registered.  Moreover, "::.n.m" is also a
valid RPCB_GETADDR response for IPv6 addresses that is shorter
than rpcb_decode_getaddr()'s lower limit of 11.  So this sanity
check introduced a regression for rpcbind requests against IPv6
remotes.

So revert the lower bound check added by commit
e65fe397, and add an explicit check
for an empty uaddr string, similar to libtirpc's rpcb_getaddr(3).
Pointed-out-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

776bd5c7

28 3月, 2009 2 次提交

sunrpc/svc.c: Remove unused line 'rqstp->rq_server = serv;' in svc_process · abd91ee9

由 ideawu 提交于 3月 26, 2009

There is no need to set rqstp->rq_server to serv, while serv is initialized as rqstp->rq_server at previous line. And between these two lines, there is no change to rqstp->rq_server.
Signed-off-by: Nideawu <ideawu@163.com>
Reviewed-by: NTom Tucker <tom@opengridcomputing.com>
Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>

abd91ee9

A
constify dentry_operations: rest · 3ba13d17
由 Al Viro 提交于 2月 20, 2009
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
3ba13d17

20 3月, 2009 5 次提交

SVCRDMA: fix recent printk format warnings. · 2e3c230b

由 Tom Talpey 提交于 3月 12, 2009

printk formats in prior commit were reversed/incorrect.
Compiled without warning on x86 and x86_64, but detected on ppc.
Signed-off-by: NTom Talpey <tmtalpey@gmail.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

2e3c230b

SUNRPC: Ensure we close the socket on EPIPE errors too... · 55420c24

由 Trond Myklebust 提交于 3月 11, 2009

As long as one task is holding the socket lock, then calls to
xprt_force_disconnect(xprt) will not succeed in shutting down the socket.
In particular, this would mean that a server initiated shutdown will not
succeed until the lock is relinquished.
In order to avoid the deadlock, we should ensure that xs_tcp_send_request()
closes the socket on EPIPE errors too.
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

55420c24

T
SUNRPC: xs_tcp_connect_worker{4,6}: merge common code · b61d59ff
由 Trond Myklebust 提交于 3月 11, 2009
```
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
```
b61d59ff
T
SUNRPC: Add a sysctl to control the duration of the socket linger timeout · 25fe6142
由 Trond Myklebust 提交于 3月 11, 2009
```
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
```
25fe6142

SUNRPC: Add the equivalent of the linger and linger2 timeouts to RPC sockets · 7d1e8255

由 Trond Myklebust 提交于 3月 11, 2009

This fixes a regression against FreeBSD servers as reported by Tomas
Kasparek. Apparently when using RPC over a TCP socket, the FreeBSD servers
don't ever react to the client closing the socket, and so commit
e06799f9 (SUNRPC: Use shutdown() instead of
close() when disconnecting a TCP socket) causes the setup to hang forever
whenever the client attempts to close and then reconnect.

We break the deadlock by adding a 'linger2' style timeout to the socket,
after which, the client will abort the connection using a TCP 'RST'.

The default timeout is set to 15 seconds. A subsequent patch will put it
under user control by means of a systctl.
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

7d1e8255

19 3月, 2009 3 次提交

svcrpc: take advantage of tcp autotuning · 47a14ef1

由 Olga Kornievskaia 提交于 10月 21, 2008

Allow the NFSv4 server to make use of TCP autotuning behaviour, which
was previously disabled by setting the sk_userlocks variable.

Set the receive buffers to be big enough to receive the whole RPC
request, and set this for the listening socket, not the accept socket.

Remove the code that readjusts the receive/send buffer sizes for the
accepted socket. Previously this code was used to influence the TCP
window management behaviour, which is no longer needed when autotuning
is enabled.

This can improve IO bandwidth on networks with high bandwidth-delay
products, where a large tcp window is required.  It also simplifies
performance tuning, since getting adequate tcp buffers previously
required increasing the number of nfsd threads.
Signed-off-by: NOlga Kornievskaia <aglo@citi.umich.edu>
Cc: Jim Rees <rees@umich.edu>
Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>

47a14ef1

knfsd: add file to export stats about nfsd pools · 03cf6c9f

由 Greg Banks 提交于 1月 13, 2009

Add /proc/fs/nfsd/pool_stats to export to userspace various
statistics about the operation of rpc server thread pools.

This patch is based on a forward-ported version of
knfsd-add-pool-thread-stats which has been shipping in the SGI
"Enhanced NFS" product since 2006 and which was previously
posted:

http://article.gmane.org/gmane.linux.nfs/10375

It has also been updated thus:

 * moved EXPORT_SYMBOL() to near the function it exports
 * made the new struct struct seq_operations const
 * used SEQ_START_TOKEN instead of ((void *)1)
 * merged fix from SGI PV 990526 "sunrpc: use dprintk instead of
   printk in svc_pool_stats_*()" by Harshula Jayasuriya.
 * merged fix from SGI PV 964001 "Crash reading pool_stats before
   nfsds are started".
Signed-off-by: NGreg Banks <gnb@sgi.com>
Signed-off-by: NHarshula Jayasuriya <harshula@sgi.com>
Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>

03cf6c9f

knfsd: avoid overloading the CPU scheduler with enormous load averages · 59a252ff

由 Greg Banks 提交于 1月 13, 2009

Avoid overloading the CPU scheduler with enormous load averages
when handling high call-rate NFS loads. When the knfsd bottom half
is made aware of an incoming call by the socket layer, it tries to
choose an nfsd thread and wake it up. As long as there are idle
threads, one will be woken up.

If there are lot of nfsd threads (a sensible configuration when
the server is disk-bound or is running an HSM), there will be many
more nfsd threads than CPUs to run them. Under a high call-rate
low service-time workload, the result is that almost every nfsd is
runnable, but only a handful are actually able to run. This situation
causes two significant problems:

1. The CPU scheduler takes over 10% of each CPU, which is robbing
the nfsd threads of valuable CPU time.

2. At a high enough load, the nfsd threads starve userspace threads
of CPU time, to the point where daemons like portmap and rpc.mountd
do not schedule for tens of seconds at a time. Clients attempting
to mount an NFS filesystem timeout at the very first step (opening
a TCP connection to portmap) because portmap cannot wake up from
select() and call accept() in time.

Disclaimer: these effects were observed on a SLES9 kernel, modern
kernels' schedulers may behave more gracefully.

The solution is simple: keep in each svc_pool a counter of the number
of threads which have been woken but have not yet run, and do not wake
any more if that count reaches an arbitrary small threshold.

Testing was on a 4 CPU 4 NIC Altix using 4 IRIX clients, each with 16
synthetic client threads simulating an rsync (i.e. recursive directory
listing) workload reading from an i386 RH9 install image (161480
regular files in 10841 directories) on the server. That tree is small
enough to fill in the server's RAM so no disk traffic was involved.
This setup gives a sustained call rate in excess of 60000 calls/sec
before being CPU-bound on the server. The server was running 128 nfsds.

Profiling showed schedule() taking 6.7% of every CPU, and __wake_up()
taking 5.2%. This patch drops those contributions to 3.0% and 2.2%.
Load average was over 120 before the patch, and 20.9 after.

This patch is a forward-ported version of knfsd-avoid-nfsd-overload
which has been shipping in the SGI "Enhanced NFS" product since 2006.
It has been posted before:

http://article.gmane.org/gmane.linux.nfs/10374Signed-off-by: NGreg Banks <gnb@sgi.com>
Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>

59a252ff

13 3月, 2009 1 次提交

cpumask: replace node_to_cpumask with cpumask_of_node. · a70f7302

由 Rusty Russell 提交于 3月 13, 2009

Impact: cleanup

node_to_cpumask (and the blecherous node_to_cpumask_ptr which
contained a declaration) are replaced now everyone implements
cpumask_of_node.
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>

a70f7302

12 3月, 2009 5 次提交

SUNRPC: Ensure that xs_nospace return values are propagated · 5e3771ce

由 Trond Myklebust 提交于 3月 11, 2009

If xs_nospace() finds that the socket has disconnected, it attempts to
return ENOTCONN, however that value is then squashed by the callers.
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

5e3771ce

SUNRPC: Delay, then retry on connection errors. · 8a2cec29

由 Trond Myklebust 提交于 3月 11, 2009

Enforce the comment in xs_tcp_connect_worker4/xs_tcp_connect_worker6 that
we should delay, then retry on certain connection errors.
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

8a2cec29

SUNRPC: Return EAGAIN instead of ENOTCONN when waking up xprt->pending · 2a491991

由 Trond Myklebust 提交于 3月 11, 2009

While we should definitely return socket errors to the task that is
currently trying to send data, there is no need to propagate the same error
to all the other tasks on xprt->pending. Doing so actually slows down
recovery, since it causes more than one tasks to attempt socket recovery.
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

2a491991

SUNRPC: Handle socket errors correctly · 482f32e6

由 Trond Myklebust 提交于 3月 11, 2009

Ensure that we pick up and handle socket errors as they occur.
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

482f32e6

SUNRPC: Handle ECONNREFUSED correctly in xprt_transmit() · c8485e4d

由 Trond Myklebust 提交于 3月 11, 2009

If we get an ECONNREFUSED error, we currently go to sleep on the
'xprt->sending' wait queue. The problem is that no timeout is set there,
and there is nothing else that will wake the task up later.

We should deal with ECONNREFUSED in call_status, given that is where we
also deal with -EHOSTDOWN, and friends.
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

c8485e4d

openanolis / cloud-kernel 接近 2 年 前同步成功

openanolis / cloud-kernel
接近 2 年前同步成功