提交 · 31cbecb4ab538f433145bc5a46f3bea9b9627031 · openeuler / Kernel

03 11月, 2011 10 次提交

pnfs-obj: Support for RAID5 read-4-write interface. · 278c023a

由 Boaz Harrosh 提交于 10月 31, 2011

The ore need suplied a r4w_get_page/r4w_put_page API
from Filesystem so it can get cache pages to read-into when
writing parial stripes.
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

278c023a

pnfs-obj: move to ore 03: Remove old raid engine · 04291b62

由 Boaz Harrosh 提交于 10月 31, 2011

Finally remove all the old raid engine, which is by now
dead code.
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

04291b62

pnfs-obj: move to ore 02: move to ORE · eecfc631

由 Boaz Harrosh 提交于 10月 31, 2011

In this patch we are actually moving to the ORE.
(Object Raid Engine).

objio_state holds a pointer to an ore_io_state. Once
we have an ore_io_state at hand we can call the ore
for reading/writing. We register on the done path
to kick off the nfs io_done mechanism.

Again for Ease of reviewing the old code is "#if 0"
but is not removed so the diff command works better.
The old code will be removed in the next patch.

fs/exofs/Kconfig::ORE is modified to also be auto-included
if PNFS_OBJLAYOUT is set. Since we now depend on ORE.
(See comments in fs/exofs/Kconfig)
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

eecfc631

pnfs-obj: move to ore 01: ore_layout & ore_components · af4f5b54

由 Boaz Harrosh 提交于 10月 31, 2011

For Ease of reviewing I split the move to ore into 3 parts
	move to ore 01: ore_layout & ore_components
	move to ore 02: move to ORE
	move to ore 03: Remove old raid engine

This patch modifies the objio_lseg, layout-segment level
and devices and components arrays to use the ORE types.

Though it will be removed soon, also the raid engine
is modified to actually compile, possibly run, with
the new types. So it is the same old raid engine but
with some new ORE types.

For Ease of reviewing, some of the old code is
"#if 0" but is not removed so the diff command works
better. The old code will be removed in the 3rd patch.
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

af4f5b54

pnfs-obj: Rename objlayout_io_state => objlayout_io_res · e2e04355

由 Boaz Harrosh 提交于 10月 31, 2011

* All instances of objlayout_io_state => objlayout_io_res
* All instances of state => oir;
* All instances of ol_state => oir;

Big but nothing to it
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

e2e04355

pnfs-obj: Get rid of objlayout_{alloc,free}_io_state · 96218556

由 Boaz Harrosh 提交于 10月 31, 2011

This is part of moving objio_osd to use the ORE.

objlayout_io_state had two functions:
1. It was used in the error reporting mechanism at layout_return.
   This function is kept intact.
   (Later patch will rename objlayout_io_state => objlayout_io_res)
2. Carrier of rw io members into the objio_read/write_paglist API.
   This is removed in this patch.

The {r,w}data received from NFS are passed directly to the
objio_{read,write}_paglist API. The io_engine is now allocating
it's own IO state as part of the read/write. The minimal
functionality that was part of the generic allocation is passed
to the io_engine.

So part of this patch is rename of:
	ios->ol_state.foo => ios->foo

At objlayout_{read,write}_done an objlayout_io_state is passed that
denotes the result of the IO. (Hence the later name change).
If the IO is successful objlayout calls an objio_free_result() API
immediately (Which for objio_osd causes the release of the io_state).
If the IO ended in an error it is hanged onto until reported in
layout_return and is released later through the objio_free_result()
API. (All this is not new just renamed and cleaned)
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

96218556

pnfs-obj: Return PNFS_NOT_ATTEMPTED in case of read/write_pagelist · e6c40fe3

由 Boaz Harrosh 提交于 10月 31, 2011

objlayout driver was always returning PNFS_ATTEMPTED from it's
read/write_pagelist operations. Even on error. Fix that.

Start by establishing an error return API from io-engine, by
not returning ssize_t (length-or-error) but returning "int"
0=OK, 0>Error. And clean up all return types in io-engine.

Then if io-engine returned error return PNFS_NOT_ATTEMPTED
to generic layer. (With a dprint)
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

e6c40fe3

pnfs-obj: Remove redundant EOF from objlayout_io_state · 4cdc685c

由 Boaz Harrosh 提交于 10月 31, 2011

The EOF calculation was done on .read_pagelist(), cached
in objlayout_io_state->eof, and set in objlayout_read_done()
into nfs_read_data->res.eof.

So set it directly into nfs_read_data->res.eof and avoid
the extra member.
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

4cdc685c

nfs: Remove unused variable from write.c · 2b72c9cc

由 Rakib Mullick 提交于 11月 01, 2011

When CONFIG_NFS=y and CONFIG_NFS_V3_{,V4}=n we get the following warning.

	fs/nfs/write.c: In function ‘nfs_writeback_done’:
	fs/nfs/write.c:1246:21: warning: unused variable ‘server’

 Remove the variable 'server' to fix the above warning.
Signed-off-by: NRakib Mullick <rakib.mullick@gmail.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

2b72c9cc

nfs: Fix unused variable warning from file.c · 6f276e49

由 Rakib Mullick 提交于 11月 01, 2011

Fix the following unused variable warning.

fs/nfs/file.c: In function ‘nfs_file_release’:
fs/nfs/file.c:140:17: warning: unused variable ‘dentry’
fs/nfs/file.c: In function ‘nfs_file_read’:
fs/nfs/file.c:237:9: warning: unused variable ‘count’
Signed-off-by: NRakib Mullick <rakib.mullick@gmail.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

6f276e49

02 11月, 2011 1 次提交

sysfs: Make sysfs_rename safe with sysfs_dirents in rbtrees. · f6d90b4f

由 Eric W. Biederman 提交于 11月 01, 2011

In sysfs_rename we need to remove the optimization of not calling
sysfs_unlink_sibling and sysfs_link_sibling if the renamed parent
directory is not changing.  This optimization is no longer valid now
that sysfs dirents are stored in an rbtree sorted by name.

Move the assignment of s_ns before the call of sysfs_link_sibling.  With
no sysfs_dirent fields changing after the call of sysfs_link_sibling
this allows sysfs_link_sibling to take any of the directory entries into
account when it builds the rbtrees, and s_ns looks like a prime canidate
to be used in the rbtree in the future.
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
Cc: Jiri Slaby <jirislaby@gmail.com>
Cc: Greg KH <gregkh@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f6d90b4f

01 11月, 2011 14 次提交

epoll: fix spurious lockdep warnings · d8805e63

由 Nelson Elhage 提交于 10月 31, 2011

epoll can acquire recursively acquire ep->mtx on multiple "struct
eventpoll"s at once in the case where one epoll fd is monitoring another
epoll fd.  This is perfectly OK, since we're careful about the lock
ordering, but it causes spurious lockdep warnings.  Annotate the recursion
using mutex_lock_nested, and add a comment explaining the nesting rules
for good measure.

Recent versions of systemd are triggering this, and it can also be
demonstrated with the following trivial test program:

--------------------8<--------------------

int main(void) {
   int e1, e2;
   struct epoll_event evt = {
       .events = EPOLLIN
   };

   e1 = epoll_create1(0);
   e2 = epoll_create1(0);
   epoll_ctl(e1, EPOLL_CTL_ADD, e2, &evt);
   return 0;
}
--------------------8<--------------------
Reported-by: NPaul Bolle <pebolle@tiscali.nl>
Tested-by: NPaul Bolle <pebolle@tiscali.nl>
Signed-off-by: NNelson Elhage <nelhage@nelhage.com>
Acked-by: NJason Baron <jbaron@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d8805e63

fat: follow rename pack_hex_byte() to hex_byte_pack() · 0a90e0f1

由 Andy Shevchenko 提交于 10月 31, 2011

There is no functional change.
Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: NOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0a90e0f1

treewide: use __printf not __attribute__((format(printf,...))) · b9075fa9

由 Joe Perches 提交于 10月 31, 2011

Standardize the style for compiler based printf format verification.
Standardized the location of __printf too.

Done via script and a little typing.

$ grep -rPl --include=*.[ch] -w "__attribute__" * | \
  grep -vP "^(tools|scripts|include/linux/compiler-gcc.h)" | \
  xargs perl -n -i -e 'local $/; while (<>) { s/\b__attribute__\s*\(\s*\(\s*format\s*\(\s*printf\s*,\s*(.+)\s*,\s*(.+)\s*\)\s*\)\s*\)/__printf($1, $2)/g ; print; }'

[akpm@linux-foundation.org: revert arch bits]
Signed-off-by: NJoe Perches <joe@perches.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b9075fa9

fs/pipe.c: add ->statfs callback for pipefs · d70ef97b

由 Pavel Emelyanov 提交于 10月 31, 2011

Currently a statfs on a pipe's /proc/<pid>/fd/ link returns -ENOSYS.  Wire
pipfs up so that the statfs succeeds.

This is required by checkpoint-restart in the userspace to make it
possible to distinguish pipes from fifos.

When we dump information about task's open files we use the /proc/pid/fd
directoy's symlinks and the fact that opening any of them gives us exactly
the same dentry->inode pair as the original process has.  Now if a task
we're dumping has opened pipe and fifo we need to detect this and act
accordingly.  Knowing that an fd with type S_ISFIFO resides on a pipefs is
the most precise way.
Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
Reviewed-by: NTejun Heo <tj@kernel.org>
Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d70ef97b

fs/buffer.c: add device information for error output in __find_get_block_slow() · 72a2ebd8

由 Tao Ma 提交于 10月 31, 2011

On the ext4 mailing list[1], we got some report about errors in
__find_get_block_slow(), but the information is very limited.

If the device information is given, we can know the name of the sick
volume.  Futhermore, we can get the corresponding status of that
block(group, inode block etc) by analyzing the disk layout.

[1] http://marc.info/?l=linux-ext4&m=131379831421147&w=2Signed-off-by: NTao Ma <boyu.mt@taobao.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

72a2ebd8

vmscan: fix shrinker callback bug in fs/super.c · 09f363c7

由 Mikulas Patocka 提交于 10月 31, 2011

The callback must not return -1 when nr_to_scan is zero. Fix the bug in
fs/super.c and add this requirement to the callback specification.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

09f363c7

lib/string.c: introduce memchr_inv() · 79824820

由 Akinobu Mita 提交于 10月 31, 2011

memchr_inv() is mainly used to check whether the whole buffer is filled
with just a specified byte.

The function name and prototype are stolen from logfs and the
implementation is from SLUB.
Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
Acked-by: NChristoph Lameter <cl@linux-foundation.org>
Acked-by: NPekka Enberg <penberg@kernel.org>
Cc: Matt Mackall <mpm@selenic.com>
Acked-by: NJoern Engel <joern@logfs.org>
Cc: Marcin Slusarz <marcin.slusarz@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

79824820

ext4: warn if direct reclaim tries to writeback pages · 966dbde2

由 Mel Gorman 提交于 10月 31, 2011

Direct reclaim should never writeback pages.  Warn if an attempt is made.
Signed-off-by: NMel Gorman <mgorman@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alex Elder <aelder@sgi.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

966dbde2

xfs: warn if direct reclaim tries to writeback pages · 94054fa3

由 Mel Gorman 提交于 10月 31, 2011

Direct reclaim should never writeback pages.  For now, handle the
situation and warn about it.  Ultimately, this will be a BUG_ON.
Signed-off-by: NMel Gorman <mgorman@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alex Elder <aelder@sgi.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

94054fa3

mm: distinguish between mlocked and pinned pages · bc3e53f6

由 Christoph Lameter 提交于 10月 31, 2011

Some kernel components pin user space memory (infiniband and perf) (by
increasing the page count) and account that memory as "mlocked".

The difference between mlocking and pinning is:

A. mlocked pages are marked with PG_mlocked and are exempt from
   swapping. Page migration may move them around though.
   They are kept on a special LRU list.

B. Pinned pages cannot be moved because something needs to
   directly access physical memory. They may not be on any
   LRU list.

I recently saw an mlockalled process where mm->locked_vm became
bigger than the virtual size of the process (!) because some
memory was accounted for twice:

Once when the page was mlocked and once when the Infiniband
layer increased the refcount because it needt to pin the RDMA
memory.

This patch introduces a separate counter for pinned pages and
accounts them seperately.
Signed-off-by: NChristoph Lameter <cl@linux.com>
Cc: Mike Marciniszyn <infinipath@qlogic.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

bc3e53f6

tmpfs: add "tmpfs" to the Kconfig prompt to make it obvious. · f5fc870d

由 Robert P. J. Day 提交于 10月 31, 2011

Add the leading word "tmpfs" to the Kconfig string to make it blindingly
obvious that this selection refers to tmpfs.
Signed-off-by: NRobert P. J. Day <rpjday@crashcourse.ca>
Acked-by: NHugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f5fc870d

oom: remove oom_disable_count · c9f01245

由 David Rientjes 提交于 10月 31, 2011

This removes mm->oom_disable_count entirely since it's unnecessary and
currently buggy.  The counter was intended to be per-process but it's
currently decremented in the exit path for each thread that exits, causing
it to underflow.

The count was originally intended to prevent oom killing threads that
share memory with threads that cannot be killed since it doesn't lead to
future memory freeing.  The counter could be fixed to represent all
threads sharing the same mm, but it's better to remove the count since:

 - it is possible that the OOM_DISABLE thread sharing memory with the
   victim is waiting on that thread to exit and will actually cause
   future memory freeing, and

 - there is no guarantee that a thread is disabled from oom killing just
   because another thread sharing its mm is oom disabled.
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Reported-by: NOleg Nesterov <oleg@redhat.com>
Reviewed-by: NOleg Nesterov <oleg@redhat.com>
Cc: Ying Han <yinghan@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c9f01245

Cross Memory Attach · fcf63409

由 Christopher Yeoh 提交于 10月 31, 2011

The basic idea behind cross memory attach is to allow MPI programs doing
intra-node communication to do a single copy of the message rather than a
double copy of the message via shared memory.

The following patch attempts to achieve this by allowing a destination
process, given an address and size from a source process, to copy memory
directly from the source process into its own address space via a system
call.  There is also a symmetrical ability to copy from the current
process's address space into a destination process's address space.

- Use of /proc/pid/mem has been considered, but there are issues with
  using it:
  - Does not allow for specifying iovecs for both src and dest, assuming
    preadv or pwritev was implemented either the area read from or
  written to would need to be contiguous.
  - Currently mem_read allows only processes who are currently
  ptrace'ing the target and are still able to ptrace the target to read
  from the target. This check could possibly be moved to the open call,
  but its not clear exactly what race this restriction is stopping
  (reason  appears to have been lost)
  - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
  domain socket is a bit ugly from a userspace point of view,
  especially when you may have hundreds if not (eventually) thousands
  of processes  that all need to do this with each other
  - Doesn't allow for some future use of the interface we would like to
  consider adding in the future (see below)
  - Interestingly reading from /proc/pid/mem currently actually
  involves two copies! (But this could be fixed pretty easily)

As mentioned previously use of vmsplice instead was considered, but has
problems.  Since you need the reader and writer working co-operatively if
the pipe is not drained then you block.  Which requires some wrapping to
do non blocking on the send side or polling on the receive.  In all to all
communication it requires ordering otherwise you can deadlock.  And in the
example of many MPI tasks writing to one MPI task vmsplice serialises the
copying.

There are some cases of MPI collectives where even a single copy interface
does not get us the performance gain we could.  For example in an
MPI_Reduce rather than copy the data from the source we would like to
instead use it directly in a mathops (say the reduce is doing a sum) as
this would save us doing a copy.  We don't need to keep a copy of the data
from the source.  I haven't implemented this, but I think this interface
could in the future do all this through the use of the flags - eg could
specify the math operation and type and the kernel rather than just
copying the data would apply the specified operation between the source
and destination and store it in the destination.

Although we don't have a "second user" of the interface (though I've had
some nibbles from people who may be interested in using it for intra
process messaging which is not MPI).  This interface is something which
hardware vendors are already doing for their custom drivers to implement
fast local communication.  And so in addition to this being useful for
OpenMPI it would mean the driver maintainers don't have to fix things up
when the mm changes.

There was some discussion about how much faster a true zero copy would
go. Here's a link back to the email with some testing I did on that:

http://marc.info/?l=linux-mm&m=130105930902915&w=2

There is a basic man page for the proposed interface here:

http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt

This has been implemented for x86 and powerpc, other architecture should
mainly (I think) just need to add syscall numbers for the process_vm_readv
and process_vm_writev. There are 32 bit compatibility versions for
64-bit kernels.

For arch maintainers there are some simple tests to be able to quickly
verify that the syscalls are working correctly here:

http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgzSigned-off-by: NChris Yeoh <yeohc@au1.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: <linux-man@vger.kernel.org>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fcf63409

/proc/self/numa_maps: restore "huge" tag for hugetlb vmas · fc360bd9

由 Andrew Morton 提交于 10月 31, 2011

The display of the "huge" tag was accidentally removed in 29ea2f69 ("mm:
use walk_page_range() instead of custom page table walking code").
Reported-by: NStephen Hemminger <shemminger@vyatta.com>
Tested-by: NStephen Hemminger <shemminger@vyatta.com>
Reviewed-by: NStephen Wilson <wilsons@start.ca>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fc360bd9

31 10月, 2011 5 次提交

NFS: Remove no-op less-than-zero checks on unsigned variables. · e414966b

由 Chuck Lever 提交于 10月 25, 2011

Introduced by commit 16b374ca "NFSv4.1: pnfs: filelayout: add driver's
LAYOUTGET and GETDEVICEINFO infrastructure" (October 20, 2010).
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

e414966b

NFS: Clean up nfs4_xdr_dec_secinfo() · c6e69666

由 Chuck Lever 提交于 10月 25, 2011

Clean up: Remove superfluous logic at the tail of
nfs4_xdr_dec_secinfo() .

Introduced by commit 5a5ea0d4 "NFS: Add secinfo procedure" (March 24,
2011).
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

c6e69666

NFS: Fix documenting comment for nfs_create_request() · c02f557d

由 Chuck Lever 提交于 10月 25, 2011

Clean up: the first parameter of nfs_create_request() has been
incorrectly documented since time immemorial (OK, since before
2.6.12).
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

c02f557d

NFS4: fix cb_recallany decode error · d743c3c9

由 Peng Tao 提交于 10月 23, 2011

craa_type_mask is bitmap4 per RFC5661. We need to expect a length before
extracting bitmap value.

Cc: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: NPeng Tao <peng_tao@emc.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

d743c3c9

nfs4: serialize layoutcommit · 92407e75

由 Peng Tao 提交于 10月 23, 2011

Current pnfs_layoutcommit_inode can not handle parallel layoutcommit.
And as Trond suggested , there is no need for client to optimize for
parallel layoutcommit. So add NFS_INO_LAYOUTCOMMITTING flag to
mark inflight layoutcommit and serialize lalyoutcommit with it.
Also mark_inode_dirty_sync if pnfs_layoutcommit_inode fails to issue
layoutcommit.
Reported-by: NVitaliy Gusev <gusev.vitaliy@nexenta.com>
Signed-off-by: NPeng Tao <peng_tao@emc.com>
Signed-off-by: NJim Rees <rees@umich.edu>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

92407e75

28 10月, 2011 10 次提交

leases: fix write-open/read-lease race · f3c7691e

由 J. Bruce Fields 提交于 9月 21, 2011

In setlease, we use i_writecount to decide whether we can give out a
read lease.

In open, we break leases before incrementing i_writecount.

There is therefore a window between the break lease and the i_writecount
increment when setlease could add a new read lease.

This would leave us with a simultaneous write open and read lease, which
shouldn't happen.
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

f3c7691e

nfs: drop unnecessary locking in llseek · 79835a71

由 Andi Kleen 提交于 9月 15, 2011

This makes NFS follow the standard generic_file_llseek locking scheme.

Cc: Trond.Myklebust@netapp.com
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

79835a71

ext4: replace cut'n'pasted llseek code with generic_file_llseek_size · 4cce0e28

由 Andi Kleen 提交于 9月 15, 2011

This gives ext4 the benefits of unlocked llseek.

Cc: tytso@mit.edu
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

4cce0e28

vfs: add generic_file_llseek_size · 5760495a

由 Andi Kleen 提交于 9月 15, 2011

Add a generic_file_llseek variant to the VFS that allows passing in
the maximum file size of the file system, instead of always
using maxbytes from the superblock.

This can be used to eliminate some cut'n'paste seek code in ext4.
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

5760495a

vfs: do (nearly) lockless generic_file_llseek · ef3d0fd2

由 Andi Kleen 提交于 9月 15, 2011

The i_mutex lock use of generic _file_llseek hurts.  Independent processes
accessing the same file synchronize over a single lock, even though
they have no need for synchronization at all.

Under high utilization this can cause llseek to scale very poorly on larger
systems.

This patch does some rethinking of the llseek locking model:

First the 64bit f_pos is not necessarily atomic without locks
on 32bit systems. This can already cause races with read() today.
This was discussed on linux-kernel in the past and deemed acceptable.
The patch does not change that.

Let's look at the different seek variants:

SEEK_SET: Doesn't really need any locking.
If there's a race one writer wins, the other loses.

For 32bit the non atomic update races against read()
stay the same. Without a lock they can also happen
against write() now.  The read() race was deemed
acceptable in past discussions, and I think if it's
ok for read it's ok for write too.

=> Don't need a lock.

SEEK_END: This behaves like SEEK_SET plus it reads
the maximum size too. Reading the maximum size would have the
32bit atomic problem. But luckily we already have a way to read
the maximum size without locking (i_size_read), so we
can just use that instead.

Without i_mutex there is no synchronization with write() anymore,
however since the write() update is atomic on 64bit it just behaves
like another racy SEEK_SET.  On non atomic 32bit it's the same
as SEEK_SET.

=> Don't need a lock, but need to use i_size_read()

SEEK_CUR: This has a read-modify-write race window
on the same file. One could argue that any application
doing unsynchronized seeks on the same file is already broken.
But for the sake of not adding a regression here I'm
using the file->f_lock to synchronize this. Using this
lock is much better than the inode mutex because it doesn't
synchronize between processes.

=> So still need a lock, but can use a f_lock.

This patch implements this new scheme in generic_file_llseek.
I dropped generic_file_llseek_unlocked and changed all callers.
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

ef3d0fd2

direct-io: merge direct_io_walker into __blockdev_direct_IO · 847cc637

由 Andi Kleen 提交于 8月 01, 2011

This doesn't change anything for the compiler, but hch thought it would
make the code clearer.

I moved the reference counting into its own little inline.
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Acked-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

847cc637

direct-io: inline the complete submission path · ba253fbf

由 Andi Kleen 提交于 8月 01, 2011

Add inlines to all the submission path functions. While this increases
code size it also gives gcc a lot of optimization opportunities
in this critical hotpath.

In particular -- together with some other changes -- this
allows gcc to get rid of the unnecessary clearing of
sdio at the beginning and optimize the messy parameter passing.
Any non inlining of a function which takes a sdio parameter
would break this optimization because they cannot be done if the
address of a structure is taken.

Note that benefits are only seen with CONFIG_OPTIMIZE_INLINING
and CONFIG_CC_OPTIMIZE_FOR_SIZE both set to off.

This gives about 2.2% improvement on a large database benchmark
with a high IOPS rate.
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

ba253fbf

direct-io: separate map_bh from dio · 18772641

由 Andi Kleen 提交于 8月 01, 2011

Only a single b_private field in the map_bh buffer head is needed after
the submission path. Move map_bh separately to avoid storing
this information in the long term slab.

This avoids the weird 104 byte hole in struct dio_submit which also needed
to be memseted early.
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

18772641

direct-io: use a slab cache for struct dio · 6e8267f5

由 Andi Kleen 提交于 8月 01, 2011

A direct slab call is slightly faster than kmalloc and can be better cached
per CPU. It also avoids rounding to the next kmalloc slab.

In addition this enforces cache line alignment for struct dio to avoid
any false sharing.
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Acked-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

6e8267f5

direct-io: rearrange fields in dio/dio_submit to avoid holes · 0dc2bc49

由 Andi Kleen 提交于 8月 01, 2011

Fix most problems reported by pahole.

There is still a weird 104 byte hole after map_bh. I'm not sure what
causes this.
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Acked-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

0dc2bc49

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功