提交 · 293bc9822fa9b3c9d4b7893bcb241e085580771a · openanolis / cloud-kernel

07 5月, 2014 2 次提交

new methods: ->read_iter() and ->write_iter() · 293bc982

由 Al Viro 提交于 2月 11, 2014

Beginning to introduce those.  Just the callers for now, and it's
clumsier than it'll eventually become; once we finish converting
aio_read and aio_write instances, the things will get nicer.

For now, these guys are in parallel to ->aio_read() and ->aio_write();
they take iocb and iov_iter, with everything in iov_iter already
validated.  File offset is passed in iocb->ki_pos, iov/nr_segs -
in iov_iter.

Main concerns in that series are stack footprint and ability to
split the damn thing cleanly.

[fix from Peter Ujfalusi <peter.ujfalusi@ti.com> folded]
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

293bc982

replace checking for ->read/->aio_read presence with check in ->f_mode · 7f7f25e8

由 Al Viro 提交于 2月 11, 2014

Since we are about to introduce new methods (read_iter/write_iter), the
tests in a bunch of places would have to grow inconveniently. Check
once (at open() time) and store results in ->f_mode as FMODE_CAN_READ
and FMODE_CAN_WRITE resp. It might end up being a temporary measure -
once everything switches from ->aio_{read,write} to ->{read,write}_iter
it might make sense to return to open-coded checks. We'll see...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

7f7f25e8

23 3月, 2014 1 次提交

vfs: atomic f_pos access in llseek() · d7a15f8d

由 Eric Biggers 提交于 3月 16, 2014

Commit 9c225f26 ("vfs: atomic f_pos accesses as per POSIX") changed
several system calls to use fdget_pos() instead of fdget(), but missed
sys_llseek().  Fix it.
Signed-off-by: NEric Biggers <ebiggers3@gmail.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

d7a15f8d

10 3月, 2014 2 次提交

get rid of fget_light() · bd2a31d5

由 Al Viro 提交于 3月 04, 2014

instead of returning the flags by reference, we can just have the
low-level primitive return those in lower bits of unsigned long,
with struct file * derived from the rest.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

bd2a31d5

vfs: atomic f_pos accesses as per POSIX · 9c225f26

由 Linus Torvalds 提交于 3月 03, 2014

Our write() system call has always been atomic in the sense that you get
the expected thread-safe contiguous write, but we haven't actually
guaranteed that concurrent writes are serialized wrt f_pos accesses, so
threads (or processes) that share a file descriptor and use "write()"
concurrently would quite likely overwrite each others data.

This violates POSIX.1-2008/SUSv4 Section XSI 2.9.7 that says:

 "2.9.7 Thread Interactions with Regular File Operations

  All of the following functions shall be atomic with respect to each
  other in the effects specified in POSIX.1-2008 when they operate on
  regular files or symbolic links: [...]"

and one of the effects is the file position update.

This unprotected file position behavior is not new behavior, and nobody
has ever cared.  Until now.  Yongzhi Pan reported unexpected behavior to
Michael Kerrisk that was due to this.

This resolves the issue with a f_pos-specific lock that is taken by
read/write/lseek on file descriptors that may be shared across threads
or processes.
Reported-by: NYongzhi Pan <panyongzhi@gmail.com>
Reported-by: NMichael Kerrisk <mtk.manpages@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

9c225f26

06 3月, 2014 1 次提交

fs/compat: optional preadv64/pwrite64 compat system calls · 378a10f3

由 Heiko Carstens 提交于 3月 05, 2014

The preadv64/pwrite64 have been implemented for the x32 ABI, in order
to allow passing 64 bit arguments from user space without splitting
them into two 32 bit parameters, like it would be necessary for usual
compat tasks.
Howevert these two system calls are only being used for the x32 ABI,
so add __ARCH_WANT_COMPAT defines for these two compat syscalls and
make these two only visible for x86.
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>

378a10f3

30 1月, 2014 1 次提交

fs/compat: fix parameter handling for compat readv/writev syscalls · dfd948e3

由 Heiko Carstens 提交于 1月 29, 2014

We got a report that the pwritev syscall does not work correctly in
compat mode on s390.

It turned out that with commit 72ec3516 ("switch compat readv/writev
variants to COMPAT_SYSCALL_DEFINE") we lost the zero extension of a
couple of syscall parameters because the some parameter types haven't
been converted from unsigned long to compat_ulong_t.

This is needed for architectures where the ABI requires that the caller
of a function performed zero and/or sign extension to 64 bit of all
parameters.
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: <stable@vger.kernel.org>	[v3.10+]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

dfd948e3

22 1月, 2014 1 次提交

fs/read_write.c:compat_readv(): remove bogus area verify · 4e4f9e66

由 Corey Minyard 提交于 1月 21, 2014

The compat_do_readv_writev() function was doing a verify_area on the
incoming iov, but the nr_segs value is not checked. If someone passes
in a -1 for nr_segs, for instance, the function should return an EINVAL.
However, it returns a EFAULT because the verify_area fails because it is
checking an array of size MAX_UINT. The check is bogus, anyway, because
the next check, compat_rw_copy_check_uvector(), will do all the
necessary checking, anyway. The non-compat do_readv_writev() function
doesn't do this check, so I think it's safe to just remove the code.
Signed-off-by: NCorey Minyard <cminyard@mvista.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4e4f9e66

25 10月, 2013 1 次提交
- A
  file->f_op is never NULL... · 72c2d531
  由 Al Viro 提交于 9月 22, 2013
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  72c2d531
30 7月, 2013 1 次提交

aio: Kill aio_rw_vect_retry() · 73a7075e

由 Kent Overstreet 提交于 5月 09, 2013

This code doesn't serve any purpose anymore, since the aio retry
infrastructure has been removed.

This change should be safe because aio_read/write are also used for
synchronous IO, and called from do_sync_read()/do_sync_write() - and
there's no looping done in the sync case (the read and write syscalls).
Signed-off-by: NKent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>

73a7075e

03 7月, 2013 1 次提交

vfs: export lseek_execute() to modules · 46a1c2c7

由 Jie Liu 提交于 6月 25, 2013

For those file systems(btrfs/ext4/ocfs2/tmpfs) that support
SEEK_DATA/SEEK_HOLE functions, we end up handling the similar
matter in lseek_execute() to update the current file offset
to the desired offset if it is valid, ceph also does the
simliar things at ceph_llseek().

To reduce the duplications, this patch make lseek_execute()
public accessible so that we can call it directly from the
underlying file systems.

Thanks Dave Chinner for this suggestion.

[AV: call it vfs_setpos(), don't bring the removed 'inode' argument back]

v2->v1:
- Add kernel-doc comments for lseek_execute()
- Call lseek_execute() in ceph->llseek()
Signed-off-by: NJie Liu <jeff.liu@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Cc: Ben Myers <bpm@sgi.com>
Cc: Ted Tso <tytso@mit.edu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Sage Weil <sage@inktank.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

46a1c2c7

29 6月, 2013 5 次提交
- A
  lseek_execute() doesn't need an inode passed to it · 2142914e
  由 Al Viro 提交于 6月 23, 2013
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  2142914e
- A
  constify rw_verify_area() · 68d70d03
  由 Al Viro 提交于 6月 19, 2013
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  68d70d03
- A
  new helper: fixed_size_llseek() · 1bf9d14d
  由 Al Viro 提交于 6月 16, 2013
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  1bf9d14d
- A
  don't call file_pos_write() if vfs_{read,write}{,v}() fails · 5faf153e
  由 Al Viro 提交于 6月 15, 2013
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  5faf153e
- A
  lift file_*_write out of do_splice_direct() · 50cd2c57
  由 Al Viro 提交于 5月 23, 2013
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  50cd2c57
20 6月, 2013 1 次提交
- A
  splice: don't pass the address of ->f_pos to methods · 7995bd28
  由 Al Viro 提交于 6月 20, 2013
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  7995bd28
08 5月, 2013 2 次提交

aio: don't include aio.h in sched.h · a27bb332

由 Kent Overstreet 提交于 5月 07, 2013

Faster kernel compiles by way of fewer unnecessary includes.

[akpm@linux-foundation.org: fix fallout]
[akpm@linux-foundation.org: fix build]
Signed-off-by: NKent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: N"Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a27bb332

aio: remove retry-based AIO · 41003a7b

由 Zach Brown 提交于 5月 07, 2013

This removes the retry-based AIO infrastructure now that nothing in tree
is using it.

We want to remove retry-based AIO because it is fundemantally unsafe.
It retries IO submission from a kernel thread that has only assumed the
mm of the submitting task.  All other task_struct references in the IO
submission path will see the kernel thread, not the submitting task.
This design flaw means that nothing of any meaningful complexity can use
retry-based AIO.

This removes all the code and data associated with the retry machinery.
The most significant benefit of this is the removal of the locking
around the unused run list in the submission path.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: NKent Overstreet <koverstreet@google.com>
Signed-off-by: NZach Brown <zab@redhat.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Acked-by: NJeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: N"Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

41003a7b

05 5月, 2013 1 次提交

kill fs/read_write.h · c0bd14af

由 Al Viro 提交于 5月 04, 2013

fs/compat.c doesn't need it anymore, so let's just move the remaining
contents (two typedefs) into fs/read_write.c
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

c0bd14af

30 4月, 2013 1 次提交

fs/read_write.c: fix generic_file_llseek() comment · 546ae2d2

由 Ming Lei 提交于 4月 29, 2013

Commit ef3d0fd2 ("vfs: do (nearly) lockless generic_file_llseek")
has removed i_mutex from generic_file_llseek, so update the comment
accordingly.
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

546ae2d2

10 4月, 2013 3 次提交
- A
  lift sb_start_write() out of ->write() · 03d95eb2
  由 Al Viro 提交于 3月 20, 2013
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  03d95eb2
- A
  switch compat readv/writev variants to COMPAT_SYSCALL_DEFINE · 72ec3516
  由 Al Viro 提交于 3月 20, 2013
```
... and take to fs/read_write.c
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  72ec3516
- A
  lift sb_start_write/sb_end_write out of ->aio_write() · 8d71db4f
  由 Al Viro 提交于 3月 19, 2013
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  8d71db4f
28 3月, 2013 1 次提交

vfs/splice: Fix missed checks in new __kernel_write() helper · 3e84f48e

由 Al Viro 提交于 3月 27, 2013

Commit 06ae43f3 ("Don't bother with redoing rw_verify_area() from
default_file_splice_from()") lost the checks to test existence of the
write/aio_write methods.  My apologies ;-/

Eventually, we want that in fs/splice.c side of things (no point
repeating it for every buffer, after all), but for now this is the
obvious minimal fix.
Reported-by: NDave Jones <davej@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3e84f48e

22 3月, 2013 1 次提交

Don't bother with redoing rw_verify_area() from default_file_splice_from() · 06ae43f3

由 Al Viro 提交于 3月 20, 2013

default_file_splice_from() ends up calling vfs_write() (via very convoluted
callchain). It's an overkill, since we already have done rw_verify_area()
in the caller by the time we call vfs_write() we are under set_fs(KERNEL_DS),
so access_ok() is also pointless. Add a new helper (__kernel_write()),
use it instead of kernel_write() in there.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

06ae43f3

04 3月, 2013 2 次提交
- A
  convert sendfile{,64} to COMPAT_SYSCALL_DEFINE · 19f4fc3a
  由 Al Viro 提交于 2月 24, 2013
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  19f4fc3a
- A
  teach SYSCALL_DEFINE<n> how to deal with long long/unsigned long long · 4a0fd5bf
  由 Al Viro 提交于 1月 21, 2013
```
... and convert a bunch of SYSCALL_DEFINE ones to SYSCALL_DEFINE<n>,
killing the boilerplate crap around them.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  4a0fd5bf
24 2月, 2013 1 次提交
- A
  switch lseek to COMPAT_SYSCALL_DEFINE · 561c6731
  由 Al Viro 提交于 2月 24, 2013
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  561c6731
23 2月, 2013 1 次提交
- A
  new helper: file_inode(file) · 496ad9aa
  由 Al Viro 提交于 1月 23, 2013
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  496ad9aa
21 12月, 2012 1 次提交

sendfile: allows bypassing of notifier events · a68c2f12

由 Scott Wolchok 提交于 12月 20, 2012

do_sendfile() in fs/read_write.c does not call the fsnotify functions,
unlike its neighbors.  This manifests as a lack of inotify ACCESS events
when a file is sent using sendfile(2).

Addresses
  https://bugzilla.kernel.org/show_bug.cgi?id=12812

[akpm@linux-foundation.org: use fsnotify_modify(out.file), not fsnotify_access(), per Dave]
Signed-off-by: NAlan Cox <alan@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Scott Wolchok <swolchok@umich.edu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a68c2f12

18 12月, 2012 1 次提交

lseek: the "whence" argument is called "whence" · 965c8e59

由 Andrew Morton 提交于 12月 17, 2012

But the kernel decided to call it "origin" instead.  Fix most of the
sites.
Acked-by: NHugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

965c8e59

03 10月, 2012 1 次提交

compat: fs: Generic compat_sys_sendfile implementation · 8f9c0119

由 Catalin Marinas 提交于 9月 19, 2012

This function is used by sparc, powerpc and arm64 for compat support.
The patch adds a generic implementation which calls do_sendfile()
directly and avoids set_fs().

The sparc architecture has wrappers for the sign extensions while
powerpc relies on the compiler to do the this. The patch adds wrappers
for powerpc to handle the u32->int type conversion.

compat_sys_sendfile64() can be replaced by a sys_sendfile() call since
compat_loff_t has the same size as off_t on a 64-bit system.

On powerpc, the patch also changes the 64-bit sendfile call from
sys_sendile64 to sys_sendfile.
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

8f9c0119

27 9月, 2012 1 次提交
- A
  switch simple cases of fget_light to fdget · 2903ff01
  由 Al Viro 提交于 8月 28, 2012
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  2903ff01
23 7月, 2012 1 次提交

vfs: allow custom EOF in generic_file_llseek code · e8b96eb5

由 Eric Sandeen 提交于 4月 30, 2012

For ext3/4 htree directories, using the vfs llseek function with
SEEK_END goes to i_size like for any other file, but in reality
we want the maximum possible hash value.  Recent changes
in ext4 have cut & pasted generic_file_llseek() back into fs/ext4/dir.c,
but replicating this core code seems like a bad idea, especially
since the copy has already diverged from the vfs.

This patch updates generic_file_llseek_size to accept
both a custom maximum offset, and a custom EOF position.  With this
in place, ext4_dir_llseek can pass in the appropriate maximum hash
position for both maxsize and eof, and get what it wants.

As far as I know, this does not fix any bugs - nfs in the kernel
doesn't use SEEK_END, and I don't know of any user who does.  But
some ext4 folks seem keen on doing the right thing here, and I can't
really argue.

(Patch also fixes up some comments slightly)
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

e8b96eb5

01 6月, 2012 1 次提交

aio/vfs: cleanup of rw_copy_check_uvector() and compat_rw_copy_check_uvector() · ac34ebb3

由 Christopher Yeoh 提交于 5月 31, 2012

A cleanup of rw_copy_check_uvector and compat_rw_copy_check_uvector after
changes made to support CMA in an earlier patch.

Rather than having an additional check_access parameter to these
functions, the first paramater type is overloaded to allow the caller to
specify CHECK_IOVEC_ONLY which means check that the contents of the iovec
are valid, but do not check the memory that they point to. This is used
by process_vm_readv/writev where we need to validate that a iovec passed
to the syscall is valid but do not want to check the memory that it points
to at this point because it refers to an address space in another process.
Signed-off-by: NChris Yeoh <yeohc@au1.ibm.com>
Reviewed-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ac34ebb3

29 2月, 2012 1 次提交

fs: reduce the use of module.h wherever possible · 630d9c47

由 Paul Gortmaker 提交于 11月 16, 2011

For files only using THIS_MODULE and/or EXPORT_SYMBOL, map
them onto including export.h -- or if the file isn't even
using those, then just delete the include.  Fix up any implicit
include dependencies that were being masked by module.h along
the way.
Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>

630d9c47

01 11月, 2011 1 次提交

Cross Memory Attach · fcf63409

由 Christopher Yeoh 提交于 10月 31, 2011

The basic idea behind cross memory attach is to allow MPI programs doing
intra-node communication to do a single copy of the message rather than a
double copy of the message via shared memory.

The following patch attempts to achieve this by allowing a destination
process, given an address and size from a source process, to copy memory
directly from the source process into its own address space via a system
call.  There is also a symmetrical ability to copy from the current
process's address space into a destination process's address space.

- Use of /proc/pid/mem has been considered, but there are issues with
  using it:
  - Does not allow for specifying iovecs for both src and dest, assuming
    preadv or pwritev was implemented either the area read from or
  written to would need to be contiguous.
  - Currently mem_read allows only processes who are currently
  ptrace'ing the target and are still able to ptrace the target to read
  from the target. This check could possibly be moved to the open call,
  but its not clear exactly what race this restriction is stopping
  (reason  appears to have been lost)
  - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
  domain socket is a bit ugly from a userspace point of view,
  especially when you may have hundreds if not (eventually) thousands
  of processes  that all need to do this with each other
  - Doesn't allow for some future use of the interface we would like to
  consider adding in the future (see below)
  - Interestingly reading from /proc/pid/mem currently actually
  involves two copies! (But this could be fixed pretty easily)

As mentioned previously use of vmsplice instead was considered, but has
problems.  Since you need the reader and writer working co-operatively if
the pipe is not drained then you block.  Which requires some wrapping to
do non blocking on the send side or polling on the receive.  In all to all
communication it requires ordering otherwise you can deadlock.  And in the
example of many MPI tasks writing to one MPI task vmsplice serialises the
copying.

There are some cases of MPI collectives where even a single copy interface
does not get us the performance gain we could.  For example in an
MPI_Reduce rather than copy the data from the source we would like to
instead use it directly in a mathops (say the reduce is doing a sum) as
this would save us doing a copy.  We don't need to keep a copy of the data
from the source.  I haven't implemented this, but I think this interface
could in the future do all this through the use of the flags - eg could
specify the math operation and type and the kernel rather than just
copying the data would apply the specified operation between the source
and destination and store it in the destination.

Although we don't have a "second user" of the interface (though I've had
some nibbles from people who may be interested in using it for intra
process messaging which is not MPI).  This interface is something which
hardware vendors are already doing for their custom drivers to implement
fast local communication.  And so in addition to this being useful for
OpenMPI it would mean the driver maintainers don't have to fix things up
when the mm changes.

There was some discussion about how much faster a true zero copy would
go. Here's a link back to the email with some testing I did on that:

http://marc.info/?l=linux-mm&m=130105930902915&w=2

There is a basic man page for the proposed interface here:

http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt

This has been implemented for x86 and powerpc, other architecture should
mainly (I think) just need to add syscall numbers for the process_vm_readv
and process_vm_writev. There are 32 bit compatibility versions for
64-bit kernels.

For arch maintainers there are some simple tests to be able to quickly
verify that the syscalls are working correctly here:

http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgzSigned-off-by: NChris Yeoh <yeohc@au1.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: <linux-man@vger.kernel.org>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fcf63409

28 10月, 2011 2 次提交

vfs: add generic_file_llseek_size · 5760495a

由 Andi Kleen 提交于 9月 15, 2011

Add a generic_file_llseek variant to the VFS that allows passing in
the maximum file size of the file system, instead of always
using maxbytes from the superblock.

This can be used to eliminate some cut'n'paste seek code in ext4.
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

5760495a

vfs: do (nearly) lockless generic_file_llseek · ef3d0fd2

由 Andi Kleen 提交于 9月 15, 2011

The i_mutex lock use of generic _file_llseek hurts.  Independent processes
accessing the same file synchronize over a single lock, even though
they have no need for synchronization at all.

Under high utilization this can cause llseek to scale very poorly on larger
systems.

This patch does some rethinking of the llseek locking model:

First the 64bit f_pos is not necessarily atomic without locks
on 32bit systems. This can already cause races with read() today.
This was discussed on linux-kernel in the past and deemed acceptable.
The patch does not change that.

Let's look at the different seek variants:

SEEK_SET: Doesn't really need any locking.
If there's a race one writer wins, the other loses.

For 32bit the non atomic update races against read()
stay the same. Without a lock they can also happen
against write() now.  The read() race was deemed
acceptable in past discussions, and I think if it's
ok for read it's ok for write too.

=> Don't need a lock.

SEEK_END: This behaves like SEEK_SET plus it reads
the maximum size too. Reading the maximum size would have the
32bit atomic problem. But luckily we already have a way to read
the maximum size without locking (i_size_read), so we
can just use that instead.

Without i_mutex there is no synchronization with write() anymore,
however since the write() update is atomic on 64bit it just behaves
like another racy SEEK_SET.  On non atomic 32bit it's the same
as SEEK_SET.

=> Don't need a lock, but need to use i_size_read()

SEEK_CUR: This has a read-modify-write race window
on the same file. One could argue that any application
doing unsynchronized seeks on the same file is already broken.
But for the sake of not adding a regression here I'm
using the file->f_lock to synchronize this. Using this
lock is much better than the inode mutex because it doesn't
synchronize between processes.

=> So still need a lock, but can use a f_lock.

This patch implements this new scheme in generic_file_llseek.
I dropped generic_file_llseek_unlocked and changed all callers.
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

ef3d0fd2

openanolis / cloud-kernel 大约 1 年 前同步成功

openanolis / cloud-kernel
大约 1 年前同步成功