提交 · 8b284dc47291daf72fe300e1138a2e7ed56f38ab · openeuler / raspberrypi-kernel

01 10月, 2013 11 次提交

fuse: writepages: handle same page rewrites · 8b284dc4

由 Miklos Szeredi 提交于 10月 01, 2013

As Maxim Patlasov pointed out, it's possible to get a dirty page while it's
copy is still under writeback, despite fuse_page_mkwrite() doing its thing
(direct IO).

This could result in two concurrent write request for the same offset, with
data corruption if they get mixed up.

To prevent this, fuse needs to check and delay such writes.  This
implementation does this by:

 1. check if page is still under writeout, if so create a new, single page
    secondary request for it

 2. chain this secondary request onto the in-flight request

 2/a. if a seconday request for the same offset was already chained to the
    in-flight request, then just copy the contents of the page and discard
    the new secondary request.  This makes sure that for each page will
    have at most two requests associated with it

 3. when the in-flight request finished, send off all secondary requests
    chained onto it
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

8b284dc4

fuse: writepages: fix aggregation · 1e112a48

由 Miklos Szeredi 提交于 10月 01, 2013

Checking against tmp-page indexes is not very useful, and results in one
(or rarely two) page requests.  Which is not much of an improvement...
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

1e112a48

fuse: fix race in fuse_writepages() · 2d033eaa

由 Maxim Patlasov 提交于 8月 16, 2013

The patch fixes a race between ftruncate(2), mmap-ed write and write(2):

1) An user makes a page dirty via mmap-ed write.
2) The user performs shrinking truncate(2) intended to purge the page.
3) Before fuse_do_setattr calls truncate_pagecache, the page goes to
   writeback. fuse_writepages_fill attaches a new page to FUSE_WRITE request,
   then releases the original page by end_page_writeback and unlock it.
4) fuse_do_setattr completes and successfully returns. Since now, i_mutex
   is free.
5) Ordinary write(2) extends i_size back to cover the page. Note that
   fuse_send_write_pages do wait for fuse writeback, but for another
   page->index.
6) fuse_writepages_fill attaches more pages to the request (if any), then
   fuse_writepages_send is eventually called. It is supposed to crop
   inarg->size of the request, but it doesn't because i_size has already been
   extended back.

Moving end_page_writeback behind fuse_writepages_send guarantees that
__fuse_release_nowrite (called from fuse_do_setattr) will crop inarg->size
of the request before write(2) gets the chance to extend i_size.
Signed-off-by: NMaxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

2d033eaa

fuse: Implement writepages callback · 26d614df

由 Pavel Emelyanov 提交于 6月 29, 2013

The .writepages one is required to make each writeback request carry more than
one page on it. The patch enables optimized behaviour unconditionally,
i.e. mmap-ed writes will benefit from the patch even if fc->writeback_cache=0.

[SzM: simplify, add comments]
Signed-off-by: NMaxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

26d614df

fuse: don't BUG on no write file · 72523425

由 Miklos Szeredi 提交于 10月 01, 2013

Don't bug if there's no writable files found for page writeback. If ever
this is triggered, a WARN_ON helps debugging it much better then a BUG_ON.
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

72523425

fuse: lock page in mkwrite · cca24370

由 Miklos Szeredi 提交于 10月 01, 2013

Lock the page in fuse_page_mkwrite() to protect against a race with
fuse_writepage() where the page is redirtied before the actual writeback
begins.
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

cca24370

fuse: Prepare to handle multiple pages in writeback · 385b1268

由 Pavel Emelyanov 提交于 6月 29, 2013

The .writepages callback will issue writeback requests with more than one
page aboard. Make existing end/check code be aware of this.
Signed-off-by: NMaxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

385b1268

fuse: Getting file for writeback helper · adcadfa8

由 Pavel Emelyanov 提交于 6月 29, 2013

There will be a .writepageS callback implementation which will need to
get a fuse_file out of a fuse_inode, thus make a helper for this.
Signed-off-by: NMaxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

adcadfa8

fuse: no RCU mode in fuse_access() · 698fa1d1

由 Miklos Szeredi 提交于 10月 01, 2013

fuse_access() is never called in RCU walk, only on the final component of
access(2) and chdir(2)...
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

698fa1d1

fuse: readdirplus: fix RCU walk · 6314efee

由 Miklos Szeredi 提交于 10月 01, 2013

Doing dput(parent) is not valid in RCU walk mode.  In RCU mode it would
probably be okay to update the parent flags, but it's actually not
necessary most of the time...

So only set the FUSE_I_ADVISE_RDPLUS flag on the parent when the entry was
recently initialized by READDIRPLUS.

This is achieved by setting FUSE_I_INIT_RDPLUS on entries added by
READDIRPLUS and only dropping out of RCU mode if this flag is set.
FUSE_I_INIT_RDPLUS is cleared once the FUSE_I_ADVISE_RDPLUS flag is set in
the parent.
Reported-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org

6314efee

fuse: don't check_submounts_and_drop() in RCU walk · 3c70b8ee

由 Miklos Szeredi 提交于 10月 01, 2013

If revalidate finds an invalid dentry in RCU walk mode, let the VFS deal
with it instead of calling check_submounts_and_drop() which is not prepared
for being called from RCU walk.
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org

3c70b8ee

18 9月, 2013 2 次提交

fuse: fix fallocate vs. ftruncate race · 0ab08f57

由 Maxim Patlasov 提交于 9月 13, 2013

A former patch introducing FUSE_I_SIZE_UNSTABLE flag provided detailed
description of races between ftruncate and anyone who can extend i_size:

> 1. As in the previous scenario fuse_dentry_revalidate() discovered that i_size
> changed (due to our own fuse_do_setattr()) and is going to call
> truncate_pagecache() for some  'new_size' it believes valid right now. But by
> the time that particular truncate_pagecache() is called ...
> 2. fuse_do_setattr() returns (either having called truncate_pagecache() or
> not -- it doesn't matter).
> 3. The file is extended either by write(2) or ftruncate(2) or fallocate(2).
> 4. mmap-ed write makes a page in the extended region dirty.

This patch adds necessary bits to fuse_file_fallocate() to protect from that
race.
Signed-off-by: NMaxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org

0ab08f57

fuse: wait for writeback in fuse_file_fallocate() · bde52788

由 Maxim Patlasov 提交于 9月 13, 2013

The patch fixes a race between mmap-ed write and fallocate(PUNCH_HOLE):

1) An user makes a page dirty via mmap-ed write.
2) The user performs fallocate(2) with mode == PUNCH_HOLE|KEEP_SIZE
   and <offset, size> covering the page.
3) Before truncate_pagecache_range call from fuse_file_fallocate,
   the page goes to write-back. The page is fully processed by fuse_writepage
   (including end_page_writeback on the page), but fuse_flush_writepages did
   nothing because fi->writectr < 0.
4) truncate_pagecache_range is called and fuse_file_fallocate is finishing
   by calling fuse_release_nowrite. The latter triggers processing queued
   write-back request which will write stale data to the hole soon.

Changed in v2 (thanks to Brian for suggestion):
 - Do not truncate page cache until FUSE_FALLOCATE succeeded. Otherwise,
   we can end up in returning -ENOTSUPP while user data is already punched
   from page cache. Use filemap_write_and_wait_range() instead.
Changed in v3 (thanks to Miklos for suggestion):
 - fuse_wait_on_writeback() is prone to livelocks; use fuse_set_nowrite()
   instead. So far as we need a dirty-page barrier only, fuse_sync_writes()
   should be enough.
 - rebased to for-linus branch of fuse.git
Signed-off-by: NMaxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org

bde52788

15 9月, 2013 1 次提交

vfs: fix typo in comment in recent dentry work · 05a8252b

由 Linus Torvalds 提交于 9月 15, 2013

Sedat points out that I transposed some letters in "LRU" and wrote "RLU"
instead in one of the new comments explaining the flow.  Let's just fix
it.
Reported-by: NSedat Dilek <sedat.dilek@jpberlin.de>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

05a8252b

14 9月, 2013 3 次提交

vfs: fix dentry LRU list handling and nr_dentry_unused accounting · 89dc77bc

由 Linus Torvalds 提交于 9月 13, 2013

The LRU list changes interacted badly with our nr_dentry_unused
accounting, and even worse with the new DCACHE_LRU_LIST bit logic.

This introduces helper functions to make sure everything follows the
proper dcache d_lru list rules: the dentry cache is complicated by the
fact that some of the hotpaths don't even want to look at the LRU list
at all, and the fact that we use the same list entry in the dentry for
both the LRU list and for our temporary shrinking lists when removing
things from the LRU.

The helper functions temporarily have some extra sanity checking for the
flag bits that have to match the current LRU state of the dentry.  We'll
remove that before the final 3.12 release, but considering how easy it
is to get wrong, this first cleanup version has some very particular
sanity checking.
Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

89dc77bc

cifs: Avoid calling unlock_page() twice in cifs_readpage() when using fscache · 466bd31b

由 Sachin Prabhu 提交于 9月 13, 2013

When reading a single page with cifs_readpage(), we make a call to
fscache_read_or_alloc_page() which once done, asynchronously calls
the completion function cifs_readpage_from_fscache_complete(). This
completion function unlocks the page once it has been populated from
cache. The module then attempts to unlock the page a second time in
cifs_readpage() which leads to warning messages.

In case of a successful call to fscache_read_or_alloc_page() we should skip
the second unlock_page() since this will be called by the
cifs_readpage_from_fscache_complete() once the page has been populated by
fscache.

With the modifications to cifs_readpage_worker(), we will need to re-grab the
page lock in cifs_write_begin().

The problem was first noticed when testing new fscache patches for cifs.
https://bugzilla.redhat.com/show_bug.cgi?id=1005737Signed-off-by: NSachin Prabhu <sprabhu@redhat.com>
Reviewed-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NSteve French <smfrench@gmail.com>

466bd31b

cifs: Do not take a reference to the page in cifs_readpage_worker() · a9e9b7bc

由 Sachin Prabhu 提交于 9月 13, 2013

We do not need to take a reference to the pagecache in
cifs_readpage_worker() since the calling function will have already
taken one before passing the pointer to the page as an argument to the
function.
Signed-off-by: NSachin Prabhu <sprabhu@redhat.com>
Reviewed-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NSteve French <smfrench@gmail.com>

a9e9b7bc

13 9月, 2013 8 次提交

thp: account anon transparent huge pages into NR_ANON_PAGES · 3cd14fcd

由 Kirill A. Shutemov 提交于 9月 12, 2013

We use NR_ANON_PAGES as base for reporting AnonPages to user.  There's
not much sense in not accounting transparent huge pages there, but add
them on printing to user.

Let's account transparent huge pages in NR_ANON_PAGES in the first place.
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: NDave Hansen <dave.hansen@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Ning Qu <quning@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3cd14fcd

truncate: drop 'oldsize' truncate_pagecache() parameter · 7caef267

由 Kirill A. Shutemov 提交于 9月 12, 2013

truncate_pagecache() doesn't care about old size since commit
cedabed4 ("vfs: Fix vmtruncate() regression").  Let's drop it.
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7caef267

vfs: make d_path() get the root path under RCU · 68f0d9d9

由 Linus Torvalds 提交于 9月 12, 2013

This avoids the spinlocks and refcounts in the d_path() sequence too
(used by /proc and various other entities).  See commit 8b19e341 for
the equivalent getcwd() system call path.

And unlike getcwd(), d_path() doesn't copy the result to user space, so
I don't need to fear _that_ particular bug happening again.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

68f0d9d9

vfs: use __getname/__putname for getcwd() system call · 3272c544

由 Linus Torvalds 提交于 9月 12, 2013

It's a pathname.  It should use the pathname allocators and
deallocators, and PATH_MAX instead of PAGE_SIZE.  Never mind that the
two are commonly the same.

With this, the allocations scale up nicely too, and I can do getcwd()
system calls at a rate of about 300M/s, with no lock contention
anywhere.

Of course, nobody sane does that, especially since getcwd() is
traditionally a very slow operation in Unix.  But this was also the
simplest way to benchmark the prepend_path() improvements by Waiman, and
once I saw the profiles I couldn't leave it well enough alone.

But apart from being an performance improvement (from using per-cpu slab
allocators instead of the raw page allocator), it's actually a valid and
real cleanup.
Signed-off-by: NLinus "OCD" Torvalds <torvalds@linux-foundation.org>

3272c544

vfs: don't copy things to user space holding the rcu readlock · ff812d72

由 Linus Torvalds 提交于 9月 12, 2013

Oops.  That wasn't very smart.  We don't actually need the RCU lock any
more by the time we copy the cwd string to user space, but I had
stupidly surrounded the whole thing with it.

Introduced by commit 8b19e341 ("vfs: make getcwd() get the root and
pwd path under rcu")

Is-a-big-hairy-idiot: Linus Torvalds <torvalds@linux-foundation.org>

ff812d72

vfs: make getcwd() get the root and pwd path under rcu · 8b19e341

由 Linus Torvalds 提交于 9月 12, 2013

This allows us to skip all the crazy spinlocks and reference count
updates, and instead use the fs sequence read-lock to get an atomic
snapshot of the root and cwd information.

We might want to make the rule that "prepend_path()" is always called
with the RCU lock held, but the RCU lock nests fine and this is the
minimal fix.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8b19e341

vfs: move get_fs_root_and_pwd() to single caller · 5762482f

由 Linus Torvalds 提交于 9月 12, 2013

Let's not pollute the include files with inline functions that are only
used in a single place. Especially not if we decide we might want to
change the semantics of said function to make it more efficient..
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5762482f

dcache: get/release read lock in read_seqbegin_or_lock() & friend · 18129977

由 Waiman Long 提交于 9月 12, 2013

This patch modifies read_seqbegin_or_lock() and need_seqretry() to use
newly introduced read_seqlock_excl() and read_sequnlock_excl()
primitives so that they won't change the sequence number even if they
fall back to take the lock.  This is OK as no change to the protected
data structure is being made.

It will prevent one fallback to lock taking from cascading into a series
of lock taking reducing performance because of the sequence number
change.  It will also allow other sequence readers to go forward while
an exclusive reader lock is taken.

This patch also updates some of the inaccurate comments in the code.
Signed-off-by: NWaiman Long <Waiman.Long@hp.com>
To: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

18129977

12 9月, 2013 15 次提交

xfs: remove dead code from xlog_recover_inode_pass2 · 08474ed6

由 Mark Tinguely 提交于 9月 12, 2013

Additional code in the error handler of xlog_recover_inode_pass2()
results in the following error:

static checker warning: "fs/xfs/xfs_log_recover.c:2999
xlog_recover_inode_pass2()
	 info: ignoring unreachable code."
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NMark Tinguely <tinguely@sgi.com>
Reviewed-by: Ben Myers <bpm@sgi.com
Signed-off-by: NBen Myers <bpm@sgi.com>

08474ed6

xfs: = vs == typo in ASSERT() · aa9e1040

由 Dan Carpenter 提交于 9月 12, 2013

There is a '=' vs '==' typo so the ASSERT()s are always true.
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NEric Sandeen <sandeen@redhat.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

aa9e1040

initmpfs: move rootfs code from fs/ramfs/ to init/ · 57f150a5

由 Rob Landley 提交于 9月 11, 2013

When the rootfs code was a wrapper around ramfs, having them in the same
file made sense.  Now that it can wrap another filesystem type, move it in
with the init code instead.

This also allows a subsequent patch to access rootfstype= command line
arg.
Signed-off-by: NRob Landley <rob@landley.net>
Cc: Jeff Layton <jlayton@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Stephen Warren <swarren@nvidia.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jim Cromie <jim.cromie@gmail.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

57f150a5

initmpfs: move bdi setup from init_rootfs to init_ramfs · 4bbee76b

由 Rob Landley 提交于 9月 11, 2013

Even though ramfs hasn't got a backing device, commit e0bf68dd ("mm:
bdi init hooks") added one anyway, and put the initialization in
init_rootfs() since that's the first user, leaving it out of init_ramfs()
to avoid duplication.

But initmpfs uses init_tmpfs() instead, so move the init into the
filesystem's init function, add a "once" guard to prevent duplicate
initialization, and call the filesystem init from rootfs init.

This goes part of the way to allowing ramfs to be built as a module.

[akpm@linux-foundation.org; using bit 1 was odd]
Signed-off-by: NRob Landley <rob@landley.net>
Cc: Jeff Layton <jlayton@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Stephen Warren <swarren@nvidia.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jim Cromie <jim.cromie@gmail.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4bbee76b

initmpfs: replace MS_NOUSER in initramfs · 137fdcc1

由 Rob Landley 提交于 9月 11, 2013

Mounting MS_NOUSER prevents --bind mounts from rootfs.  Prevent new rootfs
mounts with a different mechanism that doesn't affect bind mounts.
Signed-off-by: NRob Landley <rob@landley.net>
Cc: Jeff Layton <jlayton@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Stephen Warren <swarren@nvidia.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jim Cromie <jim.cromie@gmail.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

137fdcc1

lib/radix-tree.c: make radix_tree_node_alloc() work correctly within interrupt · 5e4c0d97

由 Jan Kara 提交于 9月 11, 2013

With users of radix_tree_preload() run from interrupt (block/blk-ioc.c is
one such possible user), the following race can happen:

radix_tree_preload()
...
radix_tree_insert()
  radix_tree_node_alloc()
    if (rtp->nr) {
      ret = rtp->nodes[rtp->nr - 1];
<interrupt>
...
radix_tree_preload()
...
radix_tree_insert()
  radix_tree_node_alloc()
    if (rtp->nr) {
      ret = rtp->nodes[rtp->nr - 1];

And we give out one radix tree node twice.  That clearly results in radix
tree corruption with different results (usually OOPS) depending on which
two users of radix tree race.

We fix the problem by making radix_tree_node_alloc() always allocate fresh
radix tree nodes when in interrupt.  Using preloading when in interrupt
doesn't make sense since all the allocations have to be atomic anyway and
we cannot steal nodes from process-context users because some users rely
on radix_tree_insert() succeeding after radix_tree_preload().
in_interrupt() check is somewhat ugly but we cannot simply key off passed
gfp_mask as that is acquired from root_gfp_mask() and thus the same for
all preload users.

Another part of the fix is to avoid node preallocation in
radix_tree_preload() when passed gfp_mask doesn't allow waiting.  Again,
preallocation in such case doesn't make sense and when preallocation would
happen in interrupt we could possibly leak some allocated nodes.  However,
some users of radix_tree_preload() require following radix_tree_insert()
to succeed.  To avoid unexpected effects for these users,
radix_tree_preload() only warns if passed gfp mask doesn't allow waiting
and we provide a new function radix_tree_maybe_preload() for those users
which get different gfp mask from different call sites and which are
prepared to handle radix_tree_insert() failure.
Signed-off-by: NJan Kara <jack@suse.cz>
Cc: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5e4c0d97

affs: use loff_t in affs_truncate() · 63259326

由 Dan Carpenter 提交于 9月 11, 2013

It seems pretty unlikely that AFFS supports files over 4GB but we may as
well leave use loff_t just for cleanness sake instead of truncating it to
32 bits.
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Cc: Marco Stornelli <marco.stornelli@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

63259326

vmcore: enable /proc/vmcore mmap for s390 · 11e376a3

由 Michael Holzheu 提交于 9月 11, 2013

The patch "s390/vmcore: Implement remap_oldmem_pfn_range for s390" allows
now to use mmap also on s390.

So enable mmap for s390 again.
Signed-off-by: NMichael Holzheu <holzheu@linux.vnet.ibm.com>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: Jan Willeke <willeke@de.ibm.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

11e376a3

vmcore: introduce remap_oldmem_pfn_range() · 9cb21813

由 Michael Holzheu 提交于 9月 11, 2013

For zfcpdump we can't map the HSA storage because it is only available via
a read interface.  Therefore, for the new vmcore mmap feature we have
introduce a new mechanism to create mappings on demand.

This patch introduces a new architecture function remap_oldmem_pfn_range()
that should be used to create mappings with remap_pfn_range() for oldmem
areas that can be directly mapped.  For zfcpdump this is everything
besides of the HSA memory.  For the areas that are not mapped by
remap_oldmem_pfn_range() a generic vmcore a new generic vmcore fault
handler mmap_vmcore_fault() is called.

This handler works as follows:

* Get already available or new page from page cache (find_or_create_page)
* Check if /proc/vmcore page is filled with data (PageUptodate)
* If yes:
  Return that page
* If no:
  Fill page using __vmcore_read(), set PageUptodate, and return page
Signed-off-by: NMichael Holzheu <holzheu@linux.vnet.ibm.com>
Acked-by: NVivek Goyal <vgoyal@redhat.com>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: Jan Willeke <willeke@de.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9cb21813

vmcore: introduce ELF header in new memory feature · be8a8d06

由 Michael Holzheu 提交于 9月 11, 2013

For s390 we want to use /proc/vmcore for our SCSI stand-alone dump
(zfcpdump).  We have support where the first HSA_SIZE bytes are saved into
a hypervisor owned memory area (HSA) before the kdump kernel is booted.
When the kdump kernel starts, it is restricted to use only HSA_SIZE bytes.

The advantages of this mechanism are:

 * No crashkernel memory has to be defined in the old kernel.
 * Early boot problems (before kexec_load has been done) can be dumped
 * Non-Linux systems can be dumped.

We modify the s390 copy_oldmem_page() function to read from the HSA memory
if memory below HSA_SIZE bytes is requested.

Since we cannot use the kexec tool to load the kernel in this scenario,
we have to build the ELF header in the 2nd (kdump/new) kernel.

So with the following patch set we would like to introduce the new
function that the ELF header for /proc/vmcore can be created in the 2nd
kernel memory.

The following steps are done during zfcpdump execution:

1.  Production system crashes
2.  User boots a SCSI disk that has been prepared with the zfcpdump tool
3.  Hypervisor saves CPU state of boot CPU and HSA_SIZE bytes of memory into HSA
4.  Boot loader loads kernel into low memory area
5.  Kernel boots and uses only HSA_SIZE bytes of memory
6.  Kernel saves registers of non-boot CPUs
7.  Kernel does memory detection for dump memory map
8.  Kernel creates ELF header for /proc/vmcore
9.  /proc/vmcore uses this header for initialization
10. The zfcpdump user space reads /proc/vmcore to write dump to SCSI disk
    - copy_oldmem_page() copies from HSA for memory below HSA_SIZE
    - copy_oldmem_page() copies from real memory for memory above HSA_SIZE

Currently for s390 we create the ELF core header in the 2nd kernel with a
small trick.  We relocate the addresses in the ELF header in a way that
for the /proc/vmcore code it seems to be in the 1st kernel (old) memory
and the read_from_oldmem() returns the correct data.  This allows the
/proc/vmcore code to use the ELF header in the 2nd kernel.

This patch:

Exchange the old mechanism with the new and much cleaner function call
override feature that now offcially allows to create the ELF core header
in the 2nd kernel.

To use the new feature the following function have to be defined
by the architecture backend code to read from new memory:

 * elfcorehdr_alloc: Allocate ELF header
 * elfcorehdr_free: Free the memory of the ELF header
 * elfcorehdr_read: Read from ELF header
 * elfcorehdr_read_notes: Read from ELF notes
Signed-off-by: NMichael Holzheu <holzheu@linux.vnet.ibm.com>
Acked-by: NVivek Goyal <vgoyal@redhat.com>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: Jan Willeke <willeke@de.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

be8a8d06

exec: cleanup the error handling in search_binary_handler() · 6b3c538f

由 Oleg Nesterov 提交于 9月 11, 2013

The error hanling and ret-from-loop look confusing and inconsistent.

- "retval >= 0" simply returns

- "!bprm->file" returns too but with read_unlock() because
   binfmt_lock was already re-acquired

- "retval != -ENOEXEC || bprm->mm == NULL" does "break" and
  relies on the same check after the main loop

Consolidate these checks into a single if/return statement.

need_retry still checks "retval == -ENOEXEC", but this and -ENOENT before
the main loop are not needed.  This is only for pathological and
impossible list_empty(&formats) case.

It is not clear why do we check "bprm->mm == NULL", probably this
should be removed.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Acked-by: NKees Cook <keescook@chromium.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Cc: Zach Levis <zml@linux.vnet.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6b3c538f

exec: don't retry if request_module() fails · 4e0621a0

由 Oleg Nesterov 提交于 9月 11, 2013

A separate one-liner for better documentation.

It doesn't make sense to retry if request_module() fails to exec
/sbin/modprobe, add the additional "request_module() < 0" check.

However, this logic still doesn't look exactly right:

1. It would be better to check "request_module() != 0", the user
   space modprobe process should report the correct exit code.
   But I didn't dare to add the user-visible change.

2. The whole ENOEXEC logic looks suboptimal. Suppose that we try
   to exec a "#!path-to-unsupported-binary" script. In this case
   request_module() + "retry" will be done twice: first by the
   "depth == 1" code, and then again by the "depth == 0" caller
   which doesn't make sense.

3. And note that in the case above bprm->buf was already changed
   by load_script()->prepare_binprm(), so this looks even more
   ugly.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Acked-by: NKees Cook <keescook@chromium.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Cc: Zach Levis <zml@linux.vnet.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4e0621a0

exec: cleanup the CONFIG_MODULES logic · cb7b6b1c

由 Oleg Nesterov 提交于 9月 11, 2013

search_binary_handler() uses "for (try=0; try<2; try++)" to avoid "goto"
but the code looks too complicated and horrible imho.  We still need to
check "try == 0" before request_module() and add the additional "break"
for !CONFIG_MODULES case.

Kill this loop and use a simple "bool need_retry" + "goto retry".  The
code looks much simpler and we do not even need ifdef's, gcc can optimize
out the "if (need_retry)" block if !IS_ENABLED().
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Acked-by: NKees Cook <keescook@chromium.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Cc: Zach Levis <zml@linux.vnet.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

cb7b6b1c

exec: kill ->load_binary != NULL check in search_binary_handler() · 92eaa565

由 Oleg Nesterov 提交于 9月 11, 2013

search_binary_handler() checks ->load_binary != NULL for no reason, this
method should be always defined.  Turn this check into WARN_ON() and move
it into __register_binfmt().

Also, kill the function pointer.  The current code looks confusing, as if
->load_binary can go away after read_unlock(&binfmt_lock).  But we rely on
module_get(fmt->module), this fmt can't be changed or unregistered,
otherwise this code is buggy anyway.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Acked-by: NKees Cook <keescook@chromium.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Cc: Zach Levis <zml@linux.vnet.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

92eaa565

exec: move allow_write_access/fput to exec_binprm() · 52f14282

由 Oleg Nesterov 提交于 9月 11, 2013

When search_binary_handler() succeeds it does allow_write_access() and
fput(), then it clears bprm->file to ensure the caller will not do the
same.

We can simply move this code to exec_binprm() which is called only once.
In fact we could move this to free_bprm() and remove the same code in
do_execve_common's error path.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Acked-by: NKees Cook <keescook@chromium.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Cc: Zach Levis <zml@linux.vnet.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

52f14282