提交 · 4023bfc9f351a7994fb6a7d515476c320f94a574 · openeuler / Kernel

15 9月, 2014 2 次提交

be careful with nd->inode in path_init() and follow_dotdot_rcu() · 4023bfc9

由 Al Viro 提交于 9月 13, 2014

in the former we simply check if dentry is still valid after picking
its ->d_inode; in the latter we fetch ->d_inode in the same places
where we fetch dentry and its ->d_seq, under the same checks.

Cc: stable@vger.kernel.org # 2.6.38+
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

4023bfc9

don't bugger nd->seq on set_root_rcu() from follow_dotdot_rcu() · 7bd88377

由 Al Viro 提交于 9月 13, 2014

return the value instead, and have path_init() do the assignment.  Broken by
"vfs: Fix absolute RCU path walk failures due to uninitialized seq number",
which was Cc-stable with 2.6.38+ as destination.  This one should go where
it went.

To avoid dummy value returned in case when root is already set (it would do
no harm, actually, since the only caller that doesn't ignore the return value
is guaranteed to have nd->root *not* set, but it's more obvious that way),
lift the check into callers.  And do the same to set_root(), to keep them
in sync.

Cc: stable@vger.kernel.org # 2.6.38+
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

7bd88377

14 9月, 2014 2 次提交

fix bogus read_seqretry() checks introduced in · f5be3e29

由 Al Viro 提交于 9月 13, 2014

read_seqretry() returns true on mismatch, not on match...

Cc: stable@vger.kernel.org # 3.15+
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

f5be3e29

A
move the call of __d_drop(anon) into __d_materialise_unique(dentry, anon) · 6f18493e
由 Al Viro 提交于 9月 11, 2014
```
and lock the right list there

Cc: stable@vger.kernel.org
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
6f18493e

09 9月, 2014 2 次提交

lockd: fix rpcbind crash on lockd startup failure · 7c17705e

由 J. Bruce Fields 提交于 8月 29, 2014

Nikita Yuschenko reported that booting a kernel with init=/bin/sh and
then nfs mounting without portmap or rpcbind running using a busybox
mount resulted in:

  # mount -t nfs 10.30.130.21:/opt /mnt
  svc: failed to register lockdv1 RPC service (errno 111).
  lockd_up: makesock failed, error=-111
  Unable to handle kernel paging request for data at address 0x00000030
  Faulting instruction address: 0xc055e65c
  Oops: Kernel access of bad area, sig: 11 [#1]
  MPC85xx CDS
  Modules linked in:
  CPU: 0 PID: 1338 Comm: mount Not tainted 3.10.44.cge #117
  task: cf29cea0 ti: cf35c000 task.ti: cf35c000
  NIP: c055e65c LR: c0566490 CTR: c055e648
  REGS: cf35dad0 TRAP: 0300   Not tainted  (3.10.44.cge)
  MSR: 00029000 <CE,EE,ME>  CR: 22442488  XER: 20000000
  DEAR: 00000030, ESR: 00000000

  GPR00: c05606f4 cf35db80 cf29cea0 cf0ded80 cf0dedb8 00000001 1dec3086
  00000000
  GPR08: 00000000 c07b1640 00000007 1dec3086 22442482 100b9758 00000000
  10090ae8
  GPR16: 00000000 000186a5 00000000 00000000 100c3018 bfa46edc 100b0000
  bfa46ef0
  GPR24: cf386ae0 c07834f0 00000000 c0565f88 00000001 cf0dedb8 00000000
  cf0ded80
  NIP [c055e65c] call_start+0x14/0x34
  LR [c0566490] __rpc_execute+0x70/0x250
  Call Trace:
  [cf35db80] [00000080] 0x80 (unreliable)
  [cf35dbb0] [c05606f4] rpc_run_task+0x9c/0xc4
  [cf35dbc0] [c0560840] rpc_call_sync+0x50/0xb8
  [cf35dbf0] [c056ee90] rpcb_register_call+0x54/0x84
  [cf35dc10] [c056f24c] rpcb_register+0xf8/0x10c
  [cf35dc70] [c0569e18] svc_unregister.isra.23+0x100/0x108
  [cf35dc90] [c0569e38] svc_rpcb_cleanup+0x18/0x30
  [cf35dca0] [c0198c5c] lockd_up+0x1dc/0x2e0
  [cf35dcd0] [c0195348] nlmclnt_init+0x2c/0xc8
  [cf35dcf0] [c015bb5c] nfs_start_lockd+0x98/0xec
  [cf35dd20] [c015ce6c] nfs_create_server+0x1e8/0x3f4
  [cf35dd90] [c0171590] nfs3_create_server+0x10/0x44
  [cf35dda0] [c016528c] nfs_try_mount+0x158/0x1e4
  [cf35de20] [c01670d0] nfs_fs_mount+0x434/0x8c8
  [cf35de70] [c00cd3bc] mount_fs+0x20/0xbc
  [cf35de90] [c00e4f88] vfs_kern_mount+0x50/0x104
  [cf35dec0] [c00e6e0c] do_mount+0x1d0/0x8e0
  [cf35df10] [c00e75ac] SyS_mount+0x90/0xd0
  [cf35df40] [c000ccf4] ret_from_syscall+0x0/0x3c

The addition of svc_shutdown_net() resulted in two calls to
svc_rpcb_cleanup(); the second is no longer necessary and crashes when
it calls rpcb_register_call with clnt=NULL.
Reported-by: NNikita Yushchenko <nyushchenko@dev.rtsoft.ru>
Fixes: 679b033d "lockd: ensure we tear down any live sockets when socket creation fails during lockd_up"
Cc: stable@vger.kernel.org
Acked-by: NJeff Layton <jlayton@primarydata.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

7c17705e

nfsd4: fix rd_dircount enforcement · aee37764

由 J. Bruce Fields 提交于 8月 20, 2014

Commit 3b299709 "nfsd4: enforce rd_dircount" totally misunderstood
rd_dircount; it refers to total non-attribute bytes returned, not number
of directory entries returned.

Bring the code into agreement with RFC 3530 section 14.2.24.

Cc: stable@vger.kernel.org
Fixes: 3b299709 "nfsd4: enforce rd_dircount"
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

aee37764

08 9月, 2014 1 次提交

ufs: fix deadlocks introduced by sb mutex merge · 9ef7db7f

由 Alexey Khoroshilov 提交于 9月 02, 2014

Commit 0244756e ("ufs: sb mutex merge + mutex_destroy") introduces
deadlocks in ufs_new_inode() and ufs_free_inode().
Most callers of that functions acqure the mutex by themselves and
ufs_{new,free}_inode() do that via lock_ufs(),
i.e we have an unavoidable double lock.

The patch proposes to resolve the issue by making sure that
ufs_{new,free}_inode() are not called with the mutex held.

Found by Linux Driver Verification project (linuxtesting.org).

Cc: stable@vger.kernel.org # 3.16
Signed-off-by: NAlexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

9ef7db7f

05 9月, 2014 8 次提交

Export sync_filesystem() for modular ->remount_fs() use · 10096fb1

由 Anton Altaparmakov 提交于 8月 21, 2014

This patch changes sync_filesystem() to be EXPORT_SYMBOL().

The reason this is needed is that starting with 3.15 kernel, due to
Theodore Ts'o's commit 02b9984d ("fs: push sync_filesystem() down to
the file system's remount_fs()"), all file systems that have dirty data
to be written out need to call sync_filesystem() from their
->remount_fs() method when remounting read-only.

As this is now a generically required function rather than an internal
only function it should be EXPORT_SYMBOL() so that all file systems can
call it.
Signed-off-by: NAnton Altaparmakov <aia21@cantab.net>
Acked-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

10096fb1

aio: block exit_aio() until all context requests are completed · 6098b45b

由 Gu Zheng 提交于 9月 03, 2014

It seems that exit_aio() also needs to wait for all iocbs to complete (like
io_destroy), but we missed the wait step in current implemention, so fix
it in the same way as we did in io_destroy.
Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
Cc: stable@vger.kernel.org

6098b45b

A
udf: saner calling conventions for udf_new_inode() · 0b93a92b
由 Al Viro 提交于 9月 04, 2014
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NJan Kara <jack@suse.cz>
```
0b93a92b

udf: fix the udf_iget() vs. udf_new_inode() races · b2315096

由 Al Viro 提交于 9月 04, 2014

Currently udf_iget() (triggered by NFS) can race with udf_new_inode()
leading to two inode structures with the same inode number:

nfsd: iget_locked() creates inode
nfsd: try to read from disk, block on that.
udf_new_inode(): allocate inode with that inumber
udf_new_inode(): insert it into icache, set it up and dirty
udf_write_inode(): write inode into buffer cache
nfsd: get CPU again, look into buffer cache, see nice and sane on-disk
  inode, set the in-core inode from it

Fix the problem by putting inode into icache in locked state (I_NEW set)
and unlocking it only after it's fully set up.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NJan Kara <jack@suse.cz>

b2315096

udf: merge the pieces inserting a new non-directory object into directory · d2be51cb

由 Al Viro 提交于 9月 04, 2014

boilerplate code in udf_{create,mknod,symlink} taken to new helper

symlink case converted to unique id calculated by udf_new_inode() - no
point finding a new one.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NJan Kara <jack@suse.cz>

d2be51cb

udf: Set i_generation field · 470cca56

由 Jan Kara 提交于 9月 04, 2014

Currently UDF doesn't initialize i_generation in any way and thus NFS
can easily get reallocated inodes from stale file handles. Luckily UDF
already has a unique object identifier associated with each inode -
i_unique. Use that for initialization of i_generation.
Signed-off-by: NJan Kara <jack@suse.cz>

470cca56

udf: Properly detect stale inodes · 4071b913

由 Jan Kara 提交于 9月 04, 2014

NFS can easily ask for inodes that are already deleted. Currently UDF
happily returns such inodes which is a bug. Return -ESTALE if
udf_read_inode() is asked to read deleted inode.
Signed-off-by: NJan Kara <jack@suse.cz>

4071b913

udf: Make udf_read_inode() and udf_iget() return error · 6d3d5e86

由 Jan Kara 提交于 9月 04, 2014

Currently __udf_read_inode() wasn't returning anything and we found out
whether we succeeded reading inode by checking whether inode is bad or
not. udf_iget() returned NULL on failure and inode pointer otherwise.
Make these two functions properly propagate errors up the call stack and
use the return value in callers.
Signed-off-by: NJan Kara <jack@suse.cz>

6d3d5e86

04 9月, 2014 3 次提交

udf: Avoid infinite loop when processing indirect ICBs · c03aa9f6

由 Jan Kara 提交于 9月 04, 2014

We did not implement any bound on number of indirect ICBs we follow when
loading inode. Thus corrupted medium could cause kernel to go into an
infinite loop, possibly causing a stack overflow.

Fix the possible stack overflow by removing recursion from
__udf_read_inode() and limit number of indirect ICBs we follow to avoid
infinite loops.
Signed-off-by: NJan Kara <jack@suse.cz>

c03aa9f6

udf: Fold udf_fill_inode() into __udf_read_inode() · bb7720a0

由 Jan Kara 提交于 9月 04, 2014

There's no good reason to separate these since udf_fill_inode() is
called only from __udf_read_inode() and both do part of the same thing.
Signed-off-by: NJan Kara <jack@suse.cz>

bb7720a0

udf: Avoid dir link count to go negative · 8a70ee33

由 Jan Kara 提交于 9月 04, 2014

If we are writing back inode of unlinked directory, its link count ends
up being (u16)-1. Although the inode is deleted, udf_iget() can load the
inode when NFS uses stale file handle and get confused.
Signed-off-by: NJan Kara <jack@suse.cz>

8a70ee33

03 9月, 2014 2 次提交

ext4: avoid trying to kfree an ERR_PTR pointer · a9cfcd63

由 Theodore Ts'o 提交于 9月 03, 2014

Thanks to Dan Carpenter for extending smatch to find bugs like this.
(This was found using a development version of smatch.)

Fixes: 36de9286
Reported-by: Dan Carpenter <dan.carpenter@oracle.com
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org

a9cfcd63

aio: add missing smp_rmb() in read_events_ring · 2ff396be

由 Jeff Moyer 提交于 9月 02, 2014

We ran into a case on ppc64 running mariadb where io_getevents would
return zeroed out I/O events.  After adding instrumentation, it became
clear that there was some missing synchronization between reading the
tail pointer and the events themselves.  This small patch fixes the
problem in testing.

Thanks to Zach for helping to look into this, and suggesting the fix.
Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
Cc: stable@vger.kernel.org

2ff396be

02 9月, 2014 8 次提交

f2fs: reposition unlock_new_inode to prevent accessing invalid inode · b73e5282

由 Chao Yu 提交于 8月 30, 2014

As the race condition on the inode cache, following scenario can appear:
[Thread a]				[Thread b]
					->f2fs_mkdir
					  ->f2fs_add_link
					    ->__f2fs_add_link
					      ->init_inode_metadata failed here
->gc_thread_func
  ->f2fs_gc
    ->do_garbage_collect
      ->gc_data_segment
        ->f2fs_iget
          ->iget_locked
            ->wait_on_inode
					  ->unlock_new_inode
        ->move_data_page
					  ->make_bad_inode
					  ->iput

When we fail in create/symlink/mkdir/mknod/tmpfile, the new allocated inode
should be set as bad to avoid being accessed by other thread. But in above
scenario, it allows f2fs to access the invalid inode before this inode was set
as bad.
This patch fix the potential problem, and this issue was found by code review.

change log from v1:
 o Add condition judgment in gc_data_segment() suggested by Changman Lee.
 o use iget_failed to simplify code.
Signed-off-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

b73e5282

xfs: trim eofblocks before collapse range · 41b9d726

由 Brian Foster 提交于 9月 02, 2014

xfs_collapse_file_space() currently writes back the entire file
undergoing collapse range to settle things down for the extent shift
algorithm. While this prevents changes to the extent list during the
collapse operation, the writeback itself is not enough to prevent
unnecessary collapse failures.

The current shift algorithm uses the extent index to iterate the in-core
extent list. If a post-eof delalloc extent persists after the writeback
(e.g., a prior zero range op where the end of the range aligns with eof
can separate the post-eof blocks such that they are not written back and
converted), xfs_bmap_shift_extents() becomes confused over the encoded
br_startblock value and fails the collapse.

As with the full writeback, this is a temporary fix until the algorithm
is improved to cope with a volatile extent list and avoid attempts to
shift post-eof extents.
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

41b9d726

xfs: xfs_file_collapse_range is delalloc challenged · 1669a8ca

由 Dave Chinner 提交于 9月 02, 2014

If we have delalloc extents on a file before we run a collapse range
opertaion, we sync the range that we are going to collapse to
convert delalloc extents in that region to real extents to simplify
the shift operation.

However, the shift operation then assumes that the extent list is
not going to change as it iterates over the extent list moving
things about. Unfortunately, this isn't true because we can't hold
the ILOCK over all the operations. We can prevent new IO from
modifying the extent list by holding the IOLOCK, but that doesn't
prevent writeback from running....

And when writeback runs, it can convert delalloc extents is the
range of the file prior to the region being collapsed, and this
changes the indexes of all the extents in the file. That causes the
collapse range operation to Go Bad.

The right fix is to rewrite the extent shift operation not to be
dependent on the extent list not changing across the entire
operation, but this is a fairly significant piece of work to do.
Hence, as a short-term workaround for the problem, sync the entire
file before starting a collapse operation to remove all delalloc
ranges from the file and so avoid the problem of concurrent
writeback changing the extent list.
Diagnosed-and-Reported-by: NBrian Foster <bfoster@redhat.com>
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

1669a8ca

xfs: don't log inode unless extent shift makes extent modifications · ca446d88

由 Brian Foster 提交于 9月 02, 2014

The file collapse mechanism uses xfs_bmap_shift_extents() to collapse
all subsequent extents down into the specified, previously punched out,
region. This function performs some validation, such as whether a
sufficient hole exists in the target region of the collapse, then shifts
the remaining exents downward.

The exit path of the function currently logs the inode unconditionally.
While we must log the inode (and abort) if an error occurs and the
transaction is dirty, the initial validation paths can generate errors
before the transaction has been dirtied. This creates an unnecessary
filesystem shutdown scenario, as the caller will cancel a transaction
that has been marked dirty.

Modify xfs_bmap_shift_extents() to OR the logflags bits as modifications
are made to the inode bmap. Only log the inode in the exit path if
logflags has been set. This ensures we only have to cancel a dirty
transaction if modifications have been made and prevents an unnecessary
filesystem shutdown otherwise.
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

ca446d88

xfs: use ranged writeback and invalidation for direct IO · 7d4ea3ce

由 Dave Chinner 提交于 9月 02, 2014

Now we are not doing silly things with dirtying buffers beyond EOF
and using invalidation correctly, we can finally reduce the ranges of
writeback and invalidation used by direct IO to match that of the IO
being issued.

Bring the writeback and invalidation ranges back to match the
generic direct IO code - this will greatly reduce the perturbation
of cached data when direct IO and buffered IO are mixed, but still
provide the same buffered vs direct IO coherency behaviour we
currently have.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

7d4ea3ce

xfs: don't zero partial page cache pages during O_DIRECT writes · 834ffca6

由 Dave Chinner 提交于 9月 02, 2014

Similar to direct IO reads, direct IO writes are using 
truncate_pagecache_range to invalidate the page cache. This is
incorrect due to the sub-block zeroing in the page cache that
truncate_pagecache_range() triggers.

This patch fixes things by using invalidate_inode_pages2_range
instead.  It preserves the page cache invalidation, but won't zero
any pages.

cc: stable@vger.kernel.org
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

834ffca6

xfs: don't zero partial page cache pages during O_DIRECT writes · 85e584da

由 Chris Mason 提交于 9月 02, 2014

xfs is using truncate_pagecache_range to invalidate the page cache
during DIO reads.  This is different from the other filesystems who
only invalidate pages during DIO writes.

truncate_pagecache_range is meant to be used when we are freeing the
underlying data structs from disk, so it will zero any partial
ranges in the page.  This means a DIO read can zero out part of the
page cache page, and it is possible the page will stay in cache.

buffered reads will find an up to date page with zeros instead of
the data actually on disk.

This patch fixes things by using invalidate_inode_pages2_range
instead.  It preserves the page cache invalidation, but won't zero
any pages.

[dchinner: catch error and warn if it fails. Comment.]

cc: stable@vger.kernel.org
Signed-off-by: NChris Mason <clm@fb.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

85e584da

xfs: don't dirty buffers beyond EOF · 22e757a4

由 Dave Chinner 提交于 9月 02, 2014

generic/263 is failing fsx at this point with a page spanning
EOF that cannot be invalidated. The operations are:

1190 mapwrite   0x52c00 thru    0x5e569 (0xb96a bytes)
1191 mapread    0x5c000 thru    0x5d636 (0x1637 bytes)
1192 write      0x5b600 thru    0x771ff (0x1bc00 bytes)

where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
write attempts to invalidate the cached page over this range, it
fails with -EBUSY and so any attempt to do page invalidation fails.

The real question is this: Why can't that page be invalidated after
it has been written to disk and cleaned?

Well, there's data on the first two buffers in the page (1k block
size, 4k page), but the third buffer on the page (i.e. beyond EOF)
is failing drop_buffers because it's bh->b_state == 0x3, which is
BH_Uptodate | BH_Dirty.  IOWs, there's dirty buffers beyond EOF. Say
what?

OK, set_buffer_dirty() is called on all buffers from
__set_page_buffers_dirty(), regardless of whether the buffer is
beyond EOF or not, which means that when we get to ->writepage,
we have buffers marked dirty beyond EOF that we need to clean.
So, we need to implement our own .set_page_dirty method that
doesn't dirty buffers beyond EOF.

This is messy because the buffer code is not meant to be shared
and it has interesting locking issues on the buffer dirty bits.
So just copy and paste it and then modify it to suit what we need.

Note: the solutions the other filesystems and generic block code use
of marking the buffers clean in ->writepage does not work for XFS.
It still leaves dirty buffers beyond EOF and invalidations still
fail. Hence rather than play whack-a-mole, this patch simply
prevents those buffers from being dirtied in the first place.

cc: <stable@kernel.org>
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

22e757a4

31 8月, 2014 2 次提交

fix EBUSY on umount() from MNT_SHRINKABLE · 81b6b061

由 Al Viro 提交于 8月 30, 2014

We need the parents of victims alive until namespace_unlock() gets to
dput() of the (ex-)mountpoints. However, that screws up the "is it
busy" checks in case when we have shrinkable mounts that need to be
killed. Solution: go ahead and decrement refcounts of parents right
in umount_tree(), increment them again just before dropping rwsem in
namespace_unlock() (and let the loop in the end of namespace_unlock()
finally drop those references for good, as we do now). Parents can't
get freed until we drop rwsem - at least one reference is kept until
then, both in case when parent is among the victims and when it is
not. So they'll still be around when we get to namespace_unlock().

Cc: stable@vger.kernel.org # 3.12+
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

81b6b061

get rid of propagate_umount() mistakenly treating slaves as busy. · 88b368f2

由 Al Viro 提交于 8月 18, 2014

The check in __propagate_umount() ("has somebody explicitly mounted
something on that slave?") is done *before* taking the already doomed
victims out of the child lists.

Cc: stable@vger.kernel.org
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

88b368f2

30 8月, 2014 4 次提交

ocfs2: quorum: add a log for node not fenced · 8c7b638c

由 Junxiao Bi 提交于 8月 29, 2014

For debug use, we can see from the log whether the fence decision is
made and why it is not fenced.
Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: NSrinivas Eeda <srinivas.eeda@oracle.com>
Reviewed-by: NMark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8c7b638c

ocfs2: o2net: set tcp user timeout to max value · 8e9801df

由 Junxiao Bi 提交于 8月 29, 2014

When tcp retransmit timeout(15mins), the connection will be closed.
Pending messages may be lost during this time.  So we set tcp user
timeout to override the retransmit timeout to the max value.  This is OK
for ocfs2 since we have disk heartbeat, if peer crash, the disk
heartbeat will timeout and it will be evicted, if disk heartbeat not
timeout and connection idle for a long time, then this means the cluster
enters split-brain state, since fence can't happen, we'd better keep the
connection and wait network recover.
Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: NSrinivas Eeda <srinivas.eeda@oracle.com>
Reviewed-by: NMark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8e9801df

ocfs2: o2net: don't shutdown connection when idle timeout · c43c363d

由 Junxiao Bi 提交于 8月 29, 2014

This patch series is to fix a possible message lost bug in ocfs2 when
network go bad.  This bug will cause ocfs2 hung forever even network
become good again.

The messages may lost in this case.  After the tcp connection is
established between two nodes, an idle timer will be set to check its
state periodically, if no messages are received during this time, idle
timer will timeout, it will shutdown the connection and try to
reconnect, so pending messages in tcp queues will be lost.  This
messages may be from dlm.  Dlm may get hung in this case.  This may
cause the whole ocfs2 cluster hung.

This is very possible to happen when network state goes bad.  Do the
reconnect is useless, it will fail if network state is still bad.  Just
waiting there for network recovering may be a good idea, it will not
lost messages and some node will be fenced until cluster goes into
split-brain state, for this case, Tcp user timeout is used to override
the tcp retransmit timeout.  It will timeout after 25 days, user should
have notice this through the provided log and fix the network, if they
don't, ocfs2 will fall back to original reconnect way.

This patch (of 3):

Some messages in the tcp queue maybe lost if we shutdown the connection
and reconnect when idle timeout.  If packets lost and reconnect success,
then the ocfs2 cluster maybe hung.

To fix this, we can leave the connection there and do the fence decision
when idle timeout, if network recover before fence dicision is made, the
connection survive without lost any messages.

This bug can be saw when network state go bad.  It may cause ocfs2 hung
forever if some packets lost.  With this fix, ocfs2 will recover from
hung if network becomes good again.
Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: NSrinivas Eeda <srinivas.eeda@oracle.com>
Reviewed-by: NMark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c43c363d

ocfs2: do not write error flag to user structure we cannot copy from/to · 2b462638

由 Ben Hutchings 提交于 8月 29, 2014

If we failed to copy from the structure, writing back the flags leaks 31
bits of kernel memory (the rest of the ir_flags field).

In any case, if we cannot copy from/to the structure, why should we
expect putting just the flags to work?

Also make sure ocfs2_info_handle_freeinode() returns the right error
code if the copy_to_user() fails.

Fixes: ddee5cdb ('Ocfs2: Add new OCFS2_IOC_INFO ioctl for ocfs2 v8.')
Signed-off-by: NBen Hutchings <ben@decadent.org.uk>
Cc: Joel Becker <jlbec@evilplan.org>
Acked-by: NMark Fasheh <mfasheh@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2b462638

29 8月, 2014 6 次提交

f2fs: fix wrong casting for dentry name · 3304b564

由 Jaegeuk Kim 提交于 8月 29, 2014

The dentry name type is unsigned char *.
If we don't match this type, some character codes can be changed by signed bit.
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

3304b564

ext4: fix same-dir rename when inline data directory overflows · d80d448c

由 Darrick J. Wong 提交于 8月 27, 2014

When performing a same-directory rename, it's possible that adding or
setting the new directory entry will cause the directory to overflow
the inline data area, which causes the directory to be converted to an
extent-based directory.  Under this circumstance it is necessary to
re-read the directory when deleting the old dirent because the "old
directory" context still points to i_block in the inode table, which
is now an extent tree root!  The delete fails with an FS error, and
the subsequent fsck complains about incorrect link counts and
hardlinked directories.

Test case (originally found with flat_dir_test in the metadata_csum
test program):

# mkfs.ext4 -O inline_data /dev/sda
# mount /dev/sda /mnt
# mkdir /mnt/x
# touch /mnt/x/changelog.gz /mnt/x/copyright /mnt/x/README.Debian
# sync
# for i in /mnt/x/*; do mv $i $i.longer; done
# ls -la /mnt/x/
total 0
-rw-r--r-- 1 root root 0 Aug 25 12:03 changelog.gz.longer
-rw-r--r-- 1 root root 0 Aug 25 12:03 copyright
-rw-r--r-- 1 root root 0 Aug 25 12:03 copyright.longer
-rw-r--r-- 1 root root 0 Aug 25 12:03 README.Debian.longer

(Hey!  Why are there four files now??)
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org

d80d448c

jbd2: fix descriptor block size handling errors with journal_csum · db9ee220

由 Darrick J. Wong 提交于 8月 27, 2014

It turns out that there are some serious problems with the on-disk
format of journal checksum v2.  The foremost is that the function to
calculate descriptor tag size returns sizes that are too big.  This
causes alignment issues on some architectures and is compounded by the
fact that some parts of jbd2 use the structure size (incorrectly) to
determine the presence of a 64bit journal instead of checking the
feature flags.

Therefore, introduce journal checksum v3, which enlarges the
descriptor block tag format to allow for full 32-bit checksums of
journal blocks, fix the journal tag function to return the correct
sizes, and fix the jbd2 recovery code to use feature flags to
determine 64bitness.

Add a few function helpers so we don't have to open-code quite so
many pieces.

Switching to a 16-byte block size was found to increase journal size
overhead by a maximum of 0.1%, to convert a 32-bit journal with no
checksumming to a 32-bit journal with checksum v3 enabled.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reported-by: NTR Reardon <thomas_reardon@hotmail.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org

db9ee220

jbd2: fix infinite loop when recovering corrupt journal blocks · 022eaa75

由 Darrick J. Wong 提交于 8月 27, 2014

When recovering the journal, don't fall into an infinite loop if we
encounter a corrupt journal block.  Instead, just skip the block and
return an error, which fails the mount and thus forces the user to run
a full filesystem fsck.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org

022eaa75

ext4: update i_disksize coherently with block allocation on error path · 6603120e

由 Dmitry Monakhov 提交于 8月 27, 2014

In case of delalloc block i_disksize may be less than i_size. So we
have to update i_disksize each time we allocated and submitted some
blocks beyond i_disksize.  We weren't doing this on the error paths,
so fix this.

testcase: xfstest generic/019
Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org

6603120e

f2fs: simplify by using a literal · 922cedbd

由 Dan Carpenter 提交于 8月 28, 2014

We can make the code a bit simpler because we know that "!retry" is
zero.
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

922cedbd

openeuler / Kernel 大约 1 年 前同步成功

openeuler / Kernel
大约 1 年前同步成功