提交 · af7d93729c7c2beadea8ec5a6e66c53bef0e6290 · openeuler / raspberrypi-kernel

28 5月, 2016 8 次提交

direct-io: fix direct write stale data exposure from concurrent buffered read · 9ecd10b7

由 Eryu Guan 提交于 5月 27, 2016

Currently direct writes inside i_size on a DIO_SKIP_HOLES filesystem are
not allowed to allocate blocks(get_more_blocks() sets 'create' to 0
before calling get_block() callback), if it's a sparse file, direct
writes fall back to buffered writes to avoid stale data exposure from
concurrent buffered read.  But there're two cases that can result in
stale data exposure are not correctly detected.

1. The detection for "writing inside i_size" is not sufficient,
   writes can be treated as "extending writes" wrongly.  For example,
   direct write 1FSB (file system block) to a 1FSB sparse file on
   ext2/3/4, starting from offset 0, in this case it's writing inside
   i_size, but 'create' is non-zero, because 'block_in_file' and
   '(i_size_read(inode) >> blkbits' are both zero.

2. Direct writes starting from or beyong i_size (not inside i_size)
   also could trigger block allocation and expose stale data.  For
   example, consider a sparse file with i_size of 2k, and a write to
   offset 2k or 3k into the file, with a filesystem block size of 4k.
   (Thanks to Jeff Moyer for pointing this case out in his review.)

The first problem can be demostrated by running ltp-aiodio test ADSP045
many times.  When testing on extN filesystems, I see test failures
occasionally, buffered read could read non-zero (stale) data.

ADSP045: dio_sparse -a 4k -w 4k -s 2k -n 1

dio_sparse    0  TINFO  :  Dirtying free blocks
dio_sparse    0  TINFO  :  Starting I/O tests
non zero buffer at buf[0] => 0xffffffaa,ffffffaa,ffffffaa,ffffffaa
non-zero read at offset 0
dio_sparse    0  TINFO  :  Killing childrens(s)
dio_sparse    1  TFAIL  :  dio_sparse.c:191: 1 children(s) exited abnormally

The second problem can also be reproduced easily by a hacked dio_sparse
program, which accepts an option to specify the write offset.

What we should really do is to disable block allocation for writes that
could result in filling holes inside i_size.

Link: http://lkml.kernel.org/r/1463156728-13357-1-git-send-email-guaneryu@gmail.comReviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NEryu Guan <guaneryu@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9ecd10b7

ocfs2: bump up o2cb network protocol version · 38b52efd

由 Junxiao Bi 提交于 5月 27, 2016

Two new messages are added to support negotiating hb timeout.  Stop
nodes frmo talking an old version to mount as they will cause the
negotiation to fail.

Link: http://lkml.kernel.org/r/1464231615-27939-1-git-send-email-junxiao.bi@oracle.comSigned-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: NMark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

38b52efd

ocfs2: o2hb: fix hb hung time · 6633ca57

由 Junxiao Bi 提交于 5月 27, 2016

hr_last_timeout_start should be set as the last time where hb is
still OK.  When hb write timeout, hung time will be (jiffies -
hr_last_timeout_start).
Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: NRyan Ding <ryan.ding@oracle.com>
Reviewed-by: NMark Fasheh <mfasheh@suse.de>
Cc: Gang He <ghe@suse.com>
Cc: rwxybh <rwxybh@126.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6633ca57

ocfs2: o2hb: don't negotiate if last hb fail · 88dbe98d

由 Junxiao Bi 提交于 5月 27, 2016

Sometimes io error is returned when storage is down for a while.  Like
for iscsi device, stroage is made offline when session timeout, and this
will make all io return -EIO.  For this case, nodes shouldn't do
negotiate timeout but should fence self.  So let nodes fence self when
o2hb_do_disk_heartbeat return an error, this is the same behavior with
o2hb without negotiate timer.
Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: NRyan Ding <ryan.ding@oracle.com>
Reviewed-by: NMark Fasheh <mfasheh@suse.de>
Cc: Gang He <ghe@suse.com>
Cc: rwxybh <rwxybh@126.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

88dbe98d

ocfs2: o2hb: add some user/debug log · 1bd12902

由 Junxiao Bi 提交于 5月 27, 2016

Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: NRyan Ding <ryan.ding@oracle.com>
Reviewed-by: NMark Fasheh <mfasheh@suse.de>
Cc: Gang He <ghe@suse.com>
Cc: rwxybh <rwxybh@126.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1bd12902

ocfs2: o2hb: add NEGOTIATE_APPROVE message · e76f8237

由 Junxiao Bi 提交于 5月 27, 2016

This message is used to re-queue write timeout timer and negotiate timer
when all nodes suffer a write hung to storage, this makes node not fence
self if storage down.
Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: NRyan Ding <ryan.ding@oracle.com>
Reviewed-by: NMark Fasheh <mfasheh@suse.de>
Cc: Gang He <ghe@suse.com>
Cc: rwxybh <rwxybh@126.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e76f8237

ocfs2: o2hb: add NEGO_TIMEOUT message · 34069b88

由 Junxiao Bi 提交于 5月 27, 2016

This message is sent to master node when non-master nodes's negotiate
timer expired.  Master node records these nodes in a bitmap which is
used to do write timeout timer re-queue decision.
Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: NRyan Ding <ryan.ding@oracle.com>
Reviewed-by: NMark Fasheh <mfasheh@suse.de>
Cc: Gang He <ghe@suse.com>
Cc: rwxybh <rwxybh@126.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

34069b88

ocfs2: o2hb: add negotiate timer · e0cbb798

由 Junxiao Bi 提交于 5月 27, 2016

This series of patches is to fix the issue that when storage down, all
nodes will fence self due to write timeout.

With this patch set, all nodes will keep going until storage back
online, except if the following issue happens, then all nodes will do as
before to fence self.

1. io error got
2. network between nodes down
3. nodes panic

This patch (of 6):

When storage down, all nodes will fence self due to write timeout.  The
negotiate timer is designed to avoid this, with it node will wait until
storage up again.

Negotiate timer working in the following way:

1. The timer expires before write timeout timer, its timeout is half
   of write timeout now.  It is re-queued along with write timeout timer.
   If expires, it will send NEGO_TIMEOUT message to master node(node with
   lowest node number).  This message does nothing but marks a bit in a
   bitmap recording which nodes are negotiating timeout on master node.

2. If storage down, nodes will send this message to master node, then
   when master node finds its bitmap including all online nodes, it sends
   NEGO_APPROVL message to all nodes one by one, this message will
   re-queue write timeout timer and negotiate timer.  For any node doesn't
   receive this message or meets some issue when handling this message, it
   will be fenced.  If storage up at any time, o2hb_thread will run and
   re-queue all the timer, nothing will be affected by these two steps.
Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: NRyan Ding <ryan.ding@oracle.com>
Reviewed-by: NMark Fasheh <mfasheh@suse.de>
Cc: Gang He <ghe@suse.com>
Cc: rwxybh <rwxybh@126.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e0cbb798

27 5月, 2016 1 次提交

ocfs2: fix improper handling of return errno · 1f3a437f

由 Eric Ren 提交于 5月 26, 2016

Previously, if a bad inode was found in ocfs2_iget(), -ESTALE was
returned back to the caller anyway.  Since commit d2b9d71a2da7 ("ocfs2:
check/fix inode block for online file check") can handle with return
value from ocfs2_read_locked_inode() now, we know the exact errno
returned for us.

Link: http://lkml.kernel.org/r/1463970656-18413-1-git-send-email-zren@suse.comSigned-off-by: NEric Ren <zren@suse.com>
Reviewed-by: NJoseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1f3a437f

26 5月, 2016 31 次提交

pnfs: pnfs_update_layout needs to consider if strict iomode checking is on · c7d73af2

由 Tom Haynes 提交于 5月 25, 2016

As flexfiles has FF_FLAGS_NO_READ_IO, there is a need to generically
support enforcing that a IOMODE_RW segment will not allow READ I/O.
Signed-off-by: NTom Haynes <loghyr@primarydata.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

c7d73af2

T
nfs/flexfiles: Use the layout segment for reading unless it a IOMODE_RW and reading is disabled · 602c4cd4
由 Tom Haynes 提交于 5月 25, 2016
```
Signed-off-by: NTom Haynes <loghyr@primarydata.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
```
602c4cd4

ceph: fix wake_up_session_cb() · e5360309

由 Yan, Zheng 提交于 5月 19, 2016

We should reset i_requested_max_size before waking the waiters.
(zero i_requested_max_size make waiter re-request the max size)
Signed-off-by: NYan, Zheng <zyan@redhat.com>

e5360309

ceph: don't use truncate_pagecache() to invalidate read cache · 9abd4db7

由 Yan, Zheng 提交于 5月 18, 2016

truncate_pagecache() drops dirty pages, it's dangerous to use it
to invalidate read cache. Besides, we shouldn't start invalidating
read cache while there are buffer writers. Because buffer writers
may add dirty pages later.
Signed-off-by: NYan, Zheng <zyan@redhat.com>

9abd4db7

Y
ceph: SetPageError() for writeback pages if writepages fails · b109eec6
由 Yan, Zheng 提交于 5月 13, 2016
```
Signed-off-by: NYan, Zheng <zyan@redhat.com>
```
b109eec6

ceph: handle interrupted ceph_writepage() · ad15ec06

由 Yan, Zheng 提交于 5月 13, 2016

writepage() can be interrupted when it's called by direct memory
reclaimer (the direct memory relaimer is killed). To avoid lossing
data, we redirty the page.
Signed-off-by: NYan, Zheng <zyan@redhat.com>

ad15ec06

ceph: make ceph_update_writeable_page() uninterruptible · a78bbd4b

由 Yan, Zheng 提交于 5月 13, 2016

ceph_update_writeable_page() is used by ceph_write_begin(). It beaks
atomicity of write operation if it's interruptible.
Signed-off-by: NYan, Zheng <zyan@redhat.com>

a78bbd4b

ceph: handle -EAGAIN returned by ceph_update_writeable_page() · f0b33df5

由 Yan, Zheng 提交于 5月 10, 2016

when ceph_update_writeable_page() return -EAGAIN, caller should
lock the page and call ceph_update_writeable_page() again.
Signed-off-by: NYan, Zheng <zyan@redhat.com>

f0b33df5

Y
ceph: make fault/page_mkwrite return VM_FAULT_OOM for -ENOMEM · 6ce026e4
由 Yan, Zheng 提交于 5月 10, 2016
```
Signed-off-by: NYan, Zheng <zyan@redhat.com>
```
6ce026e4

ceph: block non-fatal signals for fault/page_mkwrite · 4f7e89f6

由 Yan, Zheng 提交于 5月 10, 2016

Fault and page_mkwrite are supposed to be uninterruptable. But they
call ceph functions that are interruptible. So they should block
signals before calling functions that are interruptible
Signed-off-by: NYan, Zheng <zyan@redhat.com>

4f7e89f6

ceph: make logical calculation functions return bool · 3b33f692

由 Zhang Zhuoyu 提交于 3月 25, 2016

This patch makes serverl logical caculation functions return bool to
improve readability due to these particular functions only using 0/1
as their return value.

No functional change.
Signed-off-by: NZhang Zhuoyu <zhangzhuoyu@cmss.chinamobile.com>

3b33f692

ceph: tolerate bad i_size for symlink inode · 224a7542

由 Yan, Zheng 提交于 5月 05, 2016

A mds bug can cause symlink's size to be truncated to zero.
Signed-off-by: NYan, Zheng <zyan@redhat.com>

224a7542

ceph: improve fragtree change detection · 1b1bc16d

由 Yan, Zheng 提交于 5月 04, 2016

check if number of splits in i_fragtree is equal to number of splits
in mds reply
Signed-off-by: NYan, Zheng <zyan@redhat.com>

1b1bc16d

ceph: keep leaf frag when updating fragtree · a4b7431f

由 Yan, Zheng 提交于 5月 04, 2016

Nodes in i_fragtree are sorted according to ceph_compare_frag().
It means frag node in i_fragtree always follow its direct parent
node. To check if a leaf node is valid, we just need to check if
it's child of previous split node.
Signed-off-by: NYan, Zheng <zyan@redhat.com>

a4b7431f

ceph: fix dir_auth check in ceph_fill_dirfrag() · 42172119

由 Yan, Zheng 提交于 5月 03, 2016

-1 is CDIR_AUTH_PARENT, it means dir's auth mds is the same as
inode's auth mds
Signed-off-by: NYan, Zheng <zyan@redhat.com>

42172119

ceph: don't assume frag tree splits in mds reply are sorted · a407846e

由 Yan, Zheng 提交于 5月 03, 2016

The algorithm that updates i_fragtree relies on that the frag tree
splits in mds reply are of the same order of i_fragtree. This is not
true because current MDS encodes frag tree splits in ascending order
of (unsigned)frag_t. But nodes in i_fragtree are sorted according to
ceph_frag_compare().

The fix is sort the frag tree splits first, then updates i_fragtree.
Signed-off-by: NYan, Zheng <zyan@redhat.com>

a407846e

Y
ceph: fix inode reference leak · 209ae762
由 Yan, Zheng 提交于 4月 29, 2016
```
Signed-off-by: NYan, Zheng <zyan@redhat.com>
```
209ae762

ceph: using hash value to compose dentry offset · f3c4ebe6

由 Yan, Zheng 提交于 4月 29, 2016

If MDS sorts dentries in dirfrag in hash order, we use hash value to
compose dentry offset. dentry offset is:

  (0xff << 52) | ((24 bits hash) << 28) |
  (the nth entry hash hash collision)

This offset is stable across directory fragmentation. This alos means
there is no need to reset readdir offset if directory get fragmented
in the middle of readdir.
Signed-off-by: NYan, Zheng <zyan@redhat.com>

f3c4ebe6

ceph: don't forbid marking directory complete after forward seek · 076c40f1

由 Yan, Zheng 提交于 4月 28, 2016

Forward seek within same frag does not update fi->last_name, it will
not affect contents of later readdir reply. So there is no need to
forbid marking directory complete
Signed-off-by: NYan, Zheng <zyan@redhat.com>

076c40f1

Y
ceph: record 'offset' for each entry of readdir result · 8974eebd
由 Yan, Zheng 提交于 4月 28, 2016
```
This is preparation for using hash value as dentry 'offset'
Signed-off-by: NYan, Zheng <zyan@redhat.com>
```
8974eebd

ceph: define 'end/complete' in readdir reply as bit flags · 956d39d6

由 Yan, Zheng 提交于 4月 27, 2016

Set a flag in readdir request, which indicates that client interprets
'end/complete' as bit flags. So that mds can reply additional flags in
readdir reply.
Signed-off-by: NYan, Zheng <zyan@redhat.com>

956d39d6

ceph: define struct for dir entry in readdir reply · 2a5beea3

由 Yan, Zheng 提交于 4月 28, 2016

This avoids defining multiple arrays for entries in readdir reply
Signed-off-by: NYan, Zheng <zyan@redhat.com>

2a5beea3

ceph: simplify 'offset in frag' · a78600e7

由 Yan, Zheng 提交于 4月 27, 2016

don't distinguish leftmost frag from other frags. always use 2 as
first entry's offset.
Signed-off-by: NYan, Zheng <zyan@redhat.com>

a78600e7

ceph: remove unnecessary checks in __dcache_readdir · 1cd42a42

由 Yan, Zheng 提交于 4月 29, 2016

we never add snapdir and the hidden .ceph dir into readdir cache
Signed-off-by: NYan, Zheng <zyan@redhat.com>

1cd42a42

ceph: search cache postion for dcache readdir · c530cd24

由 Yan, Zheng 提交于 4月 28, 2016

use binary search to find cache index that corresponds to readdir
postion.
Signed-off-by: NYan, Zheng <zyan@redhat.com>

c530cd24

ceph: use CEPH_MDS_OP_RMXATTR request to remove xattr · 04303d8a

由 Yan, Zheng 提交于 4月 21, 2016

Setxattr with NULL value and XATTR_REPLACE flag should be equivalent
to removexattr. But current MDS does not support deleting vxattrs through
MDS_OP_SETXATTR request. The workaround is sending MDS_OP_RMXATTR request
if setxattr actually removs xattr.
Signed-off-by: NYan, Zheng <zyan@redhat.com>

04303d8a

Y
ceph: report mount root in session metadata · 3f384954
由 Yan, Zheng 提交于 4月 21, 2016
```
Signed-off-by: NYan, Zheng <zyan@redhat.com>
```
3f384954

ceph: don't show symlink target in debugfs/mdsc · aeda081c

由 Yan, Zheng 提交于 4月 18, 2016

symlink target is useless for debug and can be very long. It's annoying
to show it in debugfs/mdsc.
Signed-off-by: NYan, Zheng <zyan@redhat.com>

aeda081c

ceph: don't call truncate_pagecache in ceph_writepages_start · 6c93df5d

由 Yan, Zheng 提交于 4月 15, 2016

truncate_pagecache() may decrease inode's reference. This can cause
deadlock if inode's last reference is dropped and iput_final() wants
to evict the inode. (evict() calls inode_wait_for_writeback(), which
waits for ceph_writepages_start() to return).

The fix is use work thead to truncate dirty pages. Also add 'forced
umount' check to ceph_update_writeable_page(), which prevents new
pages getting dirty.
Signed-off-by: NYan, Zheng <zyan@redhat.com>

6c93df5d

ceph: renew caps for read/write if mds session got killed. · 77310320

由 Yan, Zheng 提交于 4月 08, 2016

When mds session gets killed, read/write operation may hang.
Client waits for Frw caps, but mds does not know what caps client
wants. To recover this, client sends an open request to mds. The
request will tell mds what caps client wants.
Signed-off-by: NYan, Zheng <zyan@redhat.com>

77310320

Y
ceph: CEPH_FEATURE_MDSENC support · d463a43d
由 Yan, Zheng 提交于 3月 31, 2016
```
Signed-off-by: NYan, Zheng <zyan@redhat.com>
```
d463a43d