提交 · 55f57d2c4259a9a4048cf4629a2c6ba53729188d · openanolis / cloud-kernel

05 8月, 2015 33 次提交

f2fs: fix double lock in handle_failed_inode · 55f57d2c

由 Chao Yu 提交于 7月 16, 2015

In handle_failed_inode, there is a potential deadlock which can happen
in below call path:

- f2fs_create
 - f2fs_lock_op   down_read(cp_rwsem)
 - f2fs_add_link
  - __f2fs_add_link
   - init_inode_metadata
    - f2fs_init_security    failed
    - truncate_blocks    failed
 - handle_failed_inode
  - f2fs_truncate
   - truncate_blocks(..,true)
					- write_checkpoint
					 - block_operations
					  - f2fs_lock_all  down_write(cp_rwsem)
    - f2fs_lock_op   down_read(cp_rwsem)

So in this path, we pass parameter to f2fs_truncate to make sure
cp_rwsem in truncate_blocks will not be locked again.
Signed-off-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

55f57d2c

f2fs: reduce region of cp_rwsem covered in f2fs_do_collapse · ecbaa406

由 Chao Yu 提交于 7月 16, 2015

In f2fs_do_collapse, region cp_rwsem covered is large, since it will be
held until all blocks are left shifted, so if we try to collapse small
area at the beginning of large file, checkpoint who want to grab writer's
lock of cp_rwsem will be delayed for long time.

In order to avoid this condition, altering to lock/unlock cp_rwsem each
shift operation.
Signed-off-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

ecbaa406

f2fs: add new interfaces for extent tree · 0f825ee6

由 Fan Li 提交于 7月 15, 2015

Add a lookup and a insertion interface for extent tree.
The new lookup return the insert position and the prev/next
extents closest to the offset we lookup when find no match.
The new insertion uses above parameters to improve performance.

There are three possible insertions after the lookup in
f2fs_update_extent_tree, two of them insert parts of removed extent
back to tree, since no merge happens during this process, new insertion
skips the merge check in this scanario; the another insertion inserts a
new extent to tree, new insertion uses prev/next extent and insert
position to insert this extent directly, and save the time of searching
down the tree.

As long as tree remains unchanged between lookup and insertion, this
would work fine. And the new lookup would be useful when add
multi-blocks extent support for insertion interface.
Signed-off-by: NFan li <fanofcode.li@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

0f825ee6

f2fs: callers take care of the page from bio error · 86531d6b

由 Jaegeuk Kim 提交于 7月 15, 2015

This patch changes for a caller to handle the page after its bio gets an error.
Reviewed-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

86531d6b

f2fs: use atomic_t to record hit ratio info of extent cache · 727edac5

由 Chao Yu 提交于 7月 15, 2015

Variables for recording extent cache ratio info were updated without
protection, this patch tries to alter them to atomic_t type for more
accurate stat.
Signed-off-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

727edac5

f2fs: stat inline xattr inode number · d5e8f6c9

由 Chao Yu 提交于 7月 15, 2015

This patch adds to stat the number of inline xattr inode for
showing in debugfs.
Signed-off-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

d5e8f6c9

f2fs: use a page temporarily for encrypted gced page · 1b77c416

由 Jaegeuk Kim 提交于 7月 13, 2015

That encrypted page is used temporarily, so we don't need to mark it accessed.
Reviewed-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

1b77c416

f2fs: expose f2fs_write_cache_pages · 8f46dcae

由 Chao Yu 提交于 7月 14, 2015

If there are gced dirty pages and normal dirty pages in the mapping
of one inode, we might writeback them alternately with discontinuous
block address, resulting in low performance.

This patch introduces f2fs_write_cache_pages with codes copied from
write_cache_pages in mm/page-writeback.c.

In this function, we refactor flow with two steps:
1) writeback all cold type pages.
2) writeback all non-cold type pages.

By using this method, f2fs will writeback dirty pages with the same
temperature in bunch mode, it makes writeouted block being with
more continuous address, so they can be merged as much as possible
in f2fs bio cache, and also it will reduce the chance of submiting
small IO from block layer.

Test environment: 8g nokia sd card (very old sd card, but it shows
better effect when testing with this patch, and with a 32g kingston
sd card, I didn't see much more improvement).

Test step:
1. touch testfile;
2. truncate -s 512K testfile;
3. write all pages with odd index;
4. trigger gc by ioctl;
5. write all pages with even index;
6. time fsync testfile.

before:
real	0m0.402s
user	0m0.000s
sys	0m0.000s

after:
real	0m0.143s
user	0m0.004s
sys	0m0.004s
Signed-off-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

8f46dcae

f2fs: correct return value of ->setxattr · 037fe70c

由 Chao Yu 提交于 7月 13, 2015

This patch fixes to return correct error number of ->setxattr, which
is reported by xfstest tests/generic/026 as below:

generic/026      - output mismatch
    --- tests/generic/026.out
    +++ results/generic/026.out.bad
    @@ -4,6 +4,6 @@
     1 below acl max
     acl max
     1 above acl max
    -chacl: cannot set access acl on "largeaclfile": Argument list too long
    +chacl: cannot set access acl on "largeaclfile": Numerical result out of range
     use 16 aces
     use 17 aces
    ...
Ran: generic/026
Failures: generic/026
Failed 1 of 1 tests
Signed-off-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

037fe70c

f2fs: cleanup write_orphan_inodes · bd936f84

由 Chao Yu 提交于 7月 13, 2015

Previously, since 'commit 4531929e ("f2fs: move grabing orphan
pages out of protection region")' was committed, in write_orphan_inodes(),
we will grab all meta page in a batch before we use them under spinlock,
so that we can avoid large time delay of grabbing meta pages under
spinlock.

Now, 'commit d6c67a4f ("f2fs: revmove spin_lock for
write_orphan_inodes")' remove the spinlock in write_orphan_inodes,
so there is no issue we describe above, we'd better recover to move
the grab operation to original place for readability.
Signed-off-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

bd936f84

f2fs: warm up cold page after mmaped write · 5b339124

由 Chao Yu 提交于 7月 13, 2015

With cost-benifit method, background gc will consider old section with
fewer valid blocks as candidate victim, these old blocks in section will
be treated as cold data, and laterly will be moved into cold segment.

But if the gcing page is attached by user through buffered or mmaped
write, we should reset the page as non-cold one, because this page may
have more opportunity for further updating.

So fix to add clearing code for the missed 'mmap' case.
Signed-off-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

5b339124

f2fs: add new ioctl F2FS_IOC_GARBAGE_COLLECT · c1c1b583

由 Chao Yu 提交于 7月 10, 2015

When background gc is off, the only way to trigger gc is executing
a force gc in some operations who wants to grab space in disk.

The executing condition is limited: to execute force gc, we should
wait for the time when there is almost no more free section for LFS
allocation. This seems not reasonable for our user who wants to
control triggering gc by himself.

This patch introduces F2FS_IOC_GARBAGE_COLLECT interface for
triggering garbage collection by using ioctl. It provides our users
one more option to trigger gc.
Signed-off-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

c1c1b583

f2fs: maintain extent cache in separated file · a28ef1f5

由 Chao Yu 提交于 7月 08, 2015

This patch moves extent cache related code from data.c into extent_cache.c
since extent cache is independent feature, and its codes are not relate to
others in data.c, it's better for us to maintain them in separated place.

There is no functionality change, but several small coding style fixes
including:
* rename __drop_largest_extent to f2fs_drop_largest_extent for exporting;
* rename misspelled word 'untill' to 'until';
* remove unneeded 'return' in the end of f2fs_destroy_extent_tree().
Signed-off-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

a28ef1f5

f2fs: don't try to split extents shorter than F2FS_MIN_EXTENT_LEN · 3c7df87d

由 Fan Li 提交于 7月 08, 2015

Since only parts of extents longer than F2FS_MIN_EXTENT_LEN will
be kept in extent cache after split, extents already shorter than
F2FS_MIN_EXTENT_LEN don't need to try split at all.
Signed-off-by: NFan Li <fanofcode.li@samsung.com>
Reviewed-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

3c7df87d

f2fs: fix to update page flag · 90d4388a

由 Chao Yu 提交于 7月 08, 2015

This patch fixes to update page flag (e.g. Uptodate/cold flag) in
->write_begin.

Otherwise, page will be non-uptodate when we try to write entire
page, and cold data flag in page will not be clean when gced page
is being rewritten.
Signed-off-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

90d4388a

f2fs: shrink unreferenced extent_caches first · 7023a1ad

由 Jaegeuk Kim 提交于 6月 29, 2015

If an extent_tree entry has a zero reference count, we can drop it from the
cache in higher priority rather than currently referencing entries.
Reviewed-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

7023a1ad

f2fs: enhance multithread performance · bb96a8d5

由 Chao Yu 提交于 7月 06, 2015

In ->writepages, we use writepages mutex lock to serialize all block
address allocation and page submitting pairs from different inodes.
This method makes our delayed dirty pages of one inode being written
continously as many as possible.

But there is one problem that we did not submit current cached bio in
protection region of writepages mutex lock, so there is a small chance
that we submit the one of other thread's as below, resulting in
splitting more bios.

thread 1			thread 2
->writepages
  lock(writepages)
  ->write_cache_pages
  unlock(writepages)
				  lock(writepages)
				  ->write_cache_pages
  ->f2fs_submit_merged_bio
				    ->writepage
				  unlock(writepages)

fs_mark-6535  [002] ....  2242.270230: f2fs_submit_write_bio: dev = (1,0), WRITE_SYNC, DATA, sector = 5766152, size = 524288
fs_mark-6536  [000] ....  2242.270361: f2fs_submit_write_bio: dev = (1,0), WRITE_SYNC, DATA, sector = 5767176, size = 4096
fs_mark-6536  [000] ....  2242.270370: f2fs_submit_write_bio: dev = (1,0), WRITE_SYNC, NODE, sector = 8138112, size = 4096
fs_mark-6535  [002] ....  2242.270776: f2fs_submit_write_bio: dev = (1,0), WRITE_SYNC, DATA, sector = 5767184, size = 516096

This may really increase time of block layer works, and may cause
larger IO lantency.

This patch moves the submitting operation into region of writepages
mutex lock to avoid bio splits when concurrently writebacking is
intensive.

my test environment: virtual machine,
intel cpu i5 2500, 8GB size memory, 4GB size ramdisk

time fs_mark  -t  16  -L  1  -s  524288  -S  1  -d  /mnt/f2fs/

before:
real	0m4.244s
user	0m0.088s
sys	0m12.336s

after:
real	0m3.822s
user	0m0.072s
sys	0m10.760s
Signed-off-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

bb96a8d5

f2fs: restrict multimedia filename · 741a7bea

由 Chao Yu 提交于 7月 06, 2015

When testing with fs_mark, some blocks were written out as cold
data which were mixed with warm data, resulting in splitting more
bios.

This is because fs_mark will create file with random filename as
below:

559551ee~~~~~~~~15Z29OCC05JCKQP60JQ42MKV
559551ee~~~~~~~~NZAZ6X8OA8LHIIP6XD0L58RM
559551ef~~~~~~~~B15YDSWAK789HPSDZKYTW6WM
559551f1~~~~~~~~2DAE5DPS79785BUNTFWBEMP3
559551f1~~~~~~~~1MYDY0BKSQCJPI32Q8C514RM
559551f1~~~~~~~~YQOTMAOMN5CVRFOUNI026MP4
559551f3~~~~~~~~1WF42LPRTQJNPPGR3EINKMPE
559551f3~~~~~~~~8Y2NRK7CEPPAA02LY936PJPG

They are regarded as cold file since their filename are ended with
multimedia files' extension, but this should be wrong as we only
match the extension of filename, not the whole one.

In this patch, we try to fix the format of multimedia filename to:
"filename + '.' + extension", then we set cold file only its
filename matches the format.

So after this change, it will reduce the probability we set the
wrong cold file, also it helps a little for fs_mark's performance
on f2fs.
Signed-off-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

741a7bea

f2fs: make the function check_dnode have a return type of bool and change it's name to is_alive · c1079892

由 Nicholas Krause 提交于 6月 30, 2015

This makes the function check_dnode have a return type of bool
due to this particular function only ever returning either one
or zero as its return value and changes the name of the function
to is_alive in order to better explain this function's intended
work of checking if a dnode is still in use by the filesystem.
Signed-off-by: NNicholas Krause <xerofoify@gmail.com>
[Jaegeuk Kim: change the return value check for the renamed function]
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

c1079892

f2fs: check the largest extent at look-up time · 84bc926c

由 Jaegeuk Kim 提交于 6月 29, 2015

Because of the extent shrinker or other -ENOMEM scenarios, it cannot guarantee
that the largest extent would be cached in the tree all the time.

Instead of relying on extent_tree, we can simply check the cached one in extent
tree accordingly.
Reviewed-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

84bc926c

f2fs: use extent_cache by default · 3e72f721

由 Jaegeuk Kim 提交于 6月 19, 2015

We don't need to handle the duplicate extent information.

The integrated rule is:
 - update on-disk extent with largest one tracked by in-memory extent_cache
 - destroy extent_tree for the truncation case
 - drop per-inode extent_cache by shrinker
Reviewed-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

3e72f721

f2fs: add noextent_cache mount option · 7daaea25

由 Jaegeuk Kim 提交于 6月 25, 2015

This patch adds noextent_cache mount option.
Reviewed-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

7daaea25

f2fs: shrink extent_cache entries · 554df79e

由 Jaegeuk Kim 提交于 6月 19, 2015

This patch registers shrinking extent_caches.
Reviewed-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

554df79e

f2fs: shrink nat_cache entries · 1b38dc8e

由 Jaegeuk Kim 提交于 6月 19, 2015

This patch registers shrinking nat_cache entries.
Reviewed-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

1b38dc8e

f2fs: introduce a shrinker for mounted fs · 2658e50d

由 Jaegeuk Kim 提交于 6月 19, 2015

This patch introduces a shrinker targeting to reduce memory footprint consumed
by a number of in-memory f2fs data structures.

In addition, it newly adds:
 - sbi->umount_mutex to avoid data races on shrinker and put_super
 - sbi->shruinker_run_no to not revisit objects

Note that the basic implementation was copied from fs/ubifs/shrinker.c
Reviewed-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

2658e50d

f2fs: set cached_en after checking finally · 244f4fc1

由 Jaegeuk Kim 提交于 6月 22, 2015

This patch relocates cached_en not only to be covered by spin_lock, but also
to set once after checking out completely.
Reviewed-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

244f4fc1

f2fs: update on-disk extents even under extent_cache · cbe91923

由 Jaegeuk Kim 提交于 6月 16, 2015

Previously, f2fs_update_extent_cache() updates in-memory extent_cache all the
time, and then finally preserves its up-to-date extent into on-disk one during
f2fs_evict_inode.

But, in the following scenario:

1. mount
2. open & write an extent X
3. f2fs_evict_inode; on-disk extent is X
4. open & update the extent X with Y
5. sync; trigger checkpoint
6. power-cut

after power-on, f2fs should serve extent Y, but we have an on-disk extent X.

This causes a failure on xfstests/311.
Reviewed-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

cbe91923

f2fs: fix wrong block address calculation for a split extent · 7a2cb678

由 Jaegeuk Kim 提交于 6月 18, 2015

This patch fixes wrong calculation on block address field when an extent is
split.
Reviewed-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

7a2cb678

f2fs: convert inline_data for various fallocate · 97a7b2c2

由 Jaegeuk Kim 提交于 6月 17, 2015

For newly added fallocate types, it should convert inline_data before handling
block swapping.
Reviewed-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

97a7b2c2

f2fs: avoid to use failed inode immediately · c9b63bd0

由 Jaegeuk Kim 提交于 6月 23, 2015

Before iput is called, the inode number used by a bad inode can be reassigned
to other new inode, resulting in any abnormal behaviors on the new inode.
This should not happen for the new inode.
Reviewed-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

c9b63bd0

f2fs: avoid freed stat information · eca616f8

由 Jaegeuk Kim 提交于 6月 15, 2015

The write_checkpoint can update stat information, so we should destroy the stat
structure after it.
Reviewed-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

eca616f8

f2fs: fix to record dirty page count for symlink · 5ac9f36f

由 Chao Yu 提交于 6月 29, 2015

Dirty page can be exist in mapping of newly created symlink, but previously
we did not maintain the counting of dirty page for symlink like we maintained
for regular/directory, so the counting we lookuped should be wrong.

This patch adds missed dirty page counting for symlink to fix this issue.
Signed-off-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

5ac9f36f

f2fs crypto: delete an unnecessary check before the function call "key_put" · 92859a5e

由 Markus Elfring 提交于 6月 26, 2015

The key_put() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.
Signed-off-by: NMarkus Elfring <elfring@users.sourceforge.net>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

92859a5e

02 8月, 2015 1 次提交

link_path_walk(): be careful when failing with ENOTDIR · 97242f99

由 Al Viro 提交于 8月 01, 2015

In RCU mode we might end up with dentry evicted just we check
that it's a directory.  In such case we should return ECHILD
rather than ENOTDIR, so that pathwalk would be retries in non-RCU
mode.

Breakage had been introduced in commit b18825a7 - prior to that
we were looking at nd->inode, which had been fetched before
verifying that ->d_seq was still valid.  That form of check
would only be satisfied if at some point the pathname prefix
would indeed have resolved to a non-directory.  The fix consists
of checking ->d_seq after we'd run into a non-directory dentry,
and failing with ECHILD in case of mismatch.

Note that all branches since 3.12 have that problem...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

97242f99

31 7月, 2015 2 次提交

ceph: always re-send cap flushes when MDS recovers · fc927cd3

由 Yan, Zheng 提交于 7月 20, 2015

commit e548e9b9 makes the kclient
only re-send cap flush once during MDS failover. If the kclient sends
a cap flush after MDS enters reconnect stage but before MDS recovers.
The kclient will skip re-sending the same cap flush when MDS recovers.

This causes problem for newly created inode. The MDS handles cap
flushes before replaying unsafe requests, so it's possible that MDS
find corresponding inode is missing when handling cap flush. The fix
is reverting to old behaviour: always re-send when MDS recovers
Signed-off-by: NYan, Zheng <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

fc927cd3

ceph: fix ceph_encode_locks_to_buffer() · f6762cb2

由 Yan, Zheng 提交于 7月 07, 2015

posix locks should be in ctx->flc_posix list
Signed-off-by: NYan, Zheng <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

f6762cb2

29 7月, 2015 3 次提交

xfs: remote attributes need to be considered data · df150ed1

由 Dave Chinner 提交于 7月 29, 2015

We don't log remote attribute contents, and instead write them
synchronously before we commit the block allocation and attribute
tree update transaction. As a result we are writing to the allocated
space before the allcoation has been made permanent.

As a result, we cannot consider this allocation to be a metadata
allocation. Metadata allocation can take blocks from the free list
and so reuse them before the transaction that freed the block is
committed to disk. This behaviour is perfectly fine for journalled
metadata changes as log recovery will ensure the free operation is
replayed before the overwrite, but for remote attribute writes this
is not the case.

Hence we have to consider the remote attribute blocks to contain
data and allocate accordingly. We do this by dropping the
XFS_BMAPI_METADATA flag from the block allocation. This means the
allocation will not use blocks that are on the busy list without
first ensuring that the freeing transaction has been committed to
disk and the blocks removed from the busy list. This ensures we will
never overwrite a freed block without first ensuring that it is
really free.

cc: <stable@vger.kernel.org>
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

df150ed1

xfs: remote attribute headers contain an invalid LSN · e3c32ee9

由 Dave Chinner 提交于 7月 29, 2015

In recent testing, a system that crashed failed log recovery on
restart with a bad symlink buffer magic number:

XFS (vda): Starting recovery (logdev: internal)
XFS (vda): Bad symlink block magic!
XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 2060

On examination of the log via xfs_logprint, none of the symlink
buffers in the log had a bad magic number, nor were any other types
of buffer log format headers mis-identified as symlink buffers.
Tracing was used to find the buffer the kernel was tripping over,
and xfs_db identified it's contents as:

000: 5841524d 00000000 00000346 64d82b48 8983e692 d71e4680 a5f49e2c b317576e
020: 00000000 00602038 00000000 006034ce d0020000 00000000 4d4d4d4d 4d4d4d4d
040: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
060: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
.....

This is a remote attribute buffer, which are notable in that they
are not logged but are instead written synchronously by the remote
attribute code so that they exist on disk before the attribute
transactions are committed to the journal.

The above remote attribute block has an invalid LSN in it - cycle
0xd002000, block 0 - which means when log recovery comes along to
determine if the transaction that writes to the underlying block
should be replayed, it sees a block that has a future LSN and so
does not replay the buffer data in the transaction. Instead, it
validates the buffer magic number and attaches the buffer verifier
to it.  It is this buffer magic number check that is failing in the
above assert, indicating that we skipped replay due to the LSN of
the underlying buffer.

The problem here is that the remote attribute buffers cannot have a
valid LSN placed into them, because the transaction that contains 
the attribute tree pointer changes and the block allocation that the
attribute data is being written to hasn't yet been committed. Hence
the LSN field in the attribute block is completely unwritten,
thereby leaving the underlying contents of the block in the LSN
field. It could have any value, and hence a future overwrite of the
block by log recovery may or may not work correctly.

Fix this by always writing an invalid LSN to the remote attribute
block, as any buffer in log recovery that needs to write over the
remote attribute should occur. We are protected from having old data
written over the attribute by the fact that freeing the block before
the remote attribute is written will result in the buffer being
marked stale in the log and so all changes prior to the buffer stale
transaction will be cancelled by log recovery.

Hence it is safe to ignore the LSN in the case or synchronously
written, unlogged metadata such as remote attribute blocks, and to
ensure we do that correctly, we need to write an invalid LSN to all
remote attribute blocks to trigger immediate recovery of metadata
that is written over the top.

As a further protection for filesystems that may already have remote
attribute blocks with bad LSNs on disk, change the log recovery code
to always trigger immediate recovery of metadata over remote
attribute blocks.

cc: <stable@vger.kernel.org>
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

e3c32ee9

xfs: call dax_fault on read page faults for DAX · b2442c5a

由 Dave Chinner 提交于 7月 29, 2015

When modifying the patch series to handle the XFS MMAP_LOCK nesting
of page faults, I botched the conversion of the read page fault
path, and so it is only every calling through the page cache. Re-add
the necessary __dax_fault() call for such files.

Because the get_blocks callback on read faults may not set up the
mapping buffer correctly to allow unwritten extent completion to be
run, we need to allow callers of __dax_fault() to pass a null
complete_unwritten() callback. The DAX code always zeros the
unwritten page when it is read faulted so there are no stale data
exposure issues with not doing the conversion. The only downside
will be the potential for increased CPU overhead on repeated read
faults of the same page. If this proves to be a problem, then the
filesystem needs to fix it's get_block callback and provide a
convert_unwritten() callback to the read fault path.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NMatthew Wilcox <willy@linux.intel.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

b2442c5a

28 7月, 2015 1 次提交

nfs: Fix an oops caused by using other thread's stack space in ASYNC mode · a49c2691

由 Kinglong Mee 提交于 7月 27, 2015

An oops caused by using other thread's stack space in sunrpc ASYNC sending thread.

[ 9839.007187] ------------[ cut here ]------------
[ 9839.007923] kernel BUG at fs/nfs/nfs4xdr.c:910!
[ 9839.008069] invalid opcode: 0000 [#1] SMP
[ 9839.008069] Modules linked in: blocklayoutdriver rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm joydev iosf_mbi crct10dif_pclmul snd_timer crc32_pclmul crc32c_intel ghash_clmulni_intel snd soundcore ppdev pvpanic parport_pc i2c_piix4 serio_raw virtio_balloon parport acpi_cpufreq nfsd nfs_acl lockd grace auth_rpcgss sunrpc qxl drm_kms_helper virtio_net virtio_console virtio_blk ttm drm virtio_pci virtio_ring virtio ata_generic pata_acpi
[ 9839.008069] CPU: 0 PID: 308 Comm: kworker/0:1H Not tainted 4.0.0-0.rc4.git1.3.fc23.x86_64 #1
[ 9839.008069] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 9839.008069] Workqueue: rpciod rpc_async_schedule [sunrpc]
[ 9839.008069] task: ffff8800d8b4d8e0 ti: ffff880036678000 task.ti: ffff880036678000
[ 9839.008069] RIP: 0010:[<ffffffffa0339cc9>]  [<ffffffffa0339cc9>] reserve_space.part.73+0x9/0x10 [nfsv4]
[ 9839.008069] RSP: 0018:ffff88003667ba58  EFLAGS: 00010246
[ 9839.008069] RAX: 0000000000000000 RBX: 000000001fc15e18 RCX: ffff8800c0193800
[ 9839.008069] RDX: ffff8800e4ae3f24 RSI: 000000001fc15e2c RDI: ffff88003667bcd0
[ 9839.008069] RBP: ffff88003667ba58 R08: ffff8800d9173008 R09: 0000000000000003
[ 9839.008069] R10: ffff88003667bcd0 R11: 000000000000000c R12: 0000000000010000
[ 9839.008069] R13: ffff8800d9173350 R14: 0000000000000000 R15: ffff8800c0067b98
[ 9839.008069] FS:  0000000000000000(0000) GS:ffff88011fc00000(0000) knlGS:0000000000000000
[ 9839.008069] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9839.008069] CR2: 00007f988c9c8bb0 CR3: 00000000d99b6000 CR4: 00000000000407f0
[ 9839.008069] Stack:
[ 9839.008069]  ffff88003667bbc8 ffffffffa03412c5 00000000c6c55680 ffff880000000003
[ 9839.008069]  0000000000000088 00000010c6c55680 0001000000000002 ffffffff816e87e9
[ 9839.008069]  0000000000000000 00000000477290e2 ffff88003667bab8 ffffffff81327ba3
[ 9839.008069] Call Trace:
[ 9839.008069]  [<ffffffffa03412c5>] encode_attrs+0x435/0x530 [nfsv4]
[ 9839.008069]  [<ffffffff816e87e9>] ? inet_sendmsg+0x69/0xb0
[ 9839.008069]  [<ffffffff81327ba3>] ? selinux_socket_sendmsg+0x23/0x30
[ 9839.008069]  [<ffffffff8164c1df>] ? do_sock_sendmsg+0x9f/0xc0
[ 9839.008069]  [<ffffffff8164c278>] ? kernel_sendmsg+0x58/0x70
[ 9839.008069]  [<ffffffffa011acc0>] ? xdr_reserve_space+0x20/0x170 [sunrpc]
[ 9839.008069]  [<ffffffffa011acc0>] ? xdr_reserve_space+0x20/0x170 [sunrpc]
[ 9839.008069]  [<ffffffffa0341b40>] ? nfs4_xdr_enc_open_noattr+0x130/0x130 [nfsv4]
[ 9839.008069]  [<ffffffffa03419a5>] encode_open+0x2d5/0x340 [nfsv4]
[ 9839.008069]  [<ffffffffa0341b40>] ? nfs4_xdr_enc_open_noattr+0x130/0x130 [nfsv4]
[ 9839.008069]  [<ffffffffa011ab89>] ? xdr_encode_opaque+0x19/0x20 [sunrpc]
[ 9839.008069]  [<ffffffffa0339cfb>] ? encode_string+0x2b/0x40 [nfsv4]
[ 9839.008069]  [<ffffffffa0341bf3>] nfs4_xdr_enc_open+0xb3/0x140 [nfsv4]
[ 9839.008069]  [<ffffffffa0110a4c>] rpcauth_wrap_req+0xac/0xf0 [sunrpc]
[ 9839.008069]  [<ffffffffa01017db>] call_transmit+0x18b/0x2d0 [sunrpc]
[ 9839.008069]  [<ffffffffa0101650>] ? call_decode+0x860/0x860 [sunrpc]
[ 9839.008069]  [<ffffffffa0101650>] ? call_decode+0x860/0x860 [sunrpc]
[ 9839.008069]  [<ffffffffa010caa0>] __rpc_execute+0x90/0x460 [sunrpc]
[ 9839.008069]  [<ffffffffa010ce85>] rpc_async_schedule+0x15/0x20 [sunrpc]
[ 9839.008069]  [<ffffffff810b452b>] process_one_work+0x1bb/0x410
[ 9839.008069]  [<ffffffff810b47d3>] worker_thread+0x53/0x470
[ 9839.008069]  [<ffffffff810b4780>] ? process_one_work+0x410/0x410
[ 9839.008069]  [<ffffffff810b4780>] ? process_one_work+0x410/0x410
[ 9839.008069]  [<ffffffff810ba7b8>] kthread+0xd8/0xf0
[ 9839.008069]  [<ffffffff810ba6e0>] ? kthread_worker_fn+0x180/0x180
[ 9839.008069]  [<ffffffff81786418>] ret_from_fork+0x58/0x90
[ 9839.008069]  [<ffffffff810ba6e0>] ? kthread_worker_fn+0x180/0x180
[ 9839.008069] Code: 00 00 48 c7 c7 21 fa 37 a0 e8 94 1c d6 e0 c6 05 d2 17 05 00 01 8b 03 eb d7 66 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 <0f> 0b 0f 1f 44 00 00 66 66 66 66 90 55 48 89 e5 41 54 53 89 f3
[ 9839.008069] RIP  [<ffffffffa0339cc9>] reserve_space.part.73+0x9/0x10 [nfsv4]
[ 9839.008069]  RSP <ffff88003667ba58>
[ 9839.071114] ---[ end trace cc14c03adb522e94 ]---
Signed-off-by: NKinglong Mee <kinglongmee@gmail.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

a49c2691

openanolis / cloud-kernel 接近 2 年 前同步成功

openanolis / cloud-kernel
接近 2 年前同步成功