提交 · 36f90b0a2ddd60823fe193a85e60ff1906c2a9b3 · openanolis / cloud-kernel

23 1月, 2016 1 次提交

wrappers for ->i_mutex access · 5955102c

由 Al Viro 提交于 1月 22, 2016

parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

5955102c

09 1月, 2016 1 次提交

ext4: adds project ID support · 040cb378

由 Li Xi 提交于 1月 08, 2016

Signed-off-by: NLi Xi <lixi@ddn.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
Reviewed-by: NJan Kara <jack@suse.cz>

040cb378

09 12月, 2015 1 次提交

don't put symlink bodies in pagecache into highmem · 21fc61c7

由 Al Viro 提交于 11月 17, 2015

kmap() in page_follow_link_light() needed to go - allowing to hold
an arbitrary number of kmaps for long is a great way to deadlocking
the system.

new helper (inode_nohighmem(inode)) needs to be used for pagecache
symlinks inodes; done for all in-tree cases.  page_follow_link_light()
instrumented to yell about anything missed.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

21fc61c7

08 12月, 2015 6 次提交

ext4: use pre-zeroed blocks for DAX page faults · ba5843f5

由 Jan Kara 提交于 12月 07, 2015

Make DAX fault path use pre-zeroed blocks to avoid races with extent
conversion and zeroing when two page faults to the same block happen.
Signed-off-by: NJan Kara <jack@suse.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

ba5843f5

ext4: implement allocation of pre-zeroed blocks · c86d8db3

由 Jan Kara 提交于 12月 07, 2015

DAX page fault path needs to get blocks that are pre-zeroed to avoid
races when two concurrent page faults happen in the same block of a
file. Implement support for this in ext4_map_blocks().
Signed-off-by: NJan Kara <jack@suse.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

c86d8db3

ext4: provide ext4_issue_zeroout() · 53085fac

由 Jan Kara 提交于 12月 07, 2015

Create new function ext4_issue_zeroout() to zeroout contiguous (both
logically and physically) part of inode data. We will need to issue
zeroout when extent structure is not readily available and this function
will allow us to do it without making up fake extent structures.
Signed-off-by: NJan Kara <jack@suse.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

53085fac

ext4: get rid of EXT4_GET_BLOCKS_NO_LOCK flag · 2dcba478

由 Jan Kara 提交于 12月 07, 2015

When dioread_nolock mode is enabled, we grab i_data_sem in
ext4_ext_direct_IO() and therefore we need to instruct _ext4_get_block()
not to grab i_data_sem again using EXT4_GET_BLOCKS_NO_LOCK. However
holding i_data_sem over overwrite direct IO isn't needed these days. We
have exclusion against truncate / hole punching because we increase
i_dio_count under i_mutex in ext4_ext_direct_IO() so once
ext4_file_write_iter() verifies blocks are allocated & written, they are
guaranteed to stay so during the whole direct IO even after we drop
i_mutex.

So we can just remove this locking abuse and the no longer necessary
EXT4_GET_BLOCKS_NO_LOCK flag.
Signed-off-by: NJan Kara <jack@suse.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

2dcba478

ext4: fix races of writeback with punch hole and zero range · 01127848

由 Jan Kara 提交于 12月 07, 2015

When doing delayed allocation, update of on-disk inode size is postponed
until IO submission time. However hole punch or zero range fallocate
calls can end up discarding the tail page cache page and thus on-disk
inode size would never be properly updated.

Make sure the on-disk inode size is updated before truncating page
cache.
Signed-off-by: NJan Kara <jack@suse.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

01127848

ext4: fix races between page faults and hole punching · ea3d7209

由 Jan Kara 提交于 12月 07, 2015

Currently, page faults and hole punching are completely unsynchronized.
This can result in page fault faulting in a page into a range that we
are punching after truncate_pagecache_range() has been called and thus
we can end up with a page mapped to disk blocks that will be shortly
freed. Filesystem corruption will shortly follow. Note that the same
race is avoided for truncate by checking page fault offset against
i_size but there isn't similar mechanism available for punching holes.

Fix the problem by creating new rw semaphore i_mmap_sem in inode and
grab it for writing over truncate, hole punching, and other functions
removing blocks from extent tree and for read over page faults. We
cannot easily use i_data_sem for this since that ranks below transaction
start and we need something ranking above it so that it can be held over
the whole truncate / hole punching operation. Also remove various
workarounds we had in the code to reduce race window when page fault
could have created pages with stale mapping information.
Signed-off-by: NJan Kara <jack@suse.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

ea3d7209

11 11月, 2015 1 次提交

vfs: remove unused wrapper block_page_mkwrite() · 5c500029

由 Ross Zwisler 提交于 10月 13, 2015

The function currently called "__block_page_mkwrite()" used to be called
"block_page_mkwrite()" until a wrapper for this function was added by:

commit 24da4fab ("vfs: Create __block_page_mkwrite() helper passing
	error values back")

This wrapper, the current "block_page_mkwrite()", is currently unused.
__block_page_mkwrite() is used directly by ext4, nilfs2 and xfs.

Remove the unused wrapper, rename __block_page_mkwrite() back to
block_page_mkwrite() and update the comment above block_page_mkwrite().
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: NJan Kara <jack@suse.com>
Cc: Jan Kara <jack@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

5c500029

07 11月, 2015 1 次提交

mm, fs: introduce mapping_gfp_constraint() · c62d2555

由 Michal Hocko 提交于 11月 06, 2015

There are many places which use mapping_gfp_mask to restrict a more
generic gfp mask which would be used for allocations which are not
directly related to the page cache but they are performed in the same
context.

Let's introduce a helper function which makes the restriction explicit and
easier to track.  This patch doesn't introduce any functional changes.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: NMichal Hocko <mhocko@suse.com>
Suggested-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c62d2555

18 10月, 2015 2 次提交

ext4: clean up feature test macros with predicate functions · e2b911c5

由 Darrick J. Wong 提交于 10月 17, 2015

Create separate predicate functions to test/set/clear feature flags,
thereby replacing the wordy old macros.  Furthermore, clean out the
places where we open-coded feature tests.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

e2b911c5

ext4: call out CRC and corruption errors with specific error codes · 6a797d27

由 Darrick J. Wong 提交于 10月 17, 2015

Instead of overloading EIO for CRC errors and corrupt structures,
return the same error codes that XFS returns for the same issues.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

6a797d27

15 10月, 2015 1 次提交

ext4: use private version of page_zero_new_buffers() for data=journal mode · b90197b6

由 Theodore Ts'o 提交于 10月 15, 2015

If there is a error while copying data from userspace into the page
cache during a write(2) system call, in data=journal mode, in
ext4_journalled_write_end() were using page_zero_new_buffers() from
fs/buffer.c.  Unfortunately, this sets the buffer dirty flag, which is
no good if journalling is enabled.  This is a long-standing bug that
goes back for years and years in ext3, but a combination of (a)
data=journal not being very common, (b) in many case it only results
in a warning message. and (c) only very rarely causes the kernel hang,
means that we only really noticed this as a problem when commit
998ef75d caused this failure to happen frequently enough to cause
generic/208 to fail when run in data=journal mode.

The fix is to have our own version of this function that doesn't call
mark_dirty_buffer(), since we will end up calling
ext4_handle_dirty_metadata() on the buffer head(s) in questions very
shortly afterwards in ext4_journalled_write_end().

Thanks to Dave Hansen and Linus Torvalds for helping to identify the
root cause of the problem.
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Reviewed-by: NJan Kara <jack@suse.com>

b90197b6

03 10月, 2015 2 次提交

ext4 crypto: ext4_page_crypto() doesn't need a encryption context · 3684de8c

由 Theodore Ts'o 提交于 10月 03, 2015

Since ext4_page_crypto() doesn't need an encryption context (at least
not any more), this allows us to simplify a number function signature
and also allows us to avoid needing to allocate a context in
ext4_block_write_begin().  It also means we no longer need a separate
ext4_decrypt_one() function.
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

3684de8c

ext4: optimize ext4_writepage() for attempted 4k delalloc writes · cccd147a

由 Theodore Ts'o 提交于 10月 03, 2015

In cases where the file system block size is the same as the page
size, and ext4_writepage() is asked to write out a page which is
either has the unwritten bit set in the extent tree, or which does not
yet have a block assigned due to delayed allocation, we can bail out
early and, unlocking the page earlier and avoiding a round trip
through ext4_bio_write_page() with the attendant calls to
set_page_writeback() and redirty_page_for_writeback().
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

cccd147a

09 9月, 2015 2 次提交

ext4: add ext4_get_block_dax() · ed923b57

由 Matthew Wilcox 提交于 9月 08, 2015

DAX wants different semantics from any currently-existing ext4 get_block
callback.  Unlike ext4_get_block_write(), it needs to honour the
'create' flag, and unlike ext4_get_block(), it needs to be able to
return unwritten extents.  So introduce a new ext4_get_block_dax() which
has those semantics.

We could also change ext4_get_block_write() to honour the 'create' flag,
but that might have consequences on other users that I do not currently
understand.
Signed-off-by: NMatthew Wilcox <willy@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ed923b57

dax: move DAX-related functions to a new header · c94c2acf

由 Matthew Wilcox 提交于 9月 08, 2015

In order to handle the !CONFIG_TRANSPARENT_HUGEPAGES case, we need to
return VM_FAULT_FALLBACK from the inlined dax_pmd_fault(), which is
defined in linux/mm.h.  Given that we don't want to include <linux/mm.h>
in <linux/fs.h>, the easiest solution is to move the DAX-related
functions to a new header, <linux/dax.h>.  We could also have moved
VM_FAULT_* definitions to a new header, or a different header that isn't
quite such a boil-the-ocean header as <linux/mm.h>, but this felt like
the best option.
Signed-off-by: NMatthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c94c2acf

29 7月, 2015 1 次提交

ext4: update c/mtime on truncate up · 911af577

由 Eryu Guan 提交于 7月 28, 2015

Commit 3da40c7b ("ext4: only call ext4_truncate when size <= isize")
introduced a bug that c/mtime is not updated on truncate up.

Fix the issue by setting c/mtime explicitly in the truncate up case.

Note that ftruncate(2) is not affected, so you won't see this bug using
truncate(1) and xfs_io(1).
Signed-off-by: NZirong Lang <zorro.lang@gmail.com>
Signed-off-by: NEryu Guan <guaneryu@gmail.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

911af577

24 7月, 2015 1 次提交

ext4: Handle error from dquot_initialize() · a7cdadee

由 Jan Kara 提交于 6月 29, 2015

dquot_initialize() can now return error. Handle it where possible.
Acked-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NJan Kara <jack@suse.com>

a7cdadee

04 7月, 2015 1 次提交

ext4: fix reservation release on invalidatepage for delalloc fs · 9705acd6

由 Lukas Czerner 提交于 7月 03, 2015

On delalloc enabled file system on invalidatepage operation
in ext4_da_page_release_reservation() we want to clear the delayed
buffer and remove the extent covering the delayed buffer from the extent
status tree.

However currently there is a bug where on the systems with page size >
block size we will always remove extents from the start of the page
regardless where the actual delayed buffers are positioned in the page.
This leads to the errors like this:

EXT4-fs warning (device loop0): ext4_da_release_space:1225:
ext4_da_release_space: ino 13, to_free 1 with only 0 reserved data
blocks

This however can cause data loss on writeback time if the file system is
in ENOSPC condition because we're releasing reservation for someones
else delayed buffer.

Fix this by only removing extents that corresponds to the part of the
page we want to invalidate.

This problem is reproducible by the following fio receipt (however I was
only able to reproduce it with fio-2.1 or older.

[global]
bs=8k
iodepth=1024
iodepth_batch=60
randrepeat=1
size=1m
directory=/mnt/test
numjobs=20
[job1]
ioengine=sync
bs=1k
direct=1
rw=randread
filename=file1:file2
[job2]
ioengine=libaio
rw=randwrite
direct=1
filename=file1:file2
[job3]
bs=1k
ioengine=posixaio
rw=randwrite
direct=1
filename=file1:file2
[job5]
bs=1k
ioengine=sync
rw=randread
filename=file1:file2
[job7]
ioengine=libaio
rw=randwrite
filename=file1:file2
[job8]
ioengine=posixaio
rw=randwrite
filename=file1:file2
[job10]
ioengine=mmap
rw=randwrite
bs=1k
filename=file1:file2
[job11]
ioengine=mmap
rw=randwrite
direct=1
filename=file1:file2
Signed-off-by: NLukas Czerner <lczerner@redhat.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org

9705acd6

02 7月, 2015 1 次提交

ext4: fix fencepost error in lazytime optimization · 0f0ff9a9

由 Theodore Ts'o 提交于 7月 01, 2015

Commit 8f4d8558: "ext4: fix lazytime optimization" was not a
complete fix.  In the case where the inode number is a multiple of 16,
and we could still end up updating an inode with dirty timestamps
written to the wrong inode on disk.  Oops.

This can be easily reproduced by using generic/005 with a file system
with metadata_csum and lazytime enabled.
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org

0f0ff9a9

22 6月, 2015 2 次提交

ext4: only call ext4_truncate when size <= isize · 3da40c7b

由 Josef Bacik 提交于 6月 22, 2015

At LSF we decided that if we truncate up from isize we shouldn't trim
fallocated blocks that were fallocated with KEEP_SIZE and are past the
new i_size.  This patch fixes ext4 to do this.

[ Completely reworked patch so that i_disksize would actually get set
  when truncating up.  Also reworked the code for handling truncate so
  that it's easier to handle. -- tytso ]
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Reviewed-by: NLukas Czerner <lczerner@redhat.com>

3da40c7b

ext4: minor cleanup of ext4_da_reserve_space() · c27e43a1

由 Eric Whitney 提交于 6月 21, 2015

Remove outdated comments and dead code from ext4_da_reserve_space.
Clean up its trace point, and relocate it to make it more useful.

While we're at it, fix a nearby conditional used to determine if
we have a non-bigalloc file system.  It doesn't match usage elsewhere
in the code, and misleadingly suggests that an s_cluster_ratio value
of 0 would be legal.
Signed-off-by: NEric Whitney <enwlinux@gmail.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

c27e43a1

21 6月, 2015 1 次提交

ext4: prevent ext4_quota_write() from failing due to ENOSPC · c5e298ae

由 Theodore Ts'o 提交于 6月 21, 2015

In order to prevent quota block tracking to be inaccurate when
ext4_quota_write() fails with ENOSPC, we make two changes.  The quota
file can now use the reserved block (since the quota file is arguably
file system metadata), and ext4_quota_write() now uses
ext4_should_retry_alloc() to retry the block allocation after a commit
has completed and released some blocks for allocation.

This fixes failures of xfstests generic/270:

Quota error (device vdc): write_blk: dquota write failed
Quota error (device vdc): qtree_write_dquot: Error -28 occurred while creating quota
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

c5e298ae

13 6月, 2015 1 次提交

ext4: fix race between truncate and __ext4_journalled_writepage() · bdf96838

由 Theodore Ts'o 提交于 6月 12, 2015

The commit cf108bca: "ext4: Invert the locking order of page_lock
and transaction start" caused __ext4_journalled_writepage() to drop
the page lock before the page was written back, as part of changing
the locking order to jbd2_journal_start -> page_lock.  However, this
introduced a potential race if there was a truncate racing with the
data=journalled writeback mode.

Fix this by grabbing the page lock after starting the journal handle,
and then checking to see if page had gotten truncated out from under
us.

This fixes a number of different warnings or BUG_ON's when running
xfstests generic/086 in data=journalled mode, including:

jbd2_journal_dirty_metadata: vdc-8: bad jh for block 115643: transaction (ee3fe7
c0, 164), jh->b_transaction (  (null), 0), jh->b_next_transaction (  (null), 0), jlist 0

	      	      	  - and -

kernel BUG at /usr/projects/linux/ext4/fs/jbd2/transaction.c:2200!
    ...
Call Trace:
 [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117
 [<c02b2de5>] __ext4_journalled_invalidatepage+0x10f/0x117
 [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117
 [<c027d883>] ? lock_buffer+0x36/0x36
 [<c02b2dfa>] ext4_journalled_invalidatepage+0xd/0x22
 [<c0229139>] do_invalidatepage+0x22/0x26
 [<c0229198>] truncate_inode_page+0x5b/0x85
 [<c022934b>] truncate_inode_pages_range+0x156/0x38c
 [<c0229592>] truncate_inode_pages+0x11/0x15
 [<c022962d>] truncate_pagecache+0x55/0x71
 [<c02b913b>] ext4_setattr+0x4a9/0x560
 [<c01ca542>] ? current_kernel_time+0x10/0x44
 [<c026c4d8>] notify_change+0x1c7/0x2be
 [<c0256a00>] do_truncate+0x65/0x85
 [<c0226f31>] ? file_ra_state_init+0x12/0x29

	      	      	  - and -

WARNING: CPU: 1 PID: 1331 at /usr/projects/linux/ext4/fs/jbd2/transaction.c:1396
irty_metadata+0x14a/0x1ae()
    ...
Call Trace:
 [<c01b879f>] ? console_unlock+0x3a1/0x3ce
 [<c082cbb4>] dump_stack+0x48/0x60
 [<c0178b65>] warn_slowpath_common+0x89/0xa0
 [<c02ef2cf>] ? jbd2_journal_dirty_metadata+0x14a/0x1ae
 [<c0178bef>] warn_slowpath_null+0x14/0x18
 [<c02ef2cf>] jbd2_journal_dirty_metadata+0x14a/0x1ae
 [<c02d8615>] __ext4_handle_dirty_metadata+0xd4/0x19d
 [<c02b2f44>] write_end_fn+0x40/0x53
 [<c02b4a16>] ext4_walk_page_buffers+0x4e/0x6a
 [<c02b59e7>] ext4_writepage+0x354/0x3b8
 [<c02b2f04>] ? mpage_release_unused_pages+0xd4/0xd4
 [<c02b1b21>] ? wait_on_buffer+0x2c/0x2c
 [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8
 [<c02b5a5b>] __writepage+0x10/0x2e
 [<c0225956>] write_cache_pages+0x22d/0x32c
 [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8
 [<c02b6ee8>] ext4_writepages+0x102/0x607
 [<c019adfe>] ? sched_clock_local+0x10/0x10e
 [<c01a8a7c>] ? __lock_is_held+0x2e/0x44
 [<c01a8ad5>] ? lock_is_held+0x43/0x51
 [<c0226dff>] do_writepages+0x1c/0x29
 [<c0276bed>] __writeback_single_inode+0xc3/0x545
 [<c0277c07>] writeback_sb_inodes+0x21f/0x36d
    ...
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org

bdf96838

04 6月, 2015 1 次提交

dax: don't abuse get_block mapping for endio callbacks · e842f290

由 Dave Chinner 提交于 6月 04, 2015

dax_fault() currently relies on the get_block callback to attach an
io completion callback to the mapping buffer head so that it can
run unwritten extent conversion after zeroing allocated blocks.

Instead of this hack, pass the conversion callback directly into
dax_fault() similar to the get_block callback. When the filesystem
allocates unwritten extents, it will set the buffer_unwritten()
flag, and hence the dax_fault code can call the completion function
in the contexts where it is necessary without overloading the
mapping buffer head.

Note: The changes to ext4 to use this interface are suspect at best.
In fact, the way ext4 did this end_io assignment in the first place
looks suspect because it only set a completion callback when there
wasn't already some other write() call taking place on the same
inode. The ext4 end_io code looks rather intricate and fragile with
all it's reference counting and passing to different contexts for
modification via inode private pointers that aren't protected by
locks...
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Acked-by: NJan Kara <jack@suse.cz>
Signed-off-by: NDave Chinner <david@fromorbit.com>

e842f290

15 5月, 2015 1 次提交

ext4: fix lazytime optimization · 8f4d8558

由 Theodore Ts'o 提交于 5月 14, 2015

We had a fencepost error in the lazytime optimization which means that
timestamp would get written to the wrong inode.

Cc: stable@vger.kernel.org
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

8f4d8558

11 5月, 2015 2 次提交
- A
  ext4: switch to simple_follow_link() · 75e7566b
  由 Al Viro 提交于 5月 02, 2015
```
for fast symlinks only, of course...
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  75e7566b
- A
  ext4: split inode_operations for encrypted symlinks off the rest · a7a67e8a
  由 Al Viro 提交于 4月 27, 2015
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  a7a67e8a
03 5月, 2015 1 次提交

ext4: fix data corruption caused by unwritten and delayed extents · d2dc317d

由 Lukas Czerner 提交于 5月 02, 2015

Currently it is possible to lose whole file system block worth of data
when we hit the specific interaction with unwritten and delayed extents
in status extent tree.

The problem is that when we insert delayed extent into extent status
tree the only way to get rid of it is when we write out delayed buffer.
However there is a limitation in the extent status tree implementation
so that when inserting unwritten extent should there be even a single
delayed block the whole unwritten extent would be marked as delayed.

At this point, there is no way to get rid of the delayed extents,
because there are no delayed buffers to write out. So when a we write
into said unwritten extent we will convert it to written, but it still
remains delayed.

When we try to write into that block later ext4_da_map_blocks() will set
the buffer new and delayed and map it to invalid block which causes
the rest of the block to be zeroed loosing already written data.

For now we can fix this by simply not allowing to set delayed status on
written extent in the extent status tree. Also add WARN_ON() to make
sure that we notice if this happens in the future.

This problem can be easily reproduced by running the following xfs_io.

xfs_io -f -c "pwrite -S 0xaa 4096 2048" \
          -c "falloc 0 131072" \
          -c "pwrite -S 0xbb 65536 2048" \
          -c "fsync" /mnt/test/fff

echo 3 > /proc/sys/vm/drop_caches
xfs_io -c "pwrite -S 0xdd 67584 2048" /mnt/test/fff

This can be theoretically also reproduced by at random by running fsx,
but it's not very reliable, though on machines with bigger page size
(like ppc) this can be seen more often (especially xfstest generic/127)
Signed-off-by: NLukas Czerner <lczerner@redhat.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org

d2dc317d

25 4月, 2015 1 次提交

direct-io: only inc/dec inode->i_dio_count for file systems · fe0f07d0

由 Jens Axboe 提交于 4月 15, 2015

do_blockdev_direct_IO() increments and decrements the inode
->i_dio_count for each IO operation. It does this to protect against
truncate of a file. Block devices don't need this sort of protection.

For a capable multiqueue setup, this atomic int is the only shared
state between applications accessing the device for O_DIRECT, and it
presents a scaling wall for that. In my testing, as much as 30% of
system time is spent incrementing and decrementing this value. A mixed
read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
better latencies too. Before:

clat percentiles (usec):
 |  1.00th=[   33],  5.00th=[   34], 10.00th=[   34], 20.00th=[   34],
 | 30.00th=[   34], 40.00th=[   34], 50.00th=[   35], 60.00th=[   35],
 | 70.00th=[   35], 80.00th=[   35], 90.00th=[   37], 95.00th=[   80],
 | 99.00th=[   98], 99.50th=[  151], 99.90th=[  155], 99.95th=[  155],
 | 99.99th=[  165]

After:

clat percentiles (usec):
 |  1.00th=[   95],  5.00th=[  108], 10.00th=[  129], 20.00th=[  149],
 | 30.00th=[  155], 40.00th=[  161], 50.00th=[  167], 60.00th=[  171],
 | 70.00th=[  177], 80.00th=[  185], 90.00th=[  201], 95.00th=[  270],
 | 99.00th=[  390], 99.50th=[  398], 99.90th=[  418], 99.95th=[  422],
 | 99.99th=[  438]

In other setups, Robert Elliott reported seeing good performance
improvements:

https://lkml.org/lkml/2015/4/3/557

The more applications accessing the device, the worse it gets.

Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
do_blockdev_direct_IO() that it need not worry about incrementing
or decrementing the inode i_dio_count for this caller.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Elliott, Robert (Server Storage) <elliott@hp.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NJens Axboe <axboe@fb.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

fe0f07d0

16 4月, 2015 3 次提交

ext4 crypto: add symlink encryption · f348c252

由 Theodore Ts'o 提交于 4月 16, 2015

Signed-off-by: NUday Savagaonkar <savagaon@google.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

f348c252

dax: unify ext2/4_{dax,}_file_operations · be64f884

由 Boaz Harrosh 提交于 4月 15, 2015

The original dax patchset split the ext2/4_file_operations because of the
two NULL splice_read/splice_write in the dax case.

In the vfs if splice_read/splice_write are NULL we then call
default_splice_read/write.

What we do here is make generic_file_splice_read aware of IS_DAX() so the
original ext2/4_file_operations can be used as is.

For write it appears that iter_file_splice_write is just fine.  It uses
the regular f_op->write(file,..) or new_sync_write(file, ...).
Signed-off-by: NBoaz Harrosh <boaz@plexistor.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

be64f884

VFS: normal filesystems (and lustre): d_inode() annotations · 2b0143b5

由 David Howells 提交于 3月 17, 2015

that's the bulk of filesystem drivers dealing with inodes of their own
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

2b0143b5

12 4月, 2015 5 次提交

ext4 crypto: implement the ext4 decryption read path · c9c7429c

由 Michael Halcrow 提交于 4月 12, 2015

Signed-off-by: NMichael Halcrow <mhalcrow@google.com>
Signed-off-by: NIldar Muslukhov <ildarm@google.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

c9c7429c

ext4 crypto: implement the ext4 encryption write path · 2058f83a

由 Michael Halcrow 提交于 4月 12, 2015

Pulls block_write_begin() into fs/ext4/inode.c because it might need
to do a low-level read of the existing data, in which case we need to
decrypt it.
Signed-off-by: NMichael Halcrow <mhalcrow@google.com>
Signed-off-by: NIldar Muslukhov <ildarm@google.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

2058f83a

direct_IO: remove rw from a_ops->direct_IO() · 22c6186e

由 Omar Sandoval 提交于 3月 16, 2015

Now that no one is using rw, remove it completely.
Signed-off-by: NOmar Sandoval <osandov@osandov.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

22c6186e

direct_IO: use iov_iter_rw() instead of rw everywhere · 6f673763

由 Omar Sandoval 提交于 3月 16, 2015

The rw parameter to direct_IO is redundant with iov_iter->type, and
treated slightly differently just about everywhere it's used: some users
do rw & WRITE, and others do rw == WRITE where they should be doing a
bitwise check. Simplify this with the new iov_iter_rw() helper, which
always returns either READ or WRITE.
Signed-off-by: NOmar Sandoval <osandov@osandov.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

6f673763

Remove rw from dax_{do_,}io() · a95cd631

由 Omar Sandoval 提交于 3月 16, 2015

And use iov_iter_rw() instead.
Signed-off-by: NOmar Sandoval <osandov@osandov.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

a95cd631

openanolis / cloud-kernel 大约 1 年 前同步成功

openanolis / cloud-kernel
大约 1 年前同步成功