提交 · 6515925b8259549b7f2187e25d3260306e3e85e5 · openeuler / raspberrypi-kernel

22 2月, 2013 1 次提交

mm: only enforce stable page writes if the backing device requires it · 1d1d1a76

由 Darrick J. Wong 提交于 2月 21, 2013

Create a helper function to check if a backing device requires stable
page writes and, if so, performs the necessary wait.  Then, make it so
that all points in the memory manager that handle making pages writable
use the helper function.  This should provide stable page write support
to most filesystems, while eliminating unnecessary waiting for devices
that don't require the feature.

Before this patchset, all filesystems would block, regardless of whether
or not it was necessary.  ext3 would wait, but still generate occasional
checksum errors.  The network filesystems were left to do their own
thing, so they'd wait too.

After this patchset, all the disk filesystems except ext3 and btrfs will
wait only if the hardware requires it.  ext3 (if necessary) snapshots
pages instead of blocking, and btrfs provides its own bdi so the mm will
never wait.  Network filesystems haven't been touched, so either they
provide their own stable page guarantees or they don't block at all.
The blocking behavior is back to what it was before 3.0 if you don't
have a disk requiring stable page writes.

Here's the result of using dbench to test latency on ext2:

3.8.0-rc3:
 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 WriteX        109347     0.028    59.817
 ReadX         347180     0.004     3.391
 Flush          15514    29.828   287.283

Throughput 57.429 MB/sec  4 clients  4 procs  max_latency=287.290 ms

3.8.0-rc3 + patches:
 WriteX        105556     0.029     4.273
 ReadX         335004     0.005     4.112
 Flush          14982    30.540   298.634

Throughput 55.4496 MB/sec  4 clients  4 procs  max_latency=298.650 ms

As you can see, the maximum write latency drops considerably with this
patch enabled.  The other filesystems (ext3/ext4/xfs/btrfs) behave
similarly, but see the cover letter for those results.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Acked-by: NSteven Whitehouse <swhiteho@redhat.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1d1d1a76

18 2月, 2013 4 次提交

ext4: lookup block mapping in extent status tree · d100eef2

由 Zheng Liu 提交于 2月 18, 2013

After tracking all extent status, we already have a extent cache in
memory.  Every time we want to lookup a block mapping, we can first
try to lookup it in extent status tree to avoid a potential disk I/O.

A new function called ext4_es_lookup_extent is defined to finish this
work.  When we try to lookup a block mapping, we always call
ext4_map_blocks and/or ext4_da_map_blocks.  So in these functions we
first try to lookup a block mapping in extent status tree.

A new flag EXT4_GET_BLOCKS_NO_PUT_HOLE is used in ext4_da_map_blocks
in order not to put a hole into extent status tree because this hole
will be converted to delayed extent in the tree immediately.
Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
Cc: Jan kara <jack@suse.cz>

d100eef2

ext4: track all extent status in extent status tree · f7fec032

由 Zheng Liu 提交于 2月 18, 2013

By recording the phycisal block and status, extent status tree is able
to track the status of every extents.  When we call _map_blocks
functions to lookup an extent or create a new written/unwritten/delayed
extent, this extent will be inserted into extent status tree.

We don't load all extents from disk in alloc_inode() because it costs
too much memory, and if a file is opened and closed frequently it will
takes too much time to load all extent information.  So currently when
we create/lookup an extent, this extent will be inserted into extent
status tree.  Hence, the extent status tree may not comprehensively
contain all of the extents found in the file.

Here a condition we need to take care is that an extent might contains
unwritten and delayed status simultaneously because an extent is delayed
allocated and could be allocated by fallocate.  At this time we need to
keep delayed status because later we need to update delayed reservation
space using it.
Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
Cc: Jan kara <jack@suse.cz>

f7fec032

ext4: let ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag · a25a4e1a

由 Zheng Liu 提交于 2月 18, 2013

This commit lets ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag
because in later commit ext4_map_blocks needs to use this flag to
determine the extent status.
Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: NJan Kara <jack@suse.cz>

a25a4e1a

ext4: add physical block and status member into extent status tree · fdc0212e

由 Zheng Liu 提交于 2月 18, 2013

This commit adds two members in extent_status structure to let it record
physical block and extent status.  Here es_pblk is used to record both
of them because physical block only has 48 bits.  So extent status could
be stashed into it so that we can save some memory.  Now written,
unwritten, delayed and hole are defined as status.

Due to new member is added into extent status tree, all interfaces need
to be adjusted.
Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: NJan Kara <jack@suse.cz>

fdc0212e

15 2月, 2013 3 次提交

ext4: use ERR_PTR() abstraction for ext4_append() · 0f70b406

由 Theodore Ts'o 提交于 2月 15, 2013

Use ERR_PTR()/IS_ERR() abstraction instead of passing in a separate
pointer to an integer for the error code, as a code cleanup.
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

0f70b406

ext4: add debugging context for warning in ext4_da_update_reserve_space() · 01a523eb

由 Theodore Ts'o 提交于 2月 14, 2013

Print some additional debugging context to hopefully help to debug a
warning which is getting triggered by xfstests #74.

Also remove extraneous newlines from when printk's were converted to
ext4_warning() and ext4_msg().
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

01a523eb

ext4: use KERN_WARNING for warning messages · 8de5c325

由 Theodore Ts'o 提交于 2月 14, 2013

Some messages printed related to a WARN_ON(1) were printed using
KERN_NOTICE.  Use KERN_WARNING or ext4_warning() instead so that
context related to the WARN_ON() is printed at the same printk warning
level (and log files, etc.)
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

8de5c325

09 2月, 2013 2 次提交

ext4: grab page before starting transaction handle in write_begin() · 47564bfb

由 Theodore Ts'o 提交于 2月 09, 2013

The grab_cache_page_write_begin() function can potentially sleep for a
long time, since it may need to do memory allocation which can block
if the system is under significant memory pressure, and because it may
be blocked on page writeback.  If it does take a long time to grab the
page, it's better that we not hold an active jbd2 handle.

So grab a handle on the page first, and _then_ start the transaction
handle.

This commit fixes the following long transaction handle hold time:

postmark-2917  [000] ....   196.435786: jbd2_handle_stats: dev 254,32
   tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1
   dirtied_blocks 0
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: NJan Kara <jack@suse.cz>

47564bfb

ext4: pass context information to jbd2__journal_start() · 9924a92a

由 Theodore Ts'o 提交于 2月 08, 2013

So we can better understand what bits of ext4 are responsible for
long-running jbd2 handles, use jbd2__journal_start() so we can pass
context information for logging purposes.

The recommended way for finding the longer-running handles is:

   T=/sys/kernel/debug/tracing
   EVENT=$T/events/jbd2/jbd2_handle_stats
   echo "interval > 5" > $EVENT/filter
   echo 1 > $EVENT/enable

   ./run-my-fs-benchmark

   cat $T/trace > /tmp/problem-handles

This will list handles that were active for longer than 20ms.  Having
longer-running handles is bad, because a commit started at the wrong
time could stall for those 20+ milliseconds, which could delay an
fsync() or an O_SYNC operation.  Here is an example line from the
trace file describing a handle which lived on for 311 jiffies, or over
1.2 seconds:

postmark-2917  [000] ....   196.435786: jbd2_handle_stats: dev 254,32 
   tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1
   dirtied_blocks 0
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

9924a92a

30 1月, 2013 1 次提交

ext4: fix possible use-after-free with AIO · 091e26df

由 Jan Kara 提交于 1月 29, 2013

Running AIO is pinning inode in memory using file reference. Once AIO
is completed using aio_complete(), file reference is put and inode can
be freed from memory. So we have to be sure that calling aio_complete()
is the last thing we do with the inode.

CC: stable@vger.kernel.org
Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
Acked-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

091e26df

29 1月, 2013 3 次提交

ext4: fix ext4_writepage() to achieve data=ordered guarantees · fe386132

由 Jan Kara 提交于 1月 28, 2013

So far ext4_writepage() skipped writing pages that had any delayed or
unwritten buffers attached. When blocksize < pagesize this breaks
data=ordered mode guarantees as we can have a page with one freshly
allocated buffer whose allocation is part of the committing
transaction and another buffer in the page which is delayed or
unwritten. So fix this problem by calling ext4_bio_writepage()
anyway. It will submit mapped buffers and leave others alone.
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

fe386132

ext4: simplify mpage_add_bh_to_extent() · b6a8e62f

由 Jan Kara 提交于 1月 28, 2013

The argument b_size of mpage_add_bh_to_extent() was bogus since it was
always == blocksize (which we can easily derive from inode->i_blkbits).
Also second branch of condition:
	if (nrblocks >= EXT4_MAX_TRANS_DATA) {
	} else if ((nrblocks + (b_size >> mpd->inode->i_blkbits)) >
						EXT4_MAX_TRANS_DATA) {
	}
was never taken because (b_size >> mpd->inode->i_blkbits) == 1.
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

b6a8e62f

ext4: dirty page has always buffers attached · f8bec370

由 Jan Kara 提交于 1月 28, 2013

ext4_writepage(), write_cache_pages_da(), and mpage_da_submit_io()
doesn't have to deal with the case when page doesn't have buffers. We
attach buffers to a page in ->write_begin() and ->page_mkwrite() which
covers all places where a page can become dirty.
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

f8bec370

28 1月, 2013 3 次提交

ext4: remove __ext4_journalled_writepage() from mpage_da_submit_io() · fe089c77

由 Jan Kara 提交于 1月 28, 2013

We don't support delayed allocation in data=journal mode. So checking for it in
mpage_da_submit_io() doesn't make really sence. If we ever decide to extend
delayed allocation support to data=journal mode, adding
__ext4_journalled_writepage() call will be the least of problems we have to
solve. Most likely we'd have to implement separate writepages call anyways
because we don't have transaction credits for writing more than a single page
so mapping of page buffers would have to be done differently.
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

fe089c77

ext4: Always use ext4_bio_write_page() for writeout · 36ade451

由 Jan Kara 提交于 1月 28, 2013

Currently we sometimes used block_write_full_page() and sometimes
ext4_bio_write_page() for writeback (depending on mount options and call
path). Let's always use ext4_bio_write_page() to simplify things a bit.
Reviewed-by: NZheng Liu <wenqing.lz@taobao.com>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

36ade451

ext4: add punching hole support for non-extent-mapped files · 8bad6fc8

由 Zheng Liu 提交于 1月 28, 2013

This patch add supports for indirect file support punching hole.  It
is almost the same as ext4_ext_punch_hole.  First, we invalidate all
pages between this hole, and then we try to deallocate all blocks of
this hole.

A recursive function is used to handle deallocation of blocks.  In
this function, it iterates over the entries in inode's i_blocks or
indirect blocks, and try to free the block for each one of them.

After applying this patch, xfstest #255 will not pass w/o extent because
indirect-based file doesn't support unwritten extents.
Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

8bad6fc8

17 1月, 2013 1 次提交

ext4: add tracepoint in punching hole · aaddea81

由 Zheng Liu 提交于 1月 16, 2013

This patch adds a tracepoint in ext4_punch_hole.

CC: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

aaddea81

13 1月, 2013 2 次提交

ext4: use unlikely to improve the efficiency of the kernel · aebf0243

由 Wang Shilong 提交于 1月 12, 2013

Because the function 'sb_getblk' seldomly fails to return NULL
value,it will be better to use 'unlikely' to optimize it.
Signed-off-by: NWang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

aebf0243

ext4: return ENOMEM if sb_getblk() fails · 860d21e2

由 Theodore Ts'o 提交于 1月 12, 2013

The only reason for sb_getblk() failing is if it can't allocate the
buffer_head.  So ENOMEM is more appropriate than EIO.  In addition,
make sure that the file system is marked as being inconsistent if
sb_getblk() fails.
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org

860d21e2

26 12月, 2012 2 次提交

ext4: fix deadlock in journal_unmap_buffer() · 53e87268

由 Jan Kara 提交于 12月 25, 2012

We cannot wait for transaction commit in journal_unmap_buffer()
because we hold page lock which ranks below transaction start.  We
solve the issue by bailing out of journal_unmap_buffer() and
jbd2_journal_invalidatepage() with -EBUSY.  Caller is then responsible
for waiting for transaction commit to finish and try invalidation
again. Since the issue can happen only for page stradding i_size, it
is simple enough to manually call jbd2_journal_invalidatepage() for
such page from ext4_setattr(), check the return value and wait if
necessary.
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

53e87268

ext4: split off ext4_journalled_invalidatepage() · 4520fb3c

由 Jan Kara 提交于 12月 25, 2012

In data=journal mode we don't need delalloc or DIO handling in invalidatepage
and similarly in other modes we don't need the journal handling. So split
invalidatepage implementations.
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

4520fb3c

11 12月, 2012 6 次提交

ext4: let ext4_truncate handle inline data correctly · aef1c851

由 Tao Ma 提交于 12月 10, 2012

Signed-off-by: NRobin Dong <sanbai@taobao.com>
Signed-off-by: NTao Ma <boyu.mt@taobao.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

aef1c851

ext4: add delalloc support for inline data · 9c3569b5

由 Tao Ma 提交于 12月 10, 2012

For delayed allocation mode, we write to inline data if the file
is small enough. And in case of we write to some offset larger
than the inline size, the 1st page is dirtied, so that
ext4_da_writepages can handle the conversion. When the 1st page
is initialized with blocks, the inline part is removed.
Signed-off-by: NTao Ma <boyu.mt@taobao.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

9c3569b5

T
ext4: add journalled write support for inline data · 3fdcfb66
由 Tao Ma 提交于 12月 10, 2012
```
Signed-off-by: NTao Ma <boyu.mt@taobao.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
```
3fdcfb66

ext4: add normal write support for inline data · f19d5870

由 Tao Ma 提交于 12月 10, 2012

For a normal write case (not journalled write, not delayed
allocation), we write to the inline if the file is small and convert
it to an extent based file when the write is larger than the max
inline size.
Signed-off-by: NTao Ma <boyu.mt@taobao.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

f19d5870

ext4: add read support for inline data · 46c7f254

由 Tao Ma 提交于 12月 10, 2012

Let readpage and readpages handle the case when we want to read an
inlined file.
Signed-off-by: NTao Ma <boyu.mt@taobao.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

46c7f254

ext4: add the basic function for inline data support · 67cf5b09

由 Tao Ma 提交于 12月 10, 2012

Implement inline data with xattr.

Now we use "system.data" to store xattr, and the xattr will
be extended if the i_size is increased while we don't release
the space during truncate.
Signed-off-by: NTao Ma <boyu.mt@taobao.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

67cf5b09

03 12月, 2012 1 次提交

ext4: move extra inode read to a new function · 152a7b0a

由 Tao Ma 提交于 12月 02, 2012

Currently, in ext4_iget we do a simple check to see whether
there does exist some information starting from the end
of i_extra_size. With inline data added, this procedure
is more complicated. So move it to a new function named
ext4_iget_extra_inode.
Signed-off-by: NTao Ma <boyu.mt@taobao.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

152a7b0a

30 11月, 2012 1 次提交

ext4: restructure ext4_ext_direct_IO() · 69c499d1

由 Theodore Ts'o 提交于 11月 29, 2012

Remove a level of indentation by moving the DIO read and extending
write case to the beginning of the file.  This results in no actual
programmatic changes to the file, but makes it easier to
read/understand.
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

69c499d1

16 11月, 2012 1 次提交

ext4: remove calls to ext4_jbd2_file_inode() from delalloc write path · f3b59291

由 Theodore Ts'o 提交于 11月 15, 2012

The calls to ext4_jbd2_file_inode() are needed to guarantee that we do
not expose stale data in the data=ordered mode. However, they are not
necessary because in all of the cases where we have newly allocated
blocks in the delayed allocation write path, we immediately submit the
dirty pages for I/O. Hence, we can avoid the overhead of adding the
inode to the list of inodes whose data pages will be to be flushed out
to disk completely during the next commit operation.
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

f3b59291

15 11月, 2012 1 次提交

ext4: init pagevec in ext4_da_block_invalidatepages · 66bea92c

由 Eric Sandeen 提交于 11月 14, 2012

ext4_da_block_invalidatepages is missing a pagevec_init(),
which means that pvec->cold contains random garbage.

This affects whether the page goes to the front or
back of the LRU when ->cold makes it to
free_hot_cold_page()
Reviewed-by: NLukas Czerner <lczerner@redhat.com>
Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org

66bea92c

09 11月, 2012 3 次提交

ext4: reimplement ext4_find_delay_alloc_range on extent status tree · 7d1b1fbc

由 Zheng Liu 提交于 11月 08, 2012

Signed-off-by: NYongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: NAllison Henderson <achender@linux.vnet.ibm.com>
Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

7d1b1fbc

ext4: let ext4 maintain extent status tree · 51865fda

由 Zheng Liu 提交于 11月 08, 2012

This patch lets ext4 maintain extent status tree.

Currently it only tracks delay extent status in extent status tree.  When a
delay allocation is issued, the related delay extent will be inserted into
extent status tree.  When a delay extent is written out or invalidated, it will
be removed from this tree.
Signed-off-by: NYongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: NAllison Henderson <achender@linux.vnet.ibm.com>
Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

51865fda

ext4: remove code duplication in ext4_get_block_write_nolock() · 8b0f165f

由 Anatol Pomozov 提交于 11月 08, 2012

729f52c6 introduced function ext4_get_block_write_nolock() that
is very similar to _ext4_get_block(). Eliminate code duplication
by passing different flags to _ext4_get_block()

Tested: xfs tests
Reviewed-by: NZheng Liu <wenqing.lz@taobao.com>
Signed-off-by: NAnatol Pomozov <anatol.pomozov@gmail.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

8b0f165f

01 10月, 2012 1 次提交

ext4: fix mtime update in nodelalloc mode · 041bbb6d

由 Theodore Ts'o 提交于 9月 30, 2012

Commits 5e8830dc and 41c4d25f introduced a regression into
v3.6-rc1 for ext4 in nodealloc mode, such that mtime updates would not
take place for files modified via mmap if the page was already in the
page cache.  This would also affect ext3 file systems mounted using
the ext4 file system driver.

The problem was that ext4_page_mkwrite() had a shortcut which would
avoid calling __block_page_mkwrite() under some circumstances, and the
above two commit transferred the responsibility of calling
file_update_time() to __block_page_mkwrite --- which woudln't get
called in some circumstances.

Since __block_page_mkwrite() only has three callers,
block_page_mkwrite(), ext4_page_mkwrite, and nilfs_page_mkwrite(), the
best way to solve this is to move the responsibility for calling
file_update_time() to its caller.

This problem was found via xfstests #215 with a file system mounted
with -o nodelalloc.
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: stable@vger.kernel.org

041bbb6d

29 9月, 2012 4 次提交

ext4: serialize truncate with owerwrite DIO workers · 1f555cfa

由 Dmitry Monakhov 提交于 9月 29, 2012

Jan Kara have spotted interesting issue:
There are  potential data corruption issue with  direct IO overwrites
racing with truncate:
 Like:
  dio write                      truncate_task
  ->ext4_ext_direct_IO
   ->overwrite == 1
    ->down_read(&EXT4_I(inode)->i_data_sem);
    ->mutex_unlock(&inode->i_mutex);
                               ->ext4_setattr()
                                ->inode_dio_wait()
                                ->truncate_setsize()
                                ->ext4_truncate()
                                 ->down_write(&EXT4_I(inode)->i_data_sem);
    ->__blockdev_direct_IO
     ->ext4_get_block
     ->submit_io()
    ->up_read(&EXT4_I(inode)->i_data_sem);
                                 # truncate data blocks, allocate them to
                                 # other inode - bad stuff happens because
                                 # dio is still in flight.

In order to serialize with truncate dio worker should grab extra i_dio_count
reference before drop i_mutex.
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

1f555cfa

ext4: endless truncate due to nonlocked dio readers · 1b65007e

由 Dmitry Monakhov 提交于 9月 29, 2012

If we have enough aggressive DIO readers, truncate and other dio
waiters will wait forever inside inode_dio_wait(). It is reasonable
to disable nonlock DIO read optimization during truncate.
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

1b65007e

ext4: serialize unlocked dio reads with truncate · 1c9114f9

由 Dmitry Monakhov 提交于 9月 29, 2012

Current serialization will works only for DIO which holds
i_mutex, but nonlocked DIO following race is possible:

dio_nolock_read_task            truncate_task
				->ext4_setattr()
				 ->inode_dio_wait()
->ext4_ext_direct_IO
  ->ext4_ind_direct_IO
    ->__blockdev_direct_IO
      ->ext4_get_block
				 ->truncate_setsize()
				 ->ext4_truncate()
				 #alloc truncated blocks
				 #to other inode
      ->submit_io()
     #INFORMATION LEAK

In order to serialize with unlocked DIO reads we have to
rearrange wait sequence
1) update i_size first
2) if i_size about to be reduced wait for outstanding DIO requests
3) and only after that truncate inode blocks
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

1c9114f9

ext4: serialize dio nonlocked reads with defrag workers · 17335dcc

由 Dmitry Monakhov 提交于 9月 29, 2012

Inode's block defrag and ext4_change_inode_journal_flag() may
affect nonlocked DIO reads result, so proper synchronization
required.

- Add missed inode_dio_wait() calls where appropriate
- Check inode state under extra i_dio_count reference.
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

17335dcc