提交 · 0ccff1a49def92d6b838a6da166c89004b3a4d0c · OpenHarmony / kernel_linux

18 8月, 2009 2 次提交

ext4: Fix possible deadlock between ext4_truncate() and ext4_get_blocks() · 487caeef

由 Jan Kara 提交于 8月 17, 2009

During truncate we are sometimes forced to start a new transaction as
the amount of blocks to be journaled is both quite large and hard to
predict. So far we restarted a transaction while holding i_data_sem
and that violates lock ordering because i_data_sem ranks below a
transaction start (and it can lead to a real deadlock with
ext4_get_blocks() mapping blocks in some page while having a
transaction open).

We fix the problem by dropping the i_data_sem before restarting the
transaction and acquire it afterwards. It's slightly subtle that this
works:

1) By the time ext4_truncate() is called, all the page cache for the
truncated part of the file is dropped so get_block() should not be
called on it (we only have to invalidate extent cache after we
reacquire i_data_sem because some extent from not-truncated part could
extend also into the part we are going to truncate).

2) Writes, migrate or defrag hold i_mutex so they are stopped for all
the time of the truncate.

This bug has been found and analyzed by Theodore Tso <tytso@mit.edu>.
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

487caeef

jbd2: Annotate transaction start also for jbd2_journal_restart() · 9599b0e5

由 Jan Kara 提交于 8月 17, 2009

lockdep annotation for a transaction start has been at the end of
jbd2_journal_start(). But a transaction is also started from
jbd2_journal_restart(). Move the lockdep annotation to start_this_handle()
which covers both cases.
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

9599b0e5

19 9月, 2009 1 次提交

ext4: Show unwritten extent flag in ext4_ext_show_leaf() · 553f9008

由 Mingming 提交于 9月 18, 2009

ext4_ext_show_leaf() will display the leaf extents when extent
debugging is enabled.

Printing out the unwritten bit is useful for debugging unwritten
extent, allow us to see the unwritten extents vs written extents,
after the unwritten extents are splitted or converted.
Signed-off-by: NMingming Cao <cmm@us.ibm.com>

553f9008

01 9月, 2009 1 次提交

ext4: Compile warning fix when EXT_DEBUG enabled · 84fe3bef

由 Mingming 提交于 9月 01, 2009

When EXT_DEBUG is enabled I received the following compile warning on
PPC64:

  CC [M]  fs/ext4/inode.o
  CC [M]  fs/ext4/extents.o
fs/ext4/extents.c: In function ‘ext4_ext_rm_leaf’:
fs/ext4/extents.c:2097: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 2 has type ‘ext4_lblk_t’
fs/ext4/extents.c: In function ‘ext4_ext_get_blocks’:
fs/ext4/extents.c:2789: warning: format ‘%u’ expects type ‘unsigned int’, but argument 4 has type ‘long unsigned int’
fs/ext4/extents.c:2852: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 3 has type ‘ext4_lblk_t’
fs/ext4/extents.c:2953: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 4 has type ‘unsigned int’
  CC [M]  fs/ext4/migrate.o

The patch fixes compile warning.
Signed-off-by: NMingming Cao <cmm@us.ibm.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

Index: linux-2.6.31-rc4/fs/ext4/extents.c
===================================================================

84fe3bef

19 9月, 2009 1 次提交

ext4: Avoid group preallocation for closed files · 50797481

由 Theodore Ts'o 提交于 9月 18, 2009

Currently the group preallocation code tries to find a large (512)
free block from which to do per-cpu group allocation for small files.
The problem with this scheme is that it leaves the filesystem horribly
fragmented. In the worst case, if the filesystem is unmounted and
remounted (after a system shutdown, for example) we forget the fact
that wee were using a particular (now-partially filled) 512 block
extent. So the next time we try to allocate space for a small file,
we will find *another* completely free 512 block chunk to allocate
small files. Given that there are 32,768 blocks in a block group,
after 64 iterations of "mount, write one 4k file in a directory,
unmount", the block group will have 64 files, each separated by 511
blocks, and the block group will no longer have any free 512
completely free chunks of blocks for group preallocation space.

So if we try to allocate blocks for a file that has been closed, such
that we know the final size of the file, and the filesystem is not
busy, avoid using group preallocation.
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

50797481

10 8月, 2009 2 次提交

ext4: Fix bugs in mballoc's stream allocation mode · 4ba74d00

由 Theodore Ts'o 提交于 8月 09, 2009

The logic around sbi->s_mb_last_group and sbi->s_mb_last_start was all
screwed up.  These fields were getting unconditionally all the time,
set even when stream allocation had not taken place, and if they were
being used when the file was smaller than s_mb_stream_request, which
is when the allocation should _not_ be doing stream allocation.

Fix this by determining whether or not we stream allocation should
take place once, in ext4_mb_group_or_file(), and setting a flag which
gets used in ext4_mb_regular_allocator() and ext4_mb_use_best_found().
This simplifies the code and assures that we are consistently using
(or not using) the stream allocation logic.
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

4ba74d00

ext4: Display the mballoc flags in mb_history in hex instead of decimal · 0ef90db9

由 Theodore Ts'o 提交于 8月 09, 2009

Displaying the flags in base 16 makes it easier to see which flags
have been set.
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

0ef90db9

19 9月, 2009 1 次提交

ext4: Add configurable run-time mballoc debugging · 6ba495e9

由 Theodore Ts'o 提交于 9月 18, 2009

Allow mballoc debugging to be enabled at run-time instead of just at
compile time.
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

6ba495e9

11 8月, 2009 3 次提交

ext4: fix journal ref count in move_extent_par_page · 91cc219a

由 Peng Tao 提交于 8月 10, 2009

move_extent_par_page calls a_ops->write_begin() to increase journal
handler's reference count. However, if either mext_replace_branches()
or ext4_get_block fails, the increased reference count isn't
decreased. This will cause a later attempt to umount of the fs to hang
forever. The patch addresses the issue by calling ext4_journal_stop()
if page is not NULL (which means a_ops->write_end() isn't invoked).
Signed-off-by: NPeng Tao <bergwolf@gmail.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

91cc219a

jbd2: round commit timer up to avoid uncommitted transaction · b1f485f2

由 Andreas Dilger 提交于 8月 10, 2009

fix jiffie rounding in jbd commit timer setup code.  Rounding down
could cause the timer to be fired before the corresponding transaction
has expired.  That transaction can stay not committed forever if no
new transaction is created or expicit sync/umount happens.
Signed-off-by: NAlex Zhuravlev (Tomas) <alex.zhuravlev@sun.com>
Signed-off-by: NAndreas Dilger <adilger@sun.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

b1f485f2

ext4: remove redundant test on unsigned · c333e073

由 Roel Kluin 提交于 8月 10, 2009

unsigned i_block cannot be less than 0.
Signed-off-by: NRoel Kluin <roel.kluin@gmail.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

c333e073

28 7月, 2009 1 次提交

ext4: fix build warning when EXT4FS_DEBUG is on · 785b4b3a

由 Peng Tao 提交于 7月 27, 2009

When compiling with EXT4FS_DEBUG on, gcc will complain with following warnings:

linux-2.6/fs/ext4/ialloc.c: In function ‘ext4_count_free_inodes’:
linux-2.6/fs/ext4/ialloc.c:1192: warning: format ‘%lu’ expects type
‘long unsigned int’, but argument 2 has type ‘ext4_group_t’

So add a type cast to suppress it. 
Signed-off-by: NPeng Tao <bergwolf@gmail.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

785b4b3a

06 7月, 2009 2 次提交

ext4: Fix compile warnings with MB_DEBUG · 1c718505

由 Akira Fujita 提交于 7月 05, 2009

When MB_DEBUG is enabled, we get some compile warnings because
ext4_group_t is unsigned int.  This patch fixes them.

Signed-off-by Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

1c718505

J
ext4: Remove unnecessary semicolons in mballoc.c · 5a4a7989
由 Joe Perches 提交于 7月 05, 2009
```
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
```
5a4a7989

17 7月, 2009 2 次提交

ext4: More buffer head reference leaks · 6487a9d3

由 Curt Wohlgemuth 提交于 7月 17, 2009

After the patch I posted last week regarding buffer head ref leaks in
no-journal mode, I looked at all the code that uses buffer heads and
searched for more potential leaks.

The patch below fixes the issues I found; these can occur even when a
journal is present.

The change to inode.c fixes a double release if
ext4_journal_get_create_access() fails.

The changes to namei.c are more complicated.  add_dirent_to_buf() will
release the input buffer head EXCEPT when it returns -ENOSPC.  There are
some callers of this routine that don't always do the brelse() in the event
that -ENOSPC is returned.  Unfortunately, to put this fix into ext4_add_entry()
required capturing the return value of make_indexed_dir() and
add_dirent_to_buf().
Signed-off-by: NCurt Wohlgemuth <curtw@google.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

6487a9d3

jbd2: Fail to load a journal if it is too short · f6f50e28

由 Jan Kara 提交于 7月 17, 2009

Due to on disk corruption, it can happen that journal is too short. Fail
to load it in such case so that we don't oops somewhere later.
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

f6f50e28

28 7月, 2009 2 次提交

ext4: Avoid null pointer dereference when decoding EROFS w/o a journal · 78f1ddbb

由 Theodore Ts'o 提交于 7月 27, 2009

We need to check to make sure a journal is present before checking the
journal flags in ext4_decode_error().
Signed-off-by: NEric Sesterhenn <eric.sesterhenn@lsexperts.de>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

78f1ddbb

ext4: Fix typo in ext4/Kconfig · 43b38520

由 Manish Katiyar 提交于 7月 27, 2009

Signed-off-by: NManish Katiyar <mkatiyar@gmail.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

43b38520

17 7月, 2009 1 次提交

ext4: Fix memory leak fix when mounting an ext4 filesystem · 024eab4d

由 Aneesh Kumar K.V 提交于 7月 17, 2009

The allocation of the ext4_group_info array was moved to a new
function ext4_mb_add_group_info() in commit 5f21b0e6 so that online
resize would use a common (and correct) codepath.  Unfortunately, the
call to the new ext4_mb_add_group_info() function was added without
removing the code which originally allocated the array.  This caused a
memory leak each time an ext4 filesystem was mounted.

The fix is simple; remove the code that did the original allocation,
since it is no longer needed.
Reported-by: NCatalin Marinas <catalin.marinas@arm.com>
Tested-by: NCatalin Marinas <catalin.marinas@arm.com>
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

024eab4d

16 9月, 2009 16 次提交

writeback: fix possible bdi writeback refcounting problem · 1ef7d9aa

由 Nick Piggin 提交于 9月 15, 2009

wb_clear_pending AFAIKS should not be called after the item has been
put on the list, except by the worker threads. It could lead to the
situation where the refcount is decremented below 0 and cause lots of
problems.

Presumably the !wb_has_dirty_io case is not a common one, so it can
be discovered when the thread wakes up to check?

Also add a comment in bdi_work_clear.
Signed-off-by: NNick Piggin <npiggin@suse.de>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

1ef7d9aa

writeback: Fix bdi use after free in wb_work_complete() · 77b9d059

由 Nick Piggin 提交于 9月 15, 2009

By the time bdi_work_on_stack gets evaluated again in bdi_work_free, it
can already have been deallocated and used for something else in the
!on stack case, giving a false positive in this test and causing
corruption.
Signed-off-by: NNick Piggin <npiggin@suse.de>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

77b9d059

writeback: improve scalability of bdi writeback work queues · 77fad5e6

由 Nick Piggin 提交于 9月 15, 2009

If you're going to do an atomic RMW on each list entry, there's not much
point in all the RCU complexities of the list walking. This is only going
to help the multi-thread case I guess, but it doesn't hurt to do now.
Signed-off-by: NNick Piggin <npiggin@suse.de>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

77fad5e6

writeback: remove smp_mb(), it's not needed with list_add_tail_rcu() · deed62ed

由 Nick Piggin 提交于 9月 15, 2009

list_add_tail_rcu contains required barriers.
Signed-off-by: NNick Piggin <npiggin@suse.de>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

deed62ed

J
writeback: use schedule_timeout_interruptible() · 49db0414
由 Jens Axboe 提交于 9月 15, 2009
```
Gets rid of a manual set_current_state().
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
```
49db0414

writeback: add comments to bdi_work structure · 8010c3b6

由 Jens Axboe 提交于 9月 15, 2009

And document its retriever, get_next_work_item().
Acked-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

8010c3b6

writeback: separate starting of sync vs opportunistic writeback · b6e51316

由 Jens Axboe 提交于 9月 16, 2009

bdi_start_writeback() is currently split into two paths, one for
WB_SYNC_NONE and one for WB_SYNC_ALL. Add bdi_sync_writeback()
for WB_SYNC_ALL writeback and let bdi_start_writeback() handle
only WB_SYNC_NONE.

Push down the writeback_control allocation and only accept the
parameters that make sense for each function. This cleans up
the API considerably.
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

b6e51316

writeback: inline allocation failure handling in bdi_alloc_queue_work() · bcddc3f0

由 Jens Axboe 提交于 9月 13, 2009

This gets rid of work == NULL in bdi_queue_work() and puts the
OOM handling where it belongs.
Acked-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

bcddc3f0

writeback: use RCU to protect bdi_list · cfc4ba53

由 Jens Axboe 提交于 9月 14, 2009

Now that bdi_writeback_all() no longer handles integrity writeback,
it doesn't have to block anymore. This means that we can switch
bdi_list reader side protection to RCU.
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

cfc4ba53

writeback: only use bdi_writeback_all() for WB_SYNC_NONE writeout · f11fcae8

由 Jens Axboe 提交于 9月 15, 2009

Data integrity writeback must use bdi_start_writeback() and ensure
that wbc->sb and wbc->bdi are set.
Acked-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

f11fcae8

fs: Assign bdi in super_block · 32a88aa1

由 Jens Axboe 提交于 9月 16, 2009

We do this automatically in get_sb_bdev() from the set_bdev_super()
callback. Filesystems that have their own private backing_dev_info
must assign that in ->fill_super().

Note that ->s_bdi assignment is required for proper writeback!
Acked-by: NChristoph Hellwig <hch@infradead.org>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

32a88aa1

writeback: make wb_writeback() take an argument structure · c4a77a6c

由 Jens Axboe 提交于 9月 16, 2009

We need to be able to pass in range_cyclic as well, so instead
of growing yet another argument, split the arguments into a
struct wb_writeback_args structure that we can use internally.
Also makes it easier to just copy all members to an on-stack
struct, since we can't access work after clearing the pending
bit.
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

c4a77a6c

writeback: merely wakeup flusher thread if work allocation fails for WB_SYNC_NONE · f0fad8a5

由 Christoph Hellwig 提交于 9月 11, 2009

Since it's an opportunistic writeback and not a data integrity action,
don't punt to blocking writeback. Just wakeup the thread and it will
flush old data.
Acked-by: NJan Kara <jack@suse.cz>
Signed-off-by: NChristoph Hellwig <hch@infradead.org>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

f0fad8a5

writeback: get rid of wbc->for_writepages · 1fe06ad8

由 Jens Axboe 提交于 9月 15, 2009

It's only set, it's never checked. Kill it.
Acked-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

1fe06ad8

fs: remove bdev->bd_inode_backing_dev_info · 2c96ce9f

由 Jens Axboe 提交于 9月 15, 2009

It has been unused since it was introduced in:

commit 520808bf20e90fdbdb320264ba7dd5cf9d47dcac
Author: Andrew Morton <akpm@osdl.org>
Date:   Fri May 21 00:46:17 2004 -0700

    [PATCH] block device layer: separate backing_dev_info infrastructure

So lets just kill it.
Acked-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

2c96ce9f

driver model: constify attribute groups · a4dbd674

由 David Brownell 提交于 6月 24, 2009

Let attribute group vectors be declared "const".  We'd
like to let most attribute metadata live in read-only
sections... this is a start.
Signed-off-by: NDavid Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>

a4dbd674

15 9月, 2009 4 次提交

udf: Fix possible corruption when close races with write · cbc8cc33

由 Jan Kara 提交于 8月 07, 2009

When we close a file, we remove preallocated blocks from it. But this
truncation was not protected by i_mutex and thus it could have raced with a
write through a different fd and cause crashes or even filesystem corruption.
Signed-off-by: NJan Kara <jack@suse.cz>

cbc8cc33

udf: Perform preallocation only for regular files · 81056dd0

由 Jan Kara 提交于 7月 16, 2009

So far we preallocated blocks also for directories but that brings a
problem, when to get rid of preallocated blocks we don't need. So far
we removed them in udf_clear_inode() which has a disadvantage that
1) blocks are unavailable long after writing to a directory finished
   and thus one can get out of space unnecessarily early
2) releasing blocks from udf_clear_inode is problematic because VFS
   does not expect us to redirty inode there and it also slows down
   memory reclaim.

So preallocate blocks only for regular files where we can drop preallocation
in udf_release_file.
Signed-off-by: NJan Kara <jack@suse.cz>

81056dd0

udf: Remove wrong assignment in udf_symlink · 7c6e3d1a

由 Jan Kara 提交于 7月 16, 2009

Recomputation of the pointer was wrong (it should have been just increment).
Luckily, we never use the computed value. Remove it.
Signed-off-by: NJan Kara <jack@suse.cz>

7c6e3d1a

udf: Remove dead code · 5891d9dd

由 Jan Kara 提交于 7月 16, 2009

Remove code that gets never used.
Signed-off-by: NJan Kara <jack@suse.cz>

5891d9dd

14 9月, 2009 1 次提交

fsync: wait for data writeout completion before calling ->fsync · 2daea67e

由 Christoph Hellwig 提交于 9月 03, 2009

Currenly vfs_fsync(_range) first calls filemap_fdatawrite to write out
the data, the calls into ->fsync to write out the metadata and then finally
calls filemap_fdatawait to wait for the data I/O to complete.  What sounds
like a clever micro-optimization actually is nast trap for many filesystems.

For many modern filesystems i_size or other inode information is only
updated on I/O completion and we need to wait for I/O to finish before
we can write out the metadata.  For old fashionen filesystems that
instanciate blocks during the actual write and also update the metadata
at that point it opens up a large window were we could expose uninitialized
blocks after a crash.  While a few filesystems that need it already wait
for the I/O to finish inside their ->fsync methods it is rather suboptimal
as it is done under the i_mutex and also always for the whole file instead
of just a part as we could do for O_SYNC handling.

Here is a small audit of all fsync instances in the tree:

 - spufs_mfc_fsync:
 - ps3flash_fsync:
 - vol_cdev_fsync:
 - printer_fsync:
 - fb_deferred_io_fsync:
 - bad_file_fsync:
 - simple_sync_file:

	don't care - filesystems/drivers do't use the page cache or are
	purely in-memory.

 - simple_fsync:
 - file_fsync:
 - affs_file_fsync:
 - fat_file_fsync:
 - jfs_fsync:
 - ubifs_fsync:
 - reiserfs_dir_fsync:
 - reiserfs_sync_file:

	never touch pagecache themselves.  We need to wait before if we do
	not want to expose stale data after an allocation.

 - afs_fsync:
 - fuse_fsync_common:

	do the waiting writeback itself in awkward ways, would benefit from
	proper semantics

 - block_fsync:

	Does a filemap_write_and_wait on the block device inode.  Because we
	now have f_mapping that is the same inode we call it on in vfs_fsync.
	So just removing it and letting the VFS do the work in one go would
	be an improvement.

 - btrfs_sync_file:
 - cifs_fsync:
 - xfs_file_fsync:

	need the wait first and currently do it themselves. would benefit from
	doing it outside i_mutex.

 - coda_fsync:
 - ecryptfs_fsync:
 - exofs_file_fsync:
 - shm_fsync:

	only passes the fsync through to the lower layer

 - ext3_sync_file:

	doesn't seem to care, comments are confusing.

 - ext4_sync_file:

	would need the wait to work correctly for delalloc mode with late
	i_size updates.  Otherwise the ext3 comment applies.

	currently implemens it's own writeback and wait in an odd way,
	could benefit from doing it properly.

 - gfs2_fsync:

	not needed for journaled data mode, but probably harmless there.
	Currently writes back data asynchronously itself.  Needs some
	major audit.

 - hostfs_fsync:

	just calls fsync/datasync on the host FD.  Without the wait before
	data might not even be inflight yet if we're unlucky.

 - hpfs_file_fsync:
 - ncp_fsync:

	no-ops.  Dangerous before and after.

 - jffs2_fsync:

	just calls jffs2_flush_wbuf_gc, not sure how this relates to data.

 - nfs_fsync_dir:

	just increments stats, claims all directory operations are synchronous

 - nfs_file_fsync:

	only writes out data???  Looks very odd.

 - nilfs_sync_file:

	looks like it expects all data done, but not sure from the code

 - ntfs_dir_fsync:
 - ntfs_file_fsync:

	appear to do their own data writeback.  Very convoluted code.

 - ocfs2_sync_file:

	does it's own data writeback, but no wait.  probably needs the wait.

 - smb_fsync:

	according to a comment expects all pages written already, probably needs
	the wait before.

This patch only changes vfs_fsync_range, removal of the wait in the methods
that have it is left to the filesystem maintainers.  Note that most
filesystems really do need an audit for their fsync methods given the
gems found in this very brief audit.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJan Kara <jack@suse.cz>

2daea67e

OpenHarmony / kernel_linux 上一次同步 大约 4 年

OpenHarmony / kernel_linux
上一次同步大约 4 年