- 27 2月, 2011 8 次提交
-
-
由 Theodore Ts'o 提交于
If we have accumulated a contiguous region of memory to be written out, and the next page can added to this region, don't bother locking (and then unlocking the page) before writing out the memory. In the unlikely event that the next page was being written back by some other CPU, we can also skip waiting that page to finish writeback. Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Theodore Ts'o 提交于
Because the ext4 page writeback codepath had been prematurely calling clear_page_dirty_for_io(), if it turned out that a particular page couldn't be written out during a particular pass of write_cache_pages_da(), the page would have to get redirtied by calling redirty_pages_for_writeback(). Not only was this wasted work, but redirty_page_for_writeback() would increment wbc->pages_skipped to signal to writeback_sb_inodes() that buffers were locked, and that it should skip this inode until later. Since this signal was incorrect in ext4's case --- which was caused by ext4's historically incorrect use of write_cache_pages() --- ext4_da_writepages() saved and restored wbc->skipped_pages to avoid confusing writeback_sb_inodes(). Now that we've fixed ext4 to call clear_page_dirty_for_io() right before initiating the page I/O, we can nuke the page_skipped save/restore hackery, and breathe a sigh of relief. Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Theodore Ts'o 提交于
Move when we call clear_page_dirty_for_io() to just before we actually write the page. This simplifies the code somewhat, and avoids marking pages as clean and then needing to remark them as dirty later. Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Theodore Ts'o 提交于
Eliminate duplicate code, unneeded variables, etc., to make it easier to understand the code. No behavioral changes were made in this patch. Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Theodore Ts'o 提交于
Fold the __mpage_da_writepage() function into write_cache_pages_da(). This will give us opportunities to clean up and simplify the resulting code. Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Theodore Ts'o 提交于
Now that we've fixed the file corruption bug in commit d50bdd5a, it's time to enable mblk_io_submit by default. Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Curt Wohlgemuth 提交于
If ext4_da_block_invalidatepages() is called because of a failure from ext4_map_blocks() in mpage_da_map_and_submit(), it's supposed to clean up -- including unlock -- all the pages in the mpd structure. But these values may not match up, even on a system in which block size == page size: mpd->b_blocknr != mpd->first_page mpd->b_size != (mpd->next_page - mpd->first_page) ext4_da_block_invalidatepages() has been using b_blocknr and b_size; this patch changes it to use first_page and next_page. Tested: I injected a small number (5%) of failures in ext4_map_blocks() in the case that the flags contain EXT4_GET_BLOCKS_DELALLOC_RESERVE, and ran fsstress on this kernel. Without this patch, I got hung tasks every time. With this patch, I see no hangs in many runs of fsstress. Signed-off-by: NCurt Wohlgemuth <curtw@google.com> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Curt Wohlgemuth 提交于
In mpage_da_map_and_submit(), if we have a delayed block allocation failure from ext4_map_blocks(), we need to mark the IO as complete, by setting mpd->io_done = 1; Otherwise, we could end up submitting the pages in an outer loop; since they are unlocked on mapping failure in ext4_da_block_invalidatepages(), this will cause a bug check in mpage_da_submit_io(). I tested this by injected failures into ext4_map_blocks(). Without this patch, a simple fsstress run will bug check; with the patch, it works fine. Signed-off-by: NCurt Wohlgemuth <curtw@google.com> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
- 25 2月, 2011 5 次提交
-
-
由 Coly Li 提交于
In ext4_mb_check_group_pa(), the current preallocation space is replaced with a new preallocation space when the two have the same distance from the goal block. This doesn't actually gain us anything, so change things so that the function only switches to the new preallocation group if its distance from the goal block is strictly smaller than the current preallocaiton group's distance from the goal block. Signed-off-by: NColy Li <bosong.ly@taobao.com> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Coly Li 提交于
Signed-off-by: NColy Li <bosong.ly@taobao.com> Cc: Alex Tomas <alex@clusterfs.com> Cc: Theodore Tso <tytso@google.com>
-
由 Coly Li 提交于
This patch adds comments to ext4_mb_mark_free_simple to make it more understandable. Signed-off-by: NColy Li <bosong.ly@taobao.com> Cc: Alex Tomas <alex@clusterfs.com> Cc: Theodore Tso <tytso@google.com>
-
由 Coly Li 提交于
In __mb_check_buddy(), look at the code below: 591 fstart = -1; 592 buddy = mb_find_buddy(e4b, 0, &max); 593 for (i = 0; i < max; i++) { 594 if (!mb_test_bit(i, buddy)) { 595 MB_CHECK_ASSERT(i >= e4b->bd_info->bb_first_free); 596 if (fstart == -1) { 597 fragments++; 598 fstart = i; 599 } 600 continue; 601 } 602 fstart = -1; 603 /* check used bits only */ 604 for (j = 0; j < e4b->bd_blkbits + 1; j++) { 605 buddy2 = mb_find_buddy(e4b, j, &max2); 606 k = i >> j; 607 MB_CHECK_ASSERT(k < max2); 608 MB_CHECK_ASSERT(mb_test_bit(k, buddy2)); 609 } 610 } 611 MB_CHECK_ASSERT(!EXT4_MB_GRP_NEED_INIT(e4b->bd_info)); 612 MB_CHECK_ASSERT(e4b->bd_info->bb_fragments == fragments); 613 614 grp = ext4_get_group_info(sb, e4b->bd_group); 615 buddy = mb_find_buddy(e4b, 0, &max); On line 592, buddy is fetched by mb_find_buddy() with order 0, between line 593 to line 615, buddy is not changed, therefore there is no need to fetch buddy again from mb_find_buddy() with order 0 again. We can safely remove the second mb_find_buddy() on line 615. Signed-off-by: NColy Li <bosong.ly@taobao.com> Cc: Alex Tomas <alex@clusterfs.com> Cc: Theodore Tso <tytso@google.com>
-
由 Coly Li 提交于
Current code calculate max no matter whether order is zero, it's unnecessary. This cleanup patch sets max to "1 << (e4b->bd_blkbits + 3)" only when order == 0. Signed-off-by: NColy Li <bosong.ly@taobao.com> Cc: Alex Tomas <alex@clusterfs.com> Cc: Theodore Tso <tytso@google.com>
-
- 24 2月, 2011 4 次提交
-
-
由 Eric Sandeen 提交于
There's no good reason to require the extra step of providing a mount option for acl or user_xattr once the feature is configured on; no other filesystem that I know of requires this. Userspace patches have set these options in default mount options, and this patch makes them default in the kernel. At some point we can start to deprecate the options, perhaps. For now I've removed default mount option checks in show_options() to be explicit about what's set, since it's changing the default, but I'm open to alternatives if desired. Signed-off-by: NEric Sandeen <sandeen@redhat.com> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Lukas Czerner 提交于
Discard granularity tells us the minimum size of extent that can be discarded by the device. If the user supplies a minimum extent that should be discarded (range.minlen) which is smaller than the discard granularity, increase minlen to the discard granularity, since there's no point submitting trim requests that the device will reject anyway. Signed-off-by: NLukas Czerner <lczerner@redhat.com> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Lukas Czerner 提交于
For a device that does not support discard, the FITRIM ioctl returns -EOPNOTSUPP when blkdev_issue_discard() returns this error code, which is how the user is informed that the device does not support discard. If there are no suitable free extents to be trimmed, then FITRIM will return success even though the device does not support discard, which could confuse the user. So check explicitly if the device supports discard and return an error code at the beginning of the FITRIM ioctl processing. Signed-off-by: NLukas Czerner <lczerner@redhat.com> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Lukas Czerner 提交于
Signed-off-by: NLukas Czerner <lczerner@redhat.com> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
- 22 2月, 2011 3 次提交
-
-
由 Alexander V. Lukyanov 提交于
I cannot disable inode-read-ahead feature of ext4 (on 2.6.37): # echo 0 > /sys/fs/ext4/sda2/inode_readahead_blks bash: echo: write error: Invalid argument On a server with lots of small files and random access this read-ahead makes performance worse, and I'd like to disable it. I work around this problem by using value of 1, but it still reads an extra block. This patch fixes the problem by checking for zero explicitly. Signed-off-by: NAlexander V. Lukyanov <lav@netis.ru> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Peter Huewe 提交于
This patch fixes the warning "Using plain integer as NULL pointer", generated by sparse, by replacing the offending 0s with NULL. Signed-off-by: NPeter Huewe <peterhuewe@gmx.de> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Theodore Ts'o 提交于
Compile 2.6.38-rc1 with turning EXT4FS_DEBUG on, we get following compile warnings. This patch fixes them. CC fs/ext4/hash.o CC fs/ext4/resize.o fs/ext4/resize.c: In function 'setup_new_group_blocks': fs/ext4/resize.c:233:2: warning: format '%#04llx' expects type 'long long unsigned int', but argument 3 has type 'long unsigned int' fs/ext4/resize.c:251:2: warning: format '%#04llx' expects type 'long long unsigned int', but argument 3 has type 'long unsigned int' CC fs/ext4/extents.o CC fs/ext4/ext4_jbd2.o CC fs/ext4/migrate.o Reported-by: NAkira Fujita <a-fujita@rs.jp.nec.com> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
- 12 2月, 2011 2 次提交
-
-
由 Eric Sandeen 提交于
ext4 has a data corruption case when doing non-block-aligned asynchronous direct IO into a sparse file, as demonstrated by xfstest 240. The root cause is that while ext4 preallocates space in the hole, mappings of that space still look "new" and dio_zero_block() will zero out the unwritten portions. When more than one AIO thread is going, they both find this "new" block and race to zero out their portion; this is uncoordinated and causes data corruption. Dave Chinner fixed this for xfs by simply serializing all unaligned asynchronous direct IO. I've done the same here. The difference is that we only wait on conversions, not all IO. This is a very big hammer, and I'm not very pleased with stuffing this into ext4_file_write(). But since ext4 is DIO_LOCKING, we need to serialize it at this high level. I tried to move this into ext4_ext_direct_IO, but by then we have the i_mutex already, and we will wait on the work queue to do conversions - which must also take the i_mutex. So that won't work. This was originally exposed by qemu-kvm installing to a raw disk image with a normal sector-63 alignment. I've tested a backport of this patch with qemu, and it does avoid the corruption. It is also quite a lot slower (14 min for package installs, vs. 8 min for well-aligned) but I'll take slow correctness over fast corruption any day. Mingming suggested that we can track outstanding conversions, and wait on those so that non-sparse files won't be affected, and I've implemented that here; unaligned AIO to nonsparse files won't take a perf hit. [tytso@mit.edu: Keep the mutex as a hashed array instead of bloating the ext4 inode] [tytso@mit.edu: Fix up namespace issues so that global variables are protected with an "ext4_" prefix.] Signed-off-by: NEric Sandeen <sandeen@redhat.com> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Eric Sandeen 提交于
In 2.6.37 I was running into oopses with repeated module loads & unloads. I tracked this down to: fb1813f4 ext4: use dedicated slab caches for group_info structures (this was in addition to the features advert unload problem) The kstrdup & subsequent kfree of the cache name was causing a double free. In slub, at least, if I read it right it allocates & frees the name itself, slab seems to do something different... so in slub I think we were leaking -our- cachep->name, and double freeing the one allocated by slub. After getting lost in slab/slub/slob a bit, I just looked at other sized-caches that get allocated. jbd2, biovec, sgpool all do it more or less the way jbd2 does. Below patch follows the jbd2 method of dynamically allocating a cache at mount time from a list of static names. (This might also possibly fix a race creating the caches with parallel mounts running). [Folded in a fix from Dan Carpenter which fixed an off-by-one error in the original patch] Cc: stable@kernel.org Signed-off-by: NEric Sandeen <sandeen@redhat.com> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
- 08 2月, 2011 1 次提交
-
-
由 Curt Wohlgemuth 提交于
This fixes a corruption problem with the multi-block writepages submittal change for ext4, from commit bd2d0210 ("ext4: use bio layer instead of buffer layer in mpage_da_submit_io"). (Note that this corruption is not present in 2.6.37 on ext4, because the corruption was detected after the feature was merged in 2.6.37-rc1, and so it was turned off by adding a non-default mount option, mblk_io_submit. With this commit, which hopefully fixes the last of the bugs with this feature, we'll be able to turn on this performance feature by default in 2.6.38, and remove the mblk_io_submit option.) The ext4 code path to bundle multiple pages for writeback in ext4_bio_write_page() had a bug: we should be clearing buffer head dirty flags *before* we submit the bio, not in the completion routine. The patch below was tested on 2.6.37 under KVM with the postgresql script which was submitted by Jon Nelson as documented in commit 1449032b. Without the patch, I'd hit the corruption problem about 50-70% of the time. With the patch, I executed the script > 100 times with no corruption seen. I also fixed a bug to make sure ext4_end_bio() doesn't dereference the bio after the bio_put() call. Reported-by: NJon Nelson <jnelson@jamponi.net> Reported-by: NMatthias Bayer <jackdachef@gmail.com> Signed-off-by: NCurt Wohlgemuth <curtw@google.com> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
-
- 04 2月, 2011 3 次提交
-
-
由 Theodore Ts'o 提交于
Make sure we the correct cleanup happens if we die while trying to load the ext4 file system. Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Lukas Czerner 提交于
Ext4 features interface was not properly unregistered which led to problems while unloading/reloading ext4 module. This commit fixes that by adding proper kobject unregistration code into ext4_exit_fs() as well as fail-path of ext4_init_fs() Reported-by: NEric Sandeen <sandeen@redhat.com> Signed-off-by: NLukas Czerner <lczerner@redhat.com> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
-
由 Eric Sandeen 提交于
https://bugzilla.kernel.org/show_bug.cgi?id=27652 If the lazyinit thread is running, the teardown function ext4_destroy_lazyinit_thread() has problems: ext4_clear_request_list(); while (ext4_li_info->li_task) { wake_up(&ext4_li_info->li_wait_daemon); wait_event(ext4_li_info->li_wait_task, ext4_li_info->li_task == NULL); } Clearing the request list will cause the thread to exit and free ext4_li_info, so then we're waiting on something which is getting freed. Fix this up by making the thread respond to kthread_stop, and exit, without the need to wait for that exit in some other homegrown way. Cc: stable@kernel.org Reported-and-Tested-by: NTao Ma <boyu.mt@taobao.com> Signed-off-by: NEric Sandeen <sandeen@redhat.com> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
- 17 1月, 2011 2 次提交
-
-
由 Christoph Hellwig 提交于
Currently all filesystems except XFS implement fallocate asynchronously, while XFS forced a commit. Both of these are suboptimal - in case of O_SYNC I/O we really want our allocation on disk, especially for the !KEEP_SIZE case where we actually grow the file with user-visible zeroes. On the other hand always commiting the transaction is a bad idea for fast-path uses of fallocate like for example in recent Samba versions. Given that block allocation is a data plane operation anyway change it from an inode operation to a file operation so that we have the file structure available that lets us check for O_SYNC. This also includes moving the code around for a few of the filesystems, and remove the already unnedded S_ISDIR checks given that we only wire up fallocate for regular files. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Christoph Hellwig 提交于
Instead of various home grown checks that might need updates for new flags just check for any bit outside the mask of the features supported by the filesystem. This makes the check future proof for any newly added flag. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 14 1月, 2011 1 次提交
-
-
由 Andrew Morton 提交于
pr_warning_ratelimited() doesn't exist. Also include printk.h, which defines these things. Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 1月, 2011 2 次提交
-
-
由 Josef Bacik 提交于
Ext4 doesn't have the ability to punch holes yet, so make sure we return EOPNOTSUPP if we try to use hole punching through fallocate. This support can be added later. Thanks, Acked-by: NJan Kara <jack@suse.cz> Signed-off-by: NJosef Bacik <josef@redhat.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Jan Kara 提交于
As Al Viro pointed out path resolution during Q_QUOTAON calls to quotactl is prone to deadlocks. We hold s_umount semaphore for reading during the path resolution and resolution itself may need to acquire the semaphore for writing when e. g. autofs mountpoint is passed. Solve the problem by performing the resolution before we get hold of the superblock (and thus s_umount semaphore). The whole thing is complicated by the fact that some filesystems (OCFS2) ignore the path argument. So to distinguish between filesystem which want the path and which do not we introduce new .quota_on_meta callback which does not get the path. OCFS2 then uses this callback instead of old .quota_on. CC: Al Viro <viro@ZenIV.linux.org.uk> CC: Christoph Hellwig <hch@lst.de> CC: Ted Ts'o <tytso@mit.edu> CC: Joel Becker <joel.becker@oracle.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
- 12 1月, 2011 2 次提交
-
-
由 Jan Kara 提交于
When s_first_data_block is not zero (which happens e.g. when block size is 1KB) and trim ioctl is called to start trimming from block 0, the math in ext4_get_group_no_and_offset() overflows. The overall result is that ioctl returns EINVAL which is kind of unexpected and we probably don't want userspace tools to bother with internal details of filesystem structure. So just silently increase starting offset (and shorten length) when starting block is below s_first_data_block. CC: Lukas Czerner <lczerner@redhat.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Theodore Ts'o 提交于
This reverts commit 4f531501: ext4: fix possible overflow in ext4_trim_fs() Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
- 11 1月, 2011 7 次提交
-
-
由 Eric Sandeen 提交于
Since check_eofblocks_fl() only uses the m_lblk portion of the map structure, we may as well pass that directly, rather than passing the entire map, which IMHO obfuscates what parameters check_eofblocks_fl() cares about. Not a big deal, but seems tidier and less confusing, to me. Signed-off-by: NEric Sandeen <sandeen@redhat.com> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Theodore Ts'o 提交于
Commit 40389687 moved a call to ext4_forget() out of ext4_free_branches and let ext4_free_blocks() handle calling bforget(). But that change unfortunately did not replace the call to ext4_forget() with brelse(), which was needed to drop the in-use count of the indirect block's buffer head, which lead to a memory leak when deleting files that used indirect blocks. Fix this. Thanks to Hugh Dickins for pointing this out. Cc: stable@kernel.org Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Theodore Ts'o 提交于
This function was never implemented, except for a BUG_ON which was tripping when ext4 is run without a journal. The problem is that although the comment asserts that "truncate (which is the only way to free block) discards all preallocations", ext4_free_blocks() is also called in various error recovery paths when blocks have been allocated, but for various reasons, we were not able to use those data blocks (for example, because we ran out of memory while trying to manipulate the extent tree, or some other similar situation). In addition to the fact that this function isn't implemented except for the incorrect BUG_ON, the single caller of this function, ext4_free_blocks(), doesn't use it all if the journal is enabled. So remove the (stub) function entirely for now. If we decide it's better to add it back, it's only going to be useful with a relatively large number of code changes anyway. Google-Bug-Id: 3236408 Cc: Jiaying Zhang <jiayingz@google.com> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Jiaying Zhang 提交于
Ted first found the bug when running 2.6.36 kernel with dioread_nolock mount option that xfstests #13 complained about wrong file size during fsck. However, the bug exists in the older kernels as well although it is somehow harder to trigger. The problem is that ext4_end_io_work() can happen after we have truncated an inode to a smaller size. Then when ext4_end_io_work() calls ext4_convert_unwritten_extents(), we may reallocate some blocks that have been truncated, so the inode size becomes inconsistent with the allocated blocks. The following patch flushes the i_completed_io_list during truncate to reduce the risk that some pending end_io requests are executed later and convert already truncated blocks to initialized. Note that although the fix helps reduce the problem a lot there may still be a race window between vmtruncate() and ext4_end_io_work(). The fundamental problem is that if vmtruncate() is called without either i_mutex or i_alloc_sem held, it can race with an ongoing write request so that the io_end request is processed later when the corresponding blocks have been truncated. Ted and I have discussed the problem offline and we saw a few ways to fix the race completely: a) We guarantee that i_mutex lock and i_alloc_sem write lock are both hold whenever vmtruncate() is called. The i_mutex lock prevents any new write requests from entering writeback and the i_alloc_sem prevents the race from ext4_page_mkwrite(). Currently we hold both locks if vmtruncate() is called from do_truncate(), which is probably the most common case. However, there are places where we may call vmtruncate() without holding either i_mutex or i_alloc_sem. I would like to ask for other people's opinions on what locks are expected to be held before calling vmtruncate(). There seems a disagreement among the callers of that function. b) We change the ext4 write path so that we change the extent tree to contain the newly allocated blocks and update i_size both at the same time --- when the write of the data blocks is completed. c) We add some additional locking to synchronize vmtruncate() and ext4_end_io_work(). This approach may have performance implications so we need to be careful. All of the above proposals may require more substantial changes, so we may consider to take the following patch as a bandaid. Signed-off-by: NJiaying Zhang <jiayingz@google.com> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Theodore Ts'o 提交于
Call ext4_std_error() in various places when we can't bail out cleanly, so the file system can be marked as in error. Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Jan Kara 提交于
When ext4_trim_fs() is called to trim a part of a single group, the logic will wrongly set last block of the interval to 'len' instead of 'first_block + len'. Thus a shorter interval is possibly trimmed. Fix it. CC: Lukas Czerner <lczerner@redhat.com> Cc: stable@kernel.org Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Andrew Morton 提交于
fs/ext4/super.c: In function 'ext4_register_li_request': fs/ext4/super.c:2936: warning: 'ret' may be used uninitialized in this function It looks buggy to me, too. Cc: Lukas Czerner <lczerner@redhat.com> Cc: stable@kernel.org Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-