- 04 9月, 2013 1 次提交
-
-
由 Christoph Hellwig 提交于
Add support to the core direct-io code to defer AIO completions to user context using a workqueue. This replaces opencoded and less efficient code in XFS and ext4 (we save a memory allocation for each direct IO) and will be needed to properly support O_(D)SYNC for AIO. The communication between the filesystem and the direct I/O code requires a new buffer head flag, which is a bit ugly but not avoidable until the direct I/O code stops abusing the buffer_head structure for communicating with the filesystems. Currently this creates a per-superblock unbound workqueue for these completions, which is taken from an earlier patch by Jan Kara. I'm not really convinced about this use and would prefer a "normal" global workqueue with a high concurrency limit, but this needs further discussion. JK: Fixed ext4 part, dynamic allocation of the workqueue. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 17 8月, 2013 1 次提交
-
-
由 Jan Kara 提交于
Commit 0713ed0c added jbd2_journal_file_inode() call into ext4_block_zero_page_range(). However that function gets called from truncate path and thus inode needn't have jinode attached - that happens in ext4_file_open() but the file needn't be ever open since mount. Calling jbd2_journal_file_inode() without jinode attached results in the oops. We fix the problem by attaching jinode to inode also in ext4_truncate() and ext4_punch_hole() when we are going to zero out partial blocks. Reported-by: Nmajianpeng <majianpeng@gmail.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
- 30 7月, 2013 1 次提交
-
-
由 Zheng Liu 提交于
In commit 921f266b: ext4: add self-testing infrastructure to do a sanity check, some sanity checks were added in map_blocks to make sure 'retval == map->m_len'. Enable these checks by default and report any assertion failures using ext4_warning() and WARN_ON() since they can help us to figure out some bugs that are otherwise hard to hit. Signed-off-by: NZheng Liu <wenqing.lz@taobao.com> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
- 16 7月, 2013 1 次提交
-
-
由 Theodore Ts'o 提交于
If there are no items in the extent status tree, ext4_es_lru_add() is a no-op. So it is not sufficient to call ext4_es_lru_add() before we try to lookup an entry in the extent status tree. We also need to call it at the end of ext4_ext_map_blocks(), after items have been added to the extent status tree. This could lead to inodes with that have extent status trees but which are not in the LRU list, which means they won't get considered for eviction by the es_shrinker. Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu> Cc: Zheng Liu <wenqing.lz@taobao.com> Cc: stable@vger.kernel.org
-
- 13 7月, 2013 1 次提交
-
-
由 Theodore Ts'o 提交于
Replace "assertation" with "assertion" in lots and lots of debugging messages. Correct the comment stating when ext4_es_insert_extent() is used. It was no doubt tree at one point, but it is no longer true... Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu> Cc: Zheng Liu <gnehzuil.liu@gmail.com>
-
- 06 7月, 2013 1 次提交
-
-
由 Jan Kara 提交于
The loop in mpage_map_and_submit_extent() is guaranteed to always run at least once since the caller of mpage_map_and_submit_extent() makes sure map->m_len > 0. So make that explicit using do-while instead of pure while which also silences the compiler warning about uninitialized 'err' variable. Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu> Reviewed-by: NLukas Czerner <lczerner@redhat.com>
-
- 01 7月, 2013 6 次提交
-
-
由 Theodore Ts'o 提交于
The function mpage_released_unused_page() must only be called once; otherwise the kernel will BUG() when the second call to mpage_released_unused_page() tries to unlock the pages which had been unlocked by the first call. Also restructure the error handling so that we only give up on writing the dirty pages in the case of ENOSPC where retrying the allocation won't help. Otherwise, a transient failure, such as a kmalloc() failure in calling ext4_map_blocks() might cause us to give up on those pages, leading to a scary message in /var/log/messages plus data loss. Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu> Reviewed-by: NJan Kara <jack@suse.cz>
-
由 Lukas Czerner 提交于
Currently if we pass range into ext4_zero_partial_blocks() which covers entire block we would attempt to zero it even though we should only zero unaligned part of the block. Fix this by checking whether the range covers the whole block skip zeroing if so. Signed-off-by: NLukas Czerner <lczerner@redhat.com> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Theodore Ts'o 提交于
The function ext4_write_inline_data_end() can return an error. So we need to assign it to a signed integer variable to check for an error return (since copied is an unsigned int). Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu> Cc: Zheng Liu <wenqing.lz@taobao.com> Cc: stable@vger.kernel.org
-
由 jon ernst 提交于
Comparing unsigned variable with 0 always returns false. err = 0 is duplicated and unnecessary. [ tytso: Also cleaned up error handling in ext4_block_zero_page_range() ] Signed-off-by: N"Jon Ernst" <jonernst07@gmx.com> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Ashish Sangwan 提交于
No need to pass file pointer when we can directly pass inode pointer. Signed-off-by: NAshish Sangwan <a.sangwan@samsung.com> Signed-off-by: NNamjae Jeon <namjae.jeon@samsung.com> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Zheng Liu 提交于
Now we maintain an proper in-order LRU list in ext4 to reclaim entries from extent status tree when we are under heavy memory pressure. For keeping this order, a spin lock is used to protect this list. But this lock burns a lot of CPU time. We can use the following steps to trigger it. % cd /dev/shm % dd if=/dev/zero of=ext4-img bs=1M count=2k % mkfs.ext4 ext4-img % mount -t ext4 -o loop ext4-img /mnt % cd /mnt % for ((i=0;i<160;i++)); do truncate -s 64g $i; done % for ((i=0;i<160;i++)); do cp $i /dev/null &; done % perf record -a -g % perf report This commit tries to fix this problem. Now a new member called i_touch_when is added into ext4_inode_info to record the last access time for an inode. Meanwhile we never need to keep a proper in-order LRU list. So this can avoid to burns some CPU time. When we try to reclaim some entries from extent status tree, we use list_sort() to get a proper in-order list. Then we traverse this list to discard some entries. In ext4_sb_info, we use s_es_last_sorted to record the last time of sorting this list. When we traverse the list, we skip the inode that is newer than this time, and move this inode to the tail of LRU list. When the head of the list is newer than s_es_last_sorted, we will sort the LRU list again. In this commit, we break the loop if s_extent_cache_cnt == 0 because that means that all extents in extent status tree have been reclaimed. Meanwhile in this commit, ext4_es_{un}register_shrinker()'s prototype is changed to save a local variable in these functions. Reported-by: NDave Hansen <dave.hansen@intel.com> Signed-off-by: NZheng Liu <wenqing.lz@taobao.com> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
- 07 6月, 2013 1 次提交
-
-
由 Theodore Ts'o 提交于
Rename ext4_da_writepages() to ext4_writepages() and use it for all modes. We still need to iterate over all the pages in the case of data=journalling, but in the case of nodelalloc/data=ordered (which is what file systems mounted using ext3 backwards compatibility will use) this will allow us to use a much more efficient I/O submission path. Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
- 05 6月, 2013 10 次提交
-
-
由 Jan Kara 提交于
Now that we clear PageWriteback after extent conversion, there's no need to wait for io_end processing in ext4_evict_inode(). Running AIO/DIO keeps file reference until aio_complete() is called so ext4_evict_inode() cannot be called. For io_end structures resulting from buffered IO waiting is happening because we wait for PageWriteback in truncate_inode_pages(). Reviewed-by: NZheng Liu <wenqing.lz@taobao.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Jan Kara 提交于
We don't have to wait for extent conversion in ext4_punch_hole() as buffered IO for the punched range has been flushed and waited upon (thus all extent conversions for that range have completed). Also we wait for all DIO to finish using inode_dio_wait() so there cannot be any extent conversions pending due to direct IO. Also remove ext4_flush_unwritten_io() since it's unused now. Reviewed-by: NZheng Liu <wenqing.lz@taobao.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Jan Kara 提交于
Since PageWriteback bit is now cleared after extents are converted from unwritten to written ones, we have full exclusion of writeback path from truncate (truncate_inode_pages() waits for PageWriteback bits to get cleared on all invalidated pages). Exclusion from DIO path is achieved by inode_dio_wait() call in ext4_setattr(). So there's no need to wait for extent convertion in ext4_truncate() anymore. Reviewed-by: NZheng Liu <wenqing.lz@taobao.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Jan Kara 提交于
Make sure extent conversion after DIO happens while i_dio_count is still elevated so that inode_dio_wait() waits until extent conversion is done. This removes the need for explicit waiting for extent conversion in some cases. Reviewed-by: NZheng Liu <wenqing.lz@taobao.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Jan Kara 提交于
Later we would like to clear PageWriteback bit only after extent conversion from unwritten to written extents is performed. However it is not possible to start a transaction after PageWriteback is set because that violates lock ordering (and is easy to deadlock). So we have to reserve a transaction before locking pages and sending them for IO and later we use the transaction for extent conversion from ext4_end_io(). Reviewed-by: NZheng Liu <wenqing.lz@taobao.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Jan Kara 提交于
There isn't any need for setting BH_Uninit on buffers anymore. It was only used to signal we need to mark io_end as needing extent conversion in add_bh_to_extent() but now we can mark the io_end directly when mapping extent. Reviewed-by: NZheng Liu <wenqing.lz@taobao.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Jan Kara 提交于
There are two issues with current writeback path in ext4. For one we don't necessarily map complete pages when blocksize < pagesize and thus needn't do any writeback in one iteration. We always map some blocks though so we will eventually finish mapping the page. Just if writeback races with other operations on the file, forward progress is not really guaranteed. The second problem is that current code structure makes it hard to associate all the bios to some range of pages with one io_end structure so that unwritten extents can be converted after all the bios are finished. This will be especially difficult later when io_end will be associated with reserved transaction handle. We restructure the writeback path to a relatively simple loop which first prepares extent of pages, then maps one or more extents so that no page is partially mapped, and once page is fully mapped it is submitted for IO. We keep all the mapping and IO submission information in mpage_da_data structure to somewhat reduce stack usage. Resulting code is somewhat shorter than the old one and hopefully also easier to read. Reviewed-by: NZheng Liu <wenqing.lz@taobao.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Jan Kara 提交于
We limit the number of blocks written in a single loop of ext4_da_writepages() to 64 when inode uses indirect blocks. That is unnecessary as credit estimates for mapping logically continguous run of blocks is rather low even for inode with indirect blocks. So just lift this limitation and properly calculate the number of necessary credits. This better credit estimate will also later allow us to always write at least a single page in one iteration. Reviewed-by: NZheng Liu <wenqing.lz@taobao.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Jan Kara 提交于
ext4_ind_trans_blocks() wrongly used 'chunk' argument to decide whether blocks mapped are logically contiguous. That is wrong since the argument informs whether the blocks are physically contiguous. As the blocks mapped are always logically contiguous and that's all ext4_ind_trans_blocks() cares about, just remove the 'chunk' argument. Reviewed-by: NZheng Liu <wenqing.lz@taobao.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Jan Kara 提交于
Writeback code got better in how it submits IO and now the number of pages requested to be written is usually higher than original 1024. The number is now dynamically computed based on observed throughput and is set to be about 0.5 s worth of writeback. E.g. on ordinary SATA drive this ends up somewhere around 10000 as my testing shows. So remove the unnecessary smarts from ext4_da_writepages(). Reviewed-by: NZheng Liu <wenqing.lz@taobao.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
- 04 6月, 2013 1 次提交
-
-
由 Jan Kara 提交于
Change writeback path to create just one io_end structure for the extent to which we submit IO and share it among bios writing that extent. This prevents needless splitting and joining of unwritten extents when they cannot be submitted as a single bio. Bugs in ENOMEM handling found by Linux File System Verification project (linuxtesting.org) and fixed by Alexey Khoroshilov <khoroshilov@ispras.ru>. CC: Alexey Khoroshilov <khoroshilov@ispras.ru> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
- 01 6月, 2013 1 次提交
-
-
由 Jan Kara 提交于
The arithmetics adding delalloc blocks to the number of used blocks in ext4_getattr() can easily overflow on 32-bit archs as we first multiply number of blocks by blocksize and then divide back by 512. Make the arithmetics more clever and also use proper type (unsigned long long instead of unsigned long). Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 28 5月, 2013 5 次提交
-
-
由 Lukas Czerner 提交于
The discard_partial_page_buffers is no longer used anywhere so we can simply remove it including the *_no_lock variant and EXT4_DISCARD_PARTIAL_PG_ZERO_UNMAPPED define. Signed-off-by: NLukas Czerner <lczerner@redhat.com> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Lukas Czerner 提交于
We're doing to get rid of ext4_discard_partial_page_buffers() since it is duplicating some code and also partially duplicating work of truncate_pagecache_range(), moreover the old implementation was much clearer. Now when the truncate_inode_pages_range() can handle truncating non page aligned regions we can use this to invalidate and zero out block aligned region of the punched out range and then use ext4_block_truncate_page() to zero the unaligned blocks on the start and end of the range. This will greatly simplify the punch hole code. Moreover after this commit we can get rid of the ext4_discard_partial_page_buffers() completely. We also introduce function ext4_prepare_punch_hole() to do come common operations before we attempt to do the actual punch hole on indirect or extent file which saves us some code duplication. This has been tested on ppc64 with 1k block size with fsx and xfstests without any problems. Signed-off-by: NLukas Czerner <lczerner@redhat.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Lukas Czerner 提交于
This reverts commit 189e868f. This commit reintroduces the use of ext4_block_truncate_page() in ext4 truncate operation instead of ext4_discard_partial_page_buffers(). The statement in the commit description that the truncate operation only zero block unaligned portion of the last page is not exactly right, since truncate_pagecache_range() also zeroes and invalidate the unaligned portion of the page. Then there is no need to zero and unmap it once more and ext4_block_truncate_page() was doing the right job, although we still need to update the buffer head containing the last block, which is exactly what ext4_block_truncate_page() is doing. Moreover the problem described in the commit is fixed more properly with commit 15291164 jbd2: clear BH_Delay & BH_Unwritten in journal_unmap_buffer This was tested on ppc64 machine with block size of 1024 bytes without any problems. Signed-off-by: NLukas Czerner <lczerner@redhat.com> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Lukas Czerner 提交于
In data=ordered mode we should call ext4_jbd2_file_inode() so that crash after the truncate transaction has committed does not expose stall data in the tail of the block. Thanks Jan Kara for pointing that out. Signed-off-by: NLukas Czerner <lczerner@redhat.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Lukas Czerner 提交于
This reverts commit ccb4d7af. This commit reintroduces functions ext4_block_truncate_page() and ext4_block_zero_page_range() which has been previously removed in favour of ext4_discard_partial_page_buffers(). In future commits we want to reintroduce those function and remove ext4_discard_partial_page_buffers() since it is duplicating some code and also partially duplicating work of truncate_pagecache_range(), moreover the old implementation was much clearer. Signed-off-by: NLukas Czerner <lczerner@redhat.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 22 5月, 2013 3 次提交
-
-
由 Lukas Czerner 提交于
->invalidatepage() aop now accepts range to invalidate so we can make use of it in all ext4 invalidatepage routines. Signed-off-by: NLukas Czerner <lczerner@redhat.com> Reviewed-by: NJan Kara <jack@suse.cz>
-
由 Lukas Czerner 提交于
invalidatepage now accepts range to invalidate and there are two file system using jbd2 also implementing punch hole feature which can benefit from this. We need to implement the same thing for jbd2 layer in order to allow those file system take benefit of this functionality. This commit adds length argument to the jbd2_journal_invalidatepage() and updates all instances in ext4 and ocfs2. Signed-off-by: NLukas Czerner <lczerner@redhat.com> Reviewed-by: NJan Kara <jack@suse.cz>
-
由 Lukas Czerner 提交于
Currently there is no way to truncate partial page where the end truncate point is not at the end of the page. This is because it was not needed and the functionality was enough for file system truncate operation to work properly. However more file systems now support punch hole feature and it can benefit from mm supporting truncating page just up to the certain point. Specifically, with this functionality truncate_inode_pages_range() can be changed so it supports truncating partial page at the end of the range (currently it will BUG_ON() if 'end' is not at the end of the page). This commit changes the invalidatepage() address space operation prototype to accept range to be invalidated and update all the instances for it. We also change the block_invalidatepage() in the same way and actually make a use of the new length argument implementing range invalidation. Actual file system implementations will follow except the file systems where the changes are really simple and should not change the behaviour in any way .Implementation for truncate_page_range() which will be able to accept page unaligned ranges will follow as well. Signed-off-by: NLukas Czerner <lczerner@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com>
-
- 12 5月, 2013 1 次提交
-
-
由 Theodore Ts'o 提交于
This reverts commit 4eec708d. Multiple users have reported crashes which is apparently caused by this commit. Thanks to Dmitry Monakhov for bisecting it. Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu> Cc: Dmitry Monakhov <dmonakhov@openvz.org> Cc: Jan Kara <jack@suse.cz>
-
- 08 5月, 2013 1 次提交
-
-
由 Kent Overstreet 提交于
Faster kernel compiles by way of fewer unnecessary includes. [akpm@linux-foundation.org: fix fallout] [akpm@linux-foundation.org: fix build] Signed-off-by: NKent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: N"Theodore Ts'o" <tytso@mit.edu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 4月, 2013 1 次提交
-
-
由 Theodore Ts'o 提交于
Due to a missing cast, the high 32-bits of a 64-bit block number used when calculating the readahead block for inode tables can get lost. This means we can end up fetching the wrong blocks for readahead for file systems > 16TB. Linus found this when experimenting with an enhacement to the sparse static code checker which checks for missing widening casts before binary "not" operators. Reported-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
- 22 4月, 2013 1 次提交
-
-
由 Theodore Ts'o 提交于
This allows metadata writebacks which are issued via block device writeback to be sent with the current write request flags. Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
- 12 4月, 2013 1 次提交
-
-
由 Jan Kara 提交于
Change writeback path to create just one io_end structure for the extent to which we submit IO and share it among bios writing that extent. This prevents needless splitting and joining of unwritten extents when they cannot be submitted as a single bio. Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu> Reviewed-by: NDmitry Monakhov <dmonakhov@openvz.org> Reviewed-by: NZheng Liu <wenqing.lz@taobao.com>
-
- 10 4月, 2013 2 次提交
-
-
由 Dmitry Monakhov 提交于
Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
-
由 Lukas Czerner 提交于
Currently in ENOSPC condition when writing into unwritten space, or punching a hole, we might need to split the extent and grow extent tree. However since we can not allocate any new metadata blocks we'll have to zero out unwritten part of extent or punched out part of extent, or in the worst case return ENOSPC even though use actually does not allocate any space. Also in delalloc path we do reserve metadata and data blocks for the time we're going to write out, however metadata block reservation is very tricky especially since we expect that logical connectivity implies physical connectivity, however that might not be the case and hence we might end up allocating more metadata blocks than previously reserved. So in future, metadata reservation checks should be removed since we can not assure that we do not under reserve. And this is where reserved space comes into the picture. When mounting the file system we slice off a little bit of the file system space (2% or 4096 clusters, whichever is smaller) which can be then used for the cases mentioned above to prevent costly zeroout, or unexpected ENOSPC. The number of reserved clusters can be set via sysfs, however it can never be bigger than number of free clusters in the file system. Note that this patch fixes the failure of xfstest 274 as expected. Signed-off-by: NLukas Czerner <lczerner@redhat.com> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu> Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
-