- 18 3月, 2020 13 次提交
-
-
由 Ritesh Harjani 提交于
commit f629afe3369e9885fd6e9cc7a4f514b6a65cf9e9 upstream. Apparently our current rwsem code doesn't like doing the trylock, then lock for real scheme. So change our dax read/write methods to just do the trylock for the RWF_NOWAIT case. This seems to fix AIM7 regression in some scalable filesystems upto ~25% in some cases. Claimed in commit 942491c9 ("xfs: fix AIM7 regression") Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NMatthew Bobrowski <mbobrowski@mbobrowski.org> Tested-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NRitesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20191212055557.11151-2-riteshh@linux.ibm.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Matthew Bobrowski 提交于
commit 378f32bab3714f04c4e0c3aee4129f6703805550 upstream. This patch introduces a new direct I/O write path which makes use of the iomap infrastructure. All direct I/O writes are now passed from the ->write_iter() callback through to the new direct I/O handler ext4_dio_write_iter(). This function is responsible for calling into the iomap infrastructure via iomap_dio_rw(). Code snippets from the existing direct I/O write code within ext4_file_write_iter() such as, checking whether the I/O request is unaligned asynchronous I/O, or whether the write will result in an overwrite have effectively been moved out and into the new direct I/O ->write_iter() handler. The block mapping flags that are eventually passed down to ext4_map_blocks() from the *_get_block_*() suite of routines have been taken out and introduced within ext4_iomap_alloc(). For inode extension cases, ext4_handle_inode_extension() is effectively the function responsible for performing such metadata updates. This is called after iomap_dio_rw() has returned so that we can safely determine whether we need to potentially truncate any allocated blocks that may have been prepared for this direct I/O write. We don't perform the inode extension, or truncate operations from the ->end_io() handler as we don't have the original I/O 'length' available there. The ->end_io() however is responsible fo converting allocated unwritten extents to written extents. In the instance of a short write, we fallback and complete the remainder of the I/O using buffered I/O via ext4_buffered_write_iter(). The existing buffer_head direct I/O implementation has been removed as it's now redundant. [ Fix up ext4_dio_write_iter() per Jan's comments at https://lore.kernel.org/r/20191105135932.GN22379@quack2.suse.cz -- TYT ] Signed-off-by: NMatthew Bobrowski <mbobrowski@mbobrowski.org> Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/e55db6f12ae6ff017f36774135e79f3e7b0333da.1572949325.git.mbobrowski@mbobrowski.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Matthew Bobrowski 提交于
commit 3eaf9cc62f447a742b26fa601993e94406aa1ea1 upstream. When the filesystem is created without a journal, we eventually call into __generic_file_fsync() in order to write out all the modified in-core data to the permanent storage device. This function happens to try and obtain an inode_lock() while synchronizing the files buffer and it's associated metadata. Generally, this is fine, however it becomes a problem when there is higher level code that has already obtained an inode_lock() as this leads to a recursive lock situation. This case is especially true when porting across direct I/O to iomap infrastructure as we obtain an inode_lock() early on in the I/O within ext4_dio_write_iter() and hold it until the I/O has been completed. Consequently, to not run into this specific issue, we move away from calling into __generic_file_fsync() and perform the necessary synchronization tasks within ext4_sync_file(). Signed-off-by: NMatthew Bobrowski <mbobrowski@mbobrowski.org> Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com> Reviewed-by: NJan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/3495f35ef67f2021b567e28e6f59222e583689b8.1572949325.git.mbobrowski@mbobrowski.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Matthew Bobrowski 提交于
commit 0b9f230b94dd7457802264dc4c16921b3527dcf1 upstream. Lift the inode extension/orphan list handling code out from ext4_iomap_alloc() and apply it within the ext4_dax_write_iter(). Signed-off-by: NMatthew Bobrowski <mbobrowski@mbobrowski.org> Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/fd5c84db25d5d0da87d97ed4c36fd844f57da759.1572949325.git.mbobrowski@mbobrowski.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Matthew Bobrowski 提交于
commit 569342dc2485392e95b6a626281708c25014ba37 upstream. In preparation for implementing the iomap direct I/O modifications, the inode extension/truncate code needs to be moved out from the ext4_iomap_end() callback. For direct I/O, if the current code remained, it would behave incorrrectly. Updating the inode size prior to converting unwritten extents would potentially allow a racing direct I/O read to find unwritten extents before being converted correctly. The inode extension/truncate code now resides within a new helper ext4_handle_inode_extension(). This function has been designed so that it can accommodate for both DAX and direct I/O extension/truncate operations. Signed-off-by: NMatthew Bobrowski <mbobrowski@mbobrowski.org> Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/d41ffa26e20b15b12895812c3cad7c91a6a59bc6.1572949325.git.mbobrowski@mbobrowski.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Matthew Bobrowski 提交于
commit b1b4705d54abedfd69dcdf42779c521aa1e0fbd3 upstream. This patch introduces a new direct I/O read path which makes use of the iomap infrastructure. The new function ext4_do_read_iter() is responsible for calling into the iomap infrastructure via iomap_dio_rw(). If the read operation performed on the inode is not supported, which is checked via ext4_dio_supported(), then we simply fallback and complete the I/O using buffered I/O. Existing direct I/O read code path has been removed, as it is now redundant. Signed-off-by: NMatthew Bobrowski <mbobrowski@mbobrowski.org> Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/f98a6f73fadddbfbad0fc5ed04f712ca0b799f37.1572949325.git.mbobrowski@mbobrowski.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Matthew Bobrowski 提交于
commit 09edf4d381957b144440bac18a4769c53063b943 upstream. As part of the ext4_iomap_begin() cleanups that precede this patch, we also split up the IOMAP_REPORT branch into a completely separate ->iomap_begin() callback named ext4_iomap_begin_report(). Again, the raionale for this change is to reduce the overall clutter within ext4_iomap_begin(). Signed-off-by: NMatthew Bobrowski <mbobrowski@mbobrowski.org> Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/5c97a569e26ddb6696e3d3ac9fbde41317e029a0.1572949325.git.mbobrowski@mbobrowski.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Goldwyn Rodrigues 提交于
commit c039b99792726346ad46ff17c5a5bcb77a5edac4 upstream. The srcmap is used to identify where the read is to be performed from. It is passed to ->iomap_begin, which can fill it in if we need to read data for partially written blocks from a different location than the write target. The srcmap is only supported for buffered writes so far. Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com> [hch: merged two patches, removed the IOMAP_F_COW flag, use iomap as srcmap if not set, adjust length down to srcmap end as well] Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Acked-by: NGoldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Matthew Bobrowski 提交于
commit f063db5ee989aafe2dc9d571b5538f2a1f1cbad2 upstream. In preparation for porting across the ext4 direct I/O path over to the iomap infrastructure, split up the IOMAP_WRITE branch that's currently within ext4_iomap_begin() into a separate helper ext4_alloc_iomap(). This way, when we add in the necessary code for direct I/O, we don't end up with ext4_iomap_begin() becoming a monstrous twisty maze. Signed-off-by: NMatthew Bobrowski <mbobrowski@mbobrowski.org> Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/50eef383add1ea529651640574111076c55aca9f.1572949325.git.mbobrowski@mbobrowski.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Matthew Bobrowski 提交于
commit c8fdfe294187455b70e42a15df35a3e1882f332d upstream. Separate the iomap field population code that is currently within ext4_iomap_begin() into a separate helper ext4_set_iomap(). The intent of this function is self explanatory, however the rationale behind taking this step is to reeduce the overall clutter that we currently have within the ext4_iomap_begin() callback. Signed-off-by: NMatthew Bobrowski <mbobrowski@mbobrowski.org> Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/1ea34da65eecffcddffb2386668ae06134e8deaf.1572949325.git.mbobrowski@mbobrowski.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Matthew Bobrowski 提交于
commit 2e9b51d78229d5145725a481bb5464ebc0a3f9b2 upstream. This patch addresses what Dave Chinner had discovered and fixed within commit: 7684e2c4384d. This changes does not have any user visible impact for ext4 as none of the current users of ext4_iomap_begin() that extend files depend on IOMAP_F_DIRTY. When doing a direct IO that spans the current EOF, and there are written blocks beyond EOF that extend beyond the current write, the only metadata update that needs to be done is a file size extension. However, we don't mark such iomaps as IOMAP_F_DIRTY to indicate that there is IO completion metadata updates required, and hence we may fail to correctly sync file size extensions made in IO completion when O_DSYNC writes are being used and the hardware supports FUA. Hence when setting IOMAP_F_DIRTY, we need to also take into account whether the iomap spans the current EOF. If it does, then we need to mark it dirty so that IO completion will call generic_write_sync() to flush the inode size update to stable storage correctly. Signed-off-by: NMatthew Bobrowski <mbobrowski@mbobrowski.org> Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/8b43ee9ee94bee5328da56ba0909b7d2229ef150.1572949325.git.mbobrowski@mbobrowski.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Matthew Bobrowski 提交于
commit 548feebec7e93e58b647dba70b3303dcb569c914 upstream. This patch updates the lock pattern in ext4_direct_IO_read() to not block on inode lock in cases of IOCB_NOWAIT direct I/O reads. The locking condition implemented here is similar to that of 942491c9 ("xfs: fix AIM7 regression"). Fixes: 16c54688 ("ext4: Allow parallel DIO reads") Signed-off-by: NMatthew Bobrowski <mbobrowski@mbobrowski.org> Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/c5d5e759f91747359fbd2c6f9a36240cf75ad79f.1572949325.git.mbobrowski@mbobrowski.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Matthew Bobrowski 提交于
commit 53e5cca56795a301bbe8465781dab084f7ae8d54 upstream. For the direct I/O changes that follow in this patch series, we need to accommodate for the case where the block mapping flags passed through to ext4_map_blocks() result in m_flags having both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits set. In order for any allocated unwritten extents to be converted correctly in the ->end_io() handler, the iomap->type must be set to IOMAP_UNWRITTEN for cases where the EXT4_MAP_UNWRITTEN bit has been set within m_flags. Hence the reason why we need to reshuffle this conditional statement around. This change is a no-op for DAX as the block mapping flags passed through to ext4_map_blocks() i.e. EXT4_GET_BLOCKS_CREATE_ZERO never results in both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN being set at once. Signed-off-by: NMatthew Bobrowski <mbobrowski@mbobrowski.org> Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/1309ad80d31a637b2deed55a85283d582a54a26a.1572949325.git.mbobrowski@mbobrowski.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
- 17 1月, 2020 3 次提交
-
-
由 zhangyi (F) 提交于
commit 9ba55543fc0c6bb1cf8edd63be8802d9ab7e1202 upstream. If user specify a large enough value of "commit=" option, it may trigger signed integer overflow which may lead to sbi->s_commit_interval becomes a large or small value, zero in particular. UBSAN: Undefined behaviour in ../fs/ext4/super.c:1592:31 signed integer overflow: 536870912 * 1000 cannot be represented in type 'int' [...] Call trace: [...] [<ffffff9008a2d120>] ubsan_epilogue+0x34/0x9c lib/ubsan.c:166 [<ffffff9008a2d8b8>] handle_overflow+0x228/0x280 lib/ubsan.c:197 [<ffffff9008a2d95c>] __ubsan_handle_mul_overflow+0x4c/0x68 lib/ubsan.c:218 [<ffffff90086d070c>] handle_mount_opt fs/ext4/super.c:1592 [inline] [<ffffff90086d070c>] parse_options+0x1724/0x1a40 fs/ext4/super.c:1773 [<ffffff90086d51c4>] ext4_remount+0x2ec/0x14a0 fs/ext4/super.c:4834 [...] Although it is not a big deal, still silence the UBSAN by limit the input value. Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Khazhismel Kumykov 提交于
commit 4b99faa23c51ca31312b9afb876e8e5878daeb80 upstream. Signed-off-by: NKhazhismel Kumykov <khazhy@google.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Reviewed-by: NAndreas Dilger <adilger@dilger.ca> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Nikolay Borisov 提交于
commit 82dd124c40b8cda710878b88fb0182301c040ffe upstream. There is a function which clearly conveys the objective of checking i_writecount. Additionally the usage in ext4_mb_initialize_context was wrong, since a node would have wrongfully been reported as writable if i_writecount had a negative value (MMAP_DENY_WRITE). Signed-off-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
- 15 1月, 2020 3 次提交
-
-
由 Xiaoguang Wang 提交于
In ext4_writepages(), for every iteration, mpage_prepare_extent_to_map() will try to find 2048 pages to map and normally one bio can contain 256 pages at most. If we really found 2048 pages to map, there will be 4 bios and 4 ext4_io_submit() calls which are called both in ext4_writepages() and mpage_map_and_submit_extent(). But note that in mpage_map_and_submit_extent(), we hold a valid jbd2 handle, when dioread_nolock is enabled and extent is unwritten, jbd2 commit thread will wait this handle to finish, so wait the unwritten extent is written to disk, this will introduce unnecessary stall time, especially longer when the writeback operation is io throttled, need to fix this issue. Here for this scene, we accumulate bios in ext4_io_submit's io_bio, and only submit these bios after dropping the jbd2 handle. Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 zhangliguang 提交于
This is a temporary workaround plan to avoid the limitation when creating hard link cross two projids. Signed-off-by: Nzhangliguang <zhangliguang@linux.alibaba.com> Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Joseph Qi 提交于
This is trying to do jbd2 checkpoint in a specific kernel thread, then checkpoint won't be under io throttle control. Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Signed-off-by: Nzhangliguang <zhangliguang@linux.alibaba.com> Reviewed by: Baoyou Xie <baoyou.xie@linux.alibaba.com> Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
- 27 12月, 2019 8 次提交
-
-
由 Eric Whitney 提交于
commit 7bd75230b43727b258a4f7a59d62114cffe1b6c8 upstream. Ext4 may not free clusters correctly when punching holes in bigalloc file systems under high load conditions. If it's not possible to extend and restart the journal in ext4_ext_rm_leaf() when preparing to remove blocks from a punched region, a retry of the entire punch operation is triggered in ext4_ext_remove_space(). This causes a partial cluster to be set to the first cluster in the extent found to the right of the punched region. However, if the punch operation prior to the retry had made enough progress to delete one or more extents and a partial cluster candidate for freeing had already been recorded, the retry would overwrite the partial cluster. The loss of this information makes it impossible to correctly free the original partial cluster in all cases. This bug can cause generic/476 to fail when run as part of xfstests-bld's bigalloc and bigalloc_1k test cases. The failure is reported when e2fsck detects bad iblocks counts greater than expected in units of whole clusters and also detects a number of negative block bitmap differences equal to the iblocks discrepancy in cluster units. Signed-off-by: NEric Whitney <enwlinux@gmail.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Xiaoguang Wang 提交于
commit a297b2fcee461e40df763e179cbbfba5a9e572d2 upstream. In mpage_add_bh_to_extent(), when accumulated extents length is greater than MAX_WRITEPAGES_EXTENT_LEN or buffer head's b_stat is not equal, we will not continue to search unmapped area for this page, but note this page is locked, and will only be unlocked in mpage_release_unused_pages() after ext4_io_submit, if io also is throttled by blk-throttle or similar io qos, we will hold this page locked for a while, it's unnecessary. I think the best fix is to refactor mpage_add_bh_to_extent() to let it return some hints whether to unlock this page, but given that we will improve dioread_nolock later, we can let it done later, so currently the simple fix would just call mpage_release_unused_pages() before ext4_io_submit(). Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
-
由 Eric Whitney 提交于
commit f456767d3391e9f7d9d25a2e7241d75676dc19da upstream. Add new code to count canceled pending cluster reservations on bigalloc file systems and to reduce the cluster reservation count on all file systems using delayed allocation. This replaces old code in ext4_da_page_release_reservations that was incorrect. Signed-off-by: NEric Whitney <enwlinux@gmail.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
-
由 Eric Whitney 提交于
commit 9fe671496b6c286f9033aedfc1718d67721da0ae upstream. Modify ext4_ext_remove_space() and the code it calls to correct the reserved cluster count for pending reservations (delayed allocated clusters shared with allocated blocks) when a block range is removed from the extent tree. Pending reservations may be found for the clusters at the ends of written or unwritten extents when a block range is removed. If a physical cluster at the end of an extent is freed, it's necessary to increment the reserved cluster count to maintain correct accounting if the corresponding logical cluster is shared with at least one delayed and unwritten extent as found in the extents status tree. Add a new function, ext4_rereserve_cluster(), to reapply a reservation on a delayed allocated cluster sharing blocks with a freed allocated cluster. To avoid ENOSPC on reservation, a flag is applied to ext4_free_blocks() to briefly defer updating the freeclusters counter when an allocated cluster is freed. This prevents another thread from allocating the freed block before the reservation can be reapplied. Redefine the partial cluster object as a struct to carry more state information and to clarify the code using it. Adjust the conditional code structure in ext4_ext_remove_space to reduce the indentation level in the main body of the code to improve readability. Signed-off-by: NEric Whitney <enwlinux@gmail.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
-
由 Eric Whitney 提交于
commit b6bf9171ef5c37b66d446378ba63af5339a56a97 upstream. Ext4 does not always reduce the reserved cluster count by the number of clusters allocated when mapping a delayed extent. It sometimes adds back one or more clusters after allocation if delalloc blocks adjacent to the range allocated by ext4_ext_map_blocks() share the clusters newly allocated for that range. However, this overcounts the number of clusters needed to satisfy future mapping requests (holding one or more reservations for clusters that have already been allocated) and premature ENOSPC and quota failures, etc., result. Ext4 also does not reduce the reserved cluster count when allocating clusters for non-delayed allocated writes that have previously been reserved for delayed writes. This also results in overcounts. To make it possible to handle reserved cluster accounting for fallocated regions in the same manner as used for other non-delayed writes, do the reserved cluster accounting for them at the time of allocation. In the current code, this is only done later when a delayed extent sharing the fallocated region is finally mapped. Address comment correcting handling of unsigned long long constant from Jan Kara's review of RFC version of this patch. Signed-off-by: NEric Whitney <enwlinux@gmail.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
-
由 Eric Whitney 提交于
commit 0b02f4c0d6d9e2c611dfbdd4317193e9dca740e6 upstream. The code in ext4_da_map_blocks sometimes reserves space for more delayed allocated clusters than it should, resulting in premature ENOSPC, exceeded quota, and inaccurate free space reporting. Fix this by checking for written and unwritten blocks shared in the same cluster with the newly delayed allocated block. A cluster reservation should not be made for a cluster for which physical space has already been allocated. Signed-off-by: NEric Whitney <enwlinux@gmail.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
-
由 Eric Whitney 提交于
commit 1dc0aa46e74a3366e12f426b7caaca477853e9c3 upstream. Add new pending reservation mechanism to help manage reserved cluster accounting. Its primary function is to avoid the need to read extents from the disk when invalidating pages as a result of a truncate, punch hole, or collapse range operation. Signed-off-by: NEric Whitney <enwlinux@gmail.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
-
由 Eric Whitney 提交于
commit ad431025aecda85d3ebef5e4a3aca5c1c681d0c7 upstream. Ext4 contains a few functions that are used to search for delayed extents or blocks in the extents status tree. Rather than duplicate code to add new functions to search for extents with different status values, such as written or a combination of delayed and unwritten, generalize the existing code to search for caller-specified extents status values. Also, move this code into extents_status.c where it is better associated with the data structures it operates upon, and where it can be more readily used to implement new extents status tree functions that might want a broader scope for i_es_lock. Three missing static specifiers in RFC version of patch reported and fixed by Fengguang Wu <fengguang.wu@intel.com>. Signed-off-by: NEric Whitney <enwlinux@gmail.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
-
- 18 12月, 2019 3 次提交
-
-
由 yangerkun 提交于
commit 565333a1554d704789e74205989305c811fd9c7a upstream. No need to wait for any commit once the page is fully truncated. Besides, it may confuse e.g. concurrent ext4_writepage() with the page still be dirty (will be cleared by truncate_pagecache() in ext4_setattr()) but buffers has been freed; and then trigger a bug show as below: [ 26.057508] ------------[ cut here ]------------ [ 26.058531] kernel BUG at fs/ext4/inode.c:2134! ... [ 26.088130] Call trace: [ 26.088695] ext4_writepage+0x914/0xb28 [ 26.089541] writeout.isra.4+0x1b4/0x2b8 [ 26.090409] move_to_new_page+0x3b0/0x568 [ 26.091338] __unmap_and_move+0x648/0x988 [ 26.092241] unmap_and_move+0x48c/0xbb8 [ 26.093096] migrate_pages+0x220/0xb28 [ 26.093945] kernel_mbind+0x828/0xa18 [ 26.094791] __arm64_sys_mbind+0xc8/0x138 [ 26.095716] el0_svc_common+0x190/0x490 [ 26.096571] el0_svc_handler+0x60/0xd0 [ 26.097423] el0_svc+0x8/0xc Run the procedure (generate by syzkaller) parallel with ext3. void main() { int fd, fd1, ret; void *addr; size_t length = 4096; int flags; off_t offset = 0; char *str = "12345"; fd = open("a", O_RDWR | O_CREAT); assert(fd >= 0); /* Truncate to 4k */ ret = ftruncate(fd, length); assert(ret == 0); /* Journal data mode */ flags = 0xc00f; ret = ioctl(fd, _IOW('f', 2, long), &flags); assert(ret == 0); /* Truncate to 0 */ fd1 = open("a", O_TRUNC | O_NOATIME); assert(fd1 >= 0); addr = mmap(NULL, length, PROT_WRITE | PROT_READ, MAP_SHARED, fd, offset); assert(addr != (void *)-1); memcpy(addr, str, 5); mbind(addr, length, 0, 0, 0, MPOL_MF_MOVE); } And the bug will be triggered once we seen the below order. reproduce1 reproduce2 ... | ... truncate to 4k | change to journal data mode | | memcpy(set page dirty) truncate to 0: | ext4_setattr: | ... | ext4_wait_for_tail_page_commit | | mbind(trigger bug) truncate_pagecache(clean dirty)| ... ... | mbind will call ext4_writepage() since the page still be dirty, and then report the bug since the buffers has been free. Fix it by return directly once offset equals to 0 which means the page has been fully truncated. Reported-by: NHulk Robot <hulkci@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Link: https://lore.kernel.org/r/20190919063508.1045-1-yangerkun@huawei.comReviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Theodore Ts'o 提交于
commit c7df4a1ecb8579838ec8c56b2bb6a6716e974f37 upstream. If the file system is corrupted such that a file's i_links_count is too small, then it's possible that when unlinking that file, i_nlink will already be zero. Previously we were working around this kind of corruption by forcing i_nlink to one; but we were doing this before trying to delete the directory entry --- and if the file system is corrupted enough that ext4_delete_entry() fails, then we exit with i_nlink elevated, and this causes the orphan inode list handling to be FUBAR'ed, such that when we unmount the file system, the orphan inode list can get corrupted. A better way to fix this is to simply skip trying to call drop_nlink() if i_nlink is already zero, thus moving the check to the place where it makes the most sense. https://bugzilla.kernel.org/show_bug.cgi?id=205433 Link: https://lore.kernel.org/r/20191112032903.8828-1-tytso@mit.eduSigned-off-by: NTheodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Reviewed-by: NAndreas Dilger <adilger@dilger.ca> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Jan Kara 提交于
commit 65db869c754e7c271691dd5feabf884347e694f5 upstream. Estimate for the number of credits needed for final freeing of inode in ext4_evict_inode() was to small. We may modify 4 blocks (inode & sb for orphan deletion, bitmap & group descriptor for inode freeing) and not just 3. [ Fixed minor whitespace nit. -- TYT ] Fixes: e50e5129 ("ext4: xattr-in-inode support") CC: stable@vger.kernel.org Signed-off-by: NJan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191105164437.32602-6-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 05 12月, 2019 1 次提交
-
-
由 Theodore Ts'o 提交于
commit 4ea99936a1630f51fc3a2d61a58ec4a1c4b7d55a upstream. It's possible to specify a non-zero s_want_extra_isize via debugging option, and this can cause bad things(tm) to happen when using a file system with an inode size of 128 bytes. Add better checking when the file system is mounted, as well as when we are actually doing the trying to do the inode expansion. Link: https://lore.kernel.org/r/20191110121510.GH23325@mit.edu Reported-by: syzbot+f8d6f8386ceacdbfff57@syzkaller.appspotmail.com Reported-by: syzbot+33d7ea72e47de3bdf4e1@syzkaller.appspotmail.com Reported-by: syzbot+44b6763edfc17144296f@syzkaller.appspotmail.com Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 24 11月, 2019 1 次提交
-
-
由 Gabriel Krisman Bertazi 提交于
[ Upstream commit 799578ab16e86b074c184ec5abbda0bc698c7b0b ] Enabling DX_DEBUG triggers the build error below. info is an attribute of the dxroot structure. linux/fs/ext4/namei.c:2264:12: error: ‘info’ undeclared (first use in this function); did you mean ‘insl’? info->indirect_levels)); Fixes: e08ac99f ("ext4: add largedir feature") Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Reviewed-by: NLukas Czerner <lczerner@redhat.com> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
- 06 11月, 2019 1 次提交
-
-
由 Theodore Ts'o 提交于
[ Upstream commit 6e589291f4b1b700ca12baec5930592a0d51e63c ] A malicious/clueless root user can use EXT4_IOC_SWAP_BOOT to force a corner casew which can lead to the file system getting corrupted. There's no usefulness to allowing this, so just prohibit this case. Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
- 08 10月, 2019 1 次提交
-
-
由 zhangyi (F) 提交于
[ Upstream commit 7727ae52975d4f4ef7ff69ed8e6e25f6a4168158 ] Remount process will release system zone which was allocated before if "noblock_validity" is specified. If we mount an ext4 file system to two mountpoints with default mount options, and then remount one of them with "noblock_validity", it may trigger a use after free problem when someone accessing the other one. # mount /dev/sda foo # mount /dev/sda bar User access mountpoint "foo" | Remount mountpoint "bar" | ext4_map_blocks() | ext4_remount() check_block_validity() | ext4_setup_system_zone() ext4_data_block_valid() | ext4_release_system_zone() | free system_blks rb nodes access system_blks rb nodes | trigger use after free | This problem can also be reproduced by one mountpint, At the same time, add_system_zone() can get called during remount as well so there can be racing ext4_data_block_valid() reading the rbtree at the same time. This patch add RCU to protect system zone from releasing or building when doing a remount which inverse current "noblock_validity" mount option. It assign the rbtree after the whole tree was complete and do actual freeing after rcu grace period, avoid any intermediate state. Reported-by: syzbot+1e470567330b7ad711d5@syzkaller.appspotmail.com Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
- 05 10月, 2019 2 次提交
-
-
由 Theodore Ts'o 提交于
commit c1e8220bd316d8ae8e524df39534b8a412a45d5e upstream. If a program attempts to punch a hole on an inline data file, we need to convert it to a normal file first. This was detected using ext4/032 using the adv configuration. Simple reproducer: mke2fs -Fq -t ext4 -O inline_data /dev/vdc mount /vdc echo "" > /vdc/testfile xfs_io -c 'truncate 33554432' /vdc/testfile xfs_io -c 'fpunch 0 1048576' /vdc/testfile umount /vdc e2fsck -fy /dev/vdc Cc: stable@vger.kernel.org Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Rakesh Pandit 提交于
commit e3d550c2c4f2f3dba469bc3c4b83d9332b4e99e1 upstream. Really enable warning when CONFIG_EXT4_DEBUG is set and fix missing first argument. This was introduced in commit ff95ec22 ("ext4: add warning to ext4_convert_unwritten_extents_endio") and splitting extents inside endio would trigger it. Fixes: ff95ec22 ("ext4: add warning to ext4_convert_unwritten_extents_endio") Signed-off-by: NRakesh Pandit <rakesh@tuxera.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 16 9月, 2019 4 次提交
-
-
由 Colin Ian King 提交于
[ Upstream commit fbbbbd2f28aec991f3fbc248df211550fbdfd58c ] There are two cases where u32 variables n and err are being checked for less than zero error values, the checks is always false because the variables are not signed. Fix this by making the variables ints. Addresses-Coverity: ("Unsigned compared against 0") Fixes: 345c0dbf3a30 ("ext4: protect journal inode's blocks using block_validity") Signed-off-by: NColin Ian King <colin.king@canonical.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
由 Theodore Ts'o 提交于
[ Upstream commit 170417c8c7bb2cbbdd949bf5c443c0c8f24a203b ] Commit 345c0dbf3a30 ("ext4: protect journal inode's blocks using block_validity") failed to add an exception for the journal inode in ext4_check_blockref(), which is the function used by ext4_get_branch() for indirect blocks. This caused attempts to read from the ext3-style journals to fail with: [ 848.968550] EXT4-fs error (device sdb7): ext4_get_branch:171: inode #8: block 30343695: comm jbd2/sdb7-8: invalid block Fix this by adding the missing exception check. Fixes: 345c0dbf3a30 ("ext4: protect journal inode's blocks using block_validity") Reported-by: NArthur Marsh <arthur.marsh@internode.on.net> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
由 Theodore Ts'o 提交于
[ Upstream commit 0a944e8a6c66ca04c7afbaa17e22bf208a8b37f0 ] Since the journal inode is already checked when we added it to the block validity's system zone, if we check it again, we'll just trigger a failure. This was causing failures like this: [ 53.897001] EXT4-fs error (device sda): ext4_find_extent:909: inode #8: comm jbd2/sda-8: pblk 121667583 bad header/extent: invalid extent entries - magic f30a, entries 8, max 340(340), depth 0(0) [ 53.931430] jbd2_journal_bmap: journal block not found at offset 49 on sda-8 [ 53.938480] Aborting journal on device sda-8. ... but only if the system was under enough memory pressure that logical->physical mapping for the journal inode gets pushed out of the extent cache. (This is why it wasn't noticed earlier.) Fixes: 345c0dbf3a30 ("ext4: protect journal inode's blocks using block_validity") Reported-by: NDan Rue <dan.rue@linaro.org> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Tested-by: NNaresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
由 Theodore Ts'o 提交于
[ Upstream commit 345c0dbf3a30872d9b204db96b5857cd00808cae ] Add the blocks which belong to the journal inode to block_validity's system zone so attempts to deallocate or overwrite the journal due a corrupted file system where the journal blocks are also claimed by another inode. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202879Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: NSasha Levin <sashal@kernel.org>
-