- 05 7月, 2022 1 次提交
-
-
由 Ritesh Harjani 提交于
stable inclusion from stable-v5.10.110 commit 69d2421b5527609387998d76cabd8d700c2f23c1 bugzilla: https://gitee.com/openeuler/kernel/issues/I574AL Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=69d2421b5527609387998d76cabd8d700c2f23c1 -------------------------------- [ Upstream commit a5c0e2fd ] ext4_mb_mark_bb() currently wrongly calculates cluster len (clen) and flex_group->free_clusters. This patch fixes that. Identified based on code review of ext4_mb_mark_bb() function. Signed-off-by: NRitesh Harjani <riteshh@linux.ibm.com> Reviewed-by: NJan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/a0b035d536bafa88110b74456853774b64c8ac40.1644992609.git.riteshh@linux.ibm.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYu Liao <liaoyu15@huawei.com> Reviewed-by: NWei Li <liwei391@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 21 6月, 2022 2 次提交
-
-
由 Baokun Li 提交于
hulk inclusion category: bugfix bugzilla: 186777, https://gitee.com/openeuler/kernel/issues/I5C568 CVE: NA -------------------------------- ext4_mb_normalize_request() can move logical start of allocated blocks to reduce fragmentation and better utilize preallocation. However logical block requested as a start of allocation (ac->ac_o_ex.fe_logical) should always be covered by allocated blocks so we should check that by modifying and to or in the assertion. Signed-off-by: NBaokun Li <libaokun1@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Baokun Li 提交于
hulk inclusion category: bugfix bugzilla: 186777, https://gitee.com/openeuler/kernel/issues/I5C568 CVE: NA -------------------------------- Hulk Robot reported a BUG_ON: ================================================================== kernel BUG at fs/ext4/mballoc.c:3211! [...] RIP: 0010:ext4_mb_mark_diskspace_used.cold+0x85/0x136f [...] Call Trace: ext4_mb_new_blocks+0x9df/0x5d30 ext4_ext_map_blocks+0x1803/0x4d80 ext4_map_blocks+0x3a4/0x1a10 ext4_writepages+0x126d/0x2c30 do_writepages+0x7f/0x1b0 __filemap_fdatawrite_range+0x285/0x3b0 file_write_and_wait_range+0xb1/0x140 ext4_sync_file+0x1aa/0xca0 vfs_fsync_range+0xfb/0x260 do_fsync+0x48/0xa0 [...] ================================================================== Above issue may happen as follows: ------------------------------------- do_fsync vfs_fsync_range ext4_sync_file file_write_and_wait_range __filemap_fdatawrite_range do_writepages ext4_writepages mpage_map_and_submit_extent mpage_map_one_extent ext4_map_blocks ext4_mb_new_blocks ext4_mb_normalize_request >>> start + size <= ac->ac_o_ex.fe_logical ext4_mb_regular_allocator ext4_mb_simple_scan_group ext4_mb_use_best_found ext4_mb_new_preallocation ext4_mb_new_inode_pa ext4_mb_use_inode_pa >>> set ac->ac_b_ex.fe_len <= 0 ext4_mb_mark_diskspace_used >>> BUG_ON(ac->ac_b_ex.fe_len <= 0); we can easily reproduce this problem with the following commands: `fallocate -l100M disk` `mkfs.ext4 -b 1024 -g 256 disk` `mount disk /mnt` `fsstress -d /mnt -l 0 -n 1000 -p 1` The size must be smaller than or equal to EXT4_BLOCKS_PER_GROUP. Therefore, "start + size <= ac->ac_o_ex.fe_logical" may occur when the size is truncated. So start should be the start position of the group where ac_o_ex.fe_logical is located after alignment. In addition, when the value of fe_logical or EXT4_BLOCKS_PER_GROUP is very large, the value calculated by start_off is more accurate. Fixes: cd648b8a ("ext4: trim allocation requests to group size") Reported-by: NHulk Robot <hulkci@huawei.com> Signed-off-by: NBaokun Li <libaokun1@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 17 5月, 2022 1 次提交
-
-
由 Xin Yin 提交于
stable inclusion from stable-v5.10.99 commit 6c5bd55e36d3bdcbd723902b29bcf083e5592c6f bugzilla: https://gitee.com/openeuler/kernel/issues/I55O7H Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=6c5bd55e36d3bdcbd723902b29bcf083e5592c6f -------------------------------- commit 31a074a0 upstream. For now in ext4_mb_new_blocks_simple, if we found a block which should be excluded then will switch to next group, this may probably cause 'group' run out of range. Change to check next block in the same group when get a block should be excluded. Also change the search range to EXT4_CLUSTERS_PER_GROUP and add error checking. Signed-off-by: NXin Yin <yinxin.x@bytedance.com> Reviewed-by: NHarshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20220110035141.1980-3-yinxin.x@bytedance.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYu Liao <liaoyu15@huawei.com> Reviewed-by: NWei Li <liwei391@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 27 4月, 2022 2 次提交
-
-
由 Chunguang Xu 提交于
stable inclusion from stable-v5.10.94 commit f9ed0ea0a9fc59de71b230ff02f59a51fd174ca7 bugzilla: https://gitee.com/openeuler/kernel/issues/I531X9 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=f9ed0ea0a9fc59de71b230ff02f59a51fd174ca7 -------------------------------- commit 8c80fb31 upstream. We found on older kernel (3.10) that in the scenario of insufficient disk space, system may trigger an ABBA deadlock problem, it seems that this problem still exists in latest kernel, try to fix it here. The main process triggered by this problem is that task A occupies the PA and waits for the jbd2 transaction finish, the jbd2 transaction waits for the completion of task B's IO (plug_list), but task B waits for the release of PA by task A to finish discard, which indirectly forms an ABBA deadlock. The related calltrace is as follows: Task A vfs_write ext4_mb_new_blocks() ext4_mb_mark_diskspace_used() JBD2 jbd2_journal_get_write_access() -> jbd2_journal_commit_transaction() ->schedule() filemap_fdatawait() | | | Task B | | do_unlinkat() | | ext4_evict_inode() | | jbd2_journal_begin_ordered_truncate() | | filemap_fdatawrite_range() | | ext4_mb_new_blocks() | -ext4_mb_discard_group_preallocations() <----- Here, try to cancel ext4_mb_discard_group_preallocations() internal retry due to PA busy, and do a limited number of retries inside ext4_mb_discard_preallocations(), which can circumvent the above problems, but also has some advantages: 1. Since the PA is in a busy state, if other groups have free PAs, keeping the current PA may help to reduce fragmentation. 2. Continue to traverse forward instead of waiting for the current group PA to be released. In most scenarios, the PA discard time can be reduced. However, in the case of smaller free space, if only a few groups have space, then due to multiple traversals of the group, it may increase CPU overhead. But in contrast, I feel that the overall benefit is better than the cost. Signed-off-by: NChunguang Xu <brookxu@tencent.com> Reported-by: Nkernel test robot <lkp@intel.com> Reviewed-by: NJan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/1637630277-23496-1-git-send-email-brookxu.cn@gmail.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com> Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
-
由 Jan Kara 提交于
stable inclusion from stable-v5.10.94 commit f871cd8ee0f02ad7b00f9c6f326b3d6d2c386535 bugzilla: https://gitee.com/openeuler/kernel/issues/I531X9 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=f871cd8ee0f02ad7b00f9c6f326b3d6d2c386535 -------------------------------- [ Upstream commit 173b6e38 ] A user reported FITRIM ioctl failing for him on ext4 on some devices without apparent reason. After some debugging we've found out that these devices (being LVM volumes) report rather large discard granularity of 42MB and the filesystem had 1k blocksize and thus group size of 8MB. Because ext4 FITRIM implementation puts discard granularity into minlen, ext4_trim_fs() declared the trim request as invalid. However just silently doing nothing seems to be a more appropriate reaction to such combination of parameters since user did not specify anything wrong. CC: Lukas Czerner <lczerner@redhat.com> Fixes: 5c2ed62f ("ext4: Adjust minlen with discard_granularity in the FITRIM ioctl") Signed-off-by: NJan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20211112152202.26614-1-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com> Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
-
- 13 10月, 2021 1 次提交
-
-
由 Stephen Brennan 提交于
stable inclusion from stable-5.10.50 commit aa07327083b5ee2d53abe9e700bed1940bcecdba bugzilla: 174522 https://gitee.com/openeuler/kernel/issues/I4DNFY Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=aa07327083b5ee2d53abe9e700bed1940bcecdba -------------------------------- commit cd84bbba upstream. Commit 5d1b1b3f ("ext4: fix BUG when calling ext4_error with locked block group") introduces ext4_grp_locked_error to handle unlocking a group in error cases. Otherwise, there is a possibility of a sleep while atomic. However, since 43c73221 ("ext4: replace BUG_ON with WARN_ON in mb_find_extent()"), mb_find_extent() has contained a ext4_error() call while a group spinlock is held. Replace this with ext4_grp_locked_error. Fixes: 43c73221 ("ext4: replace BUG_ON with WARN_ON in mb_find_extent()") Cc: <stable@vger.kernel.org> # 4.14+ Signed-off-by: NStephen Brennan <stephen.s.brennan@oracle.com> Reviewed-by: NLukas Czerner <lczerner@redhat.com> Reviewed-by: NJunxiao Bi <junxiao.bi@oracle.com> Link: https://lore.kernel.org/r/20210623232114.34457-1-stephen.s.brennan@oracle.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NChen Jun <chenjun102@huawei.com> Acked-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NChen Jun <chenjun102@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 15 6月, 2021 1 次提交
-
-
由 Phillip Potter 提交于
stable inclusion from stable-5.10.43 commit 2050c6e5b161e5e25ce3c420fef58b24fa388a49 bugzilla: 109284 CVE: NA -------------------------------- commit a8867f4e upstream. Fix a memory leak discovered by syzbot when a file system is corrupted with an illegally large s_log_groups_per_flex. Reported-by: syzbot+aa12d6106ea4ca1b6aae@syzkaller.appspotmail.com Signed-off-by: NPhillip Potter <phil@philpotter.co.uk> Cc: stable@kernel.org Link: https://lore.kernel.org/r/20210412073837.1686-1-phil@philpotter.co.ukSigned-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NChen Jun <chenjun102@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 19 4月, 2021 1 次提交
-
-
由 Sabyrzhan Tasbolatov 提交于
stable inclusion from stable-5.10.27 commit df61d3cff422433527d3bc388f69484f0781d226 bugzilla: 51493 -------------------------------- commit f91436d5 upstream. syzbot found UBSAN: shift-out-of-bounds in ext4_mb_init [1], when 1 << sbi->s_es->s_log_groups_per_flex is bigger than UINT_MAX, where sbi->s_mb_prefetch is unsigned integer type. 32 is the maximum allowed power of s_log_groups_per_flex. Following if check will also trigger UBSAN shift-out-of-bound: if (1 << sbi->s_es->s_log_groups_per_flex >= UINT_MAX) { So I'm checking it against the raw number, perhaps there is another way to calculate UINT_MAX max power. Also use min_t as to make sure it's uint type. [1] UBSAN: shift-out-of-bounds in fs/ext4/mballoc.c:2713:24 shift exponent 60 is too large for 32-bit type 'int' Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x137/0x1be lib/dump_stack.c:120 ubsan_epilogue lib/ubsan.c:148 [inline] __ubsan_handle_shift_out_of_bounds+0x432/0x4d0 lib/ubsan.c:395 ext4_mb_init_backend fs/ext4/mballoc.c:2713 [inline] ext4_mb_init+0x19bc/0x19f0 fs/ext4/mballoc.c:2898 ext4_fill_super+0xc2ec/0xfbe0 fs/ext4/super.c:4983 Reported-by: syzbot+a8b4b0c60155e87e9484@syzkaller.appspotmail.com Signed-off-by: NSabyrzhan Tasbolatov <snovitoll@gmail.com> Reviewed-by: NJan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210224095800.3350002-1-snovitoll@gmail.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NChen Jun <chenjun102@huawei.com> Acked-by: N Weilong Chen <chenweilong@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 18 1月, 2021 1 次提交
-
-
由 Chunguang Xu 提交于
stable inclusion from stable-5.10.5 commit aceb8ae8e3b10503a2b82b17f626c9278fe792b4 bugzilla: 46931 -------------------------------- [ Upstream commit 82ef1370 ] Commit cfd73237 ("ext4: add prefetching for block allocation bitmaps") introduced block bitmap prefetch, and expects to read block bitmaps of flex_bg through an IO. However, it seems to ignore the value range of s_log_groups_per_flex. In the scenario where the value of s_log_groups_per_flex is greater than 27, s_mb_prefetch or s_mb_prefetch_limit will overflow, cause a divide zero exception. In addition, the logic of calculating nr is also flawed, because the size of flexbg is fixed during a single mount, but s_mb_prefetch can be modified, which causes nr to fail to meet the value condition of [1, flexbg_size]. To solve this problem, we need to set the upper limit of s_mb_prefetch. Since we expect to load block bitmaps of a flex_bg through an IO, we can consider determining a reasonable upper limit among the IO limit parameters. After consideration, we chose BLK_MAX_SEGMENT_SIZE. This is a good choice to solve divide zero problem and avoiding performance degradation. [ Some minor code simplifications to make the changes easy to follow -- TYT ] Reported-by: NTosk Robot <tencent_os_robot@tencent.com> Signed-off-by: NChunguang Xu <brookxu@tencent.com> Reviewed-by: NSamuel Liao <samuelliao@tencent.com> Reviewed-by: NAndreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/1607051143-24508-1-git-send-email-brookxu@tencent.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NChen Jun <chenjun102@huawei.com> Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
-
- 12 1月, 2021 1 次提交
-
-
由 Chunguang Xu 提交于
stable inclusion from stable-5.10.4 commit d28f60699590877f940f675717eafbfb30299b26 bugzilla: 46903 -------------------------------- commit cca41553 upstream. When freeing metadata, we will create an ext4_free_data and insert it into the pending free list. After the current transaction is committed, the object will be freed. ext4_mb_free_metadata() will check whether the area to be freed overlaps with the pending free list. If true, return directly. At this time, ext4_free_data is leaked. Fortunately, the probability of this problem is small, since it only occurs if the file system is corrupted such that a block is claimed by more one inode and those inodes are deleted within a single jbd2 transaction. Signed-off-by: NChunguang Xu <brookxu@tencent.com> Link: https://lore.kernel.org/r/1604764698-4269-8-git-send-email-brookxu@tencent.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NChen Jun <chenjun102@huawei.com> Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
-
- 07 11月, 2020 2 次提交
-
-
由 Harshad Shirwadkar 提交于
Fast commit file system states are recorded in sbi->s_mount_flags. Fast commit expects these bit manipulations to be atomic. This patch adds helpers to make those modifications atomic. Suggested-by: NJan Kara <jack@suse.cz> Signed-off-by: NHarshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201106035911.1942128-21-harshadshirwadkar@gmail.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Dan Carpenter 提交于
Smatch complains that "i" can be uninitialized if we don't enter the loop. I don't know if it's possible but we may as well silence this warning. [ Initialize i to sb->s_blocksize instead of 0. The only way the for loop could be skipped entirely is the in-memory data structures, in particular the bh->b_data for the on-disk superblock has gotten corrupted enough that calculated value of group is >= to ext4_get_groups_count(sb). In that case, we want to exit immediately without allocating a block. -- TYT ] Fixes: 8016e29f ("ext4: fast commit recovery path") Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com> Link: https://lore.kernel.org/r/20201030114620.GB3251003@mwandaSigned-off-by: NTheodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
-
- 22 10月, 2020 1 次提交
-
-
由 Harshad Shirwadkar 提交于
This patch adds fast commit recovery path support for Ext4 file system. We add several helper functions that are similar in spirit to e2fsprogs journal recovery path handlers. Example of such functions include - a simple block allocator, idempotent block bitmap update function etc. Using these routines and the fast commit log in the fast commit area, the recovery path (ext4_fc_replay()) performs fast commit log recovery. Reported-by: Nkernel test robot <lkp@intel.com> Signed-off-by: NHarshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201015203802.3597742-8-harshadshirwadkar@gmail.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 18 10月, 2020 5 次提交
-
-
由 Chunguang Xu 提交于
Make bb_check_counter per group, so each group has the same chance to be checked, which can expose errors more easily. Signed-off-by: NChunguang Xu <brookxu@tencent.com> Link: https://lore.kernel.org/r/1601292995-32205-2-git-send-email-brookxu@tencent.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Chunguang Xu 提交于
The comment near mb_buddy_adjust_border seems meaningless, just clear it. Signed-off-by: NChunguang Xu <brookxu@tencent.com> Link: https://lore.kernel.org/r/1601292995-32205-1-git-send-email-brookxu@tencent.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Randy Dunlap 提交于
Delete repeated words in fs/ext4/. {the, this, of, we, after} Also change spelling of "xttr" in inline.c to "xattr" in 2 places. Signed-off-by: NRandy Dunlap <rdunlap@infradead.org> Reviewed-by: NJan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200805024850.12129-1-rdunlap@infradead.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Jan Kara 提交于
ext4_mb_discard_group_preallocations() can be releasing group lock with preallocations accumulated on its local list. Thus although discard_pa_seq was incremented and concurrent allocating processes will be retrying allocations, it can happen that premature ENOSPC error is returned because blocks used for preallocations are not available for reuse yet. Make sure we always free locally accumulated preallocations before releasing group lock. Fixes: 07b5b8e1 ("ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling") Signed-off-by: NJan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200924150959.4335-1-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Ye Bin 提交于
As we test disk offline/online with running fsstress, we find fsstress process is keeping running state. kworker/u32:3-262 [004] ...1 140.787471: ext4_mb_discard_preallocations: dev 8,32 needed 114 .... kworker/u32:3-262 [004] ...1 140.787471: ext4_mb_discard_preallocations: dev 8,32 needed 114 ext4_mb_new_blocks repeat: ext4_mb_discard_preallocations_should_retry(sb, ac, &seq) freed = ext4_mb_discard_preallocations ext4_mb_discard_group_preallocations this_cpu_inc(discard_pa_seq); ---> freed == 0 seq_retry = ext4_get_discard_pa_seq_sum for_each_possible_cpu(__cpu) __seq += per_cpu(discard_pa_seq, __cpu); if (seq_retry != *seq) { *seq = seq_retry; ret = true; } As we see seq_retry is sum of discard_pa_seq every cpu, if ext4_mb_discard_group_preallocations return zero discard_pa_seq in this cpu maybe increase one, so condition "seq_retry != *seq" have always been met. Ritesh Harjani suggest to in ext4_mb_discard_group_preallocations function we only increase discard_pa_seq when there is some PA to free. Fixes: 07b5b8e1 ("ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling") Signed-off-by: NYe Bin <yebin10@huawei.com> Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20200916113859.1556397-3-yebin10@huawei.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 20 8月, 2020 3 次提交
-
-
由 brookxu 提交于
In the scenario of writing sparse files, the per-inode prealloc list may be very long, resulting in high overhead for ext4_mb_use_preallocated(). To circumvent this problem, we limit the maximum length of per-inode prealloc list to 512 and allow users to modify it. After patching, we observed that the sys ratio of cpu has dropped, and the system throughput has increased significantly. We created a process to write the sparse file, and the running time of the process on the fixed kernel was significantly reduced, as follows: Running time on unfixed kernel: [root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat real 0m2.051s user 0m0.008s sys 0m2.026s Running time on fixed kernel: [root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat real 0m0.471s user 0m0.004s sys 0m0.395s Signed-off-by: NChunguang Xu <brookxu@tencent.com> Link: https://lore.kernel.org/r/d7a98178-056b-6db5-6bce-4ead23f4a257@gmail.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 brookxu 提交于
Reorganize the if statement of ext4_mb_release_context(), make it easier to read. Signed-off-by: NChunguang Xu <brookxu@tencent.com> Link: https://lore.kernel.org/r/5439ac6f-db79-ad68-76c1-a4dda9aa0cc3@gmail.comReviewed-by: NRitesh Harjani <riteshh@linux.ibm.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 brookxu 提交于
Lost chunks are when some other process raced with the current thread to grab a particular block allocation. Add mb_debug log for developers who wants to see how often this is happening for a particular workload. Signed-off-by: NChunguang Xu <brookxu@tencent.com> Link: https://lore.kernel.org/r/0a165ac0-1912-aebd-8a0d-b42e7cd1aea1@gmail.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 19 8月, 2020 4 次提交
-
-
由 Xu Wang 提交于
seq_puts is a lot cheaper than seq_printf, so use that to print literal strings. Signed-off-by: NXu Wang <vulab@iscas.ac.cn> Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20200810022158.9167-1-vulab@iscas.ac.cnSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 brookxu 提交于
It might be better to adjust the code in two places: 1. Determine whether grp is currupt or not should be placed first. 2. (cr<=2 && free <ac->ac_g_ex.fe_len)should may belong to the crx strategy, and it may be more appropriate to put it in the subsequent switch statement block. For cr1, cr2, the conditions in switch potentially realize the above judgment. For cr0, we should add (free <ac->ac_g_ex.fe_len) judgment, and then delete (free / fragments) >= ac->ac_g_ex.fe_len), because cr0 returns true by default. Signed-off-by: NChunguang Xu <brookxu@tencent.com> Reviewed-by: NAndreas Dilger <adilger@dilger.ca> Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/e20b2d8f-1154-adb7-3831-a9e11ba842e9@gmail.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 brookxu 提交于
These comments do not seem to be related to ext4_mb_check_limits(), it may be invalid. Signed-off-by: NChunguang Xu <brookxu@tencent.com> Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/c49faf0c-d5d5-9c51-6911-9e0ff57c6bfa@gmail.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 brookxu 提交于
Fix typos in ext4_mb_regular_allocator() comment Signed-off-by: NChunguang Xu <brookxu@tencent.com> Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com> Reviewed-by: NAndreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/d6514145-73b3-808b-ec5a-a8be27c51f9c@gmail.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 08 8月, 2020 3 次提交
-
-
由 Jan Kara 提交于
Currently, system zones just track ranges of block, that are "important" fs metadata (bitmaps, group descriptors, journal blocks, etc.). This however complicates how extent tree (or indirect blocks) can be checked for inodes that actually track such metadata - currently the journal inode but arguably we should be treating quota files or resize inode similarly. We cannot run __ext4_ext_check() on such metadata inodes when loading their extents as that would immediately trigger the validity checks and so we just hack around that and special-case the journal inode. This however leads to a situation that a journal inode which has extent tree of depth at least one can have invalid extent tree that gets unnoticed until ext4_cache_extents() crashes. To overcome this limitation, track inode number each system zone belongs to (0 is used for zones not belonging to any inode). We can then verify inode number matches the expected one when verifying extent tree and thus avoid the false errors. With this there's no need to to special-case journal inode during extent tree checking anymore so remove it. Fixes: 0a944e8a ("ext4: don't perform block validity checks on the journal inode") Reported-by: NWolfgang Frisch <wolfgang.frisch@suse.com> Reviewed-by: NLukas Czerner <lczerner@redhat.com> Signed-off-by: NJan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200728130437.7804-4-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 brookxu 提交于
Delete the invalid BUGON in ext4_mb_load_buddy_gfp(), the previous code has already judged whether page is NULL. Signed-off-by: NChunguang Xu <brookxu@tencent.com> Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/ad68e8a2-5ec3-5beb-537f-f3e53f55367a@gmail.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Theodore Ts'o 提交于
For file systems where we can afford to keep the buddy bitmaps cached, we can speed up initial writes to large file systems by starting to load the block allocation bitmaps as soon as the file system is mounted. This won't work well for _super_ large file systems, or memory constrained systems, so we only enable this when it is requested via a mount option. Addresses-Google-Bug: 159488342 Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
-
- 06 8月, 2020 3 次提交
-
-
由 Alex Zhuravlev 提交于
cr=0 is supposed to be an optimization to save CPU cycles, but if buddy data (in memory) is not initialized then all this makes no sense as we have to do sync IO taking a lot of cycles. Also, at cr=0 mballoc doesn't choose any available chunk. cr=1 also skips groups using heuristic based on avg. fragment size. It's more useful to skip such groups and switch to cr=2 where groups will be scanned for available chunks. However, we always read the first block group in a flex_bg so metadata blocks will get read into the first flex_bg if possible. Using sparse image and dm-slow virtual device of 120TB was simulated, then the image was formatted and filled using debugfs to mark ~85% of available space as busy. mount process w/o the patch couldn't complete in half an hour (according to vmstat it would take ~10-11 hours). With the patch applied mount took ~20 seconds. Lustre-bug-id: https://jira.whamcloud.com/browse/LU-12988Signed-off-by: NAlex Zhuravlev <azhuravlev@whamcloud.com> Reviewed-by: NAndreas Dilger <adilger@whamcloud.com> Reviewed-by: NArtem Blagodarenko <artem.blagodarenko@gmail.com>
-
由 Alex Zhuravlev 提交于
This should significantly improve bitmap loading, especially for flex groups as it tries to load all bitmaps within a flex.group instead of one by one synchronously. Prefetching is done in 8 * flex_bg groups, so it should be 8 read-ahead reads for a single allocating thread. At the end of allocation the thread waits for read-ahead completion and initializes buddy information so that read-aheads are not lost in case of memory pressure. At cr=0 the number of prefetching IOs is limited per allocation context to prevent a situation when mballoc loads thousands of bitmaps looking for a perfect group and ignoring groups with good chunks. Together with the patch "ext4: limit scanning of uninitialized groups" the mount time (which includes few tiny allocations) of a 1PB filesystem is reduced significantly: 0% full 50%-full unpatched patched mount time 33s 9279s 563s [ Restructured by tytso; removed the state flags in the allocation context, so it can be used to lazily prefetch the allocation bitmaps immediately after the file system is mounted. Skip prefetching block groups which are uninitialized. Finally pass in the REQ_RAHEAD flag to the block layer while prefetching. ] Signed-off-by: NAlex Zhuravlev <bzzz@whamcloud.com> Reviewed-by: NAndreas Dilger <adilger@whamcloud.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 brookxu 提交于
Fix spelling typos in ext4_mb_initialize_context. Signed-off-by: NChunguang Xu <brookxu@tencent.com> Reviewed-by: NJan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/883b523c-58ec-7f38-0bb8-cd2ea4393684@gmail.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 11 6月, 2020 1 次提交
-
-
由 Ritesh Harjani 提交于
Simplify reading a seq variable by directly using this_cpu_read API instead of doing this_cpu_ptr and then dereferencing it. This also avoid the below kernel BUG: which happens when CONFIG_DEBUG_PREEMPT is enabled BUG: using smp_processor_id() in preemptible [00000000] code: syz-fuzzer/6927 caller is ext4_mb_new_blocks+0xa4d/0x3b70 fs/ext4/mballoc.c:4711 CPU: 1 PID: 6927 Comm: syz-fuzzer Not tainted 5.7.0-next-20200602-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x18f/0x20d lib/dump_stack.c:118 check_preemption_disabled+0x20d/0x220 lib/smp_processor_id.c:48 ext4_mb_new_blocks+0xa4d/0x3b70 fs/ext4/mballoc.c:4711 ext4_ext_map_blocks+0x201b/0x33e0 fs/ext4/extents.c:4244 ext4_map_blocks+0x4cb/0x1640 fs/ext4/inode.c:626 ext4_getblk+0xad/0x520 fs/ext4/inode.c:833 ext4_bread+0x7c/0x380 fs/ext4/inode.c:883 ext4_append+0x153/0x360 fs/ext4/namei.c:67 ext4_init_new_dir fs/ext4/namei.c:2757 [inline] ext4_mkdir+0x5e0/0xdf0 fs/ext4/namei.c:2802 vfs_mkdir+0x419/0x690 fs/namei.c:3632 do_mkdirat+0x21e/0x280 fs/namei.c:3655 do_syscall_64+0x60/0xe0 arch/x86/entry/common.c:359 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: 42f56b7a4a7d ("ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling") Suggested-by: NBorislav Petkov <bp@alien8.de> Tested-by: NMarek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: NRitesh Harjani <riteshh@linux.ibm.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reported-by: syzbot+82f324bb69744c5f6969@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/534f275016296996f54ecf65168bb3392b6f653d.1591699601.git.riteshh@linux.ibm.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 04 6月, 2020 7 次提交
-
-
由 Ritesh Harjani 提交于
Currently while doing block allocation grp->bb_free may be getting modified if discard is happening in parallel. For e.g. consider a case where there are lot of threads who have preallocated lot of blocks and there is a thread which is trying to discard all of this group's PA. Now it could happen that we see all of those group's bb_free is zero and fail the allocation while there is sufficient space if we free up all the PA. So this patch adds another flag "EXT4_MB_STRICT_CHECK" which will be set if we are unable to allocate any blocks in the first try (since we may not have considered blocks about to be discarded from PA lists). So during retry attempt to allocate blocks we will use ext4_lock_group() for checking if the group is good or not. Signed-off-by: NRitesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/9cb740a117c958c36596f167b12af1beae9a68b7.1589955723.git.riteshh@linux.ibm.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Ritesh Harjani 提交于
ext4_mb_good_group() definition was changed some time back and now it even initializes the buddy cache (via ext4_mb_init_group()), if in case the EXT4_MB_GRP_NEED_INIT() is true for a group. Note that ext4_mb_init_group() could sleep and so should not be called under a spinlock held. This is fine as of now because ext4_mb_good_group() is called before loading the buddy bitmap without ext4_lock_group() held and again called after loading the bitmap, only this time with ext4_lock_group() held. But still this whole thing is confusing. So this patch refactors out ext4_mb_good_group_nolock() which should be called when without holding ext4_lock_group(). Also in further patches we hold the spinlock (ext4_lock_group()) while doing any calculations which involves grp->bb_free or grp->bb_fragments. Signed-off-by: NRitesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/d9f7d031a5fbe1c943fae6bf1ff5cdf0604ae722.1589955723.git.riteshh@linux.ibm.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Ritesh Harjani 提交于
There could be a race in function ext4_mb_discard_group_preallocations() where the 1st thread may iterate through group's bb_prealloc_list and remove all the PAs and add to function's local list head. Now if the 2nd thread comes in to discard the group preallocations, it will see that the group->bb_prealloc_list is empty and will return 0. Consider for a case where we have less number of groups (for e.g. just group 0), this may even return an -ENOSPC error from ext4_mb_new_blocks() (where we call for ext4_mb_discard_group_preallocations()). But that is wrong, since 2nd thread should have waited for 1st thread to release all the PAs and should have retried for allocation. Since 1st thread was anyway going to discard the PAs. The algorithm using this percpu seq counter goes below: 1. We sample the percpu discard_pa_seq counter before trying for block allocation in ext4_mb_new_blocks(). 2. We increment this percpu discard_pa_seq counter when we either allocate or free these blocks i.e. while marking those blocks as used/free in mb_mark_used()/mb_free_blocks(). 3. We also increment this percpu seq counter when we successfully identify that the bb_prealloc_list is not empty and hence proceed for discarding of those PAs inside ext4_mb_discard_group_preallocations(). Now to make sure that the regular fast path of block allocation is not affected, as a small optimization we only sample the percpu seq counter on that cpu. Only when the block allocation fails and when freed blocks found were 0, that is when we sample percpu seq counter for all cpus using below function ext4_get_discard_pa_seq_sum(). This happens after making sure that all the PAs on grp->bb_prealloc_list got freed or if it's empty. It can be well argued that why don't just check for grp->bb_free to see if there are any free blocks to be allocated. So here are the two concerns which were discussed:- 1. If for some reason the blocks available in the group are not appropriate for allocation logic (say for e.g. EXT4_MB_HINT_GOAL_ONLY, although this is not yet implemented), then the retry logic may result into infinte looping since grp->bb_free is non-zero. 2. Also before preallocation was clubbed with block allocation with the same ext4_lock_group() held, there were lot of races where grp->bb_free could not be reliably relied upon. Due to above, this patch considers discard_pa_seq logic to determine if we should retry for block allocation. Say if there are are n threads trying for block allocation and none of those could allocate or discard any of the blocks, then all of those n threads will fail the block allocation and return -ENOSPC error. (Since the seq counter for all of those will match as no block allocation/discard was done during that duration). Signed-off-by: NRitesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/7f254686903b87c419d798742fd9a1be34f0657b.1589955723.git.riteshh@linux.ibm.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Ritesh Harjani 提交于
Implement ext4_mb_discard_preallocations_should_retry() which we will need in later patches to add more logic like check for sequence number match to see if we should retry for block allocation or not. There should be no functionality change in this patch. Signed-off-by: NRitesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/1cfae0098d2aa9afbeb59331401258182868c8f2.1589955723.git.riteshh@linux.ibm.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Ritesh Harjani 提交于
ext4_mb_discard_preallocations() only checks for grp->bb_prealloc_list of every group to discard the group's PA to free up the space if allocation request fails. Consider below race:- Process A Process B 1. allocate blocks 1. Fails block allocation from ext4_mb_regular_allocator() ext4_lock_group() allocated blocks more than ac_o_ex.fe_len ext4_unlock_group() 2. Scans the grp->bb_prealloc_list (under ext4_lock_group()) and find nothing and thus return -ENOSPC. 2. Add the additional blocks to PA list ext4_lock_group() add blocks to grp->bb_prealloc_list ext4_unlock_group() Above race could be avoided if we add those additional blocks to grp->bb_prealloc_list at the same time with block allocation when ext4_lock_group() was still held. With this discard-PA will know if there are actually any blocks which could be freed from the PA Signed-off-by: NRitesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/a2217dd782585b42328981832e6d396abaaccb80.1589955723.git.riteshh@linux.ibm.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Ritesh Harjani 提交于
mb_debug() msg had only 1 control level for all type of msgs. And if we enable mballoc_debug then all of those msgs would be enabled. Instead of adding multiple debug levels for mb_debug() msgs, use pr_debug() with which we could have finer control to print msgs at all of different levels (i.e. at file, func, line no.). Also add process name/pid, superblk id, and other info in mb_debug() msg. This also kills the mballoc_debug module parameter, since it is not needed any more. Signed-off-by: NRitesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/f0c660cbde9e2edbe95c67942ca9ad80dd2231eb.1589086800.git.riteshh@linux.ibm.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Ritesh Harjani 提交于
Make sure to check for e4b->bd_info->bb_bitmap == NULL, in mb_cmp_bitmaps() and return if NULL, to avoid possible NULL ptr dereference. Similar to how we do this in other ifdef DOUBLE_CHECK functions. Also remove the BUG_ON() logic if kmalloc() or ext4_read_block_bitmap() fails. We should simply mark grp->bb_bitmap as NULL if above happens. In fact ext4_read_block_bitmap() may even return an error in case of resize ioctl. Hence remove this BUG_ON logic (fstests ext4/032 may trigger this). Link: https://lore.kernel.org/r/9a54f8a696ff17c057cd571be3d15ac3ec1407f1.1589086800.git.riteshh@linux.ibm.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
-