提交 · 48a34311953d921235f4d7bbd2111690d2e469cf · gsplhtlxg / clone-Linux

14 2月, 2020 4 次提交

ext4: fix checksum errors with indexed dirs · 48a34311

由 Jan Kara 提交于 2月 10, 2020

DIR_INDEX has been introduced as a compat ext4 feature. That means that
even kernels / tools that don't understand the feature may modify the
filesystem. This works because for kernels not understanding indexed dir
format, internal htree nodes appear just as empty directory entries.
Index dir aware kernels then check the htree structure is still
consistent before using the data. This all worked reasonably well until
metadata checksums were introduced. The problem is that these
effectively made DIR_INDEX only ro-compatible because internal htree
nodes store checksums in a different place than normal directory blocks.
Thus any modification ignorant to DIR_INDEX (or just clearing
EXT4_INDEX_FL from the inode) will effectively cause checksum mismatch
and trigger kernel errors. So we have to be more careful when dealing
with indexed directories on filesystems with checksumming enabled.

1) We just disallow loading any directory inodes with EXT4_INDEX_FL when
DIR_INDEX is not enabled. This is harsh but it should be very rare (it
means someone disabled DIR_INDEX on existing filesystem and didn't run
e2fsck), e2fsck can fix the problem, and we don't want to answer the
difficult question: "Should we rather corrupt the directory more or
should we ignore that DIR_INDEX feature is not set?"

2) When we find out htree structure is corrupted (but the filesystem and
the directory should in support htrees), we continue just ignoring htree
information for reading but we refuse to add new entries to the
directory to avoid corrupting it more.

Link: https://lore.kernel.org/r/20200210144316.22081-1-jack@suse.cz
Fixes: dbe89444 ("ext4: Calculate and verify checksums for htree nodes")
Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org

48a34311

ext4: fix support for inode sizes > 1024 bytes · 4f97a681

由 Theodore Ts'o 提交于 2月 06, 2020

A recent commit, 9803387c ("ext4: validate the
debug_want_extra_isize mount option at parse time"), moved mount-time
checks around.  One of those changes moved the inode size check before
the blocksize variable was set to the blocksize of the file system.
After 9803387c was set to the minimum allowable blocksize, which
in practice on most systems would be 1024 bytes.  This cuased file
systems with inode sizes larger than 1024 bytes to be rejected with a
message:

EXT4-fs (sdXX): unsupported inode size: 4096

Fixes: 9803387c ("ext4: validate the debug_want_extra_isize mount option at parse time")
Link: https://lore.kernel.org/r/20200206225252.GA3673@mit.eduReported-by: NHerbert Poetzl <herbert@13thfloor.at>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org

4f97a681

ext4: simplify checking quota limits in ext4_statfs() · 46d36880

由 Jan Kara 提交于 1月 30, 2020

Coverity reports that conditions checking quota limits in ext4_statfs()
contain dead code. Indeed it is right and current conditions can be
simplified.

Link: https://lore.kernel.org/r/20200130111148.10766-1-jack@suse.czReported-by: NCoverity <scan-admin@coverity.com>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org

46d36880

ext4: don't assume that mmp_nodename/bdevname have NUL · 14c9ca05

由 Andreas Dilger 提交于 1月 26, 2020

Don't assume that the mmp_nodename and mmp_bdevname strings are NUL
terminated, since they are filled in by snprintf(), which is not
guaranteed to do so.

Link: https://lore.kernel.org/r/1580076215-1048-1-git-send-email-adilger@dilger.caSigned-off-by: NAndreas Dilger <adilger@dilger.ca>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org

14c9ca05

25 1月, 2020 5 次提交

ext4,jbd2: fix comment and code style · 8d6ce136

由 Shijie Luo 提交于 1月 23, 2020

Fix comment and remove unneccessary blank.
Signed-off-by: NShijie Luo <luoshijie1@huawei.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200123064325.36358-1-luoshijie1@huawei.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>

8d6ce136

ext4: fix extent_status trace points · 52144d89

由 Dmitry Monakhov 提交于 11月 14, 2019

Show pblock only if it has meaningful value.

# before
   ext4:ext4_es_lookup_extent_exit: dev 253,0 ino 12 found 1 [1/4294967294) 576460752303423487 H
   ext4:ext4_es_lookup_extent_exit: dev 253,0 ino 12 found 1 [2/4294967293) 576460752303423487 HR
# after
   ext4:ext4_es_lookup_extent_exit: dev 253,0 ino 12 found 1 [1/4294967294) 0 H
   ext4:ext4_es_lookup_extent_exit: dev 253,0 ino 12 found 1 [2/4294967293) 0 HR
Signed-off-by: NDmitry Monakhov <dmonakhov@gmail.com>
Link: https://lore.kernel.org/r/20191114200147.1073-2-dmonakhov@gmail.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>

52144d89

ext4: choose hardlimit when softlimit is larger than hardlimit in ext4_statfs_project() · 57c32ea4

由 Chengguang Xu 提交于 10月 16, 2019

Setting softlimit larger than hardlimit seems meaningless
for disk quota but currently it is allowed. In this case,
there may be a bit of comfusion for users when they run
df comamnd to directory which has project quota.

For example, we set 20M softlimit and 10M hardlimit of
block usage limit for project quota of test_dir(project id 123).

[root@hades mnt_ext4]# repquota -P -a
*** Report for project quotas on device /dev/loop0
Block grace time: 7days; Inode grace time: 7days
                        Block limits                File limits
Project         used    soft    hard  grace    used  soft  hard  grace
----------------------------------------------------------------------
 0        --      13       0       0              2     0     0
 123      --   10237   20480   10240              5   200   100

The result of df command as below:

[root@hades mnt_ext4]# df -h test_dir
Filesystem      Size  Used Avail Use% Mounted on
/dev/loop0       20M   10M   10M  50% /home/cgxu/test/mnt_ext4

Even though it looks like there is another 10M free space to use,
if we write new data to diretory test_dir(inherit project id),
the write will fail with errno(-EDQUOT).

After this patch, the df result looks like below.

[root@hades mnt_ext4]# df -h test_dir
Filesystem      Size  Used Avail Use% Mounted on
/dev/loop0       10M   10M  3.0K 100% /home/cgxu/test/mnt_ext4
Signed-off-by: NChengguang Xu <cgxu519@mykernel.net>
Reviewed-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191016022501.760-1-cgxu519@mykernel.netSigned-off-by: NTheodore Ts'o <tytso@mit.edu>

57c32ea4

ext4: fix race conditions in ->d_compare() and ->d_hash() · ec772f01

由 Eric Biggers 提交于 1月 23, 2020

Since ->d_compare() and ->d_hash() can be called in RCU-walk mode,
->d_parent and ->d_inode can be concurrently modified, and in
particular, ->d_inode may be changed to NULL.  For ext4_d_hash() this
resulted in a reproducible NULL dereference if a lookup is done in a
directory being deleted, e.g. with:

	int main()
	{
		if (fork()) {
			for (;;) {
				mkdir("subdir", 0700);
				rmdir("subdir");
			}
		} else {
			for (;;)
				access("subdir/file", 0);
		}
	}

... or by running the 't_encrypted_d_revalidate' program from xfstests.
Both repros work in any directory on a filesystem with the encoding
feature, even if the directory doesn't actually have the casefold flag.

I couldn't reproduce a crash in ext4_d_compare(), but it appears that a
similar crash is possible there.

Fix these bugs by reading ->d_parent and ->d_inode using READ_ONCE() and
falling back to the case sensitive behavior if the inode is NULL.
Reported-by: NAl Viro <viro@zeniv.linux.org.uk>
Fixes: b886ee3e ("ext4: Support case-insensitive file name lookups")
Cc: <stable@vger.kernel.org> # v5.2+
Signed-off-by: NEric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20200124041234.159740-1-ebiggers@kernel.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>

ec772f01

ext4: make dioread_nolock the default · 244adf64

由 Theodore Ts'o 提交于 1月 23, 2020

This fixes the direct I/O versus writeback race which can reveal stale
data, and it improves the tail latency of commits on slow devices.

Link: https://lore.kernel.org/r/20200125022254.1101588-1-tytso@mit.eduSigned-off-by: NTheodore Ts'o <tytso@mit.edu>

244adf64

24 1月, 2020 1 次提交

ext4: fix extent_status fragmentation for plain files · 4068664e

由 Dmitry Monakhov 提交于 11月 06, 2019

Extents are cached in read_extent_tree_block(); as a result, extents
are not cached for inodes with depth == 0 when we try to find the
extent using ext4_find_extent().  The result of the lookup is cached
in ext4_map_blocks() but is only a subset of the extent on disk.  As a
result, the contents of extents status cache can get very badly
fragmented for certain workloads, such as a random 4k read workload.

File size of /mnt/test is 33554432 (8192 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..    8191:      40960..     49151:   8192:             last,eof

$ perf record -e 'ext4:ext4_es_*' /root/bin/fio --name=t --direct=0 --rw=randread --bs=4k --filesize=32M --size=32M --filename=/mnt/test
$ perf script | grep ext4_es_insert_extent | head -n 10
             fio   131 [000]    13.975421:           ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [494/1) mapped 41454 status W
             fio   131 [000]    13.975939:           ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [6064/1) mapped 47024 status W
             fio   131 [000]    13.976467:           ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [6907/1) mapped 47867 status W
             fio   131 [000]    13.976937:           ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [3850/1) mapped 44810 status W
             fio   131 [000]    13.977440:           ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [3292/1) mapped 44252 status W
             fio   131 [000]    13.977931:           ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [6882/1) mapped 47842 status W
             fio   131 [000]    13.978376:           ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [3117/1) mapped 44077 status W
             fio   131 [000]    13.978957:           ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [2896/1) mapped 43856 status W
             fio   131 [000]    13.979474:           ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [7479/1) mapped 48439 status W

Fix this by caching the extents for inodes with depth == 0 in
ext4_find_extent().

[ Renamed ext4_es_cache_extents() to ext4_cache_extents() since this
  newly added function is not in extents_cache.c, and to avoid
  potential visual confusion with ext4_es_cache_extent().  -TYT ]
Signed-off-by: NDmitry Monakhov <dmonakhov@gmail.com>
Link: https://lore.kernel.org/r/20191106122502.19986-1-dmonakhov@gmail.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>

4068664e

18 1月, 2020 25 次提交

ext4: drop ext4_kvmalloc() · 71b565ce

由 Theodore Ts'o 提交于 1月 16, 2020

As Jan pointed out[1], as of commit 81378da6 ("jbd2: mark the
transaction context with the scope GFP_NOFS context") we use
memalloc_nofs_{save,restore}() while a jbd2 handle is active.  So
ext4_kvmalloc() so we can call allocate using GFP_NOFS is no longer
necessary.

[1] https://lore.kernel.org/r/20200109100007.GC27035@quack2.suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20200116155031.266620-1-tytso@mit.eduReviewed-by: NJan Kara <jack@suse.cz>

71b565ce

ext4: Add EXT4_IOC_FSGETXATTR/EXT4_IOC_FSSETXATTR to compat_ioctl · a54d8d34

由 Martijn Coenen 提交于 12月 27, 2019

These are backed by 'struct fsxattr' which has the same size on all
architectures.
Signed-off-by: NMartijn Coenen <maco@android.com>
Link: https://lore.kernel.org/r/20191227134639.35869-1-maco@android.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>

a54d8d34

ext4: remove unused macro MPAGE_DA_EXTENT_TAIL · e128d516

由 Ritesh Harjani 提交于 1月 01, 2020

Remove unused macro MPAGE_DA_EXTENT_TAIL which
is no more used after below commit
4e7ea81d ("ext4: restructure writeback path")
Signed-off-by: NRitesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200101095137.25656-1-riteshh@linux.ibm.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>

e128d516

ext4: add missing braces in ext4_ext_drop_refs() · de745485

由 Eric Biggers 提交于 12月 31, 2019

For clarity, add braces to the loop in ext4_ext_drop_refs().
Signed-off-by: NEric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191231180444.46586-9-ebiggers@kernel.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: NJan Kara <jack@suse.cz>

de745485

ext4: fix some nonstandard indentation in extents.c · 6e89bbb7

由 Eric Biggers 提交于 12月 31, 2019

Clean up some code that was using 2-character indents.
Signed-off-by: NEric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191231180444.46586-8-ebiggers@kernel.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: NJan Kara <jack@suse.cz>

6e89bbb7

ext4: remove obsolete comment from ext4_can_extents_be_merged() · 61a6cb49

由 Eric Biggers 提交于 12月 31, 2019

Support for unwritten extents was added to ext4 a long time ago, so
remove a misleading comment that says they're a future feature.
Signed-off-by: NEric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191231180444.46586-7-ebiggers@kernel.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: NJan Kara <jack@suse.cz>

61a6cb49

ext4: fix documentation for ext4_ext_try_to_merge() · adde81cf

由 Eric Biggers 提交于 12月 31, 2019

Don't mention the nonexistent return value, and mention both types of
merges that are attempted.
Signed-off-by: NEric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191231180444.46586-6-ebiggers@kernel.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: NJan Kara <jack@suse.cz>

adde81cf

ext4: make some functions static in extents.c · 43f81677

由 Eric Biggers 提交于 12月 31, 2019

Make the following functions static since they're only used in
extents.c:

	__ext4_ext_dirty()
	ext4_can_extents_be_merged()
	ext4_collapse_range()
	ext4_insert_range()

Also remove the prototype for ext4_ext_writepage_trans_blocks(), as this
function is not defined anywhere.
Signed-off-by: NEric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191231180444.46586-5-ebiggers@kernel.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: NJan Kara <jack@suse.cz>

43f81677

ext4: remove redundant S_ISREG() checks from ext4_fallocate() · a1180994

由 Eric Biggers 提交于 12月 31, 2019

ext4_fallocate() is only used in the file_operations for regular files.
Also, the VFS only allows fallocate() on regular files and block
devices, but block devices always use blkdev_fallocate().  For both of
these reasons, S_ISREG() is always true in ext4_fallocate().

Therefore the S_ISREG() checks in ext4_zero_range(),
ext4_collapse_range(), ext4_insert_range(), and ext4_punch_hole() are
redundant.  Remove them.
Signed-off-by: NEric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191231180444.46586-4-ebiggers@kernel.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: NJan Kara <jack@suse.cz>

a1180994

ext4: clean up len and offset checks in ext4_fallocate() · 9b02e498

由 Eric Biggers 提交于 12月 31, 2019

- Fix some comments.

- Consistently access i_size directly rather than using i_size_read(),
  since in all relevant cases we're under inode_lock().

- Simplify the alignment checks by using the IS_ALIGNED() macro.

- In ext4_insert_range(), do the check against s_maxbytes in a way
  that is safe against signed overflow.  (This doesn't currently matter
  for ext4 due to ext4's limited max file size, but this is something
  other filesystems have gotten wrong.  We might as well do it safely.)
Signed-off-by: NEric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191231180444.46586-3-ebiggers@kernel.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: NJan Kara <jack@suse.cz>

9b02e498

ext4: remove ext4_{ind,ext}_calc_metadata_amount() · dd6683e6

由 Eric Biggers 提交于 12月 31, 2019

Remove the ext4_ind_calc_metadata_amount() and
ext4_ext_calc_metadata_amount() functions, which have been unused since
commit 71d4f7d0 ("ext4: remove metadata reservation checks").

Also remove the i_da_metadata_calc_last_lblock and
i_da_metadata_calc_len fields from struct ext4_inode_info, as these were
only used by these removed functions.
Signed-off-by: NEric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191231180444.46586-2-ebiggers@kernel.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: NJan Kara <jack@suse.cz>

dd6683e6

ext4: remove unneeded check for error allocating bio_post_read_ctx · fd5fe253

由 Eric Biggers 提交于 12月 31, 2019

Since allocating an object from a mempool never fails when
__GFP_DIRECT_RECLAIM (which is included in GFP_NOFS) is set, the check
for failure to allocate a bio_post_read_ctx is unnecessary.  Remove it.

Also remove the redundant assignment to ->bi_private.
Signed-off-by: NEric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191231181256.47770-1-ebiggers@kernel.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>

fd5fe253

ext4: fix deadlock allocating bio_post_read_ctx from mempool · 68e45330

由 Eric Biggers 提交于 12月 31, 2019

Without any form of coordination, any case where multiple allocations
from the same mempool are needed at a time to make forward progress can
deadlock under memory pressure.

This is the case for struct bio_post_read_ctx, as one can be allocated
to decrypt a Merkle tree page during fsverity_verify_bio(), which itself
is running from a post-read callback for a data bio which has its own
struct bio_post_read_ctx.

Fix this by freeing the first bio_post_read_ctx before calling
fsverity_verify_bio(). This works because verity (if enabled) is always
the last post-read step.

This deadlock can be reproduced by trying to read from an encrypted
verity file after reducing NUM_PREALLOC_POST_READ_CTXS to 1 and patching
mempool_alloc() to pretend that pool->alloc() always fails.

Note that since NUM_PREALLOC_POST_READ_CTXS is actually 128, to actually
hit this bug in practice would require reading from lots of encrypted
verity files at the same time. But it's theoretically possible, as N
available objects isn't enough to guarantee forward progress when > N/2
threads each need 2 objects at a time.

Fixes: 22cfe4b4 ("ext4: add fs-verity read support")
Signed-off-by: NEric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191231181222.47684-1-ebiggers@kernel.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>

68e45330

ext4: fix deadlock allocating crypto bounce page from mempool · 547c556f

由 Eric Biggers 提交于 12月 31, 2019

ext4_writepages() on an encrypted file has to encrypt the data, but it
can't modify the pagecache pages in-place, so it encrypts the data into
bounce pages and writes those instead. All bounce pages are allocated
from a mempool using GFP_NOFS.

This is not correct use of a mempool, and it can deadlock. This is
because GFP_NOFS includes __GFP_DIRECT_RECLAIM, which enables the "never
fail" mode for mempool_alloc() where a failed allocation will fall back
to waiting for one of the preallocated elements in the pool.

But since this mode is used for all a bio's pages and not just the
first, it can deadlock waiting for pages already in the bio to be freed.

This deadlock can be reproduced by patching mempool_alloc() to pretend
that pool->alloc() always fails (so that it always falls back to the
preallocations), and then creating an encrypted file of size > 128 KiB.

Fix it by only using GFP_NOFS for the first page in the bio. For
subsequent pages just use GFP_NOWAIT, and if any of those fail, just
submit the bio and start a new one.

This will need to be fixed in f2fs too, but that's less straightforward.

Fixes: c9af28fd ("ext4 crypto: don't let data integrity writebacks fail with ENOMEM")
Cc: stable@vger.kernel.org
Signed-off-by: NEric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191231181149.47619-1-ebiggers@kernel.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>

547c556f

ext4: Delete ext4_kvzvalloc() · 8f27fd0a

由 Naoto Kobayashi 提交于 12月 27, 2019

Since we're not using ext4_kvzalloc(), delete this function.
Signed-off-by: NNaoto Kobayashi <naoto.kobayashi4c@gmail.com>
Link: https://lore.kernel.org/r/20191227080523.31808-2-naoto.kobayashi4c@gmail.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>

8f27fd0a

ext4: re-enable extent zeroout optimization on encrypted files · d8592647

由 Eric Biggers 提交于 12月 26, 2019

For encrypted files, commit 36086d43 ("ext4 crypto: fix bugs in
ext4_encrypted_zeroout()") disabled the optimization where when a write
occurs to the middle of an unwritten extent, the head and/or tail of the
extent (when they aren't too large) are zeroed out, turned into an
initialized extent, and merged with the part being written to. This
optimization helps prevent fragmentation of the extent tree.

However, disabling this optimization also made fscrypt_zeroout_range()
nearly impossible to test, as now it's only reachable via the very rare
case in ext4_split_extent_at() where allocating a new extent tree block
fails due to ENOSPC. 'gce-xfstests -c ext4/encrypt -g auto' doesn't
even hit this at all.

It's preferable to avoid really rare cases that are hard to test.

That commit also cited data corruption in xfstest generic/127 as a
reason to disable the extent zeroout optimization, but that's no longer
reproducible anymore. It also cited fscrypt_zeroout_range() having poor
performance, but I've written a patch to fix that.

Therefore, re-enable the extent zeroout optimization on encrypted files.
Signed-off-by: NEric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191226161114.53606-1-ebiggers@kernel.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>

d8592647

ext4: only use fscrypt_zeroout_range() on regular files · 33b4cc25

由 Eric Biggers 提交于 12月 26, 2019

fscrypt_zeroout_range() is only for encrypted regular files, not for
encrypted directories or symlinks.

Fortunately, currently it seems it's never called on non-regular files.
But to be safe ext4 should explicitly check S_ISREG() before calling it.
Signed-off-by: NEric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191226161022.53490-1-ebiggers@kernel.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>

33b4cc25

ext4: allow ZERO_RANGE on encrypted files · 457b1e35

由 Eric Biggers 提交于 12月 26, 2019

When ext4 encryption support was first added, ZERO_RANGE was disallowed,
supposedly because test failures (e.g. ext4/001) were seen when enabling
it, and at the time there wasn't enough time/interest to debug it.

However, there's actually no reason why ZERO_RANGE can't work on
encrypted files. And it fact it *does* work now. Whole blocks in the
zeroed range are converted to unwritten extents, as usual; encryption
makes no difference for that part. Partial blocks are zeroed in the
pagecache and then ->writepages() encrypts those blocks as usual.
ext4_block_zero_page_range() handles reading and decrypting the block if
needed before actually doing the pagecache write.

Also, f2fs has always supported ZERO_RANGE on encrypted files.

As far as I can tell, the reason that ext4/001 was failing in v4.1 was
actually because of one of the bugs fixed by commit 36086d43 ("ext4
crypto: fix bugs in ext4_encrypted_zeroout()"). The bug made
ext4_encrypted_zeroout() always return a positive value, which caused
unwritten extents in encrypted files to sometimes not be marked as
initialized after being written to. This bug was not actually in
ZERO_RANGE; it just happened to trigger during the extents manipulation
done in ext4/001 (and probably other tests too).

So, let's enable ZERO_RANGE on encrypted files on ext4.

Tested with:
gce-xfstests -c ext4/encrypt -g auto
gce-xfstests -c ext4/encrypt_1k -g auto

Got the same set of test failures both with and without this patch.
But with this patch 6 fewer tests are skipped: ext4/001, generic/008,
generic/009, generic/033, generic/096, and generic/511.
Signed-off-by: NEric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191226154216.4808-1-ebiggers@kernel.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>

457b1e35

ext4: handle decryption error in __ext4_block_zero_page_range() · 834f1565

由 Eric Biggers 提交于 12月 26, 2019

fscrypt_decrypt_pagecache_blocks() can fail, because it uses
skcipher_request_alloc(), which uses kmalloc(), which can fail; and also
because it calls crypto_skcipher_decrypt(), which can fail depending on
the driver that actually implements the crypto.

Therefore it's not appropriate to WARN on decryption error in
__ext4_block_zero_page_range().

Remove the WARN and just handle the error instead.
Signed-off-by: NEric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191226154105.4704-1-ebiggers@kernel.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>

834f1565

ext4: remove unnecessary selections from EXT3_FS · 284b3f6e

由 Eric Biggers 提交于 12月 26, 2019

Since EXT3_FS already selects EXT4_FS, there's no reason for it to
redundantly select all the selections of EXT4_FS -- notwithstanding the
comments that claim otherwise.

Remove these redundant selections to avoid confusion.
Signed-off-by: NEric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191226153920.4466-1-ebiggers@kernel.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Reviewed-by: NJan Kara <jack@suse.cz>

284b3f6e

ext4: use true,false for bool variable · 4756ee18

由 zhengbin 提交于 12月 25, 2019

Fixes coccicheck warning:

fs/ext4/extents.c:5271:6-12: WARNING: Assignment of 0/1 to bool variable
fs/ext4/extents.c:5287:4-10: WARNING: Assignment of 0/1 to bool variable
Reported-by: NHulk Robot <hulkci@huawei.com>
Signed-off-by: Nzhengbin <zhengbin13@huawei.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/1577241959-138695-1-git-send-email-zhengbin13@huawei.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>

4756ee18

ext4: uninline ext4_inode_journal_mode() · 46797ad7

由 Eric Biggers 提交于 12月 09, 2019

Determining an inode's journaling mode has gotten more complicated over
time. Move ext4_inode_journal_mode() from an inline function into
ext4_jbd2.c to reduce the compiled code size.
Signed-off-by: NEric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191209233602.117778-1-ebiggers@kernel.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Reviewed-by: NJan Kara <jack@suse.cz>

46797ad7

ext4: remove unnecessary ifdefs in htree_dirblock_to_tree() · 64c314ff

由 Eric Biggers 提交于 12月 09, 2019

The ifdefs for CONFIG_FS_ENCRYPTION in htree_dirblock_to_tree() are
unnecessary, as the called functions are already stubbed out when
!CONFIG_FS_ENCRYPTION.  Remove them.
Signed-off-by: NEric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191209213225.18477-1-ebiggers@kernel.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Reviewed-by: NJan Kara <jack@suse.cz>

64c314ff

ext4: remove unnecessary assignment in ext4_htree_store_dirent() · 7063743f

由 Chengguang Xu 提交于 12月 06, 2019

We have allocated memory using kzalloc() so don't have
to set 0 again in last byte.
Signed-off-by: NChengguang Xu <cgxu519@mykernel.net>
Link: https://lore.kernel.org/r/20191206054317.3107-1-cgxu519@mykernel.netSigned-off-by: NTheodore Ts'o <tytso@mit.edu>

7063743f

ext4: avoid fetching btime in ext4_getattr() unless requested · d4c5e960

由 Theodore Ts'o 提交于 11月 28, 2019

Linus observed that an allmodconfig build which does a lot of stat(2)
calls that ext4_getattr() was a noticeable (1%) amount of CPU time,
due to the cache line for i_extra_isize getting pulled in.  Since the
normal stat system call doesn't return btime, it's a complete waste.
So only calculate btime when it is explicitly requested.

[ Fixed to check against request_mask instead of query_flags. ]

Link: https://lore.kernel.org/r/CAHk-=wivmk_j6KbTX+Er64mLrG8abXZo0M10PNdAnHc8fWXfsQ@mail.gmail.comReported-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

d4c5e960

15 1月, 2020 1 次提交

fs-verity: implement readahead of Merkle tree pages · fd39073d

由 Eric Biggers 提交于 1月 06, 2020

When fs-verity verifies data pages, currently it reads each Merkle tree
page synchronously using read_mapping_page().

Therefore, when the Merkle tree pages aren't already cached, fs-verity
causes an extra 4 KiB I/O request for every 512 KiB of data (assuming
that the Merkle tree uses SHA-256 and 4 KiB blocks). This results in
more I/O requests and performance loss than is strictly necessary.

Therefore, implement readahead of the Merkle tree pages.

For simplicity, we take advantage of the fact that the kernel already
does readahead of the file's *data*, just like it does for any other
file. Due to this, we don't really need a separate readahead state
(struct file_ra_state) just for the Merkle tree, but rather we just need
to piggy-back on the existing data readahead requests.

We also only really need to bother with the first level of the Merkle
tree, since the usual fan-out factor is 128, so normally over 99% of
Merkle tree I/O requests are for the first level.

Therefore, make fsverity_verify_bio() enable readahead of the first
Merkle tree level, for up to 1/4 the number of pages in the bio, when it
sees that the REQ_RAHEAD flag is set on the bio. The readahead size is
then passed down to ->read_merkle_tree_page() for the filesystem to
(optionally) implement if it sees that the requested page is uncached.

While we're at it, also make build_merkle_tree_level() set the Merkle
tree readahead size, since it's easy to do there.

However, for now don't set the readahead size in fsverity_verify_page(),
since currently it's only used to verify holes on ext4 and f2fs, and it
would need parameters added to know how much to read ahead.

This patch significantly improves fs-verity sequential read performance.
Some quick benchmarks with 'cat'-ing a 250MB file after dropping caches:

On an ARM64 phone (using sha256-ce):
Before: 217 MB/s
After: 263 MB/s
(compare to sha256sum of non-verity file: 357 MB/s)

In an x86_64 VM (using sha256-avx2):
Before: 173 MB/s
After: 215 MB/s
(compare to sha256sum of non-verity file: 223 MB/s)

Link: https://lore.kernel.org/r/20200106205533.137005-1-ebiggers@kernel.orgReviewed-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NEric Biggers <ebiggers@google.com>

fd39073d

10 1月, 2020 1 次提交

kunit: allow kunit tests to be loaded as a module · c475c77d

由 Alan Maguire 提交于 1月 06, 2020

As tests are added to kunit, it will become less feasible to execute
all built tests together.  By supporting modular tests we provide
a simple way to do selective execution on a running system; specifying

CONFIG_KUNIT=y
CONFIG_KUNIT_EXAMPLE_TEST=m

...means we can simply "insmod example-test.ko" to run the tests.

To achieve this we need to do the following:

o export the required symbols in kunit
o string-stream tests utilize non-exported symbols so for now we skip
  building them when CONFIG_KUNIT_TEST=m.
o drivers/base/power/qos-test.c contains a few unexported interface
  references, namely freq_qos_read_value() and freq_constraints_init().
  Both of these could be potentially defined as static inline functions
  in include/linux/pm_qos.h, but for now we simply avoid supporting
  module build for that test suite.
o support a new way of declaring test suites.  Because a module cannot
  do multiple late_initcall()s, we provide a kunit_test_suites() macro
  to declare multiple suites within the same module at once.
o some test module names would have been too general ("test-test"
  and "example-test" for kunit tests, "inode-test" for ext4 tests);
  rename these as appropriate ("kunit-test", "kunit-example-test"
  and "ext4-inode-test" respectively).

Also define kunit_test_suite() via kunit_test_suites()
as callers in other trees may need the old definition.
Co-developed-by: NKnut Omang <knut.omang@oracle.com>
Signed-off-by: NKnut Omang <knut.omang@oracle.com>
Signed-off-by: NAlan Maguire <alan.maguire@oracle.com>
Reviewed-by: NBrendan Higgins <brendanhiggins@google.com>
Acked-by: Theodore Ts'o <tytso@mit.edu> # for ext4 bits
Acked-by: David Gow <davidgow@google.com> # For list-test
Reported-by: Nkbuild test robot <lkp@intel.com>
Signed-off-by: NShuah Khan <skhan@linuxfoundation.org>

c475c77d

01 1月, 2020 2 次提交

fscrypt: Allow modular crypto algorithms · ede7a09f

由 Herbert Xu 提交于 12月 27, 2019

The commit 643fa961 ("fscrypt: remove filesystem specific
build config option") removed modular support for fs/crypto. This
causes the Crypto API to be built-in whenever fscrypt is enabled.
This makes it very difficult for me to test modular builds of
the Crypto API without disabling fscrypt which is a pain.

As fscrypt is still evolving and it's developing new ties with the
fs layer, it's hard to build it as a module for now.

However, the actual algorithms are not required until a filesystem
is mounted. Therefore we can allow them to be built as modules.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Link: https://lore.kernel.org/r/20191227024700.7vrzuux32uyfdgum@gondor.apana.org.auSigned-off-by: NEric Biggers <ebiggers@google.com>

ede7a09f

fscrypt: don't check for ENOKEY from fscrypt_get_encryption_info() · 3b1ada55

由 Eric Biggers 提交于 12月 09, 2019

fscrypt_get_encryption_info() returns 0 if the encryption key is
unavailable; it never returns ENOKEY. So remove checks for ENOKEY.

Link: https://lore.kernel.org/r/20191209212348.243331-1-ebiggers@kernel.orgSigned-off-by: NEric Biggers <ebiggers@google.com>

3b1ada55

27 12月, 2019 1 次提交

ext4: Optimize ext4 DIO overwrites · 8cd115bd

由 Jan Kara 提交于 12月 18, 2019

Currently we start transaction for mapping every extent for writing
using direct IO. This is unnecessary when we know we are overwriting
already allocated blocks and the overhead of starting a transaction can
be significant especially for multithreaded workloads doing small writes.
Use iomap operations that avoid starting a transaction for direct IO
overwrites.

This improves throughput of 4k random writes - fio jobfile:
[global]
rw=randrw
norandommap=1
invalidate=0
bs=4k
numjobs=16
time_based=1
ramp_time=30
runtime=120
group_reporting=1
ioengine=psync
direct=1
size=16G
filename=file1.0.0:file1.0.1:file1.0.2:file1.0.3:file1.0.4:file1.0.5:file1.0.6:file1.0.7:file1.0.8:file1.0.9:file1.0.10:file1.0.11:file1.0.12:file1.0.13:file1.0.14:file1.0.15:file1.0.16:file1.0.17:file1.0.18:file1.0.19:file1.0.20:file1.0.21:file1.0.22:file1.0.23:file1.0.24:file1.0.25:file1.0.26:file1.0.27:file1.0.28:file1.0.29:file1.0.30:file1.0.31
file_service_type=random
nrfiles=32

from 3018MB/s to 4059MB/s in my test VM running test against simulated
pmem device (note that before iomap conversion, this workload was able
to achieve 3708MB/s because old direct IO path avoided transaction start
for overwrites as well). For dax, the win is even larger improving
throughput from 3042MB/s to 4311MB/s.
Reported-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191218174433.19380-1-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>

8cd115bd