提交 · 356b720e56133cd41478fc09b0de26b403fb5145 · openeuler / Kernel

19 10月, 2021 40 次提交

new helper: inode_wrong_type() · 356b720e

由 Al Viro 提交于 10月 19, 2021

stable inclusion
from stable-5.10.63
commit 40ba433a85dbbf5b2e58f2ac6b161ce37ac872fc
bugzilla: 182231 https://gitee.com/openeuler/kernel/issues/I4EFS1

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=40ba433a85dbbf5b2e58f2ac6b161ce37ac872fc

--------------------------------

commit 6e3e2c43 upstream.

inode_wrong_type(inode, mode) returns true if setting inode->i_mode
to given value would've changed the inode type.  We have enough of
those checks open-coded to make a helper worthwhile.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

356b720e

ceph: fix possible null-pointer dereference in ceph_mdsmap_decode() · 3beff2ef

由 Tuo Li 提交于 10月 19, 2021

stable inclusion
from stable-5.10.63
commit 23c29490b84dd89582b7d3233e97f73c41f1a065
bugzilla: 182231 https://gitee.com/openeuler/kernel/issues/I4EFS1

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=23c29490b84dd89582b7d3233e97f73c41f1a065

--------------------------------

[ Upstream commit a9e6ffbc ]

kcalloc() is called to allocate memory for m->m_info, and if it fails,
ceph_mdsmap_destroy() behind the label out_err will be called:
  ceph_mdsmap_destroy(m);

In ceph_mdsmap_destroy(), m->m_info is dereferenced through:
  kfree(m->m_info[i].export_targets);

To fix this possible null-pointer dereference, check m->m_info before the
for loop to free m->m_info[i].export_targets.

[ jlayton: fix up whitespace damage
	   only kfree(m->m_info) if it's non-NULL ]
Reported-by: NTOTE Robot <oslab@tsinghua.edu.cn>
Signed-off-by: NTuo Li <islituo@gmail.com>
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

3beff2ef

Revert "Add a reference to ucounts for each cred" · 87250fc4

由 Greg Kroah-Hartman 提交于 10月 19, 2021

stable inclusion
from stable-5.10.63
commit ae16b7c668378ea00eb60ab9d29e0d46b0e7aa15
bugzilla: 182231 https://gitee.com/openeuler/kernel/issues/I4EFS1

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=ae16b7c668378ea00eb60ab9d29e0d46b0e7aa15

--------------------------------

This reverts commit b2c4d9a33cc2dec7466f97eba2c4dd571ad798a5 which is
commit 905ae01c upstream.

This commit should not have been applied to the 5.10.y stable tree, so
revert it.
Reported-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Link: https://lore.kernel.org/r/87v93k4bl6.fsf@disp2133
Cc: Alexey Gladkov <legion@kernel.org>
Cc: Sasha Levin <sashal@kernel.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

87250fc4

ubifs: report correct st_size for encrypted symlinks · bf8e5a09

由 Eric Biggers 提交于 10月 19, 2021

stable inclusion
from stable-5.10.63
commit 0479b2bd2959ae03e7f727a797ea87b3d0b7dfb2
bugzilla: 182231 https://gitee.com/openeuler/kernel/issues/I4EFS1

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=0479b2bd2959ae03e7f727a797ea87b3d0b7dfb2

--------------------------------

commit 064c7349 upstream.

The stat() family of syscalls report the wrong size for encrypted
symlinks, which has caused breakage in several userspace programs.

Fix this by calling fscrypt_symlink_getattr() after ubifs_getattr() for
encrypted symlinks.  This function computes the correct size by reading
and decrypting the symlink target (if it's not already cached).

For more details, see the commit which added fscrypt_symlink_getattr().

Fixes: ca7f85be ("ubifs: Add support for encrypted symlinks")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210702065350.209646-5-ebiggers@kernel.orgSigned-off-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

bf8e5a09

f2fs: report correct st_size for encrypted symlinks · b7fbe59b

由 Eric Biggers 提交于 10月 19, 2021

stable inclusion
from stable-5.10.63
commit 3ac01789f6d9ca93ecc1faecd23414c13b4582c9
bugzilla: 182231 https://gitee.com/openeuler/kernel/issues/I4EFS1

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=3ac01789f6d9ca93ecc1faecd23414c13b4582c9

--------------------------------

commit 461b43a8 upstream.

The stat() family of syscalls report the wrong size for encrypted
symlinks, which has caused breakage in several userspace programs.

Fix this by calling fscrypt_symlink_getattr() after f2fs_getattr() for
encrypted symlinks.  This function computes the correct size by reading
and decrypting the symlink target (if it's not already cached).

For more details, see the commit which added fscrypt_symlink_getattr().

Fixes: cbaf042a ("f2fs crypto: add symlink encryption")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210702065350.209646-4-ebiggers@kernel.orgSigned-off-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

b7fbe59b

ext4: report correct st_size for encrypted symlinks · a810087a

由 Eric Biggers 提交于 10月 19, 2021

stable inclusion
from stable-5.10.63
commit 894a02236d0d20305556af4bfba3259f28c0b86b
bugzilla: 182231 https://gitee.com/openeuler/kernel/issues/I4EFS1

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=894a02236d0d20305556af4bfba3259f28c0b86b

--------------------------------

commit 8c4bca10 upstream.

The stat() family of syscalls report the wrong size for encrypted
symlinks, which has caused breakage in several userspace programs.

Fix this by calling fscrypt_symlink_getattr() after ext4_getattr() for
encrypted symlinks.  This function computes the correct size by reading
and decrypting the symlink target (if it's not already cached).

For more details, see the commit which added fscrypt_symlink_getattr().

Fixes: f348c252 ("ext4 crypto: add symlink encryption")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210702065350.209646-3-ebiggers@kernel.orgSigned-off-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

a810087a

fscrypt: add fscrypt_symlink_getattr() for computing st_size · 337717e5

由 Eric Biggers 提交于 10月 19, 2021

stable inclusion
from stable-5.10.63
commit b8c298cf57dcb5b18855f11437199fd0eb1ea388
bugzilla: 182231 https://gitee.com/openeuler/kernel/issues/I4EFS1

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=b8c298cf57dcb5b18855f11437199fd0eb1ea388

--------------------------------

commit d1876056 upstream.

Add a helper function fscrypt_symlink_getattr() which will be called
from the various filesystems' ->getattr() methods to read and decrypt
the target of encrypted symlinks in order to report the correct st_size.

Detailed explanation:

As required by POSIX and as documented in various man pages, st_size for
a symlink is supposed to be the length of the symlink target.
Unfortunately, st_size has always been wrong for encrypted symlinks
because st_size is populated from i_size from disk, which intentionally
contains the length of the encrypted symlink target.  That's slightly
greater than the length of the decrypted symlink target (which is the
symlink target that userspace usually sees), and usually won't match the
length of the no-key encoded symlink target either.

This hadn't been fixed yet because reporting the correct st_size would
require reading the symlink target from disk and decrypting or encoding
it, which historically has been considered too heavyweight to do in
->getattr().  Also historically, the wrong st_size had only broken a
test (LTP lstat03) and there were no known complaints from real users.
(This is probably because the st_size of symlinks isn't used too often,
and when it is, typically it's for a hint for what buffer size to pass
to readlink() -- which a slightly-too-large size still works for.)

However, a couple things have changed now.  First, there have recently
been complaints about the current behavior from real users:

- Breakage in rpmbuild:
  https://github.com/rpm-software-management/rpm/issues/1682
  https://github.com/google/fscrypt/issues/305

- Breakage in toybox cpio:
  https://www.mail-archive.com/toybox@lists.landley.net/msg07193.html

- Breakage in libgit2: https://issuetracker.google.com/issues/189629152
  (on Android public issue tracker, requires login)

Second, we now cache decrypted symlink targets in ->i_link.  Therefore,
taking the performance hit of reading and decrypting the symlink target
in ->getattr() wouldn't be as big a deal as it used to be, since usually
it will just save having to do the same thing later.

Also note that eCryptfs ended up having to read and decrypt symlink
targets in ->getattr() as well, to fix this same issue; see
commit 3a60a168 ("eCryptfs: Decrypt symlink target for stat size").

So, let's just bite the bullet, and read and decrypt the symlink target
in ->getattr() in order to report the correct st_size.  Add a function
fscrypt_symlink_getattr() which the filesystems will call to do this.

(Alternatively, we could store the decrypted size of symlinks on-disk.
But there isn't a great place to do so, and encryption is meant to hide
the original size to some extent; that property would be lost.)

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210702065350.209646-2-ebiggers@kernel.orgSigned-off-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

337717e5

pipe: do FASYNC notifications for every pipe IO, not just state changes · 5d9a850e

由 Linus Torvalds 提交于 10月 19, 2021

mainline inclusion
from mainline-5.10.62
commit 3b2018f9c9c088741d7d33a2baf9aa39e93d58c5
bugzilla: 182217 https://gitee.com/openeuler/kernel/issues/I4EFOS

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3b2018f9c9c088741d7d33a2baf9aa39e93d58c5

--------------------------------

commit fe67f4dd upstream.

It turns out that the SIGIO/FASYNC situation is almost exactly the same
as the EPOLLET case was: user space really wants to be notified after
every operation.

Now, in a perfect world it should be sufficient to only notify user
space on "state transitions" when the IO state changes (ie when a pipe
goes from unreadable to readable, or from unwritable to writable).  User
space should then do as much as possible - fully emptying the buffer or
what not - and we'll notify it again the next time the state changes.

But as with EPOLLET, we have at least one case (stress-ng) where the
kernel sent SIGIO due to the pipe being marked for asynchronous
notification, but the user space signal handler then didn't actually
necessarily read it all before returning (it read more than what was
written, but since there could be multiple writes, it could leave data
pending).

The user space code then expected to get another SIGIO for subsequent
writes - even though the pipe had been readable the whole time - and
would only then read more.

This is arguably a user space bug - and Colin King already fixed the
stress-ng code in question - but the kernel regression rules are clear:
it doesn't matter if kernel people think that user space did something
silly and wrong.  What matters is that it used to work.

So if user space depends on specific historical kernel behavior, it's a
regression when that behavior changes.  It's on us: we were silly to
have that non-optimal historical behavior, and our old kernel behavior
was what user space was tested against.

Because of how the FASYNC notification was tied to wakeup behavior, this
was first broken by commits f467a6a6 and 1b6b26ae ("pipe: fix
and clarify pipe read/write wakeup logic"), but at the time it seems
nobody noticed.  Probably because the stress-ng problem case ends up
being timing-dependent too.

It was then unwittingly fixed by commit 3a34b13a ("pipe: make pipe
writes always wake up readers") only to be broken again when by commit
3b844826 ("pipe: avoid unnecessary EPOLLET wakeups under normal
loads").

And at that point the kernel test robot noticed the performance
refression in the stress-ng.sigio.ops_per_sec case.  So the "Fixes" tag
below is somewhat ad hoc, but it matches when the issue was noticed.

Fix it for good (knock wood) by simply making the kill_fasync() case
separate from the wakeup case.  FASYNC is quite rare, and we clearly
shouldn't even try to use the "avoid unnecessary wakeups" logic for it.

Link: https://lore.kernel.org/lkml/20210824151337.GC27667@xsang-OptiPlex-9020/
Fixes: 3b844826 ("pipe: avoid unnecessary EPOLLET wakeups under normal loads")
Reported-by: Nkernel test robot <oliver.sang@intel.com>
Tested-by: NOliver Sang <oliver.sang@intel.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Colin Ian King <colin.king@canonical.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

5d9a850e

pipe: avoid unnecessary EPOLLET wakeups under normal loads · 019904bc

由 Linus Torvalds 提交于 10月 19, 2021

mainline inclusion
from mainline-5.10.62
commit e91da23c1be16ebcfca0991976ed9377a8233935
bugzilla: 182217 https://gitee.com/openeuler/kernel/issues/I4EFOS

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e91da23c1be16ebcfca0991976ed9377a8233935

--------------------------------

commit 3b844826 upstream.

I had forgotten just how sensitive hackbench is to extra pipe wakeups,
and commit 3a34b13a ("pipe: make pipe writes always wake up
readers") ended up causing a quite noticeable regression on larger
machines.

Now, hackbench isn't necessarily a hugely meaningful benchmark, and it's
not clear that this matters in real life all that much, but as Mel
points out, it's used often enough when comparing kernels and so the
performance regression shows up like a sore thumb.

It's easy enough to fix at least for the common cases where pipes are
used purely for data transfer, and you never have any exciting poll
usage at all.  So set a special 'poll_usage' flag when there is polling
activity, and make the ugly "EPOLLET has crazy legacy expectations"
semantics explicit to only that case.

I would love to limit it to just the broken EPOLLET case, but the pipe
code can't see the difference between epoll and regular select/poll, so
any non-read/write waiting will trigger the extra wakeup behavior.  That
is sufficient for at least the hackbench case.

Apart from making the odd extra wakeup cases more explicitly about
EPOLLET, this also makes the extra wakeup be at the _end_ of the pipe
write, not at the first write chunk.  That is actually much saner
semantics (as much as you can call any of the legacy edge-triggered
expectations for EPOLLET "sane") since it means that you know the wakeup
will happen once the write is done, rather than possibly in the middle
of one.

[ For stable people: I'm putting a "Fixes" tag on this, but I leave it
  up to you to decide whether you actually want to backport it or not.
  It likely has no impact outside of synthetic benchmarks  - Linus ]

Link: https://lore.kernel.org/lkml/20210802024945.GA8372@xsang-OptiPlex-9020/
Fixes: 3a34b13a ("pipe: make pipe writes always wake up readers")
Reported-by: Nkernel test robot <oliver.sang@intel.com>
Tested-by: NSandeep Patil <sspatil@android.com>
Tested-by: NMel Gorman <mgorman@techsingularity.net>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

019904bc

btrfs: fix race between marking inode needs to be logged and log syncing · 6b99745e

由 Filipe Manana 提交于 10月 19, 2021

mainline inclusion
from mainline-5.10.62
commit d845f89d59fc3f17ea4e86321b82d8edf6c1719f
bugzilla: 182217 https://gitee.com/openeuler/kernel/issues/I4EFOS

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d845f89d59fc3f17ea4e86321b82d8edf6c1719f

--------------------------------

commit bc0939fc upstream.

We have a race between marking that an inode needs to be logged, either
at btrfs_set_inode_last_trans() or at btrfs_page_mkwrite(), and between
btrfs_sync_log(). The following steps describe how the race happens.

1) We are at transaction N;

2) Inode I was previously fsynced in the current transaction so it has:

    inode->logged_trans set to N;

3) The inode's root currently has:

   root->log_transid set to 1
   root->last_log_commit set to 0

   Which means only one log transaction was committed to far, log
   transaction 0. When a log tree is created we set ->log_transid and
   ->last_log_commit of its parent root to 0 (at btrfs_add_log_tree());

4) One more range of pages is dirtied in inode I;

5) Some task A starts an fsync against some other inode J (same root), and
   so it joins log transaction 1.

   Before task A calls btrfs_sync_log()...

6) Task B starts an fsync against inode I, which currently has the full
   sync flag set, so it starts delalloc and waits for the ordered extent
   to complete before calling btrfs_inode_in_log() at btrfs_sync_file();

7) During ordered extent completion we have btrfs_update_inode() called
   against inode I, which in turn calls btrfs_set_inode_last_trans(),
   which does the following:

     spin_lock(&inode->lock);
     inode->last_trans = trans->transaction->transid;
     inode->last_sub_trans = inode->root->log_transid;
     inode->last_log_commit = inode->root->last_log_commit;
     spin_unlock(&inode->lock);

   So ->last_trans is set to N and ->last_sub_trans set to 1.
   But before setting ->last_log_commit...

8) Task A is at btrfs_sync_log():

   - it increments root->log_transid to 2
   - starts writeback for all log tree extent buffers
   - waits for the writeback to complete
   - writes the super blocks
   - updates root->last_log_commit to 1

   It's a lot of slow steps between updating root->log_transid and
   root->last_log_commit;

9) The task doing the ordered extent completion, currently at
   btrfs_set_inode_last_trans(), then finally runs:

     inode->last_log_commit = inode->root->last_log_commit;
     spin_unlock(&inode->lock);

   Which results in inode->last_log_commit being set to 1.
   The ordered extent completes;

10) Task B is resumed, and it calls btrfs_inode_in_log() which returns
    true because we have all the following conditions met:

    inode->logged_trans == N which matches fs_info->generation &&
    inode->last_subtrans (1) <= inode->last_log_commit (1) &&
    inode->last_subtrans (1) <= root->last_log_commit (1) &&
    list inode->extent_tree.modified_extents is empty

    And as a consequence we return without logging the inode, so the
    existing logged version of the inode does not point to the extent
    that was written after the previous fsync.

It should be impossible in practice for one task be able to do so much
progress in btrfs_sync_log() while another task is at
btrfs_set_inode_last_trans() right after it reads root->log_transid and
before it reads root->last_log_commit. Even if kernel preemption is enabled
we know the task at btrfs_set_inode_last_trans() can not be preempted
because it is holding the inode's spinlock.

However there is another place where we do the same without holding the
spinlock, which is in the memory mapped write path at:

  vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
  {
     (...)
     BTRFS_I(inode)->last_trans = fs_info->generation;
     BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid;
     BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit;
     (...)

So with preemption happening after setting ->last_sub_trans and before
setting ->last_log_commit, it is less of a stretch to have another task
do enough progress at btrfs_sync_log() such that the task doing the memory
mapped write ends up with ->last_sub_trans and ->last_log_commit set to
the same value. It is still a big stretch to get there, as the task doing
btrfs_sync_log() has to start writeback, wait for its completion and write
the super blocks.

So fix this in two different ways:

1) For btrfs_set_inode_last_trans(), simply set ->last_log_commit to the
   value of ->last_sub_trans minus 1;

2) For btrfs_page_mkwrite() only set the inode's ->last_sub_trans, just
   like we do for buffered and direct writes at btrfs_file_write_iter(),
   which is all we need to make sure multiple writes and fsyncs to an
   inode in the same transaction never result in an fsync missing that
   the inode changed and needs to be logged. Turn this into a helper
   function and use it both at btrfs_page_mkwrite() and at
   btrfs_file_write_iter() - this also fixes the problem that at
   btrfs_page_mkwrite() we were setting those fields without the
   protection of the inode's spinlock.

This is an extremely unlikely race to happen in practice.
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NAnand Jain <anand.jain@oracle.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

6b99745e

Revert "btrfs: compression: don't try to compress if we don't have enough pages" · 6a0544e0

由 Qu Wenruo 提交于 10月 19, 2021

mainline inclusion
from mainline-5.10.62
commit 3134292a8e79e089f0f19f28b8b20eb1f961575c
bugzilla: 182217 https://gitee.com/openeuler/kernel/issues/I4EFOS

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3134292a8e79e089f0f19f28b8b20eb1f961575c

--------------------------------

commit 4e965576 upstream.

This reverts commit f2165627.

[BUG]
It's no longer possible to create compressed inline extent after commit
f2165627 ("btrfs: compression: don't try to compress if we don't
have enough pages").

[CAUSE]
For compression code, there are several possible reasons we have a range
that needs to be compressed while it's no more than one page.

- Compressed inline write
  The data is always smaller than one sector and the test lacks the
  condition to properly recognize a non-inline extent.

- Compressed subpage write
  For the incoming subpage compressed write support, we require page
  alignment of the delalloc range.
  And for 64K page size, we can compress just one page into smaller
  sectors.

For those reasons, the requirement for the data to be more than one page
is not correct, and is already causing regression for compressed inline
data writeback.  The idea of skipping one page to avoid wasting CPU time
could be revisited in the future.

[FIX]
Fix it by reverting the offending commit.
Reported-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
Link: https://lore.kernel.org/linux-btrfs/afa2742.c084f5d6.17b6b08dffc@tnonline.net
Fixes: f2165627 ("btrfs: compression: don't try to compress if we don't have enough pages")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

6a0544e0

ceph: correctly handle releasing an embedded cap flush · d7da0da4

由 Xiubo Li 提交于 10月 19, 2021

mainline inclusion
from mainline-5.10.62
commit e55a8b461585a77ad3c42e6ae1fdad9efb0ff207
bugzilla: 182217 https://gitee.com/openeuler/kernel/issues/I4EFOS

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e55a8b461585a77ad3c42e6ae1fdad9efb0ff207

--------------------------------

commit b2f9fa1f upstream.

The ceph_cap_flush structures are usually dynamically allocated, but
the ceph_cap_snap has an embedded one.

When force umounting, the client will try to remove all the session
caps. During this, it will free them, but that should not be done
with the ones embedded in a capsnap.

Fix this by adding a new boolean that indicates that the cap flush is
embedded in a capsnap, and skip freeing it if that's set.

At the same time, switch to using list_del_init() when detaching the
i_list and g_list heads.  It's possible for a forced umount to remove
these objects but then handle_cap_flushsnap_ack() races in and does the
list_del_init() again, corrupting memory.

Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/52283Signed-off-by: NXiubo Li <xiubli@redhat.com>
Reviewed-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

d7da0da4

ovl: fix uninitialized pointer read in ovl_lookup_real_one() · dd7a8d14

由 Miklos Szeredi 提交于 10月 19, 2021

mainline inclusion
from mainline-5.10.62
commit ef2d68ef9a3bff68915e6fdf5b61822bd1f6af4c
bugzilla: 182217 https://gitee.com/openeuler/kernel/issues/I4EFOS

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ef2d68ef9a3bff68915e6fdf5b61822bd1f6af4c

--------------------------------

[ Upstream commit 580c6104 ]

One error path can result in release_dentry_name_snapshot() being called
before "name" was initialized by take_dentry_name_snapshot().

Fix by moving the release_dentry_name_snapshot() to immediately after the
only use.
Reported-by: NColin Ian King <colin.king@canonical.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

dd7a8d14

ext4: fix potential uninitialized access to retval in kmmpd · 070fdd4d

由 Ye Bin 提交于 10月 19, 2021

mainline inclusion
from mainline-5.14-rc5
commit b6654142
category: bugfix
bugzilla: 176138 https://gitee.com/openeuler/kernel/issues/I4DDEL

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b66541422824cf6cf20e9a35112e9cb5d82cdf62

-------------------------------------------------

if (!ext4_has_feature_mmp(sb)) then retval can be unitialized before
we jump to the wait_to_exit label.

Fixes: 61bb4a1c ("ext4: fix possible UAF when remounting r/o a mmp-protected file system")
Signed-off-by: NYe Bin <yebin10@huawei.com>
Link: https://lore.kernel.org/r/20210713022728.2533770-1-yebin10@huawei.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NBaokun Li <libaokun1@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

070fdd4d

take LOOKUP_{ROOT,ROOT_GRABBED,JUMPED} out of LOOKUP_... space · 42d3b3b2

由 Al Viro 提交于 10月 19, 2021

mainline inclusion
from mainline-5.14-rc1
commit bcba1e7d
category: bugfix
bugzilla: 181657 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bcba1e7d0d520adba895d9e0800a056f734b0a6a

---------------------------

Separate field in nameidata (nd->state) holding the flags that
should be internal-only - that way we both get some spare bits
in LOOKUP_... and get simpler rules for nd->root lifetime rules,
since we can set the replacement of LOOKUP_ROOT (ND_ROOT_PRESET)
at the same time we set nd->root.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

Conflicts:
	fs/namei.c
	[ Bugfix 7d01ef75("Make sure nd->path.mnt and nd->path.dentry
	  are always valid pointers") is not applid, the problem to be
	  fixed not exists.
	  Feature 6c6ec2b0("fs: add support for LOOKUP_CACHED") is
	  not applied. ]
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

42d3b3b2

switch file_open_root() to struct path · d16f801d

由 Al Viro 提交于 10月 19, 2021

mainline inclusion
from mainline-5.14-rc1
commit ffb37ca3
category: bugfix
bugzilla: 181657 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ffb37ca3bd16ce6ea2df2f87fde9a31e94ebb54b

---------------------------

... and provide file_open_root_mnt(), using the root of given mount.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

Conflicts:
	Documentation/filesystems/porting.rst
	[ Non-bugfix 14e43bf4("vfs: don't unnecessarily clone
	  write access for writable fd") is not applied. ]
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
[Roberto Sassu: Adjust file_open_root() called by load_digest_list()]
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

d16f801d

memcg: enable accounting for new namesapces and struct nsproxy · befac4c1

由 Vasily Averin 提交于 10月 19, 2021

mainline inclusion
from mainline-v5.15-rc1
commit 30acd0bd
bugzilla: 181858 https://gitee.com/openeuler/kernel/issues/I4DDEL

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=30acd0bdfb86548172168a0cc71d455944de0683

--------------------------------

Container admin can create new namespaces and force kernel to allocate up
to several pages of memory for the namespaces and its associated
structures.

Net and uts namespaces have enabled accounting for such allocations.  It
makes sense to account for rest ones to restrict the host's memory
consumption from inside the memcg-limited container.

Link: https://lkml.kernel.org/r/5525bcbf-533e-da27-79b7-158686c64e13@virtuozzo.comSigned-off-by: NVasily Averin <vvs@virtuozzo.com>
Acked-by: NSerge Hallyn <serge@hallyn.com>
Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
Acked-by: NKirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yutian Yang <nglaive@gmail.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

Conflicts:
	kernel/time/namespace.c
Signed-off-by: NLi Ming <limingming.li@huawei.com>
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

befac4c1

memcg: enable accounting for fasync_cache · b4e5feb6

由 Vasily Averin 提交于 10月 19, 2021

mainline inclusion
from mainline-v5.15-rc1
commit 839d6820
bugzilla: 181858 https://gitee.com/openeuler/kernel/issues/I4DDEL

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=839d68206de869b8cb4272c5ea10da2ef7ec34cb

--------------------------------

fasync_struct is used by almost all character device drivers to set up the
fasync queue, and for regular files by the file lease code.  This
structure is quite small but long-living and it can be assigned for any
open file.

It makes sense to account for its allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Link: https://lkml.kernel.org/r/1b408625-d71c-0b26-b0b6-9baf00f93e69@virtuozzo.comSigned-off-by: NVasily Averin <vvs@virtuozzo.com>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yutian Yang <nglaive@gmail.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NLi Ming <limingming.li@huawei.com>
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

b4e5feb6

memcg: enable accounting for mnt_cache entries · b2f25c6a

由 Vasily Averin 提交于 10月 19, 2021

mainline inclusion
from mainline-v5.15-rc1
commit 79f6540b
bugzilla: 181858 https://gitee.com/openeuler/kernel/issues/I4DDEL

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=79f6540ba88dfb383ecf057a3425e668105ca774

--------------------------------

Patch series "memcg accounting from OpenVZ", v7.

OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels.
Initially we used our own accounting subsystem, then partially committed
it to upstream, and a few years ago switched to cgroups v1.  Now we're
rebasing again, revising our old patches and trying to push them upstream.

We try to protect the host system from any misuse of kernel memory
allocation triggered by untrusted users inside the containers.

Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
list, though I would be very grateful for any comments from maintainersi
of affected subsystems or other people added in cc:

Compared to the upstream, we additionally account the following kernel objects:
- network devices and its Tx/Rx queues
- ipv4/v6 addresses and routing-related objects
- inet_bind_bucket cache objects
- VLAN group arrays
- ipv6/sit: ip_tunnel_prl
- scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets
- nsproxy and namespace objects itself
- IPC objects: semaphores, message queues and share memory segments
- mounts
- pollfd and select bits arrays
- signals and posix timers
- file lock
- fasync_struct used by the file lease code and driver's fasync queues
- tty objects
- per-mm LDT

We have an incorrect/incomplete/obsoleted accounting for few other kernel
objects: sk_filter, af_packets, netlink and xt_counters for iptables.
They require rework and probably will be dropped at all.

Also we're going to add an accounting for nft, however it is not ready
yet.

We have not tested performance on upstream, however, our performance team
compares our current RHEL7-based production kernel and reports that they
are at least not worse as the according original RHEL7 kernel.

This patch (of 10):

The kernel allocates ~400 bytes of 'struct mount' for any new mount.
Creating a new mount namespace clones most of the parent mounts, and this
can be repeated many times.  Additionally, each mount allocates up to
PATH_MAX=4096 bytes for mnt->mnt_devname.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Link: https://lkml.kernel.org/r/045db11f-4a45-7c9b-2664-5b32c2b44943@virtuozzo.comSigned-off-by: NVasily Averin <vvs@virtuozzo.com>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Yutian Yang <nglaive@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Borislav Petkov <bp@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NLi Ming <limingming.li@huawei.com>
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

b2f25c6a

memcg: charge fs_context and legacy_fs_context · 1f91195e

由 Yutian Yang 提交于 10月 19, 2021

mainline inclusion
from mainline-v5.15-rc1
commit bb902cb4
bugzilla: 181858 https://gitee.com/openeuler/kernel/issues/I4DDEL

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bb902cb47cf93b33cd92b3b7a4019330a03ef57f

--------------------------------

This patch adds accounting flags to fs_context and legacy_fs_context
allocation sites so that kernel could correctly charge these objects.

We have written a PoC to demonstrate the effect of the missing-charging
bugs.  The PoC takes around 1,200MB unaccounted memory, while it is
charged for only 362MB memory usage.  We evaluate the PoC on QEMU x86_64
v5.2.90 + Linux kernel v5.10.19 + Debian buster.  All the limitations
including ulimits and sysctl variables are set as default.  Specifically,
the hard NOFILE limit and nr_open in sysctl are both 1,048,576.

/*------------------------- POC code ----------------------------*/

                        } while (0)

static inline int fsopen(const char *fs_name, unsigned int flags)
{
        return syscall(__NR_fsopen, fs_name, flags);
}

static char thread_stack[512][STACK_SIZE];

int thread_fn(void* arg)
{
  for (int i = 0; i< 800000; ++i) {
    int fsfd = fsopen("nfs", FSOPEN_CLOEXEC);
    if (fsfd == -1) {
      errExit("fsopen");
    }
  }
  while(1);
  return 0;
}

int main(int argc, char *argv[]) {
  int thread_pid;
  for (int i = 0; i < 1; ++i) {
    thread_pid = clone(thread_fn, thread_stack[i] + STACK_SIZE, \
      SIGCHLD, NULL);
  }
  while(1);
  return 0;
}

/*-------------------------- end --------------------------------*/

Link: https://lkml.kernel.org/r/1626517201-24086-1-git-send-email-nglaive@gmail.comSigned-off-by: NYutian Yang <nglaive@gmail.com>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: <shenwenbo@zju.edu.cn>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NLi Ming <limingming.li@huawei.com>
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

1f91195e

ext4: prevent getting empty inode buffer · a3e3ff2d

由 Zhang Yi 提交于 10月 19, 2021

hulk inclusion
category: bugfix
bugzilla: 174653 https://gitee.com/openeuler/kernel/issues/I4DDEL
---------------------------

In ext4_get_inode_loc(), we may skip IO and get an zero && uptodate
inode buffer when the inode monopolize an inode block for performance
reason. For most cases, ext4_mark_iloc_dirty() will fill the inode
buffer to make it fine, but we could miss this call if something bad
happened. Finally, __ext4_get_inode_loc_noinmem() may probably get an
empty inode buffer and trigger ext4 error.

For example, if we remove a nonexistent xattr on inode A,
ext4_xattr_set_handle() will return ENODATA before invoking
ext4_mark_iloc_dirty(), it will left an uptodate but zero buffer. We
will get checksum error message in ext4_iget() when getting inode again.

  EXT4-fs error (device sda): ext4_lookup:1784: inode #131074: comm cat: iget: checksum invalid

Even worse, if we allocate another inode B at the same inode block, it
will corrupt the inode A on disk when write back inode B.

So this patch initialize the inode buffer by filling the in-mem inode
contents if we skip read I/O, ensure that the buffer is really uptodate.
Signed-off-by: NZhang Yi <yi.zhang@huawei.com>
Reviewed-by: NYang Erkun <yangerkun@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

a3e3ff2d

ext4: move ext4_fill_raw_inode() related functions · fce5c640

由 Zhang Yi 提交于 10月 19, 2021

hulk inclusion
category: bugfix
bugzilla: 174653 https://gitee.com/openeuler/kernel/issues/I4DDEL
---------------------------

In preparation for calling ext4_fill_raw_inode() in
__ext4_get_inode_loc(), move three related functions before
__ext4_get_inode_loc(), no logical change.
Signed-off-by: NZhang Yi <yi.zhang@huawei.com>
Reviewed-by: NYang Erkun <yangerkun@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

fce5c640

ext4: factor out ext4_fill_raw_inode() · 40b43297

由 Zhang Yi 提交于 10月 19, 2021

hulk inclusion
category: bugfix
bugzilla: 174653 https://gitee.com/openeuler/kernel/issues/I4DDEL
---------------------------

Factor out ext4_fill_raw_inode() from ext4_do_update_inode(), which is
use to fill the in-mem inode contents into the inode table buffer, in
preparation for initializing the exclusive inode buffer without reading
the block in __ext4_get_inode_loc().
Signed-off-by: NZhang Yi <yi.zhang@huawei.com>
Reviewed-by: NYang Erkun <yangerkun@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

40b43297

ext4: make the updating inode data procedure atomic · 77c57232

由 Zhang Yi 提交于 10月 19, 2021

mainline inclusion
from mainline-5.15-rc1
commit baaae979
category: bugfix
bugzilla: 174653 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=baaae979b112642a41b71c71c599d875c067d257

---------------------------

Now that ext4_do_update_inode() return error before filling the whole
inode data if we fail to set inode blocks in ext4_inode_blocks_set().
This error should never happen in theory since sb->s_maxbytes should not
have allowed this, we have already init sb->s_maxbytes according to this
feature in ext4_fill_super(). So even through that could only happen due
to the filesystem corruption, we'd better to return after we finish
updating the inode because it may left an uninitialized buffer and we
could read this buffer later in "errors=continue" mode.

This patch make the updating inode data procedure atomic, call
EXT4_ERROR_INODE() after we dropping i_raw_lock after something bad
happened, make sure that the inode is integrated, and also drop a BUG_ON
and do some small cleanups.
Signed-off-by: NZhang Yi <yi.zhang@huawei.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210826130412.3921207-4-yi.zhang@huawei.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Reviewed-by: NYang Erkun <yangerkun@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

77c57232

ext4: move inode eio simulation behind io completeion · cc1cdb17

由 Zhang Yi 提交于 10月 19, 2021

mainline inclusion
from mainline-5.15-rc1
commit 0904c9ae
category: bugfix
bugzilla: 174653 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0904c9ae3465c7acc066a564a76b75c0af83e6c7

---------------------------

No EIO simulation is required if the buffer is uptodate, so move the
simulation behind read bio completeion just like inode/block bitmap
simulation does.
Signed-off-by: NZhang Yi <yi.zhang@huawei.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210826130412.3921207-2-yi.zhang@huawei.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Reviewed-by: NYang Erkun <yangerkun@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

cc1cdb17

ext4: wipe ext4_dir_entry2 upon file deletion · cbf92ac9

由 Leah Rumancik 提交于 10月 19, 2021

mainline inclusion
from mainline-5.13-rc1
commit 6c091273
category: bugfix
bugzilla: 176545 https://gitee.com/openeuler/kernel/issues/I4DDEL

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6c0912739699d8e4b6a87086401bf3ad3c59502d

-------------------------------------------------

Upon file deletion, zero out all fields in ext4_dir_entry2 besides rec_len.
In case sensitive data is stored in filenames, this ensures no potentially
sensitive data is left in the directory entry upon deletion. Also, wipe
these fields upon moving a directory entry during the conversion to an
htree and when splitting htree nodes.

The data wiped may still exist in the journal, but there are future
commits planned to address this.
Signed-off-by: NLeah Rumancik <leah.rumancik@gmail.com>
Link: https://lore.kernel.org/r/20210422180834.2242353-1-leah.rumancik@gmail.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NBaokun Li <libaokun1@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

cbf92ac9

io_uring: ensure symmetry in handling iter types in loop_rw_iter() · f4f98f3b

由 Jens Axboe 提交于 10月 19, 2021

stable inclusion
from stable-5.10.68
commit ce8f81b76d3bef7b9fe6c8f84d029ab898b19469
bugzilla: 182253 https://gitee.com/openeuler/kernel/issues/I4DDEL
CVE: CVE-2021-41073

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=ce8f81b76d3bef7b9fe6c8f84d029ab898b19469

--------------------------------

commit 16c8d2df upstream.

When setting up the next segment, we check what type the iter is and
handle it accordingly. However, when incrementing and processed amount
we do not, and both iter advance and addr/len are adjusted, regardless
of type. Split the increment side just like we do on the setup side.

Fixes: 4017eb91 ("io_uring: make loop_rw_iter() use original user supplied pointers")
Cc: stable@vger.kernel.org
Reported-by: NValentina Palmiotti <vpalmiotti@gmail.com>
Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Reviewed-by: NYang Erkun <yangerkun@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

f4f98f3b

ext4: fix race writing to an inline_data file while its xattrs are changing · 0458938a

由 Theodore Ts'o 提交于 10月 19, 2021

mainline inclusion
from mainline
commit a54c4613
bugzilla: 181006 https://gitee.com/openeuler/kernel/issues/I4DDEL
CVE: CVE-2021-40490

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a54c4613dac1500b40e4ab55199f7c51f028e848

--------------------------------

The location of the system.data extended attribute can change whenever
xattr_sem is not taken.  So we need to recalculate the i_inline_off
field since it mgiht have changed between ext4_write_begin() and
ext4_write_end().

This means that caching i_inline_off is probably not helpful, so in
the long run we should probably get rid of it and shrink the in-memory
ext4 inode slightly, but let's fix the race the simple way for now.

Cc: stable@kernel.org
Fixes: f19d5870 ("ext4: add normal write support for inline data")
Reported-by: syzbot+13146364637c7363a7de@syzkaller.appspotmail.com
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

0458938a

btrfs: fix NULL pointer dereference when deleting device by invalid id · 6bb7ddd9

由 Qu Wenruo 提交于 10月 19, 2021

mainline inclusion
from mainline
commit e4571b8c
bugzilla: 181007 https://gitee.com/openeuler/kernel/issues/I4DDEL
CVE: CVE-2021-3739

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e4571b8c5e9ffa1e85c0c671995bd4dcc5c75091

--------------------------------

[BUG]
It's easy to trigger NULL pointer dereference, just by removing a
non-existing device id:

 # mkfs.btrfs -f -m single -d single /dev/test/scratch1 \
				     /dev/test/scratch2
 # mount /dev/test/scratch1 /mnt/btrfs
 # btrfs device remove 3 /mnt/btrfs

Then we have the following kernel NULL pointer dereference:

 BUG: kernel NULL pointer dereference, address: 0000000000000000
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] PREEMPT SMP NOPTI
 CPU: 9 PID: 649 Comm: btrfs Not tainted 5.14.0-rc3-custom+ #35
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
 RIP: 0010:btrfs_rm_device+0x4de/0x6b0 [btrfs]
  btrfs_ioctl+0x18bb/0x3190 [btrfs]
  ? lock_is_held_type+0xa5/0x120
  ? find_held_lock.constprop.0+0x2b/0x80
  ? do_user_addr_fault+0x201/0x6a0
  ? lock_release+0xd2/0x2d0
  ? __x64_sys_ioctl+0x83/0xb0
  __x64_sys_ioctl+0x83/0xb0
  do_syscall_64+0x3b/0x90
  entry_SYSCALL_64_after_hwframe+0x44/0xae

[CAUSE]
Commit a27a94c2 ("btrfs: Make btrfs_find_device_by_devspec return
btrfs_device directly") moves the "missing" device path check into
btrfs_rm_device().

But btrfs_rm_device() itself can have case where it only receives
@devid, with NULL as @device_path.

In that case, calling strcmp() on NULL will trigger the NULL pointer
dereference.

Before that commit, we handle the "missing" case inside
btrfs_find_device_by_devspec(), which will not check @device_path at all
if @devid is provided, thus no way to trigger the bug.

[FIX]
Before calling strcmp(), also make sure @device_path is not NULL.

Fixes: a27a94c2 ("btrfs: Make btrfs_find_device_by_devspec return btrfs_device directly")
CC: stable@vger.kernel.org # 5.4+
Reported-by: Nbutt3rflyh4ck <butterflyhuangxx@gmail.com>
Reviewed-by: NAnand Jain <anand.jain@oracle.com>
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

6bb7ddd9

hfs: fix null-ptr-deref in hfs_find_init() · 6cdcbcc1

由 Yang Yingliang 提交于 10月 19, 2021

hulk inclusion
category: bugfix
bugzilla: 179898 https://gitee.com/openeuler/kernel/issues/I4DDEL
CVE: CVE-2018-12928

---------------------------

It will cause a null-ptr-deref in hfs_find_init()
in fuzz test.

[  107.092729] hfs: continuing without an alternate MDB
[  107.097632] general protection fault, probably for non-canonical address 0xdffffc0000000008: 0000 [#1] SMP KASAN PTI
[  107.104679] KASAN: null-ptr-deref in range [0x0000000000000040-0x0000000000000047]
[  107.109100] CPU: 0 PID: 379 Comm: hfs_inject Not tainted 5.7.0-rc7-00001-g24627f5f2973 #897
[  107.114142] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[  107.121095] RIP: 0010:hfs_find_init+0x72/0x170
[  107.123609] Code: c1 ea 03 80 3c 02 00 0f 85 e6 00 00 00 4c 8d 65 40 48 c7 43 18 00 00 00 00 48 b8 00 00 00 00 00 fc ff df 4c 89 e2 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e a5 00 00 00 8b 45 40 be c0 0c
[  107.134660] RSP: 0018:ffff88810291f3f8 EFLAGS: 00010202
[  107.137897] RAX: dffffc0000000000 RBX: ffff88810291f468 RCX: 1ffff110175cdf05
[  107.141874] RDX: 0000000000000008 RSI: ffff88810291f468 RDI: ffff88810291f480
[  107.145844] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffed1020381013
[  107.149431] R10: ffff88810291f500 R11: ffffed1020381012 R12: 0000000000000040
[  107.152315] R13: 0000000000000000 R14: ffff888101c0814a R15: ffff88810291f468
[  107.155464] FS:  00000000009ea880(0000) GS:ffff88810c600000(0000) knlGS:0000000000000000
[  107.159795] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  107.162987] CR2: 00005605a19dd284 CR3: 0000000103a0c006 CR4: 0000000000020ef0
[  107.166665] Call Trace:
[  107.167969]  ? find_held_lock+0x33/0x1c0
[  107.169972]  hfs_ext_read_extent+0x16b/0xb00
[  107.172092]  ? create_page_buffers+0x14e/0x1b0
[  107.174303]  ? hfs_free_extents+0x280/0x280
[  107.176437]  ? lock_downgrade+0x730/0x730
[  107.178272]  hfs_get_block+0x496/0x8a0
[  107.179972]  block_read_full_page+0x241/0x8d0
[  107.181971]  ? hfs_extend_file+0xae0/0xae0
[  107.183814]  ? end_buffer_async_read_io+0x10/0x10
[  107.185954]  ? add_to_page_cache_lru+0x13f/0x1f0
[  107.188006]  ? add_to_page_cache_locked+0x10/0x10
[  107.190175]  do_read_cache_page+0xc6a/0x1180
[  107.192096]  ? generic_file_read_iter+0x4c0/0x4c0
[  107.194234]  ? hfs_btree_open+0x408/0x1000
[  107.196068]  ? lock_downgrade+0x730/0x730
[  107.197926]  ? wake_bit_function+0x180/0x180
[  107.199845]  ? lockdep_init_map_waits+0x267/0x7c0
[  107.201895]  hfs_btree_open+0x455/0x1000
[  107.203479]  hfs_mdb_get+0x122c/0x1ae8
[  107.205065]  ? hfs_mdb_put+0x350/0x350
[  107.206590]  ? queue_work_node+0x260/0x260
[  107.208309]  ? rcu_read_lock_sched_held+0xa1/0xd0
[  107.210227]  ? lockdep_init_map_waits+0x267/0x7c0
[  107.212144]  ? lockdep_init_map_waits+0x267/0x7c0
[  107.213979]  hfs_fill_super+0x9ba/0x1280
[  107.215444]  ? bdev_name.isra.9+0xf1/0x2b0
[  107.217028]  ? hfs_remount+0x190/0x190
[  107.218428]  ? pointer+0x5da/0x710
[  107.219745]  ? file_dentry_name+0xf0/0xf0
[  107.221262]  ? mount_bdev+0xd1/0x330
[  107.222592]  ? vsnprintf+0x7bd/0x1250
[  107.224007]  ? pointer+0x710/0x710
[  107.225332]  ? down_write+0xe5/0x160
[  107.226698]  ? hfs_remount+0x190/0x190
[  107.228120]  ? snprintf+0x91/0xc0
[  107.229388]  ? vsprintf+0x10/0x10
[  107.230628]  ? sget+0x3af/0x4a0
[  107.231848]  ? hfs_remount+0x190/0x190
[  107.233300]  mount_bdev+0x26e/0x330
[  107.234611]  ? hfs_statfs+0x540/0x540
[  107.236015]  legacy_get_tree+0x101/0x1f0
[  107.237431]  ? security_capable+0x58/0x90
[  107.238832]  vfs_get_tree+0x89/0x2d0
[  107.240082]  ? ns_capable_common+0x5c/0xd0
[  107.241521]  do_mount+0xd8a/0x1720
[  107.242727]  ? lock_downgrade+0x730/0x730
[  107.244116]  ? copy_mount_string+0x20/0x20
[  107.245557]  ? _copy_from_user+0xbe/0x100
[  107.246967]  ? memdup_user+0x47/0x70
[  107.248212]  __x64_sys_mount+0x162/0x1b0
[  107.249537]  do_syscall_64+0xa5/0x4f0
[  107.250742]  entry_SYSCALL_64_after_hwframe+0x49/0xb3
[  107.252369] RIP: 0033:0x44e8ea
[  107.253360] Code: 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
[  107.259240] RSP: 002b:00007ffd910e4c28 EFLAGS: 00000207 ORIG_RAX: 00000000000000a5
[  107.261668] RAX: ffffffffffffffda RBX: 0000000000400400 RCX: 000000000044e8ea
[  107.263920] RDX: 000000000049321e RSI: 0000000000493222 RDI: 00007ffd910e4d00
[  107.266177] RBP: 00007ffd910e5d10 R08: 0000000000000000 R09: 000000000000000a
[  107.268451] R10: 0000000000000001 R11: 0000000000000207 R12: 0000000000401c40
[  107.270721] R13: 0000000000000000 R14: 00000000006ba018 R15: 0000000000000000
[  107.273025] Modules linked in:
[  107.274029] Dumping ftrace buffer:
[  107.275121]    (ftrace buffer empty)
[  107.276370] ---[ end trace c5e0b9d684f3570e ]---

We need check tree in hfs_find_init().

https://lore.kernel.org/linux-fsdevel/20180419024358.GA5215@bombadil.infradead.org/
https://marc.info/?l=linux-fsdevel&m=152406881024567&w=2
References: CVE-2018-12928
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

6cdcbcc1

io_uring: only assign io_uring_enter() SQPOLL error in actual error case · 8e0001af

由 Jens Axboe 提交于 10月 19, 2021

stable inclusion
from stable-5.10.61
commit f15e6426739389876d66d093f49866b4f24c1a4b
bugzilla: 177029 https://gitee.com/openeuler/kernel/issues/I4EAXD

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=f15e6426739389876d66d093f49866b4f24c1a4b

--------------------------------

[ upstream commit 21f96522 ]

If an SQPOLL based ring is newly created and an application issues an
io_uring_enter(2) system call on it, then we can return a spurious
-EOWNERDEAD error. This happens because there's nothing to submit, and
if the caller doesn't specify any other action, the initial error
assignment of -EOWNERDEAD never gets overwritten. This causes us to
return it directly, even if it isn't valid.

Move the error assignment into the actual failure case instead.

Cc: stable@vger.kernel.org
Fixes: d9d05217 ("io_uring: stop SQPOLL submit on creator's death")
Reported-by: Sherlock Holo sherlockya@gmail.com
Link: https://github.com/axboe/liburing/issues/413Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

8e0001af

io_uring: fix xa_alloc_cycle() error return value check · a2ae87d2

由 Jens Axboe 提交于 10月 19, 2021

stable inclusion
from stable-5.10.61
commit 695ab28a7fa107d0350ab19eba8ec89fac45a95d
bugzilla: 177029 https://gitee.com/openeuler/kernel/issues/I4EAXD

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=695ab28a7fa107d0350ab19eba8ec89fac45a95d

--------------------------------

[ upstream commit a30f895a ]

We currently check for ret != 0 to indicate error, but '1' is a valid
return and just indicates that the allocation succeeded with a wrap.
Correct the check to be for < 0, like it was before the xarray
conversion.

Cc: stable@vger.kernel.org
Fixes: 61cf9370 ("io_uring: Convert personality_idr to XArray")
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

a2ae87d2

fs: warn about impending deprecation of mandatory locks · 7b286885

由 Jeff Layton 提交于 10月 19, 2021

stable inclusion
from stable-5.10.61
commit 0d5fcfc6406ee7138ca0141769f90a37d4b4ae8b
bugzilla: 177029 https://gitee.com/openeuler/kernel/issues/I4EAXD

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=0d5fcfc6406ee7138ca0141769f90a37d4b4ae8b

--------------------------------

[ Upstream commit fdd92b64 ]

We've had CONFIG_MANDATORY_FILE_LOCKING since 2015 and a lot of distros
have disabled it. Warn the stragglers that still use "-o mand" that
we'll be dropping support for that mount option.

Cc: stable@vger.kernel.org
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

7b286885

btrfs: prevent rename2 from exchanging a subvol with a directory from different parents · 1dbbf013

由 NeilBrown 提交于 10月 19, 2021

stable inclusion
from stable-5.10.61
commit 67fece6289a9dab4e2700a93004db45ac834523b
bugzilla: 177029 https://gitee.com/openeuler/kernel/issues/I4EAXD

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=67fece6289a9dab4e2700a93004db45ac834523b

--------------------------------

[ Upstream commit 3f79f6f6 ]

Cross-rename lacks a check when that would prevent exchanging a
directory and subvolume from different parent subvolume. This causes
data inconsistencies and is caught before commit by tree-checker,
turning the filesystem to read-only.

Calling the renameat2 with RENAME_EXCHANGE flags like

  renameat2(AT_FDCWD, namesrc, AT_FDCWD, namedest, (1 << 1))

on two paths:

  namesrc = dir1/subvol1/dir2
 namedest = subvol2/subvol3

will cause key order problem with following write time tree-checker
report:

  [1194842.307890] BTRFS critical (device loop1): corrupt leaf: root=5 block=27574272 slot=10 ino=258, invalid previous key objectid, have 257 expect 258
  [1194842.322221] BTRFS info (device loop1): leaf 27574272 gen 8 total ptrs 11 free space 15444 owner 5
  [1194842.331562] BTRFS info (device loop1): refs 2 lock_owner 0 current 26561
  [1194842.338772]        item 0 key (256 1 0) itemoff 16123 itemsize 160
  [1194842.338793]                inode generation 3 size 16 mode 40755
  [1194842.338801]        item 1 key (256 12 256) itemoff 16111 itemsize 12
  [1194842.338809]        item 2 key (256 84 2248503653) itemoff 16077 itemsize 34
  [1194842.338817]                dir oid 258 type 2
  [1194842.338823]        item 3 key (256 84 2363071922) itemoff 16043 itemsize 34
  [1194842.338830]                dir oid 257 type 2
  [1194842.338836]        item 4 key (256 96 2) itemoff 16009 itemsize 34
  [1194842.338843]        item 5 key (256 96 3) itemoff 15975 itemsize 34
  [1194842.338852]        item 6 key (257 1 0) itemoff 15815 itemsize 160
  [1194842.338863]                inode generation 6 size 8 mode 40755
  [1194842.338869]        item 7 key (257 12 256) itemoff 15801 itemsize 14
  [1194842.338876]        item 8 key (257 84 2505409169) itemoff 15767 itemsize 34
  [1194842.338883]                dir oid 256 type 2
  [1194842.338888]        item 9 key (257 96 2) itemoff 15733 itemsize 34
  [1194842.338895]        item 10 key (258 12 256) itemoff 15719 itemsize 14
  [1194842.339163] BTRFS error (device loop1): block=27574272 write time tree block corruption detected
  [1194842.339245] ------------[ cut here ]------------
  [1194842.443422] WARNING: CPU: 6 PID: 26561 at fs/btrfs/disk-io.c:449 csum_one_extent_buffer+0xed/0x100 [btrfs]
  [1194842.511863] CPU: 6 PID: 26561 Comm: kworker/u17:2 Not tainted 5.14.0-rc3-git+ #793
  [1194842.511870] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008
  [1194842.511876] Workqueue: btrfs-worker-high btrfs_work_helper [btrfs]
  [1194842.511976] RIP: 0010:csum_one_extent_buffer+0xed/0x100 [btrfs]
  [1194842.512068] RSP: 0018:ffffa2c284d77da0 EFLAGS: 00010282
  [1194842.512074] RAX: 0000000000000000 RBX: 0000000000001000 RCX: ffff928867bd9978
  [1194842.512078] RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff928867bd9970
  [1194842.512081] RBP: ffff92876b958000 R08: 0000000000000001 R09: 00000000000c0003
  [1194842.512085] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
  [1194842.512088] R13: ffff92875f989f98 R14: 0000000000000000 R15: 0000000000000000
  [1194842.512092] FS:  0000000000000000(0000) GS:ffff928867a00000(0000) knlGS:0000000000000000
  [1194842.512095] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [1194842.512099] CR2: 000055f5384da1f0 CR3: 0000000102fe4000 CR4: 00000000000006e0
  [1194842.512103] Call Trace:
  [1194842.512128]  ? run_one_async_free+0x10/0x10 [btrfs]
  [1194842.631729]  btree_csum_one_bio+0x1ac/0x1d0 [btrfs]
  [1194842.631837]  run_one_async_start+0x18/0x30 [btrfs]
  [1194842.631938]  btrfs_work_helper+0xd5/0x1d0 [btrfs]
  [1194842.647482]  process_one_work+0x262/0x5e0
  [1194842.647520]  worker_thread+0x4c/0x320
  [1194842.655935]  ? process_one_work+0x5e0/0x5e0
  [1194842.655946]  kthread+0x135/0x160
  [1194842.655953]  ? set_kthread_struct+0x40/0x40
  [1194842.655965]  ret_from_fork+0x1f/0x30
  [1194842.672465] irq event stamp: 1729
  [1194842.672469] hardirqs last  enabled at (1735): [<ffffffffbd1104f5>] console_trylock_spinning+0x185/0x1a0
  [1194842.672477] hardirqs last disabled at (1740): [<ffffffffbd1104cc>] console_trylock_spinning+0x15c/0x1a0
  [1194842.672482] softirqs last  enabled at (1666): [<ffffffffbdc002e1>] __do_softirq+0x2e1/0x50a
  [1194842.672491] softirqs last disabled at (1651): [<ffffffffbd08aab7>] __irq_exit_rcu+0xa7/0xd0

The corrupted data will not be written, and filesystem can be unmounted
and mounted again (all changes since the last commit will be lost).

Add the missing check for new_ino so that all non-subvolumes must reside
under the same parent subvolume. There's an exception allowing to
exchange two subvolumes from any parents as the directory representing a
subvolume is only a logical link and does not have any other structures
related to the parent subvolume, unlike files, directories etc, that
are always in the inode namespace of the parent subvolume.

Fixes: cdd1fedf ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
CC: stable@vger.kernel.org # 4.7+
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

1dbbf013

ceph: take snap_empty_lock atomically with snaprealm refcount change · 0208ec5b

由 Jeff Layton 提交于 10月 19, 2021

stable inclusion
from stable-5.10.60
commit 2fe07584a6236d22be17f3866c4c45e0a3058d2a
bugzilla: 177018 https://gitee.com/openeuler/kernel/issues/I4EAUG

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=2fe07584a6236d22be17f3866c4c45e0a3058d2a

--------------------------------

commit 8434ffe7 upstream.

There is a race in ceph_put_snap_realm. The change to the nref and the
spinlock acquisition are not done atomically, so you could decrement
nref, and before you take the spinlock, the nref is incremented again.
At that point, you end up putting it on the empty list when it
shouldn't be there. Eventually __cleanup_empty_realms runs and frees
it when it's still in-use.

Fix this by protecting the 1->0 transition with atomic_dec_and_lock,
and just drop the spinlock if we can get the rwsem.

Because these objects can also undergo a 0->1 refcount transition, we
must protect that change as well with the spinlock. Increment locklessly
unless the value is at 0, in which case we take the spinlock, increment
and then take it off the empty list if it did the 0->1 transition.

With these changes, I'm removing the dout() messages from these
functions, as well as in __put_snap_realm. They've always been racy, and
it's better to not print values that may be misleading.

Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/46419Reported-by: NMark Nelson <mnelson@redhat.com>
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: NLuis Henriques <lhenriques@suse.de>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

0208ec5b

ceph: clean up locking annotation for ceph_get_snap_realm and __lookup_snap_realm · a0fc0d16

由 Jeff Layton 提交于 10月 19, 2021

stable inclusion
from stable-5.10.60
commit a23aced54c2c9053a0956fc1a8a6d7c0a0fff96f
bugzilla: 177018 https://gitee.com/openeuler/kernel/issues/I4EAUG

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=a23aced54c2c9053a0956fc1a8a6d7c0a0fff96f

--------------------------------

commit df2c0cb7 upstream.

They both say that the snap_rwsem must be held for write, but I don't
see any real reason for it, and it's not currently always called that
way.

The lookup is just walking the rbtree, so holding it for read should be
fine there. The "get" is bumping the refcount and (possibly) removing
it from the empty list. I see no need to hold the snap_rwsem for write
for that.
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: NIlya Dryomov <idryomov@gmail.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

a0fc0d16

ceph: add some lockdep assertions around snaprealm handling · 638d92e3

由 Jeff Layton 提交于 10月 19, 2021

stable inclusion
from stable-5.10.60
commit b0efc93271caf3868885a953f7f71c3aced7ba61
bugzilla: 177018 https://gitee.com/openeuler/kernel/issues/I4EAUG

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=b0efc93271caf3868885a953f7f71c3aced7ba61

--------------------------------

commit a6862e67 upstream.

Turn some comments into lockdep asserts.
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: NIlya Dryomov <idryomov@gmail.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

638d92e3

vboxsf: Add support for the atomic_open directory-inode op · 32e78d1f

由 Hans de Goede 提交于 10月 19, 2021

stable inclusion
from stable-5.10.60
commit dcdb587ac470f1c2476688318c81897f3ee1c233
bugzilla: 177018 https://gitee.com/openeuler/kernel/issues/I4EAUG

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=dcdb587ac470f1c2476688318c81897f3ee1c233

--------------------------------

commit 52dfd86a upstream.

Opening a new file is done in 2 steps on regular filesystems:

1. Call the create inode-op on the parent-dir to create an inode
to hold the meta-data related to the file.
2. Call the open file-op to get a handle for the file.

vboxsf however does not really use disk-backed inodes because it
is based on passing through file-related system-calls through to
the hypervisor. So both steps translate to an open(2) call being
passed through to the hypervisor. With the handle returned by
the first call immediately being closed again.

Making 2 open calls for a single open(..., O_CREATE, ...) calls
has 2 problems:

a) It is not really efficient.
b) It actually breaks some apps.

An example of b) is doing a git clone inside a vboxsf mount.
When git clone tries to create a tempfile to store the pak
files which is downloading the following happens:

1. vboxsf_dir_mkfile() gets called with a mode of 0444 and succeeds.
2. vboxsf_file_open() gets called with file->f_flags containing
O_RDWR. When the host is a Linux machine this fails because doing
a open(..., O_RDWR) on a file which exists and has mode 0444 results
in an -EPERM error.

Other network-filesystems and fuse avoid the problem of needing to
pass 2 open() calls to the other side by using the atomic_open
directory-inode op.

This commit fixes git clone not working inside a vboxsf mount,
by adding support for the atomic_open directory-inode op.
As an added bonus this should also make opening new files faster.

The atomic_open implementation is modelled after the atomic_open
implementations from the 9p and fuse code.

Fixes: 0fd16957 ("fs: Add VirtualBox guest shared folder (vboxsf) support")
Reported-by: NLudovic Pouzenc <bugreports@pouzenc.fr>
Signed-off-by: NHans de Goede <hdegoede@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

32e78d1f

vboxsf: Add vboxsf_[create|release]_sf_handle() helpers · 703a5747

由 Hans de Goede 提交于 10月 19, 2021

stable inclusion
from stable-5.10.60
commit 7cd14c1a7fed9fd5669c8578abd34ef35d4afc1a
bugzilla: 177018 https://gitee.com/openeuler/kernel/issues/I4EAUG

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=7cd14c1a7fed9fd5669c8578abd34ef35d4afc1a

--------------------------------

commit 02f840f9 upstream.

Factor out the code to create / release a struct vboxsf_handle into
2 new helper functions.

This is a preparation patch for adding atomic_open support.

Fixes: 0fd16957 ("fs: Add VirtualBox guest shared folder (vboxsf) support")
Signed-off-by: NHans de Goede <hdegoede@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

703a5747

ceph: reduce contention in ceph_check_delayed_caps() · 464616e1

由 Luis Henriques 提交于 10月 19, 2021

stable inclusion
from stable-5.10.60
commit f3fcf9d1b759915dcc1a93bceada910287c54e76
bugzilla: 177018 https://gitee.com/openeuler/kernel/issues/I4EAUG

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=f3fcf9d1b759915dcc1a93bceada910287c54e76

--------------------------------

commit bf2ba432 upstream.

Function ceph_check_delayed_caps() is called from the mdsc->delayed_work
workqueue and it can be kept looping for quite some time if caps keep
being added back to the mdsc->cap_delay_list.  This may result in the
watchdog tainting the kernel with the softlockup flag.

This patch breaks this loop if the caps have been recently (i.e. during
the loop execution).  Any new caps added to the list will be handled in
the next run.

Also, allow schedule_delayed() callers to explicitly set the delay value
instead of defaulting to 5s, so we can ensure that it runs soon
afterward if it looks like there is more work.

Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/46284Signed-off-by: NLuis Henriques <lhenriques@suse.de>
Reviewed-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

464616e1

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功