提交 · 9e62ccec3ba0a17c8050ea78500dfdd0e4c5c0cc · openeuler / Kernel

02 4月, 2020 1 次提交

powerpc: Add back __ARCH_WANT_SYS_LLSEEK macro · 9e62ccec

由 Michal Suchanek 提交于 3月 20, 2020

This partially reverts commit caf6f9c8 ("asm-generic: Remove
unneeded __ARCH_WANT_SYS_LLSEEK macro")

When CONFIG_COMPAT is disabled on ppc64 the kernel does not build.

There is resistance to both removing the llseek syscall from the 64bit
syscall tables and building the llseek interface unconditionally.
Signed-off-by: NMichal Suchanek <msuchanek@suse.de>
Reviewed-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/lkml/20190828151552.GA16855@infradead.org/
Link: https://lore.kernel.org/lkml/20190829214319.498c7de2@naga/
Link: https://lore.kernel.org/r/dd4575c51e31766e87f7e7fa121d099ab78d3290.1584699455.git.msuchanek@suse.de

9e62ccec

19 2月, 2020 1 次提交

sysfs: Wrap __compat_only_sysfs_link_entry_to_kobj function to change the symlink name · 9255782f

由 Sourabh Jain 提交于 12月 11, 2019

The __compat_only_sysfs_link_entry_to_kobj function creates a symlink
to a kobject but doesn't provide an option to change the symlink file
name.

This patch adds a wrapper function compat_only_sysfs_link_entry_to_kobj
that extends the __compat_only_sysfs_link_entry_to_kobj functionality
which allows function caller to customize the symlink name.
Signed-off-by: NSourabh Jain <sourabhjain@linux.ibm.com>
[mpe: Fix compile error when CONFIG_SYSFS=n]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191211160910.21656-3-sourabhjain@linux.ibm.com

9255782f

15 2月, 2020 3 次提交

ext4: improve explanation of a mount failure caused by a misconfigured kernel · d65d87a0

由 Theodore Ts'o 提交于 2月 14, 2020

If CONFIG_QFMT_V2 is not enabled, but CONFIG_QUOTA is enabled, when a
user tries to mount a file system with the quota or project quota
enabled, the kernel will emit a very confusing messsage:

    EXT4-fs warning (device vdc): ext4_enable_quotas:5914: Failed to enable quota tracking (type=0, err=-3). Please run e2fsck to fix.
    EXT4-fs (vdc): mount failed

We will now report an explanatory message indicating which kernel
configuration options have to be enabled, to avoid customer/sysadmin
confusion.

Link: https://lore.kernel.org/r/20200215012738.565735-1-tytso@mit.edu
Google-Bug-Id: 149093531
Fixes: 7c319d32 ("ext4: make quota as first class supported feature")
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org

d65d87a0

cifs: make sure we do not overflow the max EA buffer size · 85db6b7a

由 Ronnie Sahlberg 提交于 2月 13, 2020

RHBZ: 1752437

Before we add a new EA we should check that this will not overflow
the maximum buffer we have available to read the EAs back.
Otherwise we can get into a situation where the EAs are so big that
we can not read them back to the client and thus we can not list EAs
anymore or delete them.
Signed-off-by: NRonnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: NSteve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>

85db6b7a

cifs: enable change notification for SMB2.1 dialect · 2c6251ad

由 Steve French 提交于 2月 12, 2020

It was originally enabled only for SMB3 or later dialects, but
had requests to add it to SMB2.1 mounts as well given the
large number of systems at that dialect level.
Signed-off-by: NSteve French <stfrench@microsoft.com>
Reported-by: NL Walsh <cifs@tlinx.org>
Acked-by: NRonnie Sahlberg <lsahlber@redhat.com>

2c6251ad

14 2月, 2020 11 次提交

io_uring: prune request from overflow list on flush · 2ca10259

由 Jens Axboe 提交于 2月 13, 2020

Carter reported an issue where he could produce a stall on ring exit,
when we're cleaning up requests that match the given file table. For
this particular test case, a combination of a few things caused the
issue:

- The cq ring was overflown
- The request being canceled was in the overflow list

The combination of the above means that the cq overflow list holds a
reference to the request. The request is canceled correctly, but since
the overflow list holds a reference to it, the final put won't happen.
Since the final put doesn't happen, the request remains in the inflight.
Hence we never finish the cancelation flush.

Fix this by removing requests from the overflow list if we're canceling
them.

Cc: stable@vger.kernel.org # 5.5
Reported-by: NCarter Li 李通洲 <carter.li@eoitek.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2ca10259

NFSv4: Ensure the delegation cred is pinned when we call delegreturn · 5d63944f

由 Trond Myklebust 提交于 2月 13, 2020

Ensure we don't release the delegation cred during the call to
nfs4_proc_delegreturn().

Fixes: ee05f456 ("NFSv4: Fix races between open and delegreturn")
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

5d63944f

NFSv4: Ensure the delegation is pinned in nfs_do_return_delegation() · 8c75593c

由 Trond Myklebust 提交于 2月 13, 2020

The call to nfs_do_return_delegation() needs to be taken without
any RCU locks. Add a refcount to make sure the delegation remains
pinned in memory until we're done.

Fixes: ee05f456 ("NFSv4: Fix races between open and delegreturn")
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

8c75593c

NFSv4.1 make cachethis=no for writes · cd1b659d

由 Olga Kornievskaia 提交于 2月 12, 2020

Turning caching off for writes on the server should improve performance.

Fixes: fba83f34 ("NFS: Pass "privileged" value to nfs4_init_sequence()")
Signed-off-by: NOlga Kornievskaia <kolga@netapp.com>
Reviewed-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

cd1b659d

jbd2: do not clear the BH_Mapped flag when forgetting a metadata buffer · c96dceea

由 zhangyi (F) 提交于 2月 13, 2020

Commit 904cdbd4 ("jbd2: clear dirty flag when revoking a buffer from
an older transaction") set the BH_Freed flag when forgetting a metadata
buffer which belongs to the committing transaction, it indicate the
committing process clear dirty bits when it is done with the buffer. But
it also clear the BH_Mapped flag at the same time, which may trigger
below NULL pointer oops when block_size < PAGE_SIZE.

rmdir 1             kjournald2                 mkdir 2
                    jbd2_journal_commit_transaction
		    commit transaction N
jbd2_journal_forget
set_buffer_freed(bh1)
                    jbd2_journal_commit_transaction
                     commit transaction N+1
                     ...
                     clear_buffer_mapped(bh1)
                                               ext4_getblk(bh2 ummapped)
                                               ...
                                               grow_dev_page
                                                init_page_buffers
                                                 bh1->b_private=NULL
                                                 bh2->b_private=NULL
                     jbd2_journal_put_journal_head(jh1)
                      __journal_remove_journal_head(hb1)
		       jh1 is NULL and trigger oops

*) Dir entry block bh1 and bh2 belongs to one page, and the bh2 has
   already been unmapped.

For the metadata buffer we forgetting, we should always keep the mapped
flag and clear the dirty flags is enough, so this patch pick out the
these buffers and keep their BH_Mapped flag.

Link: https://lore.kernel.org/r/20200213063821.30455-3-yi.zhang@huawei.com
Fixes: 904cdbd4 ("jbd2: clear dirty flag when revoking a buffer from an older transaction")
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org

c96dceea

jbd2: move the clearing of b_modified flag to the journal_unmap_buffer() · 6a66a7de

由 zhangyi (F) 提交于 2月 13, 2020

There is no need to delay the clearing of b_modified flag to the
transaction committing time when unmapping the journalled buffer, so
just move it to the journal_unmap_buffer().

Link: https://lore.kernel.org/r/20200213063821.30455-2-yi.zhang@huawei.comReviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org

6a66a7de

ext4: add cond_resched() to ext4_protect_reserved_inode · af133ade

由 Shijie Luo 提交于 2月 10, 2020

When journal size is set too big by "mkfs.ext4 -J size=", or when
we mount a crafted image to make journal inode->i_size too big,
the loop, "while (i < num)", holds cpu too long. This could cause
soft lockup.

[  529.357541] Call trace:
[  529.357551]  dump_backtrace+0x0/0x198
[  529.357555]  show_stack+0x24/0x30
[  529.357562]  dump_stack+0xa4/0xcc
[  529.357568]  watchdog_timer_fn+0x300/0x3e8
[  529.357574]  __hrtimer_run_queues+0x114/0x358
[  529.357576]  hrtimer_interrupt+0x104/0x2d8
[  529.357580]  arch_timer_handler_virt+0x38/0x58
[  529.357584]  handle_percpu_devid_irq+0x90/0x248
[  529.357588]  generic_handle_irq+0x34/0x50
[  529.357590]  __handle_domain_irq+0x68/0xc0
[  529.357593]  gic_handle_irq+0x6c/0x150
[  529.357595]  el1_irq+0xb8/0x140
[  529.357599]  __ll_sc_atomic_add_return_acquire+0x14/0x20
[  529.357668]  ext4_map_blocks+0x64/0x5c0 [ext4]
[  529.357693]  ext4_setup_system_zone+0x330/0x458 [ext4]
[  529.357717]  ext4_fill_super+0x2170/0x2ba8 [ext4]
[  529.357722]  mount_bdev+0x1a8/0x1e8
[  529.357746]  ext4_mount+0x44/0x58 [ext4]
[  529.357748]  mount_fs+0x50/0x170
[  529.357752]  vfs_kern_mount.part.9+0x54/0x188
[  529.357755]  do_mount+0x5ac/0xd78
[  529.357758]  ksys_mount+0x9c/0x118
[  529.357760]  __arm64_sys_mount+0x28/0x38
[  529.357764]  el0_svc_common+0x78/0x130
[  529.357766]  el0_svc_handler+0x38/0x78
[  529.357769]  el0_svc+0x8/0xc
[  541.356516] watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [mount:18674]

Link: https://lore.kernel.org/r/20200211011752.29242-1-luoshijie1@huawei.comReviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NShijie Luo <luoshijie1@huawei.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org

af133ade

ext4: fix checksum errors with indexed dirs · 48a34311

由 Jan Kara 提交于 2月 10, 2020

DIR_INDEX has been introduced as a compat ext4 feature. That means that
even kernels / tools that don't understand the feature may modify the
filesystem. This works because for kernels not understanding indexed dir
format, internal htree nodes appear just as empty directory entries.
Index dir aware kernels then check the htree structure is still
consistent before using the data. This all worked reasonably well until
metadata checksums were introduced. The problem is that these
effectively made DIR_INDEX only ro-compatible because internal htree
nodes store checksums in a different place than normal directory blocks.
Thus any modification ignorant to DIR_INDEX (or just clearing
EXT4_INDEX_FL from the inode) will effectively cause checksum mismatch
and trigger kernel errors. So we have to be more careful when dealing
with indexed directories on filesystems with checksumming enabled.

1) We just disallow loading any directory inodes with EXT4_INDEX_FL when
DIR_INDEX is not enabled. This is harsh but it should be very rare (it
means someone disabled DIR_INDEX on existing filesystem and didn't run
e2fsck), e2fsck can fix the problem, and we don't want to answer the
difficult question: "Should we rather corrupt the directory more or
should we ignore that DIR_INDEX feature is not set?"

2) When we find out htree structure is corrupted (but the filesystem and
the directory should in support htrees), we continue just ignoring htree
information for reading but we refuse to add new entries to the
directory to avoid corrupting it more.

Link: https://lore.kernel.org/r/20200210144316.22081-1-jack@suse.cz
Fixes: dbe89444 ("ext4: Calculate and verify checksums for htree nodes")
Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org

48a34311

ext4: fix support for inode sizes > 1024 bytes · 4f97a681

由 Theodore Ts'o 提交于 2月 06, 2020

A recent commit, 9803387c ("ext4: validate the
debug_want_extra_isize mount option at parse time"), moved mount-time
checks around.  One of those changes moved the inode size check before
the blocksize variable was set to the blocksize of the file system.
After 9803387c was set to the minimum allowable blocksize, which
in practice on most systems would be 1024 bytes.  This cuased file
systems with inode sizes larger than 1024 bytes to be rejected with a
message:

EXT4-fs (sdXX): unsupported inode size: 4096

Fixes: 9803387c ("ext4: validate the debug_want_extra_isize mount option at parse time")
Link: https://lore.kernel.org/r/20200206225252.GA3673@mit.eduReported-by: NHerbert Poetzl <herbert@13thfloor.at>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org

4f97a681

ext4: simplify checking quota limits in ext4_statfs() · 46d36880

由 Jan Kara 提交于 1月 30, 2020

Coverity reports that conditions checking quota limits in ext4_statfs()
contain dead code. Indeed it is right and current conditions can be
simplified.

Link: https://lore.kernel.org/r/20200130111148.10766-1-jack@suse.czReported-by: NCoverity <scan-admin@coverity.com>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org

46d36880

ext4: don't assume that mmp_nodename/bdevname have NUL · 14c9ca05

由 Andreas Dilger 提交于 1月 26, 2020

Don't assume that the mmp_nodename and mmp_bdevname strings are NUL
terminated, since they are filled in by snprintf(), which is not
guaranteed to do so.

Link: https://lore.kernel.org/r/1580076215-1048-1-git-send-email-adilger@dilger.caSigned-off-by: NAndreas Dilger <adilger@dilger.ca>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org

14c9ca05

13 2月, 2020 10 次提交

cifs: Fix mode output in debugging statements · f52aa79d

由 Frank Sorenson 提交于 2月 12, 2020

A number of the debug statements output file or directory mode
in hex.  Change these to print using octal.
Signed-off-by: NFrank Sorenson <sorenson@redhat.com>
Signed-off-by: NSteve French <stfrench@microsoft.com>

f52aa79d

io-wq: don't call kXalloc_node() with non-online node · 7563439a

由 Jens Axboe 提交于 2月 11, 2020

Glauber reports a crash on init on a box he has:

 RIP: 0010:__alloc_pages_nodemask+0x132/0x340
 Code: 18 01 75 04 41 80 ce 80 89 e8 48 8b 54 24 08 8b 74 24 1c c1 e8 0c 48 8b 3c 24 83 e0 01 88 44 24 20 48 85 d2 0f 85 74 01 00 00 <3b> 77 08 0f 82 6b 01 00 00 48 89 7c 24 10 89 ea 48 8b 07 b9 00 02
 RSP: 0018:ffffb8be4d0b7c28 EFLAGS: 00010246
 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000e8e8
 RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080
 RBP: 0000000000012cc0 R08: 0000000000000000 R09: 0000000000000002
 R10: 0000000000000dc0 R11: ffff995c60400100 R12: 0000000000000000
 R13: 0000000000012cc0 R14: 0000000000000001 R15: ffff995c60db00f0
 FS:  00007f4d115ca900(0000) GS:ffff995c60d80000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000002088 CR3: 00000017cca66002 CR4: 00000000007606e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 PKRU: 55555554
 Call Trace:
  alloc_slab_page+0x46/0x320
  new_slab+0x9d/0x4e0
  ___slab_alloc+0x507/0x6a0
  ? io_wq_create+0xb4/0x2a0
  __slab_alloc+0x1c/0x30
  kmem_cache_alloc_node_trace+0xa6/0x260
  io_wq_create+0xb4/0x2a0
  io_uring_setup+0x97f/0xaa0
  ? io_remove_personalities+0x30/0x30
  ? io_poll_trigger_evfd+0x30/0x30
  do_syscall_64+0x5b/0x1c0
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f4d116cb1ed

which is due to the 'wqe' and 'worker' allocation being node affine.
But it isn't valid to call the node affine allocation if the node isn't
online.

Setup structures for even offline nodes, as usual, but skip them in
terms of thread setup to not waste resources. If the node isn't online,
just alloc memory with NUMA_NO_NODE.
Reported-by: NGlauber Costa <glauber@scylladb.com>
Tested-by: NGlauber Costa <glauber@scylladb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7563439a

NFSv4: Fix revalidation of dentries with delegations · efeda80d

由 Trond Myklebust 提交于 2月 05, 2020

If a dentry was not initially looked up while we were holding a
delegation, then we do still need to revalidate that it still holds
the same name. If there are multiple hard links to the same file,
then all the hard links need validation.
Reported-by: NBenjamin Coddington <bcodding@redhat.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: NBenjamin Coddington <bcodding@redhat.com>
Tested-by: NBenjamin Coddington <bcodding@redhat.com>
[Anna: Put nfs_unset_verifier_delegated() under CONFIG_NFS_V4]
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

efeda80d

btrfs: sysfs, move device id directories to UUID/devinfo · 1b9867eb

由 Anand Jain 提交于 2月 12, 2020

Originally it was planned to create device id directories under
UUID/devinfo, but it got under UUID/devices by mistake. We really want
it under definfo so the bare device node names are not mixed with device
ids and are easy to enumerate.

Fixes: 668e48af ("btrfs: sysfs, add devid/dev_state kobject and device attributes")
Signed-off-by: NAnand Jain <anand.jain@oracle.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

1b9867eb

btrfs: sysfs, add UUID/devinfo kobject · a013d141

由 Anand Jain 提交于 2月 12, 2020

Create directory /sys/fs/btrfs/UUID/devinfo to hold devices directories
by the id (unlike /devices).
Signed-off-by: NAnand Jain <anand.jain@oracle.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

a013d141

Btrfs: fix race between shrinking truncate and fiemap · 28553fa9

由 Filipe Manana 提交于 2月 07, 2020

When there is a fiemap executing in parallel with a shrinking truncate
we can end up in a situation where we have extent maps for which we no
longer have corresponding file extent items. This is generally harmless
and at the moment the only consequences are missing file extent items
representing holes after we expand the file size again after the
truncate operation removed the prealloc extent items, and stale
information for future fiemap calls (reporting extents that no longer
exist or may have been reallocated to other files for example).

Consider the following example:

1) Our inode has a size of 128KiB, one 128KiB extent at file offset 0
   and a 1MiB prealloc extent at file offset 128KiB;

2) Task A starts doing a shrinking truncate of our inode to reduce it to
   a size of 64KiB. Before it searches the subvolume tree for file
   extent items to delete, it drops all the extent maps in the range
   from 64KiB to (u64)-1 by calling btrfs_drop_extent_cache();

3) Task B starts doing a fiemap against our inode. When looking up for
   the inode's extent maps in the range from 128KiB to (u64)-1, it
   doesn't find any in the inode's extent map tree, since they were
   removed by task A.  Because it didn't find any in the extent map
   tree, it scans the inode's subvolume tree for file extent items, and
   it finds the 1MiB prealloc extent at file offset 128KiB, then it
   creates an extent map based on that file extent item and adds it to
   inode's extent map tree (this ends up being done by
   btrfs_get_extent() <- btrfs_get_extent_fiemap() <-
   get_extent_skip_holes());

4) Task A then drops the prealloc extent at file offset 128KiB and
   shrinks the 128KiB extent file offset 0 to a length of 64KiB. The
   truncation operation finishes and we end up with an extent map
   representing a 1MiB prealloc extent at file offset 128KiB, despite we
   don't have any more that extent;

After this the two types of problems we have are:

1) Future calls to fiemap always report that a 1MiB prealloc extent
   exists at file offset 128KiB. This is stale information, no longer
   correct;

2) If the size of the file is increased, by a truncate operation that
   increases the file size or by a write into a file offset > 64KiB for
   example, we end up not inserting file extent items to represent holes
   for any range between 128KiB and 128KiB + 1MiB, since the hole
   expansion function, btrfs_cont_expand() will skip hole insertion for
   any range for which an extent map exists that represents a prealloc
   extent. This causes fsck to complain about missing file extent items
   when not using the NO_HOLES feature.

The second issue could be often triggered by test case generic/561 from
fstests, which runs fsstress and duperemove in parallel, and duperemove
does frequent fiemap calls.

Essentially the problems happens because fiemap does not acquire the
inode's lock while truncate does, and fiemap locks the file range in the
inode's iotree while truncate does not. So fix the issue by making
btrfs_truncate_inode_items() lock the file range from the new file size
to (u64)-1, so that it serializes with fiemap.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

28553fa9

btrfs: log message when rw remount is attempted with unclean tree-log · 10a3a3ed

由 David Sterba 提交于 2月 05, 2020

A remount to a read-write filesystem is not safe when there's tree-log
to be replayed. Files that could be opened until now might be affected
by the changes in the tree-log.

A regular mount is needed to replay the log so the filesystem presents
the consistent view with the pending changes included.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: NAnand Jain <anand.jain@oracle.com>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

10a3a3ed

btrfs: print message when tree-log replay starts · e8294f2f

由 David Sterba 提交于 2月 05, 2020

There's no logged information about tree-log replay although this is
something that points to previous unclean unmount. Other filesystems
report that as well.
Suggested-by: NChris Murphy <lists@colorremedies.com>
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: NAnand Jain <anand.jain@oracle.com>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

e8294f2f

Btrfs: fix race between using extent maps and merging them · ac05ca91

由 Filipe Manana 提交于 1月 31, 2020

We have a few cases where we allow an extent map that is in an extent map
tree to be merged with other extents in the tree. Such cases include the
unpinning of an extent after the respective ordered extent completed or
after logging an extent during a fast fsync. This can lead to subtle and
dangerous problems because when doing the merge some other task might be
using the same extent map and as consequence see an inconsistent state of
the extent map - for example sees the new length but has seen the old start
offset.

With luck this triggers a BUG_ON(), and not some silent bug, such as the
following one in __do_readpage():

  $ cat -n fs/btrfs/extent_io.c
  3061  static int __do_readpage(struct extent_io_tree *tree,
  3062                           struct page *page,
  (...)
  3127                  em = __get_extent_map(inode, page, pg_offset, cur,
  3128                                        end - cur + 1, get_extent, em_cached);
  3129                  if (IS_ERR_OR_NULL(em)) {
  3130                          SetPageError(page);
  3131                          unlock_extent(tree, cur, end);
  3132                          break;
  3133                  }
  3134                  extent_offset = cur - em->start;
  3135                  BUG_ON(extent_map_end(em) <= cur);
  (...)

Consider the following example scenario, where we end up hitting the
BUG_ON() in __do_readpage().

We have an inode with a size of 8KiB and 2 extent maps:

  extent A: file offset 0, length 4KiB, disk_bytenr = X, persisted on disk by
            a previous transaction

  extent B: file offset 4KiB, length 4KiB, disk_bytenr = X + 4KiB, not yet
            persisted but writeback started for it already. The extent map
	    is pinned since there's writeback and an ordered extent in
	    progress, so it can not be merged with extent map A yet

The following sequence of steps leads to the BUG_ON():

1) The ordered extent for extent B completes, the respective page gets its
   writeback bit cleared and the extent map is unpinned, at that point it
   is not yet merged with extent map A because it's in the list of modified
   extents;

2) Due to memory pressure, or some other reason, the MM subsystem releases
   the page corresponding to extent B - btrfs_releasepage() is called and
   returns 1, meaning the page can be released as it's not dirty, not under
   writeback anymore and the extent range is not locked in the inode's
   iotree. However the extent map is not released, either because we are
   not in a context that allows memory allocations to block or because the
   inode's size is smaller than 16MiB - in this case our inode has a size
   of 8KiB;

3) Task B needs to read extent B and ends up __do_readpage() through the
   btrfs_readpage() callback. At __do_readpage() it gets a reference to
   extent map B;

4) Task A, doing a fast fsync, calls clear_em_loggin() against extent map B
   while holding the write lock on the inode's extent map tree - this
   results in try_merge_map() being called and since it's possible to merge
   extent map B with extent map A now (the extent map B was removed from
   the list of modified extents), the merging begins - it sets extent map
   B's start offset to 0 (was 4KiB), but before it increments the map's
   length to 8KiB (4kb + 4KiB), task A is at:

   BUG_ON(extent_map_end(em) <= cur);

   The call to extent_map_end() sees the extent map has a start of 0
   and a length still at 4KiB, so it returns 4KiB and 'cur' is 4KiB, so
   the BUG_ON() is triggered.

So it's dangerous to modify an extent map that is in the tree, because some
other task might have got a reference to it before and still using it, and
needs to see a consistent map while using it. Generally this is very rare
since most paths that lookup and use extent maps also have the file range
locked in the inode's iotree. The fsync path is pretty much the only
exception where we don't do it to avoid serialization with concurrent
reads.

Fix this by not allowing an extent map do be merged if if it's being used
by tasks other then the one attempting to merge the extent map (when the
reference count of the extent map is greater than 2).
Reported-by: Nryusuke1925 <st13s20@gm.ibaraki-ct.ac.jp>
Reported-by: NKoki Mitani <koki.mitani.xg@hco.ntt.co.jp>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206211
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ac05ca91

btrfs: ref-verify: fix memory leaks · f311ade3

由 Wenwen Wang 提交于 2月 01, 2020

In btrfs_ref_tree_mod(), 'ref' and 'ra' are allocated through kzalloc() and
kmalloc(), respectively. In the following code, if an error occurs, the
execution will be redirected to 'out' or 'out_unlock' and the function will
be exited. However, on some of the paths, 'ref' and 'ra' are not
deallocated, leading to memory leaks. For example, if 'action' is
BTRFS_ADD_DELAYED_EXTENT, add_block_entry() will be invoked. If the return
value indicates an error, the execution will be redirected to 'out'. But,
'ref' is not deallocated on this path, causing a memory leak.

To fix the above issues, deallocate both 'ref' and 'ra' before exiting from
the function when an error is encountered.

CC: stable@vger.kernel.org # 4.15+
Signed-off-by: NWenwen Wang <wenwen@cs.uga.edu>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

f311ade3

12 2月, 2020 3 次提交

ceph: noacl mount option is effectively ignored · 3b20bc2f

由 Xiubo Li 提交于 2月 11, 2020

For the old mount API, the module parameters parseing function will
be called in ceph_mount() and also just after the default posix acl
flag set, so we can control to enable/disable it via the mount option.

But for the new mount API, it will call the module parameters
parseing function before ceph_get_tree(), so the posix acl will always
be enabled.

Fixes: 82995cc6 ("libceph, rbd, ceph: convert to use the new mount API")
Signed-off-by: NXiubo Li <xiubli@redhat.com>
Reviewed-by: NIlya Dryomov <idryomov@gmail.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

3b20bc2f

ceph: canonicalize server path in place · b27a939e

由 Ilya Dryomov 提交于 2月 10, 2020

syzbot reported that 4fbc0c71 ("ceph: remove the extra slashes in
the server path") had caused a regression where an allocation could be
done under a spinlock -- compare_mount_options() is called by sget_fc()
with sb_lock held.

We don't really need the supplied server path, so canonicalize it
in place and compare it directly.  To make this work, the leading
slash is kept around and the logic in ceph_real_mount() to skip it
is restored.  CEPH_MSG_CLIENT_SESSION now reports the same (i.e.
canonicalized) path, with the leading slash of course.

Fixes: 4fbc0c71 ("ceph: remove the extra slashes in the server path")
Reported-by: syzbot+98704a51af8e3d9425a9@syzkaller.appspotmail.com
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NJeff Layton <jlayton@kernel.org>

b27a939e

ceph: do not execute direct write in parallel if O_APPEND is specified · 8e4473bb

由 Xiubo Li 提交于 2月 03, 2020

In O_APPEND & O_DIRECT mode, the data from different writers will
be possibly overlapping each other since they take the shared lock.

For example, both Writer1 and Writer2 are in O_APPEND and O_DIRECT
mode:

          Writer1                         Writer2

     shared_lock()                   shared_lock()
     getattr(CAP_SIZE)               getattr(CAP_SIZE)
     iocb->ki_pos = EOF              iocb->ki_pos = EOF
     write(data1)
                                     write(data2)
     shared_unlock()                 shared_unlock()

The data2 will overlap the data1 from the same file offset, the
old EOF.

Switch to exclusive lock instead when O_APPEND is specified.
Signed-off-by: NXiubo Li <xiubli@redhat.com>
Reviewed-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

8e4473bb

10 2月, 2020 7 次提交

NFSv4: Fix races between open and dentry revalidation · cf5b4059

由 Trond Myklebust 提交于 2月 05, 2020

We want to make sure that we revalidate the dentry if and only if
we've done an OPEN by filename.
In order to avoid races with remote changes to the directory on the
server, we want to save the verifier before calling OPEN. The exception
is if the server returned a delegation with our OPEN, as we then
know that the filename can't have changed on the server.
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: NBenjamin Coddington <bcodding@gmail.com>
Tested-by: NBenjamin Coddington <bcodding@gmail.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

cf5b4059

NFS: Fix up directory verifier races · a1147b82

由 Trond Myklebust 提交于 2月 05, 2020

In order to avoid having our dentry revalidation race with an update
of the directory on the server, we need to store the verifier before
the RPC calls to LOOKUP and READDIR.
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: NBenjamin Coddington <bcodding@gmail.com>
Tested-by: NBenjamin Coddington <bcodding@gmail.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

a1147b82

cifs: fix mount option display for sec=krb5i · 3f6166aa

由 Petr Pavlu 提交于 2月 10, 2020

Fix display for sec=krb5i which was wrongly interleaved by cruid,
resulting in string "sec=krb5,cruid=<...>i" instead of
"sec=krb5i,cruid=<...>".

Fixes: 96281b9e ("smb3: for kerberos mounts display the credential uid used")
Signed-off-by: NPetr Pavlu <petr.pavlu@suse.com>
Signed-off-by: NSteve French <stfrench@microsoft.com>

3f6166aa

io_uring: retain sockaddr_storage across send/recvmsg async punt · b537916c

由 Jens Axboe 提交于 2月 09, 2020

Jonas reports that he sometimes sees -97/-22 error returns from
sendmsg, if it gets punted async. This is due to not retaining the
sockaddr_storage between calls. Include that in the state we copy when
going async.

Cc: stable@vger.kernel.org # 5.3+
Reported-by: NJonas Bonn <jonas@norrbonn.se>
Tested-by: NJonas Bonn <jonas@norrbonn.se>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b537916c

io_uring: cancel pending async work if task exits · 6ab23144

由 Jens Axboe 提交于 2月 08, 2020

Normally we cancel all work we track, but for untracked work we could
leave the async worker behind until that work completes. This is totally
fine, but does leave resources pending after the task is gone until that
work completes.

Cancel work that this task queued up when it goes away.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6ab23144

io-wq: add io_wq_cancel_pid() to cancel based on a specific pid · 36282881

由 Jens Axboe 提交于 2月 08, 2020

Add a helper that allows the caller to cancel work based on what mm
it belongs to. This allows io_uring to cancel work from a given
task or thread when it exits.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

36282881

io-wq: make io_wqe_cancel_work() take a match handler · 00bcda13

由 Jens Axboe 提交于 2月 08, 2020

We want to use the cancel functionality for canceling based on not
just the work itself. Instead of matching on the work address
manually, allow a match handler to tell us if we found the right work
item or not.

No functional changes in this patch.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

00bcda13

09 2月, 2020 4 次提交

fs: Add VirtualBox guest shared folder (vboxsf) support · 0fd16957

由 Hans de Goede 提交于 12月 12, 2019

VirtualBox hosts can share folders with guests, this commit adds a
VFS driver implementing the Linux-guest side of this, allowing folders
exported by the host to be mounted under Linux.

This driver depends on the guest <-> host IPC functions exported by
the vboxguest driver.
Acked-by: NChristoph Hellwig <hch@infradead.org>
Signed-off-by: NHans de Goede <hdegoede@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

0fd16957

io_uring: fix openat/statx's filename leak · 0bdbdd08

由 Pavel Begunkov 提交于 2月 08, 2020

As in the previous patch, make openat*_prep() and statx_prep() handle
double preparation to avoid resource leakage.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0bdbdd08

io_uring: fix double prep iovec leak · 5f798bea

由 Pavel Begunkov 提交于 2月 08, 2020

Requests may be prepared multiple times with ->io allocated (i.e. async
prepared). Preparation functions don't handle it and forget about
previously allocated resources. This may happen in case of:
- spurious defer_check
- non-head (i.e. async prepared) request executed in sync (via nxt).

Make the handlers check, whether they already allocated resources, which
is true IFF REQ_F_NEED_CLEANUP is set.

Cc: stable@vger.kernel.org # 5.5
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5f798bea

io_uring: fix async close() with f_op->flush() · a93b3331

由 Pavel Begunkov 提交于 2月 08, 2020

First, io_close() misses filp_close() and io_cqring_add_event(), when
f_op->flush is defined. That's because in this case it will
io_queue_async_work() itself not grabbing files, so the corresponding
chunk in io_close_finish() won't be executed.

Second, when submitted through io_wq_submit_work(), it will do
filp_close() and *_add_event() twice: first inline in io_close(),
and the second one in call to io_close_finish() from io_close().
The second one will also fire, because it was submitted async through
generic path, and so have grabbed files.

And the last nice thing is to remove this weird pilgrimage with checking
work/old_work and casting it to nxt. Just use a helper instead.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a93b3331

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功