1. 26 2月, 2021 2 次提交
    • I
      btrfs: use copy_highpage() instead of 2 kmaps() · 80cc8384
      Ira Weiny 提交于
      There are many places where kmap/memove/kunmap patterns occur.
      
      This pattern exists in the core common function copy_highpage().
      
      Use copy_highpage to avoid open coding the use of kmap and leverages the
      core functions use of kmap_local_page().
      
      Development of this patch was aided by the following coccinelle script:
      
      // <smpl>
      // SPDX-License-Identifier: GPL-2.0-only
      // Find kmap/copypage/kunmap pattern and replace with copy_highpage calls
      //
      // NOTE: The expressions in the copy page version of this kmap pattern are
      // overly complex and so these all need individual attention.
      //
      // Confidence: Low
      // Copyright: (C) 2021 Intel Corporation
      // URL: http://coccinelle.lip6.fr/
      // Comments:
      // Options:
      
      //
      // Then a copy_page where we have 2 pages involved.
      //
      @ copy_page_rule @
      expression page, page2, To, From, Size;
      identifier ptr, ptr2;
      type VP, VP2;
      @@
      
      /* kmap */
      (
      -VP ptr = kmap(page);
      ...
      -VP2 ptr2 = kmap(page2);
      |
      -VP ptr = kmap_atomic(page);
      ...
      -VP2 ptr2 = kmap_atomic(page2);
      |
      -ptr = kmap(page);
      ...
      -ptr2 = kmap(page2);
      |
      -ptr = kmap_atomic(page);
      ...
      -ptr2 = kmap_atomic(page2);
      )
      
      // 1 or more copy versions of the entire page
      <+...
      (
      -copy_page(To, From);
      +copy_highpage(To, From);
      |
      -memmove(To, From, Size);
      +memmoveExtra(To, From, Size);
      )
      ...+>
      
      /* kunmap */
      (
      -kunmap(page2);
      ...
      -kunmap(page);
      |
      -kunmap(page);
      ...
      -kunmap(page2);
      |
      -kmap_atomic(ptr2);
      ...
      -kmap_atomic(ptr);
      )
      
      // Remove any pointers left unused
      @
      depends on copy_page_rule
      @
      identifier copy_page_rule.ptr;
      identifier copy_page_rule.ptr2;
      type VP, VP1;
      type VP2, VP21;
      @@
      
      -VP ptr;
      	... when != ptr;
      ? VP1 ptr;
      -VP2 ptr2;
      	... when != ptr2;
      ? VP21 ptr2;
      
      // </smpl>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NIra Weiny <ira.weiny@intel.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      80cc8384
    • I
      btrfs: use memcpy_[to|from]_page() and kmap_local_page() · 3590ec58
      Ira Weiny 提交于
      There are many places where the pattern kmap/memcpy/kunmap occurs.
      
      This pattern was lifted to the core common functions
      memcpy_[to|from]_page().
      
      Use these new functions to reduce the code, eliminate direct uses of
      kmap, and leverage the new core functions use of kmap_local_page().
      
      Also, there is 1 place where a kmap/memcpy is followed by an
      optional memset.  Here we leave the kmap open coded to avoid remapping
      the page but use kmap_local_page() directly.
      
      Development of this patch was aided by the coccinelle script:
      
      // <smpl>
      // SPDX-License-Identifier: GPL-2.0-only
      // Find kmap/memcpy/kunmap pattern and replace with memcpy*page calls
      //
      // NOTE: Offsets and other expressions may be more complex than what the script
      // will automatically generate.  Therefore a catchall rule is provided to find
      // the pattern which then must be evaluated by hand.
      //
      // Confidence: Low
      // Copyright: (C) 2021 Intel Corporation
      // URL: http://coccinelle.lip6.fr/
      // Comments:
      // Options:
      
      //
      // simple memcpy version
      //
      @ memcpy_rule1 @
      expression page, T, F, B, Off;
      identifier ptr;
      type VP;
      @@
      
      (
      -VP ptr = kmap(page);
      |
      -ptr = kmap(page);
      |
      -VP ptr = kmap_atomic(page);
      |
      -ptr = kmap_atomic(page);
      )
      <+...
      (
      -memcpy(ptr + Off, F, B);
      +memcpy_to_page(page, Off, F, B);
      |
      -memcpy(ptr, F, B);
      +memcpy_to_page(page, 0, F, B);
      |
      -memcpy(T, ptr + Off, B);
      +memcpy_from_page(T, page, Off, B);
      |
      -memcpy(T, ptr, B);
      +memcpy_from_page(T, page, 0, B);
      )
      ...+>
      (
      -kunmap(page);
      |
      -kunmap_atomic(ptr);
      )
      
      // Remove any pointers left unused
      @
      depends on memcpy_rule1
      @
      identifier memcpy_rule1.ptr;
      type VP, VP1;
      @@
      
      -VP ptr;
      	... when != ptr;
      ? VP1 ptr;
      
      //
      // Some callers kmap without a temp pointer
      //
      @ memcpy_rule2 @
      expression page, T, Off, F, B;
      @@
      
      <+...
      (
      -memcpy(kmap(page) + Off, F, B);
      +memcpy_to_page(page, Off, F, B);
      |
      -memcpy(kmap(page), F, B);
      +memcpy_to_page(page, 0, F, B);
      |
      -memcpy(T, kmap(page) + Off, B);
      +memcpy_from_page(T, page, Off, B);
      |
      -memcpy(T, kmap(page), B);
      +memcpy_from_page(T, page, 0, B);
      )
      ...+>
      -kunmap(page);
      // No need for the ptr variable removal
      
      //
      // Catch all
      //
      @ memcpy_rule3 @
      expression page;
      expression GenTo, GenFrom, GenSize;
      identifier ptr;
      type VP;
      @@
      
      (
      -VP ptr = kmap(page);
      |
      -ptr = kmap(page);
      |
      -VP ptr = kmap_atomic(page);
      |
      -ptr = kmap_atomic(page);
      )
      <+...
      (
      //
      // Some call sites have complex expressions within the memcpy
      // match a catch all to be evaluated by hand.
      //
      -memcpy(GenTo, GenFrom, GenSize);
      +memcpy_to_pageExtra(page, GenTo, GenFrom, GenSize);
      +memcpy_from_pageExtra(GenTo, page, GenFrom, GenSize);
      )
      ...+>
      (
      -kunmap(page);
      |
      -kunmap_atomic(ptr);
      )
      
      // Remove any pointers left unused
      @
      depends on memcpy_rule3
      @
      identifier memcpy_rule3.ptr;
      type VP, VP1;
      @@
      
      -VP ptr;
      	... when != ptr;
      ? VP1 ptr;
      
      // <smpl>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NIra Weiny <ira.weiny@intel.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3590ec58
  2. 06 2月, 2021 2 次提交
  3. 05 2月, 2021 2 次提交
  4. 04 2月, 2021 1 次提交
    • X
      io_uring: don't modify identity's files uncess identity is cowed · d7e10d47
      Xiaoguang Wang 提交于
      Abaci Robot reported following panic:
      BUG: kernel NULL pointer dereference, address: 0000000000000000
      PGD 800000010ef3f067 P4D 800000010ef3f067 PUD 10d9df067 PMD 0
      Oops: 0002 [#1] SMP PTI
      CPU: 0 PID: 1869 Comm: io_wqe_worker-0 Not tainted 5.11.0-rc3+ #1
      Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      RIP: 0010:put_files_struct+0x1b/0x120
      Code: 24 18 c7 00 f4 ff ff ff e9 4d fd ff ff 66 90 0f 1f 44 00 00 41 57 41 56 49 89 fe 41 55 41 54 55 53 48 83 ec 08 e8 b5 6b db ff  41 ff 0e 74 13 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 5f e9 9c
      RSP: 0000:ffffc90002147d48 EFLAGS: 00010293
      RAX: 0000000000000000 RBX: ffff88810d9a5300 RCX: 0000000000000000
      RDX: ffff88810d87c280 RSI: ffffffff8144ba6b RDI: 0000000000000000
      RBP: 0000000000000080 R08: 0000000000000001 R09: ffffffff81431500
      R10: ffff8881001be000 R11: 0000000000000000 R12: ffff88810ac2f800
      R13: ffff88810af38a00 R14: 0000000000000000 R15: ffff8881057130c0
      FS:  0000000000000000(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000000 CR3: 000000010dbaa002 CR4: 00000000003706f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       __io_clean_op+0x10c/0x2a0
       io_dismantle_req+0x3c7/0x600
       __io_free_req+0x34/0x280
       io_put_req+0x63/0xb0
       io_worker_handle_work+0x60e/0x830
       ? io_wqe_worker+0x135/0x520
       io_wqe_worker+0x158/0x520
       ? __kthread_parkme+0x96/0xc0
       ? io_worker_handle_work+0x830/0x830
       kthread+0x134/0x180
       ? kthread_create_worker_on_cpu+0x90/0x90
       ret_from_fork+0x1f/0x30
      Modules linked in:
      CR2: 0000000000000000
      ---[ end trace c358ca86af95b1e7 ]---
      
      I guess case below can trigger above panic: there're two threads which
      operates different io_uring ctxs and share same sqthread identity, and
      later one thread exits, io_uring_cancel_task_requests() will clear
      task->io_uring->identity->files to be NULL in sqpoll mode, then another
      ctx that uses same identity will panic.
      
      Indeed we don't need to clear task->io_uring->identity->files here,
      io_grab_identity() should handle identity->files changes well, if
      task->io_uring->identity->files is not equal to current->files,
      io_cow_identity() should handle this changes well.
      
      Cc: stable@vger.kernel.org # 5.5+
      Reported-by: NAbaci Robot <abaci@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d7e10d47
  5. 02 2月, 2021 1 次提交
    • G
      smb3: Fix out-of-bounds bug in SMB2_negotiate() · 8d8d1dbe
      Gustavo A. R. Silva 提交于
      While addressing some warnings generated by -Warray-bounds, I found this
      bug that was introduced back in 2017:
      
        CC [M]  fs/cifs/smb2pdu.o
      fs/cifs/smb2pdu.c: In function ‘SMB2_negotiate’:
      fs/cifs/smb2pdu.c:822:16: warning: array subscript 1 is above array bounds
      of ‘__le16[1]’ {aka ‘short unsigned int[1]’} [-Warray-bounds]
        822 |   req->Dialects[1] = cpu_to_le16(SMB30_PROT_ID);
            |   ~~~~~~~~~~~~~^~~
      fs/cifs/smb2pdu.c:823:16: warning: array subscript 2 is above array bounds
      of ‘__le16[1]’ {aka ‘short unsigned int[1]’} [-Warray-bounds]
        823 |   req->Dialects[2] = cpu_to_le16(SMB302_PROT_ID);
            |   ~~~~~~~~~~~~~^~~
      fs/cifs/smb2pdu.c:824:16: warning: array subscript 3 is above array bounds
      of ‘__le16[1]’ {aka ‘short unsigned int[1]’} [-Warray-bounds]
        824 |   req->Dialects[3] = cpu_to_le16(SMB311_PROT_ID);
            |   ~~~~~~~~~~~~~^~~
      fs/cifs/smb2pdu.c:816:16: warning: array subscript 1 is above array bounds
      of ‘__le16[1]’ {aka ‘short unsigned int[1]’} [-Warray-bounds]
        816 |   req->Dialects[1] = cpu_to_le16(SMB302_PROT_ID);
            |   ~~~~~~~~~~~~~^~~
      
      At the time, the size of array _Dialects_ was changed from 1 to 3 in struct
      validate_negotiate_info_req, and then in 2019 it was changed from 3 to 4,
      but those changes were never made in struct smb2_negotiate_req, which has
      led to a 3 and a half years old out-of-bounds bug in function
      SMB2_negotiate() (fs/cifs/smb2pdu.c).
      
      Fix this by increasing the size of array _Dialects_ in struct
      smb2_negotiate_req to 4.
      
      Fixes: 9764c02f ("SMB3: Add support for multidialect negotiate (SMB2.1 and later)")
      Fixes: d5c7076b ("smb3: add smb3.1.1 to default dialect list")
      Cc: stable@vger.kernel.org
      Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      8d8d1dbe
  6. 30 1月, 2021 1 次提交
  7. 29 1月, 2021 6 次提交
    • R
      cifs: fix dfs domain referrals · 0d4873f9
      Ronnie Sahlberg 提交于
      The new mount API requires additional changes to how DFS
      is handled. Additional testing of DFS uncovered problems
      with domain based DFS referrals (a follow on patch addresses
      DFS links) which this patch addresses.
      Signed-off-by: NRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: NPaulo Alcantara (SUSE) <pc@cjr.nz>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      0d4873f9
    • P
      io_uring: reinforce cancel on flush during exit · 3a7efd1a
      Pavel Begunkov 提交于
      What 84965ff8 ("io_uring: if we see flush on exit, cancel related tasks")
      really wants is to cancel all relevant REQ_F_INFLIGHT requests reliably.
      That can be achieved by io_uring_cancel_files(), but we'll miss it
      calling io_uring_cancel_task_requests(files=NULL) from io_uring_flush(),
      because it will go through __io_uring_cancel_task_requests().
      
      Just always call io_uring_cancel_files() during cancel, it's good enough
      for now.
      
      Cc: stable@vger.kernel.org # 5.9+
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3a7efd1a
    • S
      cifs: returning mount parm processing errors correctly · bd2f0b43
      Steve French 提交于
      During additional testing of the updated cifs.ko with the
      new mount API support, we found a few additional cases where
      we were logging errors, but not returning them to the user.
      
      For example:
         a) invalid security mechanisms
         b) invalid cache options
         c) unsupported rdma
         d) invalid smb dialect requested
      
      Fixes: 24e0a1ef ("cifs: switch to new mount api")
      Acked-by: NRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      bd2f0b43
    • P
      io_uring: fix sqo ownership false positive warning · 70b2c60d
      Pavel Begunkov 提交于
      WARNING: CPU: 0 PID: 21359 at fs/io_uring.c:9042
          io_uring_cancel_task_requests+0xe55/0x10c0 fs/io_uring.c:9042
      Call Trace:
       io_uring_flush+0x47b/0x6e0 fs/io_uring.c:9227
       filp_close+0xb4/0x170 fs/open.c:1295
       close_files fs/file.c:403 [inline]
       put_files_struct fs/file.c:418 [inline]
       put_files_struct+0x1cc/0x350 fs/file.c:415
       exit_files+0x7e/0xa0 fs/file.c:435
       do_exit+0xc22/0x2ae0 kernel/exit.c:820
       do_group_exit+0x125/0x310 kernel/exit.c:922
       get_signal+0x427/0x20f0 kernel/signal.c:2773
       arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811
       handle_signal_work kernel/entry/common.c:147 [inline]
       exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
       exit_to_user_mode_prepare+0x148/0x250 kernel/entry/common.c:201
       __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
       syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:302
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Now io_uring_cancel_task_requests() can be called not through file
      notes but directly, remove a WARN_ONCE() there that give us false
      positives. That check is not very important and we catch it in other
      places.
      
      Fixes: 84965ff8 ("io_uring: if we see flush on exit, cancel related tasks")
      Cc: stable@vger.kernel.org # 5.9+
      Reported-by: syzbot+3e3d9bd0c6ce9efbc3ef@syzkaller.appspotmail.com
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      70b2c60d
    • P
      io_uring: fix list corruption for splice file_get · f609cbb8
      Pavel Begunkov 提交于
      kernel BUG at lib/list_debug.c:29!
      Call Trace:
       __list_add include/linux/list.h:67 [inline]
       list_add include/linux/list.h:86 [inline]
       io_file_get+0x8cc/0xdb0 fs/io_uring.c:6466
       __io_splice_prep+0x1bc/0x530 fs/io_uring.c:3866
       io_splice_prep fs/io_uring.c:3920 [inline]
       io_req_prep+0x3546/0x4e80 fs/io_uring.c:6081
       io_queue_sqe+0x609/0x10d0 fs/io_uring.c:6628
       io_submit_sqe fs/io_uring.c:6705 [inline]
       io_submit_sqes+0x1495/0x2720 fs/io_uring.c:6953
       __do_sys_io_uring_enter+0x107d/0x1f30 fs/io_uring.c:9353
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      io_file_get() may be called from splice, and so REQ_F_INFLIGHT may
      already be set.
      
      Fixes: 02a13674 ("io_uring: account io_uring internal files as REQ_F_INFLIGHT")
      Cc: stable@vger.kernel.org # 5.9+
      Reported-by: syzbot+6879187cf57845801267@syzkaller.appspotmail.com
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f609cbb8
    • S
      cifs: fix mounts to subdirectories of target · c9b8cd6a
      Steve French 提交于
      The "prefixpath" mount option needs to be ignored
      which was missed in the recent conversion to the
      new mount API (prefixpath would be set by the mount
      helper if mounting a subdirectory of the root of a
      share e.g. //server/share/subdir)
      
      Fixes: 24e0a1ef ("cifs: switch to new mount api")
      Suggested-by: NRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      Reviewed-by: NRonnie Sahlberg <lsahlber@redhat.com>
      c9b8cd6a
  8. 28 1月, 2021 9 次提交
    • A
      cifs: ignore auto and noauto options if given · 19d51588
      Adam Harvey 提交于
      In 24e0a1ef, the noauto and auto options were missed when migrating
      to the new mount API. As a result, users with noauto in their fstab
      mount options are now unable to mount cifs filesystems, as they'll
      receive an "Unknown parameter" error.
      
      This restores the old behaviour of ignoring noauto and auto if they're
      given.
      
      Fixes: 24e0a1ef ("cifs: switch to new mount api")
      Signed-off-by: NAdam Harvey <adam@adamharvey.name>
      Reviewed-by: NRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      19d51588
    • S
      ovl: implement volatile-specific fsync error behaviour · 335d3fc5
      Sargun Dhillon 提交于
      Overlayfs's volatile option allows the user to bypass all forced sync calls
      to the upperdir filesystem. This comes at the cost of safety. We can never
      ensure that the user's data is intact, but we can make a best effort to
      expose whether or not the data is likely to be in a bad state.
      
      The best way to handle this in the time being is that if an overlayfs's
      upperdir experiences an error after a volatile mount occurs, that error
      will be returned on fsync, fdatasync, sync, and syncfs. This is
      contradictory to the traditional behaviour of VFS which fails the call
      once, and only raises an error if a subsequent fsync error has occurred,
      and been raised by the filesystem.
      
      One awkward aspect of the patch is that we have to manually set the
      superblock's errseq_t after the sync_fs callback as opposed to just
      returning an error from syncfs. This is because the call chain looks
      something like this:
      
      sys_syncfs ->
      	sync_filesystem ->
      		__sync_filesystem ->
      			/* The return value is ignored here
      			sb->s_op->sync_fs(sb)
      			_sync_blockdev
      		/* Where the VFS fetches the error to raise to userspace */
      		errseq_check_and_advance
      
      Because of this we call errseq_set every time the sync_fs callback occurs.
      Due to the nature of this seen / unseen dichotomy, if the upperdir is an
      inconsistent state at the initial mount time, overlayfs will refuse to
      mount, as overlayfs cannot get a snapshot of the upperdir's errseq that
      will increment on error until the user calls syncfs.
      Signed-off-by: NSargun Dhillon <sargun@sargun.me>
      Suggested-by: NAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      Fixes: c86243b0 ("ovl: provide a mount option "volatile"")
      Cc: stable@vger.kernel.org
      Reviewed-by: NVivek Goyal <vgoyal@redhat.com>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      335d3fc5
    • A
      ovl: skip getxattr of security labels · 03fedf93
      Amir Goldstein 提交于
      When inode has no listxattr op of its own (e.g. squashfs) vfs_listxattr
      calls the LSM inode_listsecurity hooks to list the xattrs that LSMs will
      intercept in inode_getxattr hooks.
      
      When selinux LSM is installed but not initialized, it will list the
      security.selinux xattr in inode_listsecurity, but will not intercept it
      in inode_getxattr.  This results in -ENODATA for a getxattr call for an
      xattr returned by listxattr.
      
      This situation was manifested as overlayfs failure to copy up lower
      files from squashfs when selinux is built-in but not initialized,
      because ovl_copy_xattr() iterates the lower inode xattrs by
      vfs_listxattr() and vfs_getxattr().
      
      ovl_copy_xattr() skips copy up of security labels that are indentified by
      inode_copy_up_xattr LSM hooks, but it does that after vfs_getxattr().
      Since we are not going to copy them, skip vfs_getxattr() of the security
      labels.
      Reported-by: NMichael Labriola <michael.d.labriola@gmail.com>
      Tested-by: NMichael Labriola <michael.d.labriola@gmail.com>
      Link: https://lore.kernel.org/linux-unionfs/2nv9d47zt7.fsf@aldarion.sourceruckus.org/Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      03fedf93
    • L
      ovl: fix dentry leak in ovl_get_redirect · e04527fe
      Liangyan 提交于
      We need to lock d_parent->d_lock before dget_dlock, or this may
      have d_lockref updated parallelly like calltrace below which will
      cause dentry->d_lockref leak and risk a crash.
      
           CPU 0                                CPU 1
      ovl_set_redirect                       lookup_fast
        ovl_get_redirect                       __d_lookup
          dget_dlock
            //no lock protection here            spin_lock(&dentry->d_lock)
            dentry->d_lockref.count++            dentry->d_lockref.count++
      
      [   49.799059] PGD 800000061fed7067 P4D 800000061fed7067 PUD 61fec5067 PMD 0
      [   49.799689] Oops: 0002 [#1] SMP PTI
      [   49.800019] CPU: 2 PID: 2332 Comm: node Not tainted 4.19.24-7.20.al7.x86_64 #1
      [   49.800678] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8a46cfe 04/01/2014
      [   49.801380] RIP: 0010:_raw_spin_lock+0xc/0x20
      [   49.803470] RSP: 0018:ffffac6fc5417e98 EFLAGS: 00010246
      [   49.803949] RAX: 0000000000000000 RBX: ffff93b8da3446c0 RCX: 0000000a00000000
      [   49.804600] RDX: 0000000000000001 RSI: 000000000000000a RDI: 0000000000000088
      [   49.805252] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffff993cf040
      [   49.805898] R10: ffff93b92292e580 R11: ffffd27f188a4b80 R12: 0000000000000000
      [   49.806548] R13: 00000000ffffff9c R14: 00000000fffffffe R15: ffff93b8da3446c0
      [   49.807200] FS:  00007ffbedffb700(0000) GS:ffff93b927880000(0000) knlGS:0000000000000000
      [   49.807935] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   49.808461] CR2: 0000000000000088 CR3: 00000005e3f74006 CR4: 00000000003606a0
      [   49.809113] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   49.809758] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [   49.810410] Call Trace:
      [   49.810653]  d_delete+0x2c/0xb0
      [   49.810951]  vfs_rmdir+0xfd/0x120
      [   49.811264]  do_rmdir+0x14f/0x1a0
      [   49.811573]  do_syscall_64+0x5b/0x190
      [   49.811917]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [   49.812385] RIP: 0033:0x7ffbf505ffd7
      [   49.814404] RSP: 002b:00007ffbedffada8 EFLAGS: 00000297 ORIG_RAX: 0000000000000054
      [   49.815098] RAX: ffffffffffffffda RBX: 00007ffbedffb640 RCX: 00007ffbf505ffd7
      [   49.815744] RDX: 0000000004449700 RSI: 0000000000000000 RDI: 0000000006c8cd50
      [   49.816394] RBP: 00007ffbedffaea0 R08: 0000000000000000 R09: 0000000000017d0b
      [   49.817038] R10: 0000000000000000 R11: 0000000000000297 R12: 0000000000000012
      [   49.817687] R13: 00000000072823d8 R14: 00007ffbedffb700 R15: 00000000072823d8
      [   49.818338] Modules linked in: pvpanic cirrusfb button qemu_fw_cfg atkbd libps2 i8042
      [   49.819052] CR2: 0000000000000088
      [   49.819368] ---[ end trace 4e652b8aa299aa2d ]---
      [   49.819796] RIP: 0010:_raw_spin_lock+0xc/0x20
      [   49.821880] RSP: 0018:ffffac6fc5417e98 EFLAGS: 00010246
      [   49.822363] RAX: 0000000000000000 RBX: ffff93b8da3446c0 RCX: 0000000a00000000
      [   49.823008] RDX: 0000000000000001 RSI: 000000000000000a RDI: 0000000000000088
      [   49.823658] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffff993cf040
      [   49.825404] R10: ffff93b92292e580 R11: ffffd27f188a4b80 R12: 0000000000000000
      [   49.827147] R13: 00000000ffffff9c R14: 00000000fffffffe R15: ffff93b8da3446c0
      [   49.828890] FS:  00007ffbedffb700(0000) GS:ffff93b927880000(0000) knlGS:0000000000000000
      [   49.830725] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   49.832359] CR2: 0000000000000088 CR3: 00000005e3f74006 CR4: 00000000003606a0
      [   49.834085] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   49.835792] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      
      Cc: <stable@vger.kernel.org>
      Fixes: a6c60655 ("ovl: redirect on rename-dir")
      Signed-off-by: NLiangyan <liangyan.peng@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      e04527fe
    • M
      ovl: avoid deadlock on directory ioctl · b854cc65
      Miklos Szeredi 提交于
      The function ovl_dir_real_file() currently uses the inode lock to serialize
      writes to the od->upperfile field.
      
      However, this function will get called by ovl_ioctl_set_flags(), which
      utilizes the inode lock too.  In this case ovl_dir_real_file() will try to
      claim a lock that is owned by a function in its call stack, which won't get
      released before ovl_dir_real_file() returns.
      
      Fix by replacing the open coded compare and exchange by an explicit atomic
      op.
      
      Fixes: 61536bed ("ovl: support [S|G]ETFLAGS and FS[S|G]ETXATTR ioctls for directories")
      Cc: stable@vger.kernel.org # v5.10
      Reported-by: NIcenowy Zheng <icenowy@aosc.io>
      Tested-by: NIcenowy Zheng <icenowy@aosc.io>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      b854cc65
    • M
      ovl: perform vfs_getxattr() with mounter creds · 554677b9
      Miklos Szeredi 提交于
      The vfs_getxattr() in ovl_xattr_set() is used to check whether an xattr
      exist on a lower layer file that is to be removed.  If the xattr does not
      exist, then no need to copy up the file.
      
      This call of vfs_getxattr() wasn't wrapped in credential override, and this
      is probably okay.  But for consitency wrap this instance as well.
      Reported-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      554677b9
    • M
      ovl: add warning on user_ns mismatch · 9efb069d
      Miklos Szeredi 提交于
      Currently there's no way to create an overlay filesystem outside of the
      current user namespace.  Make sure that if this assumption changes it
      doesn't go unnoticed.
      Reported-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      9efb069d
    • H
      io_uring: fix flush cqring overflow list while TASK_INTERRUPTIBLE · 6195ba09
      Hao Xu 提交于
      Abaci reported the follow warning:
      
      [   27.073425] do not call blocking ops when !TASK_RUNNING; state=1 set at [] prepare_to_wait_exclusive+0x3a/0xc0
      [   27.075805] WARNING: CPU: 0 PID: 951 at kernel/sched/core.c:7853 __might_sleep+0x80/0xa0
      [   27.077604] Modules linked in:
      [   27.078379] CPU: 0 PID: 951 Comm: a.out Not tainted 5.11.0-rc3+ #1
      [   27.079637] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [   27.080852] RIP: 0010:__might_sleep+0x80/0xa0
      [   27.081835] Code: 65 48 8b 04 25 80 71 01 00 48 8b 90 c0 15 00 00 48 8b 70 18 48 c7 c7 08 39 95 82 c6 05 f9 5f de 08 01 48 89 d1 e8 00 c6 fa ff  0b eb bf 41 0f b6 f5 48 c7 c7 40 23 c9 82 e8 f3 48 ec 00 eb a7
      [   27.084521] RSP: 0018:ffffc90000fe3ce8 EFLAGS: 00010286
      [   27.085350] RAX: 0000000000000000 RBX: ffffffff82956083 RCX: 0000000000000000
      [   27.086348] RDX: ffff8881057a0000 RSI: ffffffff8118cc9e RDI: ffff88813bc28570
      [   27.087598] RBP: 00000000000003a7 R08: 0000000000000001 R09: 0000000000000001
      [   27.088819] R10: ffffc90000fe3e00 R11: 00000000fffef9f0 R12: 0000000000000000
      [   27.089819] R13: 0000000000000000 R14: ffff88810576eb80 R15: ffff88810576e800
      [   27.091058] FS:  00007f7b144cf740(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000
      [   27.092775] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   27.093796] CR2: 00000000022da7b8 CR3: 000000010b928002 CR4: 00000000003706f0
      [   27.094778] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   27.095780] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [   27.097011] Call Trace:
      [   27.097685]  __mutex_lock+0x5d/0xa30
      [   27.098565]  ? prepare_to_wait_exclusive+0x71/0xc0
      [   27.099412]  ? io_cqring_overflow_flush.part.101+0x6d/0x70
      [   27.100441]  ? lockdep_hardirqs_on_prepare+0xe9/0x1c0
      [   27.101537]  ? _raw_spin_unlock_irqrestore+0x2d/0x40
      [   27.102656]  ? trace_hardirqs_on+0x46/0x110
      [   27.103459]  ? io_cqring_overflow_flush.part.101+0x6d/0x70
      [   27.104317]  io_cqring_overflow_flush.part.101+0x6d/0x70
      [   27.105113]  io_cqring_wait+0x36e/0x4d0
      [   27.105770]  ? find_held_lock+0x28/0xb0
      [   27.106370]  ? io_uring_remove_task_files+0xa0/0xa0
      [   27.107076]  __x64_sys_io_uring_enter+0x4fb/0x640
      [   27.107801]  ? rcu_read_lock_sched_held+0x59/0xa0
      [   27.108562]  ? lockdep_hardirqs_on_prepare+0xe9/0x1c0
      [   27.109684]  ? syscall_enter_from_user_mode+0x26/0x70
      [   27.110731]  do_syscall_64+0x2d/0x40
      [   27.111296]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [   27.112056] RIP: 0033:0x7f7b13dc8239
      [   27.112663] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05  3d 01 f0 ff ff 73 01 c3 48 8b 0d 27 ec 2c 00 f7 d8 64 89 01 48
      [   27.115113] RSP: 002b:00007ffd6d7f5c88 EFLAGS: 00000286 ORIG_RAX: 00000000000001aa
      [   27.116562] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7b13dc8239
      [   27.117961] RDX: 000000000000478e RSI: 0000000000000000 RDI: 0000000000000003
      [   27.118925] RBP: 00007ffd6d7f5cb0 R08: 0000000020000040 R09: 0000000000000008
      [   27.119773] R10: 0000000000000001 R11: 0000000000000286 R12: 0000000000400480
      [   27.120614] R13: 00007ffd6d7f5d90 R14: 0000000000000000 R15: 0000000000000000
      [   27.121490] irq event stamp: 5635
      [   27.121946] hardirqs last  enabled at (5643): [] console_unlock+0x5c4/0x740
      [   27.123476] hardirqs last disabled at (5652): [] console_unlock+0x4e7/0x740
      [   27.125192] softirqs last  enabled at (5272): [] __do_softirq+0x3c5/0x5aa
      [   27.126430] softirqs last disabled at (5267): [] asm_call_irq_on_stack+0xf/0x20
      [   27.127634] ---[ end trace 289d7e28fa60f928 ]---
      
      This is caused by calling io_cqring_overflow_flush() which may sleep
      after calling prepare_to_wait_exclusive() which set task state to
      TASK_INTERRUPTIBLE
      Reported-by: NAbaci <abaci@linux.alibaba.com>
      Fixes: 6c503150 ("io_uring: patch up IOPOLL overflow_flush sync")
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6195ba09
    • M
      Revert "block: simplify set_init_blocksize" to regain lost performance · 8dc932d3
      Maxim Mikityanskiy 提交于
      The cited commit introduced a serious regression with SATA write speed,
      as found by bisecting. This patch reverts this commit, which restores
      write speed back to the values observed before this commit.
      
      The performance tests were done on a Helios4 NAS (2nd batch) with 4 HDDs
      (WD8003FFBX) using dd (bs=1M count=2000). "Direct" is a test with a
      single HDD, the rest are different RAID levels built over the first
      partitions of 4 HDDs. Test results are in MB/s, R is read, W is write.
      
                      | Direct | RAID0 | RAID10 f2 | RAID10 n2 | RAID6
      ----------------+--------+-------+-----------+-----------+--------
      9011495c    | R:256  | R:313 | R:276     | R:313     | R:323
      (before faulty) | W:254  | W:253 | W:195     | W:204     | W:117
      ----------------+--------+-------+-----------+-----------+--------
      5ff9f192    | R:257  | R:398 | R:312     | R:344     | R:391
      (faulty commit) | W:154  | W:122 | W:67.7    | W:66.6    | W:67.2
      ----------------+--------+-------+-----------+-----------+--------
      5.10.10         | R:256  | R:401 | R:312     | R:356     | R:375
      unpatched       | W:149  | W:123 | W:64      | W:64.1    | W:61.5
      ----------------+--------+-------+-----------+-----------+--------
      5.10.10         | R:255  | R:396 | R:312     | R:340     | R:393
      patched         | W:247  | W:274 | W:220     | W:225     | W:121
      
      Applying this patch doesn't hurt read performance, while improves the
      write speed by 1.5x - 3.5x (more impact on RAID tests). The write speed
      is restored back to the state before the faulty commit, and even a bit
      higher in RAID tests (which aren't HDD-bound on this device) - that is
      likely related to other optimizations done between the faulty commit and
      5.10.10 which also improved the read speed.
      Signed-off-by: NMaxim Mikityanskiy <maxtram95@gmail.com>
      Fixes: 5ff9f192 ("block: simplify set_init_blocksize")
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8dc932d3
  9. 27 1月, 2021 2 次提交
    • P
      io_uring: fix wqe->lock/completion_lock deadlock · 907d1df3
      Pavel Begunkov 提交于
      Joseph reports following deadlock:
      
      CPU0:
      ...
      io_kill_linked_timeout  // &ctx->completion_lock
      io_commit_cqring
      __io_queue_deferred
      __io_queue_async_work
      io_wq_enqueue
      io_wqe_enqueue  // &wqe->lock
      
      CPU1:
      ...
      __io_uring_files_cancel
      io_wq_cancel_cb
      io_wqe_cancel_pending_work  // &wqe->lock
      io_cancel_task_cb  // &ctx->completion_lock
      
      Only __io_queue_deferred() calls queue_async_work() while holding
      ctx->completion_lock, enqueue drained requests via io_req_task_queue()
      instead.
      
      Cc: stable@vger.kernel.org # 5.9+
      Reported-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Tested-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      907d1df3
    • P
      io_uring: fix cancellation taking mutex while TASK_UNINTERRUPTIBLE · ca70f00b
      Pavel Begunkov 提交于
      do not call blocking ops when !TASK_RUNNING; state=2 set at
      	[<00000000ced9dbfc>] prepare_to_wait+0x1f4/0x3b0
      	kernel/sched/wait.c:262
      WARNING: CPU: 1 PID: 19888 at kernel/sched/core.c:7853
      	__might_sleep+0xed/0x100 kernel/sched/core.c:7848
      RIP: 0010:__might_sleep+0xed/0x100 kernel/sched/core.c:7848
      Call Trace:
       __mutex_lock_common+0xc4/0x2ef0 kernel/locking/mutex.c:935
       __mutex_lock kernel/locking/mutex.c:1103 [inline]
       mutex_lock_nested+0x1a/0x20 kernel/locking/mutex.c:1118
       io_wq_submit_work+0x39a/0x720 fs/io_uring.c:6411
       io_run_cancel fs/io-wq.c:856 [inline]
       io_wqe_cancel_pending_work fs/io-wq.c:990 [inline]
       io_wq_cancel_cb+0x614/0xcb0 fs/io-wq.c:1027
       io_uring_cancel_files fs/io_uring.c:8874 [inline]
       io_uring_cancel_task_requests fs/io_uring.c:8952 [inline]
       __io_uring_files_cancel+0x115d/0x19e0 fs/io_uring.c:9038
       io_uring_files_cancel include/linux/io_uring.h:51 [inline]
       do_exit+0x2e6/0x2490 kernel/exit.c:780
       do_group_exit+0x168/0x2d0 kernel/exit.c:922
       get_signal+0x16b5/0x2030 kernel/signal.c:2770
       arch_do_signal_or_restart+0x8e/0x6a0 arch/x86/kernel/signal.c:811
       handle_signal_work kernel/entry/common.c:147 [inline]
       exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
       exit_to_user_mode_prepare+0xac/0x1e0 kernel/entry/common.c:201
       __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
       syscall_exit_to_user_mode+0x48/0x190 kernel/entry/common.c:302
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Rewrite io_uring_cancel_files() to mimic __io_uring_task_cancel()'s
      counting scheme, so it does all the heavy work before setting
      TASK_UNINTERRUPTIBLE.
      
      Cc: stable@vger.kernel.org # 5.9+
      Reported-by: syzbot+f655445043a26a7cfab8@syzkaller.appspotmail.com
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      [axboe: fix inverted task check]
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ca70f00b
  10. 26 1月, 2021 6 次提交
    • P
      io_uring: fix __io_uring_files_cancel() with TASK_UNINTERRUPTIBLE · a1bb3cd5
      Pavel Begunkov 提交于
      If the tctx inflight number haven't changed because of cancellation,
      __io_uring_task_cancel() will continue leaving the task in
      TASK_UNINTERRUPTIBLE state, that's not expected by
      __io_uring_files_cancel(). Ensure we always call finish_wait() before
      retrying.
      
      Cc: stable@vger.kernel.org # 5.9+
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a1bb3cd5
    • M
      ecryptfs: fix uid translation for setxattr on security.capability · 0b964446
      Miklos Szeredi 提交于
      Prior to commit 7c03e2cd ("vfs: move cap_convert_nscap() call into
      vfs_setxattr()") the translation of nscap->rootid did not take stacked
      filesystems (overlayfs and ecryptfs) into account.
      
      That patch fixed the overlay case, but made the ecryptfs case worse.
      
      Restore old the behavior for ecryptfs that existed before the overlayfs
      fix.  This does not fix ecryptfs's handling of complex user namespace
      setups, but it does make sure existing setups don't regress.
      Reported-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Tyler Hicks <code@tyhicks.com>
      Fixes: 7c03e2cd ("vfs: move cap_convert_nscap() call into vfs_setxattr()")
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NTyler Hicks <code@tyhicks.com>
      0b964446
    • J
      fs/pipe: allow sendfile() to pipe again · f8ad8187
      Johannes Berg 提交于
      After commit 36e2c742 ("fs: don't allow splice read/write
      without explicit ops") sendfile() could no longer send data
      from a real file to a pipe, breaking for example certain cgit
      setups (e.g. when running behind fcgiwrap), because in this
      case cgit will try to do exactly this: sendfile() to a pipe.
      
      Fix this by using iter_file_splice_write for the splice_write
      method of pipes, as suggested by Christoph.
      
      Cc: stable@vger.kernel.org
      Fixes: 36e2c742 ("fs: don't allow splice read/write without explicit ops")
      Suggested-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NJohannes Berg <johannes@sipsolutions.net>
      Signed-off-by: NJohannes Berg <johannes@sipsolutions.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8ad8187
    • F
      btrfs: fix log replay failure due to race with space cache rebuild · 9ad6d91f
      Filipe Manana 提交于
      After a sudden power failure we may end up with a space cache on disk that
      is not valid and needs to be rebuilt from scratch.
      
      If that happens, during log replay when we attempt to pin an extent buffer
      from a log tree, at btrfs_pin_extent_for_log_replay(), we do not wait for
      the space cache to be rebuilt through the call to:
      
          btrfs_cache_block_group(cache, 1);
      
      That is because that only waits for the task (work queue job) that loads
      the space cache to change the cache state from BTRFS_CACHE_FAST to any
      other value. That is ok when the space cache on disk exists and is valid,
      but when the cache is not valid and needs to be rebuilt, it ends up
      returning as soon as the cache state changes to BTRFS_CACHE_STARTED (done
      at caching_thread()).
      
      So this means that we can end up trying to unpin a range which is not yet
      marked as free in the block group. This results in the call to
      btrfs_remove_free_space() to return -EINVAL to
      btrfs_pin_extent_for_log_replay(), which in turn makes the log replay fail
      as well as mounting the filesystem. More specifically the -EINVAL comes
      from free_space_cache.c:remove_from_bitmap(), because the requested range
      is not marked as free space (ones in the bitmap), we have the following
      condition triggered:
      
      static noinline int remove_from_bitmap(struct btrfs_free_space_ctl *ctl,
      (...)
             if (ret < 0 || search_start != *offset)
                  return -EINVAL;
      (...)
      
      It's the "search_start != *offset" that results in the condition being
      evaluated to true.
      
      When this happens we got the following in dmesg/syslog:
      
      [72383.415114] BTRFS: device fsid 32b95b69-0ea9-496a-9f02-3f5a56dc9322 devid 1 transid 1432 /dev/sdb scanned by mount (3816007)
      [72383.417837] BTRFS info (device sdb): disk space caching is enabled
      [72383.418536] BTRFS info (device sdb): has skinny extents
      [72383.423846] BTRFS info (device sdb): start tree-log replay
      [72383.426416] BTRFS warning (device sdb): block group 30408704 has wrong amount of free space
      [72383.427686] BTRFS warning (device sdb): failed to load free space cache for block group 30408704, rebuilding it now
      [72383.454291] BTRFS: error (device sdb) in btrfs_recover_log_trees:6203: errno=-22 unknown (Failed to pin buffers while recovering log root tree.)
      [72383.456725] BTRFS: error (device sdb) in btrfs_replay_log:2253: errno=-22 unknown (Failed to recover log tree)
      [72383.460241] BTRFS error (device sdb): open_ctree failed
      
      We also mark the range for the extent buffer in the excluded extents io
      tree. That is fine when the space cache is valid on disk and we can load
      it, in which case it causes no problems.
      
      However, for the case where we need to rebuild the space cache, because it
      is either invalid or it is missing, having the extent buffer range marked
      in the excluded extents io tree leads to a -EINVAL failure from the call
      to btrfs_remove_free_space(), resulting in the log replay and mount to
      fail. This is because by having the range marked in the excluded extents
      io tree, the caching thread ends up never adding the range of the extent
      buffer as free space in the block group since the calls to
      add_new_free_space(), called from load_extent_tree_free(), filter out any
      ranges that are marked as excluded extents.
      
      So fix this by making sure that during log replay we wait for the caching
      task to finish completely when we need to rebuild a space cache, and also
      drop the need to mark the extent buffer range in the excluded extents io
      tree, as well as clearing ranges from that tree at
      btrfs_finish_extent_commit().
      
      This started to happen with some frequency on large filesystems having
      block groups with a lot of fragmentation since the recent commit
      e747853c ("btrfs: load free space cache asynchronously"), but in
      fact the issue has been there for years, it was just much less likely
      to happen.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9ad6d91f
    • S
      btrfs: fix lockdep warning due to seqcount_mutex on 32bit arch · c41ec452
      Su Yue 提交于
      This effectively reverts commit d5c82388 ("btrfs: convert
      data_seqcount to seqcount_mutex_t").
      
      While running fstests on 32 bits test box, many tests failed because of
      warnings in dmesg. One of those warnings (btrfs/003):
      
        [66.441317] WARNING: CPU: 6 PID: 9251 at include/linux/seqlock.h:279 btrfs_remove_chunk+0x58b/0x7b0 [btrfs]
        [66.441446] CPU: 6 PID: 9251 Comm: btrfs Tainted: G           O      5.11.0-rc4-custom+ #5
        [66.441449] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
        [66.441451] EIP: btrfs_remove_chunk+0x58b/0x7b0 [btrfs]
        [66.441472] EAX: 00000000 EBX: 00000001 ECX: c576070c EDX: c6b15803
        [66.441475] ESI: 10000000 EDI: 00000000 EBP: c56fbcfc ESP: c56fbc70
        [66.441477] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010246
        [66.441481] CR0: 80050033 CR2: 05c8da20 CR3: 04b20000 CR4: 00350ed0
        [66.441485] Call Trace:
        [66.441510]  btrfs_relocate_chunk+0xb1/0x100 [btrfs]
        [66.441529]  ? btrfs_lookup_block_group+0x17/0x20 [btrfs]
        [66.441562]  btrfs_balance+0x8ed/0x13b0 [btrfs]
        [66.441586]  ? btrfs_ioctl_balance+0x333/0x3c0 [btrfs]
        [66.441619]  ? __this_cpu_preempt_check+0xf/0x11
        [66.441643]  btrfs_ioctl_balance+0x333/0x3c0 [btrfs]
        [66.441664]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
        [66.441683]  btrfs_ioctl+0x414/0x2ae0 [btrfs]
        [66.441700]  ? __lock_acquire+0x35f/0x2650
        [66.441717]  ? lockdep_hardirqs_on+0x87/0x120
        [66.441720]  ? lockdep_hardirqs_on_prepare+0xd0/0x1e0
        [66.441724]  ? call_rcu+0x2d3/0x530
        [66.441731]  ? __might_fault+0x41/0x90
        [66.441736]  ? kvm_sched_clock_read+0x15/0x50
        [66.441740]  ? sched_clock+0x8/0x10
        [66.441745]  ? sched_clock_cpu+0x13/0x180
        [66.441750]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
        [66.441750]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
        [66.441768]  __ia32_sys_ioctl+0x165/0x8a0
        [66.441773]  ? __this_cpu_preempt_check+0xf/0x11
        [66.441785]  ? __might_fault+0x89/0x90
        [66.441791]  __do_fast_syscall_32+0x54/0x80
        [66.441796]  do_fast_syscall_32+0x32/0x70
        [66.441801]  do_SYSENTER_32+0x15/0x20
        [66.441805]  entry_SYSENTER_32+0x9f/0xf2
        [66.441808] EIP: 0xab7b5549
        [66.441814] EAX: ffffffda EBX: 00000003 ECX: c4009420 EDX: bfa91f5c
        [66.441816] ESI: 00000003 EDI: 00000001 EBP: 00000000 ESP: bfa91e98
        [66.441818] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000292
        [66.441833] irq event stamp: 42579
        [66.441835] hardirqs last  enabled at (42585): [<c60eb065>] console_unlock+0x495/0x590
        [66.441838] hardirqs last disabled at (42590): [<c60eafd5>] console_unlock+0x405/0x590
        [66.441840] softirqs last  enabled at (41698): [<c601b76c>] call_on_stack+0x1c/0x60
        [66.441843] softirqs last disabled at (41681): [<c601b76c>] call_on_stack+0x1c/0x60
      
        ========================================================================
        btrfs_remove_chunk+0x58b/0x7b0:
        __seqprop_mutex_assert at linux/./include/linux/seqlock.h:279
        (inlined by) btrfs_device_set_bytes_used at linux/fs/btrfs/volumes.h:212
        (inlined by) btrfs_remove_chunk at linux/fs/btrfs/volumes.c:2994
        ========================================================================
      
      The warning is produced by lockdep_assert_held() in
      __seqprop_mutex_assert() if CONFIG_LOCKDEP is enabled.
      And "olumes.c:2994 is btrfs_device_set_bytes_used() with mutex lock
      fs_info->chunk_mutex held already.
      
      After adding some debug prints, the cause was found that many
      __alloc_device() are called with NULL @fs_info (during scanning ioctl).
      Inside the function, btrfs_device_data_ordered_init() is expanded to
      seqcount_mutex_init().  In this scenario, its second
      parameter info->chunk_mutex  is &NULL->chunk_mutex which equals
      to offsetof(struct btrfs_fs_info, chunk_mutex) unexpectedly. Thus,
      seqcount_mutex_init() is called in wrong way. And later
      btrfs_device_get/set helpers trigger lockdep warnings.
      
      The device and filesystem object lifetimes are different and we'd have
      to synchronize initialization of the btrfs_device::data_seqcount with
      the fs_info, possibly using some additional synchronization. It would
      still not prevent concurrent access to the seqcount lock when it's used
      for read and initialization.
      
      Commit d5c82388 ("btrfs: convert data_seqcount to seqcount_mutex_t")
      does not mention a particular problem being fixed so revert should not
      cause any harm and we'll get the lockdep warning fixed.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=210139Reported-by: NErhard F <erhard_f@mailbox.org>
      Fixes: d5c82388 ("btrfs: convert data_seqcount to seqcount_mutex_t")
      CC: stable@vger.kernel.org # 5.10
      CC: Davidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NSu Yue <l@damenly.su>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c41ec452
    • J
      btrfs: fix possible free space tree corruption with online conversion · 2f96e402
      Josef Bacik 提交于
      While running btrfs/011 in a loop I would often ASSERT() while trying to
      add a new free space entry that already existed, or get an EEXIST while
      adding a new block to the extent tree, which is another indication of
      double allocation.
      
      This occurs because when we do the free space tree population, we create
      the new root and then populate the tree and commit the transaction.
      The problem is when you create a new root, the root node and commit root
      node are the same.  During this initial transaction commit we will run
      all of the delayed refs that were paused during the free space tree
      generation, and thus begin to cache block groups.  While caching block
      groups the caching thread will be reading from the main root for the
      free space tree, so as we make allocations we'll be changing the free
      space tree, which can cause us to add the same range twice which results
      in either the ASSERT(ret != -EEXIST); in __btrfs_add_free_space, or in a
      variety of different errors when running delayed refs because of a
      double allocation.
      
      Fix this by marking the fs_info as unsafe to load the free space tree,
      and fall back on the old slow method.  We could be smarter than this,
      for example caching the block group while we're populating the free
      space tree, but since this is a serious problem I've opted for the
      simplest solution.
      
      CC: stable@vger.kernel.org # 4.9+
      Fixes: a5ed9182 ("Btrfs: implement the free space B-tree")
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2f96e402
  11. 25 1月, 2021 8 次提交
新手
引导
客服 返回
顶部