1. 02 2月, 2021 15 次提交
  2. 29 1月, 2021 6 次提交
    • R
      cifs: fix dfs domain referrals · 0d4873f9
      Ronnie Sahlberg 提交于
      The new mount API requires additional changes to how DFS
      is handled. Additional testing of DFS uncovered problems
      with domain based DFS referrals (a follow on patch addresses
      DFS links) which this patch addresses.
      Signed-off-by: NRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: NPaulo Alcantara (SUSE) <pc@cjr.nz>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      0d4873f9
    • P
      io_uring: reinforce cancel on flush during exit · 3a7efd1a
      Pavel Begunkov 提交于
      What 84965ff8 ("io_uring: if we see flush on exit, cancel related tasks")
      really wants is to cancel all relevant REQ_F_INFLIGHT requests reliably.
      That can be achieved by io_uring_cancel_files(), but we'll miss it
      calling io_uring_cancel_task_requests(files=NULL) from io_uring_flush(),
      because it will go through __io_uring_cancel_task_requests().
      
      Just always call io_uring_cancel_files() during cancel, it's good enough
      for now.
      
      Cc: stable@vger.kernel.org # 5.9+
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3a7efd1a
    • S
      cifs: returning mount parm processing errors correctly · bd2f0b43
      Steve French 提交于
      During additional testing of the updated cifs.ko with the
      new mount API support, we found a few additional cases where
      we were logging errors, but not returning them to the user.
      
      For example:
         a) invalid security mechanisms
         b) invalid cache options
         c) unsupported rdma
         d) invalid smb dialect requested
      
      Fixes: 24e0a1ef ("cifs: switch to new mount api")
      Acked-by: NRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      bd2f0b43
    • P
      io_uring: fix sqo ownership false positive warning · 70b2c60d
      Pavel Begunkov 提交于
      WARNING: CPU: 0 PID: 21359 at fs/io_uring.c:9042
          io_uring_cancel_task_requests+0xe55/0x10c0 fs/io_uring.c:9042
      Call Trace:
       io_uring_flush+0x47b/0x6e0 fs/io_uring.c:9227
       filp_close+0xb4/0x170 fs/open.c:1295
       close_files fs/file.c:403 [inline]
       put_files_struct fs/file.c:418 [inline]
       put_files_struct+0x1cc/0x350 fs/file.c:415
       exit_files+0x7e/0xa0 fs/file.c:435
       do_exit+0xc22/0x2ae0 kernel/exit.c:820
       do_group_exit+0x125/0x310 kernel/exit.c:922
       get_signal+0x427/0x20f0 kernel/signal.c:2773
       arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811
       handle_signal_work kernel/entry/common.c:147 [inline]
       exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
       exit_to_user_mode_prepare+0x148/0x250 kernel/entry/common.c:201
       __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
       syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:302
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Now io_uring_cancel_task_requests() can be called not through file
      notes but directly, remove a WARN_ONCE() there that give us false
      positives. That check is not very important and we catch it in other
      places.
      
      Fixes: 84965ff8 ("io_uring: if we see flush on exit, cancel related tasks")
      Cc: stable@vger.kernel.org # 5.9+
      Reported-by: syzbot+3e3d9bd0c6ce9efbc3ef@syzkaller.appspotmail.com
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      70b2c60d
    • P
      io_uring: fix list corruption for splice file_get · f609cbb8
      Pavel Begunkov 提交于
      kernel BUG at lib/list_debug.c:29!
      Call Trace:
       __list_add include/linux/list.h:67 [inline]
       list_add include/linux/list.h:86 [inline]
       io_file_get+0x8cc/0xdb0 fs/io_uring.c:6466
       __io_splice_prep+0x1bc/0x530 fs/io_uring.c:3866
       io_splice_prep fs/io_uring.c:3920 [inline]
       io_req_prep+0x3546/0x4e80 fs/io_uring.c:6081
       io_queue_sqe+0x609/0x10d0 fs/io_uring.c:6628
       io_submit_sqe fs/io_uring.c:6705 [inline]
       io_submit_sqes+0x1495/0x2720 fs/io_uring.c:6953
       __do_sys_io_uring_enter+0x107d/0x1f30 fs/io_uring.c:9353
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      io_file_get() may be called from splice, and so REQ_F_INFLIGHT may
      already be set.
      
      Fixes: 02a13674 ("io_uring: account io_uring internal files as REQ_F_INFLIGHT")
      Cc: stable@vger.kernel.org # 5.9+
      Reported-by: syzbot+6879187cf57845801267@syzkaller.appspotmail.com
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f609cbb8
    • S
      cifs: fix mounts to subdirectories of target · c9b8cd6a
      Steve French 提交于
      The "prefixpath" mount option needs to be ignored
      which was missed in the recent conversion to the
      new mount API (prefixpath would be set by the mount
      helper if mounting a subdirectory of the root of a
      share e.g. //server/share/subdir)
      
      Fixes: 24e0a1ef ("cifs: switch to new mount api")
      Suggested-by: NRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      Reviewed-by: NRonnie Sahlberg <lsahlber@redhat.com>
      c9b8cd6a
  3. 28 1月, 2021 3 次提交
    • A
      cifs: ignore auto and noauto options if given · 19d51588
      Adam Harvey 提交于
      In 24e0a1ef, the noauto and auto options were missed when migrating
      to the new mount API. As a result, users with noauto in their fstab
      mount options are now unable to mount cifs filesystems, as they'll
      receive an "Unknown parameter" error.
      
      This restores the old behaviour of ignoring noauto and auto if they're
      given.
      
      Fixes: 24e0a1ef ("cifs: switch to new mount api")
      Signed-off-by: NAdam Harvey <adam@adamharvey.name>
      Reviewed-by: NRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      19d51588
    • H
      io_uring: fix flush cqring overflow list while TASK_INTERRUPTIBLE · 6195ba09
      Hao Xu 提交于
      Abaci reported the follow warning:
      
      [   27.073425] do not call blocking ops when !TASK_RUNNING; state=1 set at [] prepare_to_wait_exclusive+0x3a/0xc0
      [   27.075805] WARNING: CPU: 0 PID: 951 at kernel/sched/core.c:7853 __might_sleep+0x80/0xa0
      [   27.077604] Modules linked in:
      [   27.078379] CPU: 0 PID: 951 Comm: a.out Not tainted 5.11.0-rc3+ #1
      [   27.079637] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [   27.080852] RIP: 0010:__might_sleep+0x80/0xa0
      [   27.081835] Code: 65 48 8b 04 25 80 71 01 00 48 8b 90 c0 15 00 00 48 8b 70 18 48 c7 c7 08 39 95 82 c6 05 f9 5f de 08 01 48 89 d1 e8 00 c6 fa ff  0b eb bf 41 0f b6 f5 48 c7 c7 40 23 c9 82 e8 f3 48 ec 00 eb a7
      [   27.084521] RSP: 0018:ffffc90000fe3ce8 EFLAGS: 00010286
      [   27.085350] RAX: 0000000000000000 RBX: ffffffff82956083 RCX: 0000000000000000
      [   27.086348] RDX: ffff8881057a0000 RSI: ffffffff8118cc9e RDI: ffff88813bc28570
      [   27.087598] RBP: 00000000000003a7 R08: 0000000000000001 R09: 0000000000000001
      [   27.088819] R10: ffffc90000fe3e00 R11: 00000000fffef9f0 R12: 0000000000000000
      [   27.089819] R13: 0000000000000000 R14: ffff88810576eb80 R15: ffff88810576e800
      [   27.091058] FS:  00007f7b144cf740(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000
      [   27.092775] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   27.093796] CR2: 00000000022da7b8 CR3: 000000010b928002 CR4: 00000000003706f0
      [   27.094778] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   27.095780] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [   27.097011] Call Trace:
      [   27.097685]  __mutex_lock+0x5d/0xa30
      [   27.098565]  ? prepare_to_wait_exclusive+0x71/0xc0
      [   27.099412]  ? io_cqring_overflow_flush.part.101+0x6d/0x70
      [   27.100441]  ? lockdep_hardirqs_on_prepare+0xe9/0x1c0
      [   27.101537]  ? _raw_spin_unlock_irqrestore+0x2d/0x40
      [   27.102656]  ? trace_hardirqs_on+0x46/0x110
      [   27.103459]  ? io_cqring_overflow_flush.part.101+0x6d/0x70
      [   27.104317]  io_cqring_overflow_flush.part.101+0x6d/0x70
      [   27.105113]  io_cqring_wait+0x36e/0x4d0
      [   27.105770]  ? find_held_lock+0x28/0xb0
      [   27.106370]  ? io_uring_remove_task_files+0xa0/0xa0
      [   27.107076]  __x64_sys_io_uring_enter+0x4fb/0x640
      [   27.107801]  ? rcu_read_lock_sched_held+0x59/0xa0
      [   27.108562]  ? lockdep_hardirqs_on_prepare+0xe9/0x1c0
      [   27.109684]  ? syscall_enter_from_user_mode+0x26/0x70
      [   27.110731]  do_syscall_64+0x2d/0x40
      [   27.111296]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [   27.112056] RIP: 0033:0x7f7b13dc8239
      [   27.112663] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05  3d 01 f0 ff ff 73 01 c3 48 8b 0d 27 ec 2c 00 f7 d8 64 89 01 48
      [   27.115113] RSP: 002b:00007ffd6d7f5c88 EFLAGS: 00000286 ORIG_RAX: 00000000000001aa
      [   27.116562] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7b13dc8239
      [   27.117961] RDX: 000000000000478e RSI: 0000000000000000 RDI: 0000000000000003
      [   27.118925] RBP: 00007ffd6d7f5cb0 R08: 0000000020000040 R09: 0000000000000008
      [   27.119773] R10: 0000000000000001 R11: 0000000000000286 R12: 0000000000400480
      [   27.120614] R13: 00007ffd6d7f5d90 R14: 0000000000000000 R15: 0000000000000000
      [   27.121490] irq event stamp: 5635
      [   27.121946] hardirqs last  enabled at (5643): [] console_unlock+0x5c4/0x740
      [   27.123476] hardirqs last disabled at (5652): [] console_unlock+0x4e7/0x740
      [   27.125192] softirqs last  enabled at (5272): [] __do_softirq+0x3c5/0x5aa
      [   27.126430] softirqs last disabled at (5267): [] asm_call_irq_on_stack+0xf/0x20
      [   27.127634] ---[ end trace 289d7e28fa60f928 ]---
      
      This is caused by calling io_cqring_overflow_flush() which may sleep
      after calling prepare_to_wait_exclusive() which set task state to
      TASK_INTERRUPTIBLE
      Reported-by: NAbaci <abaci@linux.alibaba.com>
      Fixes: 6c503150 ("io_uring: patch up IOPOLL overflow_flush sync")
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6195ba09
    • M
      Revert "block: simplify set_init_blocksize" to regain lost performance · 8dc932d3
      Maxim Mikityanskiy 提交于
      The cited commit introduced a serious regression with SATA write speed,
      as found by bisecting. This patch reverts this commit, which restores
      write speed back to the values observed before this commit.
      
      The performance tests were done on a Helios4 NAS (2nd batch) with 4 HDDs
      (WD8003FFBX) using dd (bs=1M count=2000). "Direct" is a test with a
      single HDD, the rest are different RAID levels built over the first
      partitions of 4 HDDs. Test results are in MB/s, R is read, W is write.
      
                      | Direct | RAID0 | RAID10 f2 | RAID10 n2 | RAID6
      ----------------+--------+-------+-----------+-----------+--------
      9011495c    | R:256  | R:313 | R:276     | R:313     | R:323
      (before faulty) | W:254  | W:253 | W:195     | W:204     | W:117
      ----------------+--------+-------+-----------+-----------+--------
      5ff9f192    | R:257  | R:398 | R:312     | R:344     | R:391
      (faulty commit) | W:154  | W:122 | W:67.7    | W:66.6    | W:67.2
      ----------------+--------+-------+-----------+-----------+--------
      5.10.10         | R:256  | R:401 | R:312     | R:356     | R:375
      unpatched       | W:149  | W:123 | W:64      | W:64.1    | W:61.5
      ----------------+--------+-------+-----------+-----------+--------
      5.10.10         | R:255  | R:396 | R:312     | R:340     | R:393
      patched         | W:247  | W:274 | W:220     | W:225     | W:121
      
      Applying this patch doesn't hurt read performance, while improves the
      write speed by 1.5x - 3.5x (more impact on RAID tests). The write speed
      is restored back to the state before the faulty commit, and even a bit
      higher in RAID tests (which aren't HDD-bound on this device) - that is
      likely related to other optimizations done between the faulty commit and
      5.10.10 which also improved the read speed.
      Signed-off-by: NMaxim Mikityanskiy <maxtram95@gmail.com>
      Fixes: 5ff9f192 ("block: simplify set_init_blocksize")
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8dc932d3
  4. 27 1月, 2021 2 次提交
    • P
      io_uring: fix wqe->lock/completion_lock deadlock · 907d1df3
      Pavel Begunkov 提交于
      Joseph reports following deadlock:
      
      CPU0:
      ...
      io_kill_linked_timeout  // &ctx->completion_lock
      io_commit_cqring
      __io_queue_deferred
      __io_queue_async_work
      io_wq_enqueue
      io_wqe_enqueue  // &wqe->lock
      
      CPU1:
      ...
      __io_uring_files_cancel
      io_wq_cancel_cb
      io_wqe_cancel_pending_work  // &wqe->lock
      io_cancel_task_cb  // &ctx->completion_lock
      
      Only __io_queue_deferred() calls queue_async_work() while holding
      ctx->completion_lock, enqueue drained requests via io_req_task_queue()
      instead.
      
      Cc: stable@vger.kernel.org # 5.9+
      Reported-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Tested-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      907d1df3
    • P
      io_uring: fix cancellation taking mutex while TASK_UNINTERRUPTIBLE · ca70f00b
      Pavel Begunkov 提交于
      do not call blocking ops when !TASK_RUNNING; state=2 set at
      	[<00000000ced9dbfc>] prepare_to_wait+0x1f4/0x3b0
      	kernel/sched/wait.c:262
      WARNING: CPU: 1 PID: 19888 at kernel/sched/core.c:7853
      	__might_sleep+0xed/0x100 kernel/sched/core.c:7848
      RIP: 0010:__might_sleep+0xed/0x100 kernel/sched/core.c:7848
      Call Trace:
       __mutex_lock_common+0xc4/0x2ef0 kernel/locking/mutex.c:935
       __mutex_lock kernel/locking/mutex.c:1103 [inline]
       mutex_lock_nested+0x1a/0x20 kernel/locking/mutex.c:1118
       io_wq_submit_work+0x39a/0x720 fs/io_uring.c:6411
       io_run_cancel fs/io-wq.c:856 [inline]
       io_wqe_cancel_pending_work fs/io-wq.c:990 [inline]
       io_wq_cancel_cb+0x614/0xcb0 fs/io-wq.c:1027
       io_uring_cancel_files fs/io_uring.c:8874 [inline]
       io_uring_cancel_task_requests fs/io_uring.c:8952 [inline]
       __io_uring_files_cancel+0x115d/0x19e0 fs/io_uring.c:9038
       io_uring_files_cancel include/linux/io_uring.h:51 [inline]
       do_exit+0x2e6/0x2490 kernel/exit.c:780
       do_group_exit+0x168/0x2d0 kernel/exit.c:922
       get_signal+0x16b5/0x2030 kernel/signal.c:2770
       arch_do_signal_or_restart+0x8e/0x6a0 arch/x86/kernel/signal.c:811
       handle_signal_work kernel/entry/common.c:147 [inline]
       exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
       exit_to_user_mode_prepare+0xac/0x1e0 kernel/entry/common.c:201
       __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
       syscall_exit_to_user_mode+0x48/0x190 kernel/entry/common.c:302
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Rewrite io_uring_cancel_files() to mimic __io_uring_task_cancel()'s
      counting scheme, so it does all the heavy work before setting
      TASK_UNINTERRUPTIBLE.
      
      Cc: stable@vger.kernel.org # 5.9+
      Reported-by: syzbot+f655445043a26a7cfab8@syzkaller.appspotmail.com
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      [axboe: fix inverted task check]
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ca70f00b
  5. 26 1月, 2021 6 次提交
    • P
      io_uring: fix __io_uring_files_cancel() with TASK_UNINTERRUPTIBLE · a1bb3cd5
      Pavel Begunkov 提交于
      If the tctx inflight number haven't changed because of cancellation,
      __io_uring_task_cancel() will continue leaving the task in
      TASK_UNINTERRUPTIBLE state, that's not expected by
      __io_uring_files_cancel(). Ensure we always call finish_wait() before
      retrying.
      
      Cc: stable@vger.kernel.org # 5.9+
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a1bb3cd5
    • M
      ecryptfs: fix uid translation for setxattr on security.capability · 0b964446
      Miklos Szeredi 提交于
      Prior to commit 7c03e2cd ("vfs: move cap_convert_nscap() call into
      vfs_setxattr()") the translation of nscap->rootid did not take stacked
      filesystems (overlayfs and ecryptfs) into account.
      
      That patch fixed the overlay case, but made the ecryptfs case worse.
      
      Restore old the behavior for ecryptfs that existed before the overlayfs
      fix.  This does not fix ecryptfs's handling of complex user namespace
      setups, but it does make sure existing setups don't regress.
      Reported-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Tyler Hicks <code@tyhicks.com>
      Fixes: 7c03e2cd ("vfs: move cap_convert_nscap() call into vfs_setxattr()")
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NTyler Hicks <code@tyhicks.com>
      0b964446
    • J
      fs/pipe: allow sendfile() to pipe again · f8ad8187
      Johannes Berg 提交于
      After commit 36e2c742 ("fs: don't allow splice read/write
      without explicit ops") sendfile() could no longer send data
      from a real file to a pipe, breaking for example certain cgit
      setups (e.g. when running behind fcgiwrap), because in this
      case cgit will try to do exactly this: sendfile() to a pipe.
      
      Fix this by using iter_file_splice_write for the splice_write
      method of pipes, as suggested by Christoph.
      
      Cc: stable@vger.kernel.org
      Fixes: 36e2c742 ("fs: don't allow splice read/write without explicit ops")
      Suggested-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NJohannes Berg <johannes@sipsolutions.net>
      Signed-off-by: NJohannes Berg <johannes@sipsolutions.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8ad8187
    • F
      btrfs: fix log replay failure due to race with space cache rebuild · 9ad6d91f
      Filipe Manana 提交于
      After a sudden power failure we may end up with a space cache on disk that
      is not valid and needs to be rebuilt from scratch.
      
      If that happens, during log replay when we attempt to pin an extent buffer
      from a log tree, at btrfs_pin_extent_for_log_replay(), we do not wait for
      the space cache to be rebuilt through the call to:
      
          btrfs_cache_block_group(cache, 1);
      
      That is because that only waits for the task (work queue job) that loads
      the space cache to change the cache state from BTRFS_CACHE_FAST to any
      other value. That is ok when the space cache on disk exists and is valid,
      but when the cache is not valid and needs to be rebuilt, it ends up
      returning as soon as the cache state changes to BTRFS_CACHE_STARTED (done
      at caching_thread()).
      
      So this means that we can end up trying to unpin a range which is not yet
      marked as free in the block group. This results in the call to
      btrfs_remove_free_space() to return -EINVAL to
      btrfs_pin_extent_for_log_replay(), which in turn makes the log replay fail
      as well as mounting the filesystem. More specifically the -EINVAL comes
      from free_space_cache.c:remove_from_bitmap(), because the requested range
      is not marked as free space (ones in the bitmap), we have the following
      condition triggered:
      
      static noinline int remove_from_bitmap(struct btrfs_free_space_ctl *ctl,
      (...)
             if (ret < 0 || search_start != *offset)
                  return -EINVAL;
      (...)
      
      It's the "search_start != *offset" that results in the condition being
      evaluated to true.
      
      When this happens we got the following in dmesg/syslog:
      
      [72383.415114] BTRFS: device fsid 32b95b69-0ea9-496a-9f02-3f5a56dc9322 devid 1 transid 1432 /dev/sdb scanned by mount (3816007)
      [72383.417837] BTRFS info (device sdb): disk space caching is enabled
      [72383.418536] BTRFS info (device sdb): has skinny extents
      [72383.423846] BTRFS info (device sdb): start tree-log replay
      [72383.426416] BTRFS warning (device sdb): block group 30408704 has wrong amount of free space
      [72383.427686] BTRFS warning (device sdb): failed to load free space cache for block group 30408704, rebuilding it now
      [72383.454291] BTRFS: error (device sdb) in btrfs_recover_log_trees:6203: errno=-22 unknown (Failed to pin buffers while recovering log root tree.)
      [72383.456725] BTRFS: error (device sdb) in btrfs_replay_log:2253: errno=-22 unknown (Failed to recover log tree)
      [72383.460241] BTRFS error (device sdb): open_ctree failed
      
      We also mark the range for the extent buffer in the excluded extents io
      tree. That is fine when the space cache is valid on disk and we can load
      it, in which case it causes no problems.
      
      However, for the case where we need to rebuild the space cache, because it
      is either invalid or it is missing, having the extent buffer range marked
      in the excluded extents io tree leads to a -EINVAL failure from the call
      to btrfs_remove_free_space(), resulting in the log replay and mount to
      fail. This is because by having the range marked in the excluded extents
      io tree, the caching thread ends up never adding the range of the extent
      buffer as free space in the block group since the calls to
      add_new_free_space(), called from load_extent_tree_free(), filter out any
      ranges that are marked as excluded extents.
      
      So fix this by making sure that during log replay we wait for the caching
      task to finish completely when we need to rebuild a space cache, and also
      drop the need to mark the extent buffer range in the excluded extents io
      tree, as well as clearing ranges from that tree at
      btrfs_finish_extent_commit().
      
      This started to happen with some frequency on large filesystems having
      block groups with a lot of fragmentation since the recent commit
      e747853c ("btrfs: load free space cache asynchronously"), but in
      fact the issue has been there for years, it was just much less likely
      to happen.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9ad6d91f
    • S
      btrfs: fix lockdep warning due to seqcount_mutex on 32bit arch · c41ec452
      Su Yue 提交于
      This effectively reverts commit d5c82388 ("btrfs: convert
      data_seqcount to seqcount_mutex_t").
      
      While running fstests on 32 bits test box, many tests failed because of
      warnings in dmesg. One of those warnings (btrfs/003):
      
        [66.441317] WARNING: CPU: 6 PID: 9251 at include/linux/seqlock.h:279 btrfs_remove_chunk+0x58b/0x7b0 [btrfs]
        [66.441446] CPU: 6 PID: 9251 Comm: btrfs Tainted: G           O      5.11.0-rc4-custom+ #5
        [66.441449] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
        [66.441451] EIP: btrfs_remove_chunk+0x58b/0x7b0 [btrfs]
        [66.441472] EAX: 00000000 EBX: 00000001 ECX: c576070c EDX: c6b15803
        [66.441475] ESI: 10000000 EDI: 00000000 EBP: c56fbcfc ESP: c56fbc70
        [66.441477] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010246
        [66.441481] CR0: 80050033 CR2: 05c8da20 CR3: 04b20000 CR4: 00350ed0
        [66.441485] Call Trace:
        [66.441510]  btrfs_relocate_chunk+0xb1/0x100 [btrfs]
        [66.441529]  ? btrfs_lookup_block_group+0x17/0x20 [btrfs]
        [66.441562]  btrfs_balance+0x8ed/0x13b0 [btrfs]
        [66.441586]  ? btrfs_ioctl_balance+0x333/0x3c0 [btrfs]
        [66.441619]  ? __this_cpu_preempt_check+0xf/0x11
        [66.441643]  btrfs_ioctl_balance+0x333/0x3c0 [btrfs]
        [66.441664]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
        [66.441683]  btrfs_ioctl+0x414/0x2ae0 [btrfs]
        [66.441700]  ? __lock_acquire+0x35f/0x2650
        [66.441717]  ? lockdep_hardirqs_on+0x87/0x120
        [66.441720]  ? lockdep_hardirqs_on_prepare+0xd0/0x1e0
        [66.441724]  ? call_rcu+0x2d3/0x530
        [66.441731]  ? __might_fault+0x41/0x90
        [66.441736]  ? kvm_sched_clock_read+0x15/0x50
        [66.441740]  ? sched_clock+0x8/0x10
        [66.441745]  ? sched_clock_cpu+0x13/0x180
        [66.441750]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
        [66.441750]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
        [66.441768]  __ia32_sys_ioctl+0x165/0x8a0
        [66.441773]  ? __this_cpu_preempt_check+0xf/0x11
        [66.441785]  ? __might_fault+0x89/0x90
        [66.441791]  __do_fast_syscall_32+0x54/0x80
        [66.441796]  do_fast_syscall_32+0x32/0x70
        [66.441801]  do_SYSENTER_32+0x15/0x20
        [66.441805]  entry_SYSENTER_32+0x9f/0xf2
        [66.441808] EIP: 0xab7b5549
        [66.441814] EAX: ffffffda EBX: 00000003 ECX: c4009420 EDX: bfa91f5c
        [66.441816] ESI: 00000003 EDI: 00000001 EBP: 00000000 ESP: bfa91e98
        [66.441818] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000292
        [66.441833] irq event stamp: 42579
        [66.441835] hardirqs last  enabled at (42585): [<c60eb065>] console_unlock+0x495/0x590
        [66.441838] hardirqs last disabled at (42590): [<c60eafd5>] console_unlock+0x405/0x590
        [66.441840] softirqs last  enabled at (41698): [<c601b76c>] call_on_stack+0x1c/0x60
        [66.441843] softirqs last disabled at (41681): [<c601b76c>] call_on_stack+0x1c/0x60
      
        ========================================================================
        btrfs_remove_chunk+0x58b/0x7b0:
        __seqprop_mutex_assert at linux/./include/linux/seqlock.h:279
        (inlined by) btrfs_device_set_bytes_used at linux/fs/btrfs/volumes.h:212
        (inlined by) btrfs_remove_chunk at linux/fs/btrfs/volumes.c:2994
        ========================================================================
      
      The warning is produced by lockdep_assert_held() in
      __seqprop_mutex_assert() if CONFIG_LOCKDEP is enabled.
      And "olumes.c:2994 is btrfs_device_set_bytes_used() with mutex lock
      fs_info->chunk_mutex held already.
      
      After adding some debug prints, the cause was found that many
      __alloc_device() are called with NULL @fs_info (during scanning ioctl).
      Inside the function, btrfs_device_data_ordered_init() is expanded to
      seqcount_mutex_init().  In this scenario, its second
      parameter info->chunk_mutex  is &NULL->chunk_mutex which equals
      to offsetof(struct btrfs_fs_info, chunk_mutex) unexpectedly. Thus,
      seqcount_mutex_init() is called in wrong way. And later
      btrfs_device_get/set helpers trigger lockdep warnings.
      
      The device and filesystem object lifetimes are different and we'd have
      to synchronize initialization of the btrfs_device::data_seqcount with
      the fs_info, possibly using some additional synchronization. It would
      still not prevent concurrent access to the seqcount lock when it's used
      for read and initialization.
      
      Commit d5c82388 ("btrfs: convert data_seqcount to seqcount_mutex_t")
      does not mention a particular problem being fixed so revert should not
      cause any harm and we'll get the lockdep warning fixed.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=210139Reported-by: NErhard F <erhard_f@mailbox.org>
      Fixes: d5c82388 ("btrfs: convert data_seqcount to seqcount_mutex_t")
      CC: stable@vger.kernel.org # 5.10
      CC: Davidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NSu Yue <l@damenly.su>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c41ec452
    • J
      btrfs: fix possible free space tree corruption with online conversion · 2f96e402
      Josef Bacik 提交于
      While running btrfs/011 in a loop I would often ASSERT() while trying to
      add a new free space entry that already existed, or get an EEXIST while
      adding a new block to the extent tree, which is another indication of
      double allocation.
      
      This occurs because when we do the free space tree population, we create
      the new root and then populate the tree and commit the transaction.
      The problem is when you create a new root, the root node and commit root
      node are the same.  During this initial transaction commit we will run
      all of the delayed refs that were paused during the free space tree
      generation, and thus begin to cache block groups.  While caching block
      groups the caching thread will be reading from the main root for the
      free space tree, so as we make allocations we'll be changing the free
      space tree, which can cause us to add the same range twice which results
      in either the ASSERT(ret != -EEXIST); in __btrfs_add_free_space, or in a
      variety of different errors when running delayed refs because of a
      double allocation.
      
      Fix this by marking the fs_info as unsafe to load the free space tree,
      and fall back on the old slow method.  We could be smarter than this,
      for example caching the block group while we're populating the free
      space tree, but since this is a serious problem I've opted for the
      simplest solution.
      
      CC: stable@vger.kernel.org # 4.9+
      Fixes: a5ed9182 ("Btrfs: implement the free space B-tree")
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2f96e402
  6. 25 1月, 2021 8 次提交