1. 13 4月, 2021 17 次提交
    • P
      io_uring: fix inconsistent lock state · 58a77c0f
      Pavel Begunkov 提交于
      stable inclusion
      from stable-5.10.26
      commit 1c20e9040f49687ba2ccc2ffd4411351a6c2ebff
      bugzilla: 51363
      
      --------------------------------
      
      [ Upstream commit 9ae1f8dd ]
      
      WARNING: inconsistent lock state
      
      inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
      syz-executor217/8450 [HC1[1]:SC0[0]:HE0:SE1] takes:
      ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: spin_lock include/linux/spinlock.h:354 [inline]
      ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: io_req_clean_work fs/io_uring.c:1398 [inline]
      ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: io_dismantle_req+0x66f/0xf60 fs/io_uring.c:2029
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(&fs->lock);
        <Interrupt>
          lock(&fs->lock);
      
       *** DEADLOCK ***
      
      1 lock held by syz-executor217/8450:
       #0: ffff88802417c3e8 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0x1071/0x1f30 fs/io_uring.c:9442
      
      stack backtrace:
      CPU: 1 PID: 8450 Comm: syz-executor217 Not tainted 5.11.0-rc5-next-20210129-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <IRQ>
      [...]
       _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151
       spin_lock include/linux/spinlock.h:354 [inline]
       io_req_clean_work fs/io_uring.c:1398 [inline]
       io_dismantle_req+0x66f/0xf60 fs/io_uring.c:2029
       __io_free_req+0x3d/0x2e0 fs/io_uring.c:2046
       io_free_req fs/io_uring.c:2269 [inline]
       io_double_put_req fs/io_uring.c:2392 [inline]
       io_put_req+0xf9/0x570 fs/io_uring.c:2388
       io_link_timeout_fn+0x30c/0x480 fs/io_uring.c:6497
       __run_hrtimer kernel/time/hrtimer.c:1519 [inline]
       __hrtimer_run_queues+0x609/0xe40 kernel/time/hrtimer.c:1583
       hrtimer_interrupt+0x334/0x940 kernel/time/hrtimer.c:1645
       local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1085 [inline]
       __sysvec_apic_timer_interrupt+0x146/0x540 arch/x86/kernel/apic/apic.c:1102
       asm_call_irq_on_stack+0xf/0x20
       </IRQ>
       __run_sysvec_on_irqstack arch/x86/include/asm/irq_stack.h:37 [inline]
       run_sysvec_on_irqstack_cond arch/x86/include/asm/irq_stack.h:89 [inline]
       sysvec_apic_timer_interrupt+0xbd/0x100 arch/x86/kernel/apic/apic.c:1096
       asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:629
      RIP: 0010:__raw_spin_unlock_irq include/linux/spinlock_api_smp.h:169 [inline]
      RIP: 0010:_raw_spin_unlock_irq+0x25/0x40 kernel/locking/spinlock.c:199
       spin_unlock_irq include/linux/spinlock.h:404 [inline]
       io_queue_linked_timeout+0x194/0x1f0 fs/io_uring.c:6525
       __io_queue_sqe+0x328/0x1290 fs/io_uring.c:6594
       io_queue_sqe+0x631/0x10d0 fs/io_uring.c:6639
       io_queue_link_head fs/io_uring.c:6650 [inline]
       io_submit_sqe fs/io_uring.c:6697 [inline]
       io_submit_sqes+0x19b5/0x2720 fs/io_uring.c:6960
       __do_sys_io_uring_enter+0x107d/0x1f30 fs/io_uring.c:9443
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Don't free requests from under hrtimer context (softirq) as it may sleep
      or take spinlocks improperly (e.g. non-irq versions).
      
      Cc: stable@vger.kernel.org # 5.6+
      Reported-by: syzbot+81d17233a2b02eafba33@syzkaller.appspotmail.com
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      58a77c0f
    • S
      cifs: fix allocation size on newly created files · 8a7a3082
      Steve French 提交于
      stable inclusion
      from stable-5.10.26
      commit 04eb2b2fa12ff6023a92d5199275255e9b82011b
      bugzilla: 51363
      
      --------------------------------
      
      commit 65af8f01 upstream.
      
      Applications that create and extend and write to a file do not
      expect to see 0 allocation size.  When file is extended,
      set its allocation size to a plausible value until we have a
      chance to query the server for it.  When the file is cached
      this will prevent showing an impossible number of allocated
      blocks (like 0).  This fixes e.g. xfstests 614 which does
      
          1) create a file and set its size to 64K
          2) mmap write 64K to the file
          3) stat -c %b for the file (to query the number of allocated blocks)
      
      It was failing because we returned 0 blocks.  Even though we would
      return the correct cached file size, we returned an impossible
      allocation size.
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      CC: <stable@vger.kernel.org>
      Reviewed-by: NAurelien Aptel <aaptel@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      8a7a3082
    • J
      io_uring: ensure that SQPOLL thread is started for exit · 6d7783c6
      Jens Axboe 提交于
      stable inclusion
      from stable-5.10.26
      commit 6cae8095490caae12875300243ec94b39b6a2a78
      bugzilla: 51363
      
      --------------------------------
      
      commit 3ebba796 upstream.
      
      If we create it in a disabled state because IORING_SETUP_R_DISABLED is
      set on ring creation, we need to ensure that we've kicked the thread if
      we're exiting before it's been explicitly disabled. Otherwise we can run
      into a deadlock where exit is waiting go park the SQPOLL thread, but the
      SQPOLL thread itself is waiting to get a signal to start.
      
      That results in the below trace of both tasks hung, waiting on each other:
      
      INFO: task syz-executor458:8401 blocked for more than 143 seconds.
            Not tainted 5.11.0-next-20210226-syzkaller #0
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      task:syz-executor458 state:D stack:27536 pid: 8401 ppid:  8400 flags:0x00004004
      Call Trace:
       context_switch kernel/sched/core.c:4324 [inline]
       __schedule+0x90c/0x21a0 kernel/sched/core.c:5075
       schedule+0xcf/0x270 kernel/sched/core.c:5154
       schedule_timeout+0x1db/0x250 kernel/time/timer.c:1868
       do_wait_for_common kernel/sched/completion.c:85 [inline]
       __wait_for_common kernel/sched/completion.c:106 [inline]
       wait_for_common kernel/sched/completion.c:117 [inline]
       wait_for_completion+0x168/0x270 kernel/sched/completion.c:138
       io_sq_thread_park fs/io_uring.c:7115 [inline]
       io_sq_thread_park+0xd5/0x130 fs/io_uring.c:7103
       io_uring_cancel_task_requests+0x24c/0xd90 fs/io_uring.c:8745
       __io_uring_files_cancel+0x110/0x230 fs/io_uring.c:8840
       io_uring_files_cancel include/linux/io_uring.h:47 [inline]
       do_exit+0x299/0x2a60 kernel/exit.c:780
       do_group_exit+0x125/0x310 kernel/exit.c:922
       __do_sys_exit_group kernel/exit.c:933 [inline]
       __se_sys_exit_group kernel/exit.c:931 [inline]
       __x64_sys_exit_group+0x3a/0x50 kernel/exit.c:931
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x43e899
      RSP: 002b:00007ffe89376d48 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
      RAX: ffffffffffffffda RBX: 00000000004af2f0 RCX: 000000000043e899
      RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000
      RBP: 0000000000000000 R08: ffffffffffffffc0 R09: 0000000010000000
      R10: 0000000000008011 R11: 0000000000000246 R12: 00000000004af2f0
      R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001
      INFO: task iou-sqp-8401:8402 can't die for more than 143 seconds.
      task:iou-sqp-8401    state:D stack:30272 pid: 8402 ppid:  8400 flags:0x00004004
      Call Trace:
       context_switch kernel/sched/core.c:4324 [inline]
       __schedule+0x90c/0x21a0 kernel/sched/core.c:5075
       schedule+0xcf/0x270 kernel/sched/core.c:5154
       schedule_timeout+0x1db/0x250 kernel/time/timer.c:1868
       do_wait_for_common kernel/sched/completion.c:85 [inline]
       __wait_for_common kernel/sched/completion.c:106 [inline]
       wait_for_common kernel/sched/completion.c:117 [inline]
       wait_for_completion+0x168/0x270 kernel/sched/completion.c:138
       io_sq_thread+0x27d/0x1ae0 fs/io_uring.c:6717
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
      INFO: task iou-sqp-8401:8402 blocked for more than 143 seconds.
      
      Reported-by: syzbot+fb5458330b4442f2090d@syzkaller.appspotmail.com
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      6d7783c6
    • T
      pstore: Fix warning in pstore_kill_sb() · 64ef7217
      Tetsuo Handa 提交于
      stable inclusion
      from stable-5.10.26
      commit a7acb614287b7de8bf86d6758dac43bbd1d29534
      bugzilla: 51363
      
      --------------------------------
      
      commit 9c7d83ae upstream.
      
      syzbot is hitting WARN_ON(pstore_sb != sb) at pstore_kill_sb() [1], for the
      assumption that pstore_sb != NULL is wrong because pstore_fill_super() will
      not assign pstore_sb = sb when new_inode() for d_make_root() returned NULL
      (due to memory allocation fault injection).
      
      Since mount_single() calls pstore_kill_sb() when pstore_fill_super()
      failed, pstore_kill_sb() needs to be aware of such failure path.
      
      [1] https://syzkaller.appspot.com/bug?id=6abacb8da5137cb47a416f2bef95719ed60508a0Reported-by: Nsyzbot <syzbot+d0cf0ad6513e9a1da5df@syzkaller.appspotmail.com>
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20210214031307.57903-1-penguin-kernel@I-love.SAKURA.ne.jpSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      64ef7217
    • O
      NFSD: fix dest to src mount in inter-server COPY · 8ca28d72
      Olga Kornievskaia 提交于
      stable inclusion
      from stable-5.10.26
      commit 982b899ba672c1eb2e0c01fef197bda13de4af55
      bugzilla: 51363
      
      --------------------------------
      
      commit 614c9750 upstream.
      
      A cleanup of the inter SSC copy needs to call fput() of the source
      file handle to make sure that file structure is freed as well as
      drop the reference on the superblock to unmount the source server.
      
      Fixes: 36e1e5ba ("NFSD: Fix use-after-free warning when doing inter-server copy")
      Signed-off-by: NOlga Kornievskaia <kolga@netapp.com>
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NDai Ngo <dai.ngo@oracle.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      8ca28d72
    • J
      nfsd: don't abort copies early · 03a80099
      J. Bruce Fields 提交于
      stable inclusion
      from stable-5.10.26
      commit 12628e7779f8e191c010955058d278df5bf0c0d4
      bugzilla: 51363
      
      --------------------------------
      
      commit bfdd89f2 upstream.
      
      The typical result of the backwards comparison here is that the source
      server in a server-to-server copy will return BAD_STATEID within a few
      seconds of the copy starting, instead of giving the copy a full lease
      period, so the copy_file_range() call will end up unnecessarily
      returning a short read.
      
      Fixes: 624322f1 "NFSD add COPY_NOTIFY operation"
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      03a80099
    • T
      nfsd: Don't keep looking up unhashed files in the nfsd file cache · f764a833
      Trond Myklebust 提交于
      stable inclusion
      from stable-5.10.26
      commit 5ea0aa29ad4b8bc96b8cfcfb367f04b50b9cf92f
      bugzilla: 51363
      
      --------------------------------
      
      commit d30881f5 upstream.
      
      If a file is unhashed, then we're going to reject it anyway and retry,
      so make sure we skip it when we're doing the RCU lockless lookup.
      This avoids a number of unnecessary nfserr_jukebox returns from
      nfsd_file_acquire()
      
      Fixes: 65294c1f ("nfsd: add a new struct file caching facility to nfsd")
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      f764a833
    • D
      afs: Stop listxattr() from listing "afs.*" attributes · 8b25f3b2
      David Howells 提交于
      stable inclusion
      from stable-5.10.26
      commit 64195f022ae8c24e0abccc1545d557b064e73ed3
      bugzilla: 51363
      
      --------------------------------
      
      commit a7889c63 upstream.
      
      afs_listxattr() lists all the available special afs xattrs (i.e. those in
      the "afs.*" space), no matter what type of server we're dealing with.  But
      OpenAFS servers, for example, cannot deal with some of the extra-capable
      attributes that AuriStor (YFS) servers provide.  Unfortunately, the
      presence of the afs.yfs.* attributes causes errors[1] for anything that
      tries to read them if the server is of the wrong type.
      
      Fix the problem by removing afs_listxattr() so that none of the special
      xattrs are listed (AFS doesn't support xattrs).  It does mean, however,
      that getfattr won't list them, though they can still be accessed with
      getxattr() and setxattr().
      
      This can be tested with something like:
      
      	getfattr -d -m ".*" /afs/example.com/path/to/file
      
      With this change, none of the afs.* attributes should be visible.
      
      Changes:
      ver #2:
       - Hide all of the afs.* xattrs, not just the ACL ones.
      
      Fixes: ae46578b ("afs: Get YFS ACLs and information through xattrs")
      Reported-by: NGaja Sophie Peters <gaja.peters@math.uni-hamburg.de>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NGaja Sophie Peters <gaja.peters@math.uni-hamburg.de>
      Reviewed-by: NJeffrey Altman <jaltman@auristor.com>
      Reviewed-by: NMarc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      Link: http://lists.infradead.org/pipermail/linux-afs/2021-March/003502.html [1]
      Link: http://lists.infradead.org/pipermail/linux-afs/2021-March/003567.html # v1
      Link: http://lists.infradead.org/pipermail/linux-afs/2021-March/003573.html # v2
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      8b25f3b2
    • D
      afs: Fix accessing YFS xattrs on a non-YFS server · 844e32f2
      David Howells 提交于
      stable inclusion
      from stable-5.10.26
      commit 78ba4793b084f722a0aaf5f32a3d9f7c3e284b22
      bugzilla: 51363
      
      --------------------------------
      
      commit 64fcbb61 upstream.
      
      If someone attempts to access YFS-related xattrs (e.g. afs.yfs.acl) on a
      file on a non-YFS AFS server (such as OpenAFS), then the kernel will jump
      to a NULL function pointer because the afs_fetch_acl_operation descriptor
      doesn't point to a function for issuing an operation on a non-YFS
      server[1].
      
      Fix this by making afs_wait_for_operation() check that the issue_afs_rpc
      method is set before jumping to it and setting -ENOTSUPP if not.  This fix
      also covers other potential operations that also only exist on YFS servers.
      
      afs_xattr_get/set_yfs() then need to translate -ENOTSUPP to -ENODATA as the
      former error is internal to the kernel.
      
      The bug shows up as an oops like the following:
      
      	BUG: kernel NULL pointer dereference, address: 0000000000000000
      	[...]
      	Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
      	[...]
      	Call Trace:
      	 afs_wait_for_operation+0x83/0x1b0 [kafs]
      	 afs_xattr_get_yfs+0xe6/0x270 [kafs]
      	 __vfs_getxattr+0x59/0x80
      	 vfs_getxattr+0x11c/0x140
      	 getxattr+0x181/0x250
      	 ? __check_object_size+0x13f/0x150
      	 ? __fput+0x16d/0x250
      	 __x64_sys_fgetxattr+0x64/0xb0
      	 do_syscall_64+0x49/0xc0
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      	RIP: 0033:0x7fb120a9defe
      
      This was triggered with "cp -a" which attempts to copy xattrs, including
      afs ones, but is easier to reproduce with getfattr, e.g.:
      
      	getfattr -d -m ".*" /afs/openafs.org/
      
      Fixes: e49c7b2f ("afs: Build an abstraction around an "operation" concept")
      Reported-by: NGaja Sophie Peters <gaja.peters@math.uni-hamburg.de>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NGaja Sophie Peters <gaja.peters@math.uni-hamburg.de>
      Reviewed-by: NMarc Dionne <marc.dionne@auristor.com>
      Reviewed-by: NJeffrey Altman <jaltman@auristor.com>
      cc: linux-afs@lists.infradead.org
      Link: http://lists.infradead.org/pipermail/linux-afs/2021-March/003498.html [1]
      Link: http://lists.infradead.org/pipermail/linux-afs/2021-March/003566.html # v1
      Link: http://lists.infradead.org/pipermail/linux-afs/2021-March/003572.html # v2
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      844e32f2
    • D
      btrfs: fix slab cache flags for free space tree bitmap · b0305f10
      David Sterba 提交于
      stable inclusion
      from stable-5.10.26
      commit 2c8d6a9474f07375c87c4dc6f008610b3ce755a7
      bugzilla: 51363
      
      --------------------------------
      
      commit 34e49994 upstream.
      
      The free space tree bitmap slab cache is created with SLAB_RED_ZONE but
      that's a debugging flag and not always enabled. Also the other slabs are
      created with at least SLAB_MEM_SPREAD that we want as well to average
      the memory placement cost.
      Reported-by: NVlastimil Babka <vbabka@suse.cz>
      Fixes: 3acd4850 ("btrfs: fix allocation of free space cache v1 bitmap pages")
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      b0305f10
    • F
      btrfs: fix race when cloning extent buffer during rewind of an old root · 5faf447c
      Filipe Manana 提交于
      stable inclusion
      from stable-5.10.26
      commit 38ffe9eaeb7cce383525439f0948f9eb74632e1d
      bugzilla: 51363
      
      --------------------------------
      
      commit dbcc7d57 upstream.
      
      While resolving backreferences, as part of a logical ino ioctl call or
      fiemap, we can end up hitting a BUG_ON() when replaying tree mod log
      operations of a root, triggering a stack trace like the following:
      
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/ctree.c:1210!
        invalid opcode: 0000 [#1] SMP KASAN PTI
        CPU: 1 PID: 19054 Comm: crawl_335 Tainted: G        W         5.11.0-2d11c0084b02-misc-next+ #89
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
        RIP: 0010:__tree_mod_log_rewind+0x3b1/0x3c0
        Code: 05 48 8d 74 10 (...)
        RSP: 0018:ffffc90001eb70b8 EFLAGS: 00010297
        RAX: 0000000000000000 RBX: ffff88812344e400 RCX: ffffffffb28933b6
        RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff88812344e42c
        RBP: ffffc90001eb7108 R08: 1ffff11020b60a20 R09: ffffed1020b60a20
        R10: ffff888105b050f9 R11: ffffed1020b60a1f R12: 00000000000000ee
        R13: ffff8880195520c0 R14: ffff8881bc958500 R15: ffff88812344e42c
        FS:  00007fd1955e8700(0000) GS:ffff8881f5600000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007efdb7928718 CR3: 000000010103a006 CR4: 0000000000170ee0
        Call Trace:
         btrfs_search_old_slot+0x265/0x10d0
         ? lock_acquired+0xbb/0x600
         ? btrfs_search_slot+0x1090/0x1090
         ? free_extent_buffer.part.61+0xd7/0x140
         ? free_extent_buffer+0x13/0x20
         resolve_indirect_refs+0x3e9/0xfc0
         ? lock_downgrade+0x3d0/0x3d0
         ? __kasan_check_read+0x11/0x20
         ? add_prelim_ref.part.11+0x150/0x150
         ? lock_downgrade+0x3d0/0x3d0
         ? __kasan_check_read+0x11/0x20
         ? lock_acquired+0xbb/0x600
         ? __kasan_check_write+0x14/0x20
         ? do_raw_spin_unlock+0xa8/0x140
         ? rb_insert_color+0x30/0x360
         ? prelim_ref_insert+0x12d/0x430
         find_parent_nodes+0x5c3/0x1830
         ? resolve_indirect_refs+0xfc0/0xfc0
         ? lock_release+0xc8/0x620
         ? fs_reclaim_acquire+0x67/0xf0
         ? lock_acquire+0xc7/0x510
         ? lock_downgrade+0x3d0/0x3d0
         ? lockdep_hardirqs_on_prepare+0x160/0x210
         ? lock_release+0xc8/0x620
         ? fs_reclaim_acquire+0x67/0xf0
         ? lock_acquire+0xc7/0x510
         ? poison_range+0x38/0x40
         ? unpoison_range+0x14/0x40
         ? trace_hardirqs_on+0x55/0x120
         btrfs_find_all_roots_safe+0x142/0x1e0
         ? find_parent_nodes+0x1830/0x1830
         ? btrfs_inode_flags_to_xflags+0x50/0x50
         iterate_extent_inodes+0x20e/0x580
         ? tree_backref_for_extent+0x230/0x230
         ? lock_downgrade+0x3d0/0x3d0
         ? read_extent_buffer+0xdd/0x110
         ? lock_downgrade+0x3d0/0x3d0
         ? __kasan_check_read+0x11/0x20
         ? lock_acquired+0xbb/0x600
         ? __kasan_check_write+0x14/0x20
         ? _raw_spin_unlock+0x22/0x30
         ? __kasan_check_write+0x14/0x20
         iterate_inodes_from_logical+0x129/0x170
         ? iterate_inodes_from_logical+0x129/0x170
         ? btrfs_inode_flags_to_xflags+0x50/0x50
         ? iterate_extent_inodes+0x580/0x580
         ? __vmalloc_node+0x92/0xb0
         ? init_data_container+0x34/0xb0
         ? init_data_container+0x34/0xb0
         ? kvmalloc_node+0x60/0x80
         btrfs_ioctl_logical_to_ino+0x158/0x230
         btrfs_ioctl+0x205e/0x4040
         ? __might_sleep+0x71/0xe0
         ? btrfs_ioctl_get_supported_features+0x30/0x30
         ? getrusage+0x4b6/0x9c0
         ? __kasan_check_read+0x11/0x20
         ? lock_release+0xc8/0x620
         ? __might_fault+0x64/0xd0
         ? lock_acquire+0xc7/0x510
         ? lock_downgrade+0x3d0/0x3d0
         ? lockdep_hardirqs_on_prepare+0x210/0x210
         ? lockdep_hardirqs_on_prepare+0x210/0x210
         ? __kasan_check_read+0x11/0x20
         ? do_vfs_ioctl+0xfc/0x9d0
         ? ioctl_file_clone+0xe0/0xe0
         ? lock_downgrade+0x3d0/0x3d0
         ? lockdep_hardirqs_on_prepare+0x210/0x210
         ? __kasan_check_read+0x11/0x20
         ? lock_release+0xc8/0x620
         ? __task_pid_nr_ns+0xd3/0x250
         ? lock_acquire+0xc7/0x510
         ? __fget_files+0x160/0x230
         ? __fget_light+0xf2/0x110
         __x64_sys_ioctl+0xc3/0x100
         do_syscall_64+0x37/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7fd1976e2427
        Code: 00 00 90 48 8b 05 (...)
        RSP: 002b:00007fd1955e5cf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        RAX: ffffffffffffffda RBX: 00007fd1955e5f40 RCX: 00007fd1976e2427
        RDX: 00007fd1955e5f48 RSI: 00000000c038943b RDI: 0000000000000004
        RBP: 0000000001000000 R08: 0000000000000000 R09: 00007fd1955e6120
        R10: 0000557835366b00 R11: 0000000000000246 R12: 0000000000000004
        R13: 00007fd1955e5f48 R14: 00007fd1955e5f40 R15: 00007fd1955e5ef8
        Modules linked in:
        ---[ end trace ec8931a1c36e57be ]---
      
        (gdb) l *(__tree_mod_log_rewind+0x3b1)
        0xffffffff81893521 is in __tree_mod_log_rewind (fs/btrfs/ctree.c:1210).
        1205                     * the modification. as we're going backwards, we do the
        1206                     * opposite of each operation here.
        1207                     */
        1208                    switch (tm->op) {
        1209                    case MOD_LOG_KEY_REMOVE_WHILE_FREEING:
        1210                            BUG_ON(tm->slot < n);
        1211                            fallthrough;
        1212                    case MOD_LOG_KEY_REMOVE_WHILE_MOVING:
        1213                    case MOD_LOG_KEY_REMOVE:
        1214                            btrfs_set_node_key(eb, &tm->key, tm->slot);
      
      Here's what happens to hit that BUG_ON():
      
      1) We have one tree mod log user (through fiemap or the logical ino ioctl),
         with a sequence number of 1, so we have fs_info->tree_mod_seq == 1;
      
      2) Another task is at ctree.c:balance_level() and we have eb X currently as
         the root of the tree, and we promote its single child, eb Y, as the new
         root.
      
         Then, at ctree.c:balance_level(), we call:
      
            tree_mod_log_insert_root(eb X, eb Y, 1);
      
      3) At tree_mod_log_insert_root() we create tree mod log elements for each
         slot of eb X, of operation type MOD_LOG_KEY_REMOVE_WHILE_FREEING each
         with a ->logical pointing to ebX->start. These are placed in an array
         named tm_list.
         Lets assume there are N elements (N pointers in eb X);
      
      4) Then, still at tree_mod_log_insert_root(), we create a tree mod log
         element of operation type MOD_LOG_ROOT_REPLACE, ->logical set to
         ebY->start, ->old_root.logical set to ebX->start, ->old_root.level set
         to the level of eb X and ->generation set to the generation of eb X;
      
      5) Then tree_mod_log_insert_root() calls tree_mod_log_free_eb() with
         tm_list as argument. After that, tree_mod_log_free_eb() calls
         __tree_mod_log_insert() for each member of tm_list in reverse order,
         from highest slot in eb X, slot N - 1, to slot 0 of eb X;
      
      6) __tree_mod_log_insert() sets the sequence number of each given tree mod
         log operation - it increments fs_info->tree_mod_seq and sets
         fs_info->tree_mod_seq as the sequence number of the given tree mod log
         operation.
      
         This means that for the tm_list created at tree_mod_log_insert_root(),
         the element corresponding to slot 0 of eb X has the highest sequence
         number (1 + N), and the element corresponding to the last slot has the
         lowest sequence number (2);
      
      7) Then, after inserting tm_list's elements into the tree mod log rbtree,
         the MOD_LOG_ROOT_REPLACE element is inserted, which gets the highest
         sequence number, which is N + 2;
      
      8) Back to ctree.c:balance_level(), we free eb X by calling
         btrfs_free_tree_block() on it. Because eb X was created in the current
         transaction, has no other references and writeback did not happen for
         it, we add it back to the free space cache/tree;
      
      9) Later some other task T allocates the metadata extent from eb X, since
         it is marked as free space in the space cache/tree, and uses it as a
         node for some other btree;
      
      10) The tree mod log user task calls btrfs_search_old_slot(), which calls
          get_old_root(), and finally that calls __tree_mod_log_oldest_root()
          with time_seq == 1 and eb_root == eb Y;
      
      11) First iteration of the while loop finds the tree mod log element with
          sequence number N + 2, for the logical address of eb Y and of type
          MOD_LOG_ROOT_REPLACE;
      
      12) Because the operation type is MOD_LOG_ROOT_REPLACE, we don't break out
          of the loop, and set root_logical to point to tm->old_root.logical
          which corresponds to the logical address of eb X;
      
      13) On the next iteration of the while loop, the call to
          tree_mod_log_search_oldest() returns the smallest tree mod log element
          for the logical address of eb X, which has a sequence number of 2, an
          operation type of MOD_LOG_KEY_REMOVE_WHILE_FREEING and corresponds to
          the old slot N - 1 of eb X (eb X had N items in it before being freed);
      
      14) We then break out of the while loop and return the tree mod log operation
          of type MOD_LOG_ROOT_REPLACE (eb Y), and not the one for slot N - 1 of
          eb X, to get_old_root();
      
      15) At get_old_root(), we process the MOD_LOG_ROOT_REPLACE operation
          and set "logical" to the logical address of eb X, which was the old
          root. We then call tree_mod_log_search() passing it the logical
          address of eb X and time_seq == 1;
      
      16) Then before calling tree_mod_log_search(), task T adds a key to eb X,
          which results in adding a tree mod log operation of type
          MOD_LOG_KEY_ADD to the tree mod log - this is done at
          ctree.c:insert_ptr() - but after adding the tree mod log operation
          and before updating the number of items in eb X from 0 to 1...
      
      17) The task at get_old_root() calls tree_mod_log_search() and gets the
          tree mod log operation of type MOD_LOG_KEY_ADD just added by task T.
          Then it enters the following if branch:
      
          if (old_root && tm && tm->op != MOD_LOG_KEY_REMOVE_WHILE_FREEING) {
             (...)
          } (...)
      
          Calls read_tree_block() for eb X, which gets a reference on eb X but
          does not lock it - task T has it locked.
          Then it clones eb X while it has nritems set to 0 in its header, before
          task T sets nritems to 1 in eb X's header. From hereupon we use the
          clone of eb X which no other task has access to;
      
      18) Then we call __tree_mod_log_rewind(), passing it the MOD_LOG_KEY_ADD
          mod log operation we just got from tree_mod_log_search() in the
          previous step and the cloned version of eb X;
      
      19) At __tree_mod_log_rewind(), we set the local variable "n" to the number
          of items set in eb X's clone, which is 0. Then we enter the while loop,
          and in its first iteration we process the MOD_LOG_KEY_ADD operation,
          which just decrements "n" from 0 to (u32)-1, since "n" is declared with
          a type of u32. At the end of this iteration we call rb_next() to find the
          next tree mod log operation for eb X, that gives us the mod log operation
          of type MOD_LOG_KEY_REMOVE_WHILE_FREEING, for slot 0, with a sequence
          number of N + 1 (steps 3 to 6);
      
      20) Then we go back to the top of the while loop and trigger the following
          BUG_ON():
      
              (...)
              switch (tm->op) {
              case MOD_LOG_KEY_REMOVE_WHILE_FREEING:
                       BUG_ON(tm->slot < n);
                       fallthrough;
              (...)
      
          Because "n" has a value of (u32)-1 (4294967295) and tm->slot is 0.
      
      Fix this by taking a read lock on the extent buffer before cloning it at
      ctree.c:get_old_root(). This should be done regardless of the extent
      buffer having been freed and reused, as a concurrent task might be
      modifying it (while holding a write lock on it).
      Reported-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Link: https://lore.kernel.org/linux-btrfs/20210227155037.GN28049@hungrycats.org/
      Fixes: 834328a8 ("Btrfs: tree mod log's old roots could still be part of the tree")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      5faf447c
    • C
      zonefs: fix to update .i_wr_refcnt correctly in zonefs_open_zone() · 1026615a
      Chao Yu 提交于
      stable inclusion
      from stable-5.10.26
      commit 78486cf1f31e3f646a981f91f4be3db62689265e
      bugzilla: 51363
      
      --------------------------------
      
      commit 6980d29c upstream.
      
      In zonefs_open_zone(), if opened zone count is larger than
      .s_max_open_zones threshold, we missed to recover .i_wr_refcnt,
      fix this.
      
      Fixes: b5c00e97 ("zonefs: open/close zone on file open/close")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      1026615a
    • D
      zonefs: prevent use of seq files as swap file · b30912b0
      Damien Le Moal 提交于
      stable inclusion
      from stable-5.10.26
      commit 9c1c5e81a00250628b1dea74b815fc641ee77952
      bugzilla: 51363
      
      --------------------------------
      
      commit 1601ea06 upstream.
      
      The sequential write constraint of sequential zone file prevent their
      use as swap files. Only allow conventional zone files to be used as swap
      files.
      
      Fixes: 8dcc1a9d ("fs: New zonefs file system")
      Cc: <stable@vger.kernel.org>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      b30912b0
    • D
      zonefs: Fix O_APPEND async write handling · 54eb69ce
      Damien Le Moal 提交于
      stable inclusion
      from stable-5.10.26
      commit dfbdbf0f359abbe5005ee3d99d1923af904c8584
      bugzilla: 51363
      
      --------------------------------
      
      commit ebfd68cd upstream.
      
      zonefs updates the size of a sequential zone file inode only on
      completion of direct writes. When executing asynchronous append writes
      (with a file open with O_APPEND or using RWF_APPEND), the use of the
      current inode size in generic_write_checks() to set an iocb offset thus
      leads to unaligned write if an application issues an append write
      operation with another write already being executed.
      
      Fix this problem by introducing zonefs_write_checks() as a modified
      version of generic_write_checks() using the file inode wp_offset for an
      append write iocb offset. Also introduce zonefs_write_check_limits() to
      replace generic_write_check_limits() call. This zonefs special helper
      makes sure that the maximum file limit used is the maximum size of the
      file being accessed.
      
      Since zonefs_write_checks() already truncates the iov_iter, the calls
      to iov_iter_truncate() in zonefs_file_dio_write() and
      zonefs_file_buffered_write() are removed.
      
      Fixes: 8dcc1a9d ("fs: New zonefs file system")
      Cc: <stable@vger.kernel.org>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      54eb69ce
    • J
      Revert "nfsd4: a client's own opens needn't prevent delegations" · fc206bef
      J. Bruce Fields 提交于
      stable inclusion
      from stable-5.10.25
      commit df8596f5774387f92133e0e5b7e05808ff6595d7
      bugzilla: 51362
      
      --------------------------------
      
      commit 6ee65a77 upstream.
      
      This reverts commit 94415b06.
      
      That commit claimed to allow a client to get a read delegation when it
      was the only writer.  Actually it allowed a client to get a read
      delegation when *any* client has a write open!
      
      The main problem is that it's depending on nfs4_clnt_odstate structures
      that are actually only maintained for pnfs exports.
      
      This causes clients to miss writes performed by other clients, even when
      there have been intervening closes and opens, violating close-to-open
      cache consistency.
      
      We can do this a different way, but first we should just revert this.
      
      I've added pynfs 4.1 test DELEG19 to test for this, as I should have
      done originally!
      
      Cc: stable@vger.kernel.org
      Reported-by: NTimo Rothenpieler <timo@rothenpieler.org>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      fc206bef
    • J
      Revert "nfsd4: remove check_conflicting_opens warning" · 66d46437
      J. Bruce Fields 提交于
      stable inclusion
      from stable-5.10.25
      commit 894ecf0cb505561b9f37b302b7479eea939b0790
      bugzilla: 51362
      
      --------------------------------
      
      commit 4aa5e002 upstream.
      
      This reverts commit 50747dd5 "nfsd4: remove check_conflicting_opens
      warning", as a prerequisite for reverting 94415b06, which has a
      serious bug.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      66d46437
    • A
      fuse: fix live lock in fuse_iget() · 5cd84439
      Amir Goldstein 提交于
      stable inclusion
      from stable-5.10.25
      commit d955f13ea2120269319d6133d0dd82b66d1eeca3
      bugzilla: 51362
      
      --------------------------------
      
      commit 775c5033 upstream.
      
      Commit 5d069dbe ("fuse: fix bad inode") replaced make_bad_inode()
      in fuse_iget() with a private implementation fuse_make_bad().
      
      The private implementation fails to remove the bad inode from inode
      cache, so the retry loop with iget5_locked() finds the same bad inode
      and marks it bad forever.
      
      kmsg snip:
      
      [ ] rcu: INFO: rcu_sched self-detected stall on CPU
      ...
      [ ]  ? bit_wait_io+0x50/0x50
      [ ]  ? fuse_init_file_inode+0x70/0x70
      [ ]  ? find_inode.isra.32+0x60/0xb0
      [ ]  ? fuse_init_file_inode+0x70/0x70
      [ ]  ilookup5_nowait+0x65/0x90
      [ ]  ? fuse_init_file_inode+0x70/0x70
      [ ]  ilookup5.part.36+0x2e/0x80
      [ ]  ? fuse_init_file_inode+0x70/0x70
      [ ]  ? fuse_inode_eq+0x20/0x20
      [ ]  iget5_locked+0x21/0x80
      [ ]  ? fuse_inode_eq+0x20/0x20
      [ ]  fuse_iget+0x96/0x1b0
      
      Fixes: 5d069dbe ("fuse: fix bad inode")
      Cc: stable@vger.kernel.org # 5.10+
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      5cd84439
  2. 09 4月, 2021 23 次提交