1. 12 12月, 2020 1 次提交
    • M
      proc: use untagged_addr() for pagemap_read addresses · 40d6366e
      Miles Chen 提交于
      When we try to visit the pagemap of a tagged userspace pointer, we find
      that the start_vaddr is not correct because of the tag.
      To fix it, we should untag the userspace pointers in pagemap_read().
      
      I tested with 5.10-rc4 and the issue remains.
      
      Explanation from Catalin in [1]:
      
       "Arguably, that's a user-space bug since tagged file offsets were never
        supported. In this case it's not even a tag at bit 56 as per the arm64
        tagged address ABI but rather down to bit 47. You could say that the
        problem is caused by the C library (malloc()) or whoever created the
        tagged vaddr and passed it to this function. It's not a kernel
        regression as we've never supported it.
      
        Now, pagemap is a special case where the offset is usually not
        generated as a classic file offset but rather derived by shifting a
        user virtual address. I guess we can make a concession for pagemap
        (only) and allow such offset with the tag at bit (56 - PAGE_SHIFT + 3)"
      
      My test code is based on [2]:
      
      A userspace pointer which has been tagged by 0xb4: 0xb400007662f541c8
      
      userspace program:
      
        uint64 OsLayer::VirtualToPhysical(void *vaddr) {
      	uint64 frame, paddr, pfnmask, pagemask;
      	int pagesize = sysconf(_SC_PAGESIZE);
      	off64_t off = ((uintptr_t)vaddr) / pagesize * 8; // off = 0xb400007662f541c8 / pagesize * 8 = 0x5a00003b317aa0
      	int fd = open(kPagemapPath, O_RDONLY);
      	...
      
      	if (lseek64(fd, off, SEEK_SET) != off || read(fd, &frame, 8) != 8) {
      		int err = errno;
      		string errtxt = ErrorString(err);
      		if (fd >= 0)
      			close(fd);
      		return 0;
      	}
        ...
        }
      
      kernel fs/proc/task_mmu.c:
      
        static ssize_t pagemap_read(struct file *file, char __user *buf,
      		size_t count, loff_t *ppos)
        {
      	...
      	src = *ppos;
      	svpfn = src / PM_ENTRY_BYTES; // svpfn == 0xb400007662f54
      	start_vaddr = svpfn << PAGE_SHIFT; // start_vaddr == 0xb400007662f54000
      	end_vaddr = mm->task_size;
      
      	/* watch out for wraparound */
      	// svpfn == 0xb400007662f54
      	// (mm->task_size >> PAGE) == 0x8000000
      	if (svpfn > mm->task_size >> PAGE_SHIFT) // the condition is true because of the tag 0xb4
      		start_vaddr = end_vaddr;
      
      	ret = 0;
      	while (count && (start_vaddr < end_vaddr)) { // we cannot visit correct entry because start_vaddr is set to end_vaddr
      		int len;
      		unsigned long end;
      		...
      	}
      	...
        }
      
      [1] https://lore.kernel.org/patchwork/patch/1343258/
      [2] https://github.com/stressapptest/stressapptest/blob/master/src/os.cc#L158
      
      Link: https://lkml.kernel.org/r/20201204024347.8295-1-miles.chen@mediatek.comSigned-off-by: NMiles Chen <miles.chen@mediatek.com>
      Reviewed-by: NVincenzo Frascino <vincenzo.frascino@arm.com>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
      Cc: <stable@vger.kernel.org>	[5.4-]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      40d6366e
  2. 11 12月, 2020 3 次提交
    • A
      NFS: Disable READ_PLUS by default · 21e31401
      Anna Schumaker 提交于
      We've been seeing failures with xfstests generic/091 and generic/263
      when using READ_PLUS. I've made some progress on these issues, and the
      tests fail later on but still don't pass. Let's disable READ_PLUS by
      default until we can work out what is going on.
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      21e31401
    • D
      NFSv4.2: Fix 5 seconds delay when doing inter server copy · fe8eb820
      Dai Ngo 提交于
      Since commit b4868b44 ("NFSv4: Wait for stateid updates after
      CLOSE/OPEN_DOWNGRADE"), every inter server copy operation suffers 5
      seconds delay regardless of the size of the copy. The delay is from
      nfs_set_open_stateid_locked when the check by nfs_stateid_is_sequential
      fails because the seqid in both nfs4_state and nfs4_stateid are 0.
      
      Fix __nfs42_ssc_open to delay setting of NFS_OPEN_STATE in nfs4_state,
      until after the call to update_open_stateid, to indicate this is the 1st
      open. This fix is part of a 2 patches, the other patch is the fix in the
      source server to return the stateid for COPY_NOTIFY request with seqid 1
      instead of 0.
      
      Fixes: ce0887ac ("NFSD add nfs4 inter ssc to nfsd4_copy")
      Signed-off-by: NDai Ngo <dai.ngo@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      fe8eb820
    • C
      NFS: Fix rpcrdma_inline_fixup() crash with new LISTXATTRS operation · 1c87b851
      Chuck Lever 提交于
      By switching to an XFS-backed export, I am able to reproduce the
      ibcomp worker crash on my client with xfstests generic/013.
      
      For the failing LISTXATTRS operation, xdr_inline_pages() is called
      with page_len=12 and buflen=128.
      
      - When ->send_request() is called, rpcrdma_marshal_req() does not
        set up a Reply chunk because buflen is smaller than the inline
        threshold. Thus rpcrdma_convert_iovs() does not get invoked at
        all and the transport's XDRBUF_SPARSE_PAGES logic is not invoked
        on the receive buffer.
      
      - During reply processing, rpcrdma_inline_fixup() tries to copy
        received data into rq_rcv_buf->pages because page_len is positive.
        But there are no receive pages because rpcrdma_marshal_req() never
        allocated them.
      
      The result is that the ibcomp worker faults and dies. Sometimes that
      causes a visible crash, and sometimes it results in a transport hang
      without other symptoms.
      
      RPC/RDMA's XDRBUF_SPARSE_PAGES support is not entirely correct, and
      should eventually be fixed or replaced. However, my preference is
      that upper-layer operations should explicitly allocate their receive
      buffers (using GFP_KERNEL) when possible, rather than relying on
      XDRBUF_SPARSE_PAGES.
      Reported-by: NOlga kornievskaia <kolga@netapp.com>
      Suggested-by: NOlga kornievskaia <kolga@netapp.com>
      Fixes: c10a7514 ("NFSv4.2: add the extended attribute proc functions.")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: NOlga kornievskaia <kolga@netapp.com>
      Reviewed-by: NFrank van der Linden <fllinden@amazon.com>
      Tested-by: NOlga kornievskaia <kolga@netapp.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      1c87b851
  3. 10 12月, 2020 1 次提交
    • D
      zonefs: fix page reference and BIO leak · 6bea0225
      Damien Le Moal 提交于
      In zonefs_file_dio_append(), the pages obtained using
      bio_iov_iter_get_pages() are not released on completion of the
      REQ_OP_APPEND BIO, nor when bio_iov_iter_get_pages() fails.
      Furthermore, a call to bio_put() is missing when
      bio_iov_iter_get_pages() fails.
      
      Fix these resource leaks by adding BIO resource release code (bio_put()i
      and bio_release_pages()) at the end of the function after the BIO
      execution and add a jump to this resource cleanup code in case of
      bio_iov_iter_get_pages() failure.
      
      While at it, also fix the call to task_io_account_write() to be passed
      the correct BIO size instead of bio_iov_iter_get_pages() return value.
      Reported-by: NChristoph Hellwig <hch@lst.de>
      Fixes: 02ef12a6 ("zonefs: use REQ_OP_ZONE_APPEND for sync DIO")
      Cc: stable@vger.kernel.org
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      6bea0225
  4. 09 12月, 2020 1 次提交
    • D
      afs: Fix memory leak when mounting with multiple source parameters · 4cb68296
      David Howells 提交于
      There's a memory leak in afs_parse_source() whereby multiple source=
      parameters overwrite fc->source in the fs_context struct without freeing
      the previously recorded source.
      
      Fix this by only permitting a single source parameter and rejecting with
      an error all subsequent ones.
      
      This was caught by syzbot with the kernel memory leak detector, showing
      something like the following trace:
      
        unreferenced object 0xffff888114375440 (size 32):
          comm "repro", pid 5168, jiffies 4294923723 (age 569.948s)
          backtrace:
            slab_post_alloc_hook+0x42/0x79
            __kmalloc_track_caller+0x125/0x16a
            kmemdup_nul+0x24/0x3c
            vfs_parse_fs_string+0x5a/0xa1
            generic_parse_monolithic+0x9d/0xc5
            do_new_mount+0x10d/0x15a
            do_mount+0x5f/0x8e
            __do_sys_mount+0xff/0x127
            do_syscall_64+0x2d/0x3a
            entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: 13fcc683 ("afs: Add fs_context support")
      Reported-by: syzbot+86dc6632faaca40133ab@syzkaller.appspotmail.com
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4cb68296
  5. 08 12月, 2020 1 次提交
    • H
      io_uring: fix file leak on error path of io ctx creation · f26c08b4
      Hillf Danton 提交于
      Put file as part of error handling when setting up io ctx to fix
      memory leaks like the following one.
      
         BUG: memory leak
         unreferenced object 0xffff888101ea2200 (size 256):
           comm "syz-executor355", pid 8470, jiffies 4294953658 (age 32.400s)
           hex dump (first 32 bytes):
             00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
             20 59 03 01 81 88 ff ff 80 87 a8 10 81 88 ff ff   Y..............
           backtrace:
             [<000000002e0a7c5f>] kmem_cache_zalloc include/linux/slab.h:654 [inline]
             [<000000002e0a7c5f>] __alloc_file+0x1f/0x130 fs/file_table.c:101
             [<000000001a55b73a>] alloc_empty_file+0x69/0x120 fs/file_table.c:151
             [<00000000fb22349e>] alloc_file+0x33/0x1b0 fs/file_table.c:193
             [<000000006e1465bb>] alloc_file_pseudo+0xb2/0x140 fs/file_table.c:233
             [<000000007118092a>] anon_inode_getfile fs/anon_inodes.c:91 [inline]
             [<000000007118092a>] anon_inode_getfile+0xaa/0x120 fs/anon_inodes.c:74
             [<000000002ae99012>] io_uring_get_fd fs/io_uring.c:9198 [inline]
             [<000000002ae99012>] io_uring_create fs/io_uring.c:9377 [inline]
             [<000000002ae99012>] io_uring_setup+0x1125/0x1630 fs/io_uring.c:9411
             [<000000008280baad>] do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
             [<00000000685d8cf0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Reported-by: syzbot+71c4697e27c99fddcf17@syzkaller.appspotmail.com
      Fixes: 0f212204 ("io_uring: don't rely on weak ->files references")
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NHillf Danton <hdanton@sina.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f26c08b4
  6. 07 12月, 2020 2 次提交
  7. 04 12月, 2020 3 次提交
  8. 02 12月, 2020 2 次提交
  9. 01 12月, 2020 4 次提交
  10. 30 11月, 2020 1 次提交
  11. 27 11月, 2020 1 次提交
    • A
      gfs2: Upgrade shared glocks for atime updates · 82e938bd
      Andreas Gruenbacher 提交于
      Commit 20f82999 ("gfs2: Rework read and page fault locking") lifted
      the glock lock taking from the low-level ->readpage and ->readahead
      address space operations to the higher-level ->read_iter file and
      ->fault vm operations.  The glocks are still taken in LM_ST_SHARED mode
      only.  On filesystems mounted without the noatime option, ->read_iter
      sometimes needs to update the atime as well, though.  Right now, this
      leads to a failed locking mode assertion in gfs2_dirty_inode.
      
      Fix that by introducing a new update_time inode operation.  There, if
      the glock is held non-exclusively, upgrade it to an exclusive lock.
      Reported-by: NAlexander Aring <aahringo@redhat.com>
      Fixes: 20f82999 ("gfs2: Rework read and page fault locking")
      Cc: stable@vger.kernel.org # v5.8+
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      82e938bd
  12. 26 11月, 2020 3 次提交
    • P
      io_uring: fix files grab/cancel race · af604703
      Pavel Begunkov 提交于
      When one task is in io_uring_cancel_files() and another is doing
      io_prep_async_work() a race may happen. That's because after accounting
      a request inflight in first call to io_grab_identity() it still may fail
      and go to io_identity_cow(), which migh briefly keep dangling
      work.identity and not only.
      
      Grab files last, so io_prep_async_work() won't fail if it did get into
      ->inflight_list.
      
      note: the bug shouldn't exist after making io_uring_cancel_files() not
      poking into other tasks' requests.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      af604703
    • B
      gfs2: Don't freeze the file system during unmount · f39e7d3a
      Bob Peterson 提交于
      GFS2's freeze/thaw mechanism uses a special freeze glock to control its
      operation. It does this with a sync glock operation (glops.c) called
      freeze_go_sync. When the freeze glock is demoted (glock's do_xmote) the
      glops function causes the file system to be frozen. This is intended. However,
      GFS2's mount and unmount processes also hold the freeze glock to prevent other
      processes, perhaps on different cluster nodes, from mounting the frozen file
      system in read-write mode.
      
      Before this patch, there was no check in freeze_go_sync for whether a freeze
      in intended or whether the glock demote was caused by a normal unmount.
      So it was trying to freeze the file system it's trying to unmount, which
      ends up in a deadlock.
      
      This patch adds an additional check to freeze_go_sync so that demotes of the
      freeze glock are ignored if they come from the unmount process.
      
      Fixes: 20b32912 ("gfs2: Fix regression in freeze_go_sync")
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      f39e7d3a
    • B
      gfs2: check for empty rgrp tree in gfs2_ri_update · 77872151
      Bob Peterson 提交于
      If gfs2 tries to mount a (corrupt) file system that has no resource
      groups it still tries to set preferences on the first one, which causes
      a kernel null pointer dereference. This patch adds a check to function
      gfs2_ri_update so this condition is detected and reported back as an
      error.
      
      Reported-by: syzbot+e3f23ce40269a4c9053a@syzkaller.appspotmail.com
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      77872151
  13. 25 11月, 2020 3 次提交
    • A
      efivarfs: revert "fix memory leak in efivarfs_create()" · ff04f3b6
      Ard Biesheuvel 提交于
      The memory leak addressed by commit fe5186cf is a false positive:
      all allocations are recorded in a linked list, and freed when the
      filesystem is unmounted. This leads to double frees, and as reported
      by David, leads to crashes if SLUB is configured to self destruct when
      double frees occur.
      
      So drop the redundant kfree() again, and instead, mark the offending
      pointer variable so the allocation is ignored by kmemleak.
      
      Cc: Vamshi K Sthambamkadi <vamshi.k.sthambamkadi@gmail.com>
      Fixes: fe5186cf ("efivarfs: fix memory leak in efivarfs_create()")
      Reported-by: NDavid Laight <David.Laight@aculab.com>
      Signed-off-by: NArd Biesheuvel <ardb@kernel.org>
      ff04f3b6
    • A
      gfs2: set lockdep subclass for iopen glocks · 515b269d
      Alexander Aring 提交于
      This patch introduce a new globs attribute to define the subclass of the
      glock lockref spinlock. This avoid the following lockdep warning, which
      occurs when we lock an inode lock while an iopen lock is held:
      
      ============================================
      WARNING: possible recursive locking detected
      5.10.0-rc3+ #4990 Not tainted
      --------------------------------------------
      kworker/0:1/12 is trying to acquire lock:
      ffff9067d45672d8 (&gl->gl_lockref.lock){+.+.}-{3:3}, at: lockref_get+0x9/0x20
      
      but task is already holding lock:
      ffff9067da308588 (&gl->gl_lockref.lock){+.+.}-{3:3}, at: delete_work_func+0x164/0x260
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(&gl->gl_lockref.lock);
        lock(&gl->gl_lockref.lock);
      
       *** DEADLOCK ***
      
       May be due to missing lock nesting notation
      
      3 locks held by kworker/0:1/12:
       #0: ffff9067c1bfdd38 ((wq_completion)delete_workqueue){+.+.}-{0:0}, at: process_one_work+0x1b7/0x540
       #1: ffffac594006be70 ((work_completion)(&(&gl->gl_delete)->work)){+.+.}-{0:0}, at: process_one_work+0x1b7/0x540
       #2: ffff9067da308588 (&gl->gl_lockref.lock){+.+.}-{3:3}, at: delete_work_func+0x164/0x260
      
      stack backtrace:
      CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.10.0-rc3+ #4990
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
      Workqueue: delete_workqueue delete_work_func
      Call Trace:
       dump_stack+0x8b/0xb0
       __lock_acquire.cold+0x19e/0x2e3
       lock_acquire+0x150/0x410
       ? lockref_get+0x9/0x20
       _raw_spin_lock+0x27/0x40
       ? lockref_get+0x9/0x20
       lockref_get+0x9/0x20
       delete_work_func+0x188/0x260
       process_one_work+0x237/0x540
       worker_thread+0x4d/0x3b0
       ? process_one_work+0x540/0x540
       kthread+0x127/0x140
       ? __kthread_bind_mask+0x60/0x60
       ret_from_fork+0x22/0x30
      Suggested-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: NAlexander Aring <aahringo@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      515b269d
    • A
      gfs2: Fix deadlock dumping resource group glocks · 16e6281b
      Alexander Aring 提交于
      Commit 0e539ca1 ("gfs2: Fix NULL pointer dereference in gfs2_rgrp_dump")
      introduced additional locking in gfs2_rgrp_go_dump, which is also used for
      dumping resource group glocks via debugfs.  However, on that code path, the
      glock spin lock is already taken in dump_glock, and taking it again in
      gfs2_glock2rgrp leads to deadlock.  This can be reproduced with:
      
        $ mkfs.gfs2 -O -p lock_nolock /dev/FOO
        $ mount /dev/FOO /mnt/foo
        $ touch /mnt/foo/bar
        $ cat /sys/kernel/debug/gfs2/FOO/glocks
      
      Fix that by not taking the glock spin lock inside the go_dump callback.
      
      Fixes: 0e539ca1 ("gfs2: Fix NULL pointer dereference in gfs2_rgrp_dump")
      Signed-off-by: NAlexander Aring <aahringo@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      16e6281b
  14. 24 11月, 2020 7 次提交
    • P
      io_uring: fix ITER_BVEC check · 9c3a205c
      Pavel Begunkov 提交于
      iov_iter::type is a bitmask that also keeps direction etc., so it
      shouldn't be directly compared against ITER_*. Use proper helper.
      
      Fixes: ff6165b2 ("io_uring: retain iov_iter state over io_read/io_write calls")
      Reported-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Cc: <stable@vger.kernel.org> # 5.9
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9c3a205c
    • J
      io_uring: fix shift-out-of-bounds when round up cq size · eb2667b3
      Joseph Qi 提交于
      Abaci Fuzz reported a shift-out-of-bounds BUG in io_uring_create():
      
      [ 59.598207] UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13
      [ 59.599665] shift exponent 64 is too large for 64-bit type 'long unsigned int'
      [ 59.601230] CPU: 0 PID: 963 Comm: a.out Not tainted 5.10.0-rc4+ #3
      [ 59.602502] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [ 59.603673] Call Trace:
      [ 59.604286] dump_stack+0x107/0x163
      [ 59.605237] ubsan_epilogue+0xb/0x5a
      [ 59.606094] __ubsan_handle_shift_out_of_bounds.cold+0xb2/0x20e
      [ 59.607335] ? lock_downgrade+0x6c0/0x6c0
      [ 59.608182] ? rcu_read_lock_sched_held+0xaf/0xe0
      [ 59.609166] io_uring_create.cold+0x99/0x149
      [ 59.610114] io_uring_setup+0xd6/0x140
      [ 59.610975] ? io_uring_create+0x2510/0x2510
      [ 59.611945] ? lockdep_hardirqs_on_prepare+0x286/0x400
      [ 59.613007] ? syscall_enter_from_user_mode+0x27/0x80
      [ 59.614038] ? trace_hardirqs_on+0x5b/0x180
      [ 59.615056] do_syscall_64+0x2d/0x40
      [ 59.615940] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [ 59.617007] RIP: 0033:0x7f2bb8a0b239
      
      This is caused by roundup_pow_of_two() if the input entries larger
      enough, e.g. 2^32-1. For sq_entries, it will check first and we allow
      at most IORING_MAX_ENTRIES, so it is okay. But for cq_entries, we do
      round up first, that may overflow and truncate it to 0, which is not
      the expected behavior. So check the cq size first and then do round up.
      
      Fixes: 88ec3211 ("io_uring: round-up cq size before comparing with rounded sq size")
      Reported-by: NAbaci Fuzz <abaci@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NStefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      eb2667b3
    • F
      btrfs: fix lockdep splat when enabling and disabling qgroups · a855fbe6
      Filipe Manana 提交于
      When running test case btrfs/017 from fstests, lockdep reported the
      following splat:
      
        [ 1297.067385] ======================================================
        [ 1297.067708] WARNING: possible circular locking dependency detected
        [ 1297.068022] 5.10.0-rc4-btrfs-next-73 #1 Not tainted
        [ 1297.068322] ------------------------------------------------------
        [ 1297.068629] btrfs/189080 is trying to acquire lock:
        [ 1297.068929] ffff9f2725731690 (sb_internal#2){.+.+}-{0:0}, at: btrfs_quota_enable+0xaf/0xa70 [btrfs]
        [ 1297.069274]
      		 but task is already holding lock:
        [ 1297.069868] ffff9f2702b61a08 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_enable+0x3b/0xa70 [btrfs]
        [ 1297.070219]
      		 which lock already depends on the new lock.
      
        [ 1297.071131]
      		 the existing dependency chain (in reverse order) is:
        [ 1297.071721]
      		 -> #1 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}:
        [ 1297.072375]        lock_acquire+0xd8/0x490
        [ 1297.072710]        __mutex_lock+0xa3/0xb30
        [ 1297.073061]        btrfs_qgroup_inherit+0x59/0x6a0 [btrfs]
        [ 1297.073421]        create_subvol+0x194/0x990 [btrfs]
        [ 1297.073780]        btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
        [ 1297.074133]        __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
        [ 1297.074498]        btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
        [ 1297.074872]        btrfs_ioctl+0x1a90/0x36f0 [btrfs]
        [ 1297.075245]        __x64_sys_ioctl+0x83/0xb0
        [ 1297.075617]        do_syscall_64+0x33/0x80
        [ 1297.075993]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [ 1297.076380]
      		 -> #0 (sb_internal#2){.+.+}-{0:0}:
        [ 1297.077166]        check_prev_add+0x91/0xc60
        [ 1297.077572]        __lock_acquire+0x1740/0x3110
        [ 1297.077984]        lock_acquire+0xd8/0x490
        [ 1297.078411]        start_transaction+0x3c5/0x760 [btrfs]
        [ 1297.078853]        btrfs_quota_enable+0xaf/0xa70 [btrfs]
        [ 1297.079323]        btrfs_ioctl+0x2c60/0x36f0 [btrfs]
        [ 1297.079789]        __x64_sys_ioctl+0x83/0xb0
        [ 1297.080232]        do_syscall_64+0x33/0x80
        [ 1297.080680]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [ 1297.081139]
      		 other info that might help us debug this:
      
        [ 1297.082536]  Possible unsafe locking scenario:
      
        [ 1297.083510]        CPU0                    CPU1
        [ 1297.084005]        ----                    ----
        [ 1297.084500]   lock(&fs_info->qgroup_ioctl_lock);
        [ 1297.084994]                                lock(sb_internal#2);
        [ 1297.085485]                                lock(&fs_info->qgroup_ioctl_lock);
        [ 1297.085974]   lock(sb_internal#2);
        [ 1297.086454]
      		  *** DEADLOCK ***
        [ 1297.087880] 3 locks held by btrfs/189080:
        [ 1297.088324]  #0: ffff9f2725731470 (sb_writers#14){.+.+}-{0:0}, at: btrfs_ioctl+0xa73/0x36f0 [btrfs]
        [ 1297.088799]  #1: ffff9f2702b60cc0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1f4d/0x36f0 [btrfs]
        [ 1297.089284]  #2: ffff9f2702b61a08 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_enable+0x3b/0xa70 [btrfs]
        [ 1297.089771]
      		 stack backtrace:
        [ 1297.090662] CPU: 5 PID: 189080 Comm: btrfs Not tainted 5.10.0-rc4-btrfs-next-73 #1
        [ 1297.091132] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        [ 1297.092123] Call Trace:
        [ 1297.092629]  dump_stack+0x8d/0xb5
        [ 1297.093115]  check_noncircular+0xff/0x110
        [ 1297.093596]  check_prev_add+0x91/0xc60
        [ 1297.094076]  ? kvm_clock_read+0x14/0x30
        [ 1297.094553]  ? kvm_sched_clock_read+0x5/0x10
        [ 1297.095029]  __lock_acquire+0x1740/0x3110
        [ 1297.095510]  lock_acquire+0xd8/0x490
        [ 1297.095993]  ? btrfs_quota_enable+0xaf/0xa70 [btrfs]
        [ 1297.096476]  start_transaction+0x3c5/0x760 [btrfs]
        [ 1297.096962]  ? btrfs_quota_enable+0xaf/0xa70 [btrfs]
        [ 1297.097451]  btrfs_quota_enable+0xaf/0xa70 [btrfs]
        [ 1297.097941]  ? btrfs_ioctl+0x1f4d/0x36f0 [btrfs]
        [ 1297.098429]  btrfs_ioctl+0x2c60/0x36f0 [btrfs]
        [ 1297.098904]  ? do_user_addr_fault+0x20c/0x430
        [ 1297.099382]  ? kvm_clock_read+0x14/0x30
        [ 1297.099854]  ? kvm_sched_clock_read+0x5/0x10
        [ 1297.100328]  ? sched_clock+0x5/0x10
        [ 1297.100801]  ? sched_clock_cpu+0x12/0x180
        [ 1297.101272]  ? __x64_sys_ioctl+0x83/0xb0
        [ 1297.101739]  __x64_sys_ioctl+0x83/0xb0
        [ 1297.102207]  do_syscall_64+0x33/0x80
        [ 1297.102673]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [ 1297.103148] RIP: 0033:0x7f773ff65d87
      
      This is because during the quota enable ioctl we lock first the mutex
      qgroup_ioctl_lock and then start a transaction, and starting a transaction
      acquires a fs freeze semaphore (at the VFS level). However, every other
      code path, except for the quota disable ioctl path, we do the opposite:
      we start a transaction and then lock the mutex.
      
      So fix this by making the quota enable and disable paths to start the
      transaction without having the mutex locked, and then, after starting the
      transaction, lock the mutex and check if some other task already enabled
      or disabled the quotas, bailing with success if that was the case.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a855fbe6
    • F
      btrfs: do nofs allocations when adding and removing qgroup relations · 7aa6d359
      Filipe Manana 提交于
      When adding or removing a qgroup relation we are doing a GFP_KERNEL
      allocation which is not safe because we are holding a transaction
      handle open and that can make us deadlock if the allocator needs to
      recurse into the filesystem. So just surround those calls with a
      nofs context.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7aa6d359
    • F
      btrfs: fix lockdep splat when reading qgroup config on mount · 3d05cad3
      Filipe Manana 提交于
      Lockdep reported the following splat when running test btrfs/190 from
      fstests:
      
        [ 9482.126098] ======================================================
        [ 9482.126184] WARNING: possible circular locking dependency detected
        [ 9482.126281] 5.10.0-rc4-btrfs-next-73 #1 Not tainted
        [ 9482.126365] ------------------------------------------------------
        [ 9482.126456] mount/24187 is trying to acquire lock:
        [ 9482.126534] ffffa0c869a7dac0 (&fs_info->qgroup_rescan_lock){+.+.}-{3:3}, at: qgroup_rescan_init+0x43/0xf0 [btrfs]
        [ 9482.126647]
      		 but task is already holding lock:
        [ 9482.126777] ffffa0c892ebd3a0 (btrfs-quota-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x120 [btrfs]
        [ 9482.126886]
      		 which lock already depends on the new lock.
      
        [ 9482.127078]
      		 the existing dependency chain (in reverse order) is:
        [ 9482.127213]
      		 -> #1 (btrfs-quota-00){++++}-{3:3}:
        [ 9482.127366]        lock_acquire+0xd8/0x490
        [ 9482.127436]        down_read_nested+0x45/0x220
        [ 9482.127528]        __btrfs_tree_read_lock+0x27/0x120 [btrfs]
        [ 9482.127613]        btrfs_read_lock_root_node+0x41/0x130 [btrfs]
        [ 9482.127702]        btrfs_search_slot+0x514/0xc30 [btrfs]
        [ 9482.127788]        update_qgroup_status_item+0x72/0x140 [btrfs]
        [ 9482.127877]        btrfs_qgroup_rescan_worker+0xde/0x680 [btrfs]
        [ 9482.127964]        btrfs_work_helper+0xf1/0x600 [btrfs]
        [ 9482.128039]        process_one_work+0x24e/0x5e0
        [ 9482.128110]        worker_thread+0x50/0x3b0
        [ 9482.128181]        kthread+0x153/0x170
        [ 9482.128256]        ret_from_fork+0x22/0x30
        [ 9482.128327]
      		 -> #0 (&fs_info->qgroup_rescan_lock){+.+.}-{3:3}:
        [ 9482.128464]        check_prev_add+0x91/0xc60
        [ 9482.128551]        __lock_acquire+0x1740/0x3110
        [ 9482.128623]        lock_acquire+0xd8/0x490
        [ 9482.130029]        __mutex_lock+0xa3/0xb30
        [ 9482.130590]        qgroup_rescan_init+0x43/0xf0 [btrfs]
        [ 9482.131577]        btrfs_read_qgroup_config+0x43a/0x550 [btrfs]
        [ 9482.132175]        open_ctree+0x1228/0x18a0 [btrfs]
        [ 9482.132756]        btrfs_mount_root.cold+0x13/0xed [btrfs]
        [ 9482.133325]        legacy_get_tree+0x30/0x60
        [ 9482.133866]        vfs_get_tree+0x28/0xe0
        [ 9482.134392]        fc_mount+0xe/0x40
        [ 9482.134908]        vfs_kern_mount.part.0+0x71/0x90
        [ 9482.135428]        btrfs_mount+0x13b/0x3e0 [btrfs]
        [ 9482.135942]        legacy_get_tree+0x30/0x60
        [ 9482.136444]        vfs_get_tree+0x28/0xe0
        [ 9482.136949]        path_mount+0x2d7/0xa70
        [ 9482.137438]        do_mount+0x75/0x90
        [ 9482.137923]        __x64_sys_mount+0x8e/0xd0
        [ 9482.138400]        do_syscall_64+0x33/0x80
        [ 9482.138873]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [ 9482.139346]
      		 other info that might help us debug this:
      
        [ 9482.140735]  Possible unsafe locking scenario:
      
        [ 9482.141594]        CPU0                    CPU1
        [ 9482.142011]        ----                    ----
        [ 9482.142411]   lock(btrfs-quota-00);
        [ 9482.142806]                                lock(&fs_info->qgroup_rescan_lock);
        [ 9482.143216]                                lock(btrfs-quota-00);
        [ 9482.143629]   lock(&fs_info->qgroup_rescan_lock);
        [ 9482.144056]
      		  *** DEADLOCK ***
      
        [ 9482.145242] 2 locks held by mount/24187:
        [ 9482.145637]  #0: ffffa0c8411c40e8 (&type->s_umount_key#44/1){+.+.}-{3:3}, at: alloc_super+0xb9/0x400
        [ 9482.146061]  #1: ffffa0c892ebd3a0 (btrfs-quota-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x120 [btrfs]
        [ 9482.146509]
      		 stack backtrace:
        [ 9482.147350] CPU: 1 PID: 24187 Comm: mount Not tainted 5.10.0-rc4-btrfs-next-73 #1
        [ 9482.147788] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        [ 9482.148709] Call Trace:
        [ 9482.149169]  dump_stack+0x8d/0xb5
        [ 9482.149628]  check_noncircular+0xff/0x110
        [ 9482.150090]  check_prev_add+0x91/0xc60
        [ 9482.150561]  ? kvm_clock_read+0x14/0x30
        [ 9482.151017]  ? kvm_sched_clock_read+0x5/0x10
        [ 9482.151470]  __lock_acquire+0x1740/0x3110
        [ 9482.151941]  ? __btrfs_tree_read_lock+0x27/0x120 [btrfs]
        [ 9482.152402]  lock_acquire+0xd8/0x490
        [ 9482.152887]  ? qgroup_rescan_init+0x43/0xf0 [btrfs]
        [ 9482.153354]  __mutex_lock+0xa3/0xb30
        [ 9482.153826]  ? qgroup_rescan_init+0x43/0xf0 [btrfs]
        [ 9482.154301]  ? qgroup_rescan_init+0x43/0xf0 [btrfs]
        [ 9482.154768]  ? qgroup_rescan_init+0x43/0xf0 [btrfs]
        [ 9482.155226]  qgroup_rescan_init+0x43/0xf0 [btrfs]
        [ 9482.155690]  btrfs_read_qgroup_config+0x43a/0x550 [btrfs]
        [ 9482.156160]  open_ctree+0x1228/0x18a0 [btrfs]
        [ 9482.156643]  btrfs_mount_root.cold+0x13/0xed [btrfs]
        [ 9482.157108]  ? rcu_read_lock_sched_held+0x5d/0x90
        [ 9482.157567]  ? kfree+0x31f/0x3e0
        [ 9482.158030]  legacy_get_tree+0x30/0x60
        [ 9482.158489]  vfs_get_tree+0x28/0xe0
        [ 9482.158947]  fc_mount+0xe/0x40
        [ 9482.159403]  vfs_kern_mount.part.0+0x71/0x90
        [ 9482.159875]  btrfs_mount+0x13b/0x3e0 [btrfs]
        [ 9482.160335]  ? rcu_read_lock_sched_held+0x5d/0x90
        [ 9482.160805]  ? kfree+0x31f/0x3e0
        [ 9482.161260]  ? legacy_get_tree+0x30/0x60
        [ 9482.161714]  legacy_get_tree+0x30/0x60
        [ 9482.162166]  vfs_get_tree+0x28/0xe0
        [ 9482.162616]  path_mount+0x2d7/0xa70
        [ 9482.163070]  do_mount+0x75/0x90
        [ 9482.163525]  __x64_sys_mount+0x8e/0xd0
        [ 9482.163986]  do_syscall_64+0x33/0x80
        [ 9482.164437]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [ 9482.164902] RIP: 0033:0x7f51e907caaa
      
      This happens because at btrfs_read_qgroup_config() we can call
      qgroup_rescan_init() while holding a read lock on a quota btree leaf,
      acquired by the previous call to btrfs_search_slot_for_read(), and
      qgroup_rescan_init() acquires the mutex qgroup_rescan_lock.
      
      A qgroup rescan worker does the opposite: it acquires the mutex
      qgroup_rescan_lock, at btrfs_qgroup_rescan_worker(), and then tries to
      update the qgroup status item in the quota btree through the call to
      update_qgroup_status_item(). This inversion of locking order
      between the qgroup_rescan_lock mutex and quota btree locks causes the
      splat.
      
      Fix this simply by releasing and freeing the path before calling
      qgroup_rescan_init() at btrfs_read_qgroup_config().
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3d05cad3
    • D
      btrfs: tree-checker: add missing returns after data_ref alignment checks · 6d06b0ad
      David Sterba 提交于
      There are sectorsize alignment checks that are reported but then
      check_extent_data_ref continues. This was not intended, wrong alignment
      is not a minor problem and we should return with error.
      
      CC: stable@vger.kernel.org # 5.4+
      Fixes: 0785a9aa ("btrfs: tree-checker: Add EXTENT_DATA_REF check")
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6d06b0ad
    • J
      btrfs: don't access possibly stale fs_info data for printing duplicate device · 0697d9a6
      Johannes Thumshirn 提交于
      Syzbot reported a possible use-after-free when printing a duplicate device
      warning device_list_add().
      
      At this point it can happen that a btrfs_device::fs_info is not correctly
      setup yet, so we're accessing stale data, when printing the warning
      message using the btrfs_printk() wrappers.
      
        ==================================================================
        BUG: KASAN: use-after-free in btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245
        Read of size 8 at addr ffff8880878e06a8 by task syz-executor225/7068
      
        CPU: 1 PID: 7068 Comm: syz-executor225 Not tainted 5.9.0-rc5-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        Call Trace:
         __dump_stack lib/dump_stack.c:77 [inline]
         dump_stack+0x1d6/0x29e lib/dump_stack.c:118
         print_address_description+0x66/0x620 mm/kasan/report.c:383
         __kasan_report mm/kasan/report.c:513 [inline]
         kasan_report+0x132/0x1d0 mm/kasan/report.c:530
         btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245
         device_list_add+0x1a88/0x1d60 fs/btrfs/volumes.c:943
         btrfs_scan_one_device+0x196/0x490 fs/btrfs/volumes.c:1359
         btrfs_mount_root+0x48f/0xb60 fs/btrfs/super.c:1634
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         fc_mount fs/namespace.c:978 [inline]
         vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
         btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         do_new_mount fs/namespace.c:2875 [inline]
         path_mount+0x179d/0x29e0 fs/namespace.c:3192
         do_mount fs/namespace.c:3205 [inline]
         __do_sys_mount fs/namespace.c:3413 [inline]
         __se_sys_mount+0x126/0x180 fs/namespace.c:3390
         do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x44840a
        RSP: 002b:00007ffedfffd608 EFLAGS: 00000293 ORIG_RAX: 00000000000000a5
        RAX: ffffffffffffffda RBX: 00007ffedfffd670 RCX: 000000000044840a
        RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffedfffd630
        RBP: 00007ffedfffd630 R08: 00007ffedfffd670 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000001a
        R13: 0000000000000004 R14: 0000000000000003 R15: 0000000000000003
      
        Allocated by task 6945:
         kasan_save_stack mm/kasan/common.c:48 [inline]
         kasan_set_track mm/kasan/common.c:56 [inline]
         __kasan_kmalloc+0x100/0x130 mm/kasan/common.c:461
         kmalloc_node include/linux/slab.h:577 [inline]
         kvmalloc_node+0x81/0x110 mm/util.c:574
         kvmalloc include/linux/mm.h:757 [inline]
         kvzalloc include/linux/mm.h:765 [inline]
         btrfs_mount_root+0xd0/0xb60 fs/btrfs/super.c:1613
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         fc_mount fs/namespace.c:978 [inline]
         vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
         btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         do_new_mount fs/namespace.c:2875 [inline]
         path_mount+0x179d/0x29e0 fs/namespace.c:3192
         do_mount fs/namespace.c:3205 [inline]
         __do_sys_mount fs/namespace.c:3413 [inline]
         __se_sys_mount+0x126/0x180 fs/namespace.c:3390
         do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        Freed by task 6945:
         kasan_save_stack mm/kasan/common.c:48 [inline]
         kasan_set_track+0x3d/0x70 mm/kasan/common.c:56
         kasan_set_free_info+0x17/0x30 mm/kasan/generic.c:355
         __kasan_slab_free+0xdd/0x110 mm/kasan/common.c:422
         __cache_free mm/slab.c:3418 [inline]
         kfree+0x113/0x200 mm/slab.c:3756
         deactivate_locked_super+0xa7/0xf0 fs/super.c:335
         btrfs_mount_root+0x72b/0xb60 fs/btrfs/super.c:1678
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         fc_mount fs/namespace.c:978 [inline]
         vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
         btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         do_new_mount fs/namespace.c:2875 [inline]
         path_mount+0x179d/0x29e0 fs/namespace.c:3192
         do_mount fs/namespace.c:3205 [inline]
         __do_sys_mount fs/namespace.c:3413 [inline]
         __se_sys_mount+0x126/0x180 fs/namespace.c:3390
         do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        The buggy address belongs to the object at ffff8880878e0000
         which belongs to the cache kmalloc-16k of size 16384
        The buggy address is located 1704 bytes inside of
         16384-byte region [ffff8880878e0000, ffff8880878e4000)
        The buggy address belongs to the page:
        page:0000000060704f30 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x878e0
        head:0000000060704f30 order:3 compound_mapcount:0 compound_pincount:0
        flags: 0xfffe0000010200(slab|head)
        raw: 00fffe0000010200 ffffea00028e9a08 ffffea00021e3608 ffff8880aa440b00
        raw: 0000000000000000 ffff8880878e0000 0000000100000001 0000000000000000
        page dumped because: kasan: bad access detected
      
        Memory state around the buggy address:
         ffff8880878e0580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
         ffff8880878e0600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        >ffff8880878e0680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      				    ^
         ffff8880878e0700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
         ffff8880878e0780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ==================================================================
      
      The syzkaller reproducer for this use-after-free crafts a filesystem image
      and loop mounts it twice in a loop. The mount will fail as the crafted
      image has an invalid chunk tree. When this happens btrfs_mount_root() will
      call deactivate_locked_super(), which then cleans up fs_info and
      fs_info::sb. If a second thread now adds the same block-device to the
      filesystem, it will get detected as a duplicate device and
      device_list_add() will reject the duplicate and print a warning. But as
      the fs_info pointer passed in is non-NULL this will result in a
      use-after-free.
      
      Instead of printing possibly uninitialized or already freed memory in
      btrfs_printk(), explicitly pass in a NULL fs_info so the printing of the
      device name will be skipped altogether.
      
      There was a slightly different approach discussed in
      https://lore.kernel.org/linux-btrfs/20200114060920.4527-1-anand.jain@oracle.com/t/#u
      
      Link: https://lore.kernel.org/linux-btrfs/000000000000c9e14b05afcc41ba@google.com
      Reported-by: syzbot+582e66e5edf36a22c7b0@syzkaller.appspotmail.com
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0697d9a6
  15. 23 11月, 2020 2 次提交
    • D
      afs: Fix speculative status fetch going out of order wrt to modifications · a9e5c87c
      David Howells 提交于
      When doing a lookup in a directory, the afs filesystem uses a bulk
      status fetch to speculatively retrieve the statuses of up to 48 other
      vnodes found in the same directory and it will then either update extant
      inodes or create new ones - effectively doing 'lookup ahead'.
      
      To avoid the possibility of deadlocking itself, however, the filesystem
      doesn't lock all of those inodes; rather just the directory inode is
      locked (by the VFS).
      
      When the operation completes, afs_inode_init_from_status() or
      afs_apply_status() is called, depending on whether the inode already
      exists, to commit the new status.
      
      A case exists, however, where the speculative status fetch operation may
      straddle a modification operation on one of those vnodes.  What can then
      happen is that the speculative bulk status RPC retrieves the old status,
      and whilst that is happening, the modification happens - which returns
      an updated status, then the modification status is committed, then we
      attempt to commit the speculative status.
      
      This results in something like the following being seen in dmesg:
      
      	kAFS: vnode modified {100058:861} 8->9 YFS.InlineBulkStatus
      
      showing that for vnode 861 on volume 100058, we saw YFS.InlineBulkStatus
      say that the vnode had data version 8 when we'd already recorded version
      9 due to a local modification.  This was causing the cache to be
      invalidated for that vnode when it shouldn't have been.  If it happens
      on a data file, this might lead to local changes being lost.
      
      Fix this by ignoring speculative status updates if the data version
      doesn't match the expected value.
      
      Note that it is possible to get a DV regression if a volume gets
      restored from a backup - but we should get a callback break in such a
      case that should trigger a recheck anyway.  It might be worth checking
      the volume creation time in the volsync info and, if a change is
      observed in that (as would happen on a restore), invalidate all caches
      associated with the volume.
      
      Fixes: 5cf9dd55 ("afs: Prospectively look up extra files when doing a single lookup")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9e5c87c
    • Y
      libfs: fix error cast of negative value in simple_attr_write() · 488dac0c
      Yicong Yang 提交于
      The attr->set() receive a value of u64, but simple_strtoll() is used for
      doing the conversion.  It will lead to the error cast if user inputs a
      negative value.
      
      Use kstrtoull() instead of simple_strtoll() to convert a string got from
      the user to an unsigned value.  The former will return '-EINVAL' if it
      gets a negetive value, but the latter can't handle the situation
      correctly.  Make 'val' unsigned long long as what kstrtoull() takes,
      this will eliminate the compile warning on no 64-bit architectures.
      
      Fixes: f7b88631 ("fs/libfs.c: fix simple_attr_write() on 32bit machines")
      Signed-off-by: NYicong Yang <yangyicong@hisilicon.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Link: https://lkml.kernel.org/r/1605341356-11872-1-git-send-email-yangyicong@hisilicon.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      488dac0c
  16. 20 11月, 2020 5 次提交
    • J
      ext4: fix bogus warning in ext4_update_dx_flag() · f902b216
      Jan Kara 提交于
      The idea of the warning in ext4_update_dx_flag() is that we should warn
      when we are clearing EXT4_INODE_INDEX on a filesystem with metadata
      checksums enabled since after clearing the flag, checksums for internal
      htree nodes will become invalid. So there's no need to warn (or actually
      do anything) when EXT4_INODE_INDEX is not set.
      
      Link: https://lore.kernel.org/r/20201118153032.17281-1-jack@suse.cz
      Fixes: 48a34311 ("ext4: fix checksum errors with indexed dirs")
      Reported-by: NEric Biggers <ebiggers@kernel.org>
      Reviewed-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      f902b216
    • M
      jbd2: fix kernel-doc markups · 2bf31d94
      Mauro Carvalho Chehab 提交于
      Kernel-doc markup should use this format:
              identifier - description
      
      They should not have any type before that, as otherwise
      the parser won't do the right thing.
      
      Also, some identifiers have different names between their
      prototypes and the kernel-doc markup.
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NMauro Carvalho Chehab <mchehab+huawei@kernel.org>
      Link: https://lore.kernel.org/r/72f5c6628f5f278d67625f60893ffbc2ca28d46e.1605521731.git.mchehab+huawei@kernel.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      2bf31d94
    • D
      xfs: revert "xfs: fix rmap key and record comparison functions" · eb840907
      Darrick J. Wong 提交于
      This reverts commit 6ff646b2.
      
      Your maintainer committed a major braino in the rmap code by adding the
      attr fork, bmbt, and unwritten extent usage bits into rmap record key
      comparisons.  While XFS uses the usage bits *in the rmap records* for
      cross-referencing metadata in xfs_scrub and xfs_repair, it only needs
      the owner and offset information to distinguish between reverse mappings
      of the same physical extent into the data fork of a file at multiple
      offsets.  The other bits are not important for key comparisons for index
      lookups, and never have been.
      
      Eric Sandeen reports that this causes regressions in generic/299, so
      undo this patch before it does more damage.
      Reported-by: NEric Sandeen <sandeen@sandeen.net>
      Fixes: 6ff646b2 ("xfs: fix rmap key and record comparison functions")
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      eb840907
    • T
      ext4: drop fast_commit from /proc/mounts · 704c2317
      Theodore Ts'o 提交于
      The options in /proc/mounts must be valid mount options --- and
      fast_commit is not a mount option.  Otherwise, command sequences like
      this will fail:
      
          # mount /dev/vdc /vdc
          # mkdir -p /vdc/phoronix_test_suite /pts
          # mount --bind /vdc/phoronix_test_suite /pts
          # mount -o remount,nodioread_nolock /pts
          mount: /pts: mount point not mounted or bad option.
      
      And in the system logs, you'll find:
      
          EXT4-fs (vdc): Unrecognized mount option "fast_commit" or missing value
      
      Fixes: 995a3ed6 ("ext4: add fast_commit feature and handling for extended mount options")
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      704c2317
    • D
      xfs: don't allow NOWAIT DIO across extent boundaries · 883a790a
      Dave Chinner 提交于
      Jens has reported a situation where partial direct IOs can be issued
      and completed yet still return -EAGAIN. We don't want this to report
      a short IO as we want XFS to complete user DIO entirely or not at
      all.
      
      This partial IO situation can occur on a write IO that is split
      across an allocated extent and a hole, and the second mapping is
      returning EAGAIN because allocation would be required.
      
      The trivial reproducer:
      
      $ sudo xfs_io -fdt -c "pwrite 0 4k" -c "pwrite -V 1 -b 8k -N 0 8k" /mnt/scr/foo
      wrote 4096/4096 bytes at offset 0
      4 KiB, 1 ops; 0.0001 sec (27.509 MiB/sec and 7042.2535 ops/sec)
      pwrite: Resource temporarily unavailable
      $
      
      The pwritev2(0, 8kB, RWF_NOWAIT) call returns EAGAIN having done
      the first 4kB write:
      
       xfs_file_direct_write: dev 259:1 ino 0x83 size 0x1000 offset 0x0 count 0x2000
       iomap_apply:          dev 259:1 ino 0x83 pos 0 length 8192 flags WRITE|DIRECT|NOWAIT (0x31) ops xfs_direct_write_iomap_ops caller iomap_dio_rw actor iomap_dio_actor
       xfs_ilock_nowait:     dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_for_iomap
       xfs_iunlock:          dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_direct_write_iomap_begin
       xfs_iomap_found:      dev 259:1 ino 0x83 size 0x1000 offset 0x0 count 8192 fork data startoff 0x0 startblock 24 blockcount 0x1
       iomap_apply_dstmap:   dev 259:1 ino 0x83 bdev 259:1 addr 102400 offset 0 length 4096 type MAPPED flags DIRTY
      
      Here the first iomap loop has mapped the first 4kB of the file and
      issued the IO, and we enter the second iomap_apply loop:
      
       iomap_apply: dev 259:1 ino 0x83 pos 4096 length 4096 flags WRITE|DIRECT|NOWAIT (0x31) ops xfs_direct_write_iomap_ops caller iomap_dio_rw actor iomap_dio_actor
       xfs_ilock_nowait:     dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_for_iomap
       xfs_iunlock:          dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_direct_write_iomap_begin
      
      And we exit with -EAGAIN out because we hit the allocate case trying
      to make the second 4kB block.
      
      Then IO completes on the first 4kB and the original IO context
      completes and unlocks the inode, returning -EAGAIN to userspace:
      
       xfs_end_io_direct_write: dev 259:1 ino 0x83 isize 0x1000 disize 0x1000 offset 0x0 count 4096
       xfs_iunlock:          dev 259:1 ino 0x83 flags IOLOCK_SHARED caller xfs_file_dio_aio_write
      
      There are other vectors to the same problem when we re-enter the
      mapping code if we have to make multiple mappinfs under NOWAIT
      conditions. e.g. failing trylocks, COW extents being found,
      allocation being required, and so on.
      
      Avoid all these potential problems by only allowing IOMAP_NOWAIT IO
      to go ahead if the mapping we retrieve for the IO spans an entire
      allocated extent. This avoids the possibility of subsequent mappings
      to complete the IO from triggering NOWAIT semantics by any means as
      NOWAIT IO will now only enter the mapping code once per NOWAIT IO.
      Reported-and-tested-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      883a790a