1. 23 1月, 2020 1 次提交
  2. 17 1月, 2020 3 次提交
    • J
      btrfs: check rw_devices, not num_devices for balance · b35cf1f0
      Josef Bacik 提交于
      The fstest btrfs/154 reports
      
        [ 8675.381709] BTRFS: Transaction aborted (error -28)
        [ 8675.383302] WARNING: CPU: 1 PID: 31900 at fs/btrfs/block-group.c:2038 btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs]
        [ 8675.390925] CPU: 1 PID: 31900 Comm: btrfs Not tainted 5.5.0-rc6-default+ #935
        [ 8675.392780] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
        [ 8675.395452] RIP: 0010:btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs]
        [ 8675.402672] RSP: 0018:ffffb2090888fb00 EFLAGS: 00010286
        [ 8675.404413] RAX: 0000000000000000 RBX: ffff92026dfa91c8 RCX: 0000000000000001
        [ 8675.406609] RDX: 0000000000000000 RSI: ffffffff8e100899 RDI: ffffffff8e100971
        [ 8675.408775] RBP: ffff920247c61660 R08: 0000000000000000 R09: 0000000000000000
        [ 8675.410978] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffffe4
        [ 8675.412647] R13: ffff92026db74000 R14: ffff920247c616b8 R15: ffff92026dfbc000
        [ 8675.413994] FS:  00007fd5e57248c0(0000) GS:ffff92027d800000(0000) knlGS:0000000000000000
        [ 8675.416146] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [ 8675.417833] CR2: 0000564aa51682d8 CR3: 000000006dcbc004 CR4: 0000000000160ee0
        [ 8675.419801] Call Trace:
        [ 8675.420742]  btrfs_start_dirty_block_groups+0x355/0x480 [btrfs]
        [ 8675.422600]  btrfs_commit_transaction+0xc8/0xaf0 [btrfs]
        [ 8675.424335]  reset_balance_state+0x14a/0x190 [btrfs]
        [ 8675.425824]  btrfs_balance.cold+0xe7/0x154 [btrfs]
        [ 8675.427313]  ? kmem_cache_alloc_trace+0x235/0x2c0
        [ 8675.428663]  btrfs_ioctl_balance+0x298/0x350 [btrfs]
        [ 8675.430285]  btrfs_ioctl+0x466/0x2550 [btrfs]
        [ 8675.431788]  ? mem_cgroup_charge_statistics+0x51/0xf0
        [ 8675.433487]  ? mem_cgroup_commit_charge+0x56/0x400
        [ 8675.435122]  ? do_raw_spin_unlock+0x4b/0xc0
        [ 8675.436618]  ? _raw_spin_unlock+0x1f/0x30
        [ 8675.438093]  ? __handle_mm_fault+0x499/0x740
        [ 8675.439619]  ? do_vfs_ioctl+0x56e/0x770
        [ 8675.441034]  do_vfs_ioctl+0x56e/0x770
        [ 8675.442411]  ksys_ioctl+0x3a/0x70
        [ 8675.443718]  ? trace_hardirqs_off_thunk+0x1a/0x1c
        [ 8675.445333]  __x64_sys_ioctl+0x16/0x20
        [ 8675.446705]  do_syscall_64+0x50/0x210
        [ 8675.448059]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [ 8675.479187] BTRFS: error (device vdb) in btrfs_create_pending_block_groups:2038: errno=-28 No space left
      
      We now use btrfs_can_overcommit() to see if we can flip a block group
      read only.  Before this would fail because we weren't taking into
      account the usable un-allocated space for allocating chunks.  With my
      patches we were allowed to do the balance, which is technically correct.
      
      The test is trying to start balance on degraded mount.  So now we're
      trying to allocate a chunk and cannot because we want to allocate a
      RAID1 chunk, but there's only 1 device that's available for usage.  This
      results in an ENOSPC.
      
      But we shouldn't even be making it this far, we don't have enough
      devices to restripe.  The problem is we're using btrfs_num_devices(),
      that also includes missing devices. That's not actually what we want, we
      need to use rw_devices.
      
      The chunk_mutex is not needed here, rw_devices changes only in device
      add, remove or replace, all are excluded by EXCL_OP mechanism.
      
      Fixes: e4d8ec0f ("Btrfs: implement online profile changing")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add stacktrace, update changelog, drop chunk_mutex ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b35cf1f0
    • F
      Btrfs: always copy scrub arguments back to user space · 5afe6ce7
      Filipe Manana 提交于
      If scrub returns an error we are not copying back the scrub arguments
      structure to user space. This prevents user space to know how much
      progress scrub has done if an error happened - this includes -ECANCELED
      which is returned when users ask for scrub to stop. A particular use
      case, which is used in btrfs-progs, is to resume scrub after it is
      canceled, in that case it relies on checking the progress from the scrub
      arguments structure and then use that progress in a call to resume
      scrub.
      
      So fix this by always copying the scrub arguments structure to user
      space, overwriting the value returned to user space with -EFAULT only if
      copying the structure failed to let user space know that either that
      copying did not happen, and therefore the structure is stale, or it
      happened partially and the structure is probably not valid and corrupt
      due to the partial copy.
      Reported-by: NGraham Cobb <g.btrfs@cobb.uk.net>
      Link: https://lore.kernel.org/linux-btrfs/d0a97688-78be-08de-ca7d-bcb4c7fb397e@cobb.uk.net/
      Fixes: 06fe39ab ("Btrfs: do not overwrite scrub error with fault error in scrub ioctl")
      CC: stable@vger.kernel.org # 5.1+
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Tested-by: NGraham Cobb <g.btrfs@cobb.uk.net>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5afe6ce7
    • J
      io_uring: only allow submit from owning task · 44d28279
      Jens Axboe 提交于
      If the credentials or the mm doesn't match, don't allow the task to
      submit anything on behalf of this ring. The task that owns the ring can
      pass the file descriptor to another task, but we don't want to allow
      that task to submit an SQE that then assumes the ring mm and creds if
      it needs to go async.
      
      Cc: stable@vger.kernel.org
      Suggested-by: NStefan Metzmacher <metze@samba.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      44d28279
  3. 16 1月, 2020 3 次提交
    • M
      fuse: fix fuse_send_readpages() in the syncronous read case · 7df1e988
      Miklos Szeredi 提交于
      Buffered read in fuse normally goes via:
      
       -> generic_file_buffered_read()
         -> fuse_readpages()
           -> fuse_send_readpages()
             ->fuse_simple_request() [called since v5.4]
      
      In the case of a read request, fuse_simple_request() will return a
      non-negative bytecount on success or a negative error value.  A positive
      bytecount was taken to be an error and the PG_error flag set on the page.
      This resulted in generic_file_buffered_read() falling back to ->readpage(),
      which would repeat the read request and succeed.  Because of the repeated
      read succeeding the bug was not detected with regression tests or other use
      cases.
      
      The FTP module in GVFS however fails the second read due to the
      non-seekable nature of FTP downloads.
      
      Fix by checking and ignoring positive return value from
      fuse_simple_request().
      Reported-by: NOndrej Holy <oholy@redhat.com>
      Link: https://gitlab.gnome.org/GNOME/gvfs/issues/441
      Fixes: 134831e3 ("fuse: convert readpages to simple api")
      Cc: <stable@vger.kernel.org> # v5.4
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      7df1e988
    • J
      io_uring: ensure workqueue offload grabs ring mutex for poll list · 11ba820b
      Jens Axboe 提交于
      A previous commit moved the locking for the async sqthread, but didn't
      take into account that the io-wq workers still need it. We can't use
      req->in_async for this anymore as both the sqthread and io-wq workers
      set it, gate the need for locking on io_wq_current_is_worker() instead.
      
      Fixes: 8a4955ff ("io_uring: sqthread should grab ctx->uring_lock for submissions")
      Reported-by: NBijan Mottahedeh <bijan.mottahedeh@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      11ba820b
    • B
      io_uring: clear req->result always before issuing a read/write request · 797f3f53
      Bijan Mottahedeh 提交于
      req->result is cleared when io_issue_sqe() calls io_read/write_pre()
      routines.  Those routines however are not called when the sqe
      argument is NULL, which is the case when io_issue_sqe() is called from
      io_wq_submit_work().  io_issue_sqe() may then examine a stale result if
      a polled request had previously failed with -EAGAIN:
      
              if (ctx->flags & IORING_SETUP_IOPOLL) {
                      if (req->result == -EAGAIN)
                              return -EAGAIN;
      
                      io_iopoll_req_issued(req);
              }
      
      and in turn cause a subsequently completed request to be re-issued in
      io_wq_submit_work().
      Signed-off-by: NBijan Mottahedeh <bijan.mottahedeh@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      797f3f53
  4. 15 1月, 2020 6 次提交
  5. 14 1月, 2020 2 次提交
    • J
      io_uring: don't setup async context for read/write fixed · 74566df3
      Jens Axboe 提交于
      We don't need it, and if we have it, then the retry handler will attempt
      to copy the non-existent iovec with the inline iovec, with a segment
      count that doesn't make sense.
      
      Fixes: f67676d1 ("io_uring: ensure async punted read/write requests copy iovec")
      Reported-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      74566df3
    • Q
      btrfs: relocation: fix reloc_root lifespan and access · 6282675e
      Qu Wenruo 提交于
      [BUG]
      There are several different KASAN reports for balance + snapshot
      workloads.  Involved call paths include:
      
         should_ignore_root+0x54/0xb0 [btrfs]
         build_backref_tree+0x11af/0x2280 [btrfs]
         relocate_tree_blocks+0x391/0xb80 [btrfs]
         relocate_block_group+0x3e5/0xa00 [btrfs]
         btrfs_relocate_block_group+0x240/0x4d0 [btrfs]
         btrfs_relocate_chunk+0x53/0xf0 [btrfs]
         btrfs_balance+0xc91/0x1840 [btrfs]
         btrfs_ioctl_balance+0x416/0x4e0 [btrfs]
         btrfs_ioctl+0x8af/0x3e60 [btrfs]
         do_vfs_ioctl+0x831/0xb10
      
         create_reloc_root+0x9f/0x460 [btrfs]
         btrfs_reloc_post_snapshot+0xff/0x6c0 [btrfs]
         create_pending_snapshot+0xa9b/0x15f0 [btrfs]
         create_pending_snapshots+0x111/0x140 [btrfs]
         btrfs_commit_transaction+0x7a6/0x1360 [btrfs]
         btrfs_mksubvol+0x915/0x960 [btrfs]
         btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs]
         btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs]
         btrfs_ioctl+0x241b/0x3e60 [btrfs]
         do_vfs_ioctl+0x831/0xb10
      
         btrfs_reloc_pre_snapshot+0x85/0xc0 [btrfs]
         create_pending_snapshot+0x209/0x15f0 [btrfs]
         create_pending_snapshots+0x111/0x140 [btrfs]
         btrfs_commit_transaction+0x7a6/0x1360 [btrfs]
         btrfs_mksubvol+0x915/0x960 [btrfs]
         btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs]
         btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs]
         btrfs_ioctl+0x241b/0x3e60 [btrfs]
         do_vfs_ioctl+0x831/0xb10
      
      [CAUSE]
      All these call sites are only relying on root->reloc_root, which can
      undergo btrfs_drop_snapshot(), and since we don't have real refcount
      based protection to reloc roots, we can reach already dropped reloc
      root, triggering KASAN.
      
      [FIX]
      To avoid such access to unstable root->reloc_root, we should check
      BTRFS_ROOT_DEAD_RELOC_TREE bit first.
      
      This patch introduces wrappers that provide the correct way to check the
      bit with memory barriers protection.
      
      Most callers don't distinguish merged reloc tree and no reloc tree.  The
      only exception is should_ignore_root(), as merged reloc tree can be
      ignored, while no reloc tree shouldn't.
      
      [CRITICAL SECTION ANALYSIS]
      Although test_bit()/set_bit()/clear_bit() doesn't imply a barrier, the
      DEAD_RELOC_TREE bit has extra help from transaction as a higher level
      barrier, the lifespan of root::reloc_root and DEAD_RELOC_TREE bit are:
      
      	NULL: reloc_root is NULL	PTR: reloc_root is not NULL
      	0: DEAD_RELOC_ROOT bit not set	DEAD: DEAD_RELOC_ROOT bit set
      
      	(NULL, 0)    Initial state		 __
      	  |					 /\ Section A
              btrfs_init_reloc_root()			 \/
      	  |				 	 __
      	(PTR, 0)     reloc_root initialized      /\
                |					 |
      	btrfs_update_reloc_root()		 |  Section B
                |					 |
      	(PTR, DEAD)  reloc_root has been merged  \/
                |					 __
      	=== btrfs_commit_transaction() ====================
      	  |					 /\
      	clean_dirty_subvols()			 |
      	  |					 |  Section C
      	(NULL, DEAD) reloc_root cleanup starts   \/
                |					 __
      	btrfs_drop_snapshot()			 /\
      	  |					 |  Section D
      	(NULL, 0)    Back to initial state	 \/
      
      Every have_reloc_root() or test_bit(DEAD_RELOC_ROOT) caller holds
      transaction handle, so none of such caller can cross transaction boundary.
      
      In Section A, every caller just found no DEAD bit, and grab reloc_root.
      
      In the cross section A-B, caller may get no DEAD bit, but since reloc_root
      is still completely valid thus accessing reloc_root is completely safe.
      
      No test_bit() caller can cross the boundary of Section B and Section C.
      
      In Section C, every caller found the DEAD bit, so no one will access
      reloc_root.
      
      In the cross section C-D, either caller gets the DEAD bit set, avoiding
      access reloc_root no matter if it's safe or not.  Or caller get the DEAD
      bit cleared, then access reloc_root, which is already NULL, nothing will
      be wrong.
      
      The memory write barriers are between the reloc_root updates and bit
      set/clear, the pairing read side is before test_bit.
      Reported-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Fixes: d2311e69 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ barriers ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6282675e
  6. 09 1月, 2020 3 次提交
    • M
      fs: move guard_bio_eod() after bio_set_op_attrs · 83c9c547
      Ming Lei 提交于
      Commit 85a8ce62 ("block: add bio_truncate to fix guard_bio_eod")
      adds bio_truncate() for handling bio EOD. However, bio_truncate()
      doesn't use the passed 'op' parameter from guard_bio_eod's callers.
      
      So bio_trunacate() may retrieve wrong 'op', and zering pages may
      not be done for READ bio.
      
      Fixes this issue by moving guard_bio_eod() after bio_set_op_attrs()
      in submit_bh_wbc() so that bio_truncate() can always retrieve correct
      op info.
      
      Meantime remove the 'op' parameter from guard_bio_eod() because it isn't
      used any more.
      
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: linux-fsdevel@vger.kernel.org
      Fixes: 85a8ce62 ("block: add bio_truncate to fix guard_bio_eod")
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      
      Fold in kerneldoc and bio_op() change.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      83c9c547
    • K
      pstore/ram: Regularize prz label allocation lifetime · e163fdb3
      Kees Cook 提交于
      In my attempt to fix a memory leak, I introduced a double-free in the
      pstore error path. Instead of trying to manage the allocation lifetime
      between persistent_ram_new() and its callers, adjust the logic so
      persistent_ram_new() always takes a kstrdup() copy, and leaves the
      caller's allocation lifetime up to the caller. Therefore callers are
      _always_ responsible for freeing their label. Before, it only needed
      freeing when the prz itself failed to allocate, and not in any of the
      other prz failure cases, which callers would have no visibility into,
      which is the root design problem that lead to both the leak and now
      double-free bugs.
      Reported-by: NCengiz Can <cengiz@kernel.wtf>
      Link: https://lore.kernel.org/lkml/d4ec59002ede4aaf9928c7f7526da87c@kernel.wtf
      Fixes: 8df955a3 ("pstore/ram: Fix error-path memory leak in persistent_ram_new() callers")
      Cc: stable@vger.kernel.org
      Signed-off-by: NKees Cook <keescook@chromium.org>
      e163fdb3
    • J
      btrfs: fix memory leak in qgroup accounting · 26ef8493
      Johannes Thumshirn 提交于
      When running xfstests on the current btrfs I get the following splat from
      kmemleak:
      
      unreferenced object 0xffff88821b2404e0 (size 32):
        comm "kworker/u4:7", pid 26663, jiffies 4295283698 (age 8.776s)
        hex dump (first 32 bytes):
          01 00 00 00 00 00 00 00 10 ff fd 26 82 88 ff ff  ...........&....
          10 ff fd 26 82 88 ff ff 20 ff fd 26 82 88 ff ff  ...&.... ..&....
        backtrace:
          [<00000000f94fd43f>] ulist_alloc+0x25/0x60 [btrfs]
          [<00000000fd023d99>] btrfs_find_all_roots_safe+0x41/0x100 [btrfs]
          [<000000008f17bd32>] btrfs_find_all_roots+0x52/0x70 [btrfs]
          [<00000000b7660afb>] btrfs_qgroup_rescan_worker+0x343/0x680 [btrfs]
          [<0000000058e66778>] btrfs_work_helper+0xac/0x1e0 [btrfs]
          [<00000000f0188930>] process_one_work+0x1cf/0x350
          [<00000000af5f2f8e>] worker_thread+0x28/0x3c0
          [<00000000b55a1add>] kthread+0x109/0x120
          [<00000000f88cbd17>] ret_from_fork+0x35/0x40
      
      This corresponds to:
      
        (gdb) l *(btrfs_find_all_roots_safe+0x41)
        0x8d7e1 is in btrfs_find_all_roots_safe (fs/btrfs/backref.c:1413).
        1408
        1409            tmp = ulist_alloc(GFP_NOFS);
        1410            if (!tmp)
        1411                    return -ENOMEM;
        1412            *roots = ulist_alloc(GFP_NOFS);
        1413            if (!*roots) {
        1414                    ulist_free(tmp);
        1415                    return -ENOMEM;
        1416            }
        1417
      
      Following the lifetime of the allocated 'roots' ulist, it gets freed
      again in btrfs_qgroup_account_extent().
      
      But this does not happen if the function is called with the
      'BTRFS_FS_QUOTA_ENABLED' flag cleared, then btrfs_qgroup_account_extent()
      does a short leave and directly returns.
      
      Instead of directly returning we should jump to the 'out_free' in order to
      free all resources as expected.
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      [ add comment ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      26ef8493
  7. 08 1月, 2020 4 次提交
    • J
      btrfs: do not delete mismatched root refs · 423a716c
      Josef Bacik 提交于
      btrfs_del_root_ref() will simply WARN_ON() if the ref doesn't match in
      any way, and then continue to delete the reference.  This shouldn't
      happen, we have these values because there's more to the reference than
      the original root and the sub root.  If any of these checks fail, return
      -ENOENT.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      423a716c
    • J
      btrfs: fix invalid removal of root ref · d49d3287
      Josef Bacik 提交于
      If we have the following sequence of events
      
        btrfs sub create A
        btrfs sub create A/B
        btrfs sub snap A C
        mkdir C/foo
        mv A/B C/foo
        rm -rf *
      
      We will end up with a transaction abort.
      
      The reason for this is because we create a root ref for B pointing to A.
      When we create a snapshot of C we still have B in our tree, but because
      the root ref points to A and not C we will make it appear to be empty.
      
      The problem happens when we move B into C.  This removes the root ref
      for B pointing to A and adds a ref of B pointing to C.  When we rmdir C
      we'll see that we have a ref to our root and remove the root ref,
      despite not actually matching our reference name.
      
      Now btrfs_del_root_ref() allowing this to work is a bug as well, however
      we know that this inode does not actually point to a root ref in the
      first place, so we shouldn't be calling btrfs_del_root_ref() in the
      first place and instead simply look up our dir index for this item and
      do the rest of the removal.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d49d3287
    • J
      btrfs: rework arguments of btrfs_unlink_subvol · 045d3967
      Josef Bacik 提交于
      btrfs_unlink_subvol takes the name of the dentry and the root objectid
      based on what kind of inode this is, either a real subvolume link or a
      empty one that we inherited as a snapshot.  We need to fix how we unlink
      in the case for BTRFS_EMPTY_SUBVOL_DIR_OBJECTID in the future, so rework
      btrfs_unlink_subvol to just take the dentry and handle getting the right
      objectid given the type of inode this is.  There is no functional change
      here, simply pushing the work into btrfs_unlink_subvol() proper.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      045d3967
    • J
      io_uring: remove punt of short reads to async context · eacc6dfa
      Jens Axboe 提交于
      We currently punt any short read on a regular file to async context,
      but this fails if the short read is due to running into EOF. This is
      especially problematic since we only do the single prep for commands
      now, as we don't reset kiocb->ki_pos. This can result in a 4k read on
      a 1k file returning zero, as we detect the short read and then retry
      from async context. At the time of retry, the position is now 1k, and
      we end up reading nothing, and hence return 0.
      
      Instead of trying to patch around the fact that short reads can be
      legitimate and won't succeed in case of retry, remove the logic to punt
      a short read to async context. Simply return it.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      eacc6dfa
  8. 07 1月, 2020 1 次提交
    • W
      chardev: Avoid potential use-after-free in 'chrdev_open()' · 68faa679
      Will Deacon 提交于
      'chrdev_open()' calls 'cdev_get()' to obtain a reference to the
      'struct cdev *' stashed in the 'i_cdev' field of the target inode
      structure. If the pointer is NULL, then it is initialised lazily by
      looking up the kobject in the 'cdev_map' and so the whole procedure is
      protected by the 'cdev_lock' spinlock to serialise initialisation of
      the shared pointer.
      
      Unfortunately, it is possible for the initialising thread to fail *after*
      installing the new pointer, for example if the subsequent '->open()' call
      on the file fails. In this case, 'cdev_put()' is called, the reference
      count on the kobject is dropped and, if nobody else has taken a reference,
      the release function is called which finally clears 'inode->i_cdev' from
      'cdev_purge()' before potentially freeing the object. The problem here
      is that a racing thread can happily take the 'cdev_lock' and see the
      non-NULL pointer in the inode, which can result in a refcount increment
      from zero and a warning:
      
        |  ------------[ cut here ]------------
        |  refcount_t: addition on 0; use-after-free.
        |  WARNING: CPU: 2 PID: 6385 at lib/refcount.c:25 refcount_warn_saturate+0x6d/0xf0
        |  Modules linked in:
        |  CPU: 2 PID: 6385 Comm: repro Not tainted 5.5.0-rc2+ #22
        |  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
        |  RIP: 0010:refcount_warn_saturate+0x6d/0xf0
        |  Code: 05 55 9a 15 01 01 e8 9d aa c8 ff 0f 0b c3 80 3d 45 9a 15 01 00 75 ce 48 c7 c7 00 9c 62 b3 c6 08
        |  RSP: 0018:ffffb524c1b9bc70 EFLAGS: 00010282
        |  RAX: 0000000000000000 RBX: ffff9e9da1f71390 RCX: 0000000000000000
        |  RDX: ffff9e9dbbd27618 RSI: ffff9e9dbbd18798 RDI: ffff9e9dbbd18798
        |  RBP: 0000000000000000 R08: 000000000000095f R09: 0000000000000039
        |  R10: 0000000000000000 R11: ffffb524c1b9bb20 R12: ffff9e9da1e8c700
        |  R13: ffffffffb25ee8b0 R14: 0000000000000000 R15: ffff9e9da1e8c700
        |  FS:  00007f3b87d26700(0000) GS:ffff9e9dbbd00000(0000) knlGS:0000000000000000
        |  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        |  CR2: 00007fc16909c000 CR3: 000000012df9c000 CR4: 00000000000006e0
        |  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        |  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        |  Call Trace:
        |   kobject_get+0x5c/0x60
        |   cdev_get+0x2b/0x60
        |   chrdev_open+0x55/0x220
        |   ? cdev_put.part.3+0x20/0x20
        |   do_dentry_open+0x13a/0x390
        |   path_openat+0x2c8/0x1470
        |   do_filp_open+0x93/0x100
        |   ? selinux_file_ioctl+0x17f/0x220
        |   do_sys_open+0x186/0x220
        |   do_syscall_64+0x48/0x150
        |   entry_SYSCALL_64_after_hwframe+0x44/0xa9
        |  RIP: 0033:0x7f3b87efcd0e
        |  Code: 89 54 24 08 e8 a3 f4 ff ff 8b 74 24 0c 48 8b 3c 24 41 89 c0 44 8b 54 24 08 b8 01 01 00 00 89 f4
        |  RSP: 002b:00007f3b87d259f0 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
        |  RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3b87efcd0e
        |  RDX: 0000000000000000 RSI: 00007f3b87d25a80 RDI: 00000000ffffff9c
        |  RBP: 00007f3b87d25e90 R08: 0000000000000000 R09: 0000000000000000
        |  R10: 0000000000000000 R11: 0000000000000293 R12: 00007ffe188f504e
        |  R13: 00007ffe188f504f R14: 00007f3b87d26700 R15: 0000000000000000
        |  ---[ end trace 24f53ca58db8180a ]---
      
      Since 'cdev_get()' can already fail to obtain a reference, simply move
      it over to use 'kobject_get_unless_zero()' instead of 'kobject_get()',
      which will cause the racing thread to return -ENXIO if the initialising
      thread fails unexpectedly.
      
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Reported-by: syzbot+82defefbbd8527e1c2cb@syzkaller.appspotmail.com
      Signed-off-by: NWill Deacon <will@kernel.org>
      Cc: stable <stable@vger.kernel.org>
      Link: https://lore.kernel.org/r/20191219120203.32691-1-will@kernel.orgSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      68faa679
  9. 05 1月, 2020 6 次提交
    • G
      ocfs2: fix the crash due to call ocfs2_get_dlm_debug once less · b73eba2a
      Gang He 提交于
      Because ocfs2_get_dlm_debug() function is called once less here, ocfs2
      file system will trigger the system crash, usually after ocfs2 file
      system is unmounted.
      
      This system crash is caused by a generic memory corruption, these crash
      backtraces are not always the same, for exapmle,
      
          ocfs2: Unmounting device (253,16) on (node 172167785)
          general protection fault: 0000 [#1] SMP PTI
          CPU: 3 PID: 14107 Comm: fence_legacy Kdump:
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
          RIP: 0010:__kmalloc+0xa5/0x2a0
          Code: 00 00 4d 8b 07 65 4d 8b
          RSP: 0018:ffffaa1fc094bbe8 EFLAGS: 00010286
          RAX: 0000000000000000 RBX: d310a8800d7a3faf RCX: 0000000000000000
          RDX: 0000000000000000 RSI: 0000000000000dc0 RDI: ffff96e68fc036c0
          RBP: d310a8800d7a3faf R08: ffff96e6ffdb10a0 R09: 00000000752e7079
          R10: 000000000001c513 R11: 0000000004091041 R12: 0000000000000dc0
          R13: 0000000000000039 R14: ffff96e68fc036c0 R15: ffff96e68fc036c0
          FS:  00007f699dfba540(0000) GS:ffff96e6ffd80000(0000) knlGS:00000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 000055f3a9d9b768 CR3: 000000002cd1c000 CR4: 00000000000006e0
          Call Trace:
           ext4_htree_store_dirent+0x35/0x100 [ext4]
           htree_dirblock_to_tree+0xea/0x290 [ext4]
           ext4_htree_fill_tree+0x1c1/0x2d0 [ext4]
           ext4_readdir+0x67c/0x9d0 [ext4]
           iterate_dir+0x8d/0x1a0
           __x64_sys_getdents+0xab/0x130
           do_syscall_64+0x60/0x1f0
           entry_SYSCALL_64_after_hwframe+0x49/0xbe
          RIP: 0033:0x7f699d33a9fb
      
      This regression problem was introduced by commit e581595e ("ocfs: no
      need to check return value of debugfs_create functions").
      
      Link: http://lkml.kernel.org/r/20191225061501.13587-1-ghe@suse.com
      Fixes: e581595e ("ocfs: no need to check return value of debugfs_create functions")
      Signed-off-by: NGang He <ghe@suse.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: <stable@vger.kernel.org>	[5.3+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b73eba2a
    • K
      ocfs2: call journal flush to mark journal as empty after journal recovery when mount · 397eac17
      Kai Li 提交于
      If journal is dirty when mount, it will be replayed but jbd2 sb log tail
      cannot be updated to mark a new start because journal->j_flag has
      already been set with JBD2_ABORT first in journal_init_common.
      
      When a new transaction is committed, it will be recored in block 1
      first(journal->j_tail is set to 1 in journal_reset).  If emergency
      restart happens again before journal super block is updated
      unfortunately, the new recorded trans will not be replayed in the next
      mount.
      
      The following steps describe this procedure in detail.
      1. mount and touch some files
      2. these transactions are committed to journal area but not checkpointed
      3. emergency restart
      4. mount again and its journals are replayed
      5. journal super block's first s_start is 1, but its s_seq is not updated
      6. touch a new file and its trans is committed but not checkpointed
      7. emergency restart again
      8. mount and journal is dirty, but trans committed in 6 will not be
      replayed.
      
      This exception happens easily when this lun is used by only one node.
      If it is used by multi-nodes, other node will replay its journal and its
      journal super block will be updated after recovery like what this patch
      does.
      
      ocfs2_recover_node->ocfs2_replay_journal.
      
      The following jbd2 journal can be generated by touching a new file after
      journal is replayed, and seq 15 is the first valid commit, but first seq
      is 13 in journal super block.
      
      logdump:
        Block 0: Journal Superblock
        Seq: 0   Type: 4 (JBD2_SUPERBLOCK_V2)
        Blocksize: 4096   Total Blocks: 32768   First Block: 1
        First Commit ID: 13   Start Log Blknum: 1
        Error: 0
        Feature Compat: 0
        Feature Incompat: 2 block64
        Feature RO compat: 0
        Journal UUID: 4ED3822C54294467A4F8E87D2BA4BC36
        FS Share Cnt: 1   Dynamic Superblk Blknum: 0
        Per Txn Block Limit    Journal: 0    Data: 0
      
        Block 1: Journal Commit Block
        Seq: 14   Type: 2 (JBD2_COMMIT_BLOCK)
      
        Block 2: Journal Descriptor
        Seq: 15   Type: 1 (JBD2_DESCRIPTOR_BLOCK)
        No. Blocknum        Flags
         0. 587             none
        UUID: 00000000000000000000000000000000
         1. 8257792         JBD2_FLAG_SAME_UUID
         2. 619             JBD2_FLAG_SAME_UUID
         3. 24772864        JBD2_FLAG_SAME_UUID
         4. 8257802         JBD2_FLAG_SAME_UUID
         5. 513             JBD2_FLAG_SAME_UUID JBD2_FLAG_LAST_TAG
        ...
        Block 7: Inode
        Inode: 8257802   Mode: 0640   Generation: 57157641 (0x3682809)
        FS Generation: 2839773110 (0xa9437fb6)
        CRC32: 00000000   ECC: 0000
        Type: Regular   Attr: 0x0   Flags: Valid
        Dynamic Features: (0x1) InlineData
        User: 0 (root)   Group: 0 (root)   Size: 7
        Links: 1   Clusters: 0
        ctime: 0x5de5d870 0x11104c61 -- Tue Dec  3 11:37:20.286280801 2019
        atime: 0x5de5d870 0x113181a1 -- Tue Dec  3 11:37:20.288457121 2019
        mtime: 0x5de5d870 0x11104c61 -- Tue Dec  3 11:37:20.286280801 2019
        dtime: 0x0 -- Thu Jan  1 08:00:00 1970
        ...
        Block 9: Journal Commit Block
        Seq: 15   Type: 2 (JBD2_COMMIT_BLOCK)
      
      The following is journal recovery log when recovering the upper jbd2
      journal when mount again.
      
      syslog:
        ocfs2: File system on device (252,1) was not unmounted cleanly, recovering it.
        fs/jbd2/recovery.c:(do_one_pass, 449): Starting recovery pass 0
        fs/jbd2/recovery.c:(do_one_pass, 449): Starting recovery pass 1
        fs/jbd2/recovery.c:(do_one_pass, 449): Starting recovery pass 2
        fs/jbd2/recovery.c:(jbd2_journal_recover, 278): JBD2: recovery, exit status 0, recovered transactions 13 to 13
      
      Due to first commit seq 13 recorded in journal super is not consistent
      with the value recorded in block 1(seq is 14), journal recovery will be
      terminated before seq 15 even though it is an unbroken commit, inode
      8257802 is a new file and it will be lost.
      
      Link: http://lkml.kernel.org/r/20191217020140.2197-1-li.kai4@h3c.comSigned-off-by: NKai Li <li.kai4@h3c.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NChangwei Ge <gechangwei@live.cn>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Gang He <ghe@suse.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      397eac17
    • R
      fs/posix_acl.c: fix kernel-doc warnings · e39e773a
      Randy Dunlap 提交于
      Fix kernel-doc warnings in fs/posix_acl.c.
      Also fix one typo (setgit -> setgid).
      
        fs/posix_acl.c:647: warning: Function parameter or member 'inode' not described in 'posix_acl_update_mode'
        fs/posix_acl.c:647: warning: Function parameter or member 'mode_p' not described in 'posix_acl_update_mode'
        fs/posix_acl.c:647: warning: Function parameter or member 'acl' not described in 'posix_acl_update_mode'
      
      Link: http://lkml.kernel.org/r/29b0dc46-1f28-a4e5-b1d0-ba2b65629779@infradead.org
      Fixes: 07393101 ("posix_acl: Clear SGID bit when setting file permissions")
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Acked-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e39e773a
    • E
      fs/namespace.c: make to_mnt_ns() static · 213921f9
      Eric Biggers 提交于
      Make to_mnt_ns() static to address the following 'sparse' warning:
      
          fs/namespace.c:1731:22: warning: symbol 'to_mnt_ns' was not declared. Should it be static?
      
      Link: http://lkml.kernel.org/r/20191209234830.156260-1-ebiggers@kernel.orgSigned-off-by: NEric Biggers <ebiggers@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      213921f9
    • E
      fs/nsfs.c: include headers for missing declarations · 7bebd69e
      Eric Biggers 提交于
      Include linux/proc_fs.h and fs/internal.h to address the following
      'sparse' warnings:
      
          fs/nsfs.c:41:32: warning: symbol 'ns_dentry_operations' was not declared. Should it be static?
          fs/nsfs.c:145:5: warning: symbol 'open_related_ns' was not declared. Should it be static?
      
      Link: http://lkml.kernel.org/r/20191209234822.156179-1-ebiggers@kernel.orgSigned-off-by: NEric Biggers <ebiggers@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7bebd69e
    • E
      fs/direct-io.c: include fs/internal.h for missing prototype · b16155a0
      Eric Biggers 提交于
      Include fs/internal.h to address the following 'sparse' warning:
      
          fs/direct-io.c:591:5: warning: symbol 'sb_init_dio_done_wq' was not declared. Should it be static?
      
      Link: http://lkml.kernel.org/r/20191209234544.128302-1-ebiggers@kernel.orgSigned-off-by: NEric Biggers <ebiggers@google.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b16155a0
  10. 04 1月, 2020 1 次提交
  11. 03 1月, 2020 3 次提交
  12. 30 12月, 2019 3 次提交
    • F
      Btrfs: fix infinite loop during nocow writeback due to race · de7999af
      Filipe Manana 提交于
      When starting writeback for a range that covers part of a preallocated
      extent, due to a race with writeback for another range that also covers
      another part of the same preallocated extent, we can end up in an infinite
      loop.
      
      Consider the following example where for inode 280 we have two dirty
      ranges:
      
        range A, from 294912 to 303103, 8192 bytes
        range B, from 348160 to 438271, 90112 bytes
      
      and we have the following file extent item layout for our inode:
      
        leaf 38895616 gen 24544 total ptrs 29 free space 13820 owner 5
            (...)
            item 27 key (280 108 200704) itemoff 14598 itemsize 53
                extent data disk bytenr 0 nr 0 type 1 (regular)
                extent data offset 0 nr 94208 ram 94208
            item 28 key (280 108 294912) itemoff 14545 itemsize 53
                extent data disk bytenr 10433052672 nr 81920 type 2 (prealloc)
                extent data offset 0 nr 81920 ram 81920
      
      Then the following happens:
      
      1) Writeback starts for range B (from 348160 to 438271), execution of
         run_delalloc_nocow() starts;
      
      2) The first iteration of run_delalloc_nocow()'s whil loop leaves us at
         the extent item at slot 28, pointing to the prealloc extent item
         covering the range from 294912 to 376831. This extent covers part of
         our range;
      
      3) An ordered extent is created against that extent, covering the file
         range from 348160 to 376831 (28672 bytes);
      
      4) We adjust 'cur_offset' to 376832 and move on to the next iteration of
         the while loop;
      
      5) The call to btrfs_lookup_file_extent() leaves us at the same leaf,
         pointing to slot 29, 1 slot after the last item (the extent item
         we processed in the previous iteration);
      
      6) Because we are a slot beyond the last item, we call btrfs_next_leaf(),
         which releases the search path before doing a another search for the
         last key of the leaf (280 108 294912);
      
      7) Right after btrfs_next_leaf() released the path, and before it did
         another search for the last key of the leaf, writeback for the range
         A (from 294912 to 303103) completes (it was previously started at
         some point);
      
      8) Upon completion of the ordered extent for range A, the prealloc extent
         we previously found got split into two extent items, one covering the
         range from 294912 to 303103 (8192 bytes), with a type of regular extent
         (and no longer prealloc) and another covering the range from 303104 to
         376831 (73728 bytes), with a type of prealloc and an offset of 8192
         bytes. So our leaf now has the following layout:
      
           leaf 38895616 gen 24544 total ptrs 31 free space 13664 owner 5
               (...)
               item 27 key (280 108 200704) itemoff 14598 itemsize 53
                   extent data disk bytenr 0 nr 0 type 1
                   extent data offset 0 nr 8192 ram 94208
               item 28 key (280 108 208896) itemoff 14545 itemsize 53
                   extent data disk bytenr 10433142784 nr 86016 type 1
                   extent data offset 0 nr 86016 ram 86016
               item 29 key (280 108 294912) itemoff 14492 itemsize 53
                   extent data disk bytenr 10433052672 nr 81920 type 1
                   extent data offset 0 nr 8192 ram 81920
               item 30 key (280 108 303104) itemoff 14439 itemsize 53
                   extent data disk bytenr 10433052672 nr 81920 type 2
                   extent data offset 8192 nr 73728 ram 81920
      
      9) After btrfs_next_leaf() returns, we have our path pointing to that same
         leaf and at slot 30, since it has a key we didn't have before and it's
         the first key greater then the key that was previously the last key of
         the leaf (key (280 108 294912));
      
      10) The extent item at slot 30 covers the range from 303104 to 376831
          which is in our target range, so we process it, despite having already
          created an ordered extent against this extent for the file range from
          348160 to 376831. This is because we skip to the next extent item only
          if its end is less than or equals to the start of our delalloc range,
          and not less than or equals to the current offset ('cur_offset');
      
      11) As a result we compute 'num_bytes' as:
      
          num_bytes = min(end + 1, extent_end) - cur_offset;
                    = min(438271 + 1, 376832) - 376832 = 0
      
      12) We then call create_io_em() for a 0 bytes range starting at offset
          376832;
      
      13) Then create_io_em() enters an infinite loop because its calls to
          btrfs_drop_extent_cache() do nothing due to the 0 length range
          passed to it. So no existing extent maps that cover the offset
          376832 get removed, and therefore calls to add_extent_mapping()
          return -EEXIST, resulting in an infinite loop. This loop from
          create_io_em() is the following:
      
          do {
              btrfs_drop_extent_cache(BTRFS_I(inode), em->start,
                                      em->start + em->len - 1, 0);
              write_lock(&em_tree->lock);
              ret = add_extent_mapping(em_tree, em, 1);
              write_unlock(&em_tree->lock);
              /*
               * The caller has taken lock_extent(), who could race with us
               * to add em?
               */
          } while (ret == -EEXIST);
      
      Also, each call to btrfs_drop_extent_cache() triggers a warning because
      the start offset passed to it (376832) is smaller then the end offset
      (376832 - 1) passed to it by -1, due to the 0 length:
      
        [258532.052621] ------------[ cut here ]------------
        [258532.052643] WARNING: CPU: 0 PID: 9987 at fs/btrfs/file.c:602 btrfs_drop_extent_cache+0x3f4/0x590 [btrfs]
        (...)
        [258532.052672] CPU: 0 PID: 9987 Comm: fsx Tainted: G        W         5.4.0-rc7-btrfs-next-64 #1
        [258532.052673] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
        [258532.052691] RIP: 0010:btrfs_drop_extent_cache+0x3f4/0x590 [btrfs]
        (...)
        [258532.052695] RSP: 0018:ffffb4be0153f860 EFLAGS: 00010287
        [258532.052700] RAX: ffff975b445ee360 RBX: ffff975b44eb3e08 RCX: 0000000000000000
        [258532.052700] RDX: 0000000000038fff RSI: 0000000000039000 RDI: ffff975b445ee308
        [258532.052700] RBP: 0000000000038fff R08: 0000000000000000 R09: 0000000000000001
        [258532.052701] R10: ffff975b513c5c10 R11: 00000000e3c0cfa9 R12: 0000000000039000
        [258532.052703] R13: ffff975b445ee360 R14: 00000000ffffffef R15: ffff975b445ee308
        [258532.052705] FS:  00007f86a821de80(0000) GS:ffff975b76a00000(0000) knlGS:0000000000000000
        [258532.052707] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [258532.052708] CR2: 00007fdacf0f3ab4 CR3: 00000001f9d26002 CR4: 00000000003606f0
        [258532.052712] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [258532.052717] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [258532.052717] Call Trace:
        [258532.052718]  ? preempt_schedule_common+0x32/0x70
        [258532.052722]  ? ___preempt_schedule+0x16/0x20
        [258532.052741]  create_io_em+0xff/0x180 [btrfs]
        [258532.052767]  run_delalloc_nocow+0x942/0xb10 [btrfs]
        [258532.052791]  btrfs_run_delalloc_range+0x30b/0x520 [btrfs]
        [258532.052812]  ? find_lock_delalloc_range+0x221/0x250 [btrfs]
        [258532.052834]  writepage_delalloc+0xe4/0x140 [btrfs]
        [258532.052855]  __extent_writepage+0x110/0x4e0 [btrfs]
        [258532.052876]  extent_write_cache_pages+0x21c/0x480 [btrfs]
        [258532.052906]  extent_writepages+0x52/0xb0 [btrfs]
        [258532.052911]  do_writepages+0x23/0x80
        [258532.052915]  __filemap_fdatawrite_range+0xd2/0x110
        [258532.052938]  btrfs_fdatawrite_range+0x1b/0x50 [btrfs]
        [258532.052954]  start_ordered_ops+0x57/0xa0 [btrfs]
        [258532.052973]  ? btrfs_sync_file+0x225/0x490 [btrfs]
        [258532.052988]  btrfs_sync_file+0x225/0x490 [btrfs]
        [258532.052997]  __x64_sys_msync+0x199/0x200
        [258532.053004]  do_syscall_64+0x5c/0x250
        [258532.053007]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [258532.053010] RIP: 0033:0x7f86a7dfd760
        (...)
        [258532.053014] RSP: 002b:00007ffd99af0368 EFLAGS: 00000246 ORIG_RAX: 000000000000001a
        [258532.053016] RAX: ffffffffffffffda RBX: 0000000000000ec9 RCX: 00007f86a7dfd760
        [258532.053017] RDX: 0000000000000004 RSI: 000000000000836c RDI: 00007f86a8221000
        [258532.053019] RBP: 0000000000021ec9 R08: 0000000000000003 R09: 00007f86a812037c
        [258532.053020] R10: 0000000000000001 R11: 0000000000000246 R12: 00000000000074a3
        [258532.053021] R13: 00007f86a8221000 R14: 000000000000836c R15: 0000000000000001
        [258532.053032] irq event stamp: 1653450494
        [258532.053035] hardirqs last  enabled at (1653450493): [<ffffffff9dec69f9>] _raw_spin_unlock_irq+0x29/0x50
        [258532.053037] hardirqs last disabled at (1653450494): [<ffffffff9d4048ea>] trace_hardirqs_off_thunk+0x1a/0x20
        [258532.053039] softirqs last  enabled at (1653449852): [<ffffffff9e200466>] __do_softirq+0x466/0x6bd
        [258532.053042] softirqs last disabled at (1653449845): [<ffffffff9d4c8a0c>] irq_exit+0xec/0x120
        [258532.053043] ---[ end trace 8476fce13d9ce20a ]---
      
      Which results in flooding dmesg/syslog since btrfs_drop_extent_cache()
      uses WARN_ON() and not WARN_ON_ONCE().
      
      So fix this issue by changing run_delalloc_nocow()'s loop to move to the
      next extent item when the current extent item ends at at offset less than
      or equals to the current offset instead of the start offset.
      
      Fixes: 80ff3856 ("Btrfs: update nodatacow code v2")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      de7999af
    • D
      btrfs: fix compressed write bio blkcg attribution · 46bcff2b
      Dennis Zhou 提交于
      Bio attribution is handled at bio_set_dev() as once we have a device, we
      have a corresponding request_queue and then can derive the current css.
      In special cases, we want to attribute to bio to someone else. This can
      be done by calling bio_associate_blkg_from_css() or
      kthread_associate_blkcg() depending on the scenario. Btrfs does this for
      compressed writeback as they are handled by kworkers, so the latter can
      be done here.
      
      Commit 1a418027 ("btrfs: drop bio_set_dev where not needed") removes
      early bio_set_dev() calls prior to submit_stripe_bio(). This breaks the
      above assumption that we'll have a request_queue when we are doing
      association. To fix this, switch to using kthread_associate_blkcg().
      
      Without this, we crash in btrfs/024:
      
        [ 3052.093088] BUG: kernel NULL pointer dereference, address: 0000000000000510
        [ 3052.107013] #PF: supervisor read access in kernel mode
        [ 3052.107014] #PF: error_code(0x0000) - not-present page
        [ 3052.107015] PGD 0 P4D 0
        [ 3052.107021] Oops: 0000 [#1] SMP
        [ 3052.138904] CPU: 42 PID: 201270 Comm: kworker/u161:0 Kdump: loaded Not tainted 5.5.0-rc1-00062-g4852d8ac90a9 #712
        [ 3052.138905] Hardware name: Quanta Tioga Pass Single Side 01-0032211004/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
        [ 3052.138912] Workqueue: btrfs-delalloc btrfs_work_helper
        [ 3052.191375] RIP: 0010:bio_associate_blkg_from_css+0x1e/0x3c0
        [ 3052.191379] RSP: 0018:ffffc900210cfc90 EFLAGS: 00010282
        [ 3052.191380] RAX: 0000000000000000 RBX: ffff88bfe5573c00 RCX: 0000000000000000
        [ 3052.191382] RDX: ffff889db48ec2f0 RSI: ffff88bfe5573c00 RDI: ffff889db48ec2f0
        [ 3052.191386] RBP: 0000000000000800 R08: 0000000000203bb0 R09: ffff889db16b2400
        [ 3052.293364] R10: 0000000000000000 R11: ffff88a07fffde80 R12: ffff889db48ec2f0
        [ 3052.293365] R13: 0000000000001000 R14: ffff889de82bc000 R15: ffff889e2b7bdcc8
        [ 3052.293367] FS:  0000000000000000(0000) GS:ffff889ffba00000(0000) knlGS:0000000000000000
        [ 3052.293368] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [ 3052.293369] CR2: 0000000000000510 CR3: 0000000002611001 CR4: 00000000007606e0
        [ 3052.293370] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [ 3052.293371] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [ 3052.293372] PKRU: 55555554
        [ 3052.293376] Call Trace:
        [ 3052.402552]  btrfs_submit_compressed_write+0x137/0x390
        [ 3052.402558]  submit_compressed_extents+0x40f/0x4c0
        [ 3052.422401]  btrfs_work_helper+0x246/0x5a0
        [ 3052.422408]  process_one_work+0x200/0x570
        [ 3052.438601]  ? process_one_work+0x180/0x570
        [ 3052.438605]  worker_thread+0x4c/0x3e0
        [ 3052.438614]  kthread+0x103/0x140
        [ 3052.460735]  ? process_one_work+0x570/0x570
        [ 3052.460737]  ? kthread_mod_delayed_work+0xc0/0xc0
        [ 3052.460744]  ret_from_fork+0x24/0x30
      
      Fixes: 1a418027 ("btrfs: drop bio_set_dev where not needed")
      Reported-by: NChris Murphy <chris@colorremedies.com>
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      46bcff2b
    • D
      btrfs: punt all bios created in btrfs_submit_compressed_write() · 7b62e66c
      Dennis Zhou 提交于
      Compressed writes happen in the background via kworkers. However, this
      causes bios to be attributed to root bypassing any cgroup limits from
      the actual writer. We tag the first bio with REQ_CGROUP_PUNT, which will
      punt the bio to an appropriate cgroup specific workqueue and attribute
      the IO properly. However, if btrfs_submit_compressed_write() creates a
      new bio, we don't tag it the same way. Add the appropriate tagging for
      subsequent bios.
      
      Fixes: ec39f769 ("Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios")
      Reviewed-by: NChris Mason <clm@fb.com>
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7b62e66c
  13. 29 12月, 2019 2 次提交
    • A
      locks: print unsigned ino in /proc/locks · 98ca480a
      Amir Goldstein 提交于
      An ino is unsigned, so display it as such in /proc/locks.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NJeff Layton <jlayton@kernel.org>
      98ca480a
    • M
      block: add bio_truncate to fix guard_bio_eod · 85a8ce62
      Ming Lei 提交于
      Some filesystem, such as vfat, may send bio which crosses device boundary,
      and the worse thing is that the IO request starting within device boundaries
      can contain more than one segment past EOD.
      
      Commit dce30ca9 ("fs: fix guard_bio_eod to check for real EOD errors")
      tries to fix this issue by returning -EIO for this situation. However,
      this way lets fs user code lose chance to handle -EIO, then sync_inodes_sb()
      may hang for ever.
      
      Also the current truncating on last segment is dangerous by updating the
      last bvec, given bvec table becomes not immutable any more, and fs bio
      users may not retrieve the truncated pages via bio_for_each_segment_all() in
      its .end_io callback.
      
      Fixes this issue by supporting multi-segment truncating. And the
      approach is simpler:
      
      - just update bio size since block layer can make correct bvec with
      the updated bio size. Then bvec table becomes really immutable.
      
      - zero all truncated segments for read bio
      
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: linux-fsdevel@vger.kernel.org
      Fixed-by: dce30ca9 ("fs: fix guard_bio_eod to check for real EOD errors")
      Reported-by: syzbot+2b9e54155c8c25d8d165@syzkaller.appspotmail.com
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      85a8ce62
  14. 25 12月, 2019 1 次提交
  15. 23 12月, 2019 1 次提交