1. 21 1月, 2019 1 次提交
  2. 19 1月, 2019 5 次提交
    • J
      btrfs: wakeup cleaner thread when adding delayed iput · fd340d0f
      Josef Bacik 提交于
      The cleaner thread usually takes care of delayed iputs, with the
      exception of the btrfs_end_transaction_throttle path.  Delaying iputs
      means we are potentially delaying the eviction of an inode and it's
      respective space.  The cleaner thread only gets woken up every 30
      seconds, or when we require space.  If there are a lot of inodes that
      need to be deleted we could induce a serious amount of latency while we
      wait for these inodes to be evicted.  So instead wakeup the cleaner if
      it's not already awake to process any new delayed iputs we add to the
      list.  If we suddenly need space we will less likely be backed up
      behind a bunch of inodes that are waiting to be deleted, and we could
      possibly free space before we need to get into the flushing logic which
      will save us some latency.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fd340d0f
    • J
      btrfs: run delayed iputs before committing · 3ec9a4c8
      Josef Bacik 提交于
      Delayed iputs means we can have final iputs of deleted inodes in the
      queue, which could potentially generate a lot of pinned space that could
      be free'd.  So before we decide to commit the transaction for ENOPSC
      reasons, run the delayed iputs so that any potential space is free'd up.
      If there is and we freed enough we can then commit the transaction and
      potentially be able to make our reservation.
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3ec9a4c8
    • J
      btrfs: wait on ordered extents on abort cleanup · 74d5d229
      Josef Bacik 提交于
      If we flip read-only before we initiate writeback on all dirty pages for
      ordered extents we've created then we'll have ordered extents left over
      on umount, which results in all sorts of bad things happening.  Fix this
      by making sure we wait on ordered extents if we have to do the aborted
      transaction cleanup stuff.
      
      generic/475 can produce this warning:
      
       [ 8531.177332] WARNING: CPU: 2 PID: 11997 at fs/btrfs/disk-io.c:3856 btrfs_free_fs_root+0x95/0xa0 [btrfs]
       [ 8531.183282] CPU: 2 PID: 11997 Comm: umount Tainted: G        W 5.0.0-rc1-default+ #394
       [ 8531.185164] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
       [ 8531.187851] RIP: 0010:btrfs_free_fs_root+0x95/0xa0 [btrfs]
       [ 8531.193082] RSP: 0018:ffffb1ab86163d98 EFLAGS: 00010286
       [ 8531.194198] RAX: ffff9f3449494d18 RBX: ffff9f34a2695000 RCX:0000000000000000
       [ 8531.195629] RDX: 0000000000000002 RSI: 0000000000000001 RDI:0000000000000000
       [ 8531.197315] RBP: ffff9f344e930000 R08: 0000000000000001 R09:0000000000000000
       [ 8531.199095] R10: 0000000000000000 R11: ffff9f34494d4ff8 R12:ffffb1ab86163dc0
       [ 8531.200870] R13: ffff9f344e9300b0 R14: ffffb1ab86163db8 R15:0000000000000000
       [ 8531.202707] FS:  00007fc68e949fc0(0000) GS:ffff9f34bd800000(0000)knlGS:0000000000000000
       [ 8531.204851] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [ 8531.205942] CR2: 00007ffde8114dd8 CR3: 000000002dfbd000 CR4:00000000000006e0
       [ 8531.207516] Call Trace:
       [ 8531.208175]  btrfs_free_fs_roots+0xdb/0x170 [btrfs]
       [ 8531.210209]  ? wait_for_completion+0x5b/0x190
       [ 8531.211303]  close_ctree+0x157/0x350 [btrfs]
       [ 8531.212412]  generic_shutdown_super+0x64/0x100
       [ 8531.213485]  kill_anon_super+0x14/0x30
       [ 8531.214430]  btrfs_kill_super+0x12/0xa0 [btrfs]
       [ 8531.215539]  deactivate_locked_super+0x29/0x60
       [ 8531.216633]  cleanup_mnt+0x3b/0x70
       [ 8531.217497]  task_work_run+0x98/0xc0
       [ 8531.218397]  exit_to_usermode_loop+0x83/0x90
       [ 8531.219324]  do_syscall_64+0x15b/0x180
       [ 8531.220192]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [ 8531.221286] RIP: 0033:0x7fc68e5e4d07
       [ 8531.225621] RSP: 002b:00007ffde8116608 EFLAGS: 00000246 ORIG_RAX:00000000000000a6
       [ 8531.227512] RAX: 0000000000000000 RBX: 00005580c2175970 RCX:00007fc68e5e4d07
       [ 8531.229098] RDX: 0000000000000001 RSI: 0000000000000000 RDI:00005580c2175b80
       [ 8531.230730] RBP: 0000000000000000 R08: 00005580c2175ba0 R09:00007ffde8114e80
       [ 8531.232269] R10: 0000000000000000 R11: 0000000000000246 R12:00005580c2175b80
       [ 8531.233839] R13: 00007fc68eac61c4 R14: 00005580c2175a68 R15:0000000000000000
      
      Leaving a tree in the rb-tree:
      
      3853 void btrfs_free_fs_root(struct btrfs_root *root)
      3854 {
      3855         iput(root->ino_cache_inode);
      3856         WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree));
      
      CC: stable@vger.kernel.org
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      [ add stacktrace ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      74d5d229
    • J
      btrfs: handle delayed ref head accounting cleanup in abort · 31890da0
      Josef Bacik 提交于
      We weren't doing any of the accounting cleanup when we aborted
      transactions.  Fix this by making cleanup_ref_head_accounting global and
      calling it from the abort code, this fixes the issue where our
      accounting was all wrong after the fs aborts.
      
      The test generic/475 on a 2G VM can trigger the problems eg.:
      
        [ 8502.136957] WARNING: CPU: 0 PID: 11064 at fs/btrfs/extent-tree.c:5986 btrfs_free_block_grou +ps+0x3dc/0x410 [btrfs]
        [ 8502.148372] CPU: 0 PID: 11064 Comm: umount Not tainted 5.0.0-rc1-default+ #394
        [ 8502.150807] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626 +cc-prebuilt.qemu-project.org 04/01/2014
        [ 8502.154317] RIP: 0010:btrfs_free_block_groups+0x3dc/0x410 [btrfs]
        [ 8502.160623] RSP: 0018:ffffb1ab84b93de8 EFLAGS: 00010206
        [ 8502.161906] RAX: 0000000001000000 RBX: ffff9f34b1756400 RCX: 0000000000000000
        [ 8502.163448] RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff9f34b1755400
        [ 8502.164906] RBP: ffff9f34b7e8c000 R08: 0000000000000001 R09: 0000000000000000
        [ 8502.166716] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9f34b7e8c108
        [ 8502.168498] R13: ffff9f34b7e8c158 R14: 0000000000000000 R15: dead000000000100
        [ 8502.170296] FS:  00007fb1cf15ffc0(0000) GS:ffff9f34bd400000(0000) knlGS:0000000000000000
        [ 8502.172439] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [ 8502.173669] CR2: 00007fb1ced507b0 CR3: 000000002f7a6000 CR4: 00000000000006f0
        [ 8502.175094] Call Trace:
        [ 8502.175759]  close_ctree+0x17f/0x350 [btrfs]
        [ 8502.176721]  generic_shutdown_super+0x64/0x100
        [ 8502.177702]  kill_anon_super+0x14/0x30
        [ 8502.178607]  btrfs_kill_super+0x12/0xa0 [btrfs]
        [ 8502.179602]  deactivate_locked_super+0x29/0x60
        [ 8502.180595]  cleanup_mnt+0x3b/0x70
        [ 8502.181406]  task_work_run+0x98/0xc0
        [ 8502.182255]  exit_to_usermode_loop+0x83/0x90
        [ 8502.183113]  do_syscall_64+0x15b/0x180
        [ 8502.183919]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Corresponding to
      
        release_global_block_rsv() {
        ...
        WARN_ON(fs_info->delayed_refs_rsv.reserved > 0);
      
      CC: stable@vger.kernel.org
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      [ add log dump ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      31890da0
    • D
      Revert "btrfs: balance dirty metadata pages in btrfs_finish_ordered_io" · 77b7aad1
      David Sterba 提交于
      This reverts commit e73e81b6.
      
      This patch causes a few problems:
      
      - adds latency to btrfs_finish_ordered_io
      - as btrfs_finish_ordered_io is used for free space cache, generating
        more work from btrfs_btree_balance_dirty_nodelay could end up in the
        same workque, effectively deadlocking
      
      12260 kworker/u96:16+btrfs-freespace-write D
      [<0>] balance_dirty_pages+0x6e6/0x7ad
      [<0>] balance_dirty_pages_ratelimited+0x6bb/0xa90
      [<0>] btrfs_finish_ordered_io+0x3da/0x770
      [<0>] normal_work_helper+0x1c5/0x5a0
      [<0>] process_one_work+0x1ee/0x5a0
      [<0>] worker_thread+0x46/0x3d0
      [<0>] kthread+0xf5/0x130
      [<0>] ret_from_fork+0x24/0x30
      [<0>] 0xffffffffffffffff
      
      Transaction commit will wait on the freespace cache:
      
      838 btrfs-transacti D
      [<0>] btrfs_start_ordered_extent+0x154/0x1e0
      [<0>] btrfs_wait_ordered_range+0xbd/0x110
      [<0>] __btrfs_wait_cache_io+0x49/0x1a0
      [<0>] btrfs_write_dirty_block_groups+0x10b/0x3b0
      [<0>] commit_cowonly_roots+0x215/0x2b0
      [<0>] btrfs_commit_transaction+0x37e/0x910
      [<0>] transaction_kthread+0x14d/0x180
      [<0>] kthread+0xf5/0x130
      [<0>] ret_from_fork+0x24/0x30
      [<0>] 0xffffffffffffffff
      
      And then writepages ends up waiting on transaction commit:
      
      9520 kworker/u96:13+flush-btrfs-1 D
      [<0>] wait_current_trans+0xac/0xe0
      [<0>] start_transaction+0x21b/0x4b0
      [<0>] cow_file_range_inline+0x10b/0x6b0
      [<0>] cow_file_range.isra.69+0x329/0x4a0
      [<0>] run_delalloc_range+0x105/0x3c0
      [<0>] writepage_delalloc+0x119/0x180
      [<0>] __extent_writepage+0x10c/0x390
      [<0>] extent_write_cache_pages+0x26f/0x3d0
      [<0>] extent_writepages+0x4f/0x80
      [<0>] do_writepages+0x17/0x60
      [<0>] __writeback_single_inode+0x59/0x690
      [<0>] writeback_sb_inodes+0x291/0x4e0
      [<0>] __writeback_inodes_wb+0x87/0xb0
      [<0>] wb_writeback+0x3bb/0x500
      [<0>] wb_workfn+0x40d/0x610
      [<0>] process_one_work+0x1ee/0x5a0
      [<0>] worker_thread+0x1e0/0x3d0
      [<0>] kthread+0xf5/0x130
      [<0>] ret_from_fork+0x24/0x30
      [<0>] 0xffffffffffffffff
      
      Eventually, we have every process in the system waiting on
      balance_dirty_pages(), and nobody is able to make progress on page
      writeback.
      
      The original patch tried to fix an OOM condition, that happened on 4.4 but no
      success reproducing that on later kernels (4.19 and 4.20). This is more likely
      a problem in OOM itself.
      
      Link: https://lore.kernel.org/linux-btrfs/20180528054821.9092-1-ethanlien@synology.com/Reported-by: NChris Mason <clm@fb.com>
      CC: stable@vger.kernel.org # 4.18+
      CC: ethanlien <ethanlien@synology.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      77b7aad1
  3. 18 1月, 2019 1 次提交
    • S
      pstore/ram: Fix console ramoops to show the previous boot logs · 6a4c9ab1
      Sai Prakash Ranjan 提交于
      commit b05c9506 ("pstore/ram: Simplify ramoops_get_next_prz()
      arguments") changed update assignment in getting next persistent ram zone
      by adding a check for record type. But the check always returns true since
      the record type is assigned 0. And this breaks console ramoops by showing
      current console log instead of previous log on warm reset and hard reset
      (actually hard reset should not be showing any logs).
      
      Fix this by having persistent ram zone type check instead of record type
      check. Tested this on SDM845 MTP and dragonboard 410c.
      
      Reproducing this issue is simple as below:
      
      1. Trigger hard reset and mount pstore. Will see console-ramoops
         record in the mounted location which is the current log.
      
      2. Trigger warm reset and mount pstore. Will see the current
         console-ramoops record instead of previous record.
      
      Fixes: b05c9506 ("pstore/ram: Simplify ramoops_get_next_prz() arguments")
      Signed-off-by: NSai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
      Acked-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      [kees: dropped local variable usage]
      Signed-off-by: NKees Cook <keescook@chromium.org>
      6a4c9ab1
  4. 17 1月, 2019 4 次提交
    • D
      afs: Fix race in async call refcounting · 34fa4761
      David Howells 提交于
      There's a race between afs_make_call() and afs_wake_up_async_call() in the
      case that an error is returned from rxrpc_kernel_send_data() after it has
      queued the final packet.
      
      afs_make_call() will try and clean up the mess, but the call state may have
      been moved on thereby causing afs_process_async_call() to also try and to
      delete the call.
      
      Fix this by:
      
       (1) Getting an extra ref for an asynchronous call for the call itself to
           hold.  This makes sure the call doesn't evaporate on us accidentally
           and will allow the call to be retained by the caller in a future
           patch.  The ref is released on leaving afs_make_call() or
           afs_wait_for_call_to_complete().
      
       (2) In the event of an error from rxrpc_kernel_send_data():
      
           (a) Don't set the call state to AFS_CALL_COMPLETE until *after* the
           	 call has been aborted and ended.  This prevents
           	 afs_deliver_to_call() from doing anything with any notifications
           	 it gets.
      
           (b) Explicitly end the call immediately to prevent further callbacks.
      
           (c) Cancel any queued async_work and wait for the work if it's
           	 executing.  This allows us to be sure the race won't recur when we
           	 change the state.  We put the work queue's ref on the call if we
           	 managed to cancel it.
      
           (d) Put the call's ref that we got in (1).  This belongs to us as long
           	 as the call is in state AFS_CALL_CL_REQUESTING.
      
      Fixes: 341f741f ("afs: Refcount the afs_call struct")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      34fa4761
    • D
      afs: Provide a function to get a ref on a call · 7a75b007
      David Howells 提交于
      Provide a function to get a reference on an afs_call struct.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      7a75b007
    • D
      afs: Fix key refcounting in file locking code · 59d49076
      David Howells 提交于
      Fix the refcounting of the authentication keys in the file locking code.
      The vnode->lock_key member points to a key on which it expects to be
      holding a ref, but it isn't always given an extra ref, however.
      
      Fixes: 0fafdc9f ("afs: Fix file locking")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      59d49076
    • M
      afs: Don't set vnode->cb_s_break in afs_validate() · 4882a27c
      Marc Dionne 提交于
      A cb_interest record is not necessarily attached to the vnode on entry to
      afs_validate(), which can cause an oops when we try to bring the vnode's
      cb_s_break up to date in the default case (ie. no current callback promise
      and the vnode has not been deleted).
      
      Fix this by simply removing the line, as vnode->cb_s_break will be set when
      needed by afs_register_server_cb_interest() when we next get a callback
      promise from RPC call.
      
      The oops looks something like:
      
          BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
          ...
          RIP: 0010:afs_validate+0x66/0x250 [kafs]
          ...
          Call Trace:
           afs_d_revalidate+0x8d/0x340 [kafs]
           ? __d_lookup+0x61/0x150
           lookup_dcache+0x44/0x70
           ? lookup_dcache+0x44/0x70
           __lookup_hash+0x24/0xa0
           do_unlinkat+0x11d/0x2c0
           __x64_sys_unlink+0x23/0x30
           do_syscall_64+0x4d/0xf0
           entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: ae3b7361 ("afs: Fix validation/callback interaction")
      Signed-off-by: NMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      4882a27c
  5. 16 1月, 2019 1 次提交
  6. 15 1月, 2019 1 次提交
    • J
      blockdev: Fix livelocks on loop device · 04906b2f
      Jan Kara 提交于
      bd_set_size() updates also block device's block size. This is somewhat
      unexpected from its name and at this point, only blkdev_open() uses this
      functionality. Furthermore, this can result in changing block size under
      a filesystem mounted on a loop device which leads to livelocks inside
      __getblk_gfp() like:
      
      Sending NMI from CPU 0 to CPUs 1:
      NMI backtrace for cpu 1
      CPU: 1 PID: 10863 Comm: syz-executor0 Not tainted 4.18.0-rc5+ #151
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
      01/01/2011
      RIP: 0010:__sanitizer_cov_trace_pc+0x3f/0x50 kernel/kcov.c:106
      ...
      Call Trace:
       init_page_buffers+0x3e2/0x530 fs/buffer.c:904
       grow_dev_page fs/buffer.c:947 [inline]
       grow_buffers fs/buffer.c:1009 [inline]
       __getblk_slow fs/buffer.c:1036 [inline]
       __getblk_gfp+0x906/0xb10 fs/buffer.c:1313
       __bread_gfp+0x2d/0x310 fs/buffer.c:1347
       sb_bread include/linux/buffer_head.h:307 [inline]
       fat12_ent_bread+0x14e/0x3d0 fs/fat/fatent.c:75
       fat_ent_read_block fs/fat/fatent.c:441 [inline]
       fat_alloc_clusters+0x8ce/0x16e0 fs/fat/fatent.c:489
       fat_add_cluster+0x7a/0x150 fs/fat/inode.c:101
       __fat_get_block fs/fat/inode.c:148 [inline]
      ...
      
      Trivial reproducer for the problem looks like:
      
      truncate -s 1G /tmp/image
      losetup /dev/loop0 /tmp/image
      mkfs.ext4 -b 1024 /dev/loop0
      mount -t ext4 /dev/loop0 /mnt
      losetup -c /dev/loop0
      l /mnt
      
      Fix the problem by moving initialization of a block device block size
      into a separate function and call it when needed.
      
      Thanks to Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> for help with
      debugging the problem.
      
      Reported-by: syzbot+9933e4476f365f5d5a1b@syzkaller.appspotmail.com
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      04906b2f
  7. 11 1月, 2019 16 次提交
  8. 09 1月, 2019 4 次提交
    • F
      Btrfs: fix deadlock when using free space tree due to block group creation · a6d8654d
      Filipe Manana 提交于
      When modifying the free space tree we can end up COWing one of its extent
      buffers which in turn might result in allocating a new chunk, which in
      turn can result in flushing (finish creation) of pending block groups. If
      that happens we can deadlock because creating a pending block group needs
      to update the free space tree, and if any of the updates tries to modify
      the same extent buffer that we are COWing, we end up in a deadlock since
      we try to write lock twice the same extent buffer.
      
      So fix this by skipping pending block group creation if we are COWing an
      extent buffer from the free space tree. This is a case missed by commit
      5ce55557 ("Btrfs: fix deadlock when writing out free space caches").
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202173
      Fixes: 5ce55557 ("Btrfs: fix deadlock when writing out free space caches")
      CC: stable@vger.kernel.org # 4.18+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a6d8654d
    • F
      Btrfs: fix race between reflink/dedupe and relocation · d8b55242
      Filipe Manana 提交于
      The recent rework that makes btrfs' remap_file_range operation use the
      generic helper generic_remap_file_range_prep() introduced a race between
      relocation and reflinking (for both cloning and deduplication) the file
      extents between the source and destination inodes.
      
      This happens because we no longer lock the source range anymore, and we do
      not lock it anymore because we wait for direct IO writes and writeback to
      complete early on the code path right after locking the inodes, which
      guarantees no other file operations interfere with the reflinking. However
      there is one exception which is relocation, since it replaces the byte
      number of file extents items in the fs tree after locking the range the
      file extent items represent. This is a problem because after finding each
      file extent to clone in the fs tree, the reflink process copies the file
      extent item into a local buffer, releases the search path, inserts new
      file extent items in the destination range and then increments the
      reference count for the extent mentioned in the file extent item that it
      previously copied to the buffer. If right after copying the file extent
      item into the buffer and releasing the path the relocation process
      updates the file extent item to point to the new extent, the reflink
      process ends up creating a delayed reference to increment the reference
      count of the old extent, for which the relocation process already created
      a delayed reference to drop it. This results in failure to run delayed
      references because we will attempt to increment the count of a reference
      that was already dropped. This is illustrated by the following diagram:
      
              CPU 1                                       CPU 2
      
                                              relocation is running
      
        btrfs_clone_files()
      
          btrfs_clone()
            --> finds extent item
                in source range
                point to extent
                at bytenr X
            --> copies it into a
                local buffer
            --> releases path
      
                                              replace_file_extents()
                                                --> successfully locks the
                                                    range represented by
                                                    the file extent item
                                                --> replaces disk_bytenr
                                                    field in the file
                                                    extent item with some
                                                    other value Y
                                                --> creates delayed reference
                                                    to increment reference
                                                    count for extent at
                                                    bytenr Y
                                                --> creates delayed reference
                                                    to drop the extent at
                                                    bytenr X
      
            --> starts transaction
            --> creates delayed
                reference to
                increment extent
                at bytenr X
      
                          <delayed references are run, due to a transaction
                           commit for example, and the transaction is aborted
                           with -EIO because we attempt to increment reference
                           count for the extent at bytenr X after we freed it>
      
      When this race is hit the running transaction ends up getting aborted with
      an -EIO error and a trace like the following is produced:
      
      [ 4382.553858] WARNING: CPU: 2 PID: 3648 at fs/btrfs/extent-tree.c:1552 lookup_inline_extent_backref+0x4f4/0x650 [btrfs]
      (...)
      [ 4382.556293] CPU: 2 PID: 3648 Comm: btrfs Tainted: G        W         4.20.0-rc6-btrfs-next-41 #1
      [ 4382.556294] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
      [ 4382.556308] RIP: 0010:lookup_inline_extent_backref+0x4f4/0x650 [btrfs]
      (...)
      [ 4382.556310] RSP: 0018:ffffac784408f738 EFLAGS: 00010202
      [ 4382.556311] RAX: 0000000000000001 RBX: ffff8980673c3a48 RCX: 0000000000000001
      [ 4382.556312] RDX: 0000000000000008 RSI: 0000000000000000 RDI: 0000000000000000
      [ 4382.556312] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
      [ 4382.556313] R10: 0000000000000001 R11: ffff897f40000000 R12: 0000000000001000
      [ 4382.556313] R13: 00000000c224f000 R14: ffff89805de9bd40 R15: ffff8980453f4548
      [ 4382.556315] FS:  00007f5e759178c0(0000) GS:ffff89807b300000(0000) knlGS:0000000000000000
      [ 4382.563130] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 4382.563562] CR2: 00007f2e9789fcbc CR3: 0000000120512001 CR4: 00000000003606e0
      [ 4382.564005] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 4382.564451] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 4382.564887] Call Trace:
      [ 4382.565343]  insert_inline_extent_backref+0x55/0xe0 [btrfs]
      [ 4382.565796]  __btrfs_inc_extent_ref.isra.60+0x88/0x260 [btrfs]
      [ 4382.566249]  ? __btrfs_run_delayed_refs+0x93/0x1650 [btrfs]
      [ 4382.566702]  __btrfs_run_delayed_refs+0xa22/0x1650 [btrfs]
      [ 4382.567162]  btrfs_run_delayed_refs+0x7e/0x1d0 [btrfs]
      [ 4382.567623]  btrfs_commit_transaction+0x50/0x9c0 [btrfs]
      [ 4382.568112]  ? _raw_spin_unlock+0x24/0x30
      [ 4382.568557]  ? block_rsv_release_bytes+0x14e/0x410 [btrfs]
      [ 4382.569006]  create_subvol+0x3c8/0x830 [btrfs]
      [ 4382.569461]  ? btrfs_mksubvol+0x317/0x600 [btrfs]
      [ 4382.569906]  btrfs_mksubvol+0x317/0x600 [btrfs]
      [ 4382.570383]  ? rcu_sync_lockdep_assert+0xe/0x60
      [ 4382.570822]  ? __sb_start_write+0xd4/0x1c0
      [ 4382.571262]  ? mnt_want_write_file+0x24/0x50
      [ 4382.571712]  btrfs_ioctl_snap_create_transid+0x117/0x1a0 [btrfs]
      [ 4382.572155]  ? _copy_from_user+0x66/0x90
      [ 4382.572602]  btrfs_ioctl_snap_create+0x66/0x80 [btrfs]
      [ 4382.573052]  btrfs_ioctl+0x7c1/0x30e0 [btrfs]
      [ 4382.573502]  ? mem_cgroup_commit_charge+0x8b/0x570
      [ 4382.573946]  ? do_raw_spin_unlock+0x49/0xc0
      [ 4382.574379]  ? _raw_spin_unlock+0x24/0x30
      [ 4382.574803]  ? __handle_mm_fault+0xf29/0x12d0
      [ 4382.575215]  ? do_vfs_ioctl+0xa2/0x6f0
      [ 4382.575622]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
      [ 4382.576020]  do_vfs_ioctl+0xa2/0x6f0
      [ 4382.576405]  ksys_ioctl+0x70/0x80
      [ 4382.576776]  __x64_sys_ioctl+0x16/0x20
      [ 4382.577137]  do_syscall_64+0x60/0x1b0
      [ 4382.577488]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      (...)
      [ 4382.578837] RSP: 002b:00007ffe04bf64c8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
      [ 4382.579174] RAX: ffffffffffffffda RBX: 00005564136f3050 RCX: 00007f5e74724dd7
      [ 4382.579505] RDX: 00007ffe04bf64d0 RSI: 000000005000940e RDI: 0000000000000003
      [ 4382.579848] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000044
      [ 4382.580164] R10: 0000000000000541 R11: 0000000000000202 R12: 00005564136f3010
      [ 4382.580477] R13: 0000000000000003 R14: 00005564136f3035 R15: 00005564136f3050
      [ 4382.580792] irq event stamp: 0
      [ 4382.581106] hardirqs last  enabled at (0): [<0000000000000000>]           (null)
      [ 4382.581441] hardirqs last disabled at (0): [<ffffffff8d085842>] copy_process.part.32+0x6e2/0x2320
      [ 4382.581772] softirqs last  enabled at (0): [<ffffffff8d085842>] copy_process.part.32+0x6e2/0x2320
      [ 4382.582095] softirqs last disabled at (0): [<0000000000000000>]           (null)
      [ 4382.582413] ---[ end trace d3c188e3e9367382 ]---
      [ 4382.623855] BTRFS: error (device sdc) in btrfs_run_delayed_refs:2981: errno=-5 IO failure
      [ 4382.624295] BTRFS info (device sdc): forced readonly
      
      Fix this by locking the source range before searching for the file extent
      items in the fs tree, since the relocation process will try to lock the
      range a file extent item represents before updating it with the new extent
      location.
      
      Fixes: 34a28e3d ("Btrfs: use generic_remap_file_range_prep() for cloning and deduplication")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d8b55242
    • F
      Btrfs: fix race between cloning range ending at eof and writeback · f7fa1107
      Filipe Manana 提交于
      The recent rework that makes btrfs' remap_file_range operation use the
      generic helper generic_remap_file_range_prep() introduced a race between
      writeback and cloning a range that covers the eof extent of the source
      file into a destination offset that is greater then the same file's size.
      
      This happens because we now wait for writeback to complete before doing
      the truncation of the eof block, while previously we did the truncation
      and then waited for writeback to complete. This leads to a race between
      writeback of the truncated block and cloning the file extents in the
      source range, because we copy each file extent item we find in the fs
      root into a buffer, then release the path and then increment the reference
      count for the extent referred in that file extent item we copied, which
      can no longer exist if writeback of the truncated eof block completes
      after we copied the file extent item into the buffer and before we
      incremented the reference count. This is illustrated by the following
      diagram:
      
              CPU 1                                       CPU 2
      
        btrfs_clone_files()
          btrfs_cont_expand()
            btrfs_truncate_block()
               --> zeroes part of the
                   page containg eof,
                   marking it for
                  delalloc
      
          btrfs_clone()
            --> finds extent item
                covering eof,
                points to extent
                at bytenr X
            --> copies it into a
                local buffer
            --> releases path
      
                                              writeback starts
      
                                              btrfs_finish_ordered_io()
                                                insert_reserved_file_extent()
                                                  __btrfs_drop_extents()
                                                    --> creates delayed
                                                        reference to drop
                                                        the extent at
                                                        bytenr X
      
            --> starts transaction
            --> creates delayed
                reference to
                increment extent
                at bytenr X
      
                          <delayed references are run, due to a transaction
                           commit for example, and the transaction is aborted
                           with -EIO because we attempt to increment reference
                           count for the extent at bytenr X after we freed it>
      
      When this race is hit the running transaction ends up getting aborted with
      an -EIO error and a trace like the following is produced:
      
      [ 4382.553858] WARNING: CPU: 2 PID: 3648 at fs/btrfs/extent-tree.c:1552 lookup_inline_extent_backref+0x4f4/0x650 [btrfs]
      (...)
      [ 4382.556293] CPU: 2 PID: 3648 Comm: btrfs Tainted: G        W         4.20.0-rc6-btrfs-next-41 #1
      [ 4382.556294] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
      [ 4382.556308] RIP: 0010:lookup_inline_extent_backref+0x4f4/0x650 [btrfs]
      (...)
      [ 4382.556310] RSP: 0018:ffffac784408f738 EFLAGS: 00010202
      [ 4382.556311] RAX: 0000000000000001 RBX: ffff8980673c3a48 RCX: 0000000000000001
      [ 4382.556312] RDX: 0000000000000008 RSI: 0000000000000000 RDI: 0000000000000000
      [ 4382.556312] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
      [ 4382.556313] R10: 0000000000000001 R11: ffff897f40000000 R12: 0000000000001000
      [ 4382.556313] R13: 00000000c224f000 R14: ffff89805de9bd40 R15: ffff8980453f4548
      [ 4382.556315] FS:  00007f5e759178c0(0000) GS:ffff89807b300000(0000) knlGS:0000000000000000
      [ 4382.563130] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 4382.563562] CR2: 00007f2e9789fcbc CR3: 0000000120512001 CR4: 00000000003606e0
      [ 4382.564005] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 4382.564451] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 4382.564887] Call Trace:
      [ 4382.565343]  insert_inline_extent_backref+0x55/0xe0 [btrfs]
      [ 4382.565796]  __btrfs_inc_extent_ref.isra.60+0x88/0x260 [btrfs]
      [ 4382.566249]  ? __btrfs_run_delayed_refs+0x93/0x1650 [btrfs]
      [ 4382.566702]  __btrfs_run_delayed_refs+0xa22/0x1650 [btrfs]
      [ 4382.567162]  btrfs_run_delayed_refs+0x7e/0x1d0 [btrfs]
      [ 4382.567623]  btrfs_commit_transaction+0x50/0x9c0 [btrfs]
      [ 4382.568112]  ? _raw_spin_unlock+0x24/0x30
      [ 4382.568557]  ? block_rsv_release_bytes+0x14e/0x410 [btrfs]
      [ 4382.569006]  create_subvol+0x3c8/0x830 [btrfs]
      [ 4382.569461]  ? btrfs_mksubvol+0x317/0x600 [btrfs]
      [ 4382.569906]  btrfs_mksubvol+0x317/0x600 [btrfs]
      [ 4382.570383]  ? rcu_sync_lockdep_assert+0xe/0x60
      [ 4382.570822]  ? __sb_start_write+0xd4/0x1c0
      [ 4382.571262]  ? mnt_want_write_file+0x24/0x50
      [ 4382.571712]  btrfs_ioctl_snap_create_transid+0x117/0x1a0 [btrfs]
      [ 4382.572155]  ? _copy_from_user+0x66/0x90
      [ 4382.572602]  btrfs_ioctl_snap_create+0x66/0x80 [btrfs]
      [ 4382.573052]  btrfs_ioctl+0x7c1/0x30e0 [btrfs]
      [ 4382.573502]  ? mem_cgroup_commit_charge+0x8b/0x570
      [ 4382.573946]  ? do_raw_spin_unlock+0x49/0xc0
      [ 4382.574379]  ? _raw_spin_unlock+0x24/0x30
      [ 4382.574803]  ? __handle_mm_fault+0xf29/0x12d0
      [ 4382.575215]  ? do_vfs_ioctl+0xa2/0x6f0
      [ 4382.575622]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
      [ 4382.576020]  do_vfs_ioctl+0xa2/0x6f0
      [ 4382.576405]  ksys_ioctl+0x70/0x80
      [ 4382.576776]  __x64_sys_ioctl+0x16/0x20
      [ 4382.577137]  do_syscall_64+0x60/0x1b0
      [ 4382.577488]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      (...)
      [ 4382.578837] RSP: 002b:00007ffe04bf64c8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
      [ 4382.579174] RAX: ffffffffffffffda RBX: 00005564136f3050 RCX: 00007f5e74724dd7
      [ 4382.579505] RDX: 00007ffe04bf64d0 RSI: 000000005000940e RDI: 0000000000000003
      [ 4382.579848] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000044
      [ 4382.580164] R10: 0000000000000541 R11: 0000000000000202 R12: 00005564136f3010
      [ 4382.580477] R13: 0000000000000003 R14: 00005564136f3035 R15: 00005564136f3050
      [ 4382.580792] irq event stamp: 0
      [ 4382.581106] hardirqs last  enabled at (0): [<0000000000000000>]           (null)
      [ 4382.581441] hardirqs last disabled at (0): [<ffffffff8d085842>] copy_process.part.32+0x6e2/0x2320
      [ 4382.581772] softirqs last  enabled at (0): [<ffffffff8d085842>] copy_process.part.32+0x6e2/0x2320
      [ 4382.582095] softirqs last disabled at (0): [<0000000000000000>]           (null)
      [ 4382.582413] ---[ end trace d3c188e3e9367382 ]---
      [ 4382.623855] BTRFS: error (device sdc) in btrfs_run_delayed_refs:2981: errno=-5 IO failure
      [ 4382.624295] BTRFS info (device sdc): forced readonly
      
      Fix this by waiting for writeback to complete after truncating the eof
      block.
      
      Fixes: 34a28e3d ("Btrfs: use generic_remap_file_range_prep() for cloning and deduplication")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f7fa1107
    • M
      hugetlbfs: revert "Use i_mmap_rwsem to fix page fault/truncate race" · e7c58097
      Mike Kravetz 提交于
      This reverts c86aa7bb
      
      The reverted commit caused ABBA deadlocks when file migration raced with
      file eviction for specific hugetlbfs files.  This was discovered with a
      modified version of the LTP move_pages12 test.
      
      The purpose of the reverted patch was to close a long existing race
      between hugetlbfs file truncation and page faults.  After more analysis
      of the patch and impacted code, it was determined that i_mmap_rwsem can
      not be used for all required synchronization.  Therefore, revert this
      patch while working an another approach to the underlying issue.
      
      Link: http://lkml.kernel.org/r/20190103235452.29335-1-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7c58097
  9. 08 1月, 2019 2 次提交
  10. 07 1月, 2019 1 次提交
  11. 06 1月, 2019 1 次提交
    • E
      fscrypt: add Adiantum support · 8094c3ce
      Eric Biggers 提交于
      Add support for the Adiantum encryption mode to fscrypt.  Adiantum is a
      tweakable, length-preserving encryption mode with security provably
      reducible to that of XChaCha12 and AES-256, subject to a security bound.
      It's also a true wide-block mode, unlike XTS.  See the paper
      "Adiantum: length-preserving encryption for entry-level processors"
      (https://eprint.iacr.org/2018/720.pdf) for more details.  Also see
      commit 059c2a4d ("crypto: adiantum - add Adiantum support").
      
      On sufficiently long messages, Adiantum's bottlenecks are XChaCha12 and
      the NH hash function.  These algorithms are fast even on processors
      without dedicated crypto instructions.  Adiantum makes it feasible to
      enable storage encryption on low-end mobile devices that lack AES
      instructions; currently such devices are unencrypted.  On ARM Cortex-A7,
      on 4096-byte messages Adiantum encryption is about 4 times faster than
      AES-256-XTS encryption; decryption is about 5 times faster.
      
      In fscrypt, Adiantum is suitable for encrypting both file contents and
      names.  With filenames, it fixes a known weakness: when two filenames in
      a directory share a common prefix of >= 16 bytes, with CTS-CBC their
      encrypted filenames share a common prefix too, leaking information.
      Adiantum does not have this problem.
      
      Since Adiantum also accepts long tweaks (IVs), it's also safe to use the
      master key directly for Adiantum encryption rather than deriving
      per-file keys, provided that the per-file nonce is included in the IVs
      and the master key isn't used for any other encryption mode.  This
      configuration saves memory and improves performance.  A new fscrypt
      policy flag is added to allow users to opt-in to this configuration.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      8094c3ce
  12. 05 1月, 2019 3 次提交