1. 25 8月, 2016 1 次提交
    • L
      Btrfs: fix memory leak of reloc_root · 1c1ea4f7
      Liu Bo 提交于
      When some critical errors occur and FS would be flipped into RO,
      if we have an on-going balance, we can end up with a memory leak
      of root->reloc_root since btrfs_drop_snapshots() bails out
      without freeing reloc_root at the very early start.
      
      However, we're not able to free reloc_root in btrfs_drop_snapshots()
      because its caller, merge_reloc_roots(), still needs to access it to
      cleanup reloc_root's rbtree.
      
      This makes us free reloc_root when we're going to free fs/file roots.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1c1ea4f7
  2. 26 7月, 2016 6 次提交
    • J
      btrfs: tests, use BTRFS_FS_STATE_DUMMY_FS_INFO instead of dummy root · f5ee5c9a
      Jeff Mahoney 提交于
      Now that we have a dummy fs_info associated with each test that
      uses a root, we don't need the DUMMY_ROOT bit anymore.  This lets
      us make choices without needing an actual root like in e.g.
      btrfs_find_create_tree_block.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f5ee5c9a
    • J
      btrfs: tests, require fs_info for root · 7c0260ee
      Jeff Mahoney 提交于
      This allows the upcoming patchset to push nodesize and sectorsize into
      fs_info.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7c0260ee
    • J
      btrfs: btrfs_test_opt and friends should take a btrfs_fs_info · 3cdde224
      Jeff Mahoney 提交于
      btrfs_test_opt and friends only use the root pointer to access
      the fs_info.  Let's pass the fs_info directly in preparation to
      eliminate similar patterns all over btrfs.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3cdde224
    • J
      btrfs: plumb fs_info into btrfs_work · cb001095
      Jeff Mahoney 提交于
      In order to provide an fsid for trace events, we'll need a btrfs_fs_info
      pointer.  The most lightweight way to do that for btrfs_work structures
      is to associate it with the __btrfs_workqueue structure.  Each queued
      btrfs_work structure has a workqueue associated with it, so that's
      a natural fit.  It's a privately defined structures, so we add accessors
      to retrieve the fs_info pointer.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cb001095
    • N
      btrfs: Fix slab accounting flags · fba4b697
      Nikolay Borisov 提交于
      BTRFS is using a variety of slab caches to satisfy internal needs.
      Those slab caches are always allocated with the SLAB_RECLAIM_ACCOUNT,
      meaning allocations from the caches are going to be accounted as
      SReclaimable. At the same time btrfs is not registering any shrinkers
      whatsoever, thus preventing memory from the slabs to be shrunk. This
      means those caches are not in fact reclaimable.
      
      To fix this remove the SLAB_RECLAIM_ACCOUNT on all caches apart from the
      inode cache, since this one is being freed by the generic VFS super_block
      shrinker. Also set the transaction related caches as SLAB_TEMPORARY,
      to better document the lifetime of the objects (it just translates
      to SLAB_RECLAIM_ACCOUNT).
      Signed-off-by: NNikolay Borisov <n.borisov.lkml@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fba4b697
    • L
      Btrfs: fix double free of fs root · 876d2cf1
      Liu Bo 提交于
      I got this warning while mounting a btrfs image,
      
      [ 3020.509606] ------------[ cut here ]------------
      [ 3020.510107] WARNING: CPU: 3 PID: 5581 at lib/idr.c:1051 ida_remove+0xca/0x190
      [ 3020.510853] ida_remove called for id=42 which is not allocated.
      [ 3020.511466] Modules linked in:
      [ 3020.511802] CPU: 3 PID: 5581 Comm: mount Not tainted 4.7.0-rc5+ #274
      [ 3020.512438] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014
      [ 3020.513385]  0000000000000286 0000000021295d86 ffff88006c66b8f0 ffffffff8182ba5a
      [ 3020.514153]  0000000000000000 0000000000000009 ffff88006c66b930 ffffffff810e0ed7
      [ 3020.514928]  0000041b00000000 ffffffff8289a8c0 ffff88007f437880 0000000000000000
      [ 3020.515717] Call Trace:
      [ 3020.515965]  [<ffffffff8182ba5a>] dump_stack+0xc9/0x13f
      [ 3020.516487]  [<ffffffff810e0ed7>] __warn+0x147/0x160
      [ 3020.517005]  [<ffffffff810e0f4f>] warn_slowpath_fmt+0x5f/0x80
      [ 3020.517572]  [<ffffffff8182e6ca>] ida_remove+0xca/0x190
      [ 3020.518075]  [<ffffffff813a2bcc>] free_anon_bdev+0x2c/0x60
      [ 3020.518609]  [<ffffffff81657a9f>] free_fs_root+0x13f/0x160
      [ 3020.519138]  [<ffffffff8165c679>] btrfs_get_fs_root+0x379/0x3d0
      [ 3020.519710]  [<ffffffff81e6e975>] ? __mutex_unlock_slowpath+0x155/0x2c0
      [ 3020.520366]  [<ffffffff816615b1>] open_ctree+0x2e91/0x3200
      [ 3020.520965]  [<ffffffff8161ede2>] btrfs_mount+0x1322/0x15b0
      [ 3020.521536]  [<ffffffff81e60e74>] ? kmemleak_alloc_percpu+0x44/0x170
      [ 3020.522167]  [<ffffffff8115f5e1>] ? lockdep_init_map+0x61/0x210
      [ 3020.522780]  [<ffffffff813a4f59>] mount_fs+0x49/0x2c0
      [ 3020.523305]  [<ffffffff813d840c>] vfs_kern_mount+0xac/0x1b0
      [ 3020.523872]  [<ffffffff8161dee1>] btrfs_mount+0x421/0x15b0
      [ 3020.524402]  [<ffffffff81e60e74>] ? kmemleak_alloc_percpu+0x44/0x170
      [ 3020.525045]  [<ffffffff8115f5e1>] ? lockdep_init_map+0x61/0x210
      [ 3020.525657]  [<ffffffff8115f5e1>] ? lockdep_init_map+0x61/0x210
      [ 3020.526289]  [<ffffffff813a4f59>] mount_fs+0x49/0x2c0
      [ 3020.526803]  [<ffffffff813d840c>] vfs_kern_mount+0xac/0x1b0
      [ 3020.527365]  [<ffffffff813dc27a>] do_mount+0x41a/0x1770
      [ 3020.527899]  [<ffffffff812e800d>] ? strndup_user+0x6d/0xc0
      [ 3020.528447]  [<ffffffff812e7f68>] ? memdup_user+0x78/0xb0
      [ 3020.528987]  [<ffffffff813ddad0>] SyS_mount+0x150/0x160
      [ 3020.529493]  [<ffffffff81e72b7c>] entry_SYSCALL_64_fastpath+0x1f/0xbd
      
      It turns out that we free fs root twice, btrfs_init_fs_root() calls
      free_anon_bdev(root->anon_dev) and later then btrfs_get_fs_root() cals
      free_fs_root which does another free_anon_bdev() and it ends up with the
      above warning.
      
      Instead of reset root->anon_dev to 0 after free_anon_bdev(), we can let
      btrfs_init_fs_root() return directly since its callers have already done
      the free job by calling free_fs_root().
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Reviewed-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      876d2cf1
  3. 24 6月, 2016 1 次提交
  4. 18 6月, 2016 3 次提交
    • C
      Btrfs: btrfs_check_super_valid: Allow 4096 as stripesize · dd5c9311
      Chandan Rajendra 提交于
      Older btrfs-progs/mkfs.btrfs sets 4096 as the stripesize. Hence
      restricting stripesize to be equal to sectorsize would cause super block
      validation to return an error on architectures where PAGE_SIZE is not
      equal to 4096.
      
      Hence as a workaround, this commit allows stripesize to be set to 4096
      bytes.
      Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dd5c9311
    • Z
      btrfs: avoid blocking open_ctree from cleaner_kthread · 90c711ab
      Zygo Blaxell 提交于
      This fixes a problem introduced in commit 2f3165ec
      "btrfs: don't force mounts to wait for cleaner_kthread to delete one or more subvolumes".
      
      open_ctree eventually calls btrfs_replay_log which in turn calls
      btrfs_commit_super which tries to lock the cleaner_mutex, causing a
      recursive mutex deadlock during mount.
      
      Instead of playing whack-a-mole trying to keep up with all the
      functions that may want to lock cleaner_mutex, put all the cleaner_mutex
      lockers back where they were, and attack the problem more directly:
      keep cleaner_kthread asleep until the filesystem is mounted.
      
      When filesystems are mounted read-only and later remounted read-write,
      open_ctree did not set fs_info->open and neither does anything else.
      Set this flag in btrfs_remount so that neither btrfs_delete_unused_bgs
      nor cleaner_kthread get confused by the common case of "/" filesystem
      read-only mount followed by read-write remount.
      Signed-off-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      90c711ab
    • L
      Btrfs: check if extent buffer is aligned to sectorsize · c871b0f2
      Liu Bo 提交于
      Thanks to fuzz testing, we can pass an invalid bytenr to extent buffer
      via alloc_extent_buffer().  An unaligned eb can have more pages than it
      should have, which ends up extent buffer's leak or some corrupted content
      in extent buffer.
      
      This adds a warning to let us quickly know what was happening.
      
      Now that alloc_extent_buffer() no more returns NULL, this changes its
      caller and callers of its caller to match with the new error
      handling.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c871b0f2
  5. 06 6月, 2016 2 次提交
  6. 03 6月, 2016 1 次提交
  7. 26 5月, 2016 1 次提交
  8. 10 5月, 2016 2 次提交
  9. 06 5月, 2016 1 次提交
    • Z
      btrfs: don't force mounts to wait for cleaner_kthread to delete one or more subvolumes · 2f3165ec
      Zygo Blaxell 提交于
      During a mount, we start the cleaner kthread first because the transaction
      kthread wants to wake up the cleaner kthread.  We start the transaction
      kthread next because everything in btrfs wants transactions.  We do reloc
      recovery in the thread that was doing the original mount call once the
      transaction kthread is running.  This means that the cleaner kthread
      could already be running when reloc recovery happens (e.g. if a snapshot
      delete was started before a crash).
      
      Relocation does not play well with the cleaner kthread, so a mutex was
      added in commit 5f316481 "Btrfs: fix
      race between balance recovery and root deletion" to prevent both from
      being active at the same time.
      
      If the cleaner kthread is already holding the mutex by the time we get
      to btrfs_recover_relocation, the mount will be blocked until at least
      one deleted subvolume is cleaned (possibly more if the mount process
      doesn't get the lock right away).  During this time (which could be an
      arbitrarily long time on a large/slow filesystem), the mount process is
      stuck and the filesystem is unnecessarily inaccessible.
      
      Fix this by locking cleaner_mutex before we start cleaner_kthread, and
      unlocking the mutex after mount no longer requires it.  This ensures
      that the mounting process will not be blocked by the cleaner kthread.
      The cleaner kthread is already prepared for mutex contention and will
      just go to sleep until the mutex is available.
      Signed-off-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2f3165ec
  10. 29 4月, 2016 2 次提交
  11. 28 4月, 2016 1 次提交
  12. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  13. 22 3月, 2016 4 次提交
  14. 14 3月, 2016 1 次提交
  15. 12 3月, 2016 1 次提交
  16. 23 2月, 2016 1 次提交
    • L
      Btrfs: fix lockdep deadlock warning due to dev_replace · 73beece9
      Liu Bo 提交于
      Xfstests btrfs/011 complains about a deadlock warning,
      
      [ 1226.649039] =========================================================
      [ 1226.649039] [ INFO: possible irq lock inversion dependency detected ]
      [ 1226.649039] 4.1.0+ #270 Not tainted
      [ 1226.649039] ---------------------------------------------------------
      [ 1226.652955] kswapd0/46 just changed the state of lock:
      [ 1226.652955]  (&delayed_node->mutex){+.+.-.}, at: [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
      [ 1226.652955] but this lock took another, RECLAIM_FS-unsafe lock in the past:
      [ 1226.652955]  (&fs_info->dev_replace.lock){+.+.+.}
      
      and interrupts could create inverse lock ordering between them.
      
      [ 1226.652955]
      other info that might help us debug this:
      [ 1226.652955] Chain exists of:
        &delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock
      
      [ 1226.652955]  Possible interrupt unsafe locking scenario:
      
      [ 1226.652955]        CPU0                    CPU1
      [ 1226.652955]        ----                    ----
      [ 1226.652955]   lock(&fs_info->dev_replace.lock);
      [ 1226.652955]                                local_irq_disable();
      [ 1226.652955]                                lock(&delayed_node->mutex);
      [ 1226.652955]                                lock(&found->groups_sem);
      [ 1226.652955]   <Interrupt>
      [ 1226.652955]     lock(&delayed_node->mutex);
      [ 1226.652955]
       *** DEADLOCK ***
      
      Commit 084b6e7c ("btrfs: Fix a lockdep warning when running xfstest.") tried
      to fix a similar one that has the exactly same warning, but with that, we still
      run to this.
      
      The above lock chain comes from
      btrfs_commit_transaction
        ->btrfs_run_delayed_items
          ...
          ->__btrfs_update_delayed_inode
            ...
            ->__btrfs_cow_block
               ...
               ->find_free_extent
                  ->cache_block_group
                    ->load_free_space_cache
                      ->btrfs_readpages
                        ->submit_one_bio
                          ...
                          ->__btrfs_map_block
                            ->btrfs_dev_replace_lock
      
      However, with high memory pressure, tasks which hold dev_replace.lock can
      be interrupted by kswapd and then kswapd is intended to release memory occupied
      by superblock, inodes and dentries, where we may call evict_inode, and it comes
      to
      
      [ 1226.652955]  [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
      [ 1226.652955]  [<ffffffff81459e74>] btrfs_remove_delayed_node+0x24/0x30
      [ 1226.652955]  [<ffffffff8140c5fe>] btrfs_evict_inode+0x34e/0x700
      
      delayed_node->mutex may be acquired in __btrfs_release_delayed_node(), and it leads
      to a ABBA deadlock.
      
      To fix this, we can use "blocking rwlock" used in the case of extent_buffer, but
      things are simpler here since we only needs read's spinlock to blocking lock.
      
      With this, btrfs/011 no more produces warnings in dmesg.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      73beece9
  17. 18 2月, 2016 3 次提交
  18. 12 2月, 2016 2 次提交
  19. 11 2月, 2016 1 次提交
  20. 30 1月, 2016 1 次提交
    • B
      x86/cpufeature: Replace the old static_cpu_has() with safe variant · bc696ca0
      Borislav Petkov 提交于
      So the old one didn't work properly before alternatives had run.
      And it was supposed to provide an optimized JMP because the
      assumption was that the offset it is jumping to is within a
      signed byte and thus a two-byte JMP.
      
      So I did an x86_64 allyesconfig build and dumped all possible
      sites where static_cpu_has() was used. The optimization amounted
      to all in all 12(!) places where static_cpu_has() had generated
      a 2-byte JMP. Which has saved us a whopping 36 bytes!
      
      This clearly is not worth the trouble so we can remove it. The
      only place where the optimization might count - in __switch_to()
      - we will handle differently. But that's not subject of this
      patch.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1453842730-28463-6-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      bc696ca0
  21. 26 1月, 2016 1 次提交
    • D
      Revert "btrfs: clear PF_NOFREEZE in cleaner_kthread()" · 80ad623e
      David Sterba 提交于
      This reverts commit 69624913. The
      cleaner thread can block freezing when there's a snapshot cleaning in
      progress and the other threads get suspended first. From the logs
      provided by Martin we're waiting for reading extent pages:
      
      kernel: PM: Syncing filesystems ... done.
      kernel: Freezing user space processes ... (elapsed 0.015 seconds) done.
      kernel: Freezing remaining freezable tasks ...
      kernel: Freezing of tasks failed after 20.003 seconds (1 tasks refusing to freeze, wq_busy=0):
      kernel: btrfs-cleaner   D ffff88033dd13bc0     0   152      2 0x00000000
      kernel: ffff88032ebc2e00 ffff88032e750000 ffff88032e74fa50 7fffffffffffffff
      kernel: ffffffff814a58df 0000000000000002 ffffea000934d580 ffffffff814a5451
      kernel: 7fffffffffffffff ffffffff814a6e8f 0000000000000000 0000000000000020
      kernel: Call Trace:
      kernel: [<ffffffff814a58df>] ? bit_wait+0x2c/0x2c
      kernel: [<ffffffff814a5451>] ? schedule+0x6f/0x7c
      kernel: [<ffffffff814a6e8f>] ? schedule_timeout+0x2f/0xd8
      kernel: [<ffffffff81076f94>] ? timekeeping_get_ns+0xa/0x2e
      kernel: [<ffffffff81077603>] ? ktime_get+0x36/0x44
      kernel: [<ffffffff814a4f6c>] ? io_schedule_timeout+0x94/0xf2
      kernel: [<ffffffff814a4f6c>] ? io_schedule_timeout+0x94/0xf2
      kernel: [<ffffffff814a590b>] ? bit_wait_io+0x2c/0x30
      kernel: [<ffffffff814a5694>] ? __wait_on_bit+0x41/0x73
      kernel: [<ffffffff8109eba8>] ? wait_on_page_bit+0x6d/0x72
      kernel: [<ffffffff8105d718>] ? autoremove_wake_function+0x2a/0x2a
      kernel: [<ffffffff811a02d7>] ? read_extent_buffer_pages+0x1bd/0x203
      kernel: [<ffffffff8117d9e9>] ? free_root_pointers+0x4c/0x4c
      kernel: [<ffffffff8117e831>] ? btree_read_extent_buffer_pages.constprop.57+0x5a/0xe9
      kernel: [<ffffffff8117f4f3>] ? read_tree_block+0x2d/0x45
      kernel: [<ffffffff8116782a>] ? read_block_for_search.isra.34+0x22a/0x26b
      kernel: [<ffffffff811656c3>] ? btrfs_set_path_blocking+0x1e/0x4a
      kernel: [<ffffffff8116919b>] ? btrfs_search_slot+0x648/0x736
      kernel: [<ffffffff81170559>] ? btrfs_lookup_extent_info+0xb7/0x2c7
      kernel: [<ffffffff81170ee5>] ? walk_down_proc+0x9c/0x1ae
      kernel: [<ffffffff81171c9d>] ? walk_down_tree+0x40/0xa4
      kernel: [<ffffffff8117375f>] ? btrfs_drop_snapshot+0x2da/0x664
      kernel: [<ffffffff8104ff21>] ? finish_task_switch+0x126/0x167
      kernel: [<ffffffff811850f8>] ? btrfs_clean_one_deleted_snapshot+0xa6/0xb0
      kernel: [<ffffffff8117eaba>] ? cleaner_kthread+0x13e/0x17b
      kernel: [<ffffffff8117e97c>] ? btrfs_item_end+0x33/0x33
      kernel: [<ffffffff8104d256>] ? kthread+0x95/0x9d
      kernel: [<ffffffff8104d1c1>] ? kthread_parkme+0x16/0x16
      kernel: [<ffffffff814a7b5f>] ? ret_from_fork+0x3f/0x70
      kernel: [<ffffffff8104d1c1>] ? kthread_parkme+0x16/0x16
      
      As this affects a released kernel (4.4) we need a minimal fix for
      stable kernels.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=108361Reported-by: NMartin Ziegler <ziegler@uni-freiburg.de>
      CC: stable@vger.kernel.org # 4.4
      CC: Jiri Kosina <jkosina@suse.cz>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      80ad623e
  22. 25 1月, 2016 1 次提交
  23. 20 1月, 2016 2 次提交
    • Q
      btrfs: Enhance super validation check · 319e4d06
      Qu Wenruo 提交于
      Enhance btrfs_check_super_valid() function by the following points:
      1) Restrict sector/node size check
         Not the old max/min valid check, but also check if it's a power of 2.
         So some bogus number like 12K node size won't pass now.
      
      2) Super flag check
         For now, there is still some inconsistency between kernel and
         btrfs-progs super flags.
         And considering btrfs-progs may add new flags for super block, this
         check will only output warning.
      
      3) Better root alignment check
         Now root bytenr is checked against sector size.
      
      4) Move some check into btrfs_check_super_valid().
         Like node size vs leaf size check, and PAGESIZE vs sectorsize check.
         And magic number check.
      Reported-by: NVegard Nossum <vegard.nossum@oracle.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      319e4d06
    • F
      Btrfs: fix deadlock running delayed iputs at transaction commit time · c2d6cb16
      Filipe Manana 提交于
      While running a stress test I ran into a deadlock when running the delayed
      iputs at transaction time, which produced the following report and trace:
      
      [  886.399989] =============================================
      [  886.400871] [ INFO: possible recursive locking detected ]
      [  886.401663] 4.4.0-rc6-btrfs-next-18+ #1 Not tainted
      [  886.402384] ---------------------------------------------
      [  886.403182] fio/8277 is trying to acquire lock:
      [  886.403568]  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.403568]
      [  886.403568] but task is already holding lock:
      [  886.403568]  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.403568]
      [  886.403568] other info that might help us debug this:
      [  886.403568]  Possible unsafe locking scenario:
      [  886.403568]
      [  886.403568]        CPU0
      [  886.403568]        ----
      [  886.403568]   lock(&fs_info->delayed_iput_sem);
      [  886.403568]   lock(&fs_info->delayed_iput_sem);
      [  886.403568]
      [  886.403568]  *** DEADLOCK ***
      [  886.403568]
      [  886.403568]  May be due to missing lock nesting notation
      [  886.403568]
      [  886.403568] 3 locks held by fio/8277:
      [  886.403568]  #0:  (sb_writers#11){.+.+.+}, at: [<ffffffff81174c4c>] __sb_start_write+0x5f/0xb0
      [  886.403568]  #1:  (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa054620d>] btrfs_file_write_iter+0x73/0x408 [btrfs]
      [  886.403568]  #2:  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.403568]
      [  886.403568] stack backtrace:
      [  886.403568] CPU: 6 PID: 8277 Comm: fio Not tainted 4.4.0-rc6-btrfs-next-18+ #1
      [  886.403568] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [  886.403568]  0000000000000000 ffff88009f80f770 ffffffff8125d4fd ffffffff82af1fc0
      [  886.403568]  ffff88009f80f830 ffffffff8108e5f9 0000000200000000 ffff88009fd92290
      [  886.403568]  0000000000000000 ffffffff82af1fc0 ffffffff829cfb01 00042b216d008804
      [  886.403568] Call Trace:
      [  886.403568]  [<ffffffff8125d4fd>] dump_stack+0x4e/0x79
      [  886.403568]  [<ffffffff8108e5f9>] __lock_acquire+0xd42/0xf0b
      [  886.403568]  [<ffffffff810c22db>] ? __module_address+0xdf/0x108
      [  886.403568]  [<ffffffff8108eb77>] lock_acquire+0x10d/0x194
      [  886.403568]  [<ffffffff8108eb77>] ? lock_acquire+0x10d/0x194
      [  886.403568]  [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.489542]  [<ffffffff8148556b>] down_read+0x3e/0x4d
      [  886.489542]  [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.489542]  [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.489542]  [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
      [  886.489542]  [<ffffffffa0521d7a>] flush_space+0x435/0x44a [btrfs]
      [  886.489542]  [<ffffffffa052218b>] ? reserve_metadata_bytes+0x26a/0x384 [btrfs]
      [  886.489542]  [<ffffffffa05221ae>] reserve_metadata_bytes+0x28d/0x384 [btrfs]
      [  886.489542]  [<ffffffffa052256c>] ? btrfs_block_rsv_refill+0x58/0x96 [btrfs]
      [  886.489542]  [<ffffffffa0522584>] btrfs_block_rsv_refill+0x70/0x96 [btrfs]
      [  886.489542]  [<ffffffffa053d747>] btrfs_evict_inode+0x394/0x55a [btrfs]
      [  886.489542]  [<ffffffff81188e31>] evict+0xa7/0x15c
      [  886.489542]  [<ffffffff81189878>] iput+0x1d3/0x266
      [  886.489542]  [<ffffffffa053887c>] btrfs_run_delayed_iputs+0x8f/0xbf [btrfs]
      [  886.489542]  [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
      [  886.489542]  [<ffffffff81085096>] ? signal_pending_state+0x31/0x31
      [  886.489542]  [<ffffffffa0521191>] btrfs_alloc_data_chunk_ondemand+0x1d7/0x288 [btrfs]
      [  886.489542]  [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs]
      [  886.489542]  [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
      [  886.489542]  [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs]
      [  886.489542]  [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128
      [  886.489542]  [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs]
      [  886.489542]  [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50
      [  886.489542]  [<ffffffff8117279e>] __vfs_write+0x7c/0xa5
      [  886.489542]  [<ffffffff81172cda>] vfs_write+0xa0/0xe4
      [  886.489542]  [<ffffffff811734cc>] SyS_write+0x50/0x7e
      [  886.489542]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [ 1081.852335] INFO: task fio:8244 blocked for more than 120 seconds.
      [ 1081.854348]       Not tainted 4.4.0-rc6-btrfs-next-18+ #1
      [ 1081.857560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 1081.863227] fio        D ffff880213f9bb28     0  8244   8240 0x00000000
      [ 1081.868719]  ffff880213f9bb28 00ffffff810fc6b0 ffffffff0000000a ffff88023ed55240
      [ 1081.872499]  ffff880206b5d400 ffff880213f9c000 ffff88020a4d5318 ffff880206b5d400
      [ 1081.876834]  ffffffff00000001 ffff880206b5d400 ffff880213f9bb40 ffffffff81482ba4
      [ 1081.880782] Call Trace:
      [ 1081.881793]  [<ffffffff81482ba4>] schedule+0x7f/0x97
      [ 1081.883340]  [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325
      [ 1081.895525]  [<ffffffff8108d48d>] ? trace_hardirqs_on_caller+0x16/0x1ab
      [ 1081.897419]  [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20
      [ 1081.899251]  [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20
      [ 1081.901063]  [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21
      [ 1081.902365]  [<ffffffff814855bd>] down_write+0x43/0x57
      [ 1081.903846]  [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
      [ 1081.906078]  [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
      [ 1081.908846]  [<ffffffff8108d461>] ? mark_held_locks+0x56/0x6c
      [ 1081.910409]  [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs]
      [ 1081.912482]  [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
      [ 1081.914597]  [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs]
      [ 1081.919037]  [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128
      [ 1081.920754]  [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs]
      [ 1081.922496]  [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50
      [ 1081.923922]  [<ffffffff8117279e>] __vfs_write+0x7c/0xa5
      [ 1081.925275]  [<ffffffff81172cda>] vfs_write+0xa0/0xe4
      [ 1081.926584]  [<ffffffff811734cc>] SyS_write+0x50/0x7e
      [ 1081.927968]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [ 1081.985293] INFO: lockdep is turned off.
      [ 1081.986132] INFO: task fio:8249 blocked for more than 120 seconds.
      [ 1081.987434]       Not tainted 4.4.0-rc6-btrfs-next-18+ #1
      [ 1081.988534] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 1081.990147] fio        D ffff880218febbb8     0  8249   8240 0x00000000
      [ 1081.991626]  ffff880218febbb8 00ffffff81486b8e ffff88020000000b ffff88023ed75240
      [ 1081.993258]  ffff8802120a9a00 ffff880218fec000 ffff88020a4d5318 ffff8802120a9a00
      [ 1081.994850]  ffffffff00000001 ffff8802120a9a00 ffff880218febbd0 ffffffff81482ba4
      [ 1081.996485] Call Trace:
      [ 1081.997037]  [<ffffffff81482ba4>] schedule+0x7f/0x97
      [ 1081.998017]  [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325
      [ 1081.999241]  [<ffffffff810852a5>] ? finish_wait+0x6d/0x76
      [ 1082.000306]  [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20
      [ 1082.001533]  [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20
      [ 1082.002776]  [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21
      [ 1082.003995]  [<ffffffff814855bd>] down_write+0x43/0x57
      [ 1082.005000]  [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
      [ 1082.007403]  [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
      [ 1082.008988]  [<ffffffffa0545064>] btrfs_fallocate+0x7c1/0xc2f [btrfs]
      [ 1082.010193]  [<ffffffff8108a1ba>] ? percpu_down_read+0x4e/0x77
      [ 1082.011280]  [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0
      [ 1082.012265]  [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0
      [ 1082.013021]  [<ffffffff811712e4>] vfs_fallocate+0x170/0x1ff
      [ 1082.013738]  [<ffffffff81181ebb>] ioctl_preallocate+0x89/0x9b
      [ 1082.014778]  [<ffffffff811822d7>] do_vfs_ioctl+0x40a/0x4ea
      [ 1082.015778]  [<ffffffff81176ea7>] ? SYSC_newfstat+0x25/0x2e
      [ 1082.016806]  [<ffffffff8118b4de>] ? __fget_light+0x4d/0x71
      [ 1082.017789]  [<ffffffff8118240e>] SyS_ioctl+0x57/0x79
      [ 1082.018706]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      
      This happens because we can recursively acquire the semaphore
      fs_info->delayed_iput_sem when attempting to allocate space to satisfy
      a file write request as shown in the first trace above - when committing
      a transaction we acquire (down_read) the semaphore before running the
      delayed iputs, and when running a delayed iput() we can end up calling
      an inode's eviction handler, which in turn commits another transaction
      and attempts to acquire (down_read) again the semaphore to run more
      delayed iput operations.
      This results in a deadlock because if a task acquires multiple times a
      semaphore it should invoke down_read_nested() with a different lockdep
      class for each level of recursion.
      
      Fix this by simplifying the implementation and use a mutex instead that
      is acquired by the cleaner kthread before it runs the delayed iputs
      instead of always acquiring a semaphore before delayed references are
      run from anywhere.
      
      Fixes: d7c15171 (btrfs: Fix NO_SPACE bug caused by delayed-iput)
      Cc: stable@vger.kernel.org   # 4.1+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c2d6cb16