1. 17 12月, 2018 40 次提交
    • D
      btrfs: dev-replace: remove custom read/write blocking scheme · 53176dde
      David Sterba 提交于
      After the rw semaphore has been added, the custom blocking using
      ::blocking_readers and ::read_lock_wq is redundant.
      
      The blocking logic in __btrfs_map_block is replaced by extending the
      time the semaphore is held, that has the same blocking effect on writes
      as the previous custom scheme that waited until ::blocking_readers was
      zero.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      53176dde
    • D
      btrfs: dev-replace: swich locking to rw semaphore · 129827e3
      David Sterba 提交于
      This is the first part of removing the custom locking and waiting scheme
      used for device replace. It was probably copied from extent buffer
      locking, but there's nothing that would require more than is provided by
      the common locking primitives.
      
      The rw spinlock protects waiting tasks counter in case of incompatible
      locks and the waitqueue. Same as rw semaphore.
      
      This patch only switches the locking primitive, for better
      bisectability.  There should be no functional change other than the
      overhead of the locking and potential sleeping instead of spinning when
      the lock is contended.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      129827e3
    • D
      btrfs: reada: reorder dev-replace locks before radix tree preload · ceb21a8d
      David Sterba 提交于
      The device-replace read lock is going to use rw semaphore in followup
      commits. The semaphore might sleep which is not possible in the radix
      tree preload section. The lock nesting is now:
      
      * device replace
        * radix tree preload
          * readahead spinlock
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ceb21a8d
    • N
      btrfs: Fix error handling in btrfs_cleanup_ordered_extents · d1051d6e
      Nikolay Borisov 提交于
      Running btrfs/124 in a loop hung up on me sporadically with the
      following call trace:
      
      	btrfs           D    0  5760   5324 0x00000000
      	Call Trace:
      	 ? __schedule+0x243/0x800
      	 schedule+0x33/0x90
      	 btrfs_start_ordered_extent+0x10c/0x1b0 [btrfs]
      	 ? wait_woken+0xa0/0xa0
      	 btrfs_wait_ordered_range+0xbb/0x100 [btrfs]
      	 btrfs_relocate_block_group+0x1ff/0x230 [btrfs]
      	 btrfs_relocate_chunk+0x49/0x100 [btrfs]
      	 btrfs_balance+0xbeb/0x1740 [btrfs]
      	 btrfs_ioctl_balance+0x2ee/0x380 [btrfs]
      	 btrfs_ioctl+0x1691/0x3110 [btrfs]
      	 ? lockdep_hardirqs_on+0xed/0x180
      	 ? __handle_mm_fault+0x8e7/0xfb0
      	 ? _raw_spin_unlock+0x24/0x30
      	 ? __handle_mm_fault+0x8e7/0xfb0
      	 ? do_vfs_ioctl+0xa5/0x6e0
      	 ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
      	 do_vfs_ioctl+0xa5/0x6e0
      	 ? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
      	 ksys_ioctl+0x3a/0x70
      	 __x64_sys_ioctl+0x16/0x20
      	 do_syscall_64+0x60/0x1b0
      	 entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      This happens because during page writeback it's valid for
      writepage_delalloc to instantiate a delalloc range which doesn't belong
      to the page currently being written back.
      
      The reason this case is valid is due to find_lock_delalloc_range
      returning any available range after the passed delalloc_start and
      ignoring whether the page under writeback is within that range.
      
      In turn ordered extents (OE) are always created for the returned range
      from find_lock_delalloc_range. If, however, a failure occurs while OE
      are being created then the clean up code in btrfs_cleanup_ordered_extents
      will be called.
      
      Unfortunately the code in btrfs_cleanup_ordered_extents doesn't consider
      the case of such 'foreign' range being processed and instead it always
      assumes that the range OE are created for belongs to the page. This
      leads to the first page of such foregin range to not be cleaned up since
      it's deliberately missed and skipped by the current cleaning up code.
      
      Fix this by correctly checking whether the current page belongs to the
      range being instantiated and if so adjsut the range parameters passed
      for cleaning up. If it doesn't, then just clean the whole OE range
      directly.
      
      Fixes: 52427260 ("btrfs: Handle delalloc error correctly to avoid ordered extent hang")
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d1051d6e
    • L
      btrfs: remove always true if branch in find_delalloc_range · 3522e903
      Lu Fengqi 提交于
      The @found is always false when it comes to the if branch. Besides, the
      bool type is more suitable for @found. Change the return value of the
      function and its caller to bool as well.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3522e903
    • L
      btrfs: skip file_extent generation check for free_space_inode in run_delalloc_nocow · 27a7ff55
      Lu Fengqi 提交于
      The test case btrfs/001 with inode_cache mount option will encounter the
      following warning:
      
        WARNING: CPU: 1 PID: 23700 at fs/btrfs/inode.c:956 cow_file_range.isra.19+0x32b/0x430 [btrfs]
        CPU: 1 PID: 23700 Comm: btrfs Kdump: loaded Tainted: G        W  O      4.20.0-rc4-custom+ #30
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:cow_file_range.isra.19+0x32b/0x430 [btrfs]
        Call Trace:
         ? free_extent_buffer+0x46/0x90 [btrfs]
         run_delalloc_nocow+0x455/0x900 [btrfs]
         btrfs_run_delalloc_range+0x1a7/0x360 [btrfs]
         writepage_delalloc+0xf9/0x150 [btrfs]
         __extent_writepage+0x125/0x3e0 [btrfs]
         extent_write_cache_pages+0x1b6/0x3e0 [btrfs]
         ? __wake_up_common_lock+0x63/0xc0
         extent_writepages+0x50/0x80 [btrfs]
         do_writepages+0x41/0xd0
         ? __filemap_fdatawrite_range+0x9e/0xf0
         __filemap_fdatawrite_range+0xbe/0xf0
         btrfs_fdatawrite_range+0x1b/0x50 [btrfs]
         __btrfs_write_out_cache+0x42c/0x480 [btrfs]
         btrfs_write_out_ino_cache+0x84/0xd0 [btrfs]
         btrfs_save_ino_cache+0x551/0x660 [btrfs]
         commit_fs_roots+0xc5/0x190 [btrfs]
         btrfs_commit_transaction+0x2bf/0x8d0 [btrfs]
         btrfs_mksubvol+0x48d/0x4d0 [btrfs]
         btrfs_ioctl_snap_create_transid+0x170/0x180 [btrfs]
         btrfs_ioctl_snap_create_v2+0x124/0x180 [btrfs]
         btrfs_ioctl+0x123f/0x3030 [btrfs]
      
      The file extent generation of the free space inode is equal to the last
      snapshot of the file root, so the inode will be passed to cow_file_rage.
      But the inode was created and its extents were preallocated in
      btrfs_save_ino_cache, there are no cow copies on disk.
      
      The preallocated extent is not yet in the extent tree, and
      btrfs_cross_ref_exist will ignore the -ENOENT returned by
      check_committed_ref, so we can directly write the inode to the disk.
      
      Fixes: 78d4295b ("btrfs: lift some btrfs_cross_ref_exist checks in nocow path")
      CC: stable@vger.kernel.org # 4.18+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      27a7ff55
    • F
      Btrfs: fix fsync of files with multiple hard links in new directories · 41bd6067
      Filipe Manana 提交于
      The log tree has a long standing problem that when a file is fsync'ed we
      only check for new ancestors, created in the current transaction, by
      following only the hard link for which the fsync was issued. We follow the
      ancestors using the VFS' dget_parent() API. This means that if we create a
      new link for a file in a directory that is new (or in an any other new
      ancestor directory) and then fsync the file using an old hard link, we end
      up not logging the new ancestor, and on log replay that new hard link and
      ancestor do not exist. In some cases, involving renames, the file will not
      exist at all.
      
      Example:
      
        mkfs.btrfs -f /dev/sdb
        mount /dev/sdb /mnt
      
        mkdir /mnt/A
        touch /mnt/foo
        ln /mnt/foo /mnt/A/bar
        xfs_io -c fsync /mnt/foo
      
        <power failure>
      
      In this example after log replay only the hard link named 'foo' exists
      and directory A does not exist, which is unexpected. In other major linux
      filesystems, such as ext4, xfs and f2fs for example, both hard links exist
      and so does directory A after mounting again the filesystem.
      
      Checking if any new ancestors are new and need to be logged was added in
      2009 by commit 12fcfd22 ("Btrfs: tree logging unlink/rename fixes"),
      however only for the ancestors of the hard link (dentry) for which the
      fsync was issued, instead of checking for all ancestors for all of the
      inode's hard links.
      
      So fix this by tracking the id of the last transaction where a hard link
      was created for an inode and then on fsync fallback to a full transaction
      commit when an inode has more than one hard link and at least one new hard
      link was created in the current transaction. This is the simplest solution
      since this is not a common use case (adding frequently hard links for
      which there's an ancestor created in the current transaction and then
      fsync the file). In case it ever becomes a common use case, a solution
      that consists of iterating the fs/subvol btree for each hard link and
      check if any ancestor is new, could be implemented.
      
      This solves many unexpected scenarios reported by Jayashree Mohan and
      Vijay Chidambaram, and for which there is a new test case for fstests
      under review.
      
      Fixes: 12fcfd22 ("Btrfs: tree logging unlink/rename fixes")
      CC: stable@vger.kernel.org # 4.4+
      Reported-by: NVijay Chidambaram <vvijay03@gmail.com>
      Reported-by: NJayashree Mohan <jayashree2912@gmail.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      41bd6067
    • D
      btrfs: drop extra enum initialization where using defaults · bbe339cc
      David Sterba 提交于
      The first auto-assigned value to enum is 0, we can use that and not
      initialize all members where the auto-increment does the same. This is
      used for values that are not part of on-disk format.
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bbe339cc
    • D
      btrfs: switch BTRFS_ORDERED_* to enums · 5b840301
      David Sterba 提交于
      We can use simple enum for values that are not part of on-disk format:
      ordered extent flags.
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5b840301
    • D
      btrfs: switch EXTENT_FLAG_* to enums · 50b5b602
      David Sterba 提交于
      We can use simple enum for values that are not part of on-disk format:
      extent map flags.
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      50b5b602
    • D
      btrfs: switch EXTENT_BUFFER_* to enums · 80cb3836
      David Sterba 提交于
      We can use simple enum for values that are not part of on-disk format:
      extent buffer flags;
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      80cb3836
    • D
      btrfs: switch BTRFS_ROOT_* to enums · 61fa90c1
      David Sterba 提交于
      We can use simple enum for values that are not part of on-disk format:
      root tree flags.
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      61fa90c1
    • D
      btrfs: switch BTRFS_FS_* to enums · eb1a524c
      David Sterba 提交于
      We can use simple enum for values that are not part of on-disk format:
      internal filesystem states.
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      eb1a524c
    • D
      btrfs: switch BTRFS_BLOCK_RSV_* to enums · 688a75b9
      David Sterba 提交于
      We can use simple enum for values that are not part of on-disk format:
      block reserve types.
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      688a75b9
    • D
      btrfs: switch BTRFS_FS_STATE_* to enums · b00146b5
      David Sterba 提交于
      We can use simple enum for values that are not part of on-disk format:
      global filesystem states.
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b00146b5
    • N
      btrfs: Refactor btrfs_merge_bio_hook · da12fe54
      Nikolay Borisov 提交于
      This function really checks whether adding more data to the bio will
      straddle a stripe/chunk. So first let's give it a more appropraite name
      - btrfs_bio_fits_in_stripe. Secondly, the offset parameter was never
      used to just remove it. Thirdly, pages are submitted to either btree or
      data inodes so it's guaranteed that tree->ops is set so replace the
      check with an ASSERT. Finally, document the parameters of the function.
      No functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      da12fe54
    • L
      btrfs: cleanup the useless DEFINE_WAIT in cleanup_transaction · 2ab4fd31
      Lu Fengqi 提交于
      When it was introduced in commit f094ac32 ("Btrfs: fix NULL pointer
      after aborting a transaction"), it was not used.
      Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2ab4fd31
    • J
      btrfs: document extent mapping assumptions in checksum · d2e174d5
      Johannes Thumshirn 提交于
      Document why map_private_extent_buffer() cannot return '1' (i.e. the map
      spans two pages) for the csum_tree_block() case.
      
      The current algorithm for detecting a page boundary crossing in
      map_private_extent_buffer() will return a '1' *IFF* the extent buffer's
      offset in the page + the offset passed in by csum_tree_block() and the
      minimal length passed in by csum_tree_block() - 1 are bigger than
      PAGE_SIZE.
      
      We always pass BTRFS_CSUM_SIZE (32) as offset and a minimal length of 32
      and the current extent buffer allocator always guarantees page aligned
      extends, so the above condition can't be true.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d2e174d5
    • J
      btrfs: don't initialize 'offset' in map_private_extent_buffer() · cc2c39d6
      Johannes Thumshirn 提交于
      In map_private_extent_buffer() the 'offset' variable is initialized to a
      page aligned version of the 'start' parameter.
      
      But later on it is overwritten with either the offset from the extent
      buffer's start or 0.
      
      So get rid of the initial initialization.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cc2c39d6
    • F
      Btrfs: fix deadlock with memory reclaim during scrub · a5fb1142
      Filipe Manana 提交于
      When a transaction commit starts, it attempts to pause scrub and it blocks
      until the scrub is paused. So while the transaction is blocked waiting for
      scrub to pause, we can not do memory allocation with GFP_KERNEL from scrub,
      otherwise we risk getting into a deadlock with reclaim.
      
      Checking for scrub pause requests is done early at the beginning of the
      while loop of scrub_stripe() and later in the loop, scrub_extent() and
      scrub_raid56_parity() are called, which in turn call scrub_pages() and
      scrub_pages_for_parity() respectively. These last two functions do memory
      allocations using GFP_KERNEL. Same problem could happen while scrubbing
      the super blocks, since it calls scrub_pages().
      
      We also can not have any of the worker tasks, created by the scrub task,
      doing GFP_KERNEL allocations, because before pausing, the scrub task waits
      for all the worker tasks to complete (also done at scrub_stripe()).
      
      So make sure GFP_NOFS is used for the memory allocations because at any
      time a scrub pause request can happen from another task that started to
      commit a transaction.
      
      Fixes: 58c4e173 ("btrfs: scrub: use GFP_KERNEL on the submission path")
      CC: stable@vger.kernel.org # 4.6+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a5fb1142
    • N
      btrfs: Remove extent_io_ops::readpage_io_failed_hook · 78e62c02
      Nikolay Borisov 提交于
      For data inodes this hook does nothing but to return -EAGAIN which is
      used to signal to the endio routines that this bio belongs to a data
      inode. If this is the case the actual retrying is handled by
      bio_readpage_error. Alternatively, if this bio belongs to the btree
      inode then btree_io_failed_hook just does some cleanup and doesn't retry
      anything.
      
      This patch simplifies the code flow by eliminating
      readpage_io_failed_hook and instead open-coding btree_io_failed_hook in
      end_bio_extent_readpage. Also eliminate some needless checks since IO is
      always performed on either data inode or btree inode, both of which are
      guaranteed to have their extent_io_tree::ops set.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      78e62c02
    • J
      btrfs: remove btrfs_bio_end_io_t · 7b41ba71
      Johannes Thumshirn 提交于
      The btrfs_bio_end_io_t typedef was introduced with commit
      a1d3c478 ("btrfs: btrfs_multi_bio replaced with btrfs_bio")
      but never used anywhere. This commit also introduced a forward declaration
      of 'struct btrfs_bio' which is only needed for btrfs_bio_end_io_t.
      
      Remove both as they're not needed anywhere.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7b41ba71
    • D
      btrfs: replace btrfs_io_bio::end_io with a simple helper · b3a0dd50
      David Sterba 提交于
      The end_io callback implemented as btrfs_io_bio_endio_readpage only
      calls kfree. Also the callback is set only in case the csum buffer is
      allocated and not pointing to the inline buffer. We can use that
      information to drop the indirection and call a helper that will free the
      csums only in the right case.
      
      This shrinks struct btrfs_io_bio by 8 bytes.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b3a0dd50
    • D
      btrfs: remove redundant csum buffer in btrfs_io_bio · 31fecccb
      David Sterba 提交于
      The io_bio tracks checksums and has an inline buffer or an allocated
      one. And there's a third member that points to the right one, but we
      don't need to use an extra pointer for that. Let btrfs_io_bio::csum
      point to the right buffer and check that the inline buffer is not
      accidentally freed.
      
      This shrinks struct btrfs_io_bio by 8 bytes.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      31fecccb
    • D
      btrfs: replace async_cow::root with fs_info · 600b6cf4
      David Sterba 提交于
      The async_cow::root is used to propagate fs_info to async_cow_submit.
      We can't use inode to reach it because it could become NULL after
      write without compression in async_cow_start.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      600b6cf4
    • D
      btrfs: merge btrfs_submit_bio_done to its caller · 06ea01b1
      David Sterba 提交于
      There's one caller and its code is simple, we can open code it in
      run_one_async_done. The errors are passed through bio.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      06ea01b1
    • A
      btrfs: balance: print to system log when balance ends or is paused · 7333bd02
      Anand Jain 提交于
      Print a kernel log message when the balance ends, either for cancel or
      completed or if it is paused.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7333bd02
    • A
      btrfs: balance: print args during start and resume · 56fc37d9
      Anand Jain 提交于
      The information about balance arguments is important for system audit,
      this patch prints the textual representation when balance starts or is
      resumed.
      
      Example command:
      
       $ btrfs balance start -f -mprofiles=raid1,convert=single,soft -dlimit=10..20,usage=50 /btrfs
      
      Example kernel log output:
      
       BTRFS info (device sdb): balance: start -f -dusage=50,limit=10..20 -mconvert=single,soft,profiles=raid1 -sconvert=single,soft,profiles=raid1
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update changelog, simplify code ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      56fc37d9
    • A
      btrfs: add helper to describe block group flags · f89e09cf
      Anand Jain 提交于
      Factor out helper that describes block group flags from
      describe_relocation. The result will not be longer than the given size.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add comments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f89e09cf
    • F
      Btrfs: fix deadlock when enabling quotas due to concurrent snapshot creation · 9a6f209e
      Filipe Manana 提交于
      If the quota enable and snapshot creation ioctls are called concurrently
      we can get into a deadlock where the task enabling quotas will deadlock
      on the fs_info->qgroup_ioctl_lock mutex because it attempts to lock it
      twice, or the task creating a snapshot tries to commit the transaction
      while the task enabling quota waits for the former task to commit the
      transaction while holding the mutex. The following time diagrams show how
      both cases happen.
      
      First scenario:
      
                 CPU 0                                    CPU 1
      
       btrfs_ioctl()
        btrfs_ioctl_quota_ctl()
         btrfs_quota_enable()
          mutex_lock(fs_info->qgroup_ioctl_lock)
          btrfs_start_transaction()
      
                                                   btrfs_ioctl()
                                                    btrfs_ioctl_snap_create_v2
                                                     create_snapshot()
                                                      --> adds snapshot to the
                                                          list pending_snapshots
                                                          of the current
                                                          transaction
      
          btrfs_commit_transaction()
           create_pending_snapshots()
             create_pending_snapshot()
              qgroup_account_snapshot()
               btrfs_qgroup_inherit()
      	   mutex_lock(fs_info->qgroup_ioctl_lock)
      	    --> deadlock, mutex already locked
      	        by this task at
      		btrfs_quota_enable()
      
      Second scenario:
      
                 CPU 0                                    CPU 1
      
       btrfs_ioctl()
        btrfs_ioctl_quota_ctl()
         btrfs_quota_enable()
          mutex_lock(fs_info->qgroup_ioctl_lock)
          btrfs_start_transaction()
      
                                                   btrfs_ioctl()
                                                    btrfs_ioctl_snap_create_v2
                                                     create_snapshot()
                                                      --> adds snapshot to the
                                                          list pending_snapshots
                                                          of the current
                                                          transaction
      
                                                      btrfs_commit_transaction()
                                                       --> waits for task at
                                                           CPU 0 to release
                                                           its transaction
                                                           handle
      
          btrfs_commit_transaction()
           --> sees another task started
               the transaction commit first
           --> releases its transaction
               handle
           --> waits for the transaction
               commit to be completed by
               the task at CPU 1
      
                                                       create_pending_snapshot()
                                                        qgroup_account_snapshot()
                                                         btrfs_qgroup_inherit()
                                                          mutex_lock(fs_info->qgroup_ioctl_lock)
                                                           --> deadlock, task at CPU 0
                                                               has the mutex locked but
                                                               it is waiting for us to
                                                               finish the transaction
                                                               commit
      
      So fix this by setting the quota enabled flag in fs_info after committing
      the transaction at btrfs_quota_enable(). This ends up serializing quota
      enable and snapshot creation as if the snapshot creation happened just
      before the quota enable request. The quota rescan task, scheduled after
      committing the transaction in btrfs_quote_enable(), will do the accounting.
      
      Fixes: 6426c7ad ("btrfs: qgroup: Fix qgroup accounting when creating snapshot")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9a6f209e
    • F
      Btrfs: fix access to available allocation bits when starting balance · 5a8067c0
      Filipe Manana 提交于
      The available allocation bits members from struct btrfs_fs_info are
      protected by a sequence lock, and when starting balance we access them
      incorrectly in two different ways:
      
      1) In the read sequence lock loop at btrfs_balance() we use the values we
         read from fs_info->avail_*_alloc_bits and we can immediately do actions
         that have side effects and can not be undone (printing a message and
         jumping to a label). This is wrong because a retry might be needed, so
         our actions must not have side effects and must be repeatable as long
         as read_seqretry() returns a non-zero value. In other words, we were
         essentially ignoring the sequence lock;
      
      2) Right below the read sequence lock loop, we were reading the values
         from avail_metadata_alloc_bits and avail_data_alloc_bits without any
         protection from concurrent writers, that is, reading them outside of
         the read sequence lock critical section.
      
      So fix this by making sure we only read the available allocation bits
      while in a read sequence lock critical section and that what we do in the
      critical section is repeatable (has nothing that can not be undone) so
      that any eventual retry that is needed is handled properly.
      
      Fixes: de98ced9 ("Btrfs: use seqlock to protect fs_info->avail_{data, metadata, system}_alloc_bits")
      Fixes: 14506127 ("btrfs: fix a bogus warning when converting only data or metadata")
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5a8067c0
    • F
      Btrfs: allow clear_extent_dirty() to receive a cached extent state record · 0e6ec385
      Filipe Manana 提交于
      We can have a lot freed extents during the life span of transaction, so
      the red black tree that keeps track of the ranges of each freed extent
      (fs_info->freed_extents[]) can get quite big. When finishing a
      transaction commit we find each range, process it (discard the extents,
      unpin them) and then remove it from the red black tree.
      
      We can use an extent state record as a cache when searching for a range,
      so that when we clean the range we can use the cached extent state we
      passed to the search function instead of iterating the red black tree
      again. Doing things as fast as possible when finishing a transaction (in
      state TRANS_STATE_UNBLOCKED) is convenient as it reduces the time we
      block another task that wants to commit the next transaction.
      
      So change clear_extent_dirty() to allow an optional extent state record to
      be passed as an argument, which will be passed down to __clear_extent_bit.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0e6ec385
    • N
      btrfs: Handle final split-brain possibility during fsid change · cc5de4e7
      Nikolay Borisov 提交于
      This patch lands the last case which needs to be handled by the fsid
      change code. Namely, this is the case where a multidisk filesystem has
      already undergone at least one successful fsid change i.e all disks
      have the METADATA_UUID incompat bit and power failure occurs as another
      fsid change is in progress. When such an event occurs, disks could be
      split in 2 groups. One of the groups will have both METADATA_UUID and
      CHANGING_FSID_V2 flags set coupled with old fsid/metadata_uuid pairs.
      The other group of disks will have only METADATA_UUID bit set and their
      fsid will be different than the one in disks in the first group. Here
      we look at the following cases:
      
        a) A disk from the first group is scanned first, so fs_devices is
        created with stale fsid/metdata_uuid. Then when a disk from the
        second group is scanned it needs to first check whether there exists
        such an fs_devices that has fsid_change set to true (because it was
        created with a disk having the CHANGING_FSID_V2 flag), the
        metadata_uuid and fsid of the fs_devices will be different (since it was
        created by a disk which already has had at least 1 successful fsid change)
        and finally the metadata_uuid of the fs_devices will equal that of the
        currently scanned disk (because metadata_uuid never really changes).
        When the correct fs_devices is found the information from the scanned
        disk will replace the current one in fs_devices since the scanned disk
        will have higher generation number.
      
        b) A disk from the second group is scanned so fs_devices is created
        as usual with differing fsid/metdata_uid. Then when a disk from the
        first group is scanned the code detects that it has both
        CHANGING_FSID_V2 and METADATA_UUID flags set and will search for
        fs_devices that has differing metadata_uuid/fsid and whose
        metadata_uuid is the same as that of the scanned device.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cc5de4e7
    • N
      btrfs: Handle one more split-brain scenario during fsid change · 7a62d0f0
      Nikolay Borisov 提交于
      This commit continues hardening the scanning code to handle cases where
      power loss could have caused disks in a multi-disk filesystem to be
      in inconsistent state. Namely handle the situation that can occur when
      some of the disks in multi-disk fs have completed their fsid change i.e
      they have METADATA_UUID incompat flag set, have cleared the
      CHANGING_FSID_V2 flag and their fsid/metadata_uuid are different. At
      the same time the other half of the disks will have their
      fsid/metadata_uuid unchanged and will only have CHANGING_FSID_V2 flag.
      
      This is handled by introducing code in the scan path which:
      
       a) Handles the case when a device with CHANGING_FSID_V2 flag is
       scanned and as a result btrfs_fs_devices is created with matching
       fsid/metdata_uuid. Subsequently, when a device with completed fsid
       change is scanned it will detect this via the new code in find_fsid
       i.e that such an fs_devices exist that fsid_change flag is set to true,
       it's metadata_uuid/fsid match and the metadata_uuid of the scanned
       device matches that of the fs_devices. In this case, it's important to
       note that the devices which has its fsid change completed will have a
       higher generation number than the device with FSID_CHANGING_V2 flag
       set, so its superblock block will be used during mount. To prevent an
       assertion triggering because the sb used for mounting will have
       differing fsid/metadata_uuid than the ones in the fs_devices struct
       also add code in device_list_add which overwrites the values in
       fs_devices.
      
       b) Alternatively we can end up with a device that completed its
       fsid change be scanned first which will create the respective
       btrfs_fs_devices struct with differing fsid/metadata_uuid. In this
       case when a device with FSID_CHANGING_V2 flag set is scanned it will
       call the newly added find_fsid_inprogress function which will return
       the correct fs_devices.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7a62d0f0
    • N
      btrfs: add members to fs_devices to track fsid changes · d1a63002
      Nikolay Borisov 提交于
      In order to gracefully handle split-brain scenario during fsid change
      (which are very unlikely, yet possible), two more pieces of information
      will be necessary:
      
      1. The highest generation number among all devices registered to a
         particular btrfs_fs_devices
      
      2. A boolean flag whether a given btrfs_fs_devices was created by a
         device which had the FSID_CHANGING_V2 flag set.
      
      This is a preparatory patch and just introduces the variables as well
      as code which sets them, their actual use is going to happen in a later
      patch.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d1a63002
    • N
      btrfs: Add handling for disk split-brain scenario during fsid change · fbc6feae
      Nikolay Borisov 提交于
      Even though fsid change without rewrite is a very quick operation it's
      still possible to experience a split-brain scenario if power loss occurs
      at the most inconvenient time. This patch handles the case where power
      failure occurs while the first transaction (the one setting
      CHANGING_FSID_V2) flag is being persisted on disk. This can cause the
      btrfs_fs_devices of this filesystem to be created by a device which:
      
       a) has the CHANGING_FSID_V2 flag set but its fsid value is intact
      
       b) or a device which doesn't have CHANGING_FSID_V2 flag set and its
          fsid value is intact
      
      This situation is trivially handled by the current find_fsid code since
      in both cases the devices are going to be treated like ordinary devices.
      Since btrfs is always mounted using the superblock of the latest
      device (the one with highest generation number), meaning it will have
      the CHANGING_FSID_V2 flag set, ensure it's being cleared on mount. On
      the first transaction commit following mount all disks will have it
      cleared.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fbc6feae
    • N
      btrfs: Remove fsid/metadata_fsid fields from btrfs_info · de37aa51
      Nikolay Borisov 提交于
      Currently btrfs_fs_info structure contains a copy of the
      fsid/metadata_uuid fields. Same values are also contained in the
      btrfs_fs_devices structure which fs_info has a reference to. Let's
      reduce duplication by removing the fields from fs_info and always refer
      to the ones in fs_devices. No functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      de37aa51
    • N
      btrfs: Add sysfs support for metadata_uuid feature · 56f20f40
      Nikolay Borisov 提交于
      Since the metadata_uuid is a new incompat feature it requires the
      respective sysfs hooks. This patch adds the 'metdata_uuid' feature to
      be shown if it supported by the kernel. Additionally it adds
      /sys/fs/btrfs/UUID/metadata_uuid attribute which allows one to read
      the current metadata_uuid.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      56f20f40
    • N
      btrfs: Introduce support for FSID change without metadata rewrite · 7239ff4b
      Nikolay Borisov 提交于
      This field is going to be used when the user wants to change the UUID
      of the filesystem without having to rewrite all metadata blocks. This
      field adds another level of indirection such that when the FSID is
      changed what really happens is the current UUID (the one with which the
      fs was created) is copied to the 'metadata_uuid' field in the superblock
      as well as a new incompat flag is set METADATA_UUID. When the kernel
      detects this flag is set it knows that the superblock in fact has 2
      UUIDs:
      
      1. Is the UUID which is user-visible, currently known as FSID.
      2. Metadata UUID - this is the UUID which is stamped into all on-disk
         datastructures belonging to this file system.
      
      When the new incompat flag is present device scanning checks whether
      both fsid/metadata_uuid of the scanned device match any of the
      registered filesystems. When the flag is not set then both UUIDs are
      equal and only the FSID is retained on disk, metadata_uuid is set only
      in-memory during mount.
      
      Additionally a new metadata_uuid field is also added to the fs_info
      struct. It's initialised either with the FSID in case METADATA_UUID
      incompat flag is not set or with the metdata_uuid of the superblock
      otherwise.
      
      This commit introduces the new fields as well as the new incompat flag
      and switches all users of the fsid to the new logic.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ minor updates in comments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7239ff4b
    • J
      btrfs: use EXPORT_FOR_TESTS for conditionally exported functions · ce9f967f
      Johannes Thumshirn 提交于
      Several functions in BTRFS are only used inside the source file they are
      declared if CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not defined. However if
      CONFIG_BTRFS_FS_RUN_SANITY_TESTS is defined these functions are shared
      with the unit tests code.
      
      Before the introduction of the EXPORT_FOR_TESTS macro, these functions
      could not be declared as static and the compiler had a harder task when
      optimizing and inlining them.
      
      As we have EXPORT_FOR_TESTS now, use it where appropriate to support the
      compiler.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ce9f967f