1. 27 8月, 2020 1 次提交
    • J
      btrfs: fix potential deadlock in the search ioctl · a48b73ec
      Josef Bacik 提交于
      With the conversion of the tree locks to rwsem I got the following
      lockdep splat:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.8.0-rc7-00165-g04ec4da5f45f-dirty #922 Not tainted
        ------------------------------------------------------
        compsize/11122 is trying to acquire lock:
        ffff889fabca8768 (&mm->mmap_lock#2){++++}-{3:3}, at: __might_fault+0x3e/0x90
      
        but task is already holding lock:
        ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (btrfs-fs-00){++++}-{3:3}:
      	 down_write_nested+0x3b/0x70
      	 __btrfs_tree_lock+0x24/0x120
      	 btrfs_search_slot+0x756/0x990
      	 btrfs_lookup_inode+0x3a/0xb4
      	 __btrfs_update_delayed_inode+0x93/0x270
      	 btrfs_async_run_delayed_root+0x168/0x230
      	 btrfs_work_helper+0xd4/0x570
      	 process_one_work+0x2ad/0x5f0
      	 worker_thread+0x3a/0x3d0
      	 kthread+0x133/0x150
      	 ret_from_fork+0x1f/0x30
      
        -> #1 (&delayed_node->mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x9f/0x930
      	 btrfs_delayed_update_inode+0x50/0x440
      	 btrfs_update_inode+0x8a/0xf0
      	 btrfs_dirty_inode+0x5b/0xd0
      	 touch_atime+0xa1/0xd0
      	 btrfs_file_mmap+0x3f/0x60
      	 mmap_region+0x3a4/0x640
      	 do_mmap+0x376/0x580
      	 vm_mmap_pgoff+0xd5/0x120
      	 ksys_mmap_pgoff+0x193/0x230
      	 do_syscall_64+0x50/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (&mm->mmap_lock#2){++++}-{3:3}:
      	 __lock_acquire+0x1272/0x2310
      	 lock_acquire+0x9e/0x360
      	 __might_fault+0x68/0x90
      	 _copy_to_user+0x1e/0x80
      	 copy_to_sk.isra.32+0x121/0x300
      	 search_ioctl+0x106/0x200
      	 btrfs_ioctl_tree_search_v2+0x7b/0xf0
      	 btrfs_ioctl+0x106f/0x30a0
      	 ksys_ioctl+0x83/0xc0
      	 __x64_sys_ioctl+0x16/0x20
      	 do_syscall_64+0x50/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        other info that might help us debug this:
      
        Chain exists of:
          &mm->mmap_lock#2 --> &delayed_node->mutex --> btrfs-fs-00
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(btrfs-fs-00);
      				 lock(&delayed_node->mutex);
      				 lock(btrfs-fs-00);
          lock(&mm->mmap_lock#2);
      
         *** DEADLOCK ***
      
        1 lock held by compsize/11122:
         #0: ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
      
        stack backtrace:
        CPU: 17 PID: 11122 Comm: compsize Kdump: loaded Not tainted 5.8.0-rc7-00165-g04ec4da5f45f-dirty #922
        Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
        Call Trace:
         dump_stack+0x78/0xa0
         check_noncircular+0x165/0x180
         __lock_acquire+0x1272/0x2310
         lock_acquire+0x9e/0x360
         ? __might_fault+0x3e/0x90
         ? find_held_lock+0x72/0x90
         __might_fault+0x68/0x90
         ? __might_fault+0x3e/0x90
         _copy_to_user+0x1e/0x80
         copy_to_sk.isra.32+0x121/0x300
         ? btrfs_search_forward+0x2a6/0x360
         search_ioctl+0x106/0x200
         btrfs_ioctl_tree_search_v2+0x7b/0xf0
         btrfs_ioctl+0x106f/0x30a0
         ? __do_sys_newfstat+0x5a/0x70
         ? ksys_ioctl+0x83/0xc0
         ksys_ioctl+0x83/0xc0
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x50/0x90
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The problem is we're doing a copy_to_user() while holding tree locks,
      which can deadlock if we have to do a page fault for the copy_to_user().
      This exists even without my locking changes, so it needs to be fixed.
      Rework the search ioctl to do the pre-fault and then
      copy_to_user_nofault for the copying.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a48b73ec
  2. 27 7月, 2020 12 次提交
    • F
      btrfs: do not set the full sync flag on the inode during page release · 5e548b32
      Filipe Manana 提交于
      When removing an extent map at try_release_extent_mapping(), called through
      the page release callback (btrfs_releasepage()), we always set the full
      sync flag on the inode, which forces the next fsync to use a slower code
      path.
      
      This hurts performance for workloads that dirty an amount of data that
      exceeds or is very close to the system's RAM memory and do frequent fsync
      operations (like database servers can for example). In particular if there
      are concurrent fsyncs against different files, by falling back to a full
      fsync we do a lot more checksum lookups in the checksums btree, as we do
      it for all the extents created in the current transaction, instead of only
      the new ones since the last fsync. These checksums lookups not only take
      some time but, more importantly, they also cause contention on the
      checksums btree locks due to the concurrency with checksum insertions in
      the btree by ordered extents from other inodes.
      
      We actually don't need to set the full sync flag on the inode, because we
      only remove extent maps that are in the list of modified extents if they
      were created in a past transaction, in which case an fsync skips them as
      it's pointless to log them. So stop setting the full fsync flag on the
      inode whenever we remove an extent map.
      
      This patch is part of a patchset that consists of 3 patches, which have
      the following subjects:
      
      1/3 btrfs: fix race between page release and a fast fsync
      2/3 btrfs: release old extent maps during page release
      3/3 btrfs: do not set the full sync flag on the inode during page release
      
      Performance tests were ran against a branch (misc-next) containing the
      whole patchset. The test exercises a workload where there are multiple
      processes writing to files and fsyncing them (each writing and fsyncing
      its own file), and in total the amount of data dirtied ranges from 2x to
      4x the system's RAM memory (16GiB), so that the page release callback is
      invoked frequently.
      
      The following script, using fio, was used to perform the tests:
      
        $ cat test-fsync.sh
        #!/bin/bash
      
        DEV=/dev/sdk
        MNT=/mnt/sdk
        MOUNT_OPTIONS="-o ssd"
        MKFS_OPTIONS="-d single -m single"
      
        if [ $# -ne 3 ]; then
            echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ"
            exit 1
        fi
      
        NUM_JOBS=$1
        FILE_SIZE=$2
        FSYNC_FREQ=$3
      
        cat <<EOF > /tmp/fio-job.ini
        [writers]
        rw=write
        fsync=$FSYNC_FREQ
        fallocate=none
        group_reporting=1
        direct=0
        bs=64k
        ioengine=sync
        size=$FILE_SIZE
        directory=$MNT
        numjobs=$NUM_JOBS
        thread
        EOF
      
        echo "Using config:"
        echo
        cat /tmp/fio-job.ini
        echo
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
        mount $MOUNT_OPTIONS $DEV $MNT
        fio /tmp/fio-job.ini
        umount $MNT
      
      The tests were performed for different numbers of jobs, file sizes and
      fsync frequency. A qemu VM using kvm was used, with 8 cores (the host has
      12 cores, with cpu governance set to performance mode on all cores), 16GiB
      of ram (the host has 64GiB) and using a NVMe device directly (without an
      intermediary filesystem in the host). While running the tests, the host
      was not used for anything else, to avoid disturbing the tests.
      
      The obtained results were the following, and the last line printed by
      fio is pasted (includes aggregated throughput and test run time).
      
          *****************************************************
          ****     1 job, 32GiB file, fsync frequency 1     ****
          *****************************************************
      
      Before patchset:
      
      WRITE: bw=29.1MiB/s (30.5MB/s), 29.1MiB/s-29.1MiB/s (30.5MB/s-30.5MB/s), io=32.0GiB (34.4GB), run=1127557-1127557msec
      
      After patchset:
      
      WRITE: bw=29.3MiB/s (30.7MB/s), 29.3MiB/s-29.3MiB/s (30.7MB/s-30.7MB/s), io=32.0GiB (34.4GB), run=1119042-1119042msec
      (+0.7% throughput, -0.8% run time)
      
          *****************************************************
          ****     2 jobs, 16GiB files, fsync frequency 1   ****
          *****************************************************
      
      Before patchset:
      
      WRITE: bw=33.5MiB/s (35.1MB/s), 33.5MiB/s-33.5MiB/s (35.1MB/s-35.1MB/s), io=32.0GiB (34.4GB), run=979000-979000msec
      
      After patchset:
      
      WRITE: bw=39.9MiB/s (41.8MB/s), 39.9MiB/s-39.9MiB/s (41.8MB/s-41.8MB/s), io=32.0GiB (34.4GB), run=821283-821283msec
      (+19.1% throughput, -16.1% runtime)
      
          *****************************************************
          ****     4 jobs, 8GiB files, fsync frequency 1    ****
          *****************************************************
      
      Before patchset:
      
      WRITE: bw=52.1MiB/s (54.6MB/s), 52.1MiB/s-52.1MiB/s (54.6MB/s-54.6MB/s), io=32.0GiB (34.4GB), run=629130-629130msec
      
      After patchset:
      
      WRITE: bw=71.8MiB/s (75.3MB/s), 71.8MiB/s-71.8MiB/s (75.3MB/s-75.3MB/s), io=32.0GiB (34.4GB), run=456357-456357msec
      (+37.8% throughput, -27.5% runtime)
      
          *****************************************************
          ****     8 jobs, 4GiB files, fsync frequency 1    ****
          *****************************************************
      
      Before patchset:
      
      WRITE: bw=76.1MiB/s (79.8MB/s), 76.1MiB/s-76.1MiB/s (79.8MB/s-79.8MB/s), io=32.0GiB (34.4GB), run=430708-430708msec
      
      After patchset:
      
      WRITE: bw=133MiB/s (140MB/s), 133MiB/s-133MiB/s (140MB/s-140MB/s), io=32.0GiB (34.4GB), run=245458-245458msec
      (+74.7% throughput, -43.0% run time)
      
          *****************************************************
          ****    16 jobs, 2GiB files, fsync frequency 1    ****
          *****************************************************
      
      Before patchset:
      
      WRITE: bw=74.7MiB/s (78.3MB/s), 74.7MiB/s-74.7MiB/s (78.3MB/s-78.3MB/s), io=32.0GiB (34.4GB), run=438625-438625msec
      
      After patchset:
      
      WRITE: bw=184MiB/s (193MB/s), 184MiB/s-184MiB/s (193MB/s-193MB/s), io=32.0GiB (34.4GB), run=177864-177864msec
      (+146.3% throughput, -59.5% run time)
      
          *****************************************************
          ****    32 jobs, 2GiB files, fsync frequency 1    ****
          *****************************************************
      
      Before patchset:
      
      WRITE: bw=72.6MiB/s (76.1MB/s), 72.6MiB/s-72.6MiB/s (76.1MB/s-76.1MB/s), io=64.0GiB (68.7GB), run=902615-902615msec
      
      After patchset:
      
      WRITE: bw=227MiB/s (238MB/s), 227MiB/s-227MiB/s (238MB/s-238MB/s), io=64.0GiB (68.7GB), run=288936-288936msec
      (+212.7% throughput, -68.0% run time)
      
          *****************************************************
          ****    64 jobs, 1GiB files, fsync frequency 1    ****
          *****************************************************
      
      Before patchset:
      
      WRITE: bw=98.8MiB/s (104MB/s), 98.8MiB/s-98.8MiB/s (104MB/s-104MB/s), io=64.0GiB (68.7GB), run=663126-663126msec
      
      After patchset:
      
      WRITE: bw=294MiB/s (308MB/s), 294MiB/s-294MiB/s (308MB/s-308MB/s), io=64.0GiB (68.7GB), run=222940-222940msec
      (+197.6% throughput, -66.4% run time)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5e548b32
    • F
      btrfs: release old extent maps during page release · fbc2bd7e
      Filipe Manana 提交于
      When removing an extent map at try_release_extent_mapping(), called through
      the page release callback (btrfs_releasepage()), we never release an extent
      map that is in the list of modified extents. This is to prevent races with
      a concurrent fsync using the fast path, which could lead to not logging an
      extent created in the current transaction.
      
      However we can safely remove an extent map created in a past transaction
      that is still in the list of modified extents (because no one fsynced yet
      the inode after that transaction got commited), because such extents are
      skipped during an fsync as it is pointless to log them. This change does
      that.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fbc2bd7e
    • F
      btrfs: fix race between page release and a fast fsync · 3d6448e6
      Filipe Manana 提交于
      When releasing an extent map, done through the page release callback, we
      can race with an ongoing fast fsync and cause the fsync to miss a new
      extent and not log it. The steps for this to happen are the following:
      
      1) A page is dirtied for some inode I;
      
      2) Writeback for that page is triggered by a path other than fsync, for
         example by the system due to memory pressure;
      
      3) When the ordered extent for the extent (a single 4K page) finishes,
         we unpin the corresponding extent map and set its generation to N,
         the current transaction's generation;
      
      4) The btrfs_releasepage() callback is invoked by the system due to
         memory pressure for that no longer dirty page of inode I;
      
      5) At the same time, some task calls fsync on inode I, joins transaction
         N, and at btrfs_log_inode() it sees that the inode does not have the
         full sync flag set, so we proceed with a fast fsync. But before we get
         into btrfs_log_changed_extents() and lock the inode's extent map tree:
      
      6) Through btrfs_releasepage() we end up at try_release_extent_mapping()
         and we remove the extent map for the new 4Kb extent, because it is
         neither pinned anymore nor locked. By calling remove_extent_mapping(),
         we remove the extent map from the list of modified extents, since the
         extent map does not have the logging flag set. We unlock the inode's
         extent map tree;
      
      7) The task doing the fast fsync now enters btrfs_log_changed_extents(),
         locks the inode's extent map tree and iterates its list of modified
         extents, which no longer has the 4Kb extent in it, so it does not log
         the extent;
      
      8) The fsync finishes;
      
      9) Before transaction N is committed, a power failure happens. After
         replaying the log, the 4K extent of inode I will be missing, since
         it was not logged due to the race with try_release_extent_mapping().
      
      So fix this by teaching try_release_extent_mapping() to not remove an
      extent map if it's still in the list of modified extents.
      
      Fixes: ff44c6e3 ("Btrfs: do not hold the write_lock on the extent tree while logging")
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3d6448e6
    • J
      btrfs: return EROFS for BTRFS_FS_STATE_ERROR cases · fbabd4a3
      Josef Bacik 提交于
      Eric reported seeing this message while running generic/475
      
        BTRFS: error (device dm-3) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted
      
      Full stack trace:
      
        BTRFS: error (device dm-0) in btrfs_commit_transaction:2323: errno=-5 IO failure (Error while writing out transaction)
        BTRFS info (device dm-0): forced readonly
        BTRFS warning (device dm-0): Skipping commit of aborted transaction.
        ------------[ cut here ]------------
        BTRFS: error (device dm-0) in cleanup_transaction:1894: errno=-5 IO failure
        BTRFS: Transaction aborted (error -117)
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6480 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6488 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6490 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6498 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64c0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85e8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85f0 len 4096 err no 10
        WARNING: CPU: 3 PID: 23985 at fs/btrfs/tree-log.c:3084 btrfs_sync_log+0xbc8/0xd60 [btrfs]
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4288 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4290 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4298 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42d0 len 4096 err no 10
        CPU: 3 PID: 23985 Comm: fsstress Tainted: G        W    L    5.8.0-rc4-default+ #1181
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
        RIP: 0010:btrfs_sync_log+0xbc8/0xd60 [btrfs]
        RSP: 0018:ffff909a44d17bd0 EFLAGS: 00010286
        RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000001
        RDX: ffff8f3be41cb940 RSI: ffffffffb0108d2b RDI: ffffffffb0108ff7
        RBP: ffff909a44d17e70 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000037988 R12: ffff8f3bd20e4000
        R13: ffff8f3bd20e4428 R14: 00000000ffffff8b R15: ffff909a44d17c70
        FS:  00007f6a6ed3fb80(0000) GS:ffff8f3c3dc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f6a6ed3e000 CR3: 00000000525c0003 CR4: 0000000000160ee0
        Call Trace:
         ? finish_wait+0x90/0x90
         ? __mutex_unlock_slowpath+0x45/0x2a0
         ? lock_acquire+0xa3/0x440
         ? lockref_put_or_lock+0x9/0x30
         ? dput+0x20/0x4a0
         ? dput+0x20/0x4a0
         ? do_raw_spin_unlock+0x4b/0xc0
         ? _raw_spin_unlock+0x1f/0x30
         btrfs_sync_file+0x335/0x490 [btrfs]
         do_fsync+0x38/0x70
         __x64_sys_fsync+0x10/0x20
         do_syscall_64+0x50/0xe0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f6a6ef1b6e3
        Code: Bad RIP value.
        RSP: 002b:00007ffd01e20038 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
        RAX: ffffffffffffffda RBX: 000000000007a120 RCX: 00007f6a6ef1b6e3
        RDX: 00007ffd01e1ffa0 RSI: 00007ffd01e1ffa0 RDI: 0000000000000003
        RBP: 0000000000000003 R08: 0000000000000001 R09: 00007ffd01e2004c
        R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000009f
        R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
        irq event stamp: 0
        hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        hardirqs last disabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00
        softirqs last  enabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00
        softirqs last disabled at (0): [<0000000000000000>] 0x0
        ---[ end trace af146e0e38433456 ]---
        BTRFS: error (device dm-0) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted
      
      This ret came from btrfs_write_marked_extents().  If we get an aborted
      transaction via EIO before, we'll see it in btree_write_cache_pages()
      and return EUCLEAN, which gets printed as "Filesystem corrupted".
      
      Except we shouldn't be returning EUCLEAN here, we need to be returning
      EROFS because EUCLEAN is reserved for actual corruption, not IO errors.
      
      We are inconsistent about our handling of BTRFS_FS_STATE_ERROR
      elsewhere, but we want to use EROFS for this particular case.  The
      original transaction abort has the real error code for why we ended up
      with an aborted transaction, all subsequent actions just need to return
      EROFS because they may not have a trans handle and have no idea about
      the original cause of the abort.
      
      After patch "btrfs: don't WARN if we abort a transaction with EROFS" the
      stacktrace will not be dumped either.
      Reported-by: NEric Sandeen <esandeen@redhat.com>
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add full test stacktrace ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fbabd4a3
    • N
      btrfs: remove done label in writepage_delalloc · b69d1ee9
      Nikolay Borisov 提交于
      Since there is not common cleanup run after the label it makes it
      somewhat redundant.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b69d1ee9
    • N
      btrfs: streamline btrfs_get_io_failure_record logic · 3526302f
      Nikolay Borisov 提交于
      Make the function directly return a pointer to a failure record and
      adjust callers to handle it. Also refactor the logic inside so that
      the case which allocates the failure record for the first time is not
      handled in an 'if' arm, saving us a level of indentation. Finally make
      the function static as it's not used outside of extent_io.c .
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3526302f
    • N
      btrfs: make get_state_failrec return failrec directly · 2279a270
      Nikolay Borisov 提交于
      Only failure that get_state_failrec can get is if there is no failure
      for the given address. There is no reason why the function should return
      a status code and use a separate parameter for returning the actual
      failure rec (if one is found). Simplify it by making the return type
      a pointer and return ERR_PTR value in case of errors.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2279a270
    • N
      btrfs: make writepage_delalloc take btrfs_inode · cd4c0bf9
      Nikolay Borisov 提交于
      Only find_lock_delalloc_range uses vfs_inode so let's take the
      btrfs_inode as a parameter.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cd4c0bf9
    • N
      btrfs: make __extent_writepage_io take btrfs_inode · d4580fe2
      Nikolay Borisov 提交于
      It has only a single use for a generic vfs inode vs 3 for btrfs_inode.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d4580fe2
    • N
      btrfs: make btrfs_run_delalloc_range take btrfs_inode · 98456b9c
      Nikolay Borisov 提交于
      All children now take btrfs_inode so convert it to taking it as a
      parameter as well.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      98456b9c
    • D
      btrfs: don't use UAPI types for fiemap callback · bab16e21
      David Sterba 提交于
      The fiemap callback is not part of UAPI interface and the prototypes
      don't have the __u64 types either.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bab16e21
    • N
      btrfs: make extent_clear_unlock_delalloc take btrfs_inode · ad7ff17b
      Nikolay Borisov 提交于
      It has one VFS and 1 btrfs inode usages but converting it to btrfs_inode
      interface will allow seamless conversion of its callers.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ad7ff17b
  3. 22 7月, 2020 1 次提交
  4. 02 7月, 2020 1 次提交
    • B
      btrfs: fix fatal extent_buffer readahead vs releasepage race · 6bf9cd2e
      Boris Burkov 提交于
      Under somewhat convoluted conditions, it is possible to attempt to
      release an extent_buffer that is under io, which triggers a BUG_ON in
      btrfs_release_extent_buffer_pages.
      
      This relies on a few different factors. First, extent_buffer reads done
      as readahead for searching use WAIT_NONE, so they free the local extent
      buffer reference while the io is outstanding. However, they should still
      be protected by TREE_REF. However, if the system is doing signficant
      reclaim, and simultaneously heavily accessing the extent_buffers, it is
      possible for releasepage to race with two concurrent readahead attempts
      in a way that leaves TREE_REF unset when the readahead extent buffer is
      released.
      
      Essentially, if two tasks race to allocate a new extent_buffer, but the
      winner who attempts the first io is rebuffed by a page being locked
      (likely by the reclaim itself) then the loser will still go ahead with
      issuing the readahead. The loser's call to find_extent_buffer must also
      race with the reclaim task reading the extent_buffer's refcount as 1 in
      a way that allows the reclaim to re-clear the TREE_REF checked by
      find_extent_buffer.
      
      The following represents an example execution demonstrating the race:
      
                  CPU0                                                         CPU1                                           CPU2
      reada_for_search                                            reada_for_search
        readahead_tree_block                                        readahead_tree_block
          find_create_tree_block                                      find_create_tree_block
            alloc_extent_buffer                                         alloc_extent_buffer
                                                                        find_extent_buffer // not found
                                                                        allocates eb
                                                                        lock pages
                                                                        associate pages to eb
                                                                        insert eb into radix tree
                                                                        set TREE_REF, refs == 2
                                                                        unlock pages
                                                                    read_extent_buffer_pages // WAIT_NONE
                                                                      not uptodate (brand new eb)
                                                                                                                  lock_page
                                                                      if !trylock_page
                                                                        goto unlock_exit // not an error
                                                                    free_extent_buffer
                                                                      release_extent_buffer
                                                                        atomic_dec_and_test refs to 1
              find_extent_buffer // found
                                                                                                                  try_release_extent_buffer
                                                                                                                    take refs_lock
                                                                                                                    reads refs == 1; no io
                atomic_inc_not_zero refs to 2
                mark_buffer_accessed
                  check_buffer_tree_ref
                    // not STALE, won't take refs_lock
                    refs == 2; TREE_REF set // no action
          read_extent_buffer_pages // WAIT_NONE
                                                                                                                    clear TREE_REF
                                                                                                                    release_extent_buffer
                                                                                                                      atomic_dec_and_test refs to 1
                                                                                                                      unlock_page
            still not uptodate (CPU1 read failed on trylock_page)
            locks pages
            set io_pages > 0
            submit io
            return
          free_extent_buffer
            release_extent_buffer
              dec refs to 0
              delete from radix tree
              btrfs_release_extent_buffer_pages
                BUG_ON(io_pages > 0)!!!
      
      We observe this at a very low rate in production and were also able to
      reproduce it in a test environment by introducing some spurious delays
      and by introducing probabilistic trylock_page failures.
      
      To fix it, we apply check_tree_ref at a point where it could not
      possibly be unset by a competing task: after io_pages has been
      incremented. All the codepaths that clear TREE_REF check for io, so they
      would not be able to clear it after this point until the io is done.
      
      Stack trace, for reference:
      [1417839.424739] ------------[ cut here ]------------
      [1417839.435328] kernel BUG at fs/btrfs/extent_io.c:4841!
      [1417839.447024] invalid opcode: 0000 [#1] SMP
      [1417839.502972] RIP: 0010:btrfs_release_extent_buffer_pages+0x20/0x1f0
      [1417839.517008] Code: ed e9 ...
      [1417839.558895] RSP: 0018:ffffc90020bcf798 EFLAGS: 00010202
      [1417839.570816] RAX: 0000000000000002 RBX: ffff888102d6def0 RCX: 0000000000000028
      [1417839.586962] RDX: 0000000000000002 RSI: ffff8887f0296482 RDI: ffff888102d6def0
      [1417839.603108] RBP: ffff88885664a000 R08: 0000000000000046 R09: 0000000000000238
      [1417839.619255] R10: 0000000000000028 R11: ffff88885664af68 R12: 0000000000000000
      [1417839.635402] R13: 0000000000000000 R14: ffff88875f573ad0 R15: ffff888797aafd90
      [1417839.651549] FS:  00007f5a844fa700(0000) GS:ffff88885f680000(0000) knlGS:0000000000000000
      [1417839.669810] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [1417839.682887] CR2: 00007f7884541fe0 CR3: 000000049f609002 CR4: 00000000003606e0
      [1417839.699037] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [1417839.715187] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [1417839.731320] Call Trace:
      [1417839.737103]  release_extent_buffer+0x39/0x90
      [1417839.746913]  read_block_for_search.isra.38+0x2a3/0x370
      [1417839.758645]  btrfs_search_slot+0x260/0x9b0
      [1417839.768054]  btrfs_lookup_file_extent+0x4a/0x70
      [1417839.778427]  btrfs_get_extent+0x15f/0x830
      [1417839.787665]  ? submit_extent_page+0xc4/0x1c0
      [1417839.797474]  ? __do_readpage+0x299/0x7a0
      [1417839.806515]  __do_readpage+0x33b/0x7a0
      [1417839.815171]  ? btrfs_releasepage+0x70/0x70
      [1417839.824597]  extent_readpages+0x28f/0x400
      [1417839.833836]  read_pages+0x6a/0x1c0
      [1417839.841729]  ? startup_64+0x2/0x30
      [1417839.849624]  __do_page_cache_readahead+0x13c/0x1a0
      [1417839.860590]  filemap_fault+0x6c7/0x990
      [1417839.869252]  ? xas_load+0x8/0x80
      [1417839.876756]  ? xas_find+0x150/0x190
      [1417839.884839]  ? filemap_map_pages+0x295/0x3b0
      [1417839.894652]  __do_fault+0x32/0x110
      [1417839.902540]  __handle_mm_fault+0xacd/0x1000
      [1417839.912156]  handle_mm_fault+0xaa/0x1c0
      [1417839.921004]  __do_page_fault+0x242/0x4b0
      [1417839.930044]  ? page_fault+0x8/0x30
      [1417839.937933]  page_fault+0x1e/0x30
      [1417839.945631] RIP: 0033:0x33c4bae
      [1417839.952927] Code: Bad RIP value.
      [1417839.960411] RSP: 002b:00007f5a844f7350 EFLAGS: 00010206
      [1417839.972331] RAX: 000000000000006e RBX: 1614b3ff6a50398a RCX: 0000000000000000
      [1417839.988477] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002
      [1417840.004626] RBP: 00007f5a844f7420 R08: 000000000000006e R09: 00007f5a94aeccb8
      [1417840.020784] R10: 00007f5a844f7350 R11: 0000000000000000 R12: 00007f5a94aecc79
      [1417840.036932] R13: 00007f5a94aecc78 R14: 00007f5a94aecc90 R15: 00007f5a94aecc40
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6bf9cd2e
  5. 30 6月, 2020 1 次提交
    • P
      fs/btrfs: Add cond_resched() for try_release_extent_mapping() stalls · 9f47eb54
      Paul E. McKenney 提交于
      Very large I/Os can cause the following RCU CPU stall warning:
      
      RIP: 0010:rb_prev+0x8/0x50
      Code: 49 89 c0 49 89 d1 48 89 c2 48 89 f8 e9 e5 fd ff ff 4c 89 48 10 c3 4c =
      89 06 c3 4c 89 40 10 c3 0f 1f 00 48 8b 0f 48 39 cf 74 38 <48> 8b 47 10 48 85 c0 74 22 48 8b 50 08 48 85 d2 74 0c 48 89 d0 48
      RSP: 0018:ffffc9002212bab0 EFLAGS: 00000287 ORIG_RAX: ffffffffffffff13
      RAX: ffff888821f93630 RBX: ffff888821f93630 RCX: ffff888821f937e0
      RDX: 0000000000000000 RSI: 0000000000102000 RDI: ffff888821f93630
      RBP: 0000000000103000 R08: 000000000006c000 R09: 0000000000000238
      R10: 0000000000102fff R11: ffffc9002212bac8 R12: 0000000000000001
      R13: ffffffffffffffff R14: 0000000000102000 R15: ffff888821f937e0
       __lookup_extent_mapping+0xa0/0x110
       try_release_extent_mapping+0xdc/0x220
       btrfs_releasepage+0x45/0x70
       shrink_page_list+0xa39/0xb30
       shrink_inactive_list+0x18f/0x3b0
       shrink_lruvec+0x38e/0x6b0
       shrink_node+0x14d/0x690
       do_try_to_free_pages+0xc6/0x3e0
       try_to_free_mem_cgroup_pages+0xe6/0x1e0
       reclaim_high.constprop.73+0x87/0xc0
       mem_cgroup_handle_over_high+0x66/0x150
       exit_to_usermode_loop+0x82/0xd0
       do_syscall_64+0xd4/0x100
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      On a PREEMPT=n kernel, the try_release_extent_mapping() function's
      "while" loop might run for a very long time on a large I/O.  This commit
      therefore adds a cond_resched() to this loop, providing RCU any needed
      quiescent states.
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      9f47eb54
  6. 03 6月, 2020 2 次提交
  7. 25 5月, 2020 8 次提交
  8. 24 3月, 2020 14 次提交
    • J
      btrfs: move the root freeing stuff into btrfs_put_root · 8c38938c
      Josef Bacik 提交于
      There are a few different ways to free roots, either you allocated them
      yourself and you just do
      
      free_extent_buffer(root->node);
      free_extent_buffer(root->commit_node);
      btrfs_put_root(root);
      
      Which is the pattern for log roots.  Or for snapshots/subvolumes that
      are being dropped you simply call btrfs_free_fs_root() which does all
      the cleanup for you.
      
      Unify this all into btrfs_put_root(), so that we don't free up things
      associated with the root until the last reference is dropped.  This
      makes the root freeing code much more significant.
      
      The only caveat is at close_ctree() time we have to free the extent
      buffers for all of our main roots (extent_root, chunk_root, etc) because
      we have to drop the btree_inode and we'll run into issues if we hold
      onto those nodes until ->kill_sb() time.  This will be addressed in the
      future when we kill the btree_inode.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8c38938c
    • J
      btrfs: make the extent buffer leak check per fs info · 3fd63727
      Josef Bacik 提交于
      I'm going to make the entire destruction of btrfs_root's controlled by
      their refcount, so it will be helpful to notice if we're leaking their
      eb's on umount.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3fd63727
    • Q
      btrfs: Don't submit any btree write bio if the fs has errors · b3ff8f1d
      Qu Wenruo 提交于
      [BUG]
      There is a fuzzed image which could cause KASAN report at unmount time.
      
        BUG: KASAN: use-after-free in btrfs_queue_work+0x2c1/0x390
        Read of size 8 at addr ffff888067cf6848 by task umount/1922
      
        CPU: 0 PID: 1922 Comm: umount Tainted: G        W         5.0.21 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
        Call Trace:
         dump_stack+0x5b/0x8b
         print_address_description+0x70/0x280
         kasan_report+0x13a/0x19b
         btrfs_queue_work+0x2c1/0x390
         btrfs_wq_submit_bio+0x1cd/0x240
         btree_submit_bio_hook+0x18c/0x2a0
         submit_one_bio+0x1be/0x320
         flush_write_bio.isra.41+0x2c/0x70
         btree_write_cache_pages+0x3bb/0x7f0
         do_writepages+0x5c/0x130
         __writeback_single_inode+0xa3/0x9a0
         writeback_single_inode+0x23d/0x390
         write_inode_now+0x1b5/0x280
         iput+0x2ef/0x600
         close_ctree+0x341/0x750
         generic_shutdown_super+0x126/0x370
         kill_anon_super+0x31/0x50
         btrfs_kill_super+0x36/0x2b0
         deactivate_locked_super+0x80/0xc0
         deactivate_super+0x13c/0x150
         cleanup_mnt+0x9a/0x130
         task_work_run+0x11a/0x1b0
         exit_to_usermode_loop+0x107/0x130
         do_syscall_64+0x1e5/0x280
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      [CAUSE]
      The fuzzed image has a completely screwd up extent tree:
      
        leaf 29421568 gen 8 total ptrs 6 free space 3587 owner EXTENT_TREE
        refs 2 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 5938
                item 0 key (12587008 168 4096) itemoff 3942 itemsize 53
                        extent refs 1 gen 9 flags 1
                        ref#0: extent data backref root 5 objectid 259 offset 0 count 1
                item 1 key (12591104 168 8192) itemoff 3889 itemsize 53
                        extent refs 1 gen 9 flags 1
                        ref#0: extent data backref root 5 objectid 271 offset 0 count 1
                item 2 key (12599296 168 4096) itemoff 3836 itemsize 53
                        extent refs 1 gen 9 flags 1
                        ref#0: extent data backref root 5 objectid 259 offset 4096 count 1
                item 3 key (29360128 169 0) itemoff 3803 itemsize 33
                        extent refs 1 gen 9 flags 2
                        ref#0: tree block backref root 5
                item 4 key (29368320 169 1) itemoff 3770 itemsize 33
                        extent refs 1 gen 9 flags 2
                        ref#0: tree block backref root 5
                item 5 key (29372416 169 0) itemoff 3737 itemsize 33
                        extent refs 1 gen 9 flags 2
                        ref#0: tree block backref root 5
      
      Note that leaf 29421568 doesn't have its backref in the extent tree.
      Thus extent allocator can re-allocate leaf 29421568 for other trees.
      
      In short, the bug is caused by:
      
      - Existing tree block gets allocated to log tree
        This got its generation bumped.
      
      - Log tree balance cleaned dirty bit of offending tree block
        It will not be written back to disk, thus no WRITTEN flag.
      
      - Original owner of the tree block gets COWed
        Since the tree block has higher transid, no WRITTEN flag, it's reused,
        and not traced by transaction::dirty_pages.
      
      - Transaction aborted
        Tree blocks get cleaned according to transaction::dirty_pages. But the
        offending tree block is not recorded at all.
      
      - Filesystem unmount
        All pages are assumed to be are clean, destroying all workqueue, then
        call iput(btree_inode).
        But offending tree block is still dirty, which triggers writeback, and
        causes use-after-free bug.
      
      The detailed sequence looks like this:
      
      - Initial status
        eb: 29421568, header=WRITTEN bflags_dirty=0, page_dirty=0, gen=8,
            not traced by any dirty extent_iot_tree.
      
      - New tree block is allocated
        Since there is no backref for 29421568, it's re-allocated as new tree
        block.
        Keep in mind that tree block 29421568 is still referred by extent
        tree.
      
      - Tree block 29421568 is filled for log tree
        eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9 << (gen bumped)
            traced by btrfs_root::dirty_log_pages
      
      - Some log tree operations
        Since the fs is using node size 4096, the log tree can easily go a
        level higher.
      
      - Log tree needs balance
        Tree block 29421568 gets all its content pushed to right, thus now
        it is empty, and we don't need it.
        btrfs_clean_tree_block() from __push_leaf_right() get called.
      
        eb: 29421568, header=0 bflags_dirty=0, page_dirty=0, gen=9
            traced by btrfs_root::dirty_log_pages
      
      - Log tree write back
        btree_write_cache_pages() goes through dirty pages ranges, but since
        page of tree block 29421568 gets cleaned already, it's not written
        back to disk. Thus it doesn't have WRITTEN bit set.
        But ranges in dirty_log_pages are cleared.
      
        eb: 29421568, header=0 bflags_dirty=0, page_dirty=0, gen=9
            not traced by any dirty extent_iot_tree.
      
      - Extent tree update when committing transaction
        Since tree block 29421568 has transid equal to running trans, and has
        no WRITTEN bit, should_cow_block() will use it directly without adding
        it to btrfs_transaction::dirty_pages.
      
        eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9
            not traced by any dirty extent_iot_tree.
      
        At this stage, we're doomed. We have a dirty eb not tracked by any
        extent io tree.
      
      - Transaction gets aborted due to corrupted extent tree
        Btrfs cleans up dirty pages according to transaction::dirty_pages and
        btrfs_root::dirty_log_pages.
        But since tree block 29421568 is not tracked by neither of them, it's
        still dirty.
      
        eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9
            not traced by any dirty extent_iot_tree.
      
      - Filesystem unmount
        Since all cleanup is assumed to be done, all workqueus are destroyed.
        Then iput(btree_inode) is called, expecting no dirty pages.
        But tree 29421568 is still dirty, thus triggering writeback.
        Since all workqueues are already freed, we cause use-after-free.
      
      This shows us that, log tree blocks + bad extent tree can cause wild
      dirty pages.
      
      [FIX]
      To fix the problem, don't submit any btree write bio if the filesytem
      has any error.  This is the last safe net, just in case other cleanup
      haven't caught catch it.
      
      Link: https://github.com/bobfuzzer/CVE/tree/master/CVE-2019-19377
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b3ff8f1d
    • J
      btrfs: Add missing lock annotation for release_extent_buffer() · 5ce48d0f
      Jules Irenge 提交于
      Sparse reports a warning at release_extent_buffer()
      warning: context imbalance in release_extent_buffer() - unexpected unlock
      
      The root cause is the missing annotation at release_extent_buffer()
      Add the missing __releases(&eb->refs_lock) annotation
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJules Irenge <jbi.octave@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5ce48d0f
    • F
      Btrfs: avoid unnecessary splits when setting bits on an extent io tree · 55ffaabe
      Filipe Manana 提交于
      When attempting to set bits on a range of an exent io tree that already
      has those bits set we can end up splitting an extent state record, use
      the preallocated extent state record, insert it into the red black tree,
      do another search on the red black tree, merge the preallocated extent
      state record with the previous extent state record, remove that previous
      record from the red black tree and then free it. This is all unnecessary
      work that consumes time.
      
      This happens specifically at the following case at __set_extent_bit():
      
        $ cat -n fs/btrfs/extent_io.c
         957  static int __must_check
         958  __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
        (...)
        1044          /*
        1045           *     | ---- desired range ---- |
        1046           * | state |
        1047           *   or
        1048           * | ------------- state -------------- |
        1049           *
        (...)
        1060          if (state->start < start) {
        1061                  if (state->state & exclusive_bits) {
        1062                          *failed_start = start;
        1063                          err = -EEXIST;
        1064                          goto out;
        1065                  }
        1066
        1067                  prealloc = alloc_extent_state_atomic(prealloc);
        1068                  BUG_ON(!prealloc);
        1069                  err = split_state(tree, state, prealloc, start);
        1070                  if (err)
        1071                          extent_io_tree_panic(tree, err);
        1072
        1073                  prealloc = NULL;
      
      So if our extent state represents a range from 0 to 1MiB for example, and
      we want to set bits in the range 128KiB to 256KiB for example, and that
      extent state record already has all those bits set, we end up splitting
      that record, so we end up with extent state records in the tree which
      represent the ranges from 0 to 128KiB and from 128KiB to 1MiB. This is
      temporary because a subsequent iteration in that function will end up
      merging the records.
      
      The splitting requires using the preallocated extent state record, so
      a future iteration that needs to do another split will need to allocate
      another extent state record in an atomic context, something not ideal
      that we try to avoid as much as possible. The splitting also requires
      an insertion in the red black tree, and a subsequent merge will require
      a deletion from the red black tree and freeing an extent state record.
      
      This change just skips the splitting of an extent state record when it
      already has all the bits the we need to set.
      
      Setting a bit that is already set for a range is very common in the
      inode's 'file_extent_tree' extent io tree for example, where we keep
      setting the EXTENT_DIRTY bit every time we replace an extent.
      
      This change also fixes a bug that happens after the recent patchset from
      Josef that avoids having implicit holes after a power failure when not
      using the NO_HOLES feature, more specifically the patch with the subject:
      
        "btrfs: introduce the inode->file_extent_tree"
      
      This patch introduced an extent io tree per inode to keep track of
      completed ordered extents and figure out at any time what is the safe
      value for the inode's disk_i_size. This assumes that for contiguous
      ranges in a file we always end up with a single extent state record in
      the io tree, but that is not the case, as there is a short time window
      where we can have two extent state records representing contiguous
      ranges. When this happens we end setting up an incorrect value for the
      inode's disk_i_size, resulting in data loss after a clean unmount
      of the filesystem. The following example explains how this can happen.
      
      Suppose we have an inode with an i_size and a disk_i_size of 1MiB, so in
      the inode's file_extent_tree we have a single extent state record that
      represents the range [0, 1MiB) with the EXTENT_DIRTY bit set. Then the
      following steps happen:
      
      1) A buffered write against file range [512KiB, 768KiB) is made. At this
         point delalloc was not flushed yet;
      
      2) Deduplication from some other inode into this inode's range
         [128KiB, 256KiB) is made. This causes btrfs_inode_set_file_extent_range()
         to be called, from btrfs_insert_clone_extent(), to mark the range
         [128KiB, 256KiB) with EXTENT_DIRTY in the inode's file_extent_tree;
      
      3) When btrfs_inode_set_file_extent_range() calls set_extent_bits(), we
         end up at __set_extent_bit(). In the first iteration of that function's
         loop we end up in the following branch:
      
         $ cat -n fs/btrfs/extent_io.c
          957  static int __must_check
          958  __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
         (...)
         1044          /*
         1045           *     | ---- desired range ---- |
         1046           * | state |
         1047           *   or
         1048           * | ------------- state -------------- |
         1049           *
         (...)
         1060          if (state->start < start) {
         1061                  if (state->state & exclusive_bits) {
         1062                          *failed_start = start;
         1063                          err = -EEXIST;
         1064                          goto out;
         1065                  }
         1066
         1067                  prealloc = alloc_extent_state_atomic(prealloc);
         1068                  BUG_ON(!prealloc);
         1069                  err = split_state(tree, state, prealloc, start);
         1070                  if (err)
         1071                          extent_io_tree_panic(tree, err);
         1072
         1073                  prealloc = NULL;
         (...)
         1089                  goto search_again;
      
         This splits the state record into two, one for range [0, 128KiB) and
         another for the range [128KiB, 1MiB). Both already have the EXTENT_DIRTY
         bit set. Then we jump to the 'search_again' label, where we unlock the
         the spinlock protecting the extent io tree before jumping to the
         'again' label to perform the next iteration;
      
      4) In the meanwhile, delalloc is flushed, the ordered extent for the range
         [512KiB, 768KiB) is created and when it completes, at
         btrfs_finish_ordered_io(), it calls btrfs_inode_safe_disk_i_size_write()
         with a value of 0 for its 'new_size' argument;
      
      5) Before the deduplication task currently at __set_extent_bit() moves to
         the next iteration, the task finishing the ordered extent calls
         find_first_extent_bit() through btrfs_inode_safe_disk_i_size_write()
         and gets 'start' set to 0 and 'end' set to 128KiB - because at this
         moment the io tree has two extent state records, one representing the
         range [0, 128KiB) and another representing the range [128KiB, 1MiB),
         both with EXTENT_DIRTY set. Then we set 'isize' to:
      
         isize = min(isize, end + 1)
               = min(1MiB, 128KiB - 1 + 1)
               = 128KiB
      
         Then we set the inode's disk_i_size to 128KiB (isize).
      
         After a clean unmount of the filesystem and mounting it again, we have
         the file with a size of 128KiB, and effectively lost all the data it
         had before in the range from 128KiB to 1MiB.
      
      This change fixes that issue too, as we never end up splitting extent
      state records when they already have all the bits we want set.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      55ffaabe
    • D
      btrfs: sink argument tree to __do_readpage · f657a31c
      David Sterba 提交于
      The tree pointer can be safely read from the inode, use it and drop the
      redundant argument.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f657a31c
    • D
      btrfs: sink arugment tree to contiguous_readpages · b6660e80
      David Sterba 提交于
      The tree pointer can be safely read from the inode, use it and drop the
      redundant argument.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b6660e80
    • D
      btrfs: sink argument tree to __extent_read_full_page · 0d44fea7
      David Sterba 提交于
      The tree pointer can be safely read from the inode, use it and drop the
      redundant argument.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0d44fea7
    • D
      btrfs: sink argument tree to extent_read_full_page · 71ad38b4
      David Sterba 提交于
      The tree pointer can be safely read from the page's inode, use it and
      drop the redundant argument.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      71ad38b4
    • D
      btrfs: drop argument tree from btrfs_lock_and_flush_ordered_range · b272ae22
      David Sterba 提交于
      The tree pointer can be safely read from the inode so we can drop the
      redundant argument from btrfs_lock_and_flush_ordered_range.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b272ae22
    • D
      btrfs: add assertions for tree == inode->io_tree to extent IO helpers · ae6957eb
      David Sterba 提交于
      Add assertions to all helpers that get tree as argument and verify that
      it's the same that can be obtained from the inode or from its pages. In
      followup patches the redundant arguments and assertions will be removed
      one by one.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ae6957eb
    • D
      btrfs: drop argument tree from submit_extent_page · 0ceb34bf
      David Sterba 提交于
      Now that we're sure the tree from argument is same as the one we can get
      from the page's inode io_tree, drop the redundant argument.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0ceb34bf
    • D
      btrfs: remove extent_page_data::tree · 45b08405
      David Sterba 提交于
      All functions that set up extent_page_data::tree set it to the inode
      io_tree. That's passed down the callstack that accesses either the same
      inode or its pages. In the end submit_extent_page can pull the tree out
      of the page and we don't have to store it in the structure.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      45b08405
    • J
      btrfs: introduce per-inode file extent tree · 41a2ee75
      Josef Bacik 提交于
      In order to keep track of where we have file extents on disk, and thus
      where it is safe to adjust the i_size to, we need to have a tree in
      place to keep track of the contiguous areas we have file extents for.
      
      Add helpers to use this tree, as it's not required for NO_HOLES file
      systems.  We will use this by setting DIRTY for areas we know we have
      file extent item's set, and clearing it when we remove file extent items
      for truncation.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      41a2ee75