1. 07 10月, 2020 15 次提交
  2. 27 8月, 2020 1 次提交
    • J
      btrfs: fix potential deadlock in the search ioctl · a48b73ec
      Josef Bacik 提交于
      With the conversion of the tree locks to rwsem I got the following
      lockdep splat:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.8.0-rc7-00165-g04ec4da5f45f-dirty #922 Not tainted
        ------------------------------------------------------
        compsize/11122 is trying to acquire lock:
        ffff889fabca8768 (&mm->mmap_lock#2){++++}-{3:3}, at: __might_fault+0x3e/0x90
      
        but task is already holding lock:
        ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (btrfs-fs-00){++++}-{3:3}:
      	 down_write_nested+0x3b/0x70
      	 __btrfs_tree_lock+0x24/0x120
      	 btrfs_search_slot+0x756/0x990
      	 btrfs_lookup_inode+0x3a/0xb4
      	 __btrfs_update_delayed_inode+0x93/0x270
      	 btrfs_async_run_delayed_root+0x168/0x230
      	 btrfs_work_helper+0xd4/0x570
      	 process_one_work+0x2ad/0x5f0
      	 worker_thread+0x3a/0x3d0
      	 kthread+0x133/0x150
      	 ret_from_fork+0x1f/0x30
      
        -> #1 (&delayed_node->mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x9f/0x930
      	 btrfs_delayed_update_inode+0x50/0x440
      	 btrfs_update_inode+0x8a/0xf0
      	 btrfs_dirty_inode+0x5b/0xd0
      	 touch_atime+0xa1/0xd0
      	 btrfs_file_mmap+0x3f/0x60
      	 mmap_region+0x3a4/0x640
      	 do_mmap+0x376/0x580
      	 vm_mmap_pgoff+0xd5/0x120
      	 ksys_mmap_pgoff+0x193/0x230
      	 do_syscall_64+0x50/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (&mm->mmap_lock#2){++++}-{3:3}:
      	 __lock_acquire+0x1272/0x2310
      	 lock_acquire+0x9e/0x360
      	 __might_fault+0x68/0x90
      	 _copy_to_user+0x1e/0x80
      	 copy_to_sk.isra.32+0x121/0x300
      	 search_ioctl+0x106/0x200
      	 btrfs_ioctl_tree_search_v2+0x7b/0xf0
      	 btrfs_ioctl+0x106f/0x30a0
      	 ksys_ioctl+0x83/0xc0
      	 __x64_sys_ioctl+0x16/0x20
      	 do_syscall_64+0x50/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        other info that might help us debug this:
      
        Chain exists of:
          &mm->mmap_lock#2 --> &delayed_node->mutex --> btrfs-fs-00
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(btrfs-fs-00);
      				 lock(&delayed_node->mutex);
      				 lock(btrfs-fs-00);
          lock(&mm->mmap_lock#2);
      
         *** DEADLOCK ***
      
        1 lock held by compsize/11122:
         #0: ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
      
        stack backtrace:
        CPU: 17 PID: 11122 Comm: compsize Kdump: loaded Not tainted 5.8.0-rc7-00165-g04ec4da5f45f-dirty #922
        Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
        Call Trace:
         dump_stack+0x78/0xa0
         check_noncircular+0x165/0x180
         __lock_acquire+0x1272/0x2310
         lock_acquire+0x9e/0x360
         ? __might_fault+0x3e/0x90
         ? find_held_lock+0x72/0x90
         __might_fault+0x68/0x90
         ? __might_fault+0x3e/0x90
         _copy_to_user+0x1e/0x80
         copy_to_sk.isra.32+0x121/0x300
         ? btrfs_search_forward+0x2a6/0x360
         search_ioctl+0x106/0x200
         btrfs_ioctl_tree_search_v2+0x7b/0xf0
         btrfs_ioctl+0x106f/0x30a0
         ? __do_sys_newfstat+0x5a/0x70
         ? ksys_ioctl+0x83/0xc0
         ksys_ioctl+0x83/0xc0
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x50/0x90
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The problem is we're doing a copy_to_user() while holding tree locks,
      which can deadlock if we have to do a page fault for the copy_to_user().
      This exists even without my locking changes, so it needs to be fixed.
      Rework the search ioctl to do the pre-fault and then
      copy_to_user_nofault for the copying.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a48b73ec
  3. 27 7月, 2020 12 次提交
    • F
      btrfs: do not set the full sync flag on the inode during page release · 5e548b32
      Filipe Manana 提交于
      When removing an extent map at try_release_extent_mapping(), called through
      the page release callback (btrfs_releasepage()), we always set the full
      sync flag on the inode, which forces the next fsync to use a slower code
      path.
      
      This hurts performance for workloads that dirty an amount of data that
      exceeds or is very close to the system's RAM memory and do frequent fsync
      operations (like database servers can for example). In particular if there
      are concurrent fsyncs against different files, by falling back to a full
      fsync we do a lot more checksum lookups in the checksums btree, as we do
      it for all the extents created in the current transaction, instead of only
      the new ones since the last fsync. These checksums lookups not only take
      some time but, more importantly, they also cause contention on the
      checksums btree locks due to the concurrency with checksum insertions in
      the btree by ordered extents from other inodes.
      
      We actually don't need to set the full sync flag on the inode, because we
      only remove extent maps that are in the list of modified extents if they
      were created in a past transaction, in which case an fsync skips them as
      it's pointless to log them. So stop setting the full fsync flag on the
      inode whenever we remove an extent map.
      
      This patch is part of a patchset that consists of 3 patches, which have
      the following subjects:
      
      1/3 btrfs: fix race between page release and a fast fsync
      2/3 btrfs: release old extent maps during page release
      3/3 btrfs: do not set the full sync flag on the inode during page release
      
      Performance tests were ran against a branch (misc-next) containing the
      whole patchset. The test exercises a workload where there are multiple
      processes writing to files and fsyncing them (each writing and fsyncing
      its own file), and in total the amount of data dirtied ranges from 2x to
      4x the system's RAM memory (16GiB), so that the page release callback is
      invoked frequently.
      
      The following script, using fio, was used to perform the tests:
      
        $ cat test-fsync.sh
        #!/bin/bash
      
        DEV=/dev/sdk
        MNT=/mnt/sdk
        MOUNT_OPTIONS="-o ssd"
        MKFS_OPTIONS="-d single -m single"
      
        if [ $# -ne 3 ]; then
            echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ"
            exit 1
        fi
      
        NUM_JOBS=$1
        FILE_SIZE=$2
        FSYNC_FREQ=$3
      
        cat <<EOF > /tmp/fio-job.ini
        [writers]
        rw=write
        fsync=$FSYNC_FREQ
        fallocate=none
        group_reporting=1
        direct=0
        bs=64k
        ioengine=sync
        size=$FILE_SIZE
        directory=$MNT
        numjobs=$NUM_JOBS
        thread
        EOF
      
        echo "Using config:"
        echo
        cat /tmp/fio-job.ini
        echo
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
        mount $MOUNT_OPTIONS $DEV $MNT
        fio /tmp/fio-job.ini
        umount $MNT
      
      The tests were performed for different numbers of jobs, file sizes and
      fsync frequency. A qemu VM using kvm was used, with 8 cores (the host has
      12 cores, with cpu governance set to performance mode on all cores), 16GiB
      of ram (the host has 64GiB) and using a NVMe device directly (without an
      intermediary filesystem in the host). While running the tests, the host
      was not used for anything else, to avoid disturbing the tests.
      
      The obtained results were the following, and the last line printed by
      fio is pasted (includes aggregated throughput and test run time).
      
          *****************************************************
          ****     1 job, 32GiB file, fsync frequency 1     ****
          *****************************************************
      
      Before patchset:
      
      WRITE: bw=29.1MiB/s (30.5MB/s), 29.1MiB/s-29.1MiB/s (30.5MB/s-30.5MB/s), io=32.0GiB (34.4GB), run=1127557-1127557msec
      
      After patchset:
      
      WRITE: bw=29.3MiB/s (30.7MB/s), 29.3MiB/s-29.3MiB/s (30.7MB/s-30.7MB/s), io=32.0GiB (34.4GB), run=1119042-1119042msec
      (+0.7% throughput, -0.8% run time)
      
          *****************************************************
          ****     2 jobs, 16GiB files, fsync frequency 1   ****
          *****************************************************
      
      Before patchset:
      
      WRITE: bw=33.5MiB/s (35.1MB/s), 33.5MiB/s-33.5MiB/s (35.1MB/s-35.1MB/s), io=32.0GiB (34.4GB), run=979000-979000msec
      
      After patchset:
      
      WRITE: bw=39.9MiB/s (41.8MB/s), 39.9MiB/s-39.9MiB/s (41.8MB/s-41.8MB/s), io=32.0GiB (34.4GB), run=821283-821283msec
      (+19.1% throughput, -16.1% runtime)
      
          *****************************************************
          ****     4 jobs, 8GiB files, fsync frequency 1    ****
          *****************************************************
      
      Before patchset:
      
      WRITE: bw=52.1MiB/s (54.6MB/s), 52.1MiB/s-52.1MiB/s (54.6MB/s-54.6MB/s), io=32.0GiB (34.4GB), run=629130-629130msec
      
      After patchset:
      
      WRITE: bw=71.8MiB/s (75.3MB/s), 71.8MiB/s-71.8MiB/s (75.3MB/s-75.3MB/s), io=32.0GiB (34.4GB), run=456357-456357msec
      (+37.8% throughput, -27.5% runtime)
      
          *****************************************************
          ****     8 jobs, 4GiB files, fsync frequency 1    ****
          *****************************************************
      
      Before patchset:
      
      WRITE: bw=76.1MiB/s (79.8MB/s), 76.1MiB/s-76.1MiB/s (79.8MB/s-79.8MB/s), io=32.0GiB (34.4GB), run=430708-430708msec
      
      After patchset:
      
      WRITE: bw=133MiB/s (140MB/s), 133MiB/s-133MiB/s (140MB/s-140MB/s), io=32.0GiB (34.4GB), run=245458-245458msec
      (+74.7% throughput, -43.0% run time)
      
          *****************************************************
          ****    16 jobs, 2GiB files, fsync frequency 1    ****
          *****************************************************
      
      Before patchset:
      
      WRITE: bw=74.7MiB/s (78.3MB/s), 74.7MiB/s-74.7MiB/s (78.3MB/s-78.3MB/s), io=32.0GiB (34.4GB), run=438625-438625msec
      
      After patchset:
      
      WRITE: bw=184MiB/s (193MB/s), 184MiB/s-184MiB/s (193MB/s-193MB/s), io=32.0GiB (34.4GB), run=177864-177864msec
      (+146.3% throughput, -59.5% run time)
      
          *****************************************************
          ****    32 jobs, 2GiB files, fsync frequency 1    ****
          *****************************************************
      
      Before patchset:
      
      WRITE: bw=72.6MiB/s (76.1MB/s), 72.6MiB/s-72.6MiB/s (76.1MB/s-76.1MB/s), io=64.0GiB (68.7GB), run=902615-902615msec
      
      After patchset:
      
      WRITE: bw=227MiB/s (238MB/s), 227MiB/s-227MiB/s (238MB/s-238MB/s), io=64.0GiB (68.7GB), run=288936-288936msec
      (+212.7% throughput, -68.0% run time)
      
          *****************************************************
          ****    64 jobs, 1GiB files, fsync frequency 1    ****
          *****************************************************
      
      Before patchset:
      
      WRITE: bw=98.8MiB/s (104MB/s), 98.8MiB/s-98.8MiB/s (104MB/s-104MB/s), io=64.0GiB (68.7GB), run=663126-663126msec
      
      After patchset:
      
      WRITE: bw=294MiB/s (308MB/s), 294MiB/s-294MiB/s (308MB/s-308MB/s), io=64.0GiB (68.7GB), run=222940-222940msec
      (+197.6% throughput, -66.4% run time)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5e548b32
    • F
      btrfs: release old extent maps during page release · fbc2bd7e
      Filipe Manana 提交于
      When removing an extent map at try_release_extent_mapping(), called through
      the page release callback (btrfs_releasepage()), we never release an extent
      map that is in the list of modified extents. This is to prevent races with
      a concurrent fsync using the fast path, which could lead to not logging an
      extent created in the current transaction.
      
      However we can safely remove an extent map created in a past transaction
      that is still in the list of modified extents (because no one fsynced yet
      the inode after that transaction got commited), because such extents are
      skipped during an fsync as it is pointless to log them. This change does
      that.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fbc2bd7e
    • F
      btrfs: fix race between page release and a fast fsync · 3d6448e6
      Filipe Manana 提交于
      When releasing an extent map, done through the page release callback, we
      can race with an ongoing fast fsync and cause the fsync to miss a new
      extent and not log it. The steps for this to happen are the following:
      
      1) A page is dirtied for some inode I;
      
      2) Writeback for that page is triggered by a path other than fsync, for
         example by the system due to memory pressure;
      
      3) When the ordered extent for the extent (a single 4K page) finishes,
         we unpin the corresponding extent map and set its generation to N,
         the current transaction's generation;
      
      4) The btrfs_releasepage() callback is invoked by the system due to
         memory pressure for that no longer dirty page of inode I;
      
      5) At the same time, some task calls fsync on inode I, joins transaction
         N, and at btrfs_log_inode() it sees that the inode does not have the
         full sync flag set, so we proceed with a fast fsync. But before we get
         into btrfs_log_changed_extents() and lock the inode's extent map tree:
      
      6) Through btrfs_releasepage() we end up at try_release_extent_mapping()
         and we remove the extent map for the new 4Kb extent, because it is
         neither pinned anymore nor locked. By calling remove_extent_mapping(),
         we remove the extent map from the list of modified extents, since the
         extent map does not have the logging flag set. We unlock the inode's
         extent map tree;
      
      7) The task doing the fast fsync now enters btrfs_log_changed_extents(),
         locks the inode's extent map tree and iterates its list of modified
         extents, which no longer has the 4Kb extent in it, so it does not log
         the extent;
      
      8) The fsync finishes;
      
      9) Before transaction N is committed, a power failure happens. After
         replaying the log, the 4K extent of inode I will be missing, since
         it was not logged due to the race with try_release_extent_mapping().
      
      So fix this by teaching try_release_extent_mapping() to not remove an
      extent map if it's still in the list of modified extents.
      
      Fixes: ff44c6e3 ("Btrfs: do not hold the write_lock on the extent tree while logging")
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3d6448e6
    • J
      btrfs: return EROFS for BTRFS_FS_STATE_ERROR cases · fbabd4a3
      Josef Bacik 提交于
      Eric reported seeing this message while running generic/475
      
        BTRFS: error (device dm-3) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted
      
      Full stack trace:
      
        BTRFS: error (device dm-0) in btrfs_commit_transaction:2323: errno=-5 IO failure (Error while writing out transaction)
        BTRFS info (device dm-0): forced readonly
        BTRFS warning (device dm-0): Skipping commit of aborted transaction.
        ------------[ cut here ]------------
        BTRFS: error (device dm-0) in cleanup_transaction:1894: errno=-5 IO failure
        BTRFS: Transaction aborted (error -117)
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6480 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6488 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6490 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6498 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64c0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85e8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85f0 len 4096 err no 10
        WARNING: CPU: 3 PID: 23985 at fs/btrfs/tree-log.c:3084 btrfs_sync_log+0xbc8/0xd60 [btrfs]
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4288 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4290 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4298 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42d0 len 4096 err no 10
        CPU: 3 PID: 23985 Comm: fsstress Tainted: G        W    L    5.8.0-rc4-default+ #1181
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
        RIP: 0010:btrfs_sync_log+0xbc8/0xd60 [btrfs]
        RSP: 0018:ffff909a44d17bd0 EFLAGS: 00010286
        RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000001
        RDX: ffff8f3be41cb940 RSI: ffffffffb0108d2b RDI: ffffffffb0108ff7
        RBP: ffff909a44d17e70 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000037988 R12: ffff8f3bd20e4000
        R13: ffff8f3bd20e4428 R14: 00000000ffffff8b R15: ffff909a44d17c70
        FS:  00007f6a6ed3fb80(0000) GS:ffff8f3c3dc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f6a6ed3e000 CR3: 00000000525c0003 CR4: 0000000000160ee0
        Call Trace:
         ? finish_wait+0x90/0x90
         ? __mutex_unlock_slowpath+0x45/0x2a0
         ? lock_acquire+0xa3/0x440
         ? lockref_put_or_lock+0x9/0x30
         ? dput+0x20/0x4a0
         ? dput+0x20/0x4a0
         ? do_raw_spin_unlock+0x4b/0xc0
         ? _raw_spin_unlock+0x1f/0x30
         btrfs_sync_file+0x335/0x490 [btrfs]
         do_fsync+0x38/0x70
         __x64_sys_fsync+0x10/0x20
         do_syscall_64+0x50/0xe0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f6a6ef1b6e3
        Code: Bad RIP value.
        RSP: 002b:00007ffd01e20038 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
        RAX: ffffffffffffffda RBX: 000000000007a120 RCX: 00007f6a6ef1b6e3
        RDX: 00007ffd01e1ffa0 RSI: 00007ffd01e1ffa0 RDI: 0000000000000003
        RBP: 0000000000000003 R08: 0000000000000001 R09: 00007ffd01e2004c
        R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000009f
        R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
        irq event stamp: 0
        hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        hardirqs last disabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00
        softirqs last  enabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00
        softirqs last disabled at (0): [<0000000000000000>] 0x0
        ---[ end trace af146e0e38433456 ]---
        BTRFS: error (device dm-0) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted
      
      This ret came from btrfs_write_marked_extents().  If we get an aborted
      transaction via EIO before, we'll see it in btree_write_cache_pages()
      and return EUCLEAN, which gets printed as "Filesystem corrupted".
      
      Except we shouldn't be returning EUCLEAN here, we need to be returning
      EROFS because EUCLEAN is reserved for actual corruption, not IO errors.
      
      We are inconsistent about our handling of BTRFS_FS_STATE_ERROR
      elsewhere, but we want to use EROFS for this particular case.  The
      original transaction abort has the real error code for why we ended up
      with an aborted transaction, all subsequent actions just need to return
      EROFS because they may not have a trans handle and have no idea about
      the original cause of the abort.
      
      After patch "btrfs: don't WARN if we abort a transaction with EROFS" the
      stacktrace will not be dumped either.
      Reported-by: NEric Sandeen <esandeen@redhat.com>
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add full test stacktrace ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fbabd4a3
    • N
      btrfs: remove done label in writepage_delalloc · b69d1ee9
      Nikolay Borisov 提交于
      Since there is not common cleanup run after the label it makes it
      somewhat redundant.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b69d1ee9
    • N
      btrfs: streamline btrfs_get_io_failure_record logic · 3526302f
      Nikolay Borisov 提交于
      Make the function directly return a pointer to a failure record and
      adjust callers to handle it. Also refactor the logic inside so that
      the case which allocates the failure record for the first time is not
      handled in an 'if' arm, saving us a level of indentation. Finally make
      the function static as it's not used outside of extent_io.c .
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3526302f
    • N
      btrfs: make get_state_failrec return failrec directly · 2279a270
      Nikolay Borisov 提交于
      Only failure that get_state_failrec can get is if there is no failure
      for the given address. There is no reason why the function should return
      a status code and use a separate parameter for returning the actual
      failure rec (if one is found). Simplify it by making the return type
      a pointer and return ERR_PTR value in case of errors.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2279a270
    • N
      btrfs: make writepage_delalloc take btrfs_inode · cd4c0bf9
      Nikolay Borisov 提交于
      Only find_lock_delalloc_range uses vfs_inode so let's take the
      btrfs_inode as a parameter.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cd4c0bf9
    • N
      btrfs: make __extent_writepage_io take btrfs_inode · d4580fe2
      Nikolay Borisov 提交于
      It has only a single use for a generic vfs inode vs 3 for btrfs_inode.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d4580fe2
    • N
      btrfs: make btrfs_run_delalloc_range take btrfs_inode · 98456b9c
      Nikolay Borisov 提交于
      All children now take btrfs_inode so convert it to taking it as a
      parameter as well.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      98456b9c
    • D
      btrfs: don't use UAPI types for fiemap callback · bab16e21
      David Sterba 提交于
      The fiemap callback is not part of UAPI interface and the prototypes
      don't have the __u64 types either.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bab16e21
    • N
      btrfs: make extent_clear_unlock_delalloc take btrfs_inode · ad7ff17b
      Nikolay Borisov 提交于
      It has one VFS and 1 btrfs inode usages but converting it to btrfs_inode
      interface will allow seamless conversion of its callers.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ad7ff17b
  4. 22 7月, 2020 1 次提交
  5. 02 7月, 2020 1 次提交
    • B
      btrfs: fix fatal extent_buffer readahead vs releasepage race · 6bf9cd2e
      Boris Burkov 提交于
      Under somewhat convoluted conditions, it is possible to attempt to
      release an extent_buffer that is under io, which triggers a BUG_ON in
      btrfs_release_extent_buffer_pages.
      
      This relies on a few different factors. First, extent_buffer reads done
      as readahead for searching use WAIT_NONE, so they free the local extent
      buffer reference while the io is outstanding. However, they should still
      be protected by TREE_REF. However, if the system is doing signficant
      reclaim, and simultaneously heavily accessing the extent_buffers, it is
      possible for releasepage to race with two concurrent readahead attempts
      in a way that leaves TREE_REF unset when the readahead extent buffer is
      released.
      
      Essentially, if two tasks race to allocate a new extent_buffer, but the
      winner who attempts the first io is rebuffed by a page being locked
      (likely by the reclaim itself) then the loser will still go ahead with
      issuing the readahead. The loser's call to find_extent_buffer must also
      race with the reclaim task reading the extent_buffer's refcount as 1 in
      a way that allows the reclaim to re-clear the TREE_REF checked by
      find_extent_buffer.
      
      The following represents an example execution demonstrating the race:
      
                  CPU0                                                         CPU1                                           CPU2
      reada_for_search                                            reada_for_search
        readahead_tree_block                                        readahead_tree_block
          find_create_tree_block                                      find_create_tree_block
            alloc_extent_buffer                                         alloc_extent_buffer
                                                                        find_extent_buffer // not found
                                                                        allocates eb
                                                                        lock pages
                                                                        associate pages to eb
                                                                        insert eb into radix tree
                                                                        set TREE_REF, refs == 2
                                                                        unlock pages
                                                                    read_extent_buffer_pages // WAIT_NONE
                                                                      not uptodate (brand new eb)
                                                                                                                  lock_page
                                                                      if !trylock_page
                                                                        goto unlock_exit // not an error
                                                                    free_extent_buffer
                                                                      release_extent_buffer
                                                                        atomic_dec_and_test refs to 1
              find_extent_buffer // found
                                                                                                                  try_release_extent_buffer
                                                                                                                    take refs_lock
                                                                                                                    reads refs == 1; no io
                atomic_inc_not_zero refs to 2
                mark_buffer_accessed
                  check_buffer_tree_ref
                    // not STALE, won't take refs_lock
                    refs == 2; TREE_REF set // no action
          read_extent_buffer_pages // WAIT_NONE
                                                                                                                    clear TREE_REF
                                                                                                                    release_extent_buffer
                                                                                                                      atomic_dec_and_test refs to 1
                                                                                                                      unlock_page
            still not uptodate (CPU1 read failed on trylock_page)
            locks pages
            set io_pages > 0
            submit io
            return
          free_extent_buffer
            release_extent_buffer
              dec refs to 0
              delete from radix tree
              btrfs_release_extent_buffer_pages
                BUG_ON(io_pages > 0)!!!
      
      We observe this at a very low rate in production and were also able to
      reproduce it in a test environment by introducing some spurious delays
      and by introducing probabilistic trylock_page failures.
      
      To fix it, we apply check_tree_ref at a point where it could not
      possibly be unset by a competing task: after io_pages has been
      incremented. All the codepaths that clear TREE_REF check for io, so they
      would not be able to clear it after this point until the io is done.
      
      Stack trace, for reference:
      [1417839.424739] ------------[ cut here ]------------
      [1417839.435328] kernel BUG at fs/btrfs/extent_io.c:4841!
      [1417839.447024] invalid opcode: 0000 [#1] SMP
      [1417839.502972] RIP: 0010:btrfs_release_extent_buffer_pages+0x20/0x1f0
      [1417839.517008] Code: ed e9 ...
      [1417839.558895] RSP: 0018:ffffc90020bcf798 EFLAGS: 00010202
      [1417839.570816] RAX: 0000000000000002 RBX: ffff888102d6def0 RCX: 0000000000000028
      [1417839.586962] RDX: 0000000000000002 RSI: ffff8887f0296482 RDI: ffff888102d6def0
      [1417839.603108] RBP: ffff88885664a000 R08: 0000000000000046 R09: 0000000000000238
      [1417839.619255] R10: 0000000000000028 R11: ffff88885664af68 R12: 0000000000000000
      [1417839.635402] R13: 0000000000000000 R14: ffff88875f573ad0 R15: ffff888797aafd90
      [1417839.651549] FS:  00007f5a844fa700(0000) GS:ffff88885f680000(0000) knlGS:0000000000000000
      [1417839.669810] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [1417839.682887] CR2: 00007f7884541fe0 CR3: 000000049f609002 CR4: 00000000003606e0
      [1417839.699037] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [1417839.715187] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [1417839.731320] Call Trace:
      [1417839.737103]  release_extent_buffer+0x39/0x90
      [1417839.746913]  read_block_for_search.isra.38+0x2a3/0x370
      [1417839.758645]  btrfs_search_slot+0x260/0x9b0
      [1417839.768054]  btrfs_lookup_file_extent+0x4a/0x70
      [1417839.778427]  btrfs_get_extent+0x15f/0x830
      [1417839.787665]  ? submit_extent_page+0xc4/0x1c0
      [1417839.797474]  ? __do_readpage+0x299/0x7a0
      [1417839.806515]  __do_readpage+0x33b/0x7a0
      [1417839.815171]  ? btrfs_releasepage+0x70/0x70
      [1417839.824597]  extent_readpages+0x28f/0x400
      [1417839.833836]  read_pages+0x6a/0x1c0
      [1417839.841729]  ? startup_64+0x2/0x30
      [1417839.849624]  __do_page_cache_readahead+0x13c/0x1a0
      [1417839.860590]  filemap_fault+0x6c7/0x990
      [1417839.869252]  ? xas_load+0x8/0x80
      [1417839.876756]  ? xas_find+0x150/0x190
      [1417839.884839]  ? filemap_map_pages+0x295/0x3b0
      [1417839.894652]  __do_fault+0x32/0x110
      [1417839.902540]  __handle_mm_fault+0xacd/0x1000
      [1417839.912156]  handle_mm_fault+0xaa/0x1c0
      [1417839.921004]  __do_page_fault+0x242/0x4b0
      [1417839.930044]  ? page_fault+0x8/0x30
      [1417839.937933]  page_fault+0x1e/0x30
      [1417839.945631] RIP: 0033:0x33c4bae
      [1417839.952927] Code: Bad RIP value.
      [1417839.960411] RSP: 002b:00007f5a844f7350 EFLAGS: 00010206
      [1417839.972331] RAX: 000000000000006e RBX: 1614b3ff6a50398a RCX: 0000000000000000
      [1417839.988477] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002
      [1417840.004626] RBP: 00007f5a844f7420 R08: 000000000000006e R09: 00007f5a94aeccb8
      [1417840.020784] R10: 00007f5a844f7350 R11: 0000000000000000 R12: 00007f5a94aecc79
      [1417840.036932] R13: 00007f5a94aecc78 R14: 00007f5a94aecc90 R15: 00007f5a94aecc40
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6bf9cd2e
  6. 30 6月, 2020 1 次提交
    • P
      fs/btrfs: Add cond_resched() for try_release_extent_mapping() stalls · 9f47eb54
      Paul E. McKenney 提交于
      Very large I/Os can cause the following RCU CPU stall warning:
      
      RIP: 0010:rb_prev+0x8/0x50
      Code: 49 89 c0 49 89 d1 48 89 c2 48 89 f8 e9 e5 fd ff ff 4c 89 48 10 c3 4c =
      89 06 c3 4c 89 40 10 c3 0f 1f 00 48 8b 0f 48 39 cf 74 38 <48> 8b 47 10 48 85 c0 74 22 48 8b 50 08 48 85 d2 74 0c 48 89 d0 48
      RSP: 0018:ffffc9002212bab0 EFLAGS: 00000287 ORIG_RAX: ffffffffffffff13
      RAX: ffff888821f93630 RBX: ffff888821f93630 RCX: ffff888821f937e0
      RDX: 0000000000000000 RSI: 0000000000102000 RDI: ffff888821f93630
      RBP: 0000000000103000 R08: 000000000006c000 R09: 0000000000000238
      R10: 0000000000102fff R11: ffffc9002212bac8 R12: 0000000000000001
      R13: ffffffffffffffff R14: 0000000000102000 R15: ffff888821f937e0
       __lookup_extent_mapping+0xa0/0x110
       try_release_extent_mapping+0xdc/0x220
       btrfs_releasepage+0x45/0x70
       shrink_page_list+0xa39/0xb30
       shrink_inactive_list+0x18f/0x3b0
       shrink_lruvec+0x38e/0x6b0
       shrink_node+0x14d/0x690
       do_try_to_free_pages+0xc6/0x3e0
       try_to_free_mem_cgroup_pages+0xe6/0x1e0
       reclaim_high.constprop.73+0x87/0xc0
       mem_cgroup_handle_over_high+0x66/0x150
       exit_to_usermode_loop+0x82/0xd0
       do_syscall_64+0xd4/0x100
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      On a PREEMPT=n kernel, the try_release_extent_mapping() function's
      "while" loop might run for a very long time on a large I/O.  This commit
      therefore adds a cond_resched() to this loop, providing RCU any needed
      quiescent states.
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      9f47eb54
  7. 03 6月, 2020 2 次提交
  8. 25 5月, 2020 7 次提交