1. 02 11月, 2021 1 次提交
    • L
      btrfs: fix lzo_decompress_bio() kmap leakage · 2cf3f813
      Linus Torvalds 提交于
      Commit ccaa66c8 reinstated the kmap/kunmap that had been dropped in
      commit 8c945d32 ("btrfs: compression: drop kmap/kunmap from lzo").
      
      However, it seems to have done so incorrectly due to the change not
      reverting cleanly, and lzo_decompress_bio() ended up not having a
      matching "kunmap()" to the "kmap()" that was put back.
      
      Also, any assert that the page pointer is not NULL should be before the
      kmap() of said pointer, since otherwise you'd just oops in the kmap()
      before the assert would even trigger.
      
      I noticed this when trying to verify my btrfs merge, and things not
      adding up.  I'm doing this fixup before re-doing my merge, because this
      commit needs to also be backported to 5.15 (after verification from the
      btrfs people).
      
      Fixes: ccaa66c8 ("Revert 'btrfs: compression: drop kmap/kunmap from lzo'")
      Cc: David Sterba <dsterba@suse.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2cf3f813
  2. 01 11月, 2021 1 次提交
  3. 31 10月, 2021 1 次提交
  4. 29 10月, 2021 13 次提交
    • P
      io-wq: remove worker to owner tw dependency · 1d5f5ea7
      Pavel Begunkov 提交于
      INFO: task iou-wrk-6609:6612 blocked for more than 143 seconds.
            Not tainted 5.15.0-rc5-syzkaller #0
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      task:iou-wrk-6609    state:D stack:27944 pid: 6612 ppid:  6526 flags:0x00004006
      Call Trace:
       context_switch kernel/sched/core.c:4940 [inline]
       __schedule+0xb44/0x5960 kernel/sched/core.c:6287
       schedule+0xd3/0x270 kernel/sched/core.c:6366
       schedule_timeout+0x1db/0x2a0 kernel/time/timer.c:1857
       do_wait_for_common kernel/sched/completion.c:85 [inline]
       __wait_for_common kernel/sched/completion.c:106 [inline]
       wait_for_common kernel/sched/completion.c:117 [inline]
       wait_for_completion+0x176/0x280 kernel/sched/completion.c:138
       io_worker_exit fs/io-wq.c:183 [inline]
       io_wqe_worker+0x66d/0xc40 fs/io-wq.c:597
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
      
      io-wq worker may submit a task_work to the master task and upon
      io_worker_exit() wait for the tw to get executed. The problem appears
      when the master task is waiting in coredump.c:
      
      468                     freezer_do_not_count();
      469                     wait_for_completion(&core_state->startup);
      470                     freezer_count();
      
      Apparently having some dependency on children threads getting everything
      stuck. Workaround it by cancelling the taks_work callback that causes it
      before going into io_worker_exit() waiting.
      
      p.s. probably a better option is to not submit tw elevating the refcount
      in the first place, but let's leave this excercise for the future.
      
      Cc: stable@vger.kernel.org
      Reported-and-tested-by: syzbot+27d62ee6f256b186883e@syzkaller.appspotmail.com
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/142a716f4ed936feae868959059154362bfa8c19.1635509451.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
      1d5f5ea7
    • J
      io_uring: harder fdinfo sq/cq ring iterating · f75d1183
      Jens Axboe 提交于
      The ring iteration is racy, which isn't necessarily a problem except it
      can cause us to iterate the whole thing. That isn't desired or ideal,
      and it can lead to excessive runtimes of reading fdinfo.
      
      Cap the iteration at tail - head OR the ring size. While in there, clean
      up the ring masking and just dump the raw values along with the masks.
      That provides more useful debug info.
      
      Fixes: 83f84356 ("io_uring: add more uring info to fdinfo for debug")
      Reported-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f75d1183
    • D
      Revert "btrfs: compression: drop kmap/kunmap from lzo" · ccaa66c8
      David Sterba 提交于
      This reverts commit 8c945d32.
      
      The kmaps in compression code are still needed and cause crashes on
      32bit machines (ARM, x86). Reproducible eg. by running fstest btrfs/004
      with enabled LZO or ZSTD compression.
      
      The revert does not apply cleanly due to changes in a6e66e6f
      ("btrfs: rework lzo_decompress_bio() to make it subpage compatible")
      that reworked the page iteration so the revert is done to be equivalent
      to the original code.
      
      Link: https://lore.kernel.org/all/CAJCQCtT+OuemovPO7GZk8Y8=qtOObr0XTDp8jh4OHD6y84AFxw@mail.gmail.com/
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=214839Tested-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ccaa66c8
    • D
      Revert "btrfs: compression: drop kmap/kunmap from zlib" · 55276e14
      David Sterba 提交于
      This reverts commit 696ab562.
      
      The kmaps in compression code are still needed and cause crashes on
      32bit machines (ARM, x86). Reproducible eg. by running fstest btrfs/004
      with enabled LZO or ZSTD compression.
      
      Link: https://lore.kernel.org/all/CAJCQCtT+OuemovPO7GZk8Y8=qtOObr0XTDp8jh4OHD6y84AFxw@mail.gmail.com/
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=214839Signed-off-by: NDavid Sterba <dsterba@suse.com>
      55276e14
    • D
      Revert "btrfs: compression: drop kmap/kunmap from zstd" · 56ee254d
      David Sterba 提交于
      This reverts commit bbaf9715.
      
      The kmaps in compression code are still needed and cause crashes on
      32bit machines (ARM, x86). Reproducible eg. by running fstest btrfs/004
      with enabled LZO or ZSTD compression.
      
      Example stacktrace with ZSTD on a 32bit ARM machine:
      
        Unable to handle kernel NULL pointer dereference at virtual address 00000000
        pgd = c4159ed3
        [00000000] *pgd=00000000
        Internal error: Oops: 5 [#1] PREEMPT SMP ARM
        Modules linked in:
        CPU: 0 PID: 210 Comm: kworker/u2:3 Not tainted 5.14.0-rc79+ #12
        Hardware name: Allwinner sun4i/sun5i Families
        Workqueue: btrfs-delalloc btrfs_work_helper
        PC is at mmiocpy+0x48/0x330
        LR is at ZSTD_compressStream_generic+0x15c/0x28c
      
        (mmiocpy) from [<c0629648>] (ZSTD_compressStream_generic+0x15c/0x28c)
        (ZSTD_compressStream_generic) from [<c06297dc>] (ZSTD_compressStream+0x64/0xa0)
        (ZSTD_compressStream) from [<c049444c>] (zstd_compress_pages+0x170/0x488)
        (zstd_compress_pages) from [<c0496798>] (btrfs_compress_pages+0x124/0x12c)
        (btrfs_compress_pages) from [<c043c068>] (compress_file_range+0x3c0/0x834)
        (compress_file_range) from [<c043c4ec>] (async_cow_start+0x10/0x28)
        (async_cow_start) from [<c0475c3c>] (btrfs_work_helper+0x100/0x230)
        (btrfs_work_helper) from [<c014ef68>] (process_one_work+0x1b4/0x418)
        (process_one_work) from [<c014f210>] (worker_thread+0x44/0x524)
        (worker_thread) from [<c0156aa4>] (kthread+0x180/0x1b0)
        (kthread) from [<c0100150>]
      
      Link: https://lore.kernel.org/all/CAJCQCtT+OuemovPO7GZk8Y8=qtOObr0XTDp8jh4OHD6y84AFxw@mail.gmail.com/
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=214839Signed-off-by: NDavid Sterba <dsterba@suse.com>
      56ee254d
    • F
      btrfs: remove root argument from check_item_in_log() · d1ed82f3
      Filipe Manana 提交于
      The root argument passed to check_item_in_log() always matches the root
      of the given directory, so it can be eliminated.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d1ed82f3
    • F
      btrfs: remove root argument from add_link() · 6d9cc072
      Filipe Manana 提交于
      The root argument for tree-log.c:add_link() always matches the root of the
      given directory and the given inode, so it can eliminated.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6d9cc072
    • F
      btrfs: remove root argument from btrfs_unlink_inode() · 4467af88
      Filipe Manana 提交于
      The root argument passed to btrfs_unlink_inode() and its callee,
      __btrfs_unlink_inode(), always matches the root of the given directory and
      the given inode. So remove the argument and make __btrfs_unlink_inode()
      use the root of the directory.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4467af88
    • F
      btrfs: remove root argument from drop_one_dir_item() · 9798ba24
      Filipe Manana 提交于
      The root argument for drop_one_dir_item() always matches the root of the
      given directory inode, since each log tree is associated to one and only
      one subvolume/root, so remove the argument.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9798ba24
    • L
      btrfs: clear MISSING device status bit in btrfs_close_one_device · 5d03dbeb
      Li Zhang 提交于
      Reported bug: https://github.com/kdave/btrfs-progs/issues/389
      
      There's a problem with scrub reporting aborted status but returning
      error code 0, on a filesystem with missing and readded device.
      
      Roughly these steps:
      
      - mkfs -d raid1 dev1 dev2
      - fill with data
      - unmount
      - make dev1 disappear
      - mount -o degraded
      - copy more data
      - make dev1 appear again
      
      Running scrub afterwards reports that the command was aborted, but the
      system log message says the exit code was 0.
      
      It seems that the cause of the error is decrementing
      fs_devices->missing_devices but not clearing device->dev_state.  Every
      time we umount filesystem, it would call close_ctree, And it would
      eventually involve btrfs_close_one_device to close the device, but it
      only decrements fs_devices->missing_devices but does not clear the
      device BTRFS_DEV_STATE_MISSING bit. Worse, this bug will cause Integer
      Overflow, because every time umount, fs_devices->missing_devices will
      decrease. If fs_devices->missing_devices value hit 0, it would overflow.
      
      With added debugging:
      
         loop1: detected capacity change from 0 to 20971520
         BTRFS: device fsid 56ad51f1-5523-463b-8547-c19486c51ebb devid 1 transid 21 /dev/loop1 scanned by systemd-udevd (2311)
         loop2: detected capacity change from 0 to 20971520
         BTRFS: device fsid 56ad51f1-5523-463b-8547-c19486c51ebb devid 2 transid 17 /dev/loop2 scanned by systemd-udevd (2313)
         BTRFS info (device loop1): flagging fs with big metadata feature
         BTRFS info (device loop1): allowing degraded mounts
         BTRFS info (device loop1): using free space tree
         BTRFS info (device loop1): has skinny extents
         BTRFS info (device loop1):  before clear_missing.00000000f706684d /dev/loop1 0
         BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing
         BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 1
         BTRFS info (device loop1): flagging fs with big metadata feature
         BTRFS info (device loop1): allowing degraded mounts
         BTRFS info (device loop1): using free space tree
         BTRFS info (device loop1): has skinny extents
         BTRFS info (device loop1):  before clear_missing.00000000f706684d /dev/loop1 0
         BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing
         BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 0
         BTRFS info (device loop1): flagging fs with big metadata feature
         BTRFS info (device loop1): allowing degraded mounts
         BTRFS info (device loop1): using free space tree
         BTRFS info (device loop1): has skinny extents
         BTRFS info (device loop1):  before clear_missing.00000000f706684d /dev/loop1 18446744073709551615
         BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing
         BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 18446744073709551615
      
      If fs_devices->missing_devices is 0, next time it would be 18446744073709551615
      
      After apply this patch, the fs_devices->missing_devices seems to be
      right:
      
        $ truncate -s 10g test1
        $ truncate -s 10g test2
        $ losetup /dev/loop1 test1
        $ losetup /dev/loop2 test2
        $ mkfs.btrfs -draid1 -mraid1 /dev/loop1 /dev/loop2 -f
        $ losetup -d /dev/loop2
        $ mount -o degraded /dev/loop1 /mnt/1
        $ umount /mnt/1
        $ mount -o degraded /dev/loop1 /mnt/1
        $ umount /mnt/1
        $ mount -o degraded /dev/loop1 /mnt/1
        $ umount /mnt/1
        $ dmesg
      
         loop1: detected capacity change from 0 to 20971520
         loop2: detected capacity change from 0 to 20971520
         BTRFS: device fsid 15aa1203-98d3-4a66-bcae-ca82f629c2cd devid 1 transid 5 /dev/loop1 scanned by mkfs.btrfs (1863)
         BTRFS: device fsid 15aa1203-98d3-4a66-bcae-ca82f629c2cd devid 2 transid 5 /dev/loop2 scanned by mkfs.btrfs (1863)
         BTRFS info (device loop1): flagging fs with big metadata feature
         BTRFS info (device loop1): allowing degraded mounts
         BTRFS info (device loop1): disk space caching is enabled
         BTRFS info (device loop1): has skinny extents
         BTRFS info (device loop1):  before clear_missing.00000000975bd577 /dev/loop1 0
         BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing
         BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 1
         BTRFS info (device loop1): checking UUID tree
         BTRFS info (device loop1): flagging fs with big metadata feature
         BTRFS info (device loop1): allowing degraded mounts
         BTRFS info (device loop1): disk space caching is enabled
         BTRFS info (device loop1): has skinny extents
         BTRFS info (device loop1):  before clear_missing.00000000975bd577 /dev/loop1 0
         BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing
         BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 1
         BTRFS info (device loop1): flagging fs with big metadata feature
         BTRFS info (device loop1): allowing degraded mounts
         BTRFS info (device loop1): disk space caching is enabled
         BTRFS info (device loop1): has skinny extents
         BTRFS info (device loop1):  before clear_missing.00000000975bd577 /dev/loop1 0
         BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing
         BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 1
      
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: NLi Zhang <zhanglikernel@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5d03dbeb
    • A
      btrfs: call btrfs_check_rw_degradable only if there is a missing device · 5c78a5e7
      Anand Jain 提交于
      In open_ctree() in btrfs_check_rw_degradable() [1], we check each block
      group individually if at least the minimum number of devices is available
      for that profile. If all the devices are available, then we don't have to
      check degradable.
      
      [1]
      open_ctree()
      ::
      3559 if (!sb_rdonly(sb) && !btrfs_check_rw_degradable(fs_info, NULL)) {
      
      Also before calling btrfs_check_rw_degradable() in open_ctee() at the
      line number shown below [2] we call btrfs_read_chunk_tree() and down to
      add_missing_dev() to record number of missing devices.
      
      [2]
      open_ctree()
      ::
      3454         ret = btrfs_read_chunk_tree(fs_info);
      
      btrfs_read_chunk_tree()
        read_one_chunk() / read_one_dev()
          add_missing_dev()
      
      So, check if there is any missing device before btrfs_check_rw_degradable()
      in open_ctree().
      
      Also, with this the mount command could save ~16ms.[3] in the most
      common case, that is no device is missing.
      
      [3]
       1) * 16934.96 us | btrfs_check_rw_degradable [btrfs]();
      
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5c78a5e7
    • D
      btrfs: send: prepare for v2 protocol · e77fbf99
      David Sterba 提交于
      This is preparatory work for send protocol update to version 2 and
      higher.
      
      We have many pending protocol update requests but still don't have the
      basic protocol rev in place, the first thing that must happen is to do
      the actual versioning support.
      
      The protocol version is u32 and is a new member in the send ioctl
      struct. Validity of the version field is backed by a new flag bit. Old
      kernels would fail when a higher version is requested. Version protocol
      0 will pick the highest supported version, BTRFS_SEND_STREAM_VERSION,
        that's also exported in sysfs.
      
      The version is still unchanged and will be increased once we have new
      incompatible commands or stream updates.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e77fbf99
    • G
      ocfs2: fix race between searching chunks and release journal_head from buffer_head · 6f1b2285
      Gautham Ananthakrishna 提交于
      Encountered a race between ocfs2_test_bg_bit_allocatable() and
      jbd2_journal_put_journal_head() resulting in the below vmcore.
      
        PID: 106879  TASK: ffff880244ba9c00  CPU: 2   COMMAND: "loop3"
        Call trace:
          panic
          oops_end
          no_context
          __bad_area_nosemaphore
          bad_area_nosemaphore
          __do_page_fault
          do_page_fault
          page_fault
            [exception RIP: ocfs2_block_group_find_clear_bits+316]
          ocfs2_block_group_find_clear_bits [ocfs2]
          ocfs2_cluster_group_search [ocfs2]
          ocfs2_search_chain [ocfs2]
          ocfs2_claim_suballoc_bits [ocfs2]
          __ocfs2_claim_clusters [ocfs2]
          ocfs2_claim_clusters [ocfs2]
          ocfs2_local_alloc_slide_window [ocfs2]
          ocfs2_reserve_local_alloc_bits [ocfs2]
          ocfs2_reserve_clusters_with_limit [ocfs2]
          ocfs2_reserve_clusters [ocfs2]
          ocfs2_lock_refcount_allocators [ocfs2]
          ocfs2_make_clusters_writable [ocfs2]
          ocfs2_replace_cow [ocfs2]
          ocfs2_refcount_cow [ocfs2]
          ocfs2_file_write_iter [ocfs2]
          lo_rw_aio
          loop_queue_work
          kthread_worker_fn
          kthread
          ret_from_fork
      
      When ocfs2_test_bg_bit_allocatable() called bh2jh(bg_bh), the
      bg_bh->b_private NULL as jbd2_journal_put_journal_head() raced and
      released the jounal head from the buffer head.  Needed to take bit lock
      for the bit 'BH_JournalHead' to fix this race.
      
      Link: https://lkml.kernel.org/r/1634820718-6043-1-git-send-email-gautham.ananthakrishna@oracle.comSigned-off-by: NGautham Ananthakrishna <gautham.ananthakrishna@oracle.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: <rajesh.sivaramasubramaniom@oracle.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f1b2285
  5. 27 10月, 2021 24 次提交
    • D
      Revert "btrfs: compression: drop kmap/kunmap from generic helpers" · 3a60f653
      David Sterba 提交于
      This reverts commit 4c2bf276.
      
      The kmaps in compression code are still needed and cause crashes on
      32bit machines (ARM, x86). Reproducible eg. by running fstest btrfs/004
      with enabled LZO or ZSTD compression.
      
      Link: https://lore.kernel.org/all/CAJCQCtT+OuemovPO7GZk8Y8=qtOObr0XTDp8jh4OHD6y84AFxw@mail.gmail.com/
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=214839Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3a60f653
    • J
      io_uring: don't assign write hint in the read path · 3884b83d
      Jens Axboe 提交于
      Move this out of the generic read/write prep path, and place it in the
      write specific kiocb setup instead.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3884b83d
    • A
      btrfs: fix comment about sector sizes supported in 64K systems · 50780d9b
      Anand Jain 提交于
      Commit 95ea0486 ("btrfs: allow read-write for 4K sectorsize on 64K
      page size systems") added write support for 4K sectorsize on a 64K
      systems. Fix the now stale comments.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      50780d9b
    • J
      btrfs: update device path inode time instead of bd_inode · 54fde91f
      Josef Bacik 提交于
      Christoph pointed out that I'm updating bdev->bd_inode for the device
      time when we remove block devices from a btrfs file system, however this
      isn't actually exposed to anything.  The inode we want to update is the
      one that's associated with the path to the device, usually on devtmpfs,
      so that blkid notices the difference.
      
      We still don't want to do the blkdev_open, so use kern_path() to get the
      path to the given device and do the update time on that inode.
      
      Fixes: 8f96a5bf ("btrfs: update the bdev time directly when closing")
      Reported-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      54fde91f
    • J
      fs: export an inode_update_time helper · e60feb44
      Josef Bacik 提交于
      If you already have an inode and need to update the time on the inode
      there is no way to do this properly.  Export this helper to allow file
      systems to update time on the inode so the appropriate handler is
      called, either ->update_time or generic_update_time.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e60feb44
    • O
      btrfs: fix deadlock when defragging transparent huge pages · 24bcb454
      Omar Sandoval 提交于
      Attempting to defragment a Btrfs file containing a transparent huge page
      immediately deadlocks with the following stack trace:
      
        #0  context_switch (kernel/sched/core.c:4940:2)
        #1  __schedule (kernel/sched/core.c:6287:8)
        #2  schedule (kernel/sched/core.c:6366:3)
        #3  io_schedule (kernel/sched/core.c:8389:2)
        #4  wait_on_page_bit_common (mm/filemap.c:1356:4)
        #5  __lock_page (mm/filemap.c:1648:2)
        #6  lock_page (./include/linux/pagemap.h:625:3)
        #7  pagecache_get_page (mm/filemap.c:1910:4)
        #8  find_or_create_page (./include/linux/pagemap.h:420:9)
        #9  defrag_prepare_one_page (fs/btrfs/ioctl.c:1068:9)
        #10 defrag_one_range (fs/btrfs/ioctl.c:1326:14)
        #11 defrag_one_cluster (fs/btrfs/ioctl.c:1421:9)
        #12 btrfs_defrag_file (fs/btrfs/ioctl.c:1523:9)
        #13 btrfs_ioctl_defrag (fs/btrfs/ioctl.c:3117:9)
        #14 btrfs_ioctl (fs/btrfs/ioctl.c:4872:10)
        #15 vfs_ioctl (fs/ioctl.c:51:10)
        #16 __do_sys_ioctl (fs/ioctl.c:874:11)
        #17 __se_sys_ioctl (fs/ioctl.c:860:1)
        #18 __x64_sys_ioctl (fs/ioctl.c:860:1)
        #19 do_syscall_x64 (arch/x86/entry/common.c:50:14)
        #20 do_syscall_64 (arch/x86/entry/common.c:80:7)
        #21 entry_SYSCALL_64+0x7c/0x15b (arch/x86/entry/entry_64.S:113)
      
      A huge page is represented by a compound page, which consists of a
      struct page for each PAGE_SIZE page within the huge page. The first
      struct page is the "head page", and the remaining are "tail pages".
      
      Defragmentation attempts to lock each page in the range. However,
      lock_page() on a tail page actually locks the corresponding head page.
      So, if defragmentation tries to lock more than one struct page in a
      compound page, it tries to lock the same head page twice and deadlocks
      with itself.
      
      Ideally, we should be able to defragment transparent huge pages.
      However, THP for filesystems is currently read-only, so a lot of code is
      not ready to use huge pages for I/O. For now, let's just return
      ETXTBUSY.
      
      This can be reproduced with the following on a kernel with
      CONFIG_READ_ONLY_THP_FOR_FS=y:
      
        $ cat create_thp_file.c
        #include <fcntl.h>
        #include <stdbool.h>
        #include <stdio.h>
        #include <stdint.h>
        #include <stdlib.h>
        #include <unistd.h>
        #include <sys/mman.h>
      
        static const char zeroes[1024 * 1024];
        static const size_t FILE_SIZE = 2 * 1024 * 1024;
      
        int main(int argc, char **argv)
        {
                if (argc != 2) {
                        fprintf(stderr, "usage: %s PATH\n", argv[0]);
                        return EXIT_FAILURE;
                }
                int fd = creat(argv[1], 0777);
                if (fd == -1) {
                        perror("creat");
                        return EXIT_FAILURE;
                }
                size_t written = 0;
                while (written < FILE_SIZE) {
                        ssize_t ret = write(fd, zeroes,
                                            sizeof(zeroes) < FILE_SIZE - written ?
                                            sizeof(zeroes) : FILE_SIZE - written);
                        if (ret < 0) {
                                perror("write");
                                return EXIT_FAILURE;
                        }
                        written += ret;
                }
                close(fd);
                fd = open(argv[1], O_RDONLY);
                if (fd == -1) {
                        perror("open");
                        return EXIT_FAILURE;
                }
      
                /*
                 * Reserve some address space so that we can align the file mapping to
                 * the huge page size.
                 */
                void *placeholder_map = mmap(NULL, FILE_SIZE * 2, PROT_NONE,
                                             MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
                if (placeholder_map == MAP_FAILED) {
                        perror("mmap (placeholder)");
                        return EXIT_FAILURE;
                }
      
                void *aligned_address =
                        (void *)(((uintptr_t)placeholder_map + FILE_SIZE - 1) & ~(FILE_SIZE - 1));
      
                void *map = mmap(aligned_address, FILE_SIZE, PROT_READ | PROT_EXEC,
                                 MAP_SHARED | MAP_FIXED, fd, 0);
                if (map == MAP_FAILED) {
                        perror("mmap");
                        return EXIT_FAILURE;
                }
                if (madvise(map, FILE_SIZE, MADV_HUGEPAGE) < 0) {
                        perror("madvise");
                        return EXIT_FAILURE;
                }
      
                char *line = NULL;
                size_t line_capacity = 0;
                FILE *smaps_file = fopen("/proc/self/smaps", "r");
                if (!smaps_file) {
                        perror("fopen");
                        return EXIT_FAILURE;
                }
                for (;;) {
                        for (size_t off = 0; off < FILE_SIZE; off += 4096)
                                ((volatile char *)map)[off];
      
                        ssize_t ret;
                        bool this_mapping = false;
                        while ((ret = getline(&line, &line_capacity, smaps_file)) > 0) {
                                unsigned long start, end, huge;
                                if (sscanf(line, "%lx-%lx", &start, &end) == 2) {
                                        this_mapping = (start <= (uintptr_t)map &&
                                                        (uintptr_t)map < end);
                                } else if (this_mapping &&
                                           sscanf(line, "FilePmdMapped: %ld", &huge) == 1 &&
                                           huge > 0) {
                                        return EXIT_SUCCESS;
                                }
                        }
      
                        sleep(6);
                        rewind(smaps_file);
                        fflush(smaps_file);
                }
        }
        $ ./create_thp_file huge
        $ btrfs fi defrag -czstd ./huge
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      24bcb454
    • A
      btrfs: sysfs: convert scnprintf and snprintf to sysfs_emit · 020e5277
      Anand Jain 提交于
      Commit 2efc459d ("sysfs: Add sysfs_emit and sysfs_emit_at to format
      sysfs out") merged in 5.10 introduced two new functions sysfs_emit() and
      sysfs_emit_at() which are aware of the PAGE_SIZE limit of the output
      buffer.
      
      Use the above two new functions instead of scnprintf() and snprintf()
      in various sysfs show().
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      020e5277
    • Q
      btrfs: make btrfs_super_block size match BTRFS_SUPER_INFO_SIZE · 38732474
      Qu Wenruo 提交于
      It's a common practice to avoid use sizeof(struct btrfs_super_block)
      (3531), but to use BTRFS_SUPER_INFO_SIZE (4096).
      
      The problem is that, sizeof(struct btrfs_super_block) doesn't match
      BTRFS_SUPER_INFO_SIZE from the very beginning.
      
      Furthermore, for all call sites except selftests, we always allocate
      BTRFS_SUPER_INFO_SIZE space for super block, there isn't any real reason
      to use the smaller value, and it doesn't really save any space.
      
      So let's get rid of such confusing behavior, and unify those two values.
      
      This modification also adds a new static_assert() to verify the size,
      and moves the BTRFS_SUPER_INFO_* macros to the definition of
      btrfs_super_block for the static_assert().
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      38732474
    • F
      btrfs: update comments for chunk allocation -ENOSPC cases · ecd84d54
      Filipe Manana 提交于
      Update the comments at btrfs_chunk_alloc() and do_chunk_alloc() that
      describe which cases can lead to a failure to allocate metadata and system
      space despite having previously reserved space. This adds one more reason
      that I previously forgot to mention.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ecd84d54
    • F
      btrfs: fix deadlock between chunk allocation and chunk btree modifications · 2bb2e00e
      Filipe Manana 提交于
      When a task is doing some modification to the chunk btree and it is not in
      the context of a chunk allocation or a chunk removal, it can deadlock with
      another task that is currently allocating a new data or metadata chunk.
      
      These contexts are the following:
      
      * When relocating a system chunk, when we need to COW the extent buffers
        that belong to the chunk btree;
      
      * When adding a new device (ioctl), where we need to add a new device item
        to the chunk btree;
      
      * When removing a device (ioctl), where we need to remove a device item
        from the chunk btree;
      
      * When resizing a device (ioctl), where we need to update a device item in
        the chunk btree and may need to relocate a system chunk that lies beyond
        the new device size when shrinking a device.
      
      The problem happens due to a sequence of steps like the following:
      
      1) Task A starts a data or metadata chunk allocation and it locks the
         chunk mutex;
      
      2) Task B is relocating a system chunk, and when it needs to COW an extent
         buffer of the chunk btree, it has locked both that extent buffer as
         well as its parent extent buffer;
      
      3) Since there is not enough available system space, either because none
         of the existing system block groups have enough free space or because
         the only one with enough free space is in RO mode due to the relocation,
         task B triggers a new system chunk allocation. It blocks when trying to
         acquire the chunk mutex, currently held by task A;
      
      4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
         the new chunk item into the chunk btree and update the existing device
         items there. But in order to do that, it has to lock the extent buffer
         that task B locked at step 2, or its parent extent buffer, but task B
         is waiting on the chunk mutex, which is currently locked by task A,
         therefore resulting in a deadlock.
      
      One example report when the deadlock happens with system chunk relocation:
      
        INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
              Not tainted 5.15.0-rc3+ #1
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        task:kworker/u9:5    state:D stack:25936 pid:  546 ppid:     2 flags:0x00004000
        Workqueue: events_unbound btrfs_async_reclaim_metadata_space
        Call Trace:
         context_switch kernel/sched/core.c:4940 [inline]
         __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
         schedule+0xd3/0x270 kernel/sched/core.c:6366
         rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
         __down_read_common kernel/locking/rwsem.c:1214 [inline]
         __down_read kernel/locking/rwsem.c:1223 [inline]
         down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
         __btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
         btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
         btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
         btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
         btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
         btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
         btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
         do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
         btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
         flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
         btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
         process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
         worker_thread+0x90/0xed0 kernel/workqueue.c:2444
         kthread+0x3e5/0x4d0 kernel/kthread.c:319
         ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
        INFO: task syz-executor:9107 blocked for more than 143 seconds.
              Not tainted 5.15.0-rc3+ #1
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        task:syz-executor    state:D stack:23200 pid: 9107 ppid:  7792 flags:0x00004004
        Call Trace:
         context_switch kernel/sched/core.c:4940 [inline]
         __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
         schedule+0xd3/0x270 kernel/sched/core.c:6366
         schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
         __mutex_lock_common kernel/locking/mutex.c:669 [inline]
         __mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
         btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
         find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
         find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
         btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
         btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
         __btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
         btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
         btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
         relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
         relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
         relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
         btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
         btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
         __btrfs_balance fs/btrfs/volumes.c:3911 [inline]
         btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
         btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
         btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
         vfs_ioctl fs/ioctl.c:51 [inline]
         __do_sys_ioctl fs/ioctl.c:874 [inline]
         __se_sys_ioctl fs/ioctl.c:860 [inline]
         __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
         do_syscall_x64 arch/x86/entry/common.c:50 [inline]
         do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      So fix this by making sure that whenever we try to modify the chunk btree
      and we are neither in a chunk allocation context nor in a chunk remove
      context, we reserve system space before modifying the chunk btree.
      Reported-by: NHao Sun <sunhao.th@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
      Fixes: 79bd3712 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
      CC: stable@vger.kernel.org # 5.14+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2bb2e00e
    • J
      btrfs: zoned: use greedy gc for auto reclaim · 2ca0ec77
      Johannes Thumshirn 提交于
      Currently auto reclaim of unusable zones reclaims the block-groups in
      the order they have been added to the reclaim list.
      
      Change this to a greedy algorithm by sorting the list so we have the
      block-groups with the least amount of valid bytes reclaimed first.
      
      Note: we can't splice the block groups from reclaim_bgs to let the sort
      happen outside of the lock. The block groups can be still in use by
      other parts eg. via bg_list and we must hold unused_bgs_lock while
      processing them.
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ write note and comment why we can't splice the list ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2ca0ec77
    • C
      btrfs: check-integrity: stop storing the block device name in btrfsic_dev_state · 813ebc16
      Christoph Hellwig 提交于
      Just use the %pg format specifier in all the debug printks previously
      using it.  Note that both bdevname and the %pg specifier never print
      a pathname, so the kbasename call wasn't needed to start with.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ adjust messages and indentation ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      813ebc16
    • J
      btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls · 1a15eb72
      Josef Bacik 提交于
      For device removal and replace we call btrfs_find_device_by_devspec,
      which if we give it a device path and nothing else will call
      btrfs_get_dev_args_from_path, which opens the block device and reads the
      super block and then looks up our device based on that.
      
      However at this point we're holding the sb write "lock", so reading the
      block device pulls in the dependency of ->open_mutex, which produces the
      following lockdep splat
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.14.0-rc2+ #405 Not tainted
      ------------------------------------------------------
      losetup/11576 is trying to acquire lock:
      ffff9bbe8cded938 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0
      
      but task is already holding lock:
      ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #4 (&lo->lo_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7d/0x750
             lo_open+0x28/0x60 [loop]
             blkdev_get_whole+0x25/0xf0
             blkdev_get_by_dev.part.0+0x168/0x3c0
             blkdev_open+0xd2/0xe0
             do_dentry_open+0x161/0x390
             path_openat+0x3cc/0xa20
             do_filp_open+0x96/0x120
             do_sys_openat2+0x7b/0x130
             __x64_sys_openat+0x46/0x70
             do_syscall_64+0x38/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      -> #3 (&disk->open_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7d/0x750
             blkdev_get_by_dev.part.0+0x56/0x3c0
             blkdev_get_by_path+0x98/0xa0
             btrfs_get_bdev_and_sb+0x1b/0xb0
             btrfs_find_device_by_devspec+0x12b/0x1c0
             btrfs_rm_device+0x127/0x610
             btrfs_ioctl+0x2a31/0x2e70
             __x64_sys_ioctl+0x80/0xb0
             do_syscall_64+0x38/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      -> #2 (sb_writers#12){.+.+}-{0:0}:
             lo_write_bvec+0xc2/0x240 [loop]
             loop_process_work+0x238/0xd00 [loop]
             process_one_work+0x26b/0x560
             worker_thread+0x55/0x3c0
             kthread+0x140/0x160
             ret_from_fork+0x1f/0x30
      
      -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
             process_one_work+0x245/0x560
             worker_thread+0x55/0x3c0
             kthread+0x140/0x160
             ret_from_fork+0x1f/0x30
      
      -> #0 ((wq_completion)loop0){+.+.}-{0:0}:
             __lock_acquire+0x10ea/0x1d90
             lock_acquire+0xb5/0x2b0
             flush_workqueue+0x91/0x5e0
             drain_workqueue+0xa0/0x110
             destroy_workqueue+0x36/0x250
             __loop_clr_fd+0x9a/0x660 [loop]
             block_ioctl+0x3f/0x50
             __x64_sys_ioctl+0x80/0xb0
             do_syscall_64+0x38/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      other info that might help us debug this:
      
      Chain exists of:
        (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(&lo->lo_mutex);
                                     lock(&disk->open_mutex);
                                     lock(&lo->lo_mutex);
        lock((wq_completion)loop0);
      
       *** DEADLOCK ***
      
      1 lock held by losetup/11576:
       #0: ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
      
      stack backtrace:
      CPU: 0 PID: 11576 Comm: losetup Not tainted 5.14.0-rc2+ #405
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
      Call Trace:
       dump_stack_lvl+0x57/0x72
       check_noncircular+0xcf/0xf0
       ? stack_trace_save+0x3b/0x50
       __lock_acquire+0x10ea/0x1d90
       lock_acquire+0xb5/0x2b0
       ? flush_workqueue+0x67/0x5e0
       ? lockdep_init_map_type+0x47/0x220
       flush_workqueue+0x91/0x5e0
       ? flush_workqueue+0x67/0x5e0
       ? verify_cpu+0xf0/0x100
       drain_workqueue+0xa0/0x110
       destroy_workqueue+0x36/0x250
       __loop_clr_fd+0x9a/0x660 [loop]
       ? blkdev_ioctl+0x8d/0x2a0
       block_ioctl+0x3f/0x50
       __x64_sys_ioctl+0x80/0xb0
       do_syscall_64+0x38/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x7f31b02404cb
      
      Instead what we want to do is populate our device lookup args before we
      grab any locks, and then pass these args into btrfs_rm_device().  From
      there we can find the device and do the appropriate removal.
      Suggested-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1a15eb72
    • J
      btrfs: add a btrfs_get_dev_args_from_path helper · faa775c4
      Josef Bacik 提交于
      We are going to want to populate our device lookup args outside of any
      locks and then do the actual device lookup later, so add a helper to do
      this work and make btrfs_find_device_by_devspec() use this helper for
      now.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      faa775c4
    • J
      btrfs: handle device lookup with btrfs_dev_lookup_args · 562d7b15
      Josef Bacik 提交于
      We have a lot of device lookup functions that all do something slightly
      different.  Clean this up by adding a struct to hold the different
      lookup criteria, and then pass this around to btrfs_find_device() so it
      can do the proper matching based on the lookup criteria.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      562d7b15
    • J
      btrfs: do not call close_fs_devices in btrfs_rm_device · 8b41393f
      Josef Bacik 提交于
      There's a subtle case where if we're removing the seed device from a
      file system we need to free its private copy of the fs_devices.  However
      we do not need to call close_fs_devices(), because at this point there
      are no devices left to close as we've closed the last one.  The only
      thing that close_fs_devices() does is decrement ->opened, which should
      be 1.  We want to avoid calling close_fs_devices() here because it has a
      lockdep_assert_held(&uuid_mutex), and we are going to stop holding the
      uuid_mutex in this path.
      
      So simply decrement the  ->opened counter like we should, and then clean
      up like normal.  Also add a comment explaining what we're doing here as
      I initially removed this code erroneously.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8b41393f
    • A
      btrfs: add comments for device counts in struct btrfs_fs_devices · add9745a
      Anand Jain 提交于
      A bug was was checking a wrong device count before we delete the struct
      btrfs_fs_devices in btrfs_rm_device(). To avoid future confusion and
      easy reference add a comment about the various device counts that we have
      in the struct btrfs_fs_devices.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      add9745a
    • A
      btrfs: use num_device to check for the last surviving seed device · 8e906945
      Anand Jain 提交于
      For both sprout and seed fsids,
       btrfs_fs_devices::num_devices provides device count including missing
       btrfs_fs_devices::open_devices provides device count excluding missing
      
      We create a dummy struct btrfs_device for the missing device, so
      num_devices != open_devices when there is a missing device.
      
      In btrfs_rm_devices() we wrongly check for %cur_devices->open_devices
      before freeing the seed fs_devices. Instead we should check for
      %cur_devices->num_devices.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8e906945
    • F
      btrfs: fix lost error handling when replaying directory deletes · 10adb115
      Filipe Manana 提交于
      At replay_dir_deletes(), if find_dir_range() returns an error we break out
      of the main while loop and then assign a value of 0 (success) to the 'ret'
      variable, resulting in completely ignoring that an error happened. Fix
      that by jumping to the 'out' label when find_dir_range() returns an error
      (negative value).
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      10adb115
    • Q
      btrfs: remove btrfs_bio::logical member · f4f39fc5
      Qu Wenruo 提交于
      The member btrfs_bio::logical is only initialized by two call sites:
      
      - btrfs_repair_one_sector()
        No corresponding site to utilize it.
      
      - btrfs_submit_direct()
        The corresponding site to utilize it is btrfs_check_read_dio_bio().
      
      However for btrfs_check_read_dio_bio(), we can grab the file_offset from
      btrfs_dio_private::file_offset directly.
      
      Thus it turns out we don't really need that btrfs_bio::logical member at
      all.
      
      For btrfs_bio, the logical bytenr can be fetched from its
      bio->bi_iter.bi_sector directly.
      
      So let's just remove the member to save 8 bytes for structure btrfs_bio.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f4f39fc5
    • Q
      btrfs: rename btrfs_dio_private::logical_offset to file_offset · 47926ab5
      Qu Wenruo 提交于
      The naming of "logical_offset" can be confused with logical bytenr of
      the dio range.
      
      In fact it's file offset, and the naming "file_offset" is already widely
      used in all other sites.
      
      Just do the rename to avoid confusion.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      47926ab5
    • C
      btrfs: use bvec_kmap_local in btrfs_csum_one_bio · 3dcfbcce
      Christoph Hellwig 提交于
      Using local kmaps slightly reduces the chances to stray writes, and
      the bvec interface cleans up the code a little bit.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3dcfbcce
    • A
      btrfs: reduce btrfs_update_block_group alloc argument to bool · 11b66fa6
      Anand Jain 提交于
      btrfs_update_block_group() accounts for the number of bytes allocated or
      freed. Argument @alloc specifies whether the call is for alloc or free.
      Convert the argument @alloc type from int to bool.
      Reviewed-by: NSu Yue <l@damenly.su>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      11b66fa6
    • N
      btrfs: make btrfs_ref::real_root optional · eed2037f
      Nikolay Borisov 提交于
      Now that real_root is only used in ref-verify core gate it behind
      CONFIG_BTRFS_FS_REF_VERIFY ifdef. This shrinks the size of pending
      delayed refs by 8 bytes per ref, of which we can have many at any one
      time depending on intensity of the workload. Also change the comment
      about the member as it no longer deals with qgroups.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      eed2037f