1. 29 5月, 2018 4 次提交
    • D
      btrfs: kill btrfs_fs_info::volume_mutex · dccdb07b
      David Sterba 提交于
      Mutual exclusion of device add/rm and balance was done by the volume
      mutex up to version 3.7. The commit 5ac00add ("Btrfs: disallow
      mutually exclusive admin operations from user mode") added a bit that
      essentially tracked the same information.
      
      The status bit has an advantage over a mutex that it can be set without
      restrictions of function context, so it started to be used in the
      mount-time resuming of balance or device replace.
      
      But we don't really need to track the same information in two ways.
      
      1) After the previous cleanups, the main ioctl handlers for
         add/del/resize copy the EXCL_OP bit next to the volume mutex, here
         it's clearly safe.
      
      2) Resuming balance during mount or after rw remount will set only the
         EXCL_OP bit and the volume_mutex is held in the kernel thread that
         calls btrfs_balance.
      
      3) Resuming device replace during mount or after rw remount is done
         after balance and is excluded by the EXCL_OP bit. It does not take
         the volume_mutex at all and completely relies on the EXCL_OP bit.
      
      4) The resuming of balance and dev-replace cannot hapen at the same time
         as the ioctls cannot be started in parallel. Nevertheless, a crafted
         image could trigger that and a warning is printed.
      
      5) Balance is normally excluded by EXCL_OP and also uses own mutex to
         protect against concurrent access to its status data. There's some
         trickery to maintain the right lock nesting in case we need to
         reexamine the status in btrfs_ioctl_balance. The volume_mutex is
         removed and the unlock/lock sequence is left in place as we might
         expect other waiters to proceed.
      
      6) Similar to 5, the unlock/lock sequence is kept in
         btrfs_cancel_balance to allow waiters to continue.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dccdb07b
    • N
      btrfs: Remove btrfs_wait_and_free_delalloc_work · 40012f96
      Nikolay Borisov 提交于
      This function is called from only 1 place and is effectively a wrapper
      over wait_completion/kfree. It doesn't really bring any value having
      those two calls in a separate function. Just open code it and remove it.
      No functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      40012f96
    • M
      btrfs: Factor out the main deletion process from btrfs_ioctl_snap_destroy() · f60a2364
      Misono Tomohiro 提交于
      Factor out the second half of btrfs_ioctl_snap_destroy() as
      btrfs_delete_subvolume(), which performs some subvolume specific checks
      before deletion:
      
      1. send is not in progress
      2. the subvolume is not the default subvolume
      3. the subvolume does not contain other subvolumes
      
      and actual deletion process. btrfs_delete_subvolume() requires
      inode_lock for both @dir and inode of @dentry. The remaining part of
      btrfs_ioctl_snap_destroy() is mainly permission checks.
      
      Note that call of d_delete() is not included in btrfs_delete_subvolume()
      as this function will also be used by btrfs_rmdir() to delete an empty
      subvolume and in that case d_delete() is called in VFS layer.
      
      As a result, btrfs_unlink_subvol() and may_destroy_subvol()
      become static functions. No functional changes.
      Signed-off-by: NTomohiro Misono <misono.tomohiro@jp.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ minor comment updates ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f60a2364
    • M
      btrfs: Move may_destroy_subvol() from ioctl.c to inode.c · ec42f167
      Misono Tomohiro 提交于
      This is a preparation work to refactor btrfs_ioctl_snap_destroy()
      and to allow rmdir(2) to delete an empty subvolume.
      Signed-off-by: NTomohiro Misono <misono.tomohiro@jp.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ minor update of the function comment ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ec42f167
  2. 28 5月, 2018 1 次提交
  3. 17 5月, 2018 1 次提交
  4. 18 4月, 2018 2 次提交
    • Q
      btrfs: qgroup: Use independent and accurate per inode qgroup rsv · ff6bc37e
      Qu Wenruo 提交于
      Unlike reservation calculation used in inode rsv for metadata, qgroup
      doesn't really need to care about things like csum size or extent usage
      for the whole tree COW.
      
      Qgroups care more about net change of the extent usage.
      That's to say, if we're going to insert one file extent, it will mostly
      find its place in COWed tree block, leaving no change in extent usage.
      Or causing a leaf split, resulting in one new net extent and increasing
      qgroup number by nodesize.
      Or in an even more rare case, increase the tree level, increasing qgroup
      number by 2 * nodesize.
      
      So here instead of using the complicated calculation for extent
      allocator, which cares more about accuracy and no error, qgroup doesn't
      need that over-estimated reservation.
      
      This patch will maintain 2 new members in btrfs_block_rsv structure for
      qgroup, using much smaller calculation for qgroup rsv, reducing false
      EDQUOT.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      ff6bc37e
    • Q
      btrfs: qgroup: Commit transaction in advance to reduce early EDQUOT · a514d638
      Qu Wenruo 提交于
      Unlike previous method that tries to commit transaction inside
      qgroup_reserve(), this time we will try to commit transaction using
      fs_info->transaction_kthread to avoid nested transaction and no need to
      worry about locking context.
      
      Since it's an asynchronous function call and we won't wait for
      transaction commit, unlike previous method, we must call it before we
      hit the qgroup limit.
      
      So this patch will use the ratio and size of qgroup meta_pertrans
      reservation as indicator to check if we should trigger a transaction
      commit.  (meta_prealloc won't be cleaned in transaction committ, it's
      useless anyway)
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a514d638
  5. 12 4月, 2018 1 次提交
  6. 31 3月, 2018 12 次提交
  7. 26 3月, 2018 11 次提交
  8. 01 3月, 2018 2 次提交
    • F
      Btrfs: fix log replay failure after unlink and link combination · 1f250e92
      Filipe Manana 提交于
      If we have a file with 2 (or more) hard links in the same directory,
      remove one of the hard links, create a new file (or link an existing file)
      in the same directory with the name of the removed hard link, and then
      finally fsync the new file, we end up with a log that fails to replay,
      causing a mount failure.
      
      Example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/testdir
        $ touch /mnt/testdir/foo
        $ ln /mnt/testdir/foo /mnt/testdir/bar
      
        $ sync
      
        $ unlink /mnt/testdir/bar
        $ touch /mnt/testdir/bar
        $ xfs_io -c "fsync" /mnt/testdir/bar
      
        <power failure>
      
        $ mount /dev/sdb /mnt
        mount: mount(2) failed: /mnt: No such file or directory
      
      When replaying the log, for that example, we also see the following in
      dmesg/syslog:
      
        [71813.671307] BTRFS info (device dm-0): failed to delete reference to bar, inode 258 parent 257
        [71813.674204] ------------[ cut here ]------------
        [71813.675694] BTRFS: Transaction aborted (error -2)
        [71813.677236] WARNING: CPU: 1 PID: 13231 at fs/btrfs/inode.c:4128 __btrfs_unlink_inode+0x17b/0x355 [btrfs]
        [71813.679669] Modules linked in: btrfs xfs f2fs dm_flakey dm_mod dax ghash_clmulni_intel ppdev pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper evdev psmouse i2c_piix4 parport_pc i2c_core pcspkr sg serio_raw parport button sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod ata_generic sd_mod virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel floppy virtio e1000 scsi_mod [last unloaded: btrfs]
        [71813.679669] CPU: 1 PID: 13231 Comm: mount Tainted: G        W        4.15.0-rc9-btrfs-next-56+ #1
        [71813.679669] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
        [71813.679669] RIP: 0010:__btrfs_unlink_inode+0x17b/0x355 [btrfs]
        [71813.679669] RSP: 0018:ffffc90001cef738 EFLAGS: 00010286
        [71813.679669] RAX: 0000000000000025 RBX: ffff880217ce4708 RCX: 0000000000000001
        [71813.679669] RDX: 0000000000000000 RSI: ffffffff81c14bae RDI: 00000000ffffffff
        [71813.679669] RBP: ffffc90001cef7c0 R08: 0000000000000001 R09: 0000000000000001
        [71813.679669] R10: ffffc90001cef5e0 R11: ffffffff8343f007 R12: ffff880217d474c8
        [71813.679669] R13: 00000000fffffffe R14: ffff88021ccf1548 R15: 0000000000000101
        [71813.679669] FS:  00007f7cee84c480(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
        [71813.679669] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [71813.679669] CR2: 00007f7cedc1abf9 CR3: 00000002354b4003 CR4: 00000000001606e0
        [71813.679669] Call Trace:
        [71813.679669]  btrfs_unlink_inode+0x17/0x41 [btrfs]
        [71813.679669]  drop_one_dir_item+0xfa/0x131 [btrfs]
        [71813.679669]  add_inode_ref+0x71e/0x851 [btrfs]
        [71813.679669]  ? __lock_is_held+0x39/0x71
        [71813.679669]  ? replay_one_buffer+0x53/0x53a [btrfs]
        [71813.679669]  replay_one_buffer+0x4a4/0x53a [btrfs]
        [71813.679669]  ? rcu_read_unlock+0x3a/0x57
        [71813.679669]  ? __lock_is_held+0x39/0x71
        [71813.679669]  walk_up_log_tree+0x101/0x1d2 [btrfs]
        [71813.679669]  walk_log_tree+0xad/0x188 [btrfs]
        [71813.679669]  btrfs_recover_log_trees+0x1fa/0x31e [btrfs]
        [71813.679669]  ? replay_one_extent+0x544/0x544 [btrfs]
        [71813.679669]  open_ctree+0x1cf6/0x2209 [btrfs]
        [71813.679669]  btrfs_mount_root+0x368/0x482 [btrfs]
        [71813.679669]  ? trace_hardirqs_on_caller+0x14c/0x1a6
        [71813.679669]  ? __lockdep_init_map+0x176/0x1c2
        [71813.679669]  ? mount_fs+0x64/0x10b
        [71813.679669]  mount_fs+0x64/0x10b
        [71813.679669]  vfs_kern_mount+0x68/0xce
        [71813.679669]  btrfs_mount+0x13e/0x772 [btrfs]
        [71813.679669]  ? trace_hardirqs_on_caller+0x14c/0x1a6
        [71813.679669]  ? __lockdep_init_map+0x176/0x1c2
        [71813.679669]  ? mount_fs+0x64/0x10b
        [71813.679669]  mount_fs+0x64/0x10b
        [71813.679669]  vfs_kern_mount+0x68/0xce
        [71813.679669]  do_mount+0x6e5/0x973
        [71813.679669]  ? memdup_user+0x3e/0x5c
        [71813.679669]  SyS_mount+0x72/0x98
        [71813.679669]  entry_SYSCALL_64_fastpath+0x1e/0x8b
        [71813.679669] RIP: 0033:0x7f7cedf150ba
        [71813.679669] RSP: 002b:00007ffca71da688 EFLAGS: 00000206
        [71813.679669] Code: 7f a0 e8 51 0c fd ff 48 8b 43 50 f0 0f ba a8 30 2c 00 00 02 72 17 41 83 fd fb 74 11 44 89 ee 48 c7 c7 7d 11 7f a0 e8 38 f5 8d e0 <0f> ff 44 89 e9 ba 20 10 00 00 eb 4d 48 8b 4d b0 48 8b 75 88 4c
        [71813.679669] ---[ end trace 83bd473fc5b4663b ]---
        [71813.854764] BTRFS: error (device dm-0) in __btrfs_unlink_inode:4128: errno=-2 No such entry
        [71813.886994] BTRFS: error (device dm-0) in btrfs_replay_log:2307: errno=-2 No such entry (Failed to recover log tree)
        [71813.903357] BTRFS error (device dm-0): cleaner transaction attach returned -30
        [71814.128078] BTRFS error (device dm-0): open_ctree failed
      
      This happens because the log has inode reference items for both inode 258
      (the first file we created) and inode 259 (the second file created), and
      when processing the reference item for inode 258, we replace the
      corresponding item in the subvolume tree (which has two names, "foo" and
      "bar") witht he one in the log (which only has one name, "foo") without
      removing the corresponding dir index keys from the parent directory.
      Later, when processing the inode reference item for inode 259, which has
      a name of "bar" associated to it, we notice that dir index entries exist
      for that name and for a different inode, so we attempt to unlink that
      name, which fails because the inode reference item for inode 258 no longer
      has the name "bar" associated to it, making a call to btrfs_unlink_inode()
      fail with a -ENOENT error.
      
      Fix this by unlinking all the names in an inode reference item from a
      subvolume tree that are not present in the inode reference item found in
      the log tree, before overwriting it with the item from the log tree.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1f250e92
    • J
      btrfs: use kvzalloc to allocate btrfs_fs_info · a8fd1f71
      Jeff Mahoney 提交于
      The srcu_struct in btrfs_fs_info scales in size with NR_CPUS.  On
      kernels built with NR_CPUS=8192, this can result in kmalloc failures
      that prevent mounting.
      
      There is work in progress to try to resolve this for every user of
      srcu_struct but using kvzalloc will work around the failures until
      that is complete.
      
      As an example with NR_CPUS=512 on x86_64: the overall size of
      subvol_srcu is 3460 bytes, fs_info is 6496.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a8fd1f71
  9. 22 1月, 2018 5 次提交
  10. 28 11月, 2017 1 次提交
    • L
      Rename superblock flags (MS_xyz -> SB_xyz) · 1751e8a6
      Linus Torvalds 提交于
      This is a pure automated search-and-replace of the internal kernel
      superblock flags.
      
      The s_flags are now called SB_*, with the names and the values for the
      moment mirroring the MS_* flags that they're equivalent to.
      
      Note how the MS_xyz flags are the ones passed to the mount system call,
      while the SB_xyz flags are what we then use in sb->s_flags.
      
      The script to do this was:
      
          # places to look in; re security/*: it generally should *not* be
          # touched (that stuff parses mount(2) arguments directly), but
          # there are two places where we really deal with superblock flags.
          FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
                  include/linux/fs.h include/uapi/linux/bfs_fs.h \
                  security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
          # the list of MS_... constants
          SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
                DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
                POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
                I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
                ACTIVE NOUSER"
      
          SED_PROG=
          for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done
      
          # we want files that contain at least one of MS_...,
          # with fs/namespace.c and fs/pnode.c excluded.
          L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')
      
          for f in $L; do sed -i $f $SED_PROG; done
      Requested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1751e8a6