1. 31 3月, 2018 6 次提交
  2. 26 3月, 2018 11 次提交
  3. 01 3月, 2018 2 次提交
    • F
      Btrfs: fix log replay failure after unlink and link combination · 1f250e92
      Filipe Manana 提交于
      If we have a file with 2 (or more) hard links in the same directory,
      remove one of the hard links, create a new file (or link an existing file)
      in the same directory with the name of the removed hard link, and then
      finally fsync the new file, we end up with a log that fails to replay,
      causing a mount failure.
      
      Example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/testdir
        $ touch /mnt/testdir/foo
        $ ln /mnt/testdir/foo /mnt/testdir/bar
      
        $ sync
      
        $ unlink /mnt/testdir/bar
        $ touch /mnt/testdir/bar
        $ xfs_io -c "fsync" /mnt/testdir/bar
      
        <power failure>
      
        $ mount /dev/sdb /mnt
        mount: mount(2) failed: /mnt: No such file or directory
      
      When replaying the log, for that example, we also see the following in
      dmesg/syslog:
      
        [71813.671307] BTRFS info (device dm-0): failed to delete reference to bar, inode 258 parent 257
        [71813.674204] ------------[ cut here ]------------
        [71813.675694] BTRFS: Transaction aborted (error -2)
        [71813.677236] WARNING: CPU: 1 PID: 13231 at fs/btrfs/inode.c:4128 __btrfs_unlink_inode+0x17b/0x355 [btrfs]
        [71813.679669] Modules linked in: btrfs xfs f2fs dm_flakey dm_mod dax ghash_clmulni_intel ppdev pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper evdev psmouse i2c_piix4 parport_pc i2c_core pcspkr sg serio_raw parport button sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod ata_generic sd_mod virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel floppy virtio e1000 scsi_mod [last unloaded: btrfs]
        [71813.679669] CPU: 1 PID: 13231 Comm: mount Tainted: G        W        4.15.0-rc9-btrfs-next-56+ #1
        [71813.679669] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
        [71813.679669] RIP: 0010:__btrfs_unlink_inode+0x17b/0x355 [btrfs]
        [71813.679669] RSP: 0018:ffffc90001cef738 EFLAGS: 00010286
        [71813.679669] RAX: 0000000000000025 RBX: ffff880217ce4708 RCX: 0000000000000001
        [71813.679669] RDX: 0000000000000000 RSI: ffffffff81c14bae RDI: 00000000ffffffff
        [71813.679669] RBP: ffffc90001cef7c0 R08: 0000000000000001 R09: 0000000000000001
        [71813.679669] R10: ffffc90001cef5e0 R11: ffffffff8343f007 R12: ffff880217d474c8
        [71813.679669] R13: 00000000fffffffe R14: ffff88021ccf1548 R15: 0000000000000101
        [71813.679669] FS:  00007f7cee84c480(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
        [71813.679669] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [71813.679669] CR2: 00007f7cedc1abf9 CR3: 00000002354b4003 CR4: 00000000001606e0
        [71813.679669] Call Trace:
        [71813.679669]  btrfs_unlink_inode+0x17/0x41 [btrfs]
        [71813.679669]  drop_one_dir_item+0xfa/0x131 [btrfs]
        [71813.679669]  add_inode_ref+0x71e/0x851 [btrfs]
        [71813.679669]  ? __lock_is_held+0x39/0x71
        [71813.679669]  ? replay_one_buffer+0x53/0x53a [btrfs]
        [71813.679669]  replay_one_buffer+0x4a4/0x53a [btrfs]
        [71813.679669]  ? rcu_read_unlock+0x3a/0x57
        [71813.679669]  ? __lock_is_held+0x39/0x71
        [71813.679669]  walk_up_log_tree+0x101/0x1d2 [btrfs]
        [71813.679669]  walk_log_tree+0xad/0x188 [btrfs]
        [71813.679669]  btrfs_recover_log_trees+0x1fa/0x31e [btrfs]
        [71813.679669]  ? replay_one_extent+0x544/0x544 [btrfs]
        [71813.679669]  open_ctree+0x1cf6/0x2209 [btrfs]
        [71813.679669]  btrfs_mount_root+0x368/0x482 [btrfs]
        [71813.679669]  ? trace_hardirqs_on_caller+0x14c/0x1a6
        [71813.679669]  ? __lockdep_init_map+0x176/0x1c2
        [71813.679669]  ? mount_fs+0x64/0x10b
        [71813.679669]  mount_fs+0x64/0x10b
        [71813.679669]  vfs_kern_mount+0x68/0xce
        [71813.679669]  btrfs_mount+0x13e/0x772 [btrfs]
        [71813.679669]  ? trace_hardirqs_on_caller+0x14c/0x1a6
        [71813.679669]  ? __lockdep_init_map+0x176/0x1c2
        [71813.679669]  ? mount_fs+0x64/0x10b
        [71813.679669]  mount_fs+0x64/0x10b
        [71813.679669]  vfs_kern_mount+0x68/0xce
        [71813.679669]  do_mount+0x6e5/0x973
        [71813.679669]  ? memdup_user+0x3e/0x5c
        [71813.679669]  SyS_mount+0x72/0x98
        [71813.679669]  entry_SYSCALL_64_fastpath+0x1e/0x8b
        [71813.679669] RIP: 0033:0x7f7cedf150ba
        [71813.679669] RSP: 002b:00007ffca71da688 EFLAGS: 00000206
        [71813.679669] Code: 7f a0 e8 51 0c fd ff 48 8b 43 50 f0 0f ba a8 30 2c 00 00 02 72 17 41 83 fd fb 74 11 44 89 ee 48 c7 c7 7d 11 7f a0 e8 38 f5 8d e0 <0f> ff 44 89 e9 ba 20 10 00 00 eb 4d 48 8b 4d b0 48 8b 75 88 4c
        [71813.679669] ---[ end trace 83bd473fc5b4663b ]---
        [71813.854764] BTRFS: error (device dm-0) in __btrfs_unlink_inode:4128: errno=-2 No such entry
        [71813.886994] BTRFS: error (device dm-0) in btrfs_replay_log:2307: errno=-2 No such entry (Failed to recover log tree)
        [71813.903357] BTRFS error (device dm-0): cleaner transaction attach returned -30
        [71814.128078] BTRFS error (device dm-0): open_ctree failed
      
      This happens because the log has inode reference items for both inode 258
      (the first file we created) and inode 259 (the second file created), and
      when processing the reference item for inode 258, we replace the
      corresponding item in the subvolume tree (which has two names, "foo" and
      "bar") witht he one in the log (which only has one name, "foo") without
      removing the corresponding dir index keys from the parent directory.
      Later, when processing the inode reference item for inode 259, which has
      a name of "bar" associated to it, we notice that dir index entries exist
      for that name and for a different inode, so we attempt to unlink that
      name, which fails because the inode reference item for inode 258 no longer
      has the name "bar" associated to it, making a call to btrfs_unlink_inode()
      fail with a -ENOENT error.
      
      Fix this by unlinking all the names in an inode reference item from a
      subvolume tree that are not present in the inode reference item found in
      the log tree, before overwriting it with the item from the log tree.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1f250e92
    • J
      btrfs: use kvzalloc to allocate btrfs_fs_info · a8fd1f71
      Jeff Mahoney 提交于
      The srcu_struct in btrfs_fs_info scales in size with NR_CPUS.  On
      kernels built with NR_CPUS=8192, this can result in kmalloc failures
      that prevent mounting.
      
      There is work in progress to try to resolve this for every user of
      srcu_struct but using kvzalloc will work around the failures until
      that is complete.
      
      As an example with NR_CPUS=512 on x86_64: the overall size of
      subvol_srcu is 3460 bytes, fs_info is 6496.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a8fd1f71
  4. 22 1月, 2018 5 次提交
  5. 28 11月, 2017 1 次提交
    • L
      Rename superblock flags (MS_xyz -> SB_xyz) · 1751e8a6
      Linus Torvalds 提交于
      This is a pure automated search-and-replace of the internal kernel
      superblock flags.
      
      The s_flags are now called SB_*, with the names and the values for the
      moment mirroring the MS_* flags that they're equivalent to.
      
      Note how the MS_xyz flags are the ones passed to the mount system call,
      while the SB_xyz flags are what we then use in sb->s_flags.
      
      The script to do this was:
      
          # places to look in; re security/*: it generally should *not* be
          # touched (that stuff parses mount(2) arguments directly), but
          # there are two places where we really deal with superblock flags.
          FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
                  include/linux/fs.h include/uapi/linux/bfs_fs.h \
                  security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
          # the list of MS_... constants
          SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
                DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
                POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
                I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
                ACTIVE NOUSER"
      
          SED_PROG=
          for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done
      
          # we want files that contain at least one of MS_...,
          # with fs/namespace.c and fs/pnode.c excluded.
          L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')
      
          for f in $L; do sed -i $f $SED_PROG; done
      Requested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1751e8a6
  6. 16 11月, 2017 1 次提交
    • F
      Btrfs: fix reported number of inode blocks after buffered append writes · e3b8a485
      Filipe Manana 提交于
      The patch from commit a7e3b975 ("Btrfs: fix reported number of inode
      blocks") introduced a regression where if we do a buffered write starting
      at position equal to or greater than the file's size and then stat(2) the
      file before writeback is triggered, the number of used blocks does not
      change (unless there's a prealloc/unwritten extent). Example:
      
        $ xfs_io -f -c "pwrite -S 0xab 0 64K" foobar
        $ du -h foobar
        0	foobar
        $ sync
        $ du -h foobar
        64K	foobar
      
      The first version of that patch didn't had this regression and the second
      version, which was the one committed, was made only to address some
      performance regression detected by the intel test robots using fs_mark.
      
      This fixes the regression by setting the new delaloc bit in the range, and
      doing it at btrfs_dirty_pages() while setting the regular dealloc bit as
      well, so that this way we set both bits at once avoiding navigation of the
      inode's io tree twice. Doing it at btrfs_dirty_pages() is also the most
      meaninful place, as we should set the new dellaloc bit when if we set the
      delalloc bit, which happens only if we copied bytes into the pages at
      __btrfs_buffered_write().
      
      This was making some of LTP's du tests fail, which can be quickly run
      using a command line like the following:
      
        $ ./runltp -q -p -l /ltp.log -f commands -s du -d /mnt
      
      Fixes: a7e3b975 ("Btrfs: fix reported number of inode blocks")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e3b8a485
  7. 02 11月, 2017 3 次提交
    • J
      btrfs: make the delalloc block rsv per inode · 69fe2d75
      Josef Bacik 提交于
      The way we handle delalloc metadata reservations has gotten
      progressively more complicated over the years.  There is so much cruft
      and weirdness around keeping the reserved count and outstanding counters
      consistent and handling the error cases that it's impossible to
      understand.
      
      Fix this by making the delalloc block rsv per-inode.  This way we can
      calculate the actual size of the outstanding metadata reservations every
      time we make a change, and then reserve the delta based on that amount.
      This greatly simplifies the code everywhere, and makes the error
      handling in btrfs_delalloc_reserve_metadata far less terrifying.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      69fe2d75
    • J
      Btrfs: rework outstanding_extents · 8b62f87b
      Josef Bacik 提交于
      Right now we do a lot of weird hoops around outstanding_extents in order
      to keep the extent count consistent.  This is because we logically
      transfer the outstanding_extent count from the initial reservation
      through the set_delalloc_bits.  This makes it pretty difficult to get a
      handle on how and when we need to mess with outstanding_extents.
      
      Fix this by revamping the rules of how we deal with outstanding_extents.
      Now instead everybody that is holding on to a delalloc extent is
      required to increase the outstanding extents count for itself.  This
      means we'll have something like this
      
      btrfs_delalloc_reserve_metadata	- outstanding_extents = 1
       btrfs_set_extent_delalloc	- outstanding_extents = 2
      btrfs_release_delalloc_extents	- outstanding_extents = 1
      
      for an initial file write.  Now take the append write where we extend an
      existing delalloc range but still under the maximum extent size
      
      btrfs_delalloc_reserve_metadata - outstanding_extents = 2
        btrfs_set_extent_delalloc
          btrfs_set_bit_hook		- outstanding_extents = 3
          btrfs_merge_extent_hook	- outstanding_extents = 2
      btrfs_delalloc_release_extents	- outstanding_extnets = 1
      
      In order to make the ordered extent transition we of course must now
      make ordered extents carry their own outstanding_extent reservation, so
      for cow_file_range we end up with
      
      btrfs_add_ordered_extent	- outstanding_extents = 2
      clear_extent_bit		- outstanding_extents = 1
      btrfs_remove_ordered_extent	- outstanding_extents = 0
      
      This makes all manipulations of outstanding_extents much more explicit.
      Every successful call to btrfs_delalloc_reserve_metadata _must_ now be
      combined with btrfs_release_delalloc_extents, even in the error case, as
      that is the only function that actually modifies the
      outstanding_extents counter.
      
      The drawback to this is now we are much more likely to have transient
      cases where outstanding_extents is much larger than it actually should
      be.  This could happen before as we manipulated the delalloc bits, but
      now it happens basically at every write.  This may put more pressure on
      the ENOSPC flushing code, but I think making this code simpler is worth
      the cost.  I have another change coming to mitigate this side-effect
      somewhat.
      
      I also added trace points for the counter manipulation.  These were used
      by a bpf script I wrote to help track down leak issues.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8b62f87b
    • D
      btrfs: allow to set compression level for zlib · f51d2b59
      David Sterba 提交于
      Preliminary support for setting compression level for zlib, the
      following works:
      
      $ mount -o compess=zlib                 # default
      $ mount -o compess=zlib0                # same
      $ mount -o compess=zlib9                # level 9, slower sync, less data
      $ mount -o compess=zlib1                # level 1, faster sync, more data
      $ mount -o remount,compress=zlib3	# level set by remount
      
      The compress-force works the same as compress'.  The level is visible in
      the same format in /proc/mounts. Level set via file property does not
      work yet.
      
      Required patch: "btrfs: prepare for extensions in compression options"
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f51d2b59
  8. 30 10月, 2017 7 次提交
  9. 04 10月, 2017 1 次提交
    • T
      Btrfs: fix overlap of fs_info::flags values · 69ad5976
      Tsutomu Itoh 提交于
      Because the values of BTRFS_FS_EXCL_OP and BTRFS_FS_QUOTA_OVERRIDE overlap,
      we should change the value.
      
      First, BTRFS_FS_EXCL_OP was set to 14.
      
        commit 171938e5 ("btrfs: track exclusive filesystem operation in flags")
      
      Next, the value of BTRFS_FS_QUOTA_OVERRIDE was set to 14.
      
        commit f29efe29 ("btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE")
      
      As a result, the value 14 overlapped, by accident.
      This problem is solved by defining the value of BTRFS_FS_EXCL_OP as 16,
      the flags are internal.
      
      Fixes: f29efe29 ("btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE")
      CC: stable@vger.kernel.org # 4.13+
      Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ minimize the change, update only BTRFS_FS_EXCL_OP ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      69ad5976
  10. 26 9月, 2017 1 次提交
    • M
      btrfs: remove BTRFS_FS_QUOTA_DISABLING flag · c2faff79
      Misono, Tomohiro 提交于
      Currently, "btrfs quota enable" would fail after "btrfs quota disable" on
      the first time with syslog output "qgroup_rescan_init failed with -22", but
      it would succeed on the second time.
      
      When "quota disable" is called, BTRFS_FS_QUOTA_DISABLING flag bit will be
      set in fs_info->flags in btrfs_quota_disable(), but it will not be droppd
      in btrfs_run_qgroups() (which is called in btrfs_commit_transaction())
      because quota_root has already been freed. If "quota enable" is called
      after that, both BTRFS_FS_QUOTA_DISABLING and BTRFS_FS_QUOTA_ENABLED flag
      would be dropped in the btrfs_run_qgroups() since quota_root is not NULL.
      This leads to the failure of "quota enable" on the first time.
      
      BTRFS_FS_QUOTA_DISABLING flag is not used outside of "quota disable"
      context and is equivalent to whether quota_root is NULL or not.
      btrfs_run_qgroups() checks whether quota_root is NULL or not in the first
      place.
      
      So, let's remove BTRFS_FS_QUOTA_DISABLING flag.
      Signed-off-by: NTomohiro Misono <misono.tomohiro@jp.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c2faff79
  11. 21 8月, 2017 2 次提交