1. 08 12月, 2020 6 次提交
  2. 07 10月, 2020 2 次提交
  3. 20 8月, 2020 1 次提交
    • M
      btrfs: reset compression level for lzo on remount · 282dd7d7
      Marcos Paulo de Souza 提交于
      Currently a user can set mount "-o compress" which will set the
      compression algorithm to zlib, and use the default compress level for
      zlib (3):
      
        relatime,compress=zlib:3,space_cache
      
      If the user remounts the fs using "-o compress=lzo", then the old
      compress_level is used:
      
        relatime,compress=lzo:3,space_cache
      
      But lzo does not expose any tunable compression level. The same happens
      if we set any compress argument with different level, also with zstd.
      
      Fix this by resetting the compress_level when compress=lzo is
      specified.  With the fix applied, lzo is shown without compress level:
      
        relatime,compress=lzo,space_cache
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      282dd7d7
  4. 11 8月, 2020 3 次提交
    • J
      btrfs: make sure SB_I_VERSION doesn't get unset by remount · faa00889
      Josef Bacik 提交于
      There's some inconsistency around SB_I_VERSION handling with mount and
      remount.  Since we don't really want it to be off ever just work around
      this by making sure we don't get the flag cleared on remount.
      
      There's a tiny cpu cost of setting the bit, otherwise all changes to
      i_version also change some of the times (ctime/mtime) so the inode needs
      to be synced. We wouldn't save anything by disabling it.
      Reported-by: NEric Sandeen <sandeen@redhat.com>
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add perf impact analysis ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      faa00889
    • J
      btrfs: don't show full path of bind mounts in subvol= · 3ef3959b
      Josef Bacik 提交于
      Chris Murphy reported a problem where rpm ostree will bind mount a bunch
      of things for whatever voodoo it's doing.  But when it does this
      /proc/mounts shows something like
      
        /dev/sda /mnt/test btrfs rw,relatime,subvolid=256,subvol=/foo 0 0
        /dev/sda /mnt/test/baz btrfs rw,relatime,subvolid=256,subvol=/foo/bar 0 0
      
      Despite subvolid=256 being subvol=/foo.  This is because we're just
      spitting out the dentry of the mount point, which in the case of bind
      mounts is the source path for the mountpoint.  Instead we should spit
      out the path to the actual subvol.  Fix this by looking up the name for
      the subvolid we have mounted.  With this fix the same test looks like
      this
      
        /dev/sda /mnt/test btrfs rw,relatime,subvolid=256,subvol=/foo 0 0
        /dev/sda /mnt/test/baz btrfs rw,relatime,subvolid=256,subvol=/foo 0 0
      Reported-by: NChris Murphy <chris@colorremedies.com>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3ef3959b
    • D
      btrfs: fix messages after changing compression level by remount · 27942c99
      David Sterba 提交于
      Reported by Forza on IRC that remounting with compression options does
      not reflect the change in level, or at least it does not appear to do so
      according to the messages:
      
        mount -o compress=zstd:1 /dev/sda /mnt
        mount -o remount,compress=zstd:15 /mnt
      
      does not print the change to the level to syslog:
      
        [   41.366060] BTRFS info (device vda): use zstd compression, level 1
        [   41.368254] BTRFS info (device vda): disk space caching is enabled
        [   41.390429] BTRFS info (device vda): disk space caching is enabled
      
      What really happens is that the message is lost but the level is actualy
      changed.
      
      There's another weird output, if compression is reset to 'no':
      
        [   45.413776] BTRFS info (device vda): use no compression, level 4
      
      To fix that, save the previous compression level and print the message
      in that case too and use separate message for 'no' compression.
      
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      27942c99
  5. 27 7月, 2020 7 次提交
    • J
      btrfs: open-code remount flag setting in btrfs_remount · 88c4703f
      Johannes Thumshirn 提交于
      When we're (re)mounting a btrfs filesystem we set the
      BTRFS_FS_STATE_REMOUNTING state in fs_info to serialize against async
      reclaim or defrags.
      
      This flag is set in btrfs_remount_prepare() called by btrfs_remount().
      As btrfs_remount_prepare() does nothing but setting this flag and
      doesn't have a second caller, we can just open-code the flag setting in
      btrfs_remount().
      
      Similarly do for so clearing of the flag by moving it out of
      btrfs_remount_cleanup() into btrfs_remount() to be symmetrical.
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      88c4703f
    • J
      btrfs: document special case error codes for fs errors · 59131393
      Josef Bacik 提交于
      We've had some discussions about what to do in certain scenarios for
      error codes, specifically EUCLEAN and EROFS.  Document these near the
      error handling code so its clear what their intentions are.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      59131393
    • A
      btrfs: don't traverse into the seed devices in show_devname · 4faf55b0
      Anand Jain 提交于
      ->show_devname currently shows the lowest devid in the list. As the seed
      devices have the lowest devid in the sprouted filesystem, the userland
      tool such as findmnt end up seeing seed device instead of the device from
      the read-writable sprouted filesystem. As shown below.
      
       mount /dev/sda /btrfs
       mount: /btrfs: WARNING: device write-protected, mounted read-only.
      
       findmnt --output SOURCE,TARGET,UUID /btrfs
       SOURCE   TARGET UUID
       /dev/sda /btrfs 899f7027-3e46-4626-93e7-7d4c9ad19111
      
       btrfs dev add -f /dev/sdb /btrfs
      
       umount /btrfs
       mount /dev/sdb /btrfs
      
       findmnt --output SOURCE,TARGET,UUID /btrfs
       SOURCE   TARGET UUID
       /dev/sda /btrfs 899f7027-3e46-4626-93e7-7d4c9ad19111
      
      All sprouts from a single seed will show the same seed device and the
      same fsid. That's confusing.
      This is causing problems in our prototype as there isn't any reference
      to the sprout file-system(s) which is being used for actual read and
      write.
      
      This was added in the patch which implemented the show_devname in btrfs
      commit 9c5085c1 ("Btrfs: implement ->show_devname").
      I tried to look for any particular reason that we need to show the seed
      device, there isn't any.
      
      So instead, do not traverse through the seed devices, just show the
      lowest devid in the sprouted fsid.
      
      After the patch:
      
       mount /dev/sda /btrfs
       mount: /btrfs: WARNING: device write-protected, mounted read-only.
      
       findmnt --output SOURCE,TARGET,UUID /btrfs
       SOURCE   TARGET UUID
       /dev/sda /btrfs 899f7027-3e46-4626-93e7-7d4c9ad19111
      
       btrfs dev add -f /dev/sdb /btrfs
       mount -o rw,remount /dev/sdb /btrfs
      
       findmnt --output SOURCE,TARGET,UUID /btrfs
       SOURCE   TARGET UUID
       /dev/sdb /btrfs 595ca0e6-b82e-46b5-b9e2-c72a6928be48
      
       mount /dev/sda /btrfs1
       mount: /btrfs1: WARNING: device write-protected, mounted read-only.
      
       btrfs dev add -f /dev/sdc /btrfs1
      
       findmnt --output SOURCE,TARGET,UUID /btrfs1
       SOURCE   TARGET  UUID
       /dev/sdc /btrfs1 ca1dbb7a-8446-4f95-853c-a20f3f82bdbb
      
       cat /proc/self/mounts | grep btrfs
       /dev/sdb /btrfs btrfs rw,relatime,noacl,space_cache,subvolid=5,subvol=/ 0 0
       /dev/sdc /btrfs1 btrfs ro,relatime,noacl,space_cache,subvolid=5,subvol=/ 0 0
      Reported-by: NMartin K. Petersen <martin.petersen@oracle.com>
      CC: stable@vger.kernel.org # 4.19+
      Tested-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4faf55b0
    • D
      btrfs: remove deprecated mount option subvolrootid · b90a4ab6
      David Sterba 提交于
      The option subvolrootid used to be a workaround for mounting subvolumes
      and ineffective since 5e2a4b25 ("btrfs: deprecate subvolrootid mount
      option"). We have subvol= that works and we don't need to keep the
      cruft, let's remove it.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b90a4ab6
    • D
      btrfs: remove deprecated mount option alloc_start · d801e7a3
      David Sterba 提交于
      The mount option alloc_start has no effect since 0d0c71b3 ("btrfs:
      obsolete and remove mount option alloc_start") which has details why
      it's been deprecated. We can remove it.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d801e7a3
    • D
      btrfs: start deprecation of mount option inode_cache · b547a88e
      David Sterba 提交于
      Estimated time of removal of the functionality is 5.11, the option will
      be still parsed but will have no effect.
      
      Reasons for deprecation and removal:
      
      - very poor naming choice of the mount option, it's supposed to cache
        and reuse the inode _numbers_, but it sounds a some generic cache for
        inodes
      
      - the only known usecase where this option would make sense is on a
        32bit architecture where inode numbers in one subvolume would be
        exhausted due to 32bit inode::i_ino
      
      - the cache is stored on disk, consumes space, needs to be loaded and
        written back
      
      - new inode number allocation is slower due to lookups into the cache
        (compared to a simple increment which is the default)
      
      - uses the free-space-cache code that is going to be deprecated as well
        in the future
      
      Known problems:
      
      - since 2011, returning EEXIST when there's not enough space in a page
        to store all checksums, see commit 4b9465cb ("Btrfs: add mount -o
        inode_cache")
      
      Remaining issues:
      
      - if the option was enabled, new inodes created, the option disabled
        again, the cache is still stored on the devices and there's currently
        no way to remove it
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b547a88e
    • Q
      btrfs: introduce "rescue=" mount option · 74ef0018
      Qu Wenruo 提交于
      This patch introduces a new "rescue=" mount option group for all mount
      options for data recovery.
      
      Different rescue sub options are seperated by ':'. E.g
      "ro,rescue=nologreplay:usebackuproot".
      
      The original plan was to use ';', but ';' needs to be escaped/quoted,
      or it will be interpreted by bash, similar to '|'.
      
      And obviously, user can specify rescue options one by one like:
      "ro,rescue=nologreplay,rescue=usebackuproot".
      
      The following mount options are converted to "rescue=", old mount
      options are deprecated but still available for compatibility purpose:
      
      - usebackuproot
        Now it's "rescue=usebackuproot"
      
      - nologreplay
        Now it's "rescue=nologreplay"
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      74ef0018
  6. 02 7月, 2020 1 次提交
  7. 25 5月, 2020 4 次提交
  8. 24 3月, 2020 10 次提交
  9. 13 2月, 2020 1 次提交
  10. 03 2月, 2020 1 次提交
    • J
      btrfs: do not zero f_bavail if we have available space · d55966c4
      Josef Bacik 提交于
      There was some logic added a while ago to clear out f_bavail in statfs()
      if we did not have enough free metadata space to satisfy our global
      reserve.  This was incorrect at the time, however didn't really pose a
      problem for normal file systems because we would often allocate chunks
      if we got this low on free metadata space, and thus wouldn't really hit
      this case unless we were actually full.
      
      Fast forward to today and now we are much better about not allocating
      metadata chunks all of the time.  Couple this with d792b0f1 ("btrfs:
      always reserve our entire size for the global reserve") which now means
      we'll easily have a larger global reserve than our free space, we are
      now more likely to trip over this while still having plenty of space.
      
      Fix this by skipping this logic if the global rsv's space_info is not
      full.  space_info->full is 0 unless we've attempted to allocate a chunk
      for that space_info and that has failed.  If this happens then the space
      for the global reserve is definitely sacred and we need to report
      b_avail == 0, but before then we can just use our calculated b_avail.
      Reported-by: NMartin Steigerwald <martin@lichtvoll.de>
      Fixes: ca8a51b3 ("btrfs: statfs: report zero available if metadata are exhausted")
      CC: stable@vger.kernel.org # 4.5+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Tested-By: NMartin Steigerwald <martin@lichtvoll.de>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d55966c4
  11. 20 1月, 2020 2 次提交
    • D
      btrfs: add the beginning of async discard, discard workqueue · b0643e59
      Dennis Zhou 提交于
      When discard is enabled, everytime a pinned extent is released back to
      the block_group's free space cache, a discard is issued for the extent.
      This is an overeager approach when it comes to discarding and helping
      the SSD maintain enough free space to prevent severe garbage collection
      situations.
      
      This adds the beginning of async discard. Instead of issuing a discard
      prior to returning it to the free space, it is just marked as untrimmed.
      The block_group is then added to a LRU which then feeds into a workqueue
      to issue discards at a much slower rate. Full discarding of unused block
      groups is still done and will be addressed in a future patch of the
      series.
      
      For now, we don't persist the discard state of extents and bitmaps.
      Therefore, our failure recovery mode will be to consider extents
      untrimmed. This lets us handle failure and unmounting as one in the
      same.
      
      On a number of Facebook webservers, I collected data every minute
      accounting the time we spent in btrfs_finish_extent_commit() (col. 1)
      and in btrfs_commit_transaction() (col. 2). btrfs_finish_extent_commit()
      is where we discard extents synchronously before returning them to the
      free space cache.
      
      discard=sync:
                       p99 total per minute       p99 total per minute
            Drive   |   extent_commit() (ms)  |    commit_trans() (ms)
          ---------------------------------------------------------------
           Drive A  |           434           |          1170
           Drive B  |           880           |          2330
           Drive C  |          2943           |          3920
           Drive D  |          4763           |          5701
      
      discard=async:
                       p99 total per minute       p99 total per minute
            Drive   |   extent_commit() (ms)  |    commit_trans() (ms)
          --------------------------------------------------------------
           Drive A  |           134           |           956
           Drive B  |            64           |          1972
           Drive C  |            59           |          1032
           Drive D  |            62           |          1200
      
      While it's not great that the stats are cumulative over 1m, all of these
      servers are running the same workload and and the delta between the two
      are substantial. We are spending significantly less time in
      btrfs_finish_extent_commit() which is responsible for discarding.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b0643e59
    • D
      btrfs: rename DISCARD mount option to to DISCARD_SYNC · 46b27f50
      Dennis Zhou 提交于
      This series introduces async discard which will use the flag
      DISCARD_ASYNC, so rename the original flag to DISCARD_SYNC as it is
      synchronously done in transaction commit.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      46b27f50
  12. 19 11月, 2019 2 次提交
    • D
      btrfs: add support for 4-copy replication (raid1c4) · 8d6fac00
      David Sterba 提交于
      Add new block group profile to store 4 copies in a simliar way that
      current RAID1 does.  The profile attributes and constraints are defined
      in the raid table and used by the same code that already handles the 2-
      and 3-copy RAID1.
      
      The minimum number of devices is 4, the maximum number of devices/chunks
      that can be lost/damaged is 3. There is no comparable traditional RAID
      level, the profile is added for future needs to accompany triple-parity
      and beyond.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8d6fac00
    • D
      btrfs: add support for 3-copy replication (raid1c3) · 47e6f742
      David Sterba 提交于
      Add new block group profile to store 3 copies in a simliar way that
      current RAID1 does. The profile attributes and constraints are defined
      in the raid table and used by the same code that already handles the
      2-copy RAID1.
      
      The minimum number of devices is 3, the maximum number of devices/chunks
      that can be lost/damaged is 2. Like RAID6 but with 33% space
      utilization.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      47e6f742