1. 22 1月, 2018 16 次提交
  2. 28 11月, 2017 2 次提交
    • L
      Rename superblock flags (MS_xyz -> SB_xyz) · 1751e8a6
      Linus Torvalds 提交于
      This is a pure automated search-and-replace of the internal kernel
      superblock flags.
      
      The s_flags are now called SB_*, with the names and the values for the
      moment mirroring the MS_* flags that they're equivalent to.
      
      Note how the MS_xyz flags are the ones passed to the mount system call,
      while the SB_xyz flags are what we then use in sb->s_flags.
      
      The script to do this was:
      
          # places to look in; re security/*: it generally should *not* be
          # touched (that stuff parses mount(2) arguments directly), but
          # there are two places where we really deal with superblock flags.
          FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
                  include/linux/fs.h include/uapi/linux/bfs_fs.h \
                  security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
          # the list of MS_... constants
          SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
                DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
                POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
                I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
                ACTIVE NOUSER"
      
          SED_PROG=
          for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done
      
          # we want files that contain at least one of MS_...,
          # with fs/namespace.c and fs/pnode.c excluded.
          L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')
      
          for f in $L; do sed -i $f $SED_PROG; done
      Requested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1751e8a6
    • Q
      btrfs: Fix wild memory access in compression level parser · eae8d825
      Qu Wenruo 提交于
      [BUG]
      Kernel panic when mounting with "-o compress" mount option.
      KASAN will report like:
      ------
      ==================================================================
      BUG: KASAN: wild-memory-access in strncmp+0x31/0xc0
      Read of size 1 at addr d86735fce994f800 by task mount/662
      ...
      Call Trace:
       dump_stack+0xe3/0x175
       kasan_report+0x163/0x370
       __asan_load1+0x47/0x50
       strncmp+0x31/0xc0
       btrfs_compress_str2level+0x20/0x70 [btrfs]
       btrfs_parse_options+0xff4/0x1870 [btrfs]
       open_ctree+0x2679/0x49f0 [btrfs]
       btrfs_mount+0x1b7f/0x1d30 [btrfs]
       mount_fs+0x49/0x190
       vfs_kern_mount.part.29+0xba/0x280
       vfs_kern_mount+0x13/0x20
       btrfs_mount+0x31e/0x1d30 [btrfs]
       mount_fs+0x49/0x190
       vfs_kern_mount.part.29+0xba/0x280
       do_mount+0xaad/0x1a00
       SyS_mount+0x98/0xe0
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      ------
      
      [Cause]
      For 'compress' and 'compress_force' options, its token doesn't expect
      any parameter so its args[0] contains uninitialized data.
      Accessing args[0] will cause above wild memory access.
      
      [Fix]
      For Opt_compress and Opt_compress_force, set compression level to
      the default.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ set the default in advance ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      eae8d825
  3. 02 11月, 2017 2 次提交
    • A
      btrfs: allow setting zlib compression level via :9 · fa4d885a
      Adam Borowski 提交于
      This is bikeshedding, but it seems people are drastically more likely to
      understand "zlib:9" as compression level rather than an algorithm
      version compared to "zlib9".
      
      Based on feedback on the mailinglist, the ":9" will be the only accepted
      syntax. The level must be a single digit. Unrecognized format will
      result to the default, for forward compatibility in a similar way the
      compression algorithm specifier was relaxed in commit
      a7164fa4 ("btrfs: prepare for extensions in compression
      options").
      Signed-off-by: NAdam Borowski <kilobyte@angband.pl>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ tighten the accepted format ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fa4d885a
    • D
      btrfs: allow to set compression level for zlib · f51d2b59
      David Sterba 提交于
      Preliminary support for setting compression level for zlib, the
      following works:
      
      $ mount -o compess=zlib                 # default
      $ mount -o compess=zlib0                # same
      $ mount -o compess=zlib9                # level 9, slower sync, less data
      $ mount -o compess=zlib1                # level 1, faster sync, more data
      $ mount -o remount,compress=zlib3	# level set by remount
      
      The compress-force works the same as compress'.  The level is visible in
      the same format in /proc/mounts. Level set via file property does not
      work yet.
      
      Required patch: "btrfs: prepare for extensions in compression options"
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f51d2b59
  4. 30 10月, 2017 5 次提交
  5. 19 10月, 2017 1 次提交
  6. 21 8月, 2017 1 次提交
    • H
      btrfs: Do not use data_alloc_cluster in ssd mode · 583b7231
      Hans van Kranenburg 提交于
          This patch provides a band aid to improve the 'out of the box'
      behaviour of btrfs for disks that are detected as being an ssd.  In a
      general purpose mixed workload scenario, the current ssd mode causes
      overallocation of available raw disk space for data, while leaving
      behind increasing amounts of unused fragmented free space. This
      situation leads to early ENOSPC problems which are harming user
      experience and adoption of btrfs as a general purpose filesystem.
      
      This patch modifies the data extent allocation behaviour of the ssd mode
      to make it behave identical to nossd mode.  The metadata behaviour and
      additional ssd_spread option stay untouched so far.
      
      Recommendations for future development are to reconsider the current
      oversimplified nossd / ssd distinction and the broken detection
      mechanism based on the rotational attribute in sysfs and provide
      experienced users with a more flexible way to choose allocator behaviour
      for data and metadata, optimized for certain use cases, while keeping
      sane 'out of the box' default settings.  The internals of the current
      btrfs code have more potential than what currently gets exposed to the
      user to choose from.
      
          The SSD story...
      
          In the first year of btrfs development, around early 2008, btrfs
      gained a mount option which enables specific functionality for
      filesystems on solid state devices. The first occurance of this
      functionality is in commit e18e4809, labeled "Add mount -o ssd, which
      includes optimizations for seek free storage".
      
      The effect on allocating free space for doing (data) writes is to
      'cluster' writes together, writing them out in contiguous space, as
      opposed to a 'tetris' way of putting all separate writes into any free
      space fragment that fits (which is what the -o nossd behaviour does).
      
      A somewhat simplified explanation of what happens is that, when for
      example, the 'cluster' size is set to 2MiB, when we do some writes, the
      data allocator will search for a free space block that is 2MiB big, and
      put the writes in there. The ssd mode itself might allow a 2MiB cluster
      to be composed of multiple free space extents with some existing data in
      between, while the additional ssd_spread mount option kills off this
      option and requires fully free space.
      
      The idea behind this is (commit 536ac8ae): "The [...] clusters make it
      more likely a given IO will completely overwrite the ssd block, so it
      doesn't have to do an internal rwm cycle."; ssd block meaning nand erase
      block. So, effectively this means applying a "locality based algorithm"
      and trying to outsmart the actual ssd.
      
      Since then, various changes have been made to the involved code, but the
      basic idea is still present, and gets activated whenever the ssd mount
      option is active. This also happens by default, when the rotational flag
      as seen at /sys/block/<device>/queue/rotational is set to 0.
      
          However, there's a number of problems with this approach.
      
          First, what the optimization is trying to do is outsmart the ssd by
      assuming there is a relation between the physical address space of the
      block device as seen by btrfs and the actual physical storage of the
      ssd, and then adjusting data placement. However, since the introduction
      of the Flash Translation Layer (FTL) which is a part of the internal
      controller of an ssd, these attempts are futile. The use of good quality
      FTL in consumer ssd products might have been limited in 2008, but this
      situation has changed drastically soon after that time. Today, even the
      flash memory in your automatic cat feeding machine or your grandma's
      wheelchair has a full featured one.
      
      Second, the behaviour as described above results in the filesystem being
      filled up with badly fragmented free space extents because of relatively
      small pieces of space that are freed up by deletes, but not selected
      again as part of a 'cluster'. Since the algorithm prefers allocating a
      new chunk over going back to tetris mode, the end result is a filesystem
      in which all raw space is allocated, but which is composed of
      underutilized chunks with a 'shotgun blast' pattern of fragmented free
      space. Usually, the next problematic thing that happens is the
      filesystem wanting to allocate new space for metadata, which causes the
      filesystem to fail in spectacular ways.
      
      Third, the default mount options you get for an ssd ('ssd' mode enabled,
      'discard' not enabled), in combination with spreading out writes over
      the full address space and ignoring freed up space leads to worst case
      behaviour in providing information to the ssd itself, since it will
      never learn that all the free space left behind is actually free.  There
      are two ways to let an ssd know previously written data does not have to
      be preserved, which are sending explicit signals using discard or
      fstrim, or by simply overwriting the space with new data.  The worst
      case behaviour is the btrfs ssd_spread mount option in combination with
      not having discard enabled. It has a side effect of minimizing the reuse
      of free space previously written in.
      
      Fourth, the rotational flag in /sys/ does not reliably indicate if the
      device is a locally attached ssd. For example, iSCSI or NBD displays as
      non-rotational, while a loop device on an ssd shows up as rotational.
      
      The combination of the second and third problem effectively means that
      despite all the good intentions, the btrfs ssd mode reliably causes the
      ssd hardware and the filesystem structures and performance to be choked
      to death. The clickbait version of the title of this story would have
      been "Btrfs ssd optimizations considered harmful for ssds".
      
      The current nossd 'tetris' mode (even still without discard) allows a
      pattern of overwriting much more previously used space, causing many
      more implicit discards to happen because of the overwrite information
      the ssd gets. The actual location in the physical address space, as seen
      from the point of view of btrfs is irrelevant, because the actual writes
      to the low level flash are reordered anyway thanks to the FTL.
      
          Changes made in the code
      
      1. Make ssd mode data allocation identical to tetris mode, like nossd.
      2. Adjust and clean up filesystem mount messages so that we can easily
      identify if a kernel has this patch applied or not, when providing
      support to end users. Also, make better use of the *_and_info helpers to
      only trigger messages on actual state changes.
      
          Backporting notes
      
      Notes for whoever wants to backport this patch to their 4.9 LTS kernel:
      * First apply commit 951e7966 "btrfs: drop the nossd flag when
        remounting with -o ssd", or fixup the differences manually.
      * The rest of the conflicts are because of the fs_info refactoring. So,
        for example, instead of using fs_info, it's root->fs_info in
        extent-tree.c
      Signed-off-by: NHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      583b7231
  7. 16 8月, 2017 6 次提交
    • D
      btrfs: prepare for extensions in compression options · a7164fa4
      David Sterba 提交于
      This is a minimal patch intended to be backported to older kernels.
      We're going to extend the string specifying the compression method and
      this would fail on kernels before that change (the string is compared
      exactly).
      
      Relax the string matching only to the prefix, ie. ignoring anything that
      goes after "zlib" or "lzo", regardless of th format extension we decide
      to use. This applies to the mount options and properties.
      
      That way, patched old kernels could be booted on systems already
      utilizing the new compression spec.
      
      Applicable since commit 63541927, v3.14.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a7164fa4
    • D
      btrfs: use GFP_KERNEL in mount and remount · 3ec83621
      David Sterba 提交于
      We don't need to restrict the allocation flags in btrfs_mount or
      _remount. No big filesystem locks are held (possibly s_umount but that
      does no count here).
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3ec83621
    • Q
      btrfs: Do chunk level check for degraded remount · b382cfe8
      Qu Wenruo 提交于
      Just the same for mount time check, use btrfs_check_rw_degradable() to
      check if we are OK to be remounted rw.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b382cfe8
    • A
      btrfs: resume qgroup rescan on rw remount · 6c6b5a39
      Aleksa Sarai 提交于
      Several distributions mount the "proper root" as ro during initrd and
      then remount it as rw before pivot_root(2). Thus, if a rescan had been
      aborted by a previous shutdown, the rescan would never be resumed.
      
      This issue would manifest itself as several btrfs ioctl(2)s causing the
      entire machine to hang when btrfs_qgroup_wait_for_completion was hit
      (due to the fs_info->qgroup_rescan_running flag being set but the rescan
      itself not being resumed). Notably, Docker's btrfs storage driver makes
      regular use of BTRFS_QUOTA_CTL_DISABLE and BTRFS_IOC_QUOTA_RESCAN_WAIT
      (causing this problem to be manifested on boot for some machines).
      
      Cc: <stable@vger.kernel.org> # v3.11+
      Cc: Jeff Mahoney <jeffm@suse.com>
      Fixes: b382a324 ("Btrfs: fix qgroup rescan resume on mount")
      Signed-off-by: NAleksa Sarai <asarai@suse.de>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6c6b5a39
    • J
      btrfs: backref, add tracepoints for prelim_ref insertion and merging · 00142756
      Jeff Mahoney 提交于
      This patch adds a tracepoint event for prelim_ref insertion and
      merging.  For each, the ref being inserted or merged and the count
      of tree nodes is issued.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      00142756
    • N
      btrfs: Add zstd support · 5c1aab1d
      Nick Terrell 提交于
      Add zstd compression and decompression support to BtrFS. zstd at its
      fastest level compresses almost as well as zlib, while offering much
      faster compression and decompression, approaching lzo speeds.
      
      I benchmarked btrfs with zstd compression against no compression, lzo
      compression, and zlib compression. I benchmarked two scenarios. Copying
      a set of files to btrfs, and then reading the files. Copying a tarball
      to btrfs, extracting it to btrfs, and then reading the extracted files.
      After every operation, I call `sync` and include the sync time.
      Between every pair of operations I unmount and remount the filesystem
      to avoid caching. The benchmark files can be found in the upstream
      zstd source repository under
      `contrib/linux-kernel/{btrfs-benchmark.sh,btrfs-extract-benchmark.sh}`
      [1] [2].
      
      I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM.
      The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor,
      16 GB of RAM, and a SSD.
      
      The first compression benchmark is copying 10 copies of the unzipped
      Silesia corpus [3] into a BtrFS filesystem mounted with
      `-o compress-force=Method`. The decompression benchmark times how long
      it takes to `tar` all 10 copies into `/dev/null`. The compression ratio is
      measured by comparing the output of `df` and `du`. See the benchmark file
      [1] for details. I benchmarked multiple zstd compression levels, although
      the patch uses zstd level 1.
      
      | Method  | Ratio | Compression MB/s | Decompression speed |
      |---------|-------|------------------|---------------------|
      | None    |  0.99 |              504 |                 686 |
      | lzo     |  1.66 |              398 |                 442 |
      | zlib    |  2.58 |               65 |                 241 |
      | zstd 1  |  2.57 |              260 |                 383 |
      | zstd 3  |  2.71 |              174 |                 408 |
      | zstd 6  |  2.87 |               70 |                 398 |
      | zstd 9  |  2.92 |               43 |                 406 |
      | zstd 12 |  2.93 |               21 |                 408 |
      | zstd 15 |  3.01 |               11 |                 354 |
      
      The next benchmark first copies `linux-4.11.6.tar` [4] to btrfs. Then it
      measures the compression ratio, extracts the tar, and deletes the tar.
      Then it measures the compression ratio again, and `tar`s the extracted
      files into `/dev/null`. See the benchmark file [2] for details.
      
      | Method | Tar Ratio | Extract Ratio | Copy (s) | Extract (s)| Read (s) |
      |--------|-----------|---------------|----------|------------|----------|
      | None   |      0.97 |          0.78 |    0.981 |      5.501 |    8.807 |
      | lzo    |      2.06 |          1.38 |    1.631 |      8.458 |    8.585 |
      | zlib   |      3.40 |          1.86 |    7.750 |     21.544 |   11.744 |
      | zstd 1 |      3.57 |          1.85 |    2.579 |     11.479 |    9.389 |
      
      [1] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-benchmark.sh
      [2] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-extract-benchmark.sh
      [3] http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia
      [4] https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.11.6.tar.xz
      
      zstd source repository: https://github.com/facebook/zstdSigned-off-by: NNick Terrell <terrelln@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5c1aab1d
  8. 17 7月, 2017 1 次提交
    • D
      VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb) · bc98a42c
      David Howells 提交于
      Firstly by applying the following with coccinelle's spatch:
      
      	@@ expression SB; @@
      	-SB->s_flags & MS_RDONLY
      	+sb_rdonly(SB)
      
      to effect the conversion to sb_rdonly(sb), then by applying:
      
      	@@ expression A, SB; @@
      	(
      	-(!sb_rdonly(SB)) && A
      	+!sb_rdonly(SB) && A
      	|
      	-A != (sb_rdonly(SB))
      	+A != sb_rdonly(SB)
      	|
      	-A == (sb_rdonly(SB))
      	+A == sb_rdonly(SB)
      	|
      	-!(sb_rdonly(SB))
      	+!sb_rdonly(SB)
      	|
      	-A && (sb_rdonly(SB))
      	+A && sb_rdonly(SB)
      	|
      	-A || (sb_rdonly(SB))
      	+A || sb_rdonly(SB)
      	|
      	-(sb_rdonly(SB)) != A
      	+sb_rdonly(SB) != A
      	|
      	-(sb_rdonly(SB)) == A
      	+sb_rdonly(SB) == A
      	|
      	-(sb_rdonly(SB)) && A
      	+sb_rdonly(SB) && A
      	|
      	-(sb_rdonly(SB)) || A
      	+sb_rdonly(SB) || A
      	)
      
      	@@ expression A, B, SB; @@
      	(
      	-(sb_rdonly(SB)) ? 1 : 0
      	+sb_rdonly(SB)
      	|
      	-(sb_rdonly(SB)) ? A : B
      	+sb_rdonly(SB) ? A : B
      	)
      
      to remove left over excess bracketage and finally by applying:
      
      	@@ expression A, SB; @@
      	(
      	-(A & MS_RDONLY) != sb_rdonly(SB)
      	+(bool)(A & MS_RDONLY) != sb_rdonly(SB)
      	|
      	-(A & MS_RDONLY) == sb_rdonly(SB)
      	+(bool)(A & MS_RDONLY) == sb_rdonly(SB)
      	)
      
      to make comparisons against the result of sb_rdonly() (which is a bool)
      work correctly.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      bc98a42c
  9. 06 7月, 2017 1 次提交
    • D
      VFS: Don't use save/replace_mount_options if not using generic_show_options · c3d98ea0
      David Howells 提交于
      btrfs, debugfs, reiserfs and tracefs call save_mount_options() and reiserfs
      calls replace_mount_options(), but they then implement their own
      ->show_options() methods and don't touch s_options, rendering the saved
      options unnecessary.  I'm trying to eliminate s_options to make it easier
      to implement a context-based mount where the mount options can be passed
      individually over a file descriptor.
      
      Remove the calls to save/replace_mount_options() call in these cases.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: Chris Mason <clm@fb.com>
      cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      cc: Steven Rostedt <rostedt@goodmis.org>
      cc: linux-btrfs@vger.kernel.org
      cc: reiserfs-devel@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c3d98ea0
  10. 30 6月, 2017 1 次提交
  11. 20 6月, 2017 4 次提交