1. 02 11月, 2017 1 次提交
    • J
      btrfs: make the delalloc block rsv per inode · 69fe2d75
      Josef Bacik 提交于
      The way we handle delalloc metadata reservations has gotten
      progressively more complicated over the years.  There is so much cruft
      and weirdness around keeping the reserved count and outstanding counters
      consistent and handling the error cases that it's impossible to
      understand.
      
      Fix this by making the delalloc block rsv per-inode.  This way we can
      calculate the actual size of the outstanding metadata reservations every
      time we make a change, and then reserve the delta based on that amount.
      This greatly simplifies the code everywhere, and makes the error
      handling in btrfs_delalloc_reserve_metadata far less terrifying.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      69fe2d75
  2. 30 10月, 2017 13 次提交
    • N
      btrfs: Replace opencoded sizes with their symbolic constants · d4417e22
      Nikolay Borisov 提交于
      Currently btrfs' code uses a mix of opencoded sizes and defines from sizes.h.
      Let's unifiy the code base to always use the symbolic constants. No functional
      changes
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d4417e22
    • J
      btrfs: remove delayed_ref_node from ref_head · d278850e
      Josef Bacik 提交于
      This is just excessive information in the ref_head, and makes the code
      complicated.  It is a relic from when we had the heads and the refs in
      the same tree, which is no longer the case.  With this removal I've
      cleaned up a bunch of the cruft around this old assumption as well.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d278850e
    • J
      Btrfs: add a extent ref verify tool · fd708b81
      Josef Bacik 提交于
      We were having corruption issues that were tied back to problems with
      the extent tree.  In order to track them down I built this tool to try
      and find the culprit, which was pretty successful.  If you compile with
      this tool on it will live verify every ref update that the fs makes and
      make sure it is consistent and valid.  I've run this through with
      xfstests and haven't gotten any false positives.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update error messages, add fixup from Dan Carpenter to handle errors
        of read_tree_block ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fd708b81
    • L
      Btrfs: remove nr_async_submits and async_submit_draining · 736cd52e
      Liu Bo 提交于
      Now that we have the combo of flushing twice, which can make sure IO
      have started since the second flush will wait for page lock which
      won't be unlocked unless setting page writeback and queuing ordered
      extents, we don't need %async_submit_draining, %async_delalloc_pages
      and %nr_async_submits to tell whether the IO has actually started.
      
      Moreover, all the flushers in use are followed by functions that wait
      for ordered extents to complete, so %nr_async_submits, which tracks
      whether bio's async submit has made progress, doesn't really make
      sense.
      
      However, %async_delalloc_pages is still required by shrink_delalloc()
      as that function doesn't flush twice in the normal case (just issues a
      writeback with WB_REASON_FS_FREE_SPACE).
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      736cd52e
    • L
      Btrfs: remove nr_async_bios · f851689b
      Liu Bo 提交于
      This was intended to congest higher layers to not send bios, but as
      
      1) the congested bit has been taken by writeback
      
      Async bios come from buffered writes and DIO writes.
      
      For DIO writes, we want to submit them ASAP, while for buffered writes,
      writeback uses balance_dirty_pages() to throttle how much dirty pages we
      can have.
      
      2) and no one is waiting for %nr_async_bios down to zero,
      
      Historically, it was introduced along with changes which let
      checksumming workload spread accross different cpus.  And at that time,
      pdflush was used instead of per-bdi flushing, perhaps pdflush did not
      have the necessary information for writeback to do throttling.
      
      We can safely remove them now.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      [ additional explanation from mails, removed unused variable 'limit' ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f851689b
    • Q
      btrfs: Move leaf and node validation checker to tree-checker.c · 557ea5dd
      Qu Wenruo 提交于
      It's no doubt the comprehensive tree block checker will become larger,
      so moving them into their own files is quite reasonable.
      Signed-off-by: NQu Wenruo <quwenruo.btrfs@gmx.com>
      [ wording adjustments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      557ea5dd
    • Q
      btrfs: Add checker for EXTENT_CSUM · 4b865cab
      Qu Wenruo 提交于
      EXTENT_CSUM checker is a relatively easy one, only needs to check:
      
      1) Objectid
         Fixed to BTRFS_EXTENT_CSUM_OBJECTID
      
      2) Key offset alignment
         Must be aligned to sectorsize
      
      3) Item size alignedment
         Must be aligned to csum size
      Signed-off-by: NQu Wenruo <quwenruo.btrfs@gmx.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4b865cab
    • Q
      btrfs: Add sanity check for EXTENT_DATA when reading out leaf · 40c3c409
      Qu Wenruo 提交于
      Add extra checks for item with EXTENT_DATA type.  This checks the
      following thing:
      
      0) Key offset
         All key offsets must be aligned to sectorsize.
         Inline extent must have 0 for key offset.
      
      1) Item size
         Uncompressed inline file extent size must match item size.
         (Compressed inline file extent has no information about its on-disk size.)
         Regular/preallocated file extent size must be a fixed value.
      
      2) Every member of regular file extent item
         Including alignment for bytenr and offset, possible value for
         compression/encryption/type.
      
      3) Type/compression/encode must be one of the valid values.
      
      This should be the most comprehensive and strict check in the context
      of btrfs_item for EXTENT_DATA.
      Signed-off-by: NQu Wenruo <quwenruo.btrfs@gmx.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ switch to BTRFS_FILE_EXTENT_TYPES, similar to what
        BTRFS_COMPRESS_TYPES does ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      40c3c409
    • Q
      btrfs: Check if item pointer overlaps with the item itself · 7f43d4af
      Qu Wenruo 提交于
      Function check_leaf() checks if any item pointer points outside of the
      leaf, but it doesn't check if the pointer overlaps with the item itself.
      
      Normally only the last item may be the victim, but adding such check is
      never a bad idea anyway.
      Signed-off-by: NQu Wenruo <quwenruo.btrfs@gmx.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7f43d4af
    • Q
      btrfs: Refactor check_leaf function for later expansion · c3267bba
      Qu Wenruo 提交于
      Current check_leaf() function does a good job checking key order and
      item offset/size.
      
      However it only checks from slot 0 to the last but one slot, this is
      good but makes later expansion hard.
      
      So this refactoring iterates from slot 0 to the last slot.
      For key comparison, it uses a key with all 0 as initial key, so all
      valid keys should be larger than that.
      
      And for item size/offset checks, it compares current item end with
      previous item offset.
      For slot 0, use leaf end as a special case.
      
      This makes later item/key offset checks and item size checks easier to
      be implemented.
      
      Also, makes check_leaf() to return -EUCLEAN other than -EIO to indicate
      error.
      Signed-off-by: NQu Wenruo <quwenruo.btrfs@gmx.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c3267bba
    • L
      Btrfs: remove bio_flags which indicates a meta block of log-tree · 18fdc679
      Liu Bo 提交于
      Since both committing transaction and writing log-tree are doing
      plugging on metadata IO, we can unify to use %sync_writers to benefit
      both cases, instead of checking bio_flags while writing meta blocks of
      log-tree.
      
      We can remove this bio_flags because in order to write dirty blocks,
      log tree also uses btrfs_write_marked_extents(), inside which we
      have enabled %sync_writers, therefore, every write goes in a
      synchronous way, so does checksuming.
      
      Please also note that, bio_flags is applied per-context while
      %sync_writers is applied per-inode, so this might incur some overhead, ie.
      
      1) while log tree is flushing its dirty blocks via
         btrfs_write_marked_extents(), in which %sync_writers is increased
         by one.
      
      2) in the meantime, some writeback operations may happen upon btrfs's
         metadata inode, so these writes go synchronously, too.
      
      However, AFAICS, the overhead is not a big one while the win is that
      we unify the two places that needs synchronous way and remove a
      special hack/flag.
      
      This removes the bio_flags related stuff for writing log-tree.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      18fdc679
    • L
      Btrfs: make plug in writing meta blocks really work · 6300463b
      Liu Bo 提交于
      We have started plug in btrfs_write_and_wait_marked_extents() but the
      generated IOs actually go to device's schedule IO list where the work
      is doing in another task, thus the started plug doesn't make any
      sense.
      
      And since we wait for IOs immediately after writing meta blocks, it's
      the same case as writing log tree, doing sync submit can merge more
      IOs.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6300463b
    • A
      btrfs: copy fsid to super_block s_uuid · ee87cf5e
      Anand Jain 提交于
      We didn't copy fsid to struct super_block.s_uuid so Overlay disables
      index feature with btrfs as the lower FS.
      
      kernel: overlayfs: fs on '/lower' does not support file handles, falling back to index=off.
      
      Fix this by publishing the fsid through struct super_block.s_uuid.
      
      [ dsterba: I think that setting s_uuid is the last missing bit. Overlay
        needs the file handle encoding support from the lower filesystem, which
        is supported. Filling the whole filesystem id is correct, the subvolume
        id is encoded in the file handle buffer from inside btrfs_encode_fh. ]
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ee87cf5e
  3. 26 9月, 2017 1 次提交
  4. 24 8月, 2017 2 次提交
    • O
      Btrfs: fix blk_status_t/errno confusion · 58efbc9f
      Omar Sandoval 提交于
      This fixes several instances of blk_status_t and bare errno ints being
      mixed up, some of which are real bugs.
      
      In the normal case, 0 matches BLK_STS_OK, so we don't observe any
      effects of the missing conversion, but in case of errors or passes
      through the repair/retry paths, the errors get mixed up.
      
      The changes were identified using 'sparse', we don't have reports of the
      buggy behaviour.
      
      Fixes: 4e4cbee9 ("block: switch bios to blk_status_t")
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      58efbc9f
    • C
      block: replace bi_bdev with a gendisk pointer and partitions index · 74d46992
      Christoph Hellwig 提交于
      This way we don't need a block_device structure to submit I/O.  The
      block_device has different life time rules from the gendisk and
      request_queue and is usually only available when the block device node
      is open.  Other callers need to explicitly create one (e.g. the lightnvm
      passthrough code, or the new nvme multipathing code).
      
      For the actual I/O path all that we need is the gendisk, which exists
      once per block device.  But given that the block layer also does
      partition remapping we additionally need a partition index, which is
      used for said remapping in generic_make_request.
      
      Note that all the block drivers generally want request_queue or
      sometimes the gendisk, so this removes a layer of indirection all
      over the stack.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      74d46992
  5. 22 8月, 2017 1 次提交
  6. 21 8月, 2017 2 次提交
    • H
      btrfs: Do not use data_alloc_cluster in ssd mode · 583b7231
      Hans van Kranenburg 提交于
          This patch provides a band aid to improve the 'out of the box'
      behaviour of btrfs for disks that are detected as being an ssd.  In a
      general purpose mixed workload scenario, the current ssd mode causes
      overallocation of available raw disk space for data, while leaving
      behind increasing amounts of unused fragmented free space. This
      situation leads to early ENOSPC problems which are harming user
      experience and adoption of btrfs as a general purpose filesystem.
      
      This patch modifies the data extent allocation behaviour of the ssd mode
      to make it behave identical to nossd mode.  The metadata behaviour and
      additional ssd_spread option stay untouched so far.
      
      Recommendations for future development are to reconsider the current
      oversimplified nossd / ssd distinction and the broken detection
      mechanism based on the rotational attribute in sysfs and provide
      experienced users with a more flexible way to choose allocator behaviour
      for data and metadata, optimized for certain use cases, while keeping
      sane 'out of the box' default settings.  The internals of the current
      btrfs code have more potential than what currently gets exposed to the
      user to choose from.
      
          The SSD story...
      
          In the first year of btrfs development, around early 2008, btrfs
      gained a mount option which enables specific functionality for
      filesystems on solid state devices. The first occurance of this
      functionality is in commit e18e4809, labeled "Add mount -o ssd, which
      includes optimizations for seek free storage".
      
      The effect on allocating free space for doing (data) writes is to
      'cluster' writes together, writing them out in contiguous space, as
      opposed to a 'tetris' way of putting all separate writes into any free
      space fragment that fits (which is what the -o nossd behaviour does).
      
      A somewhat simplified explanation of what happens is that, when for
      example, the 'cluster' size is set to 2MiB, when we do some writes, the
      data allocator will search for a free space block that is 2MiB big, and
      put the writes in there. The ssd mode itself might allow a 2MiB cluster
      to be composed of multiple free space extents with some existing data in
      between, while the additional ssd_spread mount option kills off this
      option and requires fully free space.
      
      The idea behind this is (commit 536ac8ae): "The [...] clusters make it
      more likely a given IO will completely overwrite the ssd block, so it
      doesn't have to do an internal rwm cycle."; ssd block meaning nand erase
      block. So, effectively this means applying a "locality based algorithm"
      and trying to outsmart the actual ssd.
      
      Since then, various changes have been made to the involved code, but the
      basic idea is still present, and gets activated whenever the ssd mount
      option is active. This also happens by default, when the rotational flag
      as seen at /sys/block/<device>/queue/rotational is set to 0.
      
          However, there's a number of problems with this approach.
      
          First, what the optimization is trying to do is outsmart the ssd by
      assuming there is a relation between the physical address space of the
      block device as seen by btrfs and the actual physical storage of the
      ssd, and then adjusting data placement. However, since the introduction
      of the Flash Translation Layer (FTL) which is a part of the internal
      controller of an ssd, these attempts are futile. The use of good quality
      FTL in consumer ssd products might have been limited in 2008, but this
      situation has changed drastically soon after that time. Today, even the
      flash memory in your automatic cat feeding machine or your grandma's
      wheelchair has a full featured one.
      
      Second, the behaviour as described above results in the filesystem being
      filled up with badly fragmented free space extents because of relatively
      small pieces of space that are freed up by deletes, but not selected
      again as part of a 'cluster'. Since the algorithm prefers allocating a
      new chunk over going back to tetris mode, the end result is a filesystem
      in which all raw space is allocated, but which is composed of
      underutilized chunks with a 'shotgun blast' pattern of fragmented free
      space. Usually, the next problematic thing that happens is the
      filesystem wanting to allocate new space for metadata, which causes the
      filesystem to fail in spectacular ways.
      
      Third, the default mount options you get for an ssd ('ssd' mode enabled,
      'discard' not enabled), in combination with spreading out writes over
      the full address space and ignoring freed up space leads to worst case
      behaviour in providing information to the ssd itself, since it will
      never learn that all the free space left behind is actually free.  There
      are two ways to let an ssd know previously written data does not have to
      be preserved, which are sending explicit signals using discard or
      fstrim, or by simply overwriting the space with new data.  The worst
      case behaviour is the btrfs ssd_spread mount option in combination with
      not having discard enabled. It has a side effect of minimizing the reuse
      of free space previously written in.
      
      Fourth, the rotational flag in /sys/ does not reliably indicate if the
      device is a locally attached ssd. For example, iSCSI or NBD displays as
      non-rotational, while a loop device on an ssd shows up as rotational.
      
      The combination of the second and third problem effectively means that
      despite all the good intentions, the btrfs ssd mode reliably causes the
      ssd hardware and the filesystem structures and performance to be choked
      to death. The clickbait version of the title of this story would have
      been "Btrfs ssd optimizations considered harmful for ssds".
      
      The current nossd 'tetris' mode (even still without discard) allows a
      pattern of overwriting much more previously used space, causing many
      more implicit discards to happen because of the overwrite information
      the ssd gets. The actual location in the physical address space, as seen
      from the point of view of btrfs is irrelevant, because the actual writes
      to the low level flash are reordered anyway thanks to the FTL.
      
          Changes made in the code
      
      1. Make ssd mode data allocation identical to tetris mode, like nossd.
      2. Adjust and clean up filesystem mount messages so that we can easily
      identify if a kernel has this patch applied or not, when providing
      support to end users. Also, make better use of the *_and_info helpers to
      only trigger messages on actual state changes.
      
          Backporting notes
      
      Notes for whoever wants to backport this patch to their 4.9 LTS kernel:
      * First apply commit 951e7966 "btrfs: drop the nossd flag when
        remounting with -o ssd", or fixup the differences manually.
      * The rest of the conflicts are because of the fs_info refactoring. So,
        for example, instead of using fs_info, it's root->fs_info in
        extent-tree.c
      Signed-off-by: NHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      583b7231
    • L
      btrfs: use btrfsic_submit_bio instead of submit_bio in write_dev_flush · 43a01111
      Lu Fengqi 提交于
      Although this bio has no data attached, it will reach this condition
      (bio->bi_opf & REQ_PREFLUSH) and then update the flush_gen of dev_state
      in __btrfsic_submit_bio. So we should still submit it through integrity
      checker. Otherwise, the integrity checker will throw the following warning
      when I mount a newly created btrfs filesystem.
      
      [10264.755497] btrfs: attempt to write superblock which references block M @29523968 (sdb1/1111654400/0) which is not flushed out of disk's write cache (block flush_gen=1, dev->flush_gen=0)!
      [10264.755498] btrfs: attempt to write superblock which references block M @29523968 (sdb1/37912576/0) which is not flushed out of disk's write cache (block flush_gen=1, dev->flush_gen=0)!
      Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      43a01111
  7. 18 8月, 2017 1 次提交
  8. 16 8月, 2017 9 次提交
  9. 17 7月, 2017 1 次提交
    • D
      VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb) · bc98a42c
      David Howells 提交于
      Firstly by applying the following with coccinelle's spatch:
      
      	@@ expression SB; @@
      	-SB->s_flags & MS_RDONLY
      	+sb_rdonly(SB)
      
      to effect the conversion to sb_rdonly(sb), then by applying:
      
      	@@ expression A, SB; @@
      	(
      	-(!sb_rdonly(SB)) && A
      	+!sb_rdonly(SB) && A
      	|
      	-A != (sb_rdonly(SB))
      	+A != sb_rdonly(SB)
      	|
      	-A == (sb_rdonly(SB))
      	+A == sb_rdonly(SB)
      	|
      	-!(sb_rdonly(SB))
      	+!sb_rdonly(SB)
      	|
      	-A && (sb_rdonly(SB))
      	+A && sb_rdonly(SB)
      	|
      	-A || (sb_rdonly(SB))
      	+A || sb_rdonly(SB)
      	|
      	-(sb_rdonly(SB)) != A
      	+sb_rdonly(SB) != A
      	|
      	-(sb_rdonly(SB)) == A
      	+sb_rdonly(SB) == A
      	|
      	-(sb_rdonly(SB)) && A
      	+sb_rdonly(SB) && A
      	|
      	-(sb_rdonly(SB)) || A
      	+sb_rdonly(SB) || A
      	)
      
      	@@ expression A, B, SB; @@
      	(
      	-(sb_rdonly(SB)) ? 1 : 0
      	+sb_rdonly(SB)
      	|
      	-(sb_rdonly(SB)) ? A : B
      	+sb_rdonly(SB) ? A : B
      	)
      
      to remove left over excess bracketage and finally by applying:
      
      	@@ expression A, SB; @@
      	(
      	-(A & MS_RDONLY) != sb_rdonly(SB)
      	+(bool)(A & MS_RDONLY) != sb_rdonly(SB)
      	|
      	-(A & MS_RDONLY) == sb_rdonly(SB)
      	+(bool)(A & MS_RDONLY) == sb_rdonly(SB)
      	)
      
      to make comparisons against the result of sb_rdonly() (which is a bool)
      work correctly.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      bc98a42c
  10. 15 7月, 2017 1 次提交
  11. 22 6月, 2017 3 次提交
  12. 21 6月, 2017 1 次提交
    • N
      percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch · 104b4e51
      Nikolay Borisov 提交于
      Currently, percpu_counter_add is a wrapper around __percpu_counter_add
      which is preempt safe due to explicit calls to preempt_disable.  Given
      how __ prefix is used in percpu related interfaces, the naming
      unfortunately creates the false sense that __percpu_counter_add is
      less safe than percpu_counter_add.  In terms of context-safety,
      they're equivalent.  The only difference is that the __ version takes
      a batch parameter.
      
      Make this a bit more explicit by just renaming __percpu_counter_add to
      percpu_counter_add_batch.
      
      This patch doesn't cause any functional changes.
      
      tj: Minor updates to patch description for clarity.  Cosmetic
          indentation updates.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: David Sterba <dsterba@suse.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: linux-mm@kvack.org
      Cc: "David S. Miller" <davem@davemloft.net>
      104b4e51
  13. 20 6月, 2017 4 次提交