1. 16 11月, 2017 1 次提交
    • F
      Btrfs: fix reported number of inode blocks after buffered append writes · e3b8a485
      Filipe Manana 提交于
      The patch from commit a7e3b975 ("Btrfs: fix reported number of inode
      blocks") introduced a regression where if we do a buffered write starting
      at position equal to or greater than the file's size and then stat(2) the
      file before writeback is triggered, the number of used blocks does not
      change (unless there's a prealloc/unwritten extent). Example:
      
        $ xfs_io -f -c "pwrite -S 0xab 0 64K" foobar
        $ du -h foobar
        0	foobar
        $ sync
        $ du -h foobar
        64K	foobar
      
      The first version of that patch didn't had this regression and the second
      version, which was the one committed, was made only to address some
      performance regression detected by the intel test robots using fs_mark.
      
      This fixes the regression by setting the new delaloc bit in the range, and
      doing it at btrfs_dirty_pages() while setting the regular dealloc bit as
      well, so that this way we set both bits at once avoiding navigation of the
      inode's io tree twice. Doing it at btrfs_dirty_pages() is also the most
      meaninful place, as we should set the new dellaloc bit when if we set the
      delalloc bit, which happens only if we copied bytes into the pages at
      __btrfs_buffered_write().
      
      This was making some of LTP's du tests fail, which can be quickly run
      using a command line like the following:
      
        $ ./runltp -q -p -l /ltp.log -f commands -s du -d /mnt
      
      Fixes: a7e3b975 ("Btrfs: fix reported number of inode blocks")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e3b8a485
  2. 02 11月, 2017 3 次提交
    • J
      btrfs: make the delalloc block rsv per inode · 69fe2d75
      Josef Bacik 提交于
      The way we handle delalloc metadata reservations has gotten
      progressively more complicated over the years.  There is so much cruft
      and weirdness around keeping the reserved count and outstanding counters
      consistent and handling the error cases that it's impossible to
      understand.
      
      Fix this by making the delalloc block rsv per-inode.  This way we can
      calculate the actual size of the outstanding metadata reservations every
      time we make a change, and then reserve the delta based on that amount.
      This greatly simplifies the code everywhere, and makes the error
      handling in btrfs_delalloc_reserve_metadata far less terrifying.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      69fe2d75
    • J
      Btrfs: rework outstanding_extents · 8b62f87b
      Josef Bacik 提交于
      Right now we do a lot of weird hoops around outstanding_extents in order
      to keep the extent count consistent.  This is because we logically
      transfer the outstanding_extent count from the initial reservation
      through the set_delalloc_bits.  This makes it pretty difficult to get a
      handle on how and when we need to mess with outstanding_extents.
      
      Fix this by revamping the rules of how we deal with outstanding_extents.
      Now instead everybody that is holding on to a delalloc extent is
      required to increase the outstanding extents count for itself.  This
      means we'll have something like this
      
      btrfs_delalloc_reserve_metadata	- outstanding_extents = 1
       btrfs_set_extent_delalloc	- outstanding_extents = 2
      btrfs_release_delalloc_extents	- outstanding_extents = 1
      
      for an initial file write.  Now take the append write where we extend an
      existing delalloc range but still under the maximum extent size
      
      btrfs_delalloc_reserve_metadata - outstanding_extents = 2
        btrfs_set_extent_delalloc
          btrfs_set_bit_hook		- outstanding_extents = 3
          btrfs_merge_extent_hook	- outstanding_extents = 2
      btrfs_delalloc_release_extents	- outstanding_extnets = 1
      
      In order to make the ordered extent transition we of course must now
      make ordered extents carry their own outstanding_extent reservation, so
      for cow_file_range we end up with
      
      btrfs_add_ordered_extent	- outstanding_extents = 2
      clear_extent_bit		- outstanding_extents = 1
      btrfs_remove_ordered_extent	- outstanding_extents = 0
      
      This makes all manipulations of outstanding_extents much more explicit.
      Every successful call to btrfs_delalloc_reserve_metadata _must_ now be
      combined with btrfs_release_delalloc_extents, even in the error case, as
      that is the only function that actually modifies the
      outstanding_extents counter.
      
      The drawback to this is now we are much more likely to have transient
      cases where outstanding_extents is much larger than it actually should
      be.  This could happen before as we manipulated the delalloc bits, but
      now it happens basically at every write.  This may put more pressure on
      the ENOSPC flushing code, but I think making this code simpler is worth
      the cost.  I have another change coming to mitigate this side-effect
      somewhat.
      
      I also added trace points for the counter manipulation.  These were used
      by a bpf script I wrote to help track down leak issues.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8b62f87b
    • D
      btrfs: allow to set compression level for zlib · f51d2b59
      David Sterba 提交于
      Preliminary support for setting compression level for zlib, the
      following works:
      
      $ mount -o compess=zlib                 # default
      $ mount -o compess=zlib0                # same
      $ mount -o compess=zlib9                # level 9, slower sync, less data
      $ mount -o compess=zlib1                # level 1, faster sync, more data
      $ mount -o remount,compress=zlib3	# level set by remount
      
      The compress-force works the same as compress'.  The level is visible in
      the same format in /proc/mounts. Level set via file property does not
      work yet.
      
      Required patch: "btrfs: prepare for extensions in compression options"
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f51d2b59
  3. 30 10月, 2017 7 次提交
  4. 04 10月, 2017 1 次提交
    • T
      Btrfs: fix overlap of fs_info::flags values · 69ad5976
      Tsutomu Itoh 提交于
      Because the values of BTRFS_FS_EXCL_OP and BTRFS_FS_QUOTA_OVERRIDE overlap,
      we should change the value.
      
      First, BTRFS_FS_EXCL_OP was set to 14.
      
        commit 171938e5 ("btrfs: track exclusive filesystem operation in flags")
      
      Next, the value of BTRFS_FS_QUOTA_OVERRIDE was set to 14.
      
        commit f29efe29 ("btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE")
      
      As a result, the value 14 overlapped, by accident.
      This problem is solved by defining the value of BTRFS_FS_EXCL_OP as 16,
      the flags are internal.
      
      Fixes: f29efe29 ("btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE")
      CC: stable@vger.kernel.org # 4.13+
      Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ minimize the change, update only BTRFS_FS_EXCL_OP ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      69ad5976
  5. 26 9月, 2017 1 次提交
    • M
      btrfs: remove BTRFS_FS_QUOTA_DISABLING flag · c2faff79
      Misono, Tomohiro 提交于
      Currently, "btrfs quota enable" would fail after "btrfs quota disable" on
      the first time with syslog output "qgroup_rescan_init failed with -22", but
      it would succeed on the second time.
      
      When "quota disable" is called, BTRFS_FS_QUOTA_DISABLING flag bit will be
      set in fs_info->flags in btrfs_quota_disable(), but it will not be droppd
      in btrfs_run_qgroups() (which is called in btrfs_commit_transaction())
      because quota_root has already been freed. If "quota enable" is called
      after that, both BTRFS_FS_QUOTA_DISABLING and BTRFS_FS_QUOTA_ENABLED flag
      would be dropped in the btrfs_run_qgroups() since quota_root is not NULL.
      This leads to the failure of "quota enable" on the first time.
      
      BTRFS_FS_QUOTA_DISABLING flag is not used outside of "quota disable"
      context and is equivalent to whether quota_root is NULL or not.
      btrfs_run_qgroups() checks whether quota_root is NULL or not in the first
      place.
      
      So, let's remove BTRFS_FS_QUOTA_DISABLING flag.
      Signed-off-by: NTomohiro Misono <misono.tomohiro@jp.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c2faff79
  6. 21 8月, 2017 5 次提交
    • J
      btrfs: pass fs_info to btrfs_del_root instead of tree_root · 1cd5447e
      Jeff Mahoney 提交于
      btrfs_del_roots always uses the tree_root.  Let's pass fs_info instead.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1cd5447e
    • L
      Btrfs: remove BUG() in btrfs_extent_inline_ref_size · 4335958d
      Liu Bo 提交于
      Now that btrfs_get_extent_inline_ref_type() can report if type is a
      valid one and all callers can gracefully deal with that, we don't need
      to crash here.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4335958d
    • L
      Btrfs: add a helper to retrive extent inline ref type · 167ce953
      Liu Bo 提交于
      An invalid value of extent inline ref type may be read from a
      malicious image which may force btrfs to crash.
      
      This adds a helper which does sanity check for the ref type, so we can
      know if it's sane, return he type, otherwise return an error.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ minimal tweak const types, causing warnings due to other cleanup patches ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      167ce953
    • N
      btrfs: Remove chunk_objectid argument from btrfs_make_block_group · 0174484d
      Nikolay Borisov 提交于
      btrfs_make_block_group is always called with chunk_objectid set to
      BTRFS_FIRST_CHUNK_TREE_OBJECTID. There's no reason why this behavior will
      change anytime soon, so let's remove the argument and decrease the cognitive
      load when reading the code path. No functional change
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0174484d
    • H
      btrfs: Do not use data_alloc_cluster in ssd mode · 583b7231
      Hans van Kranenburg 提交于
          This patch provides a band aid to improve the 'out of the box'
      behaviour of btrfs for disks that are detected as being an ssd.  In a
      general purpose mixed workload scenario, the current ssd mode causes
      overallocation of available raw disk space for data, while leaving
      behind increasing amounts of unused fragmented free space. This
      situation leads to early ENOSPC problems which are harming user
      experience and adoption of btrfs as a general purpose filesystem.
      
      This patch modifies the data extent allocation behaviour of the ssd mode
      to make it behave identical to nossd mode.  The metadata behaviour and
      additional ssd_spread option stay untouched so far.
      
      Recommendations for future development are to reconsider the current
      oversimplified nossd / ssd distinction and the broken detection
      mechanism based on the rotational attribute in sysfs and provide
      experienced users with a more flexible way to choose allocator behaviour
      for data and metadata, optimized for certain use cases, while keeping
      sane 'out of the box' default settings.  The internals of the current
      btrfs code have more potential than what currently gets exposed to the
      user to choose from.
      
          The SSD story...
      
          In the first year of btrfs development, around early 2008, btrfs
      gained a mount option which enables specific functionality for
      filesystems on solid state devices. The first occurance of this
      functionality is in commit e18e4809, labeled "Add mount -o ssd, which
      includes optimizations for seek free storage".
      
      The effect on allocating free space for doing (data) writes is to
      'cluster' writes together, writing them out in contiguous space, as
      opposed to a 'tetris' way of putting all separate writes into any free
      space fragment that fits (which is what the -o nossd behaviour does).
      
      A somewhat simplified explanation of what happens is that, when for
      example, the 'cluster' size is set to 2MiB, when we do some writes, the
      data allocator will search for a free space block that is 2MiB big, and
      put the writes in there. The ssd mode itself might allow a 2MiB cluster
      to be composed of multiple free space extents with some existing data in
      between, while the additional ssd_spread mount option kills off this
      option and requires fully free space.
      
      The idea behind this is (commit 536ac8ae): "The [...] clusters make it
      more likely a given IO will completely overwrite the ssd block, so it
      doesn't have to do an internal rwm cycle."; ssd block meaning nand erase
      block. So, effectively this means applying a "locality based algorithm"
      and trying to outsmart the actual ssd.
      
      Since then, various changes have been made to the involved code, but the
      basic idea is still present, and gets activated whenever the ssd mount
      option is active. This also happens by default, when the rotational flag
      as seen at /sys/block/<device>/queue/rotational is set to 0.
      
          However, there's a number of problems with this approach.
      
          First, what the optimization is trying to do is outsmart the ssd by
      assuming there is a relation between the physical address space of the
      block device as seen by btrfs and the actual physical storage of the
      ssd, and then adjusting data placement. However, since the introduction
      of the Flash Translation Layer (FTL) which is a part of the internal
      controller of an ssd, these attempts are futile. The use of good quality
      FTL in consumer ssd products might have been limited in 2008, but this
      situation has changed drastically soon after that time. Today, even the
      flash memory in your automatic cat feeding machine or your grandma's
      wheelchair has a full featured one.
      
      Second, the behaviour as described above results in the filesystem being
      filled up with badly fragmented free space extents because of relatively
      small pieces of space that are freed up by deletes, but not selected
      again as part of a 'cluster'. Since the algorithm prefers allocating a
      new chunk over going back to tetris mode, the end result is a filesystem
      in which all raw space is allocated, but which is composed of
      underutilized chunks with a 'shotgun blast' pattern of fragmented free
      space. Usually, the next problematic thing that happens is the
      filesystem wanting to allocate new space for metadata, which causes the
      filesystem to fail in spectacular ways.
      
      Third, the default mount options you get for an ssd ('ssd' mode enabled,
      'discard' not enabled), in combination with spreading out writes over
      the full address space and ignoring freed up space leads to worst case
      behaviour in providing information to the ssd itself, since it will
      never learn that all the free space left behind is actually free.  There
      are two ways to let an ssd know previously written data does not have to
      be preserved, which are sending explicit signals using discard or
      fstrim, or by simply overwriting the space with new data.  The worst
      case behaviour is the btrfs ssd_spread mount option in combination with
      not having discard enabled. It has a side effect of minimizing the reuse
      of free space previously written in.
      
      Fourth, the rotational flag in /sys/ does not reliably indicate if the
      device is a locally attached ssd. For example, iSCSI or NBD displays as
      non-rotational, while a loop device on an ssd shows up as rotational.
      
      The combination of the second and third problem effectively means that
      despite all the good intentions, the btrfs ssd mode reliably causes the
      ssd hardware and the filesystem structures and performance to be choked
      to death. The clickbait version of the title of this story would have
      been "Btrfs ssd optimizations considered harmful for ssds".
      
      The current nossd 'tetris' mode (even still without discard) allows a
      pattern of overwriting much more previously used space, causing many
      more implicit discards to happen because of the overwrite information
      the ssd gets. The actual location in the physical address space, as seen
      from the point of view of btrfs is irrelevant, because the actual writes
      to the low level flash are reordered anyway thanks to the FTL.
      
          Changes made in the code
      
      1. Make ssd mode data allocation identical to tetris mode, like nossd.
      2. Adjust and clean up filesystem mount messages so that we can easily
      identify if a kernel has this patch applied or not, when providing
      support to end users. Also, make better use of the *_and_info helpers to
      only trigger messages on actual state changes.
      
          Backporting notes
      
      Notes for whoever wants to backport this patch to their 4.9 LTS kernel:
      * First apply commit 951e7966 "btrfs: drop the nossd flag when
        remounting with -o ssd", or fixup the differences manually.
      * The rest of the conflicts are because of the fs_info refactoring. So,
        for example, instead of using fs_info, it's root->fs_info in
        extent-tree.c
      Signed-off-by: NHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      583b7231
  7. 16 8月, 2017 9 次提交
  8. 30 6月, 2017 2 次提交
    • Q
      btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges · bc42bda2
      Qu Wenruo 提交于
      [BUG]
      For the following case, btrfs can underflow qgroup reserved space
      at an error path:
      (Page size 4K, function name without "btrfs_" prefix)
      
               Task A                  |             Task B
      ----------------------------------------------------------------------
      Buffered_write [0, 2K)           |
      |- check_data_free_space()       |
      |  |- qgroup_reserve_data()      |
      |     Range aligned to page      |
      |     range [0, 4K)          <<< |
      |     4K bytes reserved      <<< |
      |- copy pages to page cache      |
                                       | Buffered_write [2K, 4K)
                                       | |- check_data_free_space()
                                       | |  |- qgroup_reserved_data()
                                       | |     Range alinged to page
                                       | |     range [0, 4K)
                                       | |     Already reserved by A <<<
                                       | |     0 bytes reserved      <<<
                                       | |- delalloc_reserve_metadata()
                                       | |  And it *FAILED* (Maybe EQUOTA)
                                       | |- free_reserved_data_space()
                                            |- qgroup_free_data()
                                               Range aligned to page range
                                               [0, 4K)
                                               Freeing 4K
      (Special thanks to Chandan for the detailed report and analyse)
      
      [CAUSE]
      Above Task B is freeing reserved data range [0, 4K) which is actually
      reserved by Task A.
      
      And at writeback time, page dirty by Task A will go through writeback
      routine, which will free 4K reserved data space at file extent insert
      time, causing the qgroup underflow.
      
      [FIX]
      For btrfs_qgroup_free_data(), add @reserved parameter to only free
      data ranges reserved by previous btrfs_qgroup_reserve_data().
      So in above case, Task B will try to free 0 byte, so no underflow.
      Reported-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Tested-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bc42bda2
    • Q
      btrfs: qgroup: Introduce extent changeset for qgroup reserve functions · 364ecf36
      Qu Wenruo 提交于
      Introduce a new parameter, struct extent_changeset for
      btrfs_qgroup_reserved_data() and its callers.
      
      Such extent_changeset was used in btrfs_qgroup_reserve_data() to record
      which range it reserved in current reserve, so it can free it in error
      paths.
      
      The reason we need to export it to callers is, at buffered write error
      path, without knowing what exactly which range we reserved in current
      allocation, we can free space which is not reserved by us.
      
      This will lead to qgroup reserved space underflow.
      Reviewed-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      364ecf36
  9. 22 6月, 2017 2 次提交
    • S
      btrfs: Check name_len with boundary in verify dir_item · e79a3327
      Su Yue 提交于
      Originally, verify_dir_item verifies name_len of dir_item with fixed
      values but not item boundary.
      If corrupted name_len was not bigger than the fixed value, for example
      255, the function will think the dir_item is fine. And then reading
      beyond boundary will cause crash.
      
      Example:
      	1. Corrupt one dir_item name_len to be 255.
              2. Run 'ls -lar /mnt/test/ > /dev/null'
      dmesg:
      [   48.451449] BTRFS info (device vdb1): disk space caching is enabled
      [   48.451453] BTRFS info (device vdb1): has skinny extents
      [   48.489420] general protection fault: 0000 [#1] SMP
      [   48.489571] Modules linked in: ext4 jbd2 mbcache btrfs xor raid6_pq
      [   48.489716] CPU: 1 PID: 2710 Comm: ls Not tainted 4.10.0-rc1 #5
      [   48.489853] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
      [   48.490008] task: ffff880035df1bc0 task.stack: ffffc90004800000
      [   48.490008] RIP: 0010:read_extent_buffer+0xd2/0x190 [btrfs]
      [   48.490008] RSP: 0018:ffffc90004803d98 EFLAGS: 00010202
      [   48.490008] RAX: 000000000000001b RBX: 000000000000001b RCX: 0000000000000000
      [   48.490008] RDX: ffff880079dbf36c RSI: 0005080000000000 RDI: ffff880079dbf368
      [   48.490008] RBP: ffffc90004803dc8 R08: ffff880078e8cc48 R09: ffff880000000000
      [   48.490008] R10: 0000160000000000 R11: 0000000000001000 R12: ffff880079dbf288
      [   48.490008] R13: ffff880078e8ca88 R14: 0000000000000003 R15: ffffc90004803e20
      [   48.490008] FS:  00007fef50c60800(0000) GS:ffff88007d400000(0000) knlGS:0000000000000000
      [   48.490008] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   48.490008] CR2: 000055f335ac2ff8 CR3: 000000007356d000 CR4: 00000000001406e0
      [   48.490008] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   48.490008] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [   48.490008] Call Trace:
      [   48.490008]  btrfs_real_readdir+0x3b7/0x4a0 [btrfs]
      [   48.490008]  iterate_dir+0x181/0x1b0
      [   48.490008]  SyS_getdents+0xa7/0x150
      [   48.490008]  ? fillonedir+0x150/0x150
      [   48.490008]  entry_SYSCALL_64_fastpath+0x18/0xad
      [   48.490008] RIP: 0033:0x7fef5032546b
      [   48.490008] RSP: 002b:00007ffeafcdb830 EFLAGS: 00000206 ORIG_RAX: 000000000000004e
      [   48.490008] RAX: ffffffffffffffda RBX: 00007fef5061db38 RCX: 00007fef5032546b
      [   48.490008] RDX: 0000000000008000 RSI: 000055f335abaff0 RDI: 0000000000000003
      [   48.490008] RBP: 00007fef5061dae0 R08: 00007fef5061db48 R09: 0000000000000000
      [   48.490008] R10: 000055f335abafc0 R11: 0000000000000206 R12: 00007fef5061db38
      [   48.490008] R13: 0000000000008040 R14: 00007fef5061db38 R15: 000000000000270e
      [   48.490008] RIP: read_extent_buffer+0xd2/0x190 [btrfs] RSP: ffffc90004803d98
      [   48.499455] ---[ end trace 321920d8e8339505 ]---
      
      Fix it by adding a parameter @slot and check name_len with item boundary
      by calling btrfs_is_name_len_valid.
      Signed-off-by: NSu Yue <suy.fnst@cn.fujitsu.com>
      rev
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e79a3327
    • S
      btrfs: Introduce btrfs_is_name_len_valid to avoid reading beyond boundary · 19c6dcbf
      Su Yue 提交于
      Introduce function btrfs_is_name_len_valid.
      
      The function compares parameter @name_len with item boundary then
      returns true if name_len is valid.
      Signed-off-by: NSu Yue <suy.fnst@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ s/btrfs_leaf_data/BTRFS_LEAF_DATA_OFFSET/ ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      19c6dcbf
  10. 20 6月, 2017 9 次提交
    • N
      btrfs: Round down values which are written for total_bytes_size · 7dfb8be1
      Nikolay Borisov 提交于
      We got an internal report about a file system not wanting to mount
      following 99e3ecfc ("Btrfs: add more validation checks for
      superblock").
      
      BTRFS error (device sdb1): super_total_bytes 1000203816960 mismatch with
      fs_devices total_rw_bytes 1000203820544
      
      Subtracting the numbers we get a difference of less than a 4kb. Upon
      closer inspection it became apparent that mkfs actually rounds down the
      size of the device to a multiple of sector size. However, the same
      cannot be said for various functions which modify the total size and are
      called from btrfs_balance as well as when adding a new device. So this
      patch ensures that values being saved into on-disk data structures are
      always rounded down to a multiple of sectorsize.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7dfb8be1
    • N
      btrfs: Manually implement device_total_bytes getter/setter · eca152ed
      Nikolay Borisov 提交于
      The device->total_bytes member needs to always be rounded down to sectorsize
      so that it corresponds to the value of super->total_bytes. However, there are
      multiple places where the setter is fed a value which is not rounded which
      can cause a fs to be unmountable due to the check introduced in
      99e3ecfc ("Btrfs: add more validation checks for superblock"). This patch
      implements the getter/setter manually so that in a later patch I can add
      necessary code to catch offenders.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      eca152ed
    • D
      btrfs: obsolete and remove mount option alloc_start · 0d0c71b3
      David Sterba 提交于
      The mount option alloc_start was used in the past for debugging and
      stressing the chunk allocator. Not meant to be used by users, so we're
      not breaking anybody's setup.
      
      There was some added complexity handling changes of the value and when
      it was not same as default. Such code has likely been untested and I
      think it's better to remove it.
      
      This patch kills all use of alloc_start, and by doing that also fixes
      a bug when alloc_size is set, potentially called from statfs:
      
      in btrfs_calc_avail_data_space, traversing the list in RCU, the RCU
      protection is temporarily dropped so btrfs_account_dev_extents_size can
      be called and then RCU is locked again! Doing that inside
      list_for_each_entry_rcu is just asking for trouble, but unlikely to be
      observed in practice.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0d0c71b3
    • D
      btrfs: move fs_info::fs_frozen to the flags · fac03c8d
      David Sterba 提交于
      We can keep the state among the other fs_info flags, there's no reason
      why fs_frozen would need to be separate.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fac03c8d
    • D
      btrfs: use generic slab for for btrfs_transaction · 4b5faeac
      David Sterba 提交于
      Observing the number of slab objects of btrfs_transaction, there's just
      one active on an almost quiescent filesystem, and the number of objects
      goes to about ten when sync is in progress. Then the nubmer goes down to
      1.  This matches the expectations of the transaction lifetime.
      
      For such use the separate slab cache is not justified, as we do not
      reuse objects frequently. For the shortlived transaction, the generic
      slab (size 512) should be ok. We can optimistically expect that the 512
      slabs are not all used (fragmentation) and there are free slots to take
      when we do the allocation, compared to potentially allocating a whole new
      page for the separate slab.
      
      We'll lose the stats about the object use, which could be added later if
      we really need them.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4b5faeac
    • N
      btrfs: remove __BTRFS_LEAF_DATA_SIZE · 118c701e
      Nikolay Borisov 提交于
      __BTRFS_LAF_DATA_SIZE is used only by BTRFS_LEAF_DATA_SIZE. Make the
      latter subsume the former.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      118c701e
    • N
      btrfs: rename btrfs_leaf_data to BTRFS_LEAF_DATA_OFFSET · 3d9ec8c4
      Nikolay Borisov 提交于
      Commit 5f39d397 ("Btrfs: Create extent_buffer interface
      for large blocksizes") refactored btrfs_leaf_data function to take
      extent_buffer rather than struct btrfs_leaf. However, as it turns out the
      parameter being passed is never used. Furthermore this function no longer
      returns the leaf data but rather the offset to it. So rename the function
      to BTRFS_LEAF_DATA_OFFSET to make it consistent with other BTRFS_LEAF_*
      helpers and turn it into a macro.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      [ removed () from the macro ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3d9ec8c4
    • J
      btrfs: cleanup root usage by btrfs_get_alloc_profile · 1b86826d
      Jeff Mahoney 提交于
      There are two places where we don't already know what kind of alloc
      profile we need before calling btrfs_get_alloc_profile, but we need
      access to a root everywhere we call it.
      
      This patch adds helpers for btrfs_{data,metadata,system}_alloc_profile()
      and relegates btrfs_system_alloc_profile to a static for use in those
      two cases.  The next patch will eliminate one of those.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1b86826d
    • J
      Btrfs: replace tree->mapping with tree->private_data · c6100a4b
      Josef Bacik 提交于
      For extent_io tree's we have carried the address_mapping of the inode
      around in the io tree in order to pull the inode back out for calling
      into various tree ops hooks.  This works fine when everything that has
      an extent_io_tree has an inode.  But we are going to remove the
      btree_inode, so we need to change this.  Instead just have a generic
      void * for private data that we can initialize with, and have all the
      tree ops use that instead.  This had a lot of cascading changes but
      should be relatively straightforward.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ minor reordering of the callback prototypes ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c6100a4b