1. 25 2月, 2019 5 次提交
  2. 31 1月, 2019 1 次提交
  3. 11 1月, 2019 1 次提交
    • Q
      btrfs: Use real device structure to verify dev extent · 1b3922a8
      Qu Wenruo 提交于
      [BUG]
      Linux v5.0-rc1 will fail fstests/btrfs/163 with the following kernel
      message:
      
        BTRFS error (device dm-6): dev extent devid 1 physical offset 13631488 len 8388608 is beyond device boundary 0
        BTRFS error (device dm-6): failed to verify dev extents against chunks: -117
        BTRFS error (device dm-6): open_ctree failed
      
      [CAUSE]
      Commit cf90d884 ("btrfs: Introduce mount time chunk <-> dev extent
      mapping check") introduced strict check on dev extents.
      
      We use btrfs_find_device() with dev uuid and fs uuid set to NULL, and
      only dependent on @devid to find the real device.
      
      For seed devices, we call clone_fs_devices() in open_seed_devices() to
      allow us search seed devices directly.
      
      However clone_fs_devices() just populates devices with devid and dev
      uuid, without populating other essential members, like disk_total_bytes.
      
      This makes any device returned by btrfs_find_device(fs_info, devid,
      NULL, NULL) is just a dummy, with 0 disk_total_bytes, and any dev
      extents on the seed device will not pass the device boundary check.
      
      [FIX]
      This patch will try to verify the device returned by btrfs_find_device()
      and if it's a dummy then re-search in seed devices.
      
      Fixes: cf90d884 ("btrfs: Introduce mount time chunk <-> dev extent mapping check")
      CC: stable@vger.kernel.org # 4.19+
      Reported-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1b3922a8
  4. 17 12月, 2018 27 次提交
    • A
      btrfs: Fix typos in comments and strings · 52042d8e
      Andrea Gelmini 提交于
      The typos accumulate over time so once in a while time they get fixed in
      a large patch.
      Signed-off-by: NAndrea Gelmini <andrea.gelmini@gelma.net>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      52042d8e
    • N
      btrfs: Remove 1st shrink/grow phase from balance · 15c82763
      Nikolay Borisov 提交于
      The first step of the rebalance process ensures there is 1MiB free on
      each device. This number seems rather small. And in fact when talking to
      the original authors their opinions were:
      
      "man that's a little bonkers"
      "i don't think we even need that code anymore"
      "I think it was there to make sure we had room for the blank 1M at the
      beginning. I bet it goes all the way back to v0"
      "we just don't need any of that tho, i say we just delete it"
      
      Clearly, this piece of code has lost its original intent throughout the
      years. It doesn't really bring any real practical benefits to the
      relocation process.
      
      Additionally, this patch makes the balance process more lightweight by
      removing a pair of shrink/grow operations which are rather expensive for
      heavily populated filesystems. This is mainly due to shrink requiring
      relocating block groups, involving heavy use of the btree.
      
      The intermediate shrink/grow can fail and leave the filesystem in a
      middle state that would need to be changed back by the user.
      Suggested-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      [ update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      15c82763
    • J
      btrfs: use offset_in_page instead of open-coding it · 7073017a
      Johannes Thumshirn 提交于
      Constructs like 'var & (PAGE_SIZE - 1)' or 'var & ~PAGE_MASK' can denote an
      offset into a page.
      
      So replace them by the offset_in_page() macro instead of open-coding it if
      they're not used as an alignment check.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7073017a
    • D
      btrfs: dev-replace: open code trivial locking helpers · cb5583dd
      David Sterba 提交于
      The dev-replace locking functions are now trivial wrappers around rw
      semaphore that can be used directly everywhere. No functional change.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cb5583dd
    • D
      btrfs: dev-replace: remove custom read/write blocking scheme · 53176dde
      David Sterba 提交于
      After the rw semaphore has been added, the custom blocking using
      ::blocking_readers and ::read_lock_wq is redundant.
      
      The blocking logic in __btrfs_map_block is replaced by extending the
      time the semaphore is held, that has the same blocking effect on writes
      as the previous custom scheme that waited until ::blocking_readers was
      zero.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      53176dde
    • N
      btrfs: Refactor btrfs_merge_bio_hook · da12fe54
      Nikolay Borisov 提交于
      This function really checks whether adding more data to the bio will
      straddle a stripe/chunk. So first let's give it a more appropraite name
      - btrfs_bio_fits_in_stripe. Secondly, the offset parameter was never
      used to just remove it. Thirdly, pages are submitted to either btree or
      data inodes so it's guaranteed that tree->ops is set so replace the
      check with an ASSERT. Finally, document the parameters of the function.
      No functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      da12fe54
    • A
      btrfs: balance: print to system log when balance ends or is paused · 7333bd02
      Anand Jain 提交于
      Print a kernel log message when the balance ends, either for cancel or
      completed or if it is paused.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7333bd02
    • A
      btrfs: balance: print args during start and resume · 56fc37d9
      Anand Jain 提交于
      The information about balance arguments is important for system audit,
      this patch prints the textual representation when balance starts or is
      resumed.
      
      Example command:
      
       $ btrfs balance start -f -mprofiles=raid1,convert=single,soft -dlimit=10..20,usage=50 /btrfs
      
      Example kernel log output:
      
       BTRFS info (device sdb): balance: start -f -dusage=50,limit=10..20 -mconvert=single,soft,profiles=raid1 -sconvert=single,soft,profiles=raid1
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update changelog, simplify code ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      56fc37d9
    • A
      btrfs: add helper to describe block group flags · f89e09cf
      Anand Jain 提交于
      Factor out helper that describes block group flags from
      describe_relocation. The result will not be longer than the given size.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add comments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f89e09cf
    • F
      Btrfs: fix access to available allocation bits when starting balance · 5a8067c0
      Filipe Manana 提交于
      The available allocation bits members from struct btrfs_fs_info are
      protected by a sequence lock, and when starting balance we access them
      incorrectly in two different ways:
      
      1) In the read sequence lock loop at btrfs_balance() we use the values we
         read from fs_info->avail_*_alloc_bits and we can immediately do actions
         that have side effects and can not be undone (printing a message and
         jumping to a label). This is wrong because a retry might be needed, so
         our actions must not have side effects and must be repeatable as long
         as read_seqretry() returns a non-zero value. In other words, we were
         essentially ignoring the sequence lock;
      
      2) Right below the read sequence lock loop, we were reading the values
         from avail_metadata_alloc_bits and avail_data_alloc_bits without any
         protection from concurrent writers, that is, reading them outside of
         the read sequence lock critical section.
      
      So fix this by making sure we only read the available allocation bits
      while in a read sequence lock critical section and that what we do in the
      critical section is repeatable (has nothing that can not be undone) so
      that any eventual retry that is needed is handled properly.
      
      Fixes: de98ced9 ("Btrfs: use seqlock to protect fs_info->avail_{data, metadata, system}_alloc_bits")
      Fixes: 14506127 ("btrfs: fix a bogus warning when converting only data or metadata")
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5a8067c0
    • N
      btrfs: Handle final split-brain possibility during fsid change · cc5de4e7
      Nikolay Borisov 提交于
      This patch lands the last case which needs to be handled by the fsid
      change code. Namely, this is the case where a multidisk filesystem has
      already undergone at least one successful fsid change i.e all disks
      have the METADATA_UUID incompat bit and power failure occurs as another
      fsid change is in progress. When such an event occurs, disks could be
      split in 2 groups. One of the groups will have both METADATA_UUID and
      CHANGING_FSID_V2 flags set coupled with old fsid/metadata_uuid pairs.
      The other group of disks will have only METADATA_UUID bit set and their
      fsid will be different than the one in disks in the first group. Here
      we look at the following cases:
      
        a) A disk from the first group is scanned first, so fs_devices is
        created with stale fsid/metdata_uuid. Then when a disk from the
        second group is scanned it needs to first check whether there exists
        such an fs_devices that has fsid_change set to true (because it was
        created with a disk having the CHANGING_FSID_V2 flag), the
        metadata_uuid and fsid of the fs_devices will be different (since it was
        created by a disk which already has had at least 1 successful fsid change)
        and finally the metadata_uuid of the fs_devices will equal that of the
        currently scanned disk (because metadata_uuid never really changes).
        When the correct fs_devices is found the information from the scanned
        disk will replace the current one in fs_devices since the scanned disk
        will have higher generation number.
      
        b) A disk from the second group is scanned so fs_devices is created
        as usual with differing fsid/metdata_uid. Then when a disk from the
        first group is scanned the code detects that it has both
        CHANGING_FSID_V2 and METADATA_UUID flags set and will search for
        fs_devices that has differing metadata_uuid/fsid and whose
        metadata_uuid is the same as that of the scanned device.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cc5de4e7
    • N
      btrfs: Handle one more split-brain scenario during fsid change · 7a62d0f0
      Nikolay Borisov 提交于
      This commit continues hardening the scanning code to handle cases where
      power loss could have caused disks in a multi-disk filesystem to be
      in inconsistent state. Namely handle the situation that can occur when
      some of the disks in multi-disk fs have completed their fsid change i.e
      they have METADATA_UUID incompat flag set, have cleared the
      CHANGING_FSID_V2 flag and their fsid/metadata_uuid are different. At
      the same time the other half of the disks will have their
      fsid/metadata_uuid unchanged and will only have CHANGING_FSID_V2 flag.
      
      This is handled by introducing code in the scan path which:
      
       a) Handles the case when a device with CHANGING_FSID_V2 flag is
       scanned and as a result btrfs_fs_devices is created with matching
       fsid/metdata_uuid. Subsequently, when a device with completed fsid
       change is scanned it will detect this via the new code in find_fsid
       i.e that such an fs_devices exist that fsid_change flag is set to true,
       it's metadata_uuid/fsid match and the metadata_uuid of the scanned
       device matches that of the fs_devices. In this case, it's important to
       note that the devices which has its fsid change completed will have a
       higher generation number than the device with FSID_CHANGING_V2 flag
       set, so its superblock block will be used during mount. To prevent an
       assertion triggering because the sb used for mounting will have
       differing fsid/metadata_uuid than the ones in the fs_devices struct
       also add code in device_list_add which overwrites the values in
       fs_devices.
      
       b) Alternatively we can end up with a device that completed its
       fsid change be scanned first which will create the respective
       btrfs_fs_devices struct with differing fsid/metadata_uuid. In this
       case when a device with FSID_CHANGING_V2 flag set is scanned it will
       call the newly added find_fsid_inprogress function which will return
       the correct fs_devices.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7a62d0f0
    • N
      btrfs: add members to fs_devices to track fsid changes · d1a63002
      Nikolay Borisov 提交于
      In order to gracefully handle split-brain scenario during fsid change
      (which are very unlikely, yet possible), two more pieces of information
      will be necessary:
      
      1. The highest generation number among all devices registered to a
         particular btrfs_fs_devices
      
      2. A boolean flag whether a given btrfs_fs_devices was created by a
         device which had the FSID_CHANGING_V2 flag set.
      
      This is a preparatory patch and just introduces the variables as well
      as code which sets them, their actual use is going to happen in a later
      patch.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d1a63002
    • N
      btrfs: Remove fsid/metadata_fsid fields from btrfs_info · de37aa51
      Nikolay Borisov 提交于
      Currently btrfs_fs_info structure contains a copy of the
      fsid/metadata_uuid fields. Same values are also contained in the
      btrfs_fs_devices structure which fs_info has a reference to. Let's
      reduce duplication by removing the fields from fs_info and always refer
      to the ones in fs_devices. No functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      de37aa51
    • N
      btrfs: Introduce support for FSID change without metadata rewrite · 7239ff4b
      Nikolay Borisov 提交于
      This field is going to be used when the user wants to change the UUID
      of the filesystem without having to rewrite all metadata blocks. This
      field adds another level of indirection such that when the FSID is
      changed what really happens is the current UUID (the one with which the
      fs was created) is copied to the 'metadata_uuid' field in the superblock
      as well as a new incompat flag is set METADATA_UUID. When the kernel
      detects this flag is set it knows that the superblock in fact has 2
      UUIDs:
      
      1. Is the UUID which is user-visible, currently known as FSID.
      2. Metadata UUID - this is the UUID which is stamped into all on-disk
         datastructures belonging to this file system.
      
      When the new incompat flag is present device scanning checks whether
      both fsid/metadata_uuid of the scanned device match any of the
      registered filesystems. When the flag is not set then both UUIDs are
      equal and only the FSID is retained on disk, metadata_uuid is set only
      in-memory during mount.
      
      Additionally a new metadata_uuid field is also added to the fs_info
      struct. It's initialised either with the FSID in case METADATA_UUID
      incompat flag is not set or with the metdata_uuid of the superblock
      otherwise.
      
      This commit introduces the new fields as well as the new incompat flag
      and switches all users of the fsid to the new logic.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ minor updates in comments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7239ff4b
    • N
      btrfs: Remove superfluous check form btrfs_remove_chunk · 64bc6c2a
      Nikolay Borisov 提交于
      It's unnecessary to check map->stripes[i].dev for NULL given its value
      is already set and dereferenced above the the check. No functional
      changes.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      64bc6c2a
    • A
      btrfs: harden agaist duplicate fsid on scanned devices · a9261d41
      Anand Jain 提交于
      It's not that impossible to imagine that a device OR a btrfs image is
      copied just by using the dd or the cp command. Which in case both the
      copies of the btrfs will have the same fsid. If on the system with
      automount enabled, the copied FS gets scanned.
      
      We have a known bug in btrfs, that we let the device path be changed
      after the device has been mounted. So using this loop hole the new
      copied device would appears as if its mounted immediately after it's
      been copied.
      
      For example:
      
      Initially.. /dev/mmcblk0p4 is mounted as /
      
        $ lsblk
        NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
        mmcblk0     179:0    0 29.2G  0 disk
        |-mmcblk0p4 179:4    0    4G  0 part /
        |-mmcblk0p2 179:2    0  500M  0 part /boot
        |-mmcblk0p3 179:3    0  256M  0 part [SWAP]
        `-mmcblk0p1 179:1    0  256M  0 part /boot/efi
      
        $ btrfs fi show
           Label: none  uuid: 07892354-ddaa-4443-90ea-f76a06accaba
           Total devices 1 FS bytes used 1.40GiB
           devid    1 size 4.00GiB used 3.00GiB path /dev/mmcblk0p4
      
      Copy mmcblk0 to sda
      
        $ dd if=/dev/mmcblk0 of=/dev/sda
      
      And immediately after the copy completes the change in the device
      superblock is notified which the automount scans using btrfs device scan
      and the new device sda becomes the mounted root device.
      
        $ lsblk
        NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
        sda           8:0    1 14.9G  0 disk
        |-sda4        8:4    1    4G  0 part /
        |-sda2        8:2    1  500M  0 part
        |-sda3        8:3    1  256M  0 part
        `-sda1        8:1    1  256M  0 part
        mmcblk0     179:0    0 29.2G  0 disk
        |-mmcblk0p4 179:4    0    4G  0 part
        |-mmcblk0p2 179:2    0  500M  0 part /boot
        |-mmcblk0p3 179:3    0  256M  0 part [SWAP]
        `-mmcblk0p1 179:1    0  256M  0 part /boot/efi
      
        $ btrfs fi show /
          Label: none  uuid: 07892354-ddaa-4443-90ea-f76a06accaba
          Total devices 1 FS bytes used 1.40GiB
          devid    1 size 4.00GiB used 3.00GiB path /dev/sda4
      
      The bug is quite nasty that you can't either unmount /dev/sda4 or
      /dev/mmcblk0p4. And the problem does not get solved until you take sda
      out of the system on to another system to change its fsid using the
      'btrfstune -u' command.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a9261d41
    • H
      btrfs: introduce nparity raid_attr · b50836ed
      Hans van Kranenburg 提交于
      Instead of hardcoding exceptions for RAID5 and RAID6 in the code, use an
      nparity field in raid_attr.
      Signed-off-by: NHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b50836ed
    • H
      btrfs: fix ncopies raid_attr for RAID56 · da612e31
      Hans van Kranenburg 提交于
      RAID5 and RAID6 profile store one copy of the data, not 2 or 3. These
      values are not yet used anywhere so there's no change.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      da612e31
    • H
      btrfs: alloc_chunk: fix more DUP stripe size handling · baf92114
      Hans van Kranenburg 提交于
      Commit 92e222df "btrfs: alloc_chunk: fix DUP stripe size handling"
      fixed calculating the stripe_size for a new DUP chunk.
      
      However, the same calculation reappears a bit later, and that one was
      not changed yet. The resulting bug that is exposed is that the newly
      allocated device extents ('stripes') can have a few MiB overlap with the
      next thing stored after them, which is another device extent or the end
      of the disk.
      
      The scenario in which this can happen is:
      * The block device for the filesystem is less than 10GiB in size.
      * The amount of contiguous free unallocated disk space chosen to use for
        chunk allocation is 20% of the total device size, or a few MiB more or
        less.
      
      An example:
      - The filesystem device is 7880MiB (max_chunk_size gets set to 788MiB)
      - There's 1578MiB unallocated raw disk space left in one contiguous
        piece.
      
      In this case stripe_size is first calculated as 789MiB, (half of
      1578MiB).
      
      Since 789MiB (stripe_size * data_stripes) > 788MiB (max_chunk_size), we
      enter the if block. Now stripe_size value is immediately overwritten
      while calculating an adjusted value based on max_chunk_size, which ends
      up as 788MiB.
      
      Next, the value is rounded up to a 16MiB boundary, 800MiB, which is
      actually more than the value we had before. However, the last comparison
      fails to detect this, because it's comparing the value with the total
      amount of free space, which is about twice the size of stripe_size.
      
      In the example above, this means that the resulting raw disk space being
      allocated is 1600MiB, while only a gap of 1578MiB has been found. The
      second device extent object for this DUP chunk will overlap for 22MiB
      with whatever comes next.
      
      The underlying problem here is that the stripe_size is reused all the
      time for different things. So, when entering the code in the if block,
      stripe_size is immediately overwritten with something else. If later we
      decide we want to have the previous value back, then the logic to
      compute it was copy pasted in again.
      
      With this change, the value in stripe_size is not unnecessarily
      destroyed, so the duplicated calculation is not needed any more.
      Signed-off-by: NHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      baf92114
    • H
      btrfs: alloc_chunk: improve chunk size variable name · 23f0ff1e
      Hans van Kranenburg 提交于
      The variable num_bytes is really a way too generic name for a variable
      in this function. There are a dozen other variables that hold a number
      of bytes as value.
      
      Give it a name that actually describes what it does, which is holding
      the size of the chunk that we're allocating.
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      23f0ff1e
    • H
      btrfs: alloc_chunk: do not refurbish num_bytes · 2f29df4f
      Hans van Kranenburg 提交于
      The variable num_bytes is used to store the chunk length of the chunk
      that we're allocating. Do not reuse it for something really different in
      the same function.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2f29df4f
    • N
      btrfs: Check for missing device before bio submission in btrfs_map_bio · fc8a168a
      Nikolay Borisov 提交于
      Before btrfs_map_bio submits all stripe bios it does a number of checks
      to ensure the device for every stripe is present. However, it doesn't do
      a DEV_STATE_MISSING check, instead this is relegated to the lower level
      btrfs_schedule_bio (in the async submission case, sync submission
      doesn't check DEV_STATE_MISSING at all). Additionally
      btrfs_schedule_bios does the duplicate device->bdev check which has
      already been performed in btrfs_map_bio.
      
      This patch moves the DEV_STATE_MISSING check in btrfs_map_bio and
      removes the duplicate device->bdev check. Doing so ensures that no bio
      cloning/submission happens for both async/sync requests in the face of
      missing device. This makes the async io submission path slightly shorter
      in terms of instruction count. No functional changes.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fc8a168a
    • O
      Btrfs: rename and export get_chunk_map · 60ca842e
      Omar Sandoval 提交于
      The Btrfs swap code is going to need it, so give it a btrfs_ prefix and
      make it non-static.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      60ca842e
    • O
      Btrfs: prevent ioctls from interfering with a swap file · eede2bf3
      Omar Sandoval 提交于
      A later patch will implement swap file support for Btrfs, but before we
      do that, we need to make sure that the various Btrfs ioctls cannot
      change a swap file.
      
      When a swap file is active, we must make sure that the extents of the
      file are not moved and that they don't become shared. That means that
      the following are not safe:
      
      - chattr +c (enable compression)
      - reflink
      - dedupe
      - snapshot
      - defrag
      
      Don't allow those to happen on an active swap file.
      
      Additionally, balance, resize, device remove, and device replace are
      also unsafe if they affect an active swapfile. Add a red-black tree of
      block groups and devices which contain an active swapfile. Relocation
      checks each block group against this tree and skips it or errors out for
      balance or resize, respectively. Device remove and device replace check
      the tree for the device they will operate on.
      
      Note that we don't have to worry about chattr -C (disable nocow), which
      we ignore for non-empty files, because an active swapfile must be
      non-empty and can't be truncated. We also don't have to worry about
      autodefrag because it's only done on COW files. Truncate and fallocate
      are already taken care of by the generic code. Device add doesn't do
      relocation so it's not an issue, either.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      eede2bf3
    • Q
      btrfs: volumes: Make sure no dev extent is beyond device boundary · 05a37c48
      Qu Wenruo 提交于
      Add extra dev extent end check against device boundary.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      05a37c48
    • Q
      btrfs: volumes: Make sure there is no overlap of dev extents at mount time · 5eb19381
      Qu Wenruo 提交于
      Enhance btrfs_verify_dev_extents() to remember previous checked dev
      extents, so it can verify no dev extents can overlap.
      
      Analysis from Hans:
      
      "Imagine allocating a DATA|DUP chunk.
      
       In the chunk allocator, we first set...
         max_stripe_size = SZ_1G;
         max_chunk_size = BTRFS_MAX_DATA_CHUNK_SIZE
       ... which is 10GiB.
      
       Then...
         /* we don't want a chunk larger than 10% of writeable space */
         max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
             		 max_chunk_size);
      
       Imagine we only have one 7880MiB block device in this filesystem. Now
       max_chunk_size is down to 788MiB.
      
       The next step in the code is to search for max_stripe_size * dev_stripes
       amount of free space on the device, which is in our example 1GiB * 2 =
       2GiB. Imagine the device has exactly 1578MiB free in one contiguous
       piece. This amount of bytes will be put in devices_info[ndevs - 1].max_avail
      
       Next we recalculate the stripe_size (which is actually the device extent
       length), based on the actual maximum amount of available raw disk space:
         stripe_size = div_u64(devices_info[ndevs - 1].max_avail, dev_stripes);
      
       stripe_size is now 789MiB
      
       Next we do...
         data_stripes = num_stripes / ncopies
       ...where data_stripes ends up as 1, because num_stripes is 2 (the amount
       of device extents we're going to have), and DUP has ncopies 2.
      
       Next there's a check...
         if (stripe_size * data_stripes > max_chunk_size)
       ...which matches because 789MiB * 1 > 788MiB.
      
       We go into the if code, and next is...
         stripe_size = div_u64(max_chunk_size, data_stripes);
       ...which resets stripe_size to max_chunk_size: 788MiB
      
       Next is a fun one...
         /* bump the answer up to a 16MB boundary */
         stripe_size = round_up(stripe_size, SZ_16M);
       ...which changes stripe_size from 788MiB to 800MiB.
      
       We're not done changing stripe_size yet...
         /* But don't go higher than the limits we found while searching
          * for free extents
          */
         stripe_size = min(devices_info[ndevs - 1].max_avail,
             	      stripe_size);
      
       This is bad. max_avail is twice the stripe_size (we need to fit 2 device
       extents on the same device for DUP).
      
       The result here is that 800MiB < 1578MiB, so it's unchanged. However,
       the resulting DUP chunk will need 1600MiB disk space, which isn't there,
       and the second dev_extent might extend into the next thing (next
       dev_extent? end of device?) for 22MiB.
      
       The last shown line of code relies on a situation where there's twice
       the value of stripe_size present as value for the variable stripe_size
       when it's DUP. This was actually the case before commit 92e222df
       "btrfs: alloc_chunk: fix DUP stripe size handling", from which I quote:
         "[...] in the meantime there's a check to see if the stripe_size does
       not exceed max_chunk_size. Since during this check stripe_size is twice
       the amount as intended, the check will reduce the stripe_size to
       max_chunk_size if the actual correct to be used stripe_size is more than
       half the amount of max_chunk_size."
      
       In the previous version of the code, the 16MiB alignment (why is this
       done, by the way?) would result in a 50% chance that it would actually
       do an 8MiB alignment for the individual dev_extents, since it was
       operating on double the size. Does this matter?
      
       Does it matter that stripe_size can be set to anything which is not
       16MiB aligned because of the amount of remaining available disk space
       which is just taken?
      
       What is the main purpose of this round_up?
      
       The most straightforward thing to do seems something like...
         stripe_size = min(
             div_u64(devices_info[ndevs - 1].max_avail, dev_stripes),
             stripe_size
         )
       ..just putting half of the max_avail into stripe_size."
      
      Link: https://lore.kernel.org/linux-btrfs/b3461a38-e5f8-f41d-c67c-2efac8129054@mendix.com/Reported-by: NHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      [ add analysis from report ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5eb19381
  5. 15 10月, 2018 6 次提交