1. 17 12月, 2018 40 次提交
    • N
      btrfs: Handle one more split-brain scenario during fsid change · 7a62d0f0
      Nikolay Borisov 提交于
      This commit continues hardening the scanning code to handle cases where
      power loss could have caused disks in a multi-disk filesystem to be
      in inconsistent state. Namely handle the situation that can occur when
      some of the disks in multi-disk fs have completed their fsid change i.e
      they have METADATA_UUID incompat flag set, have cleared the
      CHANGING_FSID_V2 flag and their fsid/metadata_uuid are different. At
      the same time the other half of the disks will have their
      fsid/metadata_uuid unchanged and will only have CHANGING_FSID_V2 flag.
      
      This is handled by introducing code in the scan path which:
      
       a) Handles the case when a device with CHANGING_FSID_V2 flag is
       scanned and as a result btrfs_fs_devices is created with matching
       fsid/metdata_uuid. Subsequently, when a device with completed fsid
       change is scanned it will detect this via the new code in find_fsid
       i.e that such an fs_devices exist that fsid_change flag is set to true,
       it's metadata_uuid/fsid match and the metadata_uuid of the scanned
       device matches that of the fs_devices. In this case, it's important to
       note that the devices which has its fsid change completed will have a
       higher generation number than the device with FSID_CHANGING_V2 flag
       set, so its superblock block will be used during mount. To prevent an
       assertion triggering because the sb used for mounting will have
       differing fsid/metadata_uuid than the ones in the fs_devices struct
       also add code in device_list_add which overwrites the values in
       fs_devices.
      
       b) Alternatively we can end up with a device that completed its
       fsid change be scanned first which will create the respective
       btrfs_fs_devices struct with differing fsid/metadata_uuid. In this
       case when a device with FSID_CHANGING_V2 flag set is scanned it will
       call the newly added find_fsid_inprogress function which will return
       the correct fs_devices.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7a62d0f0
    • N
      btrfs: add members to fs_devices to track fsid changes · d1a63002
      Nikolay Borisov 提交于
      In order to gracefully handle split-brain scenario during fsid change
      (which are very unlikely, yet possible), two more pieces of information
      will be necessary:
      
      1. The highest generation number among all devices registered to a
         particular btrfs_fs_devices
      
      2. A boolean flag whether a given btrfs_fs_devices was created by a
         device which had the FSID_CHANGING_V2 flag set.
      
      This is a preparatory patch and just introduces the variables as well
      as code which sets them, their actual use is going to happen in a later
      patch.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d1a63002
    • N
      btrfs: Add handling for disk split-brain scenario during fsid change · fbc6feae
      Nikolay Borisov 提交于
      Even though fsid change without rewrite is a very quick operation it's
      still possible to experience a split-brain scenario if power loss occurs
      at the most inconvenient time. This patch handles the case where power
      failure occurs while the first transaction (the one setting
      CHANGING_FSID_V2) flag is being persisted on disk. This can cause the
      btrfs_fs_devices of this filesystem to be created by a device which:
      
       a) has the CHANGING_FSID_V2 flag set but its fsid value is intact
      
       b) or a device which doesn't have CHANGING_FSID_V2 flag set and its
          fsid value is intact
      
      This situation is trivially handled by the current find_fsid code since
      in both cases the devices are going to be treated like ordinary devices.
      Since btrfs is always mounted using the superblock of the latest
      device (the one with highest generation number), meaning it will have
      the CHANGING_FSID_V2 flag set, ensure it's being cleared on mount. On
      the first transaction commit following mount all disks will have it
      cleared.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fbc6feae
    • N
      btrfs: Remove fsid/metadata_fsid fields from btrfs_info · de37aa51
      Nikolay Borisov 提交于
      Currently btrfs_fs_info structure contains a copy of the
      fsid/metadata_uuid fields. Same values are also contained in the
      btrfs_fs_devices structure which fs_info has a reference to. Let's
      reduce duplication by removing the fields from fs_info and always refer
      to the ones in fs_devices. No functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      de37aa51
    • N
      btrfs: Add sysfs support for metadata_uuid feature · 56f20f40
      Nikolay Borisov 提交于
      Since the metadata_uuid is a new incompat feature it requires the
      respective sysfs hooks. This patch adds the 'metdata_uuid' feature to
      be shown if it supported by the kernel. Additionally it adds
      /sys/fs/btrfs/UUID/metadata_uuid attribute which allows one to read
      the current metadata_uuid.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      56f20f40
    • N
      btrfs: Introduce support for FSID change without metadata rewrite · 7239ff4b
      Nikolay Borisov 提交于
      This field is going to be used when the user wants to change the UUID
      of the filesystem without having to rewrite all metadata blocks. This
      field adds another level of indirection such that when the FSID is
      changed what really happens is the current UUID (the one with which the
      fs was created) is copied to the 'metadata_uuid' field in the superblock
      as well as a new incompat flag is set METADATA_UUID. When the kernel
      detects this flag is set it knows that the superblock in fact has 2
      UUIDs:
      
      1. Is the UUID which is user-visible, currently known as FSID.
      2. Metadata UUID - this is the UUID which is stamped into all on-disk
         datastructures belonging to this file system.
      
      When the new incompat flag is present device scanning checks whether
      both fsid/metadata_uuid of the scanned device match any of the
      registered filesystems. When the flag is not set then both UUIDs are
      equal and only the FSID is retained on disk, metadata_uuid is set only
      in-memory during mount.
      
      Additionally a new metadata_uuid field is also added to the fs_info
      struct. It's initialised either with the FSID in case METADATA_UUID
      incompat flag is not set or with the metdata_uuid of the superblock
      otherwise.
      
      This commit introduces the new fields as well as the new incompat flag
      and switches all users of the fsid to the new logic.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ minor updates in comments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7239ff4b
    • J
      btrfs: use EXPORT_FOR_TESTS for conditionally exported functions · ce9f967f
      Johannes Thumshirn 提交于
      Several functions in BTRFS are only used inside the source file they are
      declared if CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not defined. However if
      CONFIG_BTRFS_FS_RUN_SANITY_TESTS is defined these functions are shared
      with the unit tests code.
      
      Before the introduction of the EXPORT_FOR_TESTS macro, these functions
      could not be declared as static and the compiler had a harder task when
      optimizing and inlining them.
      
      As we have EXPORT_FOR_TESTS now, use it where appropriate to support the
      compiler.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ce9f967f
    • J
      btrfs: introduce EXPORT_FOR_TESTS macro · f8f591df
      Johannes Thumshirn 提交于
      Depending on whether CONFIG_BTRFS_FS_RUN_SANITY_TESTS is set, some BTRFS
      functions are either local to the file they are implemented in and thus
      should be declared static or are called from within the test
      implementation defined in a different file.
      
      Introduce an EXPORT_FOR_TESTS macro which depending on
      CONFIG_BTRFS_FS_RUN_SANITY_TESTS either adds the 'static' keyword to a
      function or not.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f8f591df
    • J
      btrfs: remove unused drop_on_err in btrfs_mkdir · e9a05cf3
      Johannes Thumshirn 提交于
      Up to commit 32955c54 ("btrfs: switch to discard_new_inode()") the
      drop_on_err variable in btrfs_mkdir() was used to check whether the
      inode had to be dropped via iput().
      
      After commit 32955c54 ("btrfs: switch to discard_new_inode()")
      discard_new_inode() is called when err is set and inode is non NULL.
      Therefore drop_on_err is not used anymore and thus causes a warning when
      building with -Wunused-but-set-variable.
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e9a05cf3
    • N
      btrfs: Replace BUG_ON with ASSERT in find_lock_delalloc_range · 9bfd61d9
      Nikolay Borisov 提交于
      lock_delalloc_pages should only return 2 values - 0 in case of success
      and -EAGAIN if the range of pages to be locked should be shrunk due to
      some of gone. Manual inspections confirms that this is indeed the case
      since __process_pages_contig is where lock_delalloc_pages gets its
      return value. The latter always returns 0  or -EAGAIN so the invariant
      holds. No functional changes.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9bfd61d9
    • N
      btrfs: Sink find_lock_delalloc_range's 'max_bytes' argument · 917aacec
      Nikolay Borisov 提交于
      All callers of this function pass BTRFS_MAX_EXTENT_SIZE (128M) so let's
      reduce the argument count and make that a local variable. No functional
      changes.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      917aacec
    • N
      btrfs: Remove superfluous check form btrfs_remove_chunk · 64bc6c2a
      Nikolay Borisov 提交于
      It's unnecessary to check map->stripes[i].dev for NULL given its value
      is already set and dereferenced above the the check. No functional
      changes.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      64bc6c2a
    • A
      btrfs: don't report user-requested cancel as an error · f9085abf
      Anand Jain 提交于
      As of now only user requested replace cancel can cancel the
      replace-scrub so no need to log the error.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f9085abf
    • A
      btrfs: silence warning if replace is canceled · 49365e69
      Anand Jain 提交于
      When we successfully cancel the device replace, its scrub worker returns
      -ECANCELED, which is then passed to btrfs_dev_replace_finishing.
      
      It cleans up based on the returned status and propagates the same
      -ECANCELED back the parent function. As of now only user can cancel the
      replace-scrub, so its ok to silence the warning here.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      49365e69
    • A
      btrfs: dev-replace: add explicit check for replace result "no error" · 53e62fb5
      Anand Jain 提交于
      We recast the replace return status
      BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS to 0, to indicate no
      error.
      And since BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_ERROR should also return 0,
      which is also declared as 0, so we just return. Instead add it to the if
      statement so that there is enough clarity while reading the code.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      53e62fb5
    • A
      btrfs: dev-replace: replace's scrub must not be running in suspended state · fe97e2e1
      Anand Jain 提交于
      When the replace state is in the suspended state, btrfs_scrub_cancel()
      should fail with -ENOTCONN as there is no scrub running. As a safety
      catch check if btrfs_scrub_cancel() returns -ENOTCONN and assert if it
      doesn't.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fe97e2e1
    • A
      btrfs: dev-replace: set result code of cancel by status of scrub · b47dda2e
      Anand Jain 提交于
      The device-replace needs to check the result code of the scrub workers
      in btrfs_dev_replace_cancel and distinguish if successful cancel
      operation and when the there was no operation running.
      
      If btrfs_scrub_cancel() fails, return
      BTRFS_IOCTL_DEV_REPLACE_RESULT_NOT_STARTED so that user can try
      to cancel the replace again.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b47dda2e
    • A
      btrfs: fix use-after-free due to race between replace start and cancel · d189dd70
      Anand Jain 提交于
      The device replace cancel thread can race with the replace start thread
      and if fs_info::scrubs_running is not yet set, btrfs_scrub_cancel() will
      fail to stop the scrub thread.
      
      The scrub thread continues with the scrub for replace which then will
      try to write to the target device and which is already freed by the
      cancel thread.
      
      scrub_setup_ctx() warns as tgtdev is NULL.
      
        struct scrub_ctx *scrub_setup_ctx(struct btrfs_device *dev, int is_dev_replace)
        {
        ...
      	  if (is_dev_replace) {
      		  WARN_ON(!fs_info->dev_replace.tgtdev);  <===
      		  sctx->pages_per_wr_bio = SCRUB_PAGES_PER_WR_BIO;
      		  sctx->wr_tgtdev = fs_info->dev_replace.tgtdev;
      		  sctx->flush_all_writes = false;
      	  }
      
        [ 6724.497655] BTRFS info (device sdb): dev_replace from /dev/sdb (devid 1) to /dev/sdc started
        [ 6753.945017] BTRFS info (device sdb): dev_replace from /dev/sdb (devid 1) to /dev/sdc canceled
        [ 6852.426700] WARNING: CPU: 0 PID: 4494 at fs/btrfs/scrub.c:622 scrub_setup_ctx.isra.19+0x220/0x230 [btrfs]
        ...
        [ 6852.428928] RIP: 0010:scrub_setup_ctx.isra.19+0x220/0x230 [btrfs]
        ...
        [ 6852.432970] Call Trace:
        [ 6852.433202]  btrfs_scrub_dev+0x19b/0x5c0 [btrfs]
        [ 6852.433471]  btrfs_dev_replace_start+0x48c/0x6a0 [btrfs]
        [ 6852.433800]  btrfs_dev_replace_by_ioctl+0x3a/0x60 [btrfs]
        [ 6852.434097]  btrfs_ioctl+0x2476/0x2d20 [btrfs]
        [ 6852.434365]  ? do_sigaction+0x7d/0x1e0
        [ 6852.434623]  do_vfs_ioctl+0xa9/0x6c0
        [ 6852.434865]  ? syscall_trace_enter+0x1c8/0x310
        [ 6852.435124]  ? syscall_trace_enter+0x1c8/0x310
        [ 6852.435387]  ksys_ioctl+0x60/0x90
        [ 6852.435663]  __x64_sys_ioctl+0x16/0x20
        [ 6852.435907]  do_syscall_64+0x50/0x180
        [ 6852.436150]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Further, as the replace thread enters scrub_write_page_to_dev_replace()
      without the target device it panics:
      
        static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx,
      				      struct scrub_page *spage)
        {
        ...
      	bio_set_dev(bio, sbio->dev->bdev); <======
      
        [ 6929.715145] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a0
        ..
        [ 6929.717106] Workqueue: btrfs-scrub btrfs_scrub_helper [btrfs]
        [ 6929.717420] RIP: 0010:scrub_write_page_to_dev_replace+0xb4/0x260
        [btrfs]
        ..
        [ 6929.721430] Call Trace:
        [ 6929.721663]  scrub_write_block_to_dev_replace+0x3f/0x60 [btrfs]
        [ 6929.721975]  scrub_bio_end_io_worker+0x1af/0x490 [btrfs]
        [ 6929.722277]  normal_work_helper+0xf0/0x4c0 [btrfs]
        [ 6929.722552]  process_one_work+0x1f4/0x520
        [ 6929.722805]  ? process_one_work+0x16e/0x520
        [ 6929.723063]  worker_thread+0x46/0x3d0
        [ 6929.723313]  kthread+0xf8/0x130
        [ 6929.723544]  ? process_one_work+0x520/0x520
        [ 6929.723800]  ? kthread_delayed_work_timer_fn+0x80/0x80
        [ 6929.724081]  ret_from_fork+0x3a/0x50
      
      Fix this by letting the btrfs_dev_replace_finishing() to do the job of
      cleaning after the cancel, including freeing of the target device.
      btrfs_dev_replace_finishing() is called when btrfs_scub_dev() returns
      along with the scrub return status.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d189dd70
    • A
      btrfs: dev-replace: go back to suspend state if another EXCL_OP is running · 05c49e6b
      Anand Jain 提交于
      In a secnario where balance and replace co-exists as below,
      
        - start balance
        - pause balance
        - start replace
        - reboot
      
      and when system restarts, balance resumes first. Then the replace is
      attempted to restart but will fail as the EXCL_OP lock is already held
      by the balance. If so place the replace state back to
      BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED state.
      
      Fixes: 010a47bd ("btrfs: add proper safety check before resuming dev-replace")
      CC: stable@vger.kernel.org # 4.18+
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      05c49e6b
    • A
      btrfs: dev-replace: go back to suspended state if target device is missing · 0d228ece
      Anand Jain 提交于
      At the time of forced unmount we place the running replace to
      BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED state, so when the system comes
      back and expect the target device is missing.
      
      Then let the replace state continue to be in
      BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED state instead of
      BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED as there isn't any matching scrub
      running as part of replace.
      
      Fixes: e93c89c1 ("Btrfs: add new sources for device replace code")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0d228ece
    • A
      btrfs: mark btrfs_dev_replace_start as static · 54862d6d
      Anand Jain 提交于
      There isn't any other consumer other than in its own file dev-replace.c.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      54862d6d
    • A
      btrfs: harden agaist duplicate fsid on scanned devices · a9261d41
      Anand Jain 提交于
      It's not that impossible to imagine that a device OR a btrfs image is
      copied just by using the dd or the cp command. Which in case both the
      copies of the btrfs will have the same fsid. If on the system with
      automount enabled, the copied FS gets scanned.
      
      We have a known bug in btrfs, that we let the device path be changed
      after the device has been mounted. So using this loop hole the new
      copied device would appears as if its mounted immediately after it's
      been copied.
      
      For example:
      
      Initially.. /dev/mmcblk0p4 is mounted as /
      
        $ lsblk
        NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
        mmcblk0     179:0    0 29.2G  0 disk
        |-mmcblk0p4 179:4    0    4G  0 part /
        |-mmcblk0p2 179:2    0  500M  0 part /boot
        |-mmcblk0p3 179:3    0  256M  0 part [SWAP]
        `-mmcblk0p1 179:1    0  256M  0 part /boot/efi
      
        $ btrfs fi show
           Label: none  uuid: 07892354-ddaa-4443-90ea-f76a06accaba
           Total devices 1 FS bytes used 1.40GiB
           devid    1 size 4.00GiB used 3.00GiB path /dev/mmcblk0p4
      
      Copy mmcblk0 to sda
      
        $ dd if=/dev/mmcblk0 of=/dev/sda
      
      And immediately after the copy completes the change in the device
      superblock is notified which the automount scans using btrfs device scan
      and the new device sda becomes the mounted root device.
      
        $ lsblk
        NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
        sda           8:0    1 14.9G  0 disk
        |-sda4        8:4    1    4G  0 part /
        |-sda2        8:2    1  500M  0 part
        |-sda3        8:3    1  256M  0 part
        `-sda1        8:1    1  256M  0 part
        mmcblk0     179:0    0 29.2G  0 disk
        |-mmcblk0p4 179:4    0    4G  0 part
        |-mmcblk0p2 179:2    0  500M  0 part /boot
        |-mmcblk0p3 179:3    0  256M  0 part [SWAP]
        `-mmcblk0p1 179:1    0  256M  0 part /boot/efi
      
        $ btrfs fi show /
          Label: none  uuid: 07892354-ddaa-4443-90ea-f76a06accaba
          Total devices 1 FS bytes used 1.40GiB
          devid    1 size 4.00GiB used 3.00GiB path /dev/sda4
      
      The bug is quite nasty that you can't either unmount /dev/sda4 or
      /dev/mmcblk0p4. And the problem does not get solved until you take sda
      out of the system on to another system to change its fsid using the
      'btrfstune -u' command.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a9261d41
    • H
      btrfs: introduce nparity raid_attr · b50836ed
      Hans van Kranenburg 提交于
      Instead of hardcoding exceptions for RAID5 and RAID6 in the code, use an
      nparity field in raid_attr.
      Signed-off-by: NHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b50836ed
    • H
      btrfs: fix ncopies raid_attr for RAID56 · da612e31
      Hans van Kranenburg 提交于
      RAID5 and RAID6 profile store one copy of the data, not 2 or 3. These
      values are not yet used anywhere so there's no change.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      da612e31
    • H
      btrfs: alloc_chunk: fix more DUP stripe size handling · baf92114
      Hans van Kranenburg 提交于
      Commit 92e222df "btrfs: alloc_chunk: fix DUP stripe size handling"
      fixed calculating the stripe_size for a new DUP chunk.
      
      However, the same calculation reappears a bit later, and that one was
      not changed yet. The resulting bug that is exposed is that the newly
      allocated device extents ('stripes') can have a few MiB overlap with the
      next thing stored after them, which is another device extent or the end
      of the disk.
      
      The scenario in which this can happen is:
      * The block device for the filesystem is less than 10GiB in size.
      * The amount of contiguous free unallocated disk space chosen to use for
        chunk allocation is 20% of the total device size, or a few MiB more or
        less.
      
      An example:
      - The filesystem device is 7880MiB (max_chunk_size gets set to 788MiB)
      - There's 1578MiB unallocated raw disk space left in one contiguous
        piece.
      
      In this case stripe_size is first calculated as 789MiB, (half of
      1578MiB).
      
      Since 789MiB (stripe_size * data_stripes) > 788MiB (max_chunk_size), we
      enter the if block. Now stripe_size value is immediately overwritten
      while calculating an adjusted value based on max_chunk_size, which ends
      up as 788MiB.
      
      Next, the value is rounded up to a 16MiB boundary, 800MiB, which is
      actually more than the value we had before. However, the last comparison
      fails to detect this, because it's comparing the value with the total
      amount of free space, which is about twice the size of stripe_size.
      
      In the example above, this means that the resulting raw disk space being
      allocated is 1600MiB, while only a gap of 1578MiB has been found. The
      second device extent object for this DUP chunk will overlap for 22MiB
      with whatever comes next.
      
      The underlying problem here is that the stripe_size is reused all the
      time for different things. So, when entering the code in the if block,
      stripe_size is immediately overwritten with something else. If later we
      decide we want to have the previous value back, then the logic to
      compute it was copy pasted in again.
      
      With this change, the value in stripe_size is not unnecessarily
      destroyed, so the duplicated calculation is not needed any more.
      Signed-off-by: NHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      baf92114
    • H
      btrfs: alloc_chunk: improve chunk size variable name · 23f0ff1e
      Hans van Kranenburg 提交于
      The variable num_bytes is really a way too generic name for a variable
      in this function. There are a dozen other variables that hold a number
      of bytes as value.
      
      Give it a name that actually describes what it does, which is holding
      the size of the chunk that we're allocating.
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      23f0ff1e
    • H
      btrfs: alloc_chunk: do not refurbish num_bytes · 2f29df4f
      Hans van Kranenburg 提交于
      The variable num_bytes is used to store the chunk length of the chunk
      that we're allocating. Do not reuse it for something really different in
      the same function.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2f29df4f
    • E
      btrfs: use tagged writepage to mitigate livelock of snapshot · 3cd24c69
      Ethan Lien 提交于
      Snapshot is expected to be fast. But if there are writers steadily
      creating dirty pages in our subvolume, the snapshot may take a very long
      time to complete. To fix the problem, we use tagged writepage for
      snapshot flusher as we do in the generic write_cache_pages(), so we can
      omit pages dirtied after the snapshot command.
      
      This does not change the semantics regarding which data get to the
      snapshot, if there are pages being dirtied during the snapshotting
      operation.  There's a sync called before snapshot is taken in old/new
      case, any IO in flight just after that may be in the snapshot but this
      depends on other system effects that might still sync the IO.
      
      We do a simple snapshot speed test on a Intel D-1531 box:
      
      fio --ioengine=libaio --iodepth=32 --bs=4k --rw=write --size=64G
      --direct=0 --thread=1 --numjobs=1 --time_based --runtime=120
      --filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5;
      time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio
      
      original: 1m58sec
      patched:  6.54sec
      
      This is the best case for this patch since for a sequential write case,
      we omit nearly all pages dirtied after the snapshot command.
      
      For a multi writers, random write test:
      
      fio --ioengine=libaio --iodepth=32 --bs=4k --rw=randwrite --size=64G
      --direct=0 --thread=1 --numjobs=4 --time_based --runtime=120
      --filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5;
      time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio
      
      original: 15.83sec
      patched:  10.35sec
      
      The improvement is smaller compared to the sequential write case,
      since we omit only half of the pages dirtied after snapshot command.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NEthan Lien <ethanlien@synology.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3cd24c69
    • N
      btrfs: Remove unused extent_state argument from btrfs_writepage_endio_finish_ordered · c629732d
      Nikolay Borisov 提交于
      This parameter was never used, yet was part of the interface of the
      function ever since its introduction as extent_io_ops::writepage_end_io_hook
      in e6dcd2dc ("Btrfs: New data=ordered implementation"). Now that
      NULL is passed everywhere as a value for this parameter let's remove it
      for good. No functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c629732d
    • N
      btrfs: Remove extent_page_data argument from writepage_delalloc · 8cc0237a
      Nikolay Borisov 提交于
      The only remaining use of the 'epd' argument in writepage_delalloc is
      to reference the extent_io_tree which was set in extent_writepages. Since
      it is guaranteed that page->mapping of any page passed to
      writepage_delalloc (and __extent_writepage as the sole caller) to be
      equal to that passed in extent_writepages we can directly get the
      io_tree via the already passed inode (which is also taken from
      page->mapping->host). No functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8cc0237a
    • N
      btrfs: Move epd::extent_locked check to writepage_delalloc's caller · 7789a55a
      Nikolay Borisov 提交于
      If epd::extent_locked is set then writepage_delalloc terminates. Make
      this a bit more apparent in the caller by simply bubbling the check up.
      This enables to remove epd as an argument to writepage_delalloc in a
      future patch. No functional change.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7789a55a
    • N
      btrfs: Check for missing device before bio submission in btrfs_map_bio · fc8a168a
      Nikolay Borisov 提交于
      Before btrfs_map_bio submits all stripe bios it does a number of checks
      to ensure the device for every stripe is present. However, it doesn't do
      a DEV_STATE_MISSING check, instead this is relegated to the lower level
      btrfs_schedule_bio (in the async submission case, sync submission
      doesn't check DEV_STATE_MISSING at all). Additionally
      btrfs_schedule_bios does the duplicate device->bdev check which has
      already been performed in btrfs_map_bio.
      
      This patch moves the DEV_STATE_MISSING check in btrfs_map_bio and
      removes the duplicate device->bdev check. Doing so ensures that no bio
      cloning/submission happens for both async/sync requests in the face of
      missing device. This makes the async io submission path slightly shorter
      in terms of instruction count. No functional changes.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fc8a168a
    • A
      btrfs: remove redundant replace_state init · ab457246
      Anand Jain 提交于
      dev_replace::replace_state has been set to
      BTRFS_DEV_REPLACE_ITEM_STATE_NEVER_STARTED (0) in the same function,
      So delete the line which sets replace_state = 0;
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ab457246
    • F
      Btrfs: remove no longer used io_err from btrfs_log_ctx · 6d4cbf79
      Filipe Manana 提交于
      The io_err field of struct btrfs_log_ctx is no longer used after the
      recent simplification of the fast fsync path, where we now wait for
      ordered extents to complete before logging the inode. We did this in
      commit b5e6c3e1 ("btrfs: always wait on ordered extents at fsync
      time") and commit a2120a47 ("btrfs: clean up the left over
      logged_list usage") removed its last use.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6d4cbf79
    • F
      Btrfs: simpler and more efficient cleanup of a log tree's extent io tree · 59b0713a
      Filipe Manana 提交于
      We currently are in a loop finding each range (corresponding to a btree
      node/leaf) in a log root's extent io tree and then clean it up. This is a
      waste of time since we are traversing the extent io tree's rb_tree more
      times then needed (one for a range lookup and another for cleaning it up)
      without any good reason.
      
      We free the log trees when we are in the critical section of a transaction
      commit (the transaction state is set to TRANS_STATE_COMMIT_DOING), so it's
      of great convenience to do everything as fast as possible in order to
      reduce the time we block other tasks from starting a new transaction.
      
      So fix this by traversing the extent io tree once and cleaning up all its
      records in one go while traversing it.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      59b0713a
    • N
      btrfs: Adjust loop in free_extent_buffer · 46cc775e
      Nikolay Borisov 提交于
      The loop construct in free_extent_buffer was added in
      242e18c7 ("Btrfs: reduce lock contention on extent buffer locks")
      as means of reducing the times the eb lock is taken, the non-last ref
      count is decremented and lock is released. As the special handling
      of UNMAPPED extent buffers was removed now there is only one decrement
      op which is happening for EXTENT_BUFFER_UNMAPPED case.
      
      This commit modifies the loop condition so that in case of UNMAPPED
      buffers the eb's lock is taken only if we are 100% sure the eb is going
      to be freed by the current executor of the code. Additionally, remove
      superfluous ref count ops in btrfs test.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      46cc775e
    • N
      btrfs: Remove special handling of EXTENT_BUFFER_UNMAPPED while freeing · 9cfc8ba7
      Nikolay Borisov 提交于
      Now that the whole of btrfs code has been audited for eb reference count
      management it's time to remove the hunk in free_extent_buffer that
      essentially considered the condition
      
        "eb->ref == 2 && EXTENT_BUFFER_DUMMY"
      
      to equal "eb->ref = 1". Also remove the last location
      which takes an extra reference count in alloc_test_extent_buffer.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9cfc8ba7
    • N
      btrfs: Remove unnecessary tree locking code in qgroup_rescan_leaf · df449714
      Nikolay Borisov 提交于
      In qgroup_rescan_leaf a copy is made of the target leaf by calling
      btrfs_clone_extent_buffer. The latter allocates a new buffer and
      attaches a new set of pages and copies the content of the source buffer.
      The new scratch buffer is only used to iterate it's items, it's not
      published anywhere and cannot be accessed by a third party.
      
      Hence, it's not necessary to perform any locking on it whatsoever.
      Furthermore, remove the extra extent_buffer_get call since the new
      buffer is always allocated with a reference count of 1 which is
      sufficient here.  No functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      df449714
    • N
      btrfs: Remove extra reference count bumps in btrfs_compare_trees · 8c7eeb65
      Nikolay Borisov 提交于
      When the 2 comparison trees roots are initialised they are private to
      the function and already have reference counts of 1 each. There is no
      need to further increment the reference count since the cloned buffers
      are already accessed via struct btrfs_path. Eventually the 2 paths used
      for comparison are going to be released, effectively disposing of the
      cloned buffers.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8c7eeb65
    • N
      btrfs: Remove extraneous extent_buffer_get from tree_mod_log_rewind · 24cee18a
      Nikolay Borisov 提交于
      When a rewound buffer is created it already has a ref count of 1 and the
      dummy flag set. Then another ref is taken bumping the count to 2.
      Finally when this buffer is released from btrfs_release_path the extra
      reference is decremented by the special handling code in
      free_extent_buffer.
      
      However, this special code is in fact redundant sinca ref count of 1 is
      still correct since the buffer is only accessed via btrfs_path struct.
      This paves the way forward of removing the special handling in
      free_extent_buffer.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      24cee18a