1. 30 4月, 2019 27 次提交
  2. 25 4月, 2019 1 次提交
    • N
      btrfs: Switch memory allocations in async csum calculation path to kvmalloc · a3d46aea
      Nikolay Borisov 提交于
      Recent multi-page biovec rework allowed creation of bios that can span
      large regions - up to 128 megabytes in the case of btrfs. OTOH btrfs'
      submission path currently allocates a contiguous array to store the
      checksums for every bio submitted. This means we can request up to
      (128mb / BTRFS_SECTOR_SIZE) * 4 bytes + 32bytes of memory from kmalloc.
      On busy systems with possibly fragmented memory said kmalloc can fail
      which will trigger BUG_ON due to improper error handling IO submission
      context in btrfs.
      
      Until error handling is improved or bios in btrfs limited to a more
      manageable size (e.g. 1m) let's use kvmalloc to fallback to vmalloc for
      such large allocations. There is no hard requirement that the memory
      allocated for checksums during IO submission has to be contiguous, but
      this is a simple fix that does not require several non-contiguous
      allocations.
      
      For small writes this is unlikely to have any visible effect since
      kmalloc will still satisfy allocation requests as usual. For larger
      requests the code will just fallback to vmalloc.
      
      We've performed evaluation on several workload types and there was no
      significant difference kmalloc vs kvmalloc.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a3d46aea
  3. 04 4月, 2019 2 次提交
  4. 29 3月, 2019 1 次提交
  5. 21 3月, 2019 1 次提交
    • F
      Btrfs: fix assertion failure on fsync with NO_HOLES enabled · 0ccc3876
      Filipe Manana 提交于
      Back in commit a89ca6f2 ("Btrfs: fix fsync after truncate when
      no_holes feature is enabled") I added an assertion that is triggered when
      an inline extent is found to assert that the length of the (uncompressed)
      data the extent represents is the same as the i_size of the inode, since
      that is true most of the time I couldn't find or didn't remembered about
      any exception at that time. Later on the assertion was expanded twice to
      deal with a case of a compressed inline extent representing a range that
      matches the sector size followed by an expanding truncate, and another
      case where fallocate can update the i_size of the inode without adding
      or updating existing extents (if the fallocate range falls entirely within
      the first block of the file). These two expansion/fixes of the assertion
      were done by commit 7ed586d0 ("Btrfs: fix assertion on fsync of
      regular file when using no-holes feature") and commit 6399fb5a
      ("Btrfs: fix assertion failure during fsync in no-holes mode").
      These however missed the case where an falloc expands the i_size of an
      inode to exactly the sector size and inline extent exists, for example:
      
       $ mkfs.btrfs -f -O no-holes /dev/sdc
       $ mount /dev/sdc /mnt
      
       $ xfs_io -f -c "pwrite -S 0xab 0 1096" /mnt/foobar
       wrote 1096/1096 bytes at offset 0
       1 KiB, 1 ops; 0.0002 sec (4.448 MiB/sec and 4255.3191 ops/sec)
      
       $ xfs_io -c "falloc 1096 3000" /mnt/foobar
       $ xfs_io -c "fsync" /mnt/foobar
       Segmentation fault
      
       $ dmesg
       [701253.602385] assertion failed: len == i_size || (len == fs_info->sectorsize && btrfs_file_extent_compression(leaf, extent) != BTRFS_COMPRESS_NONE) || (len < i_size && i_size < fs_info->sectorsize), file: fs/btrfs/tree-log.c, line: 4727
       [701253.602962] ------------[ cut here ]------------
       [701253.603224] kernel BUG at fs/btrfs/ctree.h:3533!
       [701253.603503] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
       [701253.603774] CPU: 2 PID: 7192 Comm: xfs_io Tainted: G        W         5.0.0-rc8-btrfs-next-45 #1
       [701253.604054] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
       [701253.604650] RIP: 0010:assfail.constprop.23+0x18/0x1a [btrfs]
       (...)
       [701253.605591] RSP: 0018:ffffbb48c186bc48 EFLAGS: 00010286
       [701253.605914] RAX: 00000000000000de RBX: ffff921d0a7afc08 RCX: 0000000000000000
       [701253.606244] RDX: 0000000000000000 RSI: ffff921d36b16868 RDI: ffff921d36b16868
       [701253.606580] RBP: ffffbb48c186bcf0 R08: 0000000000000000 R09: 0000000000000000
       [701253.606913] R10: 0000000000000003 R11: 0000000000000000 R12: ffff921d05d2de18
       [701253.607247] R13: ffff921d03b54000 R14: 0000000000000448 R15: ffff921d059ecf80
       [701253.607769] FS:  00007f14da906700(0000) GS:ffff921d36b00000(0000) knlGS:0000000000000000
       [701253.608163] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [701253.608516] CR2: 000056087ea9f278 CR3: 00000002268e8001 CR4: 00000000003606e0
       [701253.608880] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       [701253.609250] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       [701253.609608] Call Trace:
       [701253.609994]  btrfs_log_inode+0xdfb/0xe40 [btrfs]
       [701253.610383]  btrfs_log_inode_parent+0x2be/0xa60 [btrfs]
       [701253.610770]  ? do_raw_spin_unlock+0x49/0xc0
       [701253.611150]  btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
       [701253.611537]  btrfs_sync_file+0x3b2/0x440 [btrfs]
       [701253.612010]  ? do_sysinfo+0xb0/0xf0
       [701253.612552]  do_fsync+0x38/0x60
       [701253.612988]  __x64_sys_fsync+0x10/0x20
       [701253.613360]  do_syscall_64+0x60/0x1b0
       [701253.613733]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [701253.614103] RIP: 0033:0x7f14da4e66d0
       (...)
       [701253.615250] RSP: 002b:00007fffa670fdb8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
       [701253.615647] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f14da4e66d0
       [701253.616047] RDX: 000056087ea9c260 RSI: 000056087ea9c260 RDI: 0000000000000003
       [701253.616450] RBP: 0000000000000001 R08: 0000000000000020 R09: 0000000000000010
       [701253.616854] R10: 000000000000009b R11: 0000000000000246 R12: 000056087ea9c260
       [701253.617257] R13: 000056087ea9c240 R14: 0000000000000000 R15: 000056087ea9dd10
       (...)
       [701253.619941] ---[ end trace e088d74f132b6da5 ]---
      
      Updating the assertion again to allow for this particular case would result
      in a meaningless assertion, plus there is currently no risk of logging
      content that would result in any corruption after a log replay if the size
      of the data encoded in an inline extent is greater than the inode's i_size
      (which is not currently possibe either with or without compression),
      therefore just remove the assertion.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0ccc3876
  6. 19 3月, 2019 3 次提交
  7. 14 3月, 2019 4 次提交
    • D
      btrfs: don't report readahead errors and don't update statistics · 0cc068e6
      David Sterba 提交于
      As readahead is an optimization, all errors are usually filtered out,
      but still properly handled when the real read call is done. The commit
      5e9d3982 ("btrfs: readpages() should submit IO as read-ahead") added
      REQ_RAHEAD to readpages() because that's only used for readahead
      (despite what one would expect from the callback name).
      
      This causes a flood of messages and inflated read error stats, so skip
      reporting in case it's readahead.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202403Reported-by: NLimeTech <tomm@lime-technology.com>
      Fixes: 5e9d3982 ("btrfs: readpages() should submit IO as read-ahead")
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0cc068e6
    • F
      Btrfs: fix file corruption after snapshotting due to mix of buffered/DIO writes · 609e804d
      Filipe Manana 提交于
      When we are mixing buffered writes with direct IO writes against the same
      file and snapshotting is happening concurrently, we can end up with a
      corrupt file content in the snapshot. Example:
      
      1) Inode/file is empty.
      
      2) Snapshotting starts.
      
      2) Buffered write at offset 0 length 256Kb. This updates the i_size of the
         inode to 256Kb, disk_i_size remains zero. This happens after the task
         doing the snapshot flushes all existing delalloc.
      
      3) DIO write at offset 256Kb length 768Kb. Once the ordered extent
         completes it sets the inode's disk_i_size to 1Mb (256Kb + 768Kb) and
         updates the inode item in the fs tree with a size of 1Mb (which is
         the value of disk_i_size).
      
      4) The dealloc for the range [0, 256Kb[ did not start yet.
      
      5) The transaction used in the DIO ordered extent completion, which updated
         the inode item, is committed by the snapshotting task.
      
      6) Snapshot creation completes.
      
      7) Dealloc for the range [0, 256Kb[ is flushed.
      
      After that when reading the file from the snapshot we always get zeroes for
      the range [0, 256Kb[, the file has a size of 1Mb and the data written by
      the direct IO write is found. From an application's point of view this is
      a corruption, since in the source subvolume it could never read a version
      of the file that included the data from the direct IO write without the
      data from the buffered write included as well. In the snapshot's tree,
      file extent items are missing for the range [0, 256Kb[.
      
      The issue, obviously, does not happen when using the -o flushoncommit
      mount option.
      
      Fix this by flushing delalloc for all the roots that are about to be
      snapshotted when committing a transaction. This guarantees total ordering
      when updating the disk_i_size of an inode since the flush for dealloc is
      done when a transaction is in the TRANS_STATE_COMMIT_START state and wait
      is done once no more external writers exist. This is similar to what we
      do when using the flushoncommit mount option, but we do it only if the
      transaction has snapshots to create and only for the roots of the
      subvolumes to be snapshotted. The bulk of the dealloc is flushed in the
      snapshot creation ioctl, so the flush work we do inside the transaction
      is minimized.
      
      This issue, involving buffered and direct IO writes with snapshotting, is
      often triggered by fstest btrfs/078, and got reported by fsck when not
      using the NO_HOLES features, for example:
      
        $ cat results/btrfs/078.full
        (...)
        _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
        *** fsck.btrfs output ***
        [1/7] checking root items
        [2/7] checking extents
        [3/7] checking free space cache
        [4/7] checking fs roots
        root 258 inode 264 errors 100, file extent discount
        Found file extent holes:
              start: 524288, len: 65536
        ERROR: errors found in fs roots
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      609e804d
    • J
      btrfs: remove WARN_ON in log_dir_items · 2cc83342
      Josef Bacik 提交于
      When Filipe added the recursive directory logging stuff in
      2f2ff0ee ("Btrfs: fix metadata inconsistencies after directory
      fsync") he specifically didn't take the directory i_mutex for the
      children directories that we need to log because of lockdep.  This is
      generally fine, but can lead to this WARN_ON() tripping if we happen to
      run delayed deletion's in between our first search and our second search
      of dir_item/dir_indexes for this directory.  We expect this to happen,
      so the WARN_ON() isn't necessary.  Drop the WARN_ON() and add a comment
      so we know why this case can happen.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2cc83342
    • F
      Btrfs: fix incorrect file size after shrinking truncate and fsync · bf504110
      Filipe Manana 提交于
      If we do a shrinking truncate against an inode which is already present
      in the respective log tree and then rename it, as part of logging the new
      name we end up logging an inode item that reflects the old size of the
      file (the one which we previously logged) and not the new smaller size.
      The decision to preserve the size previously logged was added by commit
      1a4bcf47 ("Btrfs: fix fsync data loss after adding hard link to
      inode") in order to avoid data loss after replaying the log. However that
      decision is only needed for the case the logged inode size is smaller then
      the current size of the inode, as explained in that commit's change log.
      If the current size of the inode is smaller then the previously logged
      size, we know a shrinking truncate happened and therefore need to use
      that smaller size.
      
      Example to trigger the problem:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ xfs_io -f -c "pwrite -S 0xab 0 8000" /mnt/foo
        $ xfs_io -c "fsync" /mnt/foo
        $ xfs_io -c "truncate 3000" /mnt/foo
      
        $ mv /mnt/foo /mnt/bar
        $ xfs_io -c "fsync" /mnt/bar
      
        <power failure>
      
        $ mount /dev/sdb /mnt
        $ od -t x1 -A d /mnt/bar
        0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
        *
        0008000
      
      Once we rename the file, we log its name (and inode item), and because
      the inode was already logged before in the current transaction, we log it
      with a size of 8000 bytes because that is the size we previously logged
      (with the first fsync). As part of the rename, besides logging the inode,
      we do also sync the log, which is done since commit d4682ba0
      ("Btrfs: sync log after logging new name"), so the next fsync against our
      inode is effectively a no-op, since no new changes happened since the
      rename operation. Even if did not sync the log during the rename
      operation, the same problem (fize size of 8000 bytes instead of 3000
      bytes) would be visible after replaying the log if the log ended up
      getting synced to disk through some other means, such as for example by
      fsyncing some other modified file. In the example above the fsync after
      the rename operation is there just because not every filesystem may
      guarantee logging/journalling the inode (and syncing the log/journal)
      during the rename operation, for example it is needed for f2fs, but not
      for ext4 and xfs.
      
      Fix this scenario by, when logging a new name (which is triggered by
      rename and link operations), using the current size of the inode instead
      of the previously logged inode size.
      
      A test case for fstests follows soon.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202695
      CC: stable@vger.kernel.org # 4.4+
      Reported-by: NSeulbae Kim <seulbae@gatech.edu>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bf504110
  8. 13 3月, 2019 1 次提交