1. 19 8月, 2015 12 次提交
    • B
      xfs: add helper to conditionally remove items from the AIL · 146e54b7
      Brian Foster 提交于
      Several areas of code duplicate a pattern where we take the AIL lock,
      check whether an item is in the AIL and remove it if so. Create a new
      helper for this pattern and use it where appropriate.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      146e54b7
    • B
      xfs: fix btree cursor error cleanups · f307080a
      Brian Foster 提交于
      The btree cursor cleanup function takes an error parameter that
      affects how buffers are released from the cursor. All buffers are
      released in the event of error. Several callers do not specify the
      XFS_BTREE_ERROR flag in the event of error, however. This can cause
      buffers to hang around locked or with an elevated hold count and
      thus lead to umount hangs in the event of errors.
      
      Fix up the xfs_btree_del_cursor() callers to pass XFS_BTREE_ERROR if
      the cursor is being torn down due to error.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      f307080a
    • B
      xfs: clean up root inode properly on mount failure · 0ae120f8
      Brian Foster 提交于
      The root inode is read as part of the xfs_mountfs() sequence and the
      reference is dropped in the event of failure after we grab the
      inode.  The reference drop doesn't necessarily free the inode,
      however. It marks it for reclaim and potentially kicks off the
      reclaim workqueue.  The workqueue is destroyed further up the error
      path, which means we are subject to crash if the workqueue job runs
      after this point or a memory leak which is identified if the
      xfs_inode_zone is destroyed (e.g., on module removal). Both of these
      outcomes are reproducible via manual instrumentation of a mount
      error after the root inode xfs_iget() call in xfs_mountfs().
      
      Update the xfs_mountfs() error path to cancel any potential reclaim
      work items and to run a synchronous inode reclaim if the root inode
      is marked for reclaim. This ensures that no jobs remain on the queue
      before it is destroyed and that the root inode is freed before the
      reclaim mechanism is torn down.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      0ae120f8
    • B
      xfs: checksum log record ext headers based on record size · a3f20014
      Brian Foster 提交于
      The first 4 bytes of every basic block in the physical log is stamped
      with the current lsn. To support this mechanism, the log record header
      (first block of each new log record) contains space for the original
      first byte of each log record block before it is replaced with the lsn.
      The log record header has space for 32k worth of blocks. The version 2
      log adds new extended record headers for each additional 32k worth of
      blocks beyond what is supported by the record header.
      
      The log record checksum incorporates the log record header, the extended
      headers and the record payload. xlog_cksum() checksums the extended
      headers based on log->l_iclog_heads, which specifies the number of
      extended headers in a log record based on the log buffer size mount
      option. The log buffer size is variable, however, and thus means the
      checksum can be calculated differently based on how a filesystem is
      mounted. This is problematic if a filesystem crashes and recovery occurs
      on a subsequent mount using a different log buffer size. For example,
      crash an active filesystem that is mounted with the default (32k)
      logbsize, attempt remount/recovery using '-o logbsize=64k' and the mount
      fails on or warns about log checksum failures.
      
      To avoid this problem, update xlog_cksum() to calculate the checksum
      based on the size of the log buffer according to the log record. The
      size is already included in the h_size field of the log record header
      and thus is available at log recovery time. Extended log record headers
      are also only written when the log record is large enough to require
      them. This makes checksum calculation of log records consistent with the
      extended record header mechanism as well as how on-disk records are
      checksummed with various log buffer size mount options.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      a3f20014
    • B
      xfs: fix broken icreate log item cancellation · fc0d1656
      Brian Foster 提交于
      Inode cluster buffers are invalidated and cancelled when inode chunks
      are freed to notify log recovery that previous logged updates to the
      metadata buffer should be skipped. This ensures that log recovery does
      not overwrite buffers that might have already been reused.
      
      On v4 filesystems, inode chunk allocation and inode updates are logged
      via the cluster buffers and thus cancellation is easily detected via
      buffer cancellation items. v5 filesystems use the new icreate
      transaction, which uses logical logging and ordered buffers to log a
      full inode chunk allocation at once. The resulting icreate item often
      spans multiple inode cluster buffers.
      
      Log recovery checks for cancelled buffers when processing icreate log
      items, but it has a couple problems. First, it uses the full length of
      the inode chunk rather than the cluster size. Second, it uses the length
      in FSB units rather than BB units. Either of these problems prevent
      icreate recovery from identifying cancelled buffers and thus inode
      initialization proceeds unconditionally.
      
      Update xlog_recover_do_icreate_pass2() to iterate the icreate range in
      cluster sized increments and check each increment for cancellation.
      Since icreate is currently only used for the minimum atomic inode chunk
      allocation, we expect that either all or none of the buffers will be
      cancelled. Cancel the icreate if at least one buffer is cancelled to
      avoid making a bad situation worse by initializing a partial inode
      chunk, but detect such anomalies and warn the user.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      fc0d1656
    • B
      xfs: icreate log item recovery and cancellation tracepoints · 78d57e45
      Brian Foster 提交于
      Various log items have recovery tracepoints to identify whether a
      particular log item is recovered or cancelled. Add the equivalent
      tracepoints for the icreate transaction.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      78d57e45
    • B
      xfs: don't leave EFIs on AIL on mount failure · f0b2efad
      Brian Foster 提交于
      Log recovery occurs in two phases at mount time. In the first phase,
      EFIs and EFDs are processed and potentially cancelled out. EFIs without
      EFD objects are inserted into the AIL for processing and recovery in the
      second phase. xfs_mountfs() runs various other operations between the
      phases and is thus subject to failure. If failure occurs after the first
      phase but before the second, pending EFIs sit on the AIL, pin it and
      cause the mount to hang.
      
      Update the mount sequence to ensure that pending EFIs are cancelled in
      the event of failure. Add a recovery cancellation mechanism to iterate
      the AIL and cancel all EFI items when requested. Plumb cancellation
      support through the log mount finish helper and update xfs_mountfs() to
      invoke cancellation in the event of failure after recovery has started.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      f0b2efad
    • B
      xfs: use EFI refcount consistently in log recovery · e32a1d1f
      Brian Foster 提交于
      The EFI is initialized with a reference count of 2. One for the EFI to
      ensure the item makes it to the AIL and one for the subsequently created
      EFD to release the EFI once the EFD is committed. Log recovery uses the
      EFI in a similar manner, but implements a hack to remove both references
      in one call once the EFD is handled.
      
      Update log recovery to use EFI reference counting in a manner consistent
      with the log. When an EFI is encountered during recovery, an EFI item is
      allocated and inserted to the AIL directly. Since the EFI reference is
      typically dropped when the EFI is unpinned and this is analogous with
      AIL insertion, drop the EFI reference at this point.
      
      When a corresponding EFD is encountered in the log, this indicates that
      the extents were freed, no processing is required and the EFI can be
      dropped. Update xlog_recover_efd_pass2() to simply drop the EFD
      reference at this point rather than open code the AIL removal and EFI
      free.
      
      Remaining EFIs (i.e., with no corresponding EFD) are processed in
      xlog_recover_finish(). An EFD transaction is allocated and the extents
      are freed, which transfers ownership of the EFI reference to the EFD
      item in the log.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      e32a1d1f
    • B
      xfs: ensure EFD trans aborts on log recovery extent free failure · 6bc43af3
      Brian Foster 提交于
      Log recovery attempts to free extents with leftover EFIs in the AIL
      after initial processing. If the extent free fails (e.g., due to
      unrelated fs corruption), the transaction is cancelled, though it
      might not be dirtied at the time. If this is the case, the EFD does
      not abort and thus does not release the EFI. This can lead to hangs
      as the EFI pins the AIL.
      
      Update xlog_recover_process_efi() to log the EFD in the transaction
      before xfs_free_extent() errors are handled to ensure the
      transaction is dirty, aborts the EFD and releases the EFI on error.
      Since this is a requirement for EFD processing (and consistent with
      xfs_bmap_finish()), update the EFD logging helper to do the extent
      free and unconditionally log the EFD. This encodes the required EFD
      logging behavior into the helper and reduces the likelihood of
      errors down the road.
      
      [dchinner: re-add xfs_alloc.h to xfs_log_recover.c to fix build
       failure.]
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      6bc43af3
    • B
      xfs: fix efi/efd error handling to avoid fs shutdown hangs · 8d99fe92
      Brian Foster 提交于
      Freeing an extent in XFS involves logging an EFI (extent free
      intention), freeing the actual extent, and logging an EFD (extent
      free done). The EFI object is created with a reference count of 2:
      one for the current transaction and one for the subsequently created
      EFD. Under normal circumstances, the first reference is dropped when
      the EFI is unpinned and the second reference is dropped when the EFD
      is committed to the on-disk log.
      
      In event of errors or filesystem shutdown, there are various
      potential cleanup scenarios depending on the state of the EFI/EFD.
      The cleanup scenarios are confusing and racy, as demonstrated by the
      following test sequence:
      
      	# mount $dev $mnt
      	# fsstress -d $mnt -n 99999 -p 16 -z -f fallocate=1 \
      		-f punch=1 -f creat=1 -f unlink=1 &
      	# sleep 5
      	# killall -9 fsstress; wait
      	# godown -f $mnt
      	# umount
      
      ... in which the final umount can hang due to the AIL being pinned
      indefinitely by one or more EFI items. This can occur due to several
      conditions. For example, if the shutdown occurs after the EFI is
      committed to the on-disk log and the EFD committed to the CIL, but
      before the EFD committed to the log, the EFD iop_committed() abort
      handler does not drop its reference to the EFI. Alternatively,
      manual error injection in the xfs_bmap_finish() codepath shows that
      if an error occurs after the EFI transaction is committed but before
      the EFD is constructed and logged, the EFI is never released from
      the AIL.
      
      Update the EFI/EFD item handling code to use a more straightforward
      and reliable approach to error handling. If an error occurs after
      the EFI transaction is committed and before the EFD is constructed,
      release the EFI explicitly from xfs_bmap_finish(). If the EFI
      transaction is cancelled, release the EFI in the unlock handler.
      
      Once the EFD is constructed, it is responsible for releasing the EFI
      under any circumstances (including whether the EFI item aborts due
      to log I/O error). Update the EFD item handlers to release the EFI
      if the transaction is cancelled or aborts due to log I/O error.
      Finally, update xfs_bmap_finish() to log at least one EFD extent to
      the transaction before xfs_free_extent() errors are handled to
      ensure the transaction is dirty and EFD item error handling is
      triggered.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      8d99fe92
    • B
      xfs: return committed status from xfs_trans_roll() · d43ac29b
      Brian Foster 提交于
      Some callers need to make error handling decisions based on whether
      the current transaction successfully committed or not. Rename
      xfs_trans_roll(), add a new parameter and provide a wrapper to
      preserve existing callers.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      d43ac29b
    • B
      xfs: disentagle EFI release from the extent count · 5e4b5386
      Brian Foster 提交于
      Release of the EFI either occurs based on the reference count or the
      extent count. The extent count used is either the count tracked in
      the EFI or EFD, depending on the particular situation. In either
      case, the count is initialized to the final value and thus always
      matches the current efi_next_extent value once the EFI is completely
      constructed.  For example, the EFI extent count is increased as the
      extents are logged in xfs_bmap_finish() and the full free list is
      always completely processed. Therefore, the count is guaranteed to
      be complete once the EFI transaction is committed. The EFD uses the
      efd_nextents counter to release the EFI. This counter is initialized
      to the count of the EFI when the EFD is created. Thus the EFD, as
      currently used, has no concept of partial EFI release based on
      extent count.
      
      Given that the EFI extent count is always released in whole, use of
      the extent count for reference counting is unnecessary. Remove this
      level of the API and release the EFI based on the core reference
      count. The efi_next_extent counter remains because it is still used
      to track the slot to log the next extent to free.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      5e4b5386
  2. 12 7月, 2015 3 次提交
    • A
      freeing unlinked file indefinitely delayed · 75a6f82a
      Al Viro 提交于
      	Normally opening a file, unlinking it and then closing will have
      the inode freed upon close() (provided that it's not otherwise busy and
      has no remaining links, of course).  However, there's one case where that
      does *not* happen.  Namely, if you open it by fhandle with cold dcache,
      then unlink() and close().
      
      	In normal case you get d_delete() in unlink(2) notice that dentry
      is busy and unhash it; on the final dput() it will be forcibly evicted from
      dcache, triggering iput() and inode removal.  In this case, though, we end
      up with *two* dentries - disconnected (created by open-by-fhandle) and
      regular one (used by unlink()).  The latter will have its reference to inode
      dropped just fine, but the former will not - it's considered hashed (it
      is on the ->s_anon list), so it will stay around until the memory pressure
      will finally do it in.  As the result, we have the final iput() delayed
      indefinitely.  It's trivial to reproduce -
      
      void flush_dcache(void)
      {
              system("mount -o remount,rw /");
      }
      
      static char buf[20 * 1024 * 1024];
      
      main()
      {
              int fd;
              union {
                      struct file_handle f;
                      char buf[MAX_HANDLE_SZ];
              } x;
              int m;
      
              x.f.handle_bytes = sizeof(x);
              chdir("/root");
              mkdir("foo", 0700);
              fd = open("foo/bar", O_CREAT | O_RDWR, 0600);
              close(fd);
              name_to_handle_at(AT_FDCWD, "foo/bar", &x.f, &m, 0);
              flush_dcache();
              fd = open_by_handle_at(AT_FDCWD, &x.f, O_RDWR);
              unlink("foo/bar");
              write(fd, buf, sizeof(buf));
              system("df .");			/* 20Mb eaten */
              close(fd);
              system("df .");			/* should've freed those 20Mb */
              flush_dcache();
              system("df .");			/* should be the same as #2 */
      }
      
      will spit out something like
      Filesystem     1K-blocks   Used Available Use% Mounted on
      /dev/root         322023 303843      1131 100% /
      Filesystem     1K-blocks   Used Available Use% Mounted on
      /dev/root         322023 303843      1131 100% /
      Filesystem     1K-blocks   Used Available Use% Mounted on
      /dev/root         322023 283282     21692  93% /
      - inode gets freed only when dentry is finally evicted (here we trigger
      than by remount; normally it would've happened in response to memory
      pressure hell knows when).
      
      Cc: stable@vger.kernel.org # v2.6.38+; earlier ones need s/kill_it/unhash_it/
      Acked-by: NJ. Bruce Fields <bfields@fieldses.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      75a6f82a
    • A
      fix a braino in ovl_d_select_inode() · 9391dd00
      Al Viro 提交于
      when opening a directory we want the overlayfs inode, not one from
      the topmost layer.
      Reported-By: NAndrey Jr. Melnikov <temnota.am@gmail.com>
      Tested-By: NAndrey Jr. Melnikov <temnota.am@gmail.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9391dd00
    • A
      9p: don't leave a half-initialized inode sitting around · 0a73d0a2
      Al Viro 提交于
      Cc: stable@vger.kernel.org # all branches
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0a73d0a2
  3. 10 7月, 2015 5 次提交
  4. 06 7月, 2015 1 次提交
  5. 05 7月, 2015 3 次提交
  6. 04 7月, 2015 4 次提交
    • N
      sched/stat: Expose /proc/pid/schedstat if CONFIG_SCHED_INFO=y · 5968cece
      Naveen N. Rao 提交于
      Expand /proc/pid/schedstat output:
      
       - enable it on CONFIG_TASK_DELAY_ACCT=y && !CONFIG_SCHEDSTATS kernels.
      
       - dump all zeroes on kernels that are booted with the 'nodelayacct'
         option, which boot option disables delay accounting on
         CONFIG_TASK_DELAY_ACCT=y kernels.
      Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: a.p.zijlstra@chello.nl
      Cc: ricklind@us.ibm.com
      Link: http://lkml.kernel.org/r/5ccbef17d4bc841084ea6e6421d4e4a23b7b806f.1435654789.git.naveen.n.rao@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5968cece
    • E
      ext4: correctly migrate a file with a hole at the beginning · 8974fec7
      Eryu Guan 提交于
      Currently ext4_ind_migrate() doesn't correctly handle a file which
      contains a hole at the beginning of the file.  This caused the migration
      to be done incorrectly, and then if there is a subsequent following
      delayed allocation write to the "hole", this would reclaim the same data
      blocks again and results in fs corruption.
      
        # assmuing 4k block size ext4, with delalloc enabled
        # skip the first block and write to the second block
        xfs_io -fc "pwrite 4k 4k" -c "fsync" /mnt/ext4/testfile
      
        # converting to indirect-mapped file, which would move the data blocks
        # to the beginning of the file, but extent status cache still marks
        # that region as a hole
        chattr -e /mnt/ext4/testfile
      
        # delayed allocation writes to the "hole", reclaim the same data block
        # again, results in i_blocks corruption
        xfs_io -c "pwrite 0 4k" /mnt/ext4/testfile
        umount /mnt/ext4
        e2fsck -nf /dev/sda6
        ...
        Inode 53, i_blocks is 16, should be 8.  Fix? no
        ...
      Signed-off-by: NEryu Guan <guaneryu@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      8974fec7
    • E
      ext4: be more strict when migrating to non-extent based file · d6f123a9
      Eryu Guan 提交于
      Currently the check in ext4_ind_migrate() is not enough before doing the
      real conversion:
      
      a) delayed allocated extents could bypass the check on eh->eh_entries
         and eh->eh_depth
      
      This can be demonstrated by this script
      
        xfs_io -fc "pwrite 0 4k" -c "pwrite 8k 4k" /mnt/ext4/testfile
        chattr -e /mnt/ext4/testfile
      
      where testfile has two extents but still be converted to non-extent
      based file format.
      
      b) only extent length is checked but not the offset, which would result
         in data lose (delalloc) or fs corruption (nodelalloc), because
         non-extent based file only supports at most (12 + 2^10 + 2^20 + 2^30)
         blocks
      
      This can be demostrated by
      
        xfs_io -fc "pwrite 5T 4k" /mnt/ext4/testfile
        chattr -e /mnt/ext4/testfile
        sync
      
      If delalloc is enabled, dmesg prints
        EXT4-fs warning (device dm-4): ext4_block_to_path:105: block 1342177280 > max in inode 53
        EXT4-fs (dm-4): Delayed block allocation failed for inode 53 at logical offset 1342177280 with max blocks 1 with error 5
        EXT4-fs (dm-4): This should not happen!! Data will be lost
      
      If delalloc is disabled, e2fsck -nf shows corruption
        Inode 53, i_size is 5497558142976, should be 4096.  Fix? no
      
      Fix the two issues by
      
      a) forcing all delayed allocation blocks to be allocated before checking
         eh->eh_depth and eh->eh_entries
      b) limiting the last logical block of the extent is within direct map
      Signed-off-by: NEryu Guan <guaneryu@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      d6f123a9
    • L
      ext4: fix reservation release on invalidatepage for delalloc fs · 9705acd6
      Lukas Czerner 提交于
      On delalloc enabled file system on invalidatepage operation
      in ext4_da_page_release_reservation() we want to clear the delayed
      buffer and remove the extent covering the delayed buffer from the extent
      status tree.
      
      However currently there is a bug where on the systems with page size >
      block size we will always remove extents from the start of the page
      regardless where the actual delayed buffers are positioned in the page.
      This leads to the errors like this:
      
      EXT4-fs warning (device loop0): ext4_da_release_space:1225:
      ext4_da_release_space: ino 13, to_free 1 with only 0 reserved data
      blocks
      
      This however can cause data loss on writeback time if the file system is
      in ENOSPC condition because we're releasing reservation for someones
      else delayed buffer.
      
      Fix this by only removing extents that corresponds to the part of the
      page we want to invalidate.
      
      This problem is reproducible by the following fio receipt (however I was
      only able to reproduce it with fio-2.1 or older.
      
      [global]
      bs=8k
      iodepth=1024
      iodepth_batch=60
      randrepeat=1
      size=1m
      directory=/mnt/test
      numjobs=20
      [job1]
      ioengine=sync
      bs=1k
      direct=1
      rw=randread
      filename=file1:file2
      [job2]
      ioengine=libaio
      rw=randwrite
      direct=1
      filename=file1:file2
      [job3]
      bs=1k
      ioengine=posixaio
      rw=randwrite
      direct=1
      filename=file1:file2
      [job5]
      bs=1k
      ioengine=sync
      rw=randread
      filename=file1:file2
      [job7]
      ioengine=libaio
      rw=randwrite
      filename=file1:file2
      [job8]
      ioengine=posixaio
      rw=randwrite
      filename=file1:file2
      [job10]
      ioengine=mmap
      rw=randwrite
      bs=1k
      filename=file1:file2
      [job11]
      ioengine=mmap
      rw=randwrite
      direct=1
      filename=file1:file2
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.org
      9705acd6
  7. 02 7月, 2015 12 次提交
    • N
      ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp · c45653c3
      Nikolay Borisov 提交于
      Switch ext4 to using sb_getblk_gfp with GFP_NOFS added to fix possible
      deadlocks in the page writeback path.
      Signed-off-by: NNikolay Borisov <kernel@kyup.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      c45653c3
    • T
      ext4: fix fencepost error in lazytime optimization · 0f0ff9a9
      Theodore Ts'o 提交于
      Commit 8f4d8558: "ext4: fix lazytime optimization" was not a
      complete fix.  In the case where the inode number is a multiple of 16,
      and we could still end up updating an inode with dirty timestamps
      written to the wrong inode on disk.  Oops.
      
      This can be easily reproduced by using generic/005 with a file system
      with metadata_csum and lazytime enabled.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      0f0ff9a9
    • S
      Btrfs: fix wrong check for btrfs_force_chunk_alloc() · 9689457b
      Shilong Wang 提交于
      btrfs_force_chunk_alloc() return 1 for allocation chunk successfully.
      This problem exists since commit c87f08ca.
      
      With this patch, we might fix some enospc problems for balances.
      Signed-off-by: NWang Shilong <wangshilong1991@gmail.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Tested-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9689457b
    • L
      Btrfs: fix warning of bytes_may_use · ddba1bfc
      Liu Bo 提交于
      While running generic/019, dmesg got several warnings from
      btrfs_free_reserved_data_space().
      
      Test generic/019 produces some disk failures so sumbit dio will get errors,
      in which case, btrfs_direct_IO() goes to the error handling and free
      bytes_may_use, but the problem is that bytes_may_use has been free'd
      during get_block().
      
      This adds a runtime flag to show if we've gone through get_block(), if so,
      don't do the cleanup work.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Tested-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ddba1bfc
    • L
      Btrfs: fix hang when failing to submit bio of directIO · ad9ee205
      Liu Bo 提交于
      The hang is uncoverd by generic/019.
      
      btrfs_endio_direct_write() skips the "finish_ordered_fn" part when it hits
      an error, thus those added ordered extents will never get processed, which
      block processes that waiting for them via btrfs_start_ordered_extent().
      
      This fixes the above, and meanwhile finish_ordered_fn will do the space
      accounting work.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Tested-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ad9ee205
    • F
      Btrfs: fix a comment in inode.c:evict_inode_truncate_pages() · 9c6429d9
      Filipe Manana 提交于
      The comment was not correct about the part where it says the endio
      callback of the bio might have not yet been called - update it
      to mention that by that time the endio callback execution might
      still be in progress only.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9c6429d9
    • F
      Btrfs: fix memory corruption on failure to submit bio for direct IO · 61de718f
      Filipe Manana 提交于
      If we fail to submit a bio for a direct IO request, we were grabbing the
      corresponding ordered extent and decrementing its reference count twice,
      once for our lookup reference and once for the ordered tree reference.
      This was a problem because it caused the ordered extent to be freed
      without removing it from the ordered tree and any lists it might be
      attached to, leaving dangling pointers to the ordered extent around.
      Example trace with CONFIG_DEBUG_PAGEALLOC=y:
      
      [161779.858707] BUG: unable to handle kernel paging request at 0000000087654330
      [161779.859983] IP: [<ffffffff8124ca68>] rb_prev+0x22/0x3b
      [161779.860636] PGD 34d818067 PUD 0
      [161779.860636] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      (...)
      [161779.860636] Call Trace:
      [161779.860636]  [<ffffffffa06b36a6>] __tree_search+0xd9/0xf9 [btrfs]
      [161779.860636]  [<ffffffffa06b3708>] tree_search+0x42/0x63 [btrfs]
      [161779.860636]  [<ffffffffa06b4868>] ? btrfs_lookup_ordered_range+0x2d/0xa5 [btrfs]
      [161779.860636]  [<ffffffffa06b4873>] btrfs_lookup_ordered_range+0x38/0xa5 [btrfs]
      [161779.860636]  [<ffffffffa06aab8e>] btrfs_get_blocks_direct+0x11b/0x615 [btrfs]
      [161779.860636]  [<ffffffff8119727f>] do_blockdev_direct_IO+0x5ff/0xb43
      [161779.860636]  [<ffffffffa06aaa73>] ? btrfs_page_exists_in_range+0x1ad/0x1ad [btrfs]
      [161779.860636]  [<ffffffffa06a2c9a>] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
      [161779.860636]  [<ffffffff811977f5>] __blockdev_direct_IO+0x32/0x34
      [161779.860636]  [<ffffffffa06a2c9a>] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
      [161779.860636]  [<ffffffffa06a10ae>] btrfs_direct_IO+0x198/0x21f [btrfs]
      [161779.860636]  [<ffffffffa06a2c9a>] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
      [161779.860636]  [<ffffffff81112ca1>] generic_file_direct_write+0xb3/0x128
      [161779.860636]  [<ffffffffa06affaa>] ? btrfs_file_write_iter+0x15f/0x3e0 [btrfs]
      [161779.860636]  [<ffffffffa06b004c>] btrfs_file_write_iter+0x201/0x3e0 [btrfs]
      (...)
      
      We were also not freeing the btrfs_dio_private we allocated previously,
      which kmemleak reported with the following trace in its sysfs file:
      
      unreferenced object 0xffff8803f553bf80 (size 96):
        comm "xfs_io", pid 4501, jiffies 4295039588 (age 173.936s)
        hex dump (first 32 bytes):
          88 6c 9b f5 02 88 ff ff 00 00 00 00 00 00 00 00  .l..............
          00 00 00 00 00 00 00 00 00 00 c4 00 00 00 00 00  ................
        backtrace:
          [<ffffffff81161ffe>] create_object+0x172/0x29a
          [<ffffffff8145870f>] kmemleak_alloc+0x25/0x41
          [<ffffffff81154e64>] kmemleak_alloc_recursive.constprop.40+0x16/0x18
          [<ffffffff811579ed>] kmem_cache_alloc_trace+0xfb/0x148
          [<ffffffffa03d8cff>] btrfs_submit_direct+0x65/0x16a [btrfs]
          [<ffffffff811968dc>] dio_bio_submit+0x62/0x8f
          [<ffffffff811975fe>] do_blockdev_direct_IO+0x97e/0xb43
          [<ffffffff811977f5>] __blockdev_direct_IO+0x32/0x34
          [<ffffffffa03d70ae>] btrfs_direct_IO+0x198/0x21f [btrfs]
          [<ffffffff81112ca1>] generic_file_direct_write+0xb3/0x128
          [<ffffffffa03e604d>] btrfs_file_write_iter+0x201/0x3e0 [btrfs]
          [<ffffffff8116586a>] __vfs_write+0x7c/0xa5
          [<ffffffff81165da9>] vfs_write+0xa0/0xe4
          [<ffffffff81166675>] SyS_pwrite64+0x64/0x82
          [<ffffffff81464fd7>] system_call_fastpath+0x12/0x6f
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      For read requests we weren't doing any cleanup either (none of the work
      done by btrfs_endio_direct_read()), so a failure submitting a bio for a
      read request would leave a range in the inode's io_tree locked forever,
      blocking any future operations (both reads and writes) against that range.
      
      So fix this by making sure we do the same cleanup that we do for the case
      where the bio submission succeeds.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      61de718f
    • M
      btrfs: don't update mtime/ctime on deduped inodes · 1c919a5e
      Mark Fasheh 提交于
      One issue users have reported is that dedupe changes mtime on files,
      resulting in tools like rsync thinking that their contents have changed when
      in fact the data is exactly the same. We also skip the ctime update as no
      user-visible metadata changes here and we want dedupe to be transparent to
      the user.
      
      Clone still wants time changes, so we special case this in the code.
      
      This was tested with the btrfs-extent-same tool.
      Signed-off-by: NMark Fasheh <mfasheh@suse.de>
      Signed-off-by: NChris Mason <clm@fb.com>
      1c919a5e
    • M
      btrfs: allow dedupe of same inode · 0efa9f48
      Mark Fasheh 提交于
      clone() supports cloning within an inode so extent-same can do
      the same now. This patch fixes up the locking in extent-same to
      know about the single-inode case. In addition to that, we add a
      check for overlapping ranges, which clone does not allow.
      Signed-off-by: NMark Fasheh <mfasheh@suse.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      0efa9f48
    • M
      btrfs: fix deadlock with extent-same and readpage · f4414602
      Mark Fasheh 提交于
      ->readpage() does page_lock() before extent_lock(), we do the opposite in
      extent-same. We want to reverse the order in btrfs_extent_same() but it's
      not quite straightforward since the page locks are taken inside btrfs_cmp_data().
      
      So I split btrfs_cmp_data() into 3 parts with a small context structure that
      is passed between them. The first, btrfs_cmp_data_prepare() gathers up the
      pages needed (taking page lock as required) and puts them on our context
      structure. At this point, we are safe to lock the extent range. Afterwards,
      we use btrfs_cmp_data() to do the data compare as usual and btrfs_cmp_data_free()
      to clean up our context.
      Signed-off-by: NMark Fasheh <mfasheh@suse.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      f4414602
    • M
      btrfs: pass unaligned length to btrfs_cmp_data() · 207910dd
      Mark Fasheh 提交于
      In the case that we dedupe the tail of a file, we might expand the dedupe
      len out to the end of our last block. We don't want to compare data past
      i_size however, so pass the original length to btrfs_cmp_data().
      Signed-off-by: NMark Fasheh <mfasheh@suse.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      207910dd
    • F
      Btrfs: fix fsync after truncate when no_holes feature is enabled · a89ca6f2
      Filipe Manana 提交于
      When we have the no_holes feature enabled, if a we truncate a file to a
      smaller size, truncate it again but to a size greater than or equals to
      its original size and fsync it, the log tree will not have any information
      about the hole covering the range [truncate_1_offset, new_file_size[.
      Which means if the fsync log is replayed, the file will remain with the
      state it had before both truncate operations.
      
      Without the no_holes feature this does not happen, since when the inode
      is logged (full sync flag is set) it will find in the fs/subvol tree a
      leaf with a generation matching the current transaction id that has an
      explicit extent item representing the hole.
      
      Fix this by adding an explicit extent item representing a hole between
      the last extent and the inode's i_size if we are doing a full sync.
      
      The issue is easy to reproduce with the following test case for fstests:
      
        . ./common/rc
        . ./common/filter
        . ./common/dmflakey
      
        _need_to_be_root
        _supported_fs generic
        _supported_os Linux
        _require_scratch
        _require_dm_flakey
      
        # This test was motivated by an issue found in btrfs when the btrfs
        # no-holes feature is enabled (introduced in kernel 3.14). So enable
        # the feature if the fs being tested is btrfs.
        if [ $FSTYP == "btrfs" ]; then
            _require_btrfs_fs_feature "no_holes"
            _require_btrfs_mkfs_feature "no-holes"
            MKFS_OPTIONS="$MKFS_OPTIONS -O no-holes"
        fi
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create our test files and make sure everything is durably persisted.
        $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 64K"         \
                        -c "pwrite -S 0xbb 64K 61K"       \
                        $SCRATCH_MNT/foo | _filter_xfs_io
        $XFS_IO_PROG -f -c "pwrite -S 0xee 0 64K"         \
                        -c "pwrite -S 0xff 64K 61K"       \
                        $SCRATCH_MNT/bar | _filter_xfs_io
        sync
      
        # Now truncate our file foo to a smaller size (64Kb) and then truncate
        # it to the size it had before the shrinking truncate (125Kb). Then
        # fsync our file. If a power failure happens after the fsync, we expect
        # our file to have a size of 125Kb, with the first 64Kb of data having
        # the value 0xaa and the second 61Kb of data having the value 0x00.
        $XFS_IO_PROG -c "truncate 64K" \
                     -c "truncate 125K" \
                     -c "fsync" \
                     $SCRATCH_MNT/foo
      
        # Do something similar to our file bar, but the first truncation sets
        # the file size to 0 and the second truncation expands the size to the
        # double of what it was initially.
        $XFS_IO_PROG -c "truncate 0" \
                     -c "truncate 253K" \
                     -c "fsync" \
                     $SCRATCH_MNT/bar
      
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        # Allow writes again, mount to trigger log replay and validate file
        # contents.
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        # We expect foo to have a size of 125Kb, the first 64Kb of data all
        # having the value 0xaa and the remaining 61Kb to be a hole (all bytes
        # with value 0x00).
        echo "File foo content after log replay:"
        od -t x1 $SCRATCH_MNT/foo
      
        # We expect bar to have a size of 253Kb and no extents (any byte read
        # from bar has the value 0x00).
        echo "File bar content after log replay:"
        od -t x1 $SCRATCH_MNT/bar
      
        status=0
        exit
      
      The expected file contents in the golden output are:
      
        File foo content after log replay:
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0200000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        *
        0372000
        File bar content after log replay:
        0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        *
        0772000
      
      Without this fix, their contents are:
      
        File foo content after log replay:
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0200000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
        *
        0372000
        File bar content after log replay:
        0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
        *
        0200000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
        *
        0372000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        *
        0772000
      
      A test case submission for fstests follows soon.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a89ca6f2