1. 19 8月, 2015 12 次提交
    • B
      xfs: flush entire file on dio read/write to cached file · 3d751af2
      Brian Foster 提交于
      Filesystems are responsible to manage file coherency between the page
      cache and direct I/O. The generic dio code flushes dirty pages over the
      range of a dio to ensure that the dio read or a future buffered read
      returns the correct data. XFS has generally followed this pattern,
      though traditionally has flushed and invalidated the range from the
      start of the I/O all the way to the end of the file. This changed after
      the following commit:
      
      	7d4ea3ce xfs: use ranged writeback and invalidation for direct IO
      
      ... as the full file flush was no longer necessary to deal with the
      strange post-eof delalloc issues that were since fixed. Unfortunately,
      we have since received complaints about performance degradation due to
      the increased exclusive iolock cycles (which locks out parallel dio
      submission) that occur when a file has cached pages. This does not occur
      on filesystems that use the generic code as it also does not incorporate
      locking.
      
      The exclusive iolock is acquired any time the inode mapping has cached
      pages, regardless of whether they reside in the range of the I/O or not.
      If not, the flush/inval calls do no work and the lock was cycled for no
      reason.
      
      Under consideration of the cost of the exclusive iolock, update the dio
      read and write handlers to flush and invalidate the entire mapping when
      cached pages exist. In most cases, this increases the cost of the
      initial flush sequence but eliminates the need for further lock cycles
      and flushes so long as the workload does not actively mix direct and
      buffered I/O. This also more closely matches historical behavior and
      performance characteristics that users have come to expect.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      3d751af2
    • J
      xfs: Fix xfs_attr_leafblock definition · ffeecc52
      Jan Kara 提交于
      struct xfs_attr_leafblock contains 'entries' array which is declared
      with size 1 altough it can in fact contain much more entries. Since this
      array is followed by further struct members, gcc (at least in version
      4.8.3) thinks that the array has the fixed size of 1 element and thus
      may optimize away all accesses beyond the end of array resulting in
      non-working code. This problem was only observed with userspace code in
      xfsprogs, however it's better to be safe in kernel as well and have
      matching kernel and xfsprogs definitions.
      
      cc: <stable@vger.kernel.org>
      Signed-off-by: NJan Kara <jack@suse.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      ffeecc52
    • D
      libxfs: readahead of dir3 data blocks should use the read verifier · 2f123bce
      Darrick J. Wong 提交于
      In the dir3 data block readahead function, use the regular read
      verifier to check the block's CRC and spot-check the block contents
      instead of directly calling only the spot-checking routine.  This
      prevents corrupted directory data blocks from being read into the
      kernel, which can lead to garbage ls output and directory loops (if
      say one of the entries contains slashes and other junk).
      
      cc: <stable@vger.kernel.org> # 3.12 - 4.2
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      2f123bce
    • D
      xfs: stop holding ILOCK over filldir callbacks · dbad7c99
      Dave Chinner 提交于
      The recent change to the readdir locking made in 40194ecc ("xfs:
      reinstate the ilock in xfs_readdir") for CXFS directory sanity was
      probably the wrong thing to do. Deep in the readdir code we
      can take page faults in the filldir callback, and so taking a page
      fault while holding an inode ilock creates a new set of locking
      issues that lockdep warns all over the place about.
      
      The locking order for regular inodes w.r.t. page faults is io_lock
      -> pagefault -> mmap_sem -> ilock. The directory readdir code now
      triggers ilock -> page fault -> mmap_sem. While we cannot deadlock
      at this point, it inverts all the locking patterns that lockdep
      normally sees on XFS inodes, and so triggers lockdep. We worked
      around this with commit 93a8614e ("xfs: fix directory inode iolock
      lockdep false positive"), but that then just moved the lockdep
      warning to deeper in the page fault path and triggered on security
      inode locks. Fixing the shmem issue there just moved the lockdep
      reports somewhere else, and now we are getting false positives from
      filesystem freezing annotations getting confused.
      
      Further, if we enter memory reclaim in a readdir path, we now get
      lockdep warning about potential deadlocks because the ilock is held
      when we enter reclaim. This, again, is different to a regular file
      in that we never allow memory reclaim to run while holding the ilock
      for regular files. Hence lockdep now throws
      ilock->kmalloc->reclaim->ilock warnings.
      
      Basically, the problem is that the ilock is being used to protect
      the directory data and the inode metadata, whereas for a regular
      file the iolock protects the data and the ilock protects the
      metadata. From the VFS perspective, the i_mutex serialises all
      accesses to the directory data, and so not holding the ilock for
      readdir doesn't matter. The issue is that CXFS doesn't access
      directory data via the VFS, so it has no "data serialisaton"
      mechanism. Hence we need to hold the IOLOCK in the correct places to
      provide this low level directory data access serialisation.
      
      The ilock can then be used just when the extent list needs to be
      read, just like we do for regular files. The directory modification
      code can take the iolock exclusive when the ilock is also taken,
      and this then ensures that readdir is correct excluded while
      modifications are in progress.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      dbad7c99
    • D
      xfs: clean up inode lockdep annotations · 0952c818
      Dave Chinner 提交于
      Lockdep annotations are a maintenance nightmare. Locking has to be
      modified to suit the limitations of the annotations, and we're
      always having to fix the annotations because they are unable to
      express the complexity of locking heirarchies correctly.
      
      So, next up, we've got more issues with lockdep annotations for
      inode locking w.r.t. XFS_LOCK_PARENT:
      
      	- lockdep classes are exclusive and can't be ORed together
      	  to form new classes.
      	- IOLOCK needs multiple PARENT subclasses to express the
      	  changes needed for the readdir locking rework needed to
      	  stop the endless flow of lockdep false positives involving
      	  readdir calling filldir under the ILOCK.
      	- there are only 8 unique lockdep subclasses available,
      	  so we can't create a generic solution.
      
      IOWs we need to treat the 3-bit space available to each lock type
      differently:
      
      	- IOLOCK uses xfs_lock_two_inodes(), so needs:
      		- at least 2 IOLOCK subclasses
      		- at least 2 IOLOCK_PARENT subclasses
      	- MMAPLOCK uses xfs_lock_two_inodes(), so needs:
      		- at least 2 MMAPLOCK subclasses
      	- ILOCK uses xfs_lock_inodes with up to 5 inodes, so needs:
      		- at least 5 ILOCK subclasses
      		- one ILOCK_PARENT subclass
      		- one RTBITMAP subclass
      		- one RTSUM subclass
      
      For the IOLOCK, split the space into two sets of subclasses.
      For the MMAPLOCK, just use half the space for the one subclass to
      match the non-parent lock classes of the IOLOCK.
      For the ILOCK, use 0-4 as the ILOCK subclasses, 5-7 for the
      remaining individual subclasses.
      
      Because they are now all different, modify xfs_lock_inumorder() to
      handle the nested subclasses, and to assert fail if passed an
      invalid subclass. Further, annotate xfs_lock_inodes() to assert fail
      if an invalid combination of lock primitives and inode counts are
      passed that would result in a lockdep subclass annotation overflow.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      0952c818
    • B
      xfs: swap leaf buffer into path struct atomically during path shift · 7df1c170
      Brian Foster 提交于
      The node directory lookup code uses a state structure that tracks the
      path of buffers used to search for the hash of a filename through the
      leaf blocks. When the lookup encounters a block that ends with the
      requested hash, but the entry has not yet been found, it must shift over
      to the next block and continue looking for the entry (i.e., duplicate
      hashes could continue over into the next block). This shift mechanism
      involves walking back up and down the state structure, replacing buffers
      at the appropriate btree levels as necessary.
      
      When a buffer is replaced, the old buffer is released and the new buffer
      read into the active slot in the path structure. Because the buffer is
      read directly into the path slot, a buffer read failure can result in
      setting a NULL buffer pointer in an active slot. This throws off the
      state cleanup code in xfs_dir2_node_lookup(), which expects to release a
      buffer from each active slot. Instead, a BUG occurs due to a NULL
      pointer dereference:
      
        BUG: unable to handle kernel NULL pointer dereference at 00000000000001e8
        IP: [<ffffffffa0585063>] xfs_trans_brelse+0x2a3/0x3c0 [xfs]
        ...
        RIP: 0010:[<ffffffffa0585063>]  [<ffffffffa0585063>] xfs_trans_brelse+0x2a3/0x3c0 [xfs]
        ...
        Call Trace:
         [<ffffffffa05250c6>] xfs_dir2_node_lookup+0xa6/0x2c0 [xfs]
         [<ffffffffa0519f7c>] xfs_dir_lookup+0x1ac/0x1c0 [xfs]
         [<ffffffffa055d0e1>] xfs_lookup+0x91/0x290 [xfs]
         [<ffffffffa05580b3>] xfs_vn_lookup+0x73/0xb0 [xfs]
         [<ffffffff8122de8d>] lookup_real+0x1d/0x50
         [<ffffffff8123330e>] path_openat+0x91e/0x1490
         [<ffffffff81235079>] do_filp_open+0x89/0x100
         ...
      
      This has been reproduced via a parallel fsstress and filesystem shutdown
      workload in a loop. The shutdown triggers the read error in the
      aforementioned codepath and causes the BUG in xfs_dir2_node_lookup().
      
      Update xfs_da3_path_shift() to update the active path slot atomically
      with respect to the caller when a buffer is replaced. This ensures that
      the caller always sees the old or new buffer in the slot and prevents
      the NULL pointer dereference.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      7df1c170
    • B
      xfs: relocate sparse inode mount warning · 1b867d3a
      Brian Foster 提交于
      The sparse inodes feature is currently considered experimental. We warn
      at mount time from xfs_mount_validate_sb(). This function is part of the
      superblock verifier codepath, however, which means it could be invoked
      repeatedly on superblock reads or writes. This is currently only
      noticeable from userspace, where mkfs produces multiple warnings at
      format time.
      
      As mkfs warnings were not the intent of this change, relocate the mount
      time warning to xfs_fs_fill_super(), which is only invoked once and only
      in kernel space.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      1b867d3a
    • D
      xfs: dquots should be stamped with sb_meta_uuid · 92863451
      Dave Chinner 提交于
      Once the sb_uuid is changed, the wrong uuid is stamped into new
      dquots on disk. Found by inspection, verified by generic/219.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      92863451
    • D
      xfs: log recovery needs to validate against sb_meta_uuid · fcfbe2c4
      Dave Chinner 提交于
      Now that sb_uuid can be changed by the user, we cannot use this to
      validate the metadata blocks being recovered belong to this
      filesystem. We must check against the sb_meta_uuid as that will
      remain unchanged.
      
      There is a complication in this code - the superblock itself. We can
      not check the sb_meta_uuid unconditionally, as that may not be set
      on disk. Hence we must verify the superblock sb_uuid matches between
      the log record and the in-core superblock.
      
      Found by inspection after the previous two problems were found.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      fcfbe2c4
    • D
      xfs: growfs not aware of sb_meta_uuid · ac383de2
      Dave Chinner 提交于
      Adding this simple change to xfstests:common/rc::_scratch_mkfs_xfs:
      
      +       if [ $mkfs_status -eq 0 ]; then
      +               xfs_admin -U generate $SCRATCH_DEV > /dev/null
      +       fi
      
      triggers all sorts of errors in xfstests. xfs/104 is an example,
      where growfs fails with a UUID mismatch corruption detected by
      xfs_agf_write_verify() when trying to write the first new AG
      headers.
      
      Fix this problem by making sure we copy the sb_meta_uuid into new
      metadata written by growfs.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      ac383de2
    • D
      xfs: fix sb_meta_uuid usage · bbf155ad
      Dave Chinner 提交于
      After changing the UUID on a v5 filesystem, xfstests fails
      immediately on a debug kernel with:
      
      XFS: Assertion failed: uuid_equal(&ip->i_d.di_uuid, &mp->m_sb.sb_uuid), file: fs/xfs/xfs_inode.c, line: 799
      
      This needs to check against the sb_meta_uuid, not the user visible
      UUID that was changed.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      bbf155ad
    • E
      xfs: set XFS_DA_OP_OKNOENT in xfs_attr_get · c400ee3e
      Eric Sandeen 提交于
      It's entirely possible for userspace to ask for an xattr which
      does not exist.
      
      Normally, there is no problem whatsoever when we ask for such
      a thing, but when we look at an obfuscated metadump image
      on a debug kernel with selinux, we trip over this ASSERT in
      xfs_da3_path_shift():
      
              *result = -ENOENT;      /* we're out of our tree */
              ASSERT(args->op_flags & XFS_DA_OP_OKNOENT);
      
      It (more or less) only shows up in the above scenario, because
      xfs_metadump obfuscates attr names, but chooses names which
      keep the same hash value - and xfs_da3_node_lookup_int does:
      
              if (((retval == -ENOENT) || (retval == -ENOATTR)) &&
                  (blk->hashval == args->hashval)) {
                      error = xfs_da3_path_shift(state, &state->path, 1, 1,
                                                       &retval);
      
      IOWS, we only get down to the xfs_da3_path_shift() ASSERT
      if we are looking for an xattr which doesn't exist, but we
      find xattrs on disk which have the same hash, and so might be
      a hash collision, so we try the path shift.  When *that*
      fails to find what we're looking for, we hit the assert about
      XFS_DA_OP_OKNOENT.
      
      Simply setting XFS_DA_OP_OKNOENT in xfs_attr_get solves this
      rather corner-case problem with no ill side effects.  It's
      fine for an attr name lookup to fail.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      c400ee3e
  2. 29 7月, 2015 1 次提交
    • E
      xfs: create new metadata UUID field and incompat flag · ce748eaa
      Eric Sandeen 提交于
      This adds a new superblock field, sb_meta_uuid.  If set, along with
      a new incompat flag, the code will use that field on a V5 filesystem
      to compare to metadata UUIDs, which allows us to change the user-
      visible UUID at will.  Userspace handles the setting and clearing
      of the incompat flag as appropriate, as the UUID gets changed; i.e.
      setting the user-visible UUID back to the original UUID (as stored in
      the new field) will remove the incompatible feature flag.
      
      If the incompat flag is not set, this copies the user-visible UUID into
      into the meta_uuid slot in memory when the superblock is read from disk;
      the meta_uuid field is not written back to disk in this case.
      
      The remainder of this patch simply switches verifiers, initializers,
      etc to use the new sb_meta_uuid field.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      ce748eaa
  3. 12 7月, 2015 3 次提交
    • A
      freeing unlinked file indefinitely delayed · 75a6f82a
      Al Viro 提交于
      	Normally opening a file, unlinking it and then closing will have
      the inode freed upon close() (provided that it's not otherwise busy and
      has no remaining links, of course).  However, there's one case where that
      does *not* happen.  Namely, if you open it by fhandle with cold dcache,
      then unlink() and close().
      
      	In normal case you get d_delete() in unlink(2) notice that dentry
      is busy and unhash it; on the final dput() it will be forcibly evicted from
      dcache, triggering iput() and inode removal.  In this case, though, we end
      up with *two* dentries - disconnected (created by open-by-fhandle) and
      regular one (used by unlink()).  The latter will have its reference to inode
      dropped just fine, but the former will not - it's considered hashed (it
      is on the ->s_anon list), so it will stay around until the memory pressure
      will finally do it in.  As the result, we have the final iput() delayed
      indefinitely.  It's trivial to reproduce -
      
      void flush_dcache(void)
      {
              system("mount -o remount,rw /");
      }
      
      static char buf[20 * 1024 * 1024];
      
      main()
      {
              int fd;
              union {
                      struct file_handle f;
                      char buf[MAX_HANDLE_SZ];
              } x;
              int m;
      
              x.f.handle_bytes = sizeof(x);
              chdir("/root");
              mkdir("foo", 0700);
              fd = open("foo/bar", O_CREAT | O_RDWR, 0600);
              close(fd);
              name_to_handle_at(AT_FDCWD, "foo/bar", &x.f, &m, 0);
              flush_dcache();
              fd = open_by_handle_at(AT_FDCWD, &x.f, O_RDWR);
              unlink("foo/bar");
              write(fd, buf, sizeof(buf));
              system("df .");			/* 20Mb eaten */
              close(fd);
              system("df .");			/* should've freed those 20Mb */
              flush_dcache();
              system("df .");			/* should be the same as #2 */
      }
      
      will spit out something like
      Filesystem     1K-blocks   Used Available Use% Mounted on
      /dev/root         322023 303843      1131 100% /
      Filesystem     1K-blocks   Used Available Use% Mounted on
      /dev/root         322023 303843      1131 100% /
      Filesystem     1K-blocks   Used Available Use% Mounted on
      /dev/root         322023 283282     21692  93% /
      - inode gets freed only when dentry is finally evicted (here we trigger
      than by remount; normally it would've happened in response to memory
      pressure hell knows when).
      
      Cc: stable@vger.kernel.org # v2.6.38+; earlier ones need s/kill_it/unhash_it/
      Acked-by: NJ. Bruce Fields <bfields@fieldses.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      75a6f82a
    • A
      fix a braino in ovl_d_select_inode() · 9391dd00
      Al Viro 提交于
      when opening a directory we want the overlayfs inode, not one from
      the topmost layer.
      Reported-By: NAndrey Jr. Melnikov <temnota.am@gmail.com>
      Tested-By: NAndrey Jr. Melnikov <temnota.am@gmail.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9391dd00
    • A
      9p: don't leave a half-initialized inode sitting around · 0a73d0a2
      Al Viro 提交于
      Cc: stable@vger.kernel.org # all branches
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0a73d0a2
  4. 10 7月, 2015 5 次提交
  5. 06 7月, 2015 1 次提交
  6. 05 7月, 2015 3 次提交
  7. 04 7月, 2015 4 次提交
    • N
      sched/stat: Expose /proc/pid/schedstat if CONFIG_SCHED_INFO=y · 5968cece
      Naveen N. Rao 提交于
      Expand /proc/pid/schedstat output:
      
       - enable it on CONFIG_TASK_DELAY_ACCT=y && !CONFIG_SCHEDSTATS kernels.
      
       - dump all zeroes on kernels that are booted with the 'nodelayacct'
         option, which boot option disables delay accounting on
         CONFIG_TASK_DELAY_ACCT=y kernels.
      Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: a.p.zijlstra@chello.nl
      Cc: ricklind@us.ibm.com
      Link: http://lkml.kernel.org/r/5ccbef17d4bc841084ea6e6421d4e4a23b7b806f.1435654789.git.naveen.n.rao@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5968cece
    • E
      ext4: correctly migrate a file with a hole at the beginning · 8974fec7
      Eryu Guan 提交于
      Currently ext4_ind_migrate() doesn't correctly handle a file which
      contains a hole at the beginning of the file.  This caused the migration
      to be done incorrectly, and then if there is a subsequent following
      delayed allocation write to the "hole", this would reclaim the same data
      blocks again and results in fs corruption.
      
        # assmuing 4k block size ext4, with delalloc enabled
        # skip the first block and write to the second block
        xfs_io -fc "pwrite 4k 4k" -c "fsync" /mnt/ext4/testfile
      
        # converting to indirect-mapped file, which would move the data blocks
        # to the beginning of the file, but extent status cache still marks
        # that region as a hole
        chattr -e /mnt/ext4/testfile
      
        # delayed allocation writes to the "hole", reclaim the same data block
        # again, results in i_blocks corruption
        xfs_io -c "pwrite 0 4k" /mnt/ext4/testfile
        umount /mnt/ext4
        e2fsck -nf /dev/sda6
        ...
        Inode 53, i_blocks is 16, should be 8.  Fix? no
        ...
      Signed-off-by: NEryu Guan <guaneryu@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      8974fec7
    • E
      ext4: be more strict when migrating to non-extent based file · d6f123a9
      Eryu Guan 提交于
      Currently the check in ext4_ind_migrate() is not enough before doing the
      real conversion:
      
      a) delayed allocated extents could bypass the check on eh->eh_entries
         and eh->eh_depth
      
      This can be demonstrated by this script
      
        xfs_io -fc "pwrite 0 4k" -c "pwrite 8k 4k" /mnt/ext4/testfile
        chattr -e /mnt/ext4/testfile
      
      where testfile has two extents but still be converted to non-extent
      based file format.
      
      b) only extent length is checked but not the offset, which would result
         in data lose (delalloc) or fs corruption (nodelalloc), because
         non-extent based file only supports at most (12 + 2^10 + 2^20 + 2^30)
         blocks
      
      This can be demostrated by
      
        xfs_io -fc "pwrite 5T 4k" /mnt/ext4/testfile
        chattr -e /mnt/ext4/testfile
        sync
      
      If delalloc is enabled, dmesg prints
        EXT4-fs warning (device dm-4): ext4_block_to_path:105: block 1342177280 > max in inode 53
        EXT4-fs (dm-4): Delayed block allocation failed for inode 53 at logical offset 1342177280 with max blocks 1 with error 5
        EXT4-fs (dm-4): This should not happen!! Data will be lost
      
      If delalloc is disabled, e2fsck -nf shows corruption
        Inode 53, i_size is 5497558142976, should be 4096.  Fix? no
      
      Fix the two issues by
      
      a) forcing all delayed allocation blocks to be allocated before checking
         eh->eh_depth and eh->eh_entries
      b) limiting the last logical block of the extent is within direct map
      Signed-off-by: NEryu Guan <guaneryu@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      d6f123a9
    • L
      ext4: fix reservation release on invalidatepage for delalloc fs · 9705acd6
      Lukas Czerner 提交于
      On delalloc enabled file system on invalidatepage operation
      in ext4_da_page_release_reservation() we want to clear the delayed
      buffer and remove the extent covering the delayed buffer from the extent
      status tree.
      
      However currently there is a bug where on the systems with page size >
      block size we will always remove extents from the start of the page
      regardless where the actual delayed buffers are positioned in the page.
      This leads to the errors like this:
      
      EXT4-fs warning (device loop0): ext4_da_release_space:1225:
      ext4_da_release_space: ino 13, to_free 1 with only 0 reserved data
      blocks
      
      This however can cause data loss on writeback time if the file system is
      in ENOSPC condition because we're releasing reservation for someones
      else delayed buffer.
      
      Fix this by only removing extents that corresponds to the part of the
      page we want to invalidate.
      
      This problem is reproducible by the following fio receipt (however I was
      only able to reproduce it with fio-2.1 or older.
      
      [global]
      bs=8k
      iodepth=1024
      iodepth_batch=60
      randrepeat=1
      size=1m
      directory=/mnt/test
      numjobs=20
      [job1]
      ioengine=sync
      bs=1k
      direct=1
      rw=randread
      filename=file1:file2
      [job2]
      ioengine=libaio
      rw=randwrite
      direct=1
      filename=file1:file2
      [job3]
      bs=1k
      ioengine=posixaio
      rw=randwrite
      direct=1
      filename=file1:file2
      [job5]
      bs=1k
      ioengine=sync
      rw=randread
      filename=file1:file2
      [job7]
      ioengine=libaio
      rw=randwrite
      filename=file1:file2
      [job8]
      ioengine=posixaio
      rw=randwrite
      filename=file1:file2
      [job10]
      ioengine=mmap
      rw=randwrite
      bs=1k
      filename=file1:file2
      [job11]
      ioengine=mmap
      rw=randwrite
      direct=1
      filename=file1:file2
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.org
      9705acd6
  8. 02 7月, 2015 11 次提交
    • N
      ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp · c45653c3
      Nikolay Borisov 提交于
      Switch ext4 to using sb_getblk_gfp with GFP_NOFS added to fix possible
      deadlocks in the page writeback path.
      Signed-off-by: NNikolay Borisov <kernel@kyup.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      c45653c3
    • T
      ext4: fix fencepost error in lazytime optimization · 0f0ff9a9
      Theodore Ts'o 提交于
      Commit 8f4d8558: "ext4: fix lazytime optimization" was not a
      complete fix.  In the case where the inode number is a multiple of 16,
      and we could still end up updating an inode with dirty timestamps
      written to the wrong inode on disk.  Oops.
      
      This can be easily reproduced by using generic/005 with a file system
      with metadata_csum and lazytime enabled.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      0f0ff9a9
    • S
      Btrfs: fix wrong check for btrfs_force_chunk_alloc() · 9689457b
      Shilong Wang 提交于
      btrfs_force_chunk_alloc() return 1 for allocation chunk successfully.
      This problem exists since commit c87f08ca.
      
      With this patch, we might fix some enospc problems for balances.
      Signed-off-by: NWang Shilong <wangshilong1991@gmail.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Tested-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9689457b
    • L
      Btrfs: fix warning of bytes_may_use · ddba1bfc
      Liu Bo 提交于
      While running generic/019, dmesg got several warnings from
      btrfs_free_reserved_data_space().
      
      Test generic/019 produces some disk failures so sumbit dio will get errors,
      in which case, btrfs_direct_IO() goes to the error handling and free
      bytes_may_use, but the problem is that bytes_may_use has been free'd
      during get_block().
      
      This adds a runtime flag to show if we've gone through get_block(), if so,
      don't do the cleanup work.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Tested-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ddba1bfc
    • L
      Btrfs: fix hang when failing to submit bio of directIO · ad9ee205
      Liu Bo 提交于
      The hang is uncoverd by generic/019.
      
      btrfs_endio_direct_write() skips the "finish_ordered_fn" part when it hits
      an error, thus those added ordered extents will never get processed, which
      block processes that waiting for them via btrfs_start_ordered_extent().
      
      This fixes the above, and meanwhile finish_ordered_fn will do the space
      accounting work.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Tested-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ad9ee205
    • F
      Btrfs: fix a comment in inode.c:evict_inode_truncate_pages() · 9c6429d9
      Filipe Manana 提交于
      The comment was not correct about the part where it says the endio
      callback of the bio might have not yet been called - update it
      to mention that by that time the endio callback execution might
      still be in progress only.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9c6429d9
    • F
      Btrfs: fix memory corruption on failure to submit bio for direct IO · 61de718f
      Filipe Manana 提交于
      If we fail to submit a bio for a direct IO request, we were grabbing the
      corresponding ordered extent and decrementing its reference count twice,
      once for our lookup reference and once for the ordered tree reference.
      This was a problem because it caused the ordered extent to be freed
      without removing it from the ordered tree and any lists it might be
      attached to, leaving dangling pointers to the ordered extent around.
      Example trace with CONFIG_DEBUG_PAGEALLOC=y:
      
      [161779.858707] BUG: unable to handle kernel paging request at 0000000087654330
      [161779.859983] IP: [<ffffffff8124ca68>] rb_prev+0x22/0x3b
      [161779.860636] PGD 34d818067 PUD 0
      [161779.860636] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      (...)
      [161779.860636] Call Trace:
      [161779.860636]  [<ffffffffa06b36a6>] __tree_search+0xd9/0xf9 [btrfs]
      [161779.860636]  [<ffffffffa06b3708>] tree_search+0x42/0x63 [btrfs]
      [161779.860636]  [<ffffffffa06b4868>] ? btrfs_lookup_ordered_range+0x2d/0xa5 [btrfs]
      [161779.860636]  [<ffffffffa06b4873>] btrfs_lookup_ordered_range+0x38/0xa5 [btrfs]
      [161779.860636]  [<ffffffffa06aab8e>] btrfs_get_blocks_direct+0x11b/0x615 [btrfs]
      [161779.860636]  [<ffffffff8119727f>] do_blockdev_direct_IO+0x5ff/0xb43
      [161779.860636]  [<ffffffffa06aaa73>] ? btrfs_page_exists_in_range+0x1ad/0x1ad [btrfs]
      [161779.860636]  [<ffffffffa06a2c9a>] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
      [161779.860636]  [<ffffffff811977f5>] __blockdev_direct_IO+0x32/0x34
      [161779.860636]  [<ffffffffa06a2c9a>] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
      [161779.860636]  [<ffffffffa06a10ae>] btrfs_direct_IO+0x198/0x21f [btrfs]
      [161779.860636]  [<ffffffffa06a2c9a>] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
      [161779.860636]  [<ffffffff81112ca1>] generic_file_direct_write+0xb3/0x128
      [161779.860636]  [<ffffffffa06affaa>] ? btrfs_file_write_iter+0x15f/0x3e0 [btrfs]
      [161779.860636]  [<ffffffffa06b004c>] btrfs_file_write_iter+0x201/0x3e0 [btrfs]
      (...)
      
      We were also not freeing the btrfs_dio_private we allocated previously,
      which kmemleak reported with the following trace in its sysfs file:
      
      unreferenced object 0xffff8803f553bf80 (size 96):
        comm "xfs_io", pid 4501, jiffies 4295039588 (age 173.936s)
        hex dump (first 32 bytes):
          88 6c 9b f5 02 88 ff ff 00 00 00 00 00 00 00 00  .l..............
          00 00 00 00 00 00 00 00 00 00 c4 00 00 00 00 00  ................
        backtrace:
          [<ffffffff81161ffe>] create_object+0x172/0x29a
          [<ffffffff8145870f>] kmemleak_alloc+0x25/0x41
          [<ffffffff81154e64>] kmemleak_alloc_recursive.constprop.40+0x16/0x18
          [<ffffffff811579ed>] kmem_cache_alloc_trace+0xfb/0x148
          [<ffffffffa03d8cff>] btrfs_submit_direct+0x65/0x16a [btrfs]
          [<ffffffff811968dc>] dio_bio_submit+0x62/0x8f
          [<ffffffff811975fe>] do_blockdev_direct_IO+0x97e/0xb43
          [<ffffffff811977f5>] __blockdev_direct_IO+0x32/0x34
          [<ffffffffa03d70ae>] btrfs_direct_IO+0x198/0x21f [btrfs]
          [<ffffffff81112ca1>] generic_file_direct_write+0xb3/0x128
          [<ffffffffa03e604d>] btrfs_file_write_iter+0x201/0x3e0 [btrfs]
          [<ffffffff8116586a>] __vfs_write+0x7c/0xa5
          [<ffffffff81165da9>] vfs_write+0xa0/0xe4
          [<ffffffff81166675>] SyS_pwrite64+0x64/0x82
          [<ffffffff81464fd7>] system_call_fastpath+0x12/0x6f
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      For read requests we weren't doing any cleanup either (none of the work
      done by btrfs_endio_direct_read()), so a failure submitting a bio for a
      read request would leave a range in the inode's io_tree locked forever,
      blocking any future operations (both reads and writes) against that range.
      
      So fix this by making sure we do the same cleanup that we do for the case
      where the bio submission succeeds.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      61de718f
    • M
      btrfs: don't update mtime/ctime on deduped inodes · 1c919a5e
      Mark Fasheh 提交于
      One issue users have reported is that dedupe changes mtime on files,
      resulting in tools like rsync thinking that their contents have changed when
      in fact the data is exactly the same. We also skip the ctime update as no
      user-visible metadata changes here and we want dedupe to be transparent to
      the user.
      
      Clone still wants time changes, so we special case this in the code.
      
      This was tested with the btrfs-extent-same tool.
      Signed-off-by: NMark Fasheh <mfasheh@suse.de>
      Signed-off-by: NChris Mason <clm@fb.com>
      1c919a5e
    • M
      btrfs: allow dedupe of same inode · 0efa9f48
      Mark Fasheh 提交于
      clone() supports cloning within an inode so extent-same can do
      the same now. This patch fixes up the locking in extent-same to
      know about the single-inode case. In addition to that, we add a
      check for overlapping ranges, which clone does not allow.
      Signed-off-by: NMark Fasheh <mfasheh@suse.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      0efa9f48
    • M
      btrfs: fix deadlock with extent-same and readpage · f4414602
      Mark Fasheh 提交于
      ->readpage() does page_lock() before extent_lock(), we do the opposite in
      extent-same. We want to reverse the order in btrfs_extent_same() but it's
      not quite straightforward since the page locks are taken inside btrfs_cmp_data().
      
      So I split btrfs_cmp_data() into 3 parts with a small context structure that
      is passed between them. The first, btrfs_cmp_data_prepare() gathers up the
      pages needed (taking page lock as required) and puts them on our context
      structure. At this point, we are safe to lock the extent range. Afterwards,
      we use btrfs_cmp_data() to do the data compare as usual and btrfs_cmp_data_free()
      to clean up our context.
      Signed-off-by: NMark Fasheh <mfasheh@suse.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      f4414602
    • M
      btrfs: pass unaligned length to btrfs_cmp_data() · 207910dd
      Mark Fasheh 提交于
      In the case that we dedupe the tail of a file, we might expand the dedupe
      len out to the end of our last block. We don't want to compare data past
      i_size however, so pass the original length to btrfs_cmp_data().
      Signed-off-by: NMark Fasheh <mfasheh@suse.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      207910dd