1. 29 5月, 2015 2 次提交
    • B
      xfs: fix broken i_nlink accounting for whiteout tmpfile inode · 22419ac9
      Brian Foster 提交于
      XFS uses the internal tmpfile() infrastructure for the whiteout inode
      used for RENAME_WHITEOUT operations. For tmpfile inodes, XFS allocates
      the inode, drops di_nlink, adds the inode to the agi unlinked list,
      calls d_tmpfile() which correspondingly drops i_nlink of the vfs inode,
      and then finishes the common inode setup (e.g., clear I_NEW and unlock).
      
      The d_tmpfile() call was originally made inxfs_create_tmpfile(), but was
      pulled up out of that function as part of the following commit to
      resolve a deadlock issue:
      
      	330033d6 xfs: fix tmpfile/selinux deadlock and initialize security
      
      As a result, callers of xfs_create_tmpfile() are responsible for either
      calling d_tmpfile() or fixing up i_nlink appropriately. The whiteout
      tmpfile allocation helper does neither. As a result, the vfs ->i_nlink
      becomes inconsistent with the on-disk ->di_nlink once xfs_rename() links
      it back into the source dentry and calls xfs_bumplink().
      
      Update the assert in xfs_rename() to help detect this problem in the
      future and update xfs_rename_alloc_whiteout() to decrement the link
      count as part of the manual tmpfile inode setup.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      22419ac9
    • D
      xfs: xfs_attr_inactive leaves inconsistent attr fork state behind · 6dfe5a04
      Dave Chinner 提交于
      xfs_attr_inactive() is supposed to clean up the attribute fork when
      the inode is being freed. While it removes attribute fork extents,
      it completely ignores attributes in local format, which means that
      there can still be active attributes on the inode after
      xfs_attr_inactive() has run.
      
      This leads to problems with concurrent inode writeback - the in-core
      inode attribute fork is removed without locking on the assumption
      that nothing will be attempting to access the attribute fork after a
      call to xfs_attr_inactive() because it isn't supposed to exist on
      disk any more.
      
      To fix this, make xfs_attr_inactive() completely remove all traces
      of the attribute fork from the inode, regardless of it's state.
      Further, also remove the in-core attribute fork structure safely so
      that there is nothing further that needs to be done by callers to
      clean up the attribute fork. This means we can remove the in-core
      and on-disk attribute forks atomically.
      
      Also, on error simply remove the in-memory attribute fork. There's
      nothing that can be done with it once we have failed to remove the
      on-disk attribute fork, so we may as well just blow it away here
      anyway.
      
      cc: <stable@vger.kernel.org> # 3.12 to 4.0
      Reported-by: NWaiman Long <waiman.long@hp.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      6dfe5a04
  2. 25 3月, 2015 6 次提交
    • F
      xfs: use bool instead of int in xfs_rename() · 86aaf02e
      Fabian Frederick 提交于
      new_parent and src_is_directory are only used in 0/1 context.
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      86aaf02e
    • D
      xfs: add RENAME_WHITEOUT support · 7dcf5c3e
      Dave Chinner 提交于
      Whiteouts are used by overlayfs -  it has a crazy convention that a
      whiteout is a character device inode with a major:minor of 0:0.
      Because it's not documented anywhere, here's an example of what
      RENAME_WHITEOUT does on ext4:
      
      # echo foo > /mnt/scratch/foo
      # echo bar > /mnt/scratch/bar
      # ls -l /mnt/scratch
      total 24
      -rw-r--r-- 1 root root     4 Feb 11 20:22 bar
      -rw-r--r-- 1 root root     4 Feb 11 20:22 foo
      drwx------ 2 root root 16384 Feb 11 20:18 lost+found
      # src/renameat2 -w /mnt/scratch/foo /mnt/scratch/bar
      # ls -l /mnt/scratch
      total 20
      -rw-r--r-- 1 root root     4 Feb 11 20:22 bar
      c--------- 1 root root  0, 0 Feb 11 20:23 foo
      drwx------ 2 root root 16384 Feb 11 20:18 lost+found
      # cat /mnt/scratch/bar
      foo
      #
      
      In XFS rename terms, the operation that has been done is that source
      (foo) has been moved to the target (bar), which is like a nomal
      rename operation, but rather than the source being removed, it have
      been replaced with a whiteout.
      
      We can't allocate whiteout inodes within the rename transaction due
      to allocation being a multi-commit transaction: rename needs to
      be a single, atomic commit. Hence we have several options here, form
      most efficient to least efficient:
      
          - use DT_WHT in the target dirent and do no whiteout inode
            allocation.  The main issue with this approach is that we need
            hooks in lookup to create a virtual chardev inode to present
            to userspace and in places where we might need to modify the
            dirent e.g. unlink.  Overlayfs also needs to be taught about
            DT_WHT. Most invasive change, lowest overhead.
      
          - create a special whiteout inode in the root directory (e.g. a
            ".wino" dirent) and then hardlink every new whiteout to it.
            This means we only need to create a single whiteout inode, and
            rename simply creates a hardlink to it. We can use DT_WHT for
            these, though using DT_CHR means we won't have to modify
            overlayfs, nor anything in userspace. Downside is we have to
            look up the whiteout inode on every operation and create it if
            it doesn't exist.
      
          - copy ext4: create a special whiteout chardev inode for every
            whiteout.  This is more complex than the above options because
            of the lack of atomicity between inode creation and the rename
            operation, requiring us to create a tmpfile inode and then
            linking it into the directory structure during the rename. At
            least with a tmpfile inode crashes between the create and
            rename doesn't leave unreferenced inodes or directory
            pollution around.
      
      By far the simplest thing to do in the short term is to copy ext4.
      While it is the most inefficient way of supporting whiteouts, but as
      an initial implementation we can simply reuse existing functions and
      add a small amount of extra code the the rename operation.
      
      When we get full whiteout support in the VFS (via the dentry cache)
      we can then look to supporting DT_WHT method outlined as the first
      method of supporting whiteouts. But until then, we'll stick with
      what overlayfs expects us to be: dumb and stupid.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      7dcf5c3e
    • D
      xfs: make xfs_cross_rename() complete fully · eeacd321
      Dave Chinner 提交于
      Now that xfs_finish_rename() exists, there is no reason for
      xfs_cross_rename() to return to xfs_rename() to finish off the
      rename transaction. Drive the completion code into
      xfs_cross_rename() and handle all errors there so as to simplify
      the xfs_rename() code.
      
      Further, push the rename exchange target_ip check to early in the
      rename code so as to make the error handling easy and obviously
      correct.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      eeacd321
    • D
      xfs: factor out xfs_finish_rename() · 310606b0
      Dave Chinner 提交于
      Rather than use a jump label for the final transaction commit in
      the rename, factor it into a simple helper function and call it
      appropriately. This slightly reduces the spaghetti nature of
      xfs_rename.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      310606b0
    • D
      xfs: cleanup xfs_rename error handling · 445883e8
      Dave Chinner 提交于
      The jump labels are ambiguous and unclear and some of the error
      paths are used inconsistently. Rules for error jumps are:
      
      - use out_trans_cancel for unmodified transaction context
      - use out_bmap_cancel on ENOSPC errors
      - use out_trans_abort when transaction is likely to be dirty.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      445883e8
    • D
      xfs: clean up inode locking for RENAME_WHITEOUT · 95afcf5c
      Dave Chinner 提交于
      When doing RENAME_WHITEOUT, we now have to lock 5 inodes into the
      rename transaction. This means we need to update
      xfs_sort_for_rename() and xfs_lock_inodes() to handle up to 5
      inodes. Because of the vagaries of rename, this means we could have
      anywhere between 3 and 5 inodes locked into the transaction....
      
      While xfs_lock_inodes() does not need anything other than an assert
      telling us we are passing more inodes that we ever thought we should
      see, it could do with a logic rework to remove all the indenting.
      This is not a functional change - it just makes the code a lot
      easier to read.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      95afcf5c
  3. 24 2月, 2015 1 次提交
  4. 23 2月, 2015 2 次提交
    • D
      xfs: inodes are new until the dentry cache is set up · 58c90473
      Dave Chinner 提交于
      Al Viro noticed a generic set of issues to do with filehandle lookup
      racing with dentry cache setup. They involve a filehandle lookup
      occurring while an inode is being created and the filehandle lookup
      racing with the dentry creation for the real file. This can lead to
      multiple dentries for the one path being instantiated. There are a
      host of other issues around this same set of paths.
      
      The underlying cause is that file handle lookup only waits on inode
      cache instantiation rather than full dentry cache instantiation. XFS
      is mostly immune to the problems discovered due to it's own internal
      inode cache, but there are a couple of corner cases where races can
      happen.
      
      We currently clear the XFS_INEW flag when the inode is fully set up
      after insertion into the cache. Newly allocated inodes are inserted
      locked and so aren't usable until the allocation transaction
      commits. This, however, occurs before the dentry and security
      information is fully initialised and hence the inode is unlocked and
      available for lookups to find too early.
      
      To solve the problem, only clear the XFS_INEW flag for newly created
      inodes once the dentry is fully instantiated. This means lookups
      will retry until the XFS_INEW flag is removed from the inode and
      hence avoids the race conditions in questions.
      
      THis also means that xfs_create(), xfs_create_tmpfile() and
      xfs_symlink() need to finish the setup of the inode in their error
      paths if we had allocated the inode but failed later in the creation
      process. xfs_symlink(), in particular, needed a lot of help to make
      it's error handling match that of xfs_create().
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      58c90473
    • D
      xfs: introduce mmap/truncate lock · 653c60b6
      Dave Chinner 提交于
      Right now we cannot serialise mmap against truncate or hole punch
      sanely. ->page_mkwrite is not able to take locks that the read IO
      path normally takes (i.e. the inode iolock) because that could
      result in lock inversions (read - iolock - page fault - page_mkwrite
      - iolock) and so we cannot use an IO path lock to serialise page
      write faults against truncate operations.
      
      Instead, introduce a new lock that is used *only* in the
      ->page_mkwrite path that is the equivalent of the iolock. The lock
      ordering in a page fault is i_mmaplock -> page lock -> i_ilock,
      and so in truncate we can i_iolock -> i_mmaplock and so lock out
      new write faults during the process of truncation.
      
      Because i_mmap_lock is outside the page lock, we can hold it across
      all the same operations we hold the i_iolock for. The only
      difference is that we never hold the i_mmaplock in the normal IO
      path and so do not ever have the possibility that we can page fault
      inside it. Hence there are no recursion issues on the i_mmap_lock
      and so we can use it to serialise page fault IO against inode
      modification operations that affect the IO path.
      
      This patch introduces the i_mmaplock infrastructure, lockdep
      annotations and initialisation/destruction code. Use of the new lock
      will be in subsequent patches.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      653c60b6
  5. 22 1月, 2015 1 次提交
  6. 24 12月, 2014 1 次提交
  7. 04 12月, 2014 1 次提交
  8. 28 11月, 2014 3 次提交
  9. 02 10月, 2014 3 次提交
  10. 09 9月, 2014 1 次提交
  11. 04 8月, 2014 1 次提交
  12. 25 6月, 2014 1 次提交
    • D
      xfs: global error sign conversion · 2451337d
      Dave Chinner 提交于
      Convert all the errors the core XFs code to negative error signs
      like the rest of the kernel and remove all the sign conversion we
      do in the interface layers.
      
      Errors for conversion (and comparison) found via searches like:
      
      $ git grep " E" fs/xfs
      $ git grep "return E" fs/xfs
      $ git grep " E[A-Z].*;$" fs/xfs
      
      Negation points found via searches like:
      
      $ git grep "= -[a-z,A-Z]" fs/xfs
      $ git grep "return -[a-z,A-D,F-Z]" fs/xfs
      $ git grep " -[a-z].*;" fs/xfs
      
      [ with some bits I missed from Brian Foster ]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      2451337d
  13. 22 6月, 2014 1 次提交
  14. 20 5月, 2014 1 次提交
    • D
      xfs: turn NLINK feature on by default · 263997a6
      Dave Chinner 提交于
      mkfs has turned on the XFS_SB_VERSION_NLINKBIT feature bit by
      default since November 2007. It's about time we simply made the
      kernel code turn it on by default and so always convert v1 inodes to
      v2 inodes when reading them in from disk or allocating them. This
      This removes needless version checks and modification when bumping
      link counts on inodes, and will take code out of a few common code
      paths.
      
         text    data     bss     dec     hex filename
       783251  100867     616  884734   d7ffe fs/xfs/xfs.o.orig
       782664  100867     616  884147   d7db3 fs/xfs/xfs.o.patched
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      263997a6
  15. 24 4月, 2014 1 次提交
  16. 23 4月, 2014 3 次提交
  17. 17 4月, 2014 1 次提交
    • B
      xfs: fix tmpfile/selinux deadlock and initialize security · 330033d6
      Brian Foster 提交于
      xfstests generic/004 reproduces an ilock deadlock using the tmpfile
      interface when selinux is enabled. This occurs because
      xfs_create_tmpfile() takes the ilock and then calls d_tmpfile(). The
      latter eventually calls into xfs_xattr_get() which attempts to get the
      lock again. E.g.:
      
      xfs_io          D ffffffff81c134c0  4096  3561   3560 0x00000080
      ffff8801176a1a68 0000000000000046 ffff8800b401b540 ffff8801176a1fd8
      00000000001d5800 00000000001d5800 ffff8800b401b540 ffff8800b401b540
      ffff8800b73a6bd0 fffffffeffffffff ffff8800b73a6bd8 ffff8800b5ddb480
      Call Trace:
      [<ffffffff8177f969>] schedule+0x29/0x70
      [<ffffffff81783a65>] rwsem_down_read_failed+0xc5/0x120
      [<ffffffffa05aa97f>] ? xfs_ilock_attr_map_shared+0x1f/0x50 [xfs]
      [<ffffffff813b3434>] call_rwsem_down_read_failed+0x14/0x30
      [<ffffffff810ed179>] ? down_read_nested+0x89/0xa0
      [<ffffffffa05aa7f2>] ? xfs_ilock+0x122/0x250 [xfs]
      [<ffffffffa05aa7f2>] xfs_ilock+0x122/0x250 [xfs]
      [<ffffffffa05aa97f>] xfs_ilock_attr_map_shared+0x1f/0x50 [xfs]
      [<ffffffffa05701d0>] xfs_attr_get+0x90/0xe0 [xfs]
      [<ffffffffa0565e07>] xfs_xattr_get+0x37/0x50 [xfs]
      [<ffffffff8124842f>] generic_getxattr+0x4f/0x70
      [<ffffffff8133fd9e>] inode_doinit_with_dentry+0x1ae/0x650
      [<ffffffff81340e0c>] selinux_d_instantiate+0x1c/0x20
      [<ffffffff813351bb>] security_d_instantiate+0x1b/0x30
      [<ffffffff81237db0>] d_instantiate+0x50/0x70
      [<ffffffff81237e85>] d_tmpfile+0xb5/0xc0
      [<ffffffffa05add02>] xfs_create_tmpfile+0x362/0x410 [xfs]
      [<ffffffffa0559ac8>] xfs_vn_tmpfile+0x18/0x20 [xfs]
      [<ffffffff81230388>] path_openat+0x228/0x6a0
      [<ffffffff810230f9>] ? sched_clock+0x9/0x10
      [<ffffffff8105a427>] ? kvm_clock_read+0x27/0x40
      [<ffffffff8124054f>] ? __alloc_fd+0xaf/0x1f0
      [<ffffffff8123101a>] do_filp_open+0x3a/0x90
      [<ffffffff817845e7>] ? _raw_spin_unlock+0x27/0x40
      [<ffffffff8124054f>] ? __alloc_fd+0xaf/0x1f0
      [<ffffffff8121e3ce>] do_sys_open+0x12e/0x210
      [<ffffffff8121e4ce>] SyS_open+0x1e/0x20
      [<ffffffff8178eda9>] system_call_fastpath+0x16/0x1b
      
      xfs_vn_tmpfile() also fails to initialize security on the newly created
      inode.
      
      Pull the d_tmpfile() call up into xfs_vn_tmpfile() after the transaction
      has been committed and the inode unlocked. Also, initialize security on
      the inode based on the parent directory provided via the tmpfile call.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      330033d6
  18. 14 4月, 2014 1 次提交
  19. 07 1月, 2014 3 次提交
  20. 19 12月, 2013 3 次提交
  21. 13 12月, 2013 3 次提交