1. 25 3月, 2015 5 次提交
    • D
      xfs: add RENAME_WHITEOUT support · 7dcf5c3e
      Dave Chinner 提交于
      Whiteouts are used by overlayfs -  it has a crazy convention that a
      whiteout is a character device inode with a major:minor of 0:0.
      Because it's not documented anywhere, here's an example of what
      RENAME_WHITEOUT does on ext4:
      
      # echo foo > /mnt/scratch/foo
      # echo bar > /mnt/scratch/bar
      # ls -l /mnt/scratch
      total 24
      -rw-r--r-- 1 root root     4 Feb 11 20:22 bar
      -rw-r--r-- 1 root root     4 Feb 11 20:22 foo
      drwx------ 2 root root 16384 Feb 11 20:18 lost+found
      # src/renameat2 -w /mnt/scratch/foo /mnt/scratch/bar
      # ls -l /mnt/scratch
      total 20
      -rw-r--r-- 1 root root     4 Feb 11 20:22 bar
      c--------- 1 root root  0, 0 Feb 11 20:23 foo
      drwx------ 2 root root 16384 Feb 11 20:18 lost+found
      # cat /mnt/scratch/bar
      foo
      #
      
      In XFS rename terms, the operation that has been done is that source
      (foo) has been moved to the target (bar), which is like a nomal
      rename operation, but rather than the source being removed, it have
      been replaced with a whiteout.
      
      We can't allocate whiteout inodes within the rename transaction due
      to allocation being a multi-commit transaction: rename needs to
      be a single, atomic commit. Hence we have several options here, form
      most efficient to least efficient:
      
          - use DT_WHT in the target dirent and do no whiteout inode
            allocation.  The main issue with this approach is that we need
            hooks in lookup to create a virtual chardev inode to present
            to userspace and in places where we might need to modify the
            dirent e.g. unlink.  Overlayfs also needs to be taught about
            DT_WHT. Most invasive change, lowest overhead.
      
          - create a special whiteout inode in the root directory (e.g. a
            ".wino" dirent) and then hardlink every new whiteout to it.
            This means we only need to create a single whiteout inode, and
            rename simply creates a hardlink to it. We can use DT_WHT for
            these, though using DT_CHR means we won't have to modify
            overlayfs, nor anything in userspace. Downside is we have to
            look up the whiteout inode on every operation and create it if
            it doesn't exist.
      
          - copy ext4: create a special whiteout chardev inode for every
            whiteout.  This is more complex than the above options because
            of the lack of atomicity between inode creation and the rename
            operation, requiring us to create a tmpfile inode and then
            linking it into the directory structure during the rename. At
            least with a tmpfile inode crashes between the create and
            rename doesn't leave unreferenced inodes or directory
            pollution around.
      
      By far the simplest thing to do in the short term is to copy ext4.
      While it is the most inefficient way of supporting whiteouts, but as
      an initial implementation we can simply reuse existing functions and
      add a small amount of extra code the the rename operation.
      
      When we get full whiteout support in the VFS (via the dentry cache)
      we can then look to supporting DT_WHT method outlined as the first
      method of supporting whiteouts. But until then, we'll stick with
      what overlayfs expects us to be: dumb and stupid.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      7dcf5c3e
    • D
      xfs: make xfs_cross_rename() complete fully · eeacd321
      Dave Chinner 提交于
      Now that xfs_finish_rename() exists, there is no reason for
      xfs_cross_rename() to return to xfs_rename() to finish off the
      rename transaction. Drive the completion code into
      xfs_cross_rename() and handle all errors there so as to simplify
      the xfs_rename() code.
      
      Further, push the rename exchange target_ip check to early in the
      rename code so as to make the error handling easy and obviously
      correct.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      eeacd321
    • D
      xfs: factor out xfs_finish_rename() · 310606b0
      Dave Chinner 提交于
      Rather than use a jump label for the final transaction commit in
      the rename, factor it into a simple helper function and call it
      appropriately. This slightly reduces the spaghetti nature of
      xfs_rename.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      310606b0
    • D
      xfs: cleanup xfs_rename error handling · 445883e8
      Dave Chinner 提交于
      The jump labels are ambiguous and unclear and some of the error
      paths are used inconsistently. Rules for error jumps are:
      
      - use out_trans_cancel for unmodified transaction context
      - use out_bmap_cancel on ENOSPC errors
      - use out_trans_abort when transaction is likely to be dirty.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      445883e8
    • D
      xfs: clean up inode locking for RENAME_WHITEOUT · 95afcf5c
      Dave Chinner 提交于
      When doing RENAME_WHITEOUT, we now have to lock 5 inodes into the
      rename transaction. This means we need to update
      xfs_sort_for_rename() and xfs_lock_inodes() to handle up to 5
      inodes. Because of the vagaries of rename, this means we could have
      anywhere between 3 and 5 inodes locked into the transaction....
      
      While xfs_lock_inodes() does not need anything other than an assert
      telling us we are passing more inodes that we ever thought we should
      see, it could do with a logic rework to remove all the indenting.
      This is not a functional change - it just makes the code a lot
      easier to read.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      95afcf5c
  2. 24 2月, 2015 1 次提交
  3. 23 2月, 2015 2 次提交
    • D
      xfs: inodes are new until the dentry cache is set up · 58c90473
      Dave Chinner 提交于
      Al Viro noticed a generic set of issues to do with filehandle lookup
      racing with dentry cache setup. They involve a filehandle lookup
      occurring while an inode is being created and the filehandle lookup
      racing with the dentry creation for the real file. This can lead to
      multiple dentries for the one path being instantiated. There are a
      host of other issues around this same set of paths.
      
      The underlying cause is that file handle lookup only waits on inode
      cache instantiation rather than full dentry cache instantiation. XFS
      is mostly immune to the problems discovered due to it's own internal
      inode cache, but there are a couple of corner cases where races can
      happen.
      
      We currently clear the XFS_INEW flag when the inode is fully set up
      after insertion into the cache. Newly allocated inodes are inserted
      locked and so aren't usable until the allocation transaction
      commits. This, however, occurs before the dentry and security
      information is fully initialised and hence the inode is unlocked and
      available for lookups to find too early.
      
      To solve the problem, only clear the XFS_INEW flag for newly created
      inodes once the dentry is fully instantiated. This means lookups
      will retry until the XFS_INEW flag is removed from the inode and
      hence avoids the race conditions in questions.
      
      THis also means that xfs_create(), xfs_create_tmpfile() and
      xfs_symlink() need to finish the setup of the inode in their error
      paths if we had allocated the inode but failed later in the creation
      process. xfs_symlink(), in particular, needed a lot of help to make
      it's error handling match that of xfs_create().
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      58c90473
    • D
      xfs: introduce mmap/truncate lock · 653c60b6
      Dave Chinner 提交于
      Right now we cannot serialise mmap against truncate or hole punch
      sanely. ->page_mkwrite is not able to take locks that the read IO
      path normally takes (i.e. the inode iolock) because that could
      result in lock inversions (read - iolock - page fault - page_mkwrite
      - iolock) and so we cannot use an IO path lock to serialise page
      write faults against truncate operations.
      
      Instead, introduce a new lock that is used *only* in the
      ->page_mkwrite path that is the equivalent of the iolock. The lock
      ordering in a page fault is i_mmaplock -> page lock -> i_ilock,
      and so in truncate we can i_iolock -> i_mmaplock and so lock out
      new write faults during the process of truncation.
      
      Because i_mmap_lock is outside the page lock, we can hold it across
      all the same operations we hold the i_iolock for. The only
      difference is that we never hold the i_mmaplock in the normal IO
      path and so do not ever have the possibility that we can page fault
      inside it. Hence there are no recursion issues on the i_mmap_lock
      and so we can use it to serialise page fault IO against inode
      modification operations that affect the IO path.
      
      This patch introduces the i_mmaplock infrastructure, lockdep
      annotations and initialisation/destruction code. Use of the new lock
      will be in subsequent patches.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      653c60b6
  4. 22 1月, 2015 1 次提交
  5. 24 12月, 2014 1 次提交
  6. 04 12月, 2014 1 次提交
  7. 28 11月, 2014 3 次提交
  8. 02 10月, 2014 3 次提交
  9. 09 9月, 2014 1 次提交
  10. 04 8月, 2014 1 次提交
  11. 25 6月, 2014 1 次提交
    • D
      xfs: global error sign conversion · 2451337d
      Dave Chinner 提交于
      Convert all the errors the core XFs code to negative error signs
      like the rest of the kernel and remove all the sign conversion we
      do in the interface layers.
      
      Errors for conversion (and comparison) found via searches like:
      
      $ git grep " E" fs/xfs
      $ git grep "return E" fs/xfs
      $ git grep " E[A-Z].*;$" fs/xfs
      
      Negation points found via searches like:
      
      $ git grep "= -[a-z,A-Z]" fs/xfs
      $ git grep "return -[a-z,A-D,F-Z]" fs/xfs
      $ git grep " -[a-z].*;" fs/xfs
      
      [ with some bits I missed from Brian Foster ]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      2451337d
  12. 22 6月, 2014 1 次提交
  13. 20 5月, 2014 1 次提交
    • D
      xfs: turn NLINK feature on by default · 263997a6
      Dave Chinner 提交于
      mkfs has turned on the XFS_SB_VERSION_NLINKBIT feature bit by
      default since November 2007. It's about time we simply made the
      kernel code turn it on by default and so always convert v1 inodes to
      v2 inodes when reading them in from disk or allocating them. This
      This removes needless version checks and modification when bumping
      link counts on inodes, and will take code out of a few common code
      paths.
      
         text    data     bss     dec     hex filename
       783251  100867     616  884734   d7ffe fs/xfs/xfs.o.orig
       782664  100867     616  884147   d7db3 fs/xfs/xfs.o.patched
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      263997a6
  14. 24 4月, 2014 1 次提交
  15. 23 4月, 2014 3 次提交
  16. 17 4月, 2014 1 次提交
    • B
      xfs: fix tmpfile/selinux deadlock and initialize security · 330033d6
      Brian Foster 提交于
      xfstests generic/004 reproduces an ilock deadlock using the tmpfile
      interface when selinux is enabled. This occurs because
      xfs_create_tmpfile() takes the ilock and then calls d_tmpfile(). The
      latter eventually calls into xfs_xattr_get() which attempts to get the
      lock again. E.g.:
      
      xfs_io          D ffffffff81c134c0  4096  3561   3560 0x00000080
      ffff8801176a1a68 0000000000000046 ffff8800b401b540 ffff8801176a1fd8
      00000000001d5800 00000000001d5800 ffff8800b401b540 ffff8800b401b540
      ffff8800b73a6bd0 fffffffeffffffff ffff8800b73a6bd8 ffff8800b5ddb480
      Call Trace:
      [<ffffffff8177f969>] schedule+0x29/0x70
      [<ffffffff81783a65>] rwsem_down_read_failed+0xc5/0x120
      [<ffffffffa05aa97f>] ? xfs_ilock_attr_map_shared+0x1f/0x50 [xfs]
      [<ffffffff813b3434>] call_rwsem_down_read_failed+0x14/0x30
      [<ffffffff810ed179>] ? down_read_nested+0x89/0xa0
      [<ffffffffa05aa7f2>] ? xfs_ilock+0x122/0x250 [xfs]
      [<ffffffffa05aa7f2>] xfs_ilock+0x122/0x250 [xfs]
      [<ffffffffa05aa97f>] xfs_ilock_attr_map_shared+0x1f/0x50 [xfs]
      [<ffffffffa05701d0>] xfs_attr_get+0x90/0xe0 [xfs]
      [<ffffffffa0565e07>] xfs_xattr_get+0x37/0x50 [xfs]
      [<ffffffff8124842f>] generic_getxattr+0x4f/0x70
      [<ffffffff8133fd9e>] inode_doinit_with_dentry+0x1ae/0x650
      [<ffffffff81340e0c>] selinux_d_instantiate+0x1c/0x20
      [<ffffffff813351bb>] security_d_instantiate+0x1b/0x30
      [<ffffffff81237db0>] d_instantiate+0x50/0x70
      [<ffffffff81237e85>] d_tmpfile+0xb5/0xc0
      [<ffffffffa05add02>] xfs_create_tmpfile+0x362/0x410 [xfs]
      [<ffffffffa0559ac8>] xfs_vn_tmpfile+0x18/0x20 [xfs]
      [<ffffffff81230388>] path_openat+0x228/0x6a0
      [<ffffffff810230f9>] ? sched_clock+0x9/0x10
      [<ffffffff8105a427>] ? kvm_clock_read+0x27/0x40
      [<ffffffff8124054f>] ? __alloc_fd+0xaf/0x1f0
      [<ffffffff8123101a>] do_filp_open+0x3a/0x90
      [<ffffffff817845e7>] ? _raw_spin_unlock+0x27/0x40
      [<ffffffff8124054f>] ? __alloc_fd+0xaf/0x1f0
      [<ffffffff8121e3ce>] do_sys_open+0x12e/0x210
      [<ffffffff8121e4ce>] SyS_open+0x1e/0x20
      [<ffffffff8178eda9>] system_call_fastpath+0x16/0x1b
      
      xfs_vn_tmpfile() also fails to initialize security on the newly created
      inode.
      
      Pull the d_tmpfile() call up into xfs_vn_tmpfile() after the transaction
      has been committed and the inode unlocked. Also, initialize security on
      the inode based on the parent directory provided via the tmpfile call.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      330033d6
  17. 14 4月, 2014 1 次提交
  18. 07 1月, 2014 3 次提交
  19. 19 12月, 2013 3 次提交
  20. 13 12月, 2013 3 次提交
  21. 05 11月, 2013 1 次提交
    • D
      xfs: xfs_remove deadlocks due to inverted AGF vs AGI lock ordering · 27320369
      Dave Chinner 提交于
      Removing an inode from the namespace involves removing the directory
      entry and dropping the link count on the inode. Removing the
      directory entry can result in locking an AGF (directory blocks were
      freed) and removing a link count can result in placing the inode on
      an unlinked list which results in locking an AGI.
      
      The big problem here is that we have an ordering constraint on AGF
      and AGI locking - inode allocation locks the AGI, then can allocate
      a new extent for new inodes, locking the AGF after the AGI.
      Similarly, freeing the inode removes the inode from the unlinked
      list, requiring that we lock the AGI first, and then freeing the
      inode can result in an inode chunk being freed and hence freeing
      disk space requiring that we lock an AGF.
      
      Hence the ordering that is imposed by other parts of the code is AGI
      before AGF. This means we cannot remove the directory entry before
      we drop the inode reference count and put it on the unlinked list as
      this results in a lock order of AGF then AGI, and this can deadlock
      against inode allocation and freeing. Therefore we must drop the
      link counts before we remove the directory entry.
      
      This is still safe from a transactional point of view - it is not
      until we get to xfs_bmap_finish() that we have the possibility of
      multiple transactions in this operation. Hence as long as we remove
      the directory entry and drop the link count in the first transaction
      of the remove operation, there are no transactional constraints on
      the ordering here.
      
      Change the ordering of the operations in the xfs_remove() function
      to align the ordering of AGI and AGF locking to match that of the
      rest of the code.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      27320369
  22. 24 10月, 2013 2 次提交
    • D
      xfs: decouple inode and bmap btree header files · a4fbe6ab
      Dave Chinner 提交于
      Currently the xfs_inode.h header has a dependency on the definition
      of the BMAP btree records as the inode fork includes an array of
      xfs_bmbt_rec_host_t objects in it's definition.
      
      Move all the btree format definitions from xfs_btree.h,
      xfs_bmap_btree.h, xfs_alloc_btree.h and xfs_ialloc_btree.h to
      xfs_format.h to continue the process of centralising the on-disk
      format definitions. With this done, the xfs inode definitions are no
      longer dependent on btree header files.
      
      The enables a massive culling of unnecessary includes, with close to
      200 #include directives removed from the XFS kernel code base.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      a4fbe6ab
    • D
      xfs: decouple log and transaction headers · 239880ef
      Dave Chinner 提交于
      xfs_trans.h has a dependency on xfs_log.h for a couple of
      structures. Most code that does transactions doesn't need to know
      anything about the log, but this dependency means that they have to
      include xfs_log.h. Decouple the xfs_trans.h and xfs_log.h header
      files and clean up the includes to be in dependency order.
      
      In doing this, remove the direct include of xfs_trans_reserve.h from
      xfs_trans.h so that we remove the dependency between xfs_trans.h and
      xfs_mount.h. Hence the xfs_trans.h include can be moved to the
      indicate the actual dependencies other header files have on it.
      
      Note that these are kernel only header files, so this does not
      translate to any userspace changes at all.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      239880ef