1. 19 8月, 2015 1 次提交
    • D
      xfs: clean up inode lockdep annotations · 0952c818
      Dave Chinner 提交于
      Lockdep annotations are a maintenance nightmare. Locking has to be
      modified to suit the limitations of the annotations, and we're
      always having to fix the annotations because they are unable to
      express the complexity of locking heirarchies correctly.
      
      So, next up, we've got more issues with lockdep annotations for
      inode locking w.r.t. XFS_LOCK_PARENT:
      
      	- lockdep classes are exclusive and can't be ORed together
      	  to form new classes.
      	- IOLOCK needs multiple PARENT subclasses to express the
      	  changes needed for the readdir locking rework needed to
      	  stop the endless flow of lockdep false positives involving
      	  readdir calling filldir under the ILOCK.
      	- there are only 8 unique lockdep subclasses available,
      	  so we can't create a generic solution.
      
      IOWs we need to treat the 3-bit space available to each lock type
      differently:
      
      	- IOLOCK uses xfs_lock_two_inodes(), so needs:
      		- at least 2 IOLOCK subclasses
      		- at least 2 IOLOCK_PARENT subclasses
      	- MMAPLOCK uses xfs_lock_two_inodes(), so needs:
      		- at least 2 MMAPLOCK subclasses
      	- ILOCK uses xfs_lock_inodes with up to 5 inodes, so needs:
      		- at least 5 ILOCK subclasses
      		- one ILOCK_PARENT subclass
      		- one RTBITMAP subclass
      		- one RTSUM subclass
      
      For the IOLOCK, split the space into two sets of subclasses.
      For the MMAPLOCK, just use half the space for the one subclass to
      match the non-parent lock classes of the IOLOCK.
      For the ILOCK, use 0-4 as the ILOCK subclasses, 5-7 for the
      remaining individual subclasses.
      
      Because they are now all different, modify xfs_lock_inumorder() to
      handle the nested subclasses, and to assert fail if passed an
      invalid subclass. Further, annotate xfs_lock_inodes() to assert fail
      if an invalid combination of lock primitives and inode counts are
      passed that would result in a lockdep subclass annotation overflow.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      0952c818
  2. 23 2月, 2015 3 次提交
    • D
      xfs: inodes are new until the dentry cache is set up · 58c90473
      Dave Chinner 提交于
      Al Viro noticed a generic set of issues to do with filehandle lookup
      racing with dentry cache setup. They involve a filehandle lookup
      occurring while an inode is being created and the filehandle lookup
      racing with the dentry creation for the real file. This can lead to
      multiple dentries for the one path being instantiated. There are a
      host of other issues around this same set of paths.
      
      The underlying cause is that file handle lookup only waits on inode
      cache instantiation rather than full dentry cache instantiation. XFS
      is mostly immune to the problems discovered due to it's own internal
      inode cache, but there are a couple of corner cases where races can
      happen.
      
      We currently clear the XFS_INEW flag when the inode is fully set up
      after insertion into the cache. Newly allocated inodes are inserted
      locked and so aren't usable until the allocation transaction
      commits. This, however, occurs before the dentry and security
      information is fully initialised and hence the inode is unlocked and
      available for lookups to find too early.
      
      To solve the problem, only clear the XFS_INEW flag for newly created
      inodes once the dentry is fully instantiated. This means lookups
      will retry until the XFS_INEW flag is removed from the inode and
      hence avoids the race conditions in questions.
      
      THis also means that xfs_create(), xfs_create_tmpfile() and
      xfs_symlink() need to finish the setup of the inode in their error
      paths if we had allocated the inode but failed later in the creation
      process. xfs_symlink(), in particular, needed a lot of help to make
      it's error handling match that of xfs_create().
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      58c90473
    • D
      xfs: ensure truncate forces zeroed blocks to disk · 5885ebda
      Dave Chinner 提交于
      A new fsync vs power fail test in xfstests indicated that XFS can
      have unreliable data consistency when doing extending truncates that
      require block zeroing. The blocks beyond EOF get zeroed in memory,
      but we never force those changes to disk before we run the
      transaction that extends the file size and exposes those blocks to
      userspace. This can result in the blocks not being correctly zeroed
      after a crash.
      
      Because in-memory behaviour is correct, tools like fsx don't pick up
      any coherency problems - it's not until the filesystem is shutdown
      or the system crashes after writing the truncate transaction to the
      journal but before the zeroed data in the page cache is flushed that
      the issue is exposed.
      
      Fix this by also flushing the dirty data in memory region between
      the old size and new size when we've found blocks that need zeroing
      in the truncate process.
      Reported-by: NLiu Bo <bo.li.liu@oracle.com>
      cc: <stable@vger.kernel.org>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      5885ebda
    • D
      xfs: introduce mmap/truncate lock · 653c60b6
      Dave Chinner 提交于
      Right now we cannot serialise mmap against truncate or hole punch
      sanely. ->page_mkwrite is not able to take locks that the read IO
      path normally takes (i.e. the inode iolock) because that could
      result in lock inversions (read - iolock - page fault - page_mkwrite
      - iolock) and so we cannot use an IO path lock to serialise page
      write faults against truncate operations.
      
      Instead, introduce a new lock that is used *only* in the
      ->page_mkwrite path that is the equivalent of the iolock. The lock
      ordering in a page fault is i_mmaplock -> page lock -> i_ilock,
      and so in truncate we can i_iolock -> i_mmaplock and so lock out
      new write faults during the process of truncation.
      
      Because i_mmap_lock is outside the page lock, we can hold it across
      all the same operations we hold the i_iolock for. The only
      difference is that we never hold the i_mmaplock in the normal IO
      path and so do not ever have the possibility that we can page fault
      inside it. Hence there are no recursion issues on the i_mmap_lock
      and so we can use it to serialise page fault IO against inode
      modification operations that affect the IO path.
      
      This patch introduces the i_mmaplock infrastructure, lockdep
      annotations and initialisation/destruction code. Use of the new lock
      will be in subsequent patches.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      653c60b6
  3. 02 2月, 2015 1 次提交
  4. 24 12月, 2014 1 次提交
  5. 28 11月, 2014 1 次提交
  6. 02 10月, 2014 1 次提交
    • B
      xfs: check for inode size overflow in xfs_new_eof() · ce57bcf6
      Brian Foster 提交于
      If we write to the maximum file offset (2^63-2), XFS fails to log the
      inode size update when the page is flushed. For example:
      
      $ xfs_io -fc "pwrite `echo "2^63-1-1" | bc` 1" /mnt/file
      wrote 1/1 bytes at offset 9223372036854775806
      1.000000 bytes, 1 ops; 0.0000 sec (22.711 KiB/sec and 23255.8140 ops/sec)
      $ stat -c %s /mnt/file
      9223372036854775807
      $ umount /mnt ; mount <dev> /mnt/
      $ stat -c %s /mnt/file
      0
      
      This occurs because XFS calculates the new file size as io_offset +
      io_size, I/O occurs in block sized requests, and the maximum supported
      file size is not block aligned. Therefore, a write to the max allowable
      offset on a 4k blocksize fs results in a write of size 4k to offset
      2^63-4096 (e.g., equivalent to round_down(2^63-1, 4096), or IOW the
      offset of the block that contains the max file size). The offset plus
      size calculation (2^63 - 4096 + 4096 == 2^63) overflows the signed
      64-bit variable which goes negative and causes the > comparison to the
      on-disk inode size to fail. This returns 0 from xfs_new_eof() and
      results in no change to the inode on-disk.
      
      Update xfs_new_eof() to explicitly detect overflow of the local
      calculation and use the VFS inode size in this scenario. The VFS inode
      size is capped to the maximum and thus XFS writes the correct inode size
      to disk.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      ce57bcf6
  7. 04 8月, 2014 1 次提交
    • D
      xfs: kill xfs_vnode.h · b92cc59f
      Dave Chinner 提交于
      Move the IO flag definitions to xfs_inode.h and kill the header file
      as it is now empty.
      
      Removing the xfs_vnode.h file showed up an implicit header include
      path:
      	xfs_linux.h -> xfs_vnode.h -> xfs_fs.h
      
      And so every xfs header file has been inplicitly been including
      xfs_fs.h where it is needed or not. Hence the removal of xfs_vnode.h
      causes all sorts of build issues because BBTOB() and friends are no
      longer automatically included in the build. This also gets fixed.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      b92cc59f
  8. 20 5月, 2014 1 次提交
    • D
      xfs: turn NLINK feature on by default · 263997a6
      Dave Chinner 提交于
      mkfs has turned on the XFS_SB_VERSION_NLINKBIT feature bit by
      default since November 2007. It's about time we simply made the
      kernel code turn it on by default and so always convert v1 inodes to
      v2 inodes when reading them in from disk or allocating them. This
      This removes needless version checks and modification when bumping
      link counts on inodes, and will take code out of a few common code
      paths.
      
         text    data     bss     dec     hex filename
       783251  100867     616  884734   d7ffe fs/xfs/xfs.o.orig
       782664  100867     616  884147   d7db3 fs/xfs/xfs.o.patched
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      263997a6
  9. 23 4月, 2014 1 次提交
  10. 17 4月, 2014 1 次提交
    • B
      xfs: fix tmpfile/selinux deadlock and initialize security · 330033d6
      Brian Foster 提交于
      xfstests generic/004 reproduces an ilock deadlock using the tmpfile
      interface when selinux is enabled. This occurs because
      xfs_create_tmpfile() takes the ilock and then calls d_tmpfile(). The
      latter eventually calls into xfs_xattr_get() which attempts to get the
      lock again. E.g.:
      
      xfs_io          D ffffffff81c134c0  4096  3561   3560 0x00000080
      ffff8801176a1a68 0000000000000046 ffff8800b401b540 ffff8801176a1fd8
      00000000001d5800 00000000001d5800 ffff8800b401b540 ffff8800b401b540
      ffff8800b73a6bd0 fffffffeffffffff ffff8800b73a6bd8 ffff8800b5ddb480
      Call Trace:
      [<ffffffff8177f969>] schedule+0x29/0x70
      [<ffffffff81783a65>] rwsem_down_read_failed+0xc5/0x120
      [<ffffffffa05aa97f>] ? xfs_ilock_attr_map_shared+0x1f/0x50 [xfs]
      [<ffffffff813b3434>] call_rwsem_down_read_failed+0x14/0x30
      [<ffffffff810ed179>] ? down_read_nested+0x89/0xa0
      [<ffffffffa05aa7f2>] ? xfs_ilock+0x122/0x250 [xfs]
      [<ffffffffa05aa7f2>] xfs_ilock+0x122/0x250 [xfs]
      [<ffffffffa05aa97f>] xfs_ilock_attr_map_shared+0x1f/0x50 [xfs]
      [<ffffffffa05701d0>] xfs_attr_get+0x90/0xe0 [xfs]
      [<ffffffffa0565e07>] xfs_xattr_get+0x37/0x50 [xfs]
      [<ffffffff8124842f>] generic_getxattr+0x4f/0x70
      [<ffffffff8133fd9e>] inode_doinit_with_dentry+0x1ae/0x650
      [<ffffffff81340e0c>] selinux_d_instantiate+0x1c/0x20
      [<ffffffff813351bb>] security_d_instantiate+0x1b/0x30
      [<ffffffff81237db0>] d_instantiate+0x50/0x70
      [<ffffffff81237e85>] d_tmpfile+0xb5/0xc0
      [<ffffffffa05add02>] xfs_create_tmpfile+0x362/0x410 [xfs]
      [<ffffffffa0559ac8>] xfs_vn_tmpfile+0x18/0x20 [xfs]
      [<ffffffff81230388>] path_openat+0x228/0x6a0
      [<ffffffff810230f9>] ? sched_clock+0x9/0x10
      [<ffffffff8105a427>] ? kvm_clock_read+0x27/0x40
      [<ffffffff8124054f>] ? __alloc_fd+0xaf/0x1f0
      [<ffffffff8123101a>] do_filp_open+0x3a/0x90
      [<ffffffff817845e7>] ? _raw_spin_unlock+0x27/0x40
      [<ffffffff8124054f>] ? __alloc_fd+0xaf/0x1f0
      [<ffffffff8121e3ce>] do_sys_open+0x12e/0x210
      [<ffffffff8121e4ce>] SyS_open+0x1e/0x20
      [<ffffffff8178eda9>] system_call_fastpath+0x16/0x1b
      
      xfs_vn_tmpfile() also fails to initialize security on the newly created
      inode.
      
      Pull the d_tmpfile() call up into xfs_vn_tmpfile() after the transaction
      has been committed and the inode unlocked. Also, initialize security on
      the inode based on the parent directory provided via the tmpfile call.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      330033d6
  11. 07 1月, 2014 2 次提交
  12. 19 12月, 2013 3 次提交
  13. 31 10月, 2013 1 次提交
    • D
      xfs: abstract the differences in dir2/dir3 via an ops vector · 32c5483a
      Dave Chinner 提交于
      Lots of the dir code now goes through switches to determine what is
      the correct on-disk format to parse. It generally involves a
      "xfs_sbversion_hasfoo" check, deferencing the superblock version and
      feature fields and hence touching several cache lines per operation
      in the process. Some operations do multiple checks because they nest
      conditional operations and they don't pass the information in a
      direct fashion between each other.
      
      Hence, add an ops vector to the xfs_inode structure that is
      configured when the inode is initialised to point to all the correct
      decode and encoding operations.  This will significantly reduce the
      branchiness and cacheline footprint of the directory object decoding
      and encoding.
      
      This is the first patch in a series of conversion patches. It will
      introduce the ops structure, the setup of it and add the first
      operation to the vector. Subsequent patches will convert directory
      ops one at a time to keep the changes simple and obvious.
      
      Just this patch shows the benefit of such an approach on code size.
      Just converting the two shortform dir operations as this patch does
      decreases the built binary size by ~1500 bytes:
      
      $ size fs/xfs/xfs.o.orig fs/xfs/xfs.o.p1
         text    data     bss     dec     hex filename
       794490   96802    1096  892388   d9de4 fs/xfs/xfs.o.orig
       792986   96802    1096  890884   d9804 fs/xfs/xfs.o.p1
      $
      
      That's a significant decrease in the instruction cache footprint of
      the directory code for such a simple change, and indicates that this
      approach is definitely worth pursuing further.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      32c5483a
  14. 24 10月, 2013 1 次提交
    • D
      xfs: decouple inode and bmap btree header files · a4fbe6ab
      Dave Chinner 提交于
      Currently the xfs_inode.h header has a dependency on the definition
      of the BMAP btree records as the inode fork includes an array of
      xfs_bmbt_rec_host_t objects in it's definition.
      
      Move all the btree format definitions from xfs_btree.h,
      xfs_bmap_btree.h, xfs_alloc_btree.h and xfs_ialloc_btree.h to
      xfs_format.h to continue the process of centralising the on-disk
      format definitions. With this done, the xfs inode definitions are no
      longer dependent on btree header files.
      
      The enables a massive culling of unnecessary includes, with close to
      200 #include directives removed from the XFS kernel code base.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      a4fbe6ab
  15. 09 10月, 2013 1 次提交
  16. 13 8月, 2013 6 次提交
  17. 11 7月, 2013 1 次提交
    • C
      xfs: Add pquota fields where gquota is used. · 92f8ff73
      Chandra Seetharaman 提交于
      Add project quota changes to all the places where group quota field
      is used:
         * add separate project quota members into various structures
         * split project quota and group quotas so that instead of overriding
           the group quota members incore, the new project quota members are
           used instead
         * get rid of usage of the OQUOTA flag incore, in favor of separate
           group and project quota flags.
         * add a project dquot argument to various functions.
      
      Not using the pquotino field from superblock yet.
      Signed-off-by: NChandra Seetharaman <sekharan@us.ibm.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      92f8ff73
  18. 22 4月, 2013 1 次提交
    • C
      xfs: add version 3 inode format with CRCs · 93848a99
      Christoph Hellwig 提交于
      Add a new inode version with a larger core.  The primary objective is
      to allow for a crc of the inode, and location information (uuid and ino)
      to verify it was written in the right place.  We also extend it by:
      
      	a creation time (for Samba);
      	a changecount (for NFSv4);
      	a flush sequence (in LSN format for recovery);
      	an additional inode flags field; and
      	some additional padding.
      
      These additional fields are not implemented yet, but already laid
      out in the structure.
      
      [dchinner@redhat.com] Added LSN and flags field, some factoring and rework to
      capture all the necessary information in the crc calculation.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      93848a99
  19. 15 3月, 2013 1 次提交
  20. 07 2月, 2013 1 次提交
  21. 16 11月, 2012 4 次提交
    • D
      xfs: convert buffer verifiers to an ops structure. · 1813dd64
      Dave Chinner 提交于
      To separate the verifiers from iodone functions and associate read
      and write verifiers at the same time, introduce a buffer verifier
      operations structure to the xfs_buf.
      
      This avoids the need for assigning the write verifier, clearing the
      iodone function and re-running ioend processing in the read
      verifier, and gets rid of the nasty "b_pre_io" name for the write
      verifier function pointer. If we ever need to, it will also be
      easier to add further content specific callbacks to a buffer with an
      ops structure in place.
      
      We also avoid needing to export verifier functions, instead we
      can simply export the ops structures for those that are needed
      outside the function they are defined in.
      
      This patch also fixes a directory block readahead verifier issue
      it exposed.
      
      This patch also adds ops callbacks to the inode/alloc btree blocks
      initialised by growfs. These will need more work before they will
      work with CRCs.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NPhil White <pwhite@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      1813dd64
    • D
      xfs: connect up write verifiers to new buffers · b0f539de
      Dave Chinner 提交于
      Metadata buffers that are read from disk have write verifiers
      already attached to them, but newly allocated buffers do not. Add
      appropriate write verifiers to all new metadata buffers.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      b0f539de
    • D
      xfs: add pre-write metadata buffer verifier callbacks · 612cfbfe
      Dave Chinner 提交于
      These verifiers are essentially the same code as the read verifiers,
      but do not require ioend processing. Hence factor the read verifier
      functions and add a new write verifier wrapper that is used as the
      callback.
      
      This is done as one large patch for all verifiers rather than one
      patch per verifier as the change is largely mechanical. This
      includes hooking up the write verifier via the read verifier
      function.
      
      Hooking up the write verifier for buffers obtained via
      xfs_trans_get_buf() will be done in a separate patch as that touches
      code in many different places rather than just the verifier
      functions.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      612cfbfe
    • D
      xfs: verify btree blocks as they are read from disk · 3d3e6f64
      Dave Chinner 提交于
      Add an btree block verify callback function and pass it into the
      buffer read functions. Because each different btree block type
      requires different verification, add a function to the ops structure
      that is called from the generic code.
      
      Also, propagate the verification callback functions through the
      readahead functions, and into the external bmap and bulkstat inode
      readahead code that uses the generic btree buffer read functions.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NPhil White <pwhite@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      3d3e6f64
  22. 09 11月, 2012 1 次提交
  23. 18 10月, 2012 1 次提交
  24. 30 7月, 2012 2 次提交
  25. 22 7月, 2012 2 次提交