1. 26 4月, 2017 1 次提交
    • E
      xfs: handle array index overrun in xfs_dir2_leaf_readbuf() · 023cc840
      Eric Sandeen 提交于
      Carlos had a case where "find" seemed to start spinning
      forever and never return.
      
      This was on a filesystem with non-default multi-fsb (8k)
      directory blocks, and a fragmented directory with extents
      like this:
      
      0:[0,133646,2,0]
      1:[2,195888,1,0]
      2:[3,195890,1,0]
      3:[4,195892,1,0]
      4:[5,195894,1,0]
      5:[6,195896,1,0]
      6:[7,195898,1,0]
      7:[8,195900,1,0]
      8:[9,195902,1,0]
      9:[10,195908,1,0]
      10:[11,195910,1,0]
      11:[12,195912,1,0]
      12:[13,195914,1,0]
      ...
      
      i.e. the first extent is a contiguous 2-fsb dir block, but
      after that it is fragmented into 1 block extents.
      
      At the top of the readdir path, we allocate a mapping array
      which (for this filesystem geometry) can hold 10 extents; see
      the assignment to map_info->map_size.  During readdir, we are
      therefore able to map extents 0 through 9 above into the array
      for readahead purposes.  If we count by 2, we see that the last
      mapped index (9) is the first block of a 2-fsb directory block.
      
      At the end of xfs_dir2_leaf_readbuf() we have 2 loops to fill
      more readahead; the outer loop assumes one full dir block is
      processed each loop iteration, and an inner loop that ensures
      that this is so by advancing to the next extent until a full
      directory block is mapped.
      
      The problem is that this inner loop may step past the last
      extent in the mapping array as it tries to reach the end of
      the directory block.  This will read garbage for the extent
      length, and as a result the loop control variable 'j' may
      become corrupted and never fail the loop conditional.
      
      The number of valid mappings we have in our array is stored
      in map->map_valid, so stop this inner loop based on that limit.
      
      There is an ASSERT at the top of the outer loop for this
      same condition, but we never made it out of the inner loop,
      so the ASSERT never fired.
      
      Huge appreciation for Carlos for debugging and isolating
      the problem.
      Debugged-and-analyzed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Tested-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: NBill O'Donnell <billodo@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      023cc840
  2. 15 3月, 2017 1 次提交
  3. 30 11月, 2016 1 次提交
  4. 04 10月, 2016 1 次提交
  5. 19 5月, 2016 1 次提交
    • D
      xfs: concurrent readdir hangs on data buffer locks · 9f541801
      Dave Chinner 提交于
      There's a three-process deadlock involving shared/exclusive barriers
      and inverted lock orders in the directory readdir implementation.
      It's a pre-existing problem with lock ordering, exposed by the
      VFS parallelisation code.
      
      process 1               process 2               process 3
      ---------               ---------               ---------
      readdir
      iolock(shared)
        get_leaf_dents
          iterate entries
             ilock(shared)
             map, lock and read buffer
             iunlock(shared)
             process entries in buffer
             .....
                                                      readdir
                                                      iolock(shared)
                                                        get_leaf_dents
                                                          iterate entries
                                                            ilock(shared)
                                                            map, lock buffer
                                                            <blocks>
                              finish ->iterate_shared
                              file_accessed()
                                ->update_time
                                  start transaction
                                  ilock(excl)
                                  <blocks>
              .....
              finishes processing buffer
              get next buffer
                ilock(shared)
                <blocks>
      
      And that's the deadlock.
      
      Fix this by dropping the current buffer lock in process 1 before
      trying to map the next buffer. This means we keep the lock order of
      ilock -> buffer lock intact and hence will allow process 3 to make
      progress and drop it's ilock(shared) once it is done.
      Reported-by: NXiong Zhou <xzhou@redhat.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9f541801
  6. 09 2月, 2016 1 次提交
  7. 12 10月, 2015 1 次提交
    • B
      xfs: per-filesystem stats counter implementation · ff6d6af2
      Bill O'Donnell 提交于
      This patch modifies the stats counting macros and the callers
      to those macros to properly increment, decrement, and add-to
      the xfs stats counts. The counts for global and per-fs stats
      are correctly advanced, and cleared by writing a "1" to the
      corresponding clear file.
      
      global counts: /sys/fs/xfs/stats/stats
      per-fs counts: /sys/fs/xfs/sda*/stats/stats
      
      global clear:  /sys/fs/xfs/stats/stats_clear
      per-fs clear:  /sys/fs/xfs/sda*/stats/stats_clear
      
      [dchinner: cleaned up macro variables, removed CONFIG_FS_PROC around
       stats structures and macros. ]
      Signed-off-by: NBill O'Donnell <billodo@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      ff6d6af2
  8. 19 8月, 2015 1 次提交
    • D
      xfs: stop holding ILOCK over filldir callbacks · dbad7c99
      Dave Chinner 提交于
      The recent change to the readdir locking made in 40194ecc ("xfs:
      reinstate the ilock in xfs_readdir") for CXFS directory sanity was
      probably the wrong thing to do. Deep in the readdir code we
      can take page faults in the filldir callback, and so taking a page
      fault while holding an inode ilock creates a new set of locking
      issues that lockdep warns all over the place about.
      
      The locking order for regular inodes w.r.t. page faults is io_lock
      -> pagefault -> mmap_sem -> ilock. The directory readdir code now
      triggers ilock -> page fault -> mmap_sem. While we cannot deadlock
      at this point, it inverts all the locking patterns that lockdep
      normally sees on XFS inodes, and so triggers lockdep. We worked
      around this with commit 93a8614e ("xfs: fix directory inode iolock
      lockdep false positive"), but that then just moved the lockdep
      warning to deeper in the page fault path and triggered on security
      inode locks. Fixing the shmem issue there just moved the lockdep
      reports somewhere else, and now we are getting false positives from
      filesystem freezing annotations getting confused.
      
      Further, if we enter memory reclaim in a readdir path, we now get
      lockdep warning about potential deadlocks because the ilock is held
      when we enter reclaim. This, again, is different to a regular file
      in that we never allow memory reclaim to run while holding the ilock
      for regular files. Hence lockdep now throws
      ilock->kmalloc->reclaim->ilock warnings.
      
      Basically, the problem is that the ilock is being used to protect
      the directory data and the inode metadata, whereas for a regular
      file the iolock protects the data and the ilock protects the
      metadata. From the VFS perspective, the i_mutex serialises all
      accesses to the directory data, and so not holding the ilock for
      readdir doesn't matter. The issue is that CXFS doesn't access
      directory data via the VFS, so it has no "data serialisaton"
      mechanism. Hence we need to hold the IOLOCK in the correct places to
      provide this low level directory data access serialisation.
      
      The ilock can then be used just when the extent list needs to be
      read, just like we do for regular files. The directory modification
      code can take the iolock exclusive when the ilock is also taken,
      and this then ensures that readdir is correct excluded while
      modifications are in progress.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      dbad7c99
  9. 04 12月, 2014 1 次提交
  10. 28 11月, 2014 3 次提交
  11. 25 6月, 2014 1 次提交
    • D
      xfs: global error sign conversion · 2451337d
      Dave Chinner 提交于
      Convert all the errors the core XFs code to negative error signs
      like the rest of the kernel and remove all the sign conversion we
      do in the interface layers.
      
      Errors for conversion (and comparison) found via searches like:
      
      $ git grep " E" fs/xfs
      $ git grep "return E" fs/xfs
      $ git grep " E[A-Z].*;$" fs/xfs
      
      Negation points found via searches like:
      
      $ git grep "= -[a-z,A-Z]" fs/xfs
      $ git grep "return -[a-z,A-D,F-Z]" fs/xfs
      $ git grep " -[a-z].*;" fs/xfs
      
      [ with some bits I missed from Brian Foster ]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      2451337d
  12. 22 6月, 2014 1 次提交
  13. 10 6月, 2014 1 次提交
  14. 06 6月, 2014 7 次提交
  15. 07 5月, 2014 1 次提交
    • D
      xfs: fix directory readahead offset off-by-one · 8cfcc3e5
      Dave Chinner 提交于
      Directory readahead can throw loud scary but harmless warnings
      when multiblock directories are in use a specific pattern of
      discontiguous blocks are found in the directory. That is, if a hole
      follows a discontiguous block, it will throw a warning like:
      
      XFS (dm-1): xfs_da_do_buf: bno 637 dir: inode 34363923462
      XFS (dm-1): [00] br_startoff 637 br_startblock 1917954575 br_blockcount 1 br_state 0
      XFS (dm-1): [01] br_startoff 638 br_startblock -2 br_blockcount 1 br_state 0
      
      And dump a stack trace.
      
      This is because the readahead offset increment loop does a double
      increment of the block index - it does an increment for the loop
      iteration as well as increase the loop counter by the number of
      blocks in the extent. As a result, the readahead offset does not get
      incremented correctly for discontiguous blocks and hence can ask for
      readahead of a directory block from an offset part way through a
      directory block.  If that directory block is followed by a hole, it
      will trigger a mapping warning like the above.
      
      The bad readahead will be ignored, though, because the main
      directory block read loop uses the correct mapping offsets rather
      than the readahead offset and so will ignore the bad readahead
      altogether.
      
      Fix the warning by ensuring that the readahead offset is correctly
      incremented.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      8cfcc3e5
  16. 14 4月, 2014 3 次提交
  17. 19 12月, 2013 1 次提交
    • B
      xfs: reinstate the ilock in xfs_readdir · 40194ecc
      Ben Myers 提交于
      Although it was removed in commit 051e7cd4, ilock needs to be taken in
      xfs_readdir because we might have to read the extent list in from disk.  This
      keeps other threads from reading from or writing to the extent list while it is
      being read in and is still in a transitional state.
      
      This has been associated with "Access to block zero" messages on directories
      with large numbers of extents resulting from excessive filesytem fragmentation,
      as well as extent list corruption.  Unfortunately no test case at this point.
      Signed-off-by: NBen Myers <bpm@sgi.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      40194ecc
  18. 31 10月, 2013 5 次提交
    • D
      xfs: convert directory vector functions to constants · 1c9a5b2e
      Dave Chinner 提交于
      Many of the vectorised function calls now take no parameters and
      return a constant value. There is no reason for these to be vectored
      functions, so convert them to constants
      
      Binary sizes:
      
         text    data     bss     dec     hex filename
       794490   96802    1096  892388   d9de4 fs/xfs/xfs.o.orig
       792986   96802    1096  890884   d9804 fs/xfs/xfs.o.p1
       792350   96802    1096  890248   d9588 fs/xfs/xfs.o.p2
       789293   96802    1096  887191   d8997 fs/xfs/xfs.o.p3
       789005   96802    1096  886903   d8997 fs/xfs/xfs.o.p4
       789061   96802    1096  886959   d88af fs/xfs/xfs.o.p5
       789733   96802    1096  887631   d8b4f fs/xfs/xfs.o.p6
       791421   96802    1096  889319   d91e7 fs/xfs/xfs.o.p7
       791701   96802    1096  889599   d92ff fs/xfs/xfs.o.p8
       791205   96802    1096  889103   d91cf fs/xfs/xfs.o.p9
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      1c9a5b2e
    • D
      xfs: vectorise directory data operations part 2 · 2ca98774
      Dave Chinner 提交于
      Convert the rest of the directory data block encode/decode
      operations to vector format.
      
      This further reduces the size of the built binary:
      
         text    data     bss     dec     hex filename
       794490   96802    1096  892388   d9de4 fs/xfs/xfs.o.orig
       792986   96802    1096  890884   d9804 fs/xfs/xfs.o.p1
       792350   96802    1096  890248   d9588 fs/xfs/xfs.o.p2
       789293   96802    1096  887191   d8997 fs/xfs/xfs.o.p3
       789005   96802    1096  886903   d8997 fs/xfs/xfs.o.p4
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      2ca98774
    • D
      xfs: vectorise directory data operations · 9d23fc85
      Dave Chinner 提交于
      Following from the initial patches to vectorise the shortform
      directory encode/decode operations, convert half the data block
      operations to use the vector. The rest will be done in a second
      patch.
      
      This further reduces the size of the built binary:
      
         text    data     bss     dec     hex filename
       794490   96802    1096  892388   d9de4 fs/xfs/xfs.o.orig
       792986   96802    1096  890884   d9804 fs/xfs/xfs.o.p1
       792350   96802    1096  890248   d9588 fs/xfs/xfs.o.p2
       789293   96802    1096  887191   d8997 fs/xfs/xfs.o.p3
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      9d23fc85
    • D
      xfs: vectorise remaining shortform dir2 ops · 4740175e
      Dave Chinner 提交于
      Following from the initial patch to introduce the directory
      operations vector, convert the rest of the shortform directory
      operations to use vectored ops rather than superblock feature
      checks. This further reduces the size of the built binary:
      
         text    data     bss     dec     hex filename
       794490   96802    1096  892388   d9de4 fs/xfs/xfs.o.orig
       792986   96802    1096  890884   d9804 fs/xfs/xfs.o.p1
       792350   96802    1096  890248   d9588 fs/xfs/xfs.o.p2
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      4740175e
    • D
      xfs: abstract the differences in dir2/dir3 via an ops vector · 32c5483a
      Dave Chinner 提交于
      Lots of the dir code now goes through switches to determine what is
      the correct on-disk format to parse. It generally involves a
      "xfs_sbversion_hasfoo" check, deferencing the superblock version and
      feature fields and hence touching several cache lines per operation
      in the process. Some operations do multiple checks because they nest
      conditional operations and they don't pass the information in a
      direct fashion between each other.
      
      Hence, add an ops vector to the xfs_inode structure that is
      configured when the inode is initialised to point to all the correct
      decode and encoding operations.  This will significantly reduce the
      branchiness and cacheline footprint of the directory object decoding
      and encoding.
      
      This is the first patch in a series of conversion patches. It will
      introduce the ops structure, the setup of it and add the first
      operation to the vector. Subsequent patches will convert directory
      ops one at a time to keep the changes simple and obvious.
      
      Just this patch shows the benefit of such an approach on code size.
      Just converting the two shortform dir operations as this patch does
      decreases the built binary size by ~1500 bytes:
      
      $ size fs/xfs/xfs.o.orig fs/xfs/xfs.o.p1
         text    data     bss     dec     hex filename
       794490   96802    1096  892388   d9de4 fs/xfs/xfs.o.orig
       792986   96802    1096  890884   d9804 fs/xfs/xfs.o.p1
      $
      
      That's a significant decrease in the instruction cache footprint of
      the directory code for such a simple change, and indicates that this
      approach is definitely worth pursuing further.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      32c5483a
  19. 24 10月, 2013 3 次提交
    • D
      xfs: decouple inode and bmap btree header files · a4fbe6ab
      Dave Chinner 提交于
      Currently the xfs_inode.h header has a dependency on the definition
      of the BMAP btree records as the inode fork includes an array of
      xfs_bmbt_rec_host_t objects in it's definition.
      
      Move all the btree format definitions from xfs_btree.h,
      xfs_bmap_btree.h, xfs_alloc_btree.h and xfs_ialloc_btree.h to
      xfs_format.h to continue the process of centralising the on-disk
      format definitions. With this done, the xfs inode definitions are no
      longer dependent on btree header files.
      
      The enables a massive culling of unnecessary includes, with close to
      200 #include directives removed from the XFS kernel code base.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      a4fbe6ab
    • D
      xfs: decouple log and transaction headers · 239880ef
      Dave Chinner 提交于
      xfs_trans.h has a dependency on xfs_log.h for a couple of
      structures. Most code that does transactions doesn't need to know
      anything about the log, but this dependency means that they have to
      include xfs_log.h. Decouple the xfs_trans.h and xfs_log.h header
      files and clean up the includes to be in dependency order.
      
      In doing this, remove the direct include of xfs_trans_reserve.h from
      xfs_trans.h so that we remove the dependency between xfs_trans.h and
      xfs_mount.h. Hence the xfs_trans.h include can be moved to the
      indicate the actual dependencies other header files have on it.
      
      Note that these are kernel only header files, so this does not
      translate to any userspace changes at all.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      239880ef
    • D
      xfs: unify directory/attribute format definitions · 57062787
      Dave Chinner 提交于
      The on-disk format definitions for the directory and attribute
      structures are spread across 3 header files right now, only one of
      which is dedicated to defining on-disk structures and their
      manipulation (xfs_dir2_format.h). Pull all the format definitions
      into a single header file - xfs_da_format.h - and switch all the
      code over to point at that.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      57062787
  20. 05 10月, 2013 1 次提交
    • D
      xfs: dirent dtype presence is dependent on directory magic numbers · 6d313498
      Dave Chinner 提交于
      The determination of whether a directory entry contains a dtype
      field originally was dependent on the filesystem having CRCs
      enabled. This meant that the format for dtype beign enabled could be
      determined by checking the directory block magic number rather than
      doing a feature bit check. This was useful in that it meant that we
      didn't need to pass a struct xfs_mount around to functions that
      were already supplied with a directory block header.
      
      Unfortunately, the introduction of dtype fields into the v4
      structure via a feature bit meant this "use the directory block
      magic number" method of discriminating the dirent entry sizes is
      broken. Hence we need to convert the places that use magic number
      checks to use feature bit checks so that they work correctly and not
      by chance.
      
      The current code works on v4 filesystems only because the dirent
      size roundup covers the extra byte needed by the dtype field in the
      places where this problem occurs.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      
      (cherry picked from commit 367993e7)
      6d313498
  21. 01 10月, 2013 1 次提交
    • D
      xfs: dirent dtype presence is dependent on directory magic numbers · 367993e7
      Dave Chinner 提交于
      The determination of whether a directory entry contains a dtype
      field originally was dependent on the filesystem having CRCs
      enabled. This meant that the format for dtype beign enabled could be
      determined by checking the directory block magic number rather than
      doing a feature bit check. This was useful in that it meant that we
      didn't need to pass a struct xfs_mount around to functions that
      were already supplied with a directory block header.
      
      Unfortunately, the introduction of dtype fields into the v4
      structure via a feature bit meant this "use the directory block
      magic number" method of discriminating the dirent entry sizes is
      broken. Hence we need to convert the places that use magic number
      checks to use feature bit checks so that they work correctly and not
      by chance.
      
      The current code works on v4 filesystems only because the dirent
      size roundup covers the extra byte needed by the dtype field in the
      places where this problem occurs.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      367993e7
  22. 22 8月, 2013 1 次提交
    • D
      xfs: Add read-only support for dirent filetype field · 0cb97766
      Dave Chinner 提交于
      Add support for the file type field in directory entries so that
      readdir can return the type of the inode the dirent points to to
      userspace without first having to read the inode off disk.
      
      The encoding of the type field is a single byte that is added to the
      end of the directory entry name length. For all intents and
      purposes, it appends a "hidden" byte to the name field which
      contains the type information. As the directory entry is already of
      dynamic size, helpers are already required to access and decode the
      direct entry structures.
      
      Hence the relevent extraction and iteration helpers are updated to
      understand the hidden byte.  Helpers for reading and writing the
      filetype field from the directory entries are also added. Only the
      read helpers are used by this patch.  It also adds all the code
      necessary to read the type information out of the dirents on disk.
      
      Further we add the superblock feature bit and helpers to indicate
      that we understand the on-disk format change. This is not a
      compatible change - existing kernels cannot read the new format
      successfully - so an incompatible feature flag is added. We don't
      yet allow filesystems to mount with this flag yet - that will be
      added once write support is added.
      
      Finally, the code to take the type from the VFS, convert it to an
      XFS on-disk type and put it into the xfs_name structures passed
      around is added, but the directory code does not use this field yet.
      That will be in the next patch.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      0cb97766
  23. 13 8月, 2013 2 次提交