1. 25 5月, 2013 5 次提交
    • D
      xfs: xfs_da3_node_read_verify() doesn't handle XFS_ATTR3_LEAF_MAGIC · 7ced60ca
      Dave Chinner 提交于
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      
      (cherry picked from commit 72916fb8)
      7ced60ca
    • D
      xfs: fix missing KM_NOFS tags to keep lockdep happy · b17cb364
      Dave Chinner 提交于
      There are several places where we use KM_SLEEP allocation contexts
      and use the fact that they are called from transaction context to
      add KM_NOFS where appropriate. Unfortunately, there are several
      places where the code makes this assumption but can be called from
      outside transaction context but with filesystem locks held. These
      places need explicit KM_NOFS annotations to avoid lockdep
      complaining about reclaim contexts.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      
      (cherry picked from commit ac14876c)
      b17cb364
    • D
      xfs: Don't reference the EFI after it is freed · 509e708a
      Dave Chinner 提交于
      Checking the EFI for whether it is being released from recovery
      after we've already released the known active reference is a mistake
      worthy of a brown paper bag. Fix the (now) obvious use after free
      that it can cause.
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      
      (cherry picked from commit 52c24ad3)
      509e708a
    • D
      xfs: fix rounding in xfs_free_file_space · 7031d0e1
      Dave Chinner 提交于
      The offset passed into xfs_free_file_space() needs to be rounded
      down to a certain size, but the rounding mask is built by a 32 bit
      variable. Hence the mask will always mask off the upper 32 bits of
      the offset and lead to incorrect writeback and invalidation ranges.
      
      This is not actually exposed as a bug because we writeback and
      invalidate from the rounded offset to the end of the file, and hence
      the offset we are actually punching a hole out of will always be
      covered by the code. This needs fixing, however, if we ever want to
      use exact ranges for writeback/invalidation here...
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      
      (cherry picked from commit 28ca489c)
      7031d0e1
    • D
      xfs: fix sub-page blocksize data integrity writes · 480d7467
      Dave Chinner 提交于
      FSX on 512 byte block size filesystems has been failing for some
      time with corrupted data. The fault dates back to the change in
      the writeback data integrity algorithm that uses a mark-and-sweep
      approach to avoid data writeback livelocks.
      
      Unfortunately, a side effect of this mark-and-sweep approach is that
      each page will only be written once for a data integrity sync, and
      there is a condition in writeback in XFS where a page may require
      two writeback attempts to be fully written. As a result of the high
      level change, we now only get a partial page writeback during the
      integrity sync because the first pass through writeback clears the
      mark left on the page index to tell writeback that the page needs
      writeback....
      
      The cause is writing a partial page in the clustering code. This can
      happen when a mapping boundary falls in the middle of a page - we
      end up writing back the first part of the page that the mapping
      covers, but then never revisit the page to have the remainder mapped
      and written.
      
      The fix is simple - if the mapping boundary falls inside a page,
      then simple abort clustering without touching the page. This means
      that the next ->writepage entry that write_cache_pages() will make
      is the page we aborted on, and xfs_vm_writepage() will map all
      sections of the page correctly. This behaviour is also optimal for
      non-data integrity writes, as it results in contiguous sequential
      writeback of the file rather than missing small holes and having to
      write them a "random" writes in a future pass.
      
      With this fix, all the fsx tests in xfstests now pass on a 512 byte
      block size filesystem on a 4k page machine.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      
      (cherry picked from commit 49b137cb)
      480d7467
  2. 08 5月, 2013 4 次提交
    • K
      aio: don't include aio.h in sched.h · a27bb332
      Kent Overstreet 提交于
      Faster kernel compiles by way of fewer unnecessary includes.
      
      [akpm@linux-foundation.org: fix fallout]
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      Cc: Zach Brown <zab@redhat.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Reviewed-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a27bb332
    • E
      xfs: fallback to vmalloc for large buffers in xfs_compat_attrlist_by_handle · 7dfbcbef
      Eric Sandeen 提交于
      Shamelessly copied from dchinner's:
      ad650f5b xfs: fallback to vmalloc for large buffers in xfs_attrmulti_attr_get
          
      xfsdump uses a large buffer for extended attributes, which has a
      kmalloc'd shadow buffer in the kernel. This can fail after the
      system has been running for some time as it is a high order
      allocation. Add a fallback to vmalloc so that it doesn't require
      contiguous memory and so won't randomly fail while xfsdump is
      running.
      
      This was done for xfs_attrlist_by_handle but
      xfs_compat_attrlist_by_handle (the 32-bit version) needs the same
      attention.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      7dfbcbef
    • E
      xfs: fallback to vmalloc for large buffers in xfs_attrlist_by_handle · dd700d94
      Eric Sandeen 提交于
      Shamelessly copied from dchinner's:
      ad650f5b xfs: fallback to vmalloc for large buffers in xfs_attrmulti_attr_get
          
      xfsdump uses for a large buffer for extended attributes, which has a
      kmalloc'd shadow buffer in the kernel. This can fail after the
      system has been running for some time as it is a high order
      allocation. Add a fallback to vmalloc so that it doesn't require
      contiguous memory and so won't randomly fail while xfsdump is
      running.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      dd700d94
    • D
      xfs: introduce CONFIG_XFS_WARN · 742ae1e3
      Dave Chinner 提交于
      Running a CONFIG_XFS_DEBUG kernel in production environments is not
      the best idea as it introduces significant overhead, can change
      the behaviour of algorithms (such as allocation) to improve test
      coverage, and (most importantly) panic the machine on non-fatal
      errors.
      
      There are many cases where all we want to do is run a
      kernel with more bounds checking enabled, such as is provided by the
      ASSERT() statements throughout the code, but without all the
      potential overhead and drawbacks.
      
      This patch converts all the ASSERT statements to evaluate as
      WARN_ON(1) statements and hence if they fail dump a warning and a
      stack trace to the log. This has minimal overhead and does not
      change any algorithms, and will allow us to find strange "out of
      bounds" problems more easily on production machines.
      
      There are a few places where assert statements contain debug only
      code. These are converted to be debug-or-warn only code so that we
      still get all the assert checks in the code.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      742ae1e3
  3. 02 5月, 2013 2 次提交
  4. 01 5月, 2013 1 次提交
  5. 28 4月, 2013 14 次提交
    • D
      xfs: implement extended feature masks · e721f504
      Dave Chinner 提交于
      The version 5 superblock has extended feature masks for compatible,
      incompatible and read-only compatible feature sets. Implement the
      masking and mount-time checking for these feature masks.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      e721f504
    • D
      xfs: add CRC checks to the superblock · 04a1e6c5
      Dave Chinner 提交于
      With the addition of CRCs, there is such a wide and varied change to
      the on disk format that it makes sense to bump the superblock
      version number rather than try to use feature bits for all the new
      functionality.
      
      This commit introduces all the new superblock fields needed for all
      the new functionality: feature masks similar to ext4, separate
      project quota inodes, a LSN field for recovery and the CRC field.
      
      This commit does not bump the superblock version number, however.
      That will be done as a separate commit at the end of the series
      after all the new functionality is present so we switch it all on in
      one commit. This means that we can slowly introduce the changes
      without them being active and hence maintain bisectability of the
      tree.
      
      This patch is based on a patch originally written by myself back
      from SGI days, which was subsequently modified by Christoph Hellwig.
      There is relatively little of that patch remaining, but the history
      of the patch still should be acknowledged here.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      04a1e6c5
    • D
      xfs: buffer type overruns blf_flags field · 61fe135c
      Dave Chinner 提交于
      The buffer type passed to log recvoery in the buffer log item
      overruns the blf_flags field. I had assumed that flags field was a
      32 bit value, and it turns out it is a unisgned short. Therefore
      having 19 flags doesn't really work.
      
      Convert the buffer type field to numeric value, and use the top 5
      bits of the flags field for it. We currently have 17 types of
      buffers, so using 5 bits gives us plenty of room for expansion in
      future....
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      61fe135c
    • D
      xfs: add buffer types to directory and attribute buffers · d75afeb3
      Dave Chinner 提交于
      Add buffer types to the buffer log items so that log recovery can
      validate the buffers and calculate CRCs correctly after the buffers
      are recovered.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      d75afeb3
    • D
      xfs: add CRC protection to remote attributes · d2e448d5
      Dave Chinner 提交于
      There are two ways of doing this - the first is to add a CRC to the
      remote attribute entry in the attribute block. The second is to
      treat them similar to the remote symlink, where each fragment has
      it's own header and identifies fragment location in the attribute.
      
      The problem with the CRC in the remote attr entry is that we cannot
      identify the owner of the metadata from the metadata blocks
      themselves, or where the blocks fit into the remote attribute. The
      down side to this approach is that we never know when the attribute
      has been read from disk or not and so we have to verify it every
      time it is read, and we must calculate it during the create
      transaction and log it. We do not log CRCs for any other metadata,
      and so this creates a unique set of coherency problems that, in
      general, are best avoided.
      
      Adding an identifying header to each allocated block allows us to
      identify each fragment and where in the attribute it is located. It
      enables us to rebuild the remote attribute from just the raw blocks
      containing the attribute. It also provides us to do per-block CRCs
      verification at IO time rather than during the transaction context
      that creates it or every time it is read into a user buffer. Hence
      it avoids all the problems that an external, logged CRC has, and
      provides all the benefits of self identifying metadata.
      
      The only complexity is that we have to add a header per fragment,
      and we don't know how many fragments will be needed prior to
      allocations. If we take the symlink example, the header is 56 bytes
      and hence for a 4k block size filesystem, in the worst case 16
      headers requires 1 extra block for the 64k attribute data. For 512
      byte filesystems the worst case is an extra block for every 9
      fragments (i.e. 16 extra blocks in the worse case). This will be
      very rare and so it's not really a major concern.
      
      Because allocation is done in two steps - the first finds a hole
      large enough in the attribute file, the second does the allocation -
      we only need to find a hole big enough for a worst case allocation.
      We only need to allocate enough extra blocks for number of headers
      required by the fragments, and we can calculate that as we go....
      
      Hence it really only makes sense to use the same model as for
      symlinks - it doesn't add that much complexity, does not require an
      attribute tree format change, and does not require logging
      calculated CRC values.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      d2e448d5
    • D
      xfs: split remote attribute code out · 95920cd6
      Dave Chinner 提交于
      Adding CRC support to remote attributes adds a significant amount of
      remote attribute specific code. Split the existing remote attribute
      code out into it's own file so that all the relevant remote
      attribute code is in a single, easy to find place.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      95920cd6
    • D
      xfs: add CRCs to attr leaf blocks · 517c2220
      Dave Chinner 提交于
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      517c2220
    • D
      xfs: add CRCs to dir2/da node blocks · f5ea1100
      Dave Chinner 提交于
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      f5ea1100
    • D
      xfs: shortform directory offsets change for dir3 format · 6b2647a1
      Dave Chinner 提交于
      Because the header size for the CRC enabled directory blocks is
      larger, the offset of the first entry into a directory block is
      different to the dir2 format. The shortform directory stores the
      dirent's offset so that it doesn't change when moving from shortform
      to block form and back again, and hence it needs to take into
      account the different header sizes to maintain the correct offsets.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      6b2647a1
    • D
      xfs: add CRC checking to dir2 leaf blocks · 24df33b4
      Dave Chinner 提交于
      This addition follows the same pattern as the dir2 block CRCs.
      Seeing as both LEAF1 and LEAFN types need to changed at the same
      time, this is a pretty large amount of change. leaf block headers
      need to be abstracted away from the on-disk structures (struct
      xfs_dir3_icleaf_hdr), as do the base leaf entry locations.
      
      This header abstract allows the in-core header and leaf entry
      location to be passed around instead of the leaf block itself. This
      saves a lot of converting individual variables from on-disk format
      to host format where they are used, so there's a good chance that
      the compiler will be able to produce much more optimal code as it's
      not having to byteswap variables all over the place.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      24df33b4
    • D
      xfs: add CRC checking to dir2 data blocks · 33363fee
      Dave Chinner 提交于
      This addition follows the same pattern as the dir2 block CRCs.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      33363fee
    • D
      xfs: add CRC checking to dir2 free blocks · cbc8adf8
      Dave Chinner 提交于
      This addition follows the same pattern as the dir2 block CRCs, but
      with a few differences. The main difference is that the free block
      header is different between the v2 and v3 formats, so an "in-core"
      free block header has been added and _todisk/_from_disk functions
      used to abstract the differences in structure format from the code.
      This is similar to the on-disk superblock versus the in-core
      superblock setup. The in-core strucutre is populated when the buffer
      is read from disk, all the in memory checks and modifications are
      done on the in-core version of the structure which is written back
      to the buffer before the buffer is logged.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      cbc8adf8
    • D
      xfs: add CRC checks to block format directory blocks · f5f3d9b0
      Dave Chinner 提交于
      Now that directory buffers are made from a single struct xfs_buf, we
      can add CRC calculation and checking callbacks. While there, add all
      the fields to the on disk structures for future functionality such
      as d_type support, uuids, block numbers, owner inode, etc.
      
      To distinguish between the different on disk formats, change the
      magic numbers for the new format directory blocks.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      f5f3d9b0
    • D
      xfs: add CRC checks to remote symlinks · f948dd76
      Dave Chinner 提交于
      Add a header to the remote symlink block, containing location and
      owner information, as well as CRCs and LSN fields. This requires
      verifiers to be added to the remote symlink buffers for CRC enabled
      filesystems.
      
      This also fixes a bug reading multiple block symlinks, where the second
      block overwrites the first block when copying out the link name.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      f948dd76
  6. 22 4月, 2013 8 次提交
  7. 17 4月, 2013 2 次提交
  8. 10 4月, 2013 1 次提交
  9. 06 4月, 2013 1 次提交
    • D
      xfs: don't free EFIs before the EFDs are committed · 666d644c
      Dave Chinner 提交于
      Filesystems are occasionally being shut down with this error:
      
      xfs_trans_ail_delete_bulk: attempting to delete a log item that is
      not in the AIL.
      
      It was diagnosed to be related to the EFI/EFD commit order when the
      EFI and EFD are in different checkpoints and the EFD is committed
      before the EFI here:
      
      http://oss.sgi.com/archives/xfs/2013-01/msg00082.html
      
      The real problem is that a single bit cannot fully describe the
      states that the EFI/EFD processing can be in. These completion
      states are:
      
      EFI			EFI in AIL	EFD		Result
      committed/unpinned	Yes		committed	OK
      committed/pinned	No		committed	Shutdown
      uncommitted		No		committed	Shutdown
      
      
      Note that the "result" field is what should happen, not what does
      happen. The current logic is broken and handles the first two cases
      correctly by luck.  That is, the code will free the EFI if the
      XFS_EFI_COMMITTED bit is *not* set, rather than if it is set. The
      inverted logic "works" because if both EFI and EFD are committed,
      then the first __xfs_efi_release() call clears the XFS_EFI_COMMITTED
      bit, and the second frees the EFI item. Hence as long as
      xfs_efi_item_committed() has been called, everything appears to be
      fine.
      
      It is the third case where the logic fails - where
      xfs_efd_item_committed() is called before xfs_efi_item_committed(),
      and that results in the EFI being freed before it has been
      committed. That is the bug that triggered the shutdown, and hence
      keeping track of whether the EFI has been committed or not is
      insufficient to correctly order the EFI/EFD operations w.r.t. the
      AIL.
      
      What we really want is this: the EFI is always placed into the
      AIL before the last reference goes away. The only way to guarantee
      that is that the EFI is not freed until after it has been unpinned
      *and* the EFD has been committed. That is, restructure the logic so
      that the only case that can occur is the first case.
      
      This can be done easily by replacing the XFS_EFI_COMMITTED with an
      EFI reference count. The EFI is initialised with it's own count, and
      that is not released until it is unpinned. However, there is a
      complication to this method - the high level EFI/EFD code in
      xfs_bmap_finish() does not hold direct references to the EFI
      structure, and runs a transaction commit between the EFI and EFD
      processing. Hence the EFI can be freed even before the EFD is
      created using such a method.
      
      Further, log recovery uses the AIL for tracking EFI/EFDs that need
      to be recovered, but it uses the AIL *differently* to the EFI
      transaction commit. Hence log recovery never pins or unpins EFIs, so
      we can't drop the EFI reference count indirectly to free the EFI.
      
      However, this doesn't prevent us from using a reference count here.
      There is a 1:1 relationship between EFIs and EFDs, so when we
      initialise the EFI we can take a reference count for the EFD as
      well. This solves the xfs_bmap_finish() issue - the EFI will never
      be freed until the EFD is processed. In terms of log recovery,
      during the committing of the EFD we can look for the
      XFS_EFI_RECOVERED bit being set and drop the EFI reference as well,
      thereby ensuring everything works correctly there as well.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      666d644c
  10. 04 4月, 2013 1 次提交
  11. 23 3月, 2013 1 次提交
    • J
      xfs: Fix WARN_ON(delalloc) in xfs_vm_releasepage() · ff9a28f6
      Jan Kara 提交于
      When a dirty page is truncated from a file but reclaim gets to it before
      truncate_inode_pages(), we hit WARN_ON(delalloc) in
      xfs_vm_releasepage(). This is because reclaim tries to write the page,
      xfs_vm_writepage() just bails out (leaving page clean) and thus reclaim
      thinks it can continue and calls xfs_vm_releasepage() on page with dirty
      buffers.
      
      Fix the issue by redirtying the page in xfs_vm_writepage(). This makes
      reclaim stop reclaiming the page and also logically it keeps page in a
      more consistent state where page with dirty buffers has PageDirty set.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      ff9a28f6