1. 31 5月, 2013 7 次提交
    • D
      xfs: fix dir3 freespace block corruption · 5ae6e6a4
      Dave Chinner 提交于
      When the directory freespace index grows to a second block (2017
      4k data blocks in the directory), the initialisation of the second
      new block header goes wrong. The write verifier fires a corruption
      error indicating that the block number in the header is zero. This
      was being tripped by xfs/110.
      
      The problem is that the initialisation of the new block is done just
      fine in xfs_dir3_free_get_buf(), but the caller then users a dirv2
      structure to zero on-disk header fields that xfs_dir3_free_get_buf()
      has already zeroed. These lined up with the block number in the dir
      v3 header format.
      
      While looking at this, I noticed that the struct xfs_dir3_free_hdr()
      had 4 bytes of padding in it that wasn't defined as padding or being
      zeroed by the initialisation. Add a pad field declaration and fully
      zero the on disk and in-core headers in xfs_dir3_free_get_buf() so
      that this is never an issue in the future. Note that this doesn't
      change the on-disk layout, just makes the 32 bits of padding in the
      layout explicit.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      5ae6e6a4
    • D
      xfs: kill suid/sgid through the truncate path. · 56c19e89
      Dave Chinner 提交于
      XFS has failed to kill suid/sgid bits correctly when truncating
      files of non-zero size since commit c4ed4243 ("xfs: split
      xfs_setattr") introduced in the 3.1 kernel. Fix it.
      
      Fix it.
      
      cc: stable kernel <stable@vger.kernel.org>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      56c19e89
    • D
      xfs: add fsgeom flag for v5 superblock support. · 74137fff
      Dave Chinner 提交于
      Currently userspace has no way of determining that a filesystem is
      CRC enabled. Add a flag to the XFS_IOC_FSGEOMETRY ioctl output to
      indicate that the filesystem has v5 superblock support enabled.
      This will allow xfs_info to correctly report the state of the
      filesystem.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      74137fff
    • D
      xfs: disable swap extents ioctl on CRC enabled filesystems · 02f75405
      Dave Chinner 提交于
      Currently, swapping extents from one inode to another is a simple
      act of switching data and attribute forks from one inode to another.
      This, unfortunately in no longer so simple with CRC enabled
      filesystems as there is owner information embedded into the BMBT
      blocks that are swapped between inodes. Hence swapping the forks
      between inodes results in the inodes having mapping blocks that
      point to the wrong owner and hence are considered corrupt.
      
      To fix this we need an extent tree block or record based swap
      algorithm so that the BMBT block owner information can be updated
      atomically in the swap transaction. This is a significant piece of
      new work, so for the moment simply don't allow swap extent
      operations to succeed on CRC enabled filesystems.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      02f75405
    • D
      xfs: fix split buffer vector log recovery support · 709da6a6
      Dave Chinner 提交于
      A long time ago in a galaxy far away....
      
      .. the was a commit made to fix some ilinux specific "fragmented
      buffer" log recovery problem:
      
      http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=commitdiff;h=b29c0bece51da72fb3ff3b61391a391ea54e1603
      
      That problem occurred when a contiguous dirty region of a buffer was
      split across across two pages of an unmapped buffer. It's been a
      long time since that has been done in XFS, and the changes to log
      the entire inode buffers for CRC enabled filesystems has
      re-introduced that corner case.
      
      And, of course, it turns out that the above commit didn't actually
      fix anything - it just ensured that log recovery is guaranteed to
      fail when this situation occurs. And now for the gory details.
      
      xfstest xfs/085 is failing with this assert:
      
      XFS (vdb): bad number of regions (0) in inode log format
      XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 1583
      
      Largely undocumented factoid #1: Log recovery depends on all log
      buffer format items starting with this format:
      
      struct foo_log_format {
      	__uint16_t	type;
      	__uint16_t	size;
      	....
      
      As recoery uses the size field and assumptions about 32 bit
      alignment in decoding format items.  So don't pay much attention to
      the fact log recovery thinks that it decoding an inode log format
      item - it just uses them to determine what the size of the item is.
      
      But why would it see a log format item with a zero size? Well,
      luckily enough xfs_logprint uses the same code and gives the same
      error, so with a bit of gdb magic, it turns out that it isn't a log
      format that is being decoded. What logprint tells us is this:
      
      Oper (130): tid: a0375e1a  len: 28  clientid: TRANS  flags: none
      BUF:  #regs: 2   start blkno: 144 (0x90)  len: 16  bmap size: 2  flags: 0x4000
      Oper (131): tid: a0375e1a  len: 4096  clientid: TRANS  flags: none
      BUF DATA
      ----------------------------------------------------------------------------
      Oper (132): tid: a0375e1a  len: 4096  clientid: TRANS  flags: none
      xfs_logprint: unknown log operation type (4e49)
      **********************************************************************
      * ERROR: data block=2                                                 *
      **********************************************************************
      
      That we've got a buffer format item (oper 130) that has two regions;
      the format item itself and one dirty region. The subsequent region
      after the buffer format item and it's data is them what we are
      tripping over, and the first bytes of it at an inode magic number.
      Not a log opheader like there is supposed to be.
      
      That means there's a problem with the buffer format item. It's dirty
      data region is 4096 bytes, and it contains - you guessed it -
      initialised inodes. But inode buffers are 8k, not 4k, and we log
      them in their entirety. So something is wrong here. The buffer
      format item contains:
      
      (gdb) p /x *(struct xfs_buf_log_format *)in_f
      $22 = {blf_type = 0x123c, blf_size = 0x2, blf_flags = 0x4000,
             blf_len = 0x10, blf_blkno = 0x90, blf_map_size = 0x2,
             blf_data_map = {0xffffffff, 0xffffffff, .... }}
      
      Two regions, and a signle dirty contiguous region of 64 bits.  64 *
      128 = 8k, so this should be followed by a single 8k region of data.
      And the blf_flags tell us that the type of buffer is a
      XFS_BLFT_DINO_BUF. It contains inodes. And because it doesn't have
      the XFS_BLF_INODE_BUF flag set, that means it's an inode allocation
      buffer. So, it should be followed by 8k of inode data.
      
      But we know that the next region has a header of:
      
      (gdb) p /x *ohead
      $25 = {oh_tid = 0x1a5e37a0, oh_len = 0x100000, oh_clientid = 0x69,
             oh_flags = 0x0, oh_res2 = 0x0}
      
      and so be32_to_cpu(oh_len) = 0x1000 = 4096 bytes. It's simply not
      long enough to hold all the logged data. There must be another
      region. There is - there's a following opheader for another 4k of
      data that contains the other half of the inode cluster data - the
      one we assert fail on because it's not a log format header.
      
      So why is the second part of the data not being accounted to the
      correct buffer log format structure? It took a little more work with
      gdb to work out that the buffer log format structure was both
      expecting it to be there but hadn't accounted for it. It was at that
      point I went to the kernel code, as clearly this wasn't a bug in
      xfs_logprint and the kernel was writing bad stuff to the log.
      
      First port of call was the buffer item formatting code, and the
      discontiguous memory/contiguous dirty region handling code
      immediately stood out. I've wondered for a long time why the code
      had this comment in it:
      
                              vecp->i_addr = xfs_buf_offset(bp, buffer_offset);
                              vecp->i_len = nbits * XFS_BLF_CHUNK;
                              vecp->i_type = XLOG_REG_TYPE_BCHUNK;
      /*
       * You would think we need to bump the nvecs here too, but we do not
       * this number is used by recovery, and it gets confused by the boundary
       * split here
       *                      nvecs++;
       */
                              vecp++;
      
      And it didn't account for the extra vector pointer. The case being
      handled here is that a contiguous dirty region lies across a
      boundary that cannot be memcpy()d across, and so has to be split
      into two separate operations for xlog_write() to perform.
      
      What this code assumes is that what is written to the log is two
      consecutive blocks of data that are accounted in the buf log format
      item as the same contiguous dirty region and so will get decoded as
      such by the log recovery code.
      
      The thing is, xlog_write() knows nothing about this, and so just
      does it's normal thing of adding an opheader for each vector. That
      means the 8k region gets written to the log as two separate regions
      of 4k each, but because nvecs has not been incremented, the buf log
      format item accounts for only one of them.
      
      Hence when we come to log recovery, we process the first 4k region
      and then expect to come across a new item that starts with a log
      format structure of some kind that tells us whenteh next data is
      going to be. Instead, we hit raw buffer data and things go bad real
      quick.
      
      So, the commit from 2002 that commented out nvecs++ is just plain
      wrong. It breaks log recovery completely, and it would seem the only
      reason this hasn't been since then is that we don't log large
      contigous regions of multi-page unmapped buffers very often. Never
      would be a closer estimate, at least until the CRC code came along....
      
      So, lets fix that by restoring the nvecs accounting for the extra
      region when we hit this case.....
      
      .... and there's the problemin log recovery it is apparently working
      around:
      
      XFS: Assertion failed: i == item->ri_total, file: fs/xfs/xfs_log_recover.c, line: 2135
      
      Yup, xlog_recover_do_reg_buffer() doesn't handle contigous dirty
      regions being broken up into multiple regions by the log formatting
      code. That's an easy fix, though - if the number of contiguous dirty
      bits exceeds the length of the region being copied out of the log,
      only account for the number of dirty bits that region covers, and
      then loop again and copy more from the next region. It's a 2 line
      fix.
      
      Now xfstests xfs/085 passes, we have one less piece of mystery
      code, and one more important piece of knowledge about how to
      structure new log format items..
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      709da6a6
    • D
      xfs: fix incorrect remote symlink block count · 321a9583
      Dave Chinner 提交于
      When CRCs are enabled, the number of blocks needed to hold a remote
      symlink on a 1k block size filesystem may be 2 instead of 1. The
      transaction reservation for the allocated blocks was not taking this
      into account and only allocating one block. Hence when trying to
      read or invalidate such symlinks, we are mapping a hole where there
      should be a block and things go bad at that point.
      
      Fix the reservation to use the correct block count, clean up the
      block count calculation similar to the remote attribute calculation,
      and add a debug guard to detect when we don't write the entire
      symlink to disk.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      321a9583
    • D
      xfs: don't emit v5 superblock warnings on write · 34510185
      Dave Chinner 提交于
      We write the superblock every 30s or so which results in the
      verifier being called. Right now that results in this output
      every 30s:
      
      XFS (vda): Version 5 superblock detected. This kernel has EXPERIMENTAL support enabled!
      Use of these features in this kernel is at your own risk!
      
      And spamming the logs.
      
      We don't need to check for whether we support v5 superblocks or
      whether there are feature bits we don't support set as these are
      only relevant when we first mount the filesytem. i.e. on superblock
      read. Hence for the write verification we can just skip all the
      checks (and hence verbose output) altogether.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      34510185
  2. 24 5月, 2013 6 次提交
    • D
      xfs: rework remote attr CRCs · ad1858d7
      Dave Chinner 提交于
      Note: this changes the on-disk remote attribute format. I assert
      that this is OK to do as CRCs are marked experimental and the first
      kernel it is included in has not yet reached release yet. Further,
      the userspace utilities are still evolving and so anyone using this
      stuff right now is a developer or tester using volatile filesystems
      for testing this feature. Hence changing the format right now to
      save longer term pain is the right thing to do.
      
      The fundamental change is to move from a header per extent in the
      attribute to a header per filesytem block in the attribute. This
      means there are more header blocks and the parsing of the attribute
      data is slightly more complex, but it has the advantage that we
      always know the size of the attribute on disk based on the length of
      the data it contains.
      
      This is where the header-per-extent method has problems. We don't
      know the size of the attribute on disk without first knowing how
      many extents are used to hold it. And we can't tell from a
      mapping lookup, either, because remote attributes can be allocated
      contiguously with other attribute blocks and so there is no obvious
      way of determining the actual size of the atribute on disk short of
      walking and mapping buffers.
      
      The problem with this approach is that if we map a buffer
      incorrectly (e.g. we make the last buffer for the attribute data too
      long), we then get buffer cache lookup failure when we map it
      correctly. i.e. we get a size mismatch on lookup. This is not
      necessarily fatal, but it's a cache coherency problem that can lead
      to returning the wrong data to userspace or writing the wrong data
      to disk. And debug kernels will assert fail if this occurs.
      
      I found lots of niggly little problems trying to fix this issue on a
      4k block size filesystem, finally getting it to pass with lots of
      fixes. The thing is, 1024 byte filesystems still failed, and it was
      getting really complex handling all the corner cases that were
      showing up. And there were clearly more that I hadn't found yet.
      
      It is complex, fragile code, and if we don't fix it now, it will be
      complex, fragile code forever more.
      
      Hence the simple fix is to add a header to each filesystem block.
      This gives us the same relationship between the attribute data
      length and the number of blocks on disk as we have without CRCs -
      it's a linear mapping and doesn't require us to guess anything. It
      is simple to implement, too - the remote block count calculated at
      lookup time can be used by the remote attribute set/get/remove code
      without modification for both CRC and non-CRC filesystems. The world
      becomes sane again.
      
      Because the copy-in and copy-out now need to iterate over each
      filesystem block, I moved them into helper functions so we separate
      the block mapping and buffer manupulations from the attribute data
      and CRC header manipulations. The code becomes much clearer as a
      result, and it is a lot easier to understand and debug. It also
      appears to be much more robust - once it worked on 4k block size
      filesystems, it has worked without failure on 1k block size
      filesystems, too.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      ad1858d7
    • D
      xfs: fully initialise temp leaf in xfs_attr3_leaf_compact · d4c712bc
      Dave Chinner 提交于
      xfs_attr3_leaf_compact() uses a temporary buffer for compacting the
      the entries in a leaf. It copies the the original buffer into the
      temporary buffer, then zeros the original buffer completely. It then
      copies the entries back into the original buffer.  However, the
      original buffer has not been correctly initialised, and so the
      movement of the entries goes horribly wrong.
      
      Make sure the zeroed destination buffer is fully initialised, and
      once we've set up the destination incore header appropriately, write
      is back to the buffer before starting to move entries around.
      
      While debugging this, the _d/_s prefixes weren't sufficient to
      remind me what buffer was what, so rename then all _src/_dst.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      d4c712bc
    • D
      xfs: fully initialise temp leaf in xfs_attr3_leaf_unbalance · 8517de2a
      Dave Chinner 提交于
      xfs_attr3_leaf_unbalance() uses a temporary buffer for recombining
      the entries in two leaves when the destination leaf requires
      compaction. The temporary buffer ends up being copied back over the
      original destination buffer, so the header in the temporary buffer
      needs to contain all the information that is in the destination
      buffer.
      
      To make sure the temporary buffer is fully initialised, once we've
      set up the temporary incore header appropriately, write is back to
      the temporary buffer before starting to move entries around.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      8517de2a
    • D
      xfs: correctly map remote attr buffers during removal · 6863ef84
      Dave Chinner 提交于
      If we don't map the buffers correctly (same as for get/set
      operations) then the incore buffer lookup will fail. If a block
      number matches but a length is wrong, then debug kernels will ASSERT
      fail in _xfs_buf_find() due to the length mismatch. Ensure that we
      map the buffers correctly by basing the length of the buffer on the
      attribute data length rather than the remote block count.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      6863ef84
    • D
      xfs: remote attribute tail zeroing does too much · 4af3644c
      Dave Chinner 提交于
      When an attribute data does not fill then entire remote block, we
      zero the remaining part of the buffer. This, however, needs to take
      into account that the buffer has a header, and so the offset where
      zeroing starts and the length of zeroing need to take this into
      account. Otherwise we end up with zeros over the end of the
      attribute value when CRCs are enabled.
      
      While there, make sure we only ask to map an extent that covers the
      remaining range of the attribute, rather than asking every time for
      the full length of remote data. If the remote attribute blocks are
      contiguous with other parts of the attribute tree, it will map those
      blocks as well and we can potentially zero them incorrectly. We can
      also get buffer size mistmatches when trying to read or remove the
      remote attribute, and this can lead to not finding the correct
      buffer when looking it up in cache.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      4af3644c
    • D
      xfs: remote attribute read too short · 913e96bc
      Dave Chinner 提交于
      Reading a maximally size remote attribute fails when CRCs are
      enabled with this verification error:
      
      XFS (vdb): remote attribute header does not match required off/len/owner)
      
      There are two reasons for this, the first being that the
      length of the buffer being read is determined from the
      args->rmtblkcnt which doesn't take into account CRC headers. Hence
      the mapped length ends up being too short and so we need to
      calculate it directly from the value length.
      
      The second is that the byte count of valid data within a buffer is
      capped by the length of the data and so doesn't take into account
      that the buffer might be longer due to headers. Hence we need to
      calculate the data space in the buffer first before calculating the
      actual byte count of data.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      913e96bc
  3. 22 5月, 2013 2 次提交
    • D
      xfs: remote attribute allocation may be contiguous · 90253cf1
      Dave Chinner 提交于
      When CRCs are enabled, there may be multiple allocations made if the
      headers cause a length overflow. This, however, does not mean that
      the number of headers required increases, as the second and
      subsequent extents may be contiguous with the previous extent. Hence
      when we map the extents to write the attribute data, we may end up
      with less extents than allocations made. Hence the assertion that we
      consume the number of headers we calculated in the allocation loop
      is incorrect and needs to be removed.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      90253cf1
    • D
      xfs: avoid nesting transactions in xfs_qm_scall_setqlim() · f648167f
      Dave Chinner 提交于
      Lockdep reports:
      
      =============================================
      [ INFO: possible recursive locking detected ]
      3.9.0+ #3 Not tainted
      ---------------------------------------------
      setquota/28368 is trying to acquire lock:
       (sb_internal){++++.?}, at: [<c11e8846>] xfs_trans_alloc+0x26/0x50
      
      but task is already holding lock:
       (sb_internal){++++.?}, at: [<c11e8846>] xfs_trans_alloc+0x26/0x50
      
      from xfs_qm_scall_setqlim()->xfs_dqread() when a dquot needs to be
      allocated.
      
      xfs_qm_scall_setqlim() is starting a transaction and then not
      passing it into xfs_qm_dqet() and so it starts it's own transaction
      when allocating the dquot.  Splat!
      
      Fix this by not allocating the dquot in xfs_qm_scall_setqlim()
      inside the setqlim transaction. This requires getting the dquot
      first (and allocating it if necessary) then dropping and relocking
      the dquot before joining it to the setqlim transaction.
      Reported-by: NMichael L. Semon <mlsemon35@gmail.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      f648167f
  4. 21 5月, 2013 8 次提交
    • D
      xfs: remote attribute lookups require the value length · e461fcb1
      Dave Chinner 提交于
      When reading a remote attribute, to correctly calculate the length
      of the data buffer for CRC enable filesystems, we need to know the
      length of the attribute data. We get this information when we look
      up the attribute, but we don't store it in the args structure along
      with the other remote attr information we get from the lookup. Add
      this information to the args structure so we can use it
      appropriately.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      e461fcb1
    • D
      xfs: xfs_attr_shortform_allfit() does not handle attr3 format. · b38958d7
      Dave Chinner 提交于
      xfstests generic/117 fails with:
      
      XFS: Assertion failed: leaf->hdr.info.magic == cpu_to_be16(XFS_ATTR_LEAF_MAGIC)
      
      indicating a function that does not handle the attr3 format
      correctly. Fix it.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      b38958d7
    • D
      72916fb8
    • D
      xfs: fix missing KM_NOFS tags to keep lockdep happy · ac14876c
      Dave Chinner 提交于
      There are several places where we use KM_SLEEP allocation contexts
      and use the fact that they are called from transaction context to
      add KM_NOFS where appropriate. Unfortunately, there are several
      places where the code makes this assumption but can be called from
      outside transaction context but with filesystem locks held. These
      places need explicit KM_NOFS annotations to avoid lockdep
      complaining about reclaim contexts.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      ac14876c
    • D
      xfs: Don't reference the EFI after it is freed · 52c24ad3
      Dave Chinner 提交于
      Checking the EFI for whether it is being released from recovery
      after we've already released the known active reference is a mistake
      worthy of a brown paper bag. Fix the (now) obvious use after free
      that it can cause.
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      52c24ad3
    • D
      xfs: fix rounding in xfs_free_file_space · 28ca489c
      Dave Chinner 提交于
      The offset passed into xfs_free_file_space() needs to be rounded
      down to a certain size, but the rounding mask is built by a 32 bit
      variable. Hence the mask will always mask off the upper 32 bits of
      the offset and lead to incorrect writeback and invalidation ranges.
      
      This is not actually exposed as a bug because we writeback and
      invalidate from the rounded offset to the end of the file, and hence
      the offset we are actually punching a hole out of will always be
      covered by the code. This needs fixing, however, if we ever want to
      use exact ranges for writeback/invalidation here...
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      28ca489c
    • D
      xfs: fix sub-page blocksize data integrity writes · 49b137cb
      Dave Chinner 提交于
      FSX on 512 byte block size filesystems has been failing for some
      time with corrupted data. The fault dates back to the change in
      the writeback data integrity algorithm that uses a mark-and-sweep
      approach to avoid data writeback livelocks.
      
      Unfortunately, a side effect of this mark-and-sweep approach is that
      each page will only be written once for a data integrity sync, and
      there is a condition in writeback in XFS where a page may require
      two writeback attempts to be fully written. As a result of the high
      level change, we now only get a partial page writeback during the
      integrity sync because the first pass through writeback clears the
      mark left on the page index to tell writeback that the page needs
      writeback....
      
      The cause is writing a partial page in the clustering code. This can
      happen when a mapping boundary falls in the middle of a page - we
      end up writing back the first part of the page that the mapping
      covers, but then never revisit the page to have the remainder mapped
      and written.
      
      The fix is simple - if the mapping boundary falls inside a page,
      then simple abort clustering without touching the page. This means
      that the next ->writepage entry that write_cache_pages() will make
      is the page we aborted on, and xfs_vm_writepage() will map all
      sections of the page correctly. This behaviour is also optimal for
      non-data integrity writes, as it results in contiguous sequential
      writeback of the file rather than missing small holes and having to
      write them a "random" writes in a future pass.
      
      With this fix, all the fsx tests in xfstests now pass on a 512 byte
      block size filesystem on a 4k page machine.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      49b137cb
    • J
      xfs: Avoid pathological backwards allocation · 211d022c
      Jan Kara 提交于
      Writing a large file using direct IO in 16 MB chunks sometimes results
      in a pathological allocation pattern where 16 MB chunks of large free
      extent are allocated to a file in a reversed order. So extents of a file
      look for example as:
      
       ext logical physical expected length flags
         0        0        13          4550656
         1  4550656 188136807   4550668 12562432
         2 17113088 200699240 200699238 622592
         3 17735680 182046055 201321831   4096
         4 17739776 182041959 182050150   4096
         5 17743872 182037863 182046054   4096
         6 17747968 182033767 182041958   4096
         7 17752064 182029671 182037862   4096
      ...
      6757 45400064 154381644 154389835   4096
      6758 45404160 154377548 154385739   4096
      6759 45408256 252951571 154381643  73728 eof
      
      This happens because XFS_ALLOCTYPE_THIS_BNO allocation fails (the last
      extent in the file cannot be further extended) so we fall back to
      XFS_ALLOCTYPE_NEAR_BNO allocation which picks end of a large free
      extent as the best place to continue the file. Since the chunk at the
      end of the free extent again cannot be further extended, this behavior
      repeats until the whole free extent is consumed in a reversed order.
      
      For data allocations this backward allocation isn't beneficial so make
      xfs_alloc_compute_diff() pick start of a free extent instead of its end
      for them. That avoids the backward allocation pattern.
      
      See thread at http://oss.sgi.com/archives/xfs/2013-03/msg00144.html for
      more details about the reproduction case and why this solution was
      chosen.
      
      Based on idea by Dave Chinner <dchinner@redhat.com>.
      
      CC: Dave Chinner <dchinner@redhat.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      211d022c
  5. 10 5月, 2013 4 次提交
    • T
      eCryptfs: Use the ablkcipher crypto API · 4dfea4f0
      Tyler Hicks 提交于
      Make the switch from the blkcipher kernel crypto interface to the
      ablkcipher interface.
      
      encrypt_scatterlist() and decrypt_scatterlist() now use the ablkcipher
      interface but, from the eCryptfs standpoint, still treat the crypto
      operation as a synchronous operation. They submit the async request and
      then wait until the operation is finished before they return. Most of
      the changes are contained inside those two functions.
      
      Despite waiting for the completion of the crypto operation, the
      ablkcipher interface provides performance increases in most cases when
      used on AES-NI capable hardware.
      Signed-off-by: NTyler Hicks <tyhicks@canonical.com>
      Acked-by: NColin King <colin.king@canonical.com>
      Reviewed-by: NZeev Zilberman <zeev@annapurnaLabs.com>
      Cc: Dustin Kirkland <dustin.kirkland@gazzang.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Ying Huang <ying.huang@intel.com>
      Cc: Thieu Le <thieule@google.com>
      Cc: Li Wang <dragonylffly@163.com>
      Cc: Jarkko Sakkinen <jarkko.sakkinen@iki.fi>
      4dfea4f0
    • A
      91c2e0bc
    • A
      ecryptfs: don't open-code kernel_read() · 39dfe6c6
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      39dfe6c6
    • J
      nfsd: fix oops when legacy_recdir_name_error is passed a -ENOENT error · 7255e716
      Jeff Layton 提交于
      Toralf reported the following oops to the linux-nfs mailing list:
      
          -----------------[snip]------------------
          NFSD: unable to generate recoverydir name (-2).
          NFSD: disabling legacy clientid tracking. Reboot recovery will not function correctly!
          BUG: unable to handle kernel NULL pointer dereference at 000003c8
          IP: [<f90a3d91>] nfsd4_client_tracking_exit+0x11/0x50 [nfsd]
          *pdpt = 000000002ba33001 *pde = 0000000000000000
          Oops: 0000 [#1] SMP
          Modules linked in: loop nfsd auth_rpcgss ipt_MASQUERADE xt_owner xt_multiport ipt_REJECT xt_tcpudp xt_recent xt_conntrack nf_conntrack_ftp xt_limit xt_LOG iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_filter ip_tables x_tables af_packet pppoe pppox ppp_generic slhc bridge stp llc tun arc4 iwldvm mac80211 coretemp kvm_intel uvcvideo sdhci_pci sdhci mmc_core videobuf2_vmalloc videobuf2_memops usblp videobuf2_core i915 iwlwifi psmouse videodev cfg80211 kvm fbcon bitblit cfbfillrect acpi_cpufreq mperf evdev softcursor font cfbimgblt i2c_algo_bit cfbcopyarea intel_agp intel_gtt drm_kms_helper snd_hda_codec_conexant drm agpgart fb fbdev tpm_tis thinkpad_acpi tpm nvram e1000e rfkill thermal ptp wmi pps_core tpm_bios 8250_pci processor 8250 ac snd_hda_intel snd_hda_codec snd_pcm battery video i2c_i801 snd_page_alloc snd_timer button serial_core i2c_core snd soundcore thermal_sys hwmon aesni_intel ablk_helper cryp
      td lrw aes_i586 xts gf128mul cbc fuse nfs lockd sunrpc dm_crypt dm_mod hid_monterey hid_microsoft hid_logitech hid_ezkey hid_cypress hid_chicony hid_cherry hid_belkin hid_apple hid_a4tech hid_generic usbhid hid sr_mod cdrom sg [last unloaded: microcode]
          Pid: 6374, comm: nfsd Not tainted 3.9.1 #6 LENOVO 4180F65/4180F65
          EIP: 0060:[<f90a3d91>] EFLAGS: 00010202 CPU: 0
          EIP is at nfsd4_client_tracking_exit+0x11/0x50 [nfsd]
          EAX: 00000000 EBX: fffffffe ECX: 00000007 EDX: 00000007
          ESI: eb9dcb00 EDI: eb2991c0 EBP: eb2bde38 ESP: eb2bde34
          DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
          CR0: 80050033 CR2: 000003c8 CR3: 2ba80000 CR4: 000407f0
          DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
          DR6: ffff0ff0 DR7: 00000400
          Process nfsd (pid: 6374, ti=eb2bc000 task=eb2711c0 task.ti=eb2bc000)
          Stack:
          fffffffe eb2bde4c f90a3e0c f90a7754 fffffffe eb0a9c00 eb2bdea0 f90a41ed
          eb2991c0 1b270000 eb2991c0 eb2bde7c f9099ce9 eb2bde98 0129a020 eb29a020
          eb2bdecc eb2991c0 eb2bdea8 f9099da5 00000000 eb9dcb00 00000001 67822f08
          Call Trace:
          [<f90a3e0c>] legacy_recdir_name_error+0x3c/0x40 [nfsd]
          [<f90a41ed>] nfsd4_create_clid_dir+0x15d/0x1c0 [nfsd]
          [<f9099ce9>] ? nfsd4_lookup_stateid+0x99/0xd0 [nfsd]
          [<f9099da5>] ? nfs4_preprocess_seqid_op+0x85/0x100 [nfsd]
          [<f90a4287>] nfsd4_client_record_create+0x37/0x50 [nfsd]
          [<f909d6ce>] nfsd4_open_confirm+0xfe/0x130 [nfsd]
          [<f90980b1>] ? nfsd4_encode_operation+0x61/0x90 [nfsd]
          [<f909d5d0>] ? nfsd4_free_stateid+0xc0/0xc0 [nfsd]
          [<f908fd0b>] nfsd4_proc_compound+0x41b/0x530 [nfsd]
          [<f9081b7b>] nfsd_dispatch+0x8b/0x1a0 [nfsd]
          [<f857b85d>] svc_process+0x3dd/0x640 [sunrpc]
          [<f908165d>] nfsd+0xad/0x110 [nfsd]
          [<f90815b0>] ? nfsd_destroy+0x70/0x70 [nfsd]
          [<c1054824>] kthread+0x94/0xa0
          [<c1486937>] ret_from_kernel_thread+0x1b/0x28
          [<c1054790>] ? flush_kthread_work+0xd0/0xd0
          Code: 86 b0 00 00 00 90 c5 0a f9 c7 04 24 70 76 0a f9 e8 74 a9 3d c8 eb ba 8d 76 00 55 89 e5 53 66 66 66 66 90 8b 15 68 c7 0a f9 85 d2 <8b> 88 c8 03 00 00 74 2c 3b 11 77 28 8b 5c 91 08 85 db 74 22 8b
          EIP: [<f90a3d91>] nfsd4_client_tracking_exit+0x11/0x50 [nfsd] SS:ESP 0068:eb2bde34
          CR2: 00000000000003c8
          ---[ end trace 09e54015d145c9c6 ]---
      
      The problem appears to be a regression that was introduced in commit
      9a9c6478 "nfsd: make NFSv4 recovery client tracking options per net".
      Prior to that commit, it was safe to pass a NULL net pointer to
      nfsd4_client_tracking_exit in the legacy recdir case, and
      legacy_recdir_name_error did so. After that comit, the net pointer must
      be valid.
      
      This patch just fixes legacy_recdir_name_error to pass in a valid net
      pointer to that function.
      
      Cc: <stable@vger.kernel.org> # v3.8+
      Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
      Reported-and-tested-by: NToralf Förster <toralf.foerster@gmx.de>
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      7255e716
  6. 09 5月, 2013 2 次提交
  7. 08 5月, 2013 11 次提交
    • J
      f2fs: cover free_nid management with spin_lock · 59bbd474
      Jaegeuk Kim 提交于
      After build_free_nids() searches free nid candidates from nat pages and
      current journal blocks, it checks all the candidates if they are allocated
      so that the nat cache has its nid with an allocated block address.
      
      In this procedure, previously we used
          list_for_each_entry_safe(fnid, next_fnid, &nm_i->free_nid_list, list).
      But, this is not covered by free_nid_list_lock, resulting in null pointer bug.
      
      This patch moves this checking routine inside add_free_nid() in order not to use
      the spin_lock.
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      59bbd474
    • H
      f2fs: optimize scan_nat_page() · 23d38844
      Haicheng Li 提交于
      When nm_i->fcnt > 2 * MAX_FREE_NIDS, stop scanning other NAT entries.
      Signed-off-by: NHaicheng Li <haicheng.li@linux.intel.com>
      [Jaegeuk Kim: fix handling the return value of add_free_nid()]
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      23d38844
    • H
      f2fs: code cleanup for scan_nat_page() and build_free_nids() · 8760952d
      Haicheng Li 提交于
      This patch does two cleanups:
      1. remove unused variable "fcnt" in build_free_nids().
      2. make scan_nat_page() as void type and remove useless variable "fcnt".
      Signed-off-by: NHaicheng Li <haicheng.li@linux.intel.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      8760952d
    • H
      f2fs: bugfix for alloc_nid_failed() · 95630cba
      Haicheng Li 提交于
      Directly drop the free_nid cache when nm_i->fcnt > 2 * MAX_FREE_NIDS
      
      Since there is NOT nmi->free_nid_list_lock spinlock protection between
      a sequential calling of alloc_nid() and alloc_nid_failed(), some other
      threads may already add new free_nid to the free_nid_list during this
      period.
      
      We need to make sure nmi->fcnt is never > 2 * MAX_FREE_NIDS.
      Signed-off-by: NHaicheng Li <haicheng.li@linux.intel.com>
      [Jaegeuk Kim: fit the coding style]
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      95630cba
    • C
      f2fs: recover when journal contains deleted files · 047184b4
      Chris Fries 提交于
      When recovering a journal file with fsync data for files that have
      been deleted, don't bail out on recovery.
      Signed-off-by: NChris Fries <C.Fries@motorola.com>
      Reviewed-by: NRussell Knize <rknize2@motorola.com>
      Reviewed-by: NJason Hrycay <jason.hrycay@motorola.com>
      [Jaegeuk Kim: fit the coding style]
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      047184b4
    • C
      f2fs: continue to mount after failing recovery · bde582b2
      Chris Fries 提交于
      When unable to roll forward the journal, we shouldn't bail out and
      not mount, we should continue to attempt the mount.  Bad recovery data
      is likely unrecoverable at this point, and requiring the user to try
      to mount again doesn't solve any issues.
      Signed-off-by: NChris Fries <C.Fries@motorola.com>
      Reviewed-by: NRussell Knize <rknize2@motorola.com>
      Reviewed-by: NJason Hrycay <jason.hrycay@motorola.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      bde582b2
    • J
      f2fs: avoid deadlock during evict after f2fs_gc · 531ad7d5
      Jaegeuk Kim 提交于
      o Deadlock case #1
      
      Thread 1:
      - writeback_sb_inodes
       - do_writepages
        - f2fs_write_data_pages
         - write_cache_pages
          - f2fs_write_data_page
           - f2fs_balance_fs
            - wait mutex_lock(gc_mutex)
      
      Thread 2:
      - f2fs_balance_fs
       - mutex_lock(gc_mutex)
       - f2fs_gc
        - f2fs_iget
         - wait iget_locked(inode->i_lock)
      
      Thread 3:
      - do_unlinkat
       - iput
        - lock(inode->i_lock)
         - evict
          - inode_wait_for_writeback
      
      o Deadlock case #2
      
      Thread 1:
      - __writeback_single_inode
       : set I_SYNC
        - do_writepages
         - f2fs_write_data_page
          - f2fs_balance_fs
           - f2fs_gc
            - iput
             - evict
              - inode_wait_for_writeback(I_SYNC)
      
      In order to avoid this, even though iput is called with the zero-reference
      count, we need to stop the eviction procedure if the inode is on writeback.
      So this patch links f2fs_drop_inode which checks the I_SYNC flag.
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      531ad7d5
    • K
      aio: don't include aio.h in sched.h · a27bb332
      Kent Overstreet 提交于
      Faster kernel compiles by way of fewer unnecessary includes.
      
      [akpm@linux-foundation.org: fix fallout]
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      Cc: Zach Brown <zab@redhat.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Reviewed-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a27bb332
    • K
      aio: kill ki_retry · 41ef4eb8
      Kent Overstreet 提交于
      Thanks to Zach Brown's work to rip out the retry infrastructure, we don't
      need this anymore - ki_retry was only called right after the kiocb was
      initialized.
      
      This also refactors and trims some duplicated code, as well as cleaning up
      the refcounting/error handling a bit.
      
      [akpm@linux-foundation.org: use fmode_t in aio_run_iocb()]
      [akpm@linux-foundation.org: fix file_start_write/file_end_write tests]
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      Cc: Zach Brown <zab@redhat.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Reviewed-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41ef4eb8
    • K
      aio: kill ki_key · 8a660890
      Kent Overstreet 提交于
      ki_key wasn't actually used for anything previously - it was always 0.
      Drop it to trim struct kiocb a bit.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      Cc: Zach Brown <zab@redhat.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Reviewed-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a660890
    • J
      audit: vfs: fix audit_inode call in O_CREAT case of do_last · 33e2208a
      Jeff Layton 提交于
      Jiri reported a regression in auditing of open(..., O_CREAT) syscalls.
      In older kernels, creating a file with open(..., O_CREAT) created
      audit_name records that looked like this:
      
      type=PATH msg=audit(1360255720.628:64): item=1 name="/abc/foo" inode=138810 dev=fd:00 mode=0100640 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
      type=PATH msg=audit(1360255720.628:64): item=0 name="/abc/" inode=138635 dev=fd:00 mode=040750 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
      
      ...in recent kernels though, they look like this:
      
      type=PATH msg=audit(1360255402.886:12574): item=2 name=(null) inode=264599 dev=fd:00 mode=0100640 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
      type=PATH msg=audit(1360255402.886:12574): item=1 name=(null) inode=264598 dev=fd:00 mode=040750 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
      type=PATH msg=audit(1360255402.886:12574): item=0 name="/abc/foo" inode=264598 dev=fd:00 mode=040750 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
      
      Richard bisected to determine that the problems started with commit
      bfcec708, but the log messages have changed with some later
      audit-related patches.
      
      The problem is that this audit_inode call is passing in the parent of
      the dentry being opened, but audit_inode is being called with the parent
      flag false. This causes later audit_inode and audit_inode_child calls to
      match the wrong entry in the audit_names list.
      
      This patch simply sets the flag to properly indicate that this inode
      represents the parent. With this, the audit_names entries are back to
      looking like they did before.
      
      Cc: <stable@vger.kernel.org> # v3.7+
      Reported-by: NJiri Jaburek <jjaburek@redhat.com>
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Test By: Richard Guy Briggs <rbriggs@redhat.com>
      Signed-off-by: NEric Paris <eparis@redhat.com>
      33e2208a