1. 31 5月, 2013 13 次提交
    • D
      xfs: fully initialise temp leaf in xfs_attr3_leaf_compact · 634fd532
      Dave Chinner 提交于
      xfs_attr3_leaf_compact() uses a temporary buffer for compacting the
      the entries in a leaf. It copies the the original buffer into the
      temporary buffer, then zeros the original buffer completely. It then
      copies the entries back into the original buffer.  However, the
      original buffer has not been correctly initialised, and so the
      movement of the entries goes horribly wrong.
      
      Make sure the zeroed destination buffer is fully initialised, and
      once we've set up the destination incore header appropriately, write
      is back to the buffer before starting to move entries around.
      
      While debugging this, the _d/_s prefixes weren't sufficient to
      remind me what buffer was what, so rename then all _src/_dst.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      
      (cherry picked from commit d4c712bc)
      634fd532
    • D
      xfs: fully initialise temp leaf in xfs_attr3_leaf_unbalance · 9e80c762
      Dave Chinner 提交于
      xfs_attr3_leaf_unbalance() uses a temporary buffer for recombining
      the entries in two leaves when the destination leaf requires
      compaction. The temporary buffer ends up being copied back over the
      original destination buffer, so the header in the temporary buffer
      needs to contain all the information that is in the destination
      buffer.
      
      To make sure the temporary buffer is fully initialised, once we've
      set up the temporary incore header appropriately, write is back to
      the temporary buffer before starting to move entries around.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      
      (cherry picked from commit 8517de2a)
      9e80c762
    • D
      xfs: correctly map remote attr buffers during removal · 58a72281
      Dave Chinner 提交于
      If we don't map the buffers correctly (same as for get/set
      operations) then the incore buffer lookup will fail. If a block
      number matches but a length is wrong, then debug kernels will ASSERT
      fail in _xfs_buf_find() due to the length mismatch. Ensure that we
      map the buffers correctly by basing the length of the buffer on the
      attribute data length rather than the remote block count.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      
      (cherry picked from commit 6863ef84)
      58a72281
    • D
      xfs: remote attribute tail zeroing does too much · 26f71445
      Dave Chinner 提交于
      When an attribute data does not fill then entire remote block, we
      zero the remaining part of the buffer. This, however, needs to take
      into account that the buffer has a header, and so the offset where
      zeroing starts and the length of zeroing need to take this into
      account. Otherwise we end up with zeros over the end of the
      attribute value when CRCs are enabled.
      
      While there, make sure we only ask to map an extent that covers the
      remaining range of the attribute, rather than asking every time for
      the full length of remote data. If the remote attribute blocks are
      contiguous with other parts of the attribute tree, it will map those
      blocks as well and we can potentially zero them incorrectly. We can
      also get buffer size mistmatches when trying to read or remove the
      remote attribute, and this can lead to not finding the correct
      buffer when looking it up in cache.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      
      (cherry picked from commit 4af3644c)
      26f71445
    • D
      xfs: remote attribute read too short · 551b382f
      Dave Chinner 提交于
      Reading a maximally size remote attribute fails when CRCs are
      enabled with this verification error:
      
      XFS (vdb): remote attribute header does not match required off/len/owner)
      
      There are two reasons for this, the first being that the
      length of the buffer being read is determined from the
      args->rmtblkcnt which doesn't take into account CRC headers. Hence
      the mapped length ends up being too short and so we need to
      calculate it directly from the value length.
      
      The second is that the byte count of valid data within a buffer is
      capped by the length of the data and so doesn't take into account
      that the buffer might be longer due to headers. Hence we need to
      calculate the data space in the buffer first before calculating the
      actual byte count of data.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      
      (cherry picked from commit 913e96bc)
      551b382f
    • D
      xfs: remote attribute allocation may be contiguous · 9531e2de
      Dave Chinner 提交于
      When CRCs are enabled, there may be multiple allocations made if the
      headers cause a length overflow. This, however, does not mean that
      the number of headers required increases, as the second and
      subsequent extents may be contiguous with the previous extent. Hence
      when we map the extents to write the attribute data, we may end up
      with less extents than allocations made. Hence the assertion that we
      consume the number of headers we calculated in the allocation loop
      is incorrect and needs to be removed.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      
      (cherry picked from commit 90253cf1)
      9531e2de
    • D
      xfs: fix dir3 freespace block corruption · e400d27d
      Dave Chinner 提交于
      When the directory freespace index grows to a second block (2017
      4k data blocks in the directory), the initialisation of the second
      new block header goes wrong. The write verifier fires a corruption
      error indicating that the block number in the header is zero. This
      was being tripped by xfs/110.
      
      The problem is that the initialisation of the new block is done just
      fine in xfs_dir3_free_get_buf(), but the caller then users a dirv2
      structure to zero on-disk header fields that xfs_dir3_free_get_buf()
      has already zeroed. These lined up with the block number in the dir
      v3 header format.
      
      While looking at this, I noticed that the struct xfs_dir3_free_hdr()
      had 4 bytes of padding in it that wasn't defined as padding or being
      zeroed by the initialisation. Add a pad field declaration and fully
      zero the on disk and in-core headers in xfs_dir3_free_get_buf() so
      that this is never an issue in the future. Note that this doesn't
      change the on-disk layout, just makes the 32 bits of padding in the
      layout explicit.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      
      (cherry picked from commit 5ae6e6a4)
      e400d27d
    • D
      xfs: disable swap extents ioctl on CRC enabled filesystems · 7c9950fd
      Dave Chinner 提交于
      Currently, swapping extents from one inode to another is a simple
      act of switching data and attribute forks from one inode to another.
      This, unfortunately in no longer so simple with CRC enabled
      filesystems as there is owner information embedded into the BMBT
      blocks that are swapped between inodes. Hence swapping the forks
      between inodes results in the inodes having mapping blocks that
      point to the wrong owner and hence are considered corrupt.
      
      To fix this we need an extent tree block or record based swap
      algorithm so that the BMBT block owner information can be updated
      atomically in the swap transaction. This is a significant piece of
      new work, so for the moment simply don't allow swap extent
      operations to succeed on CRC enabled filesystems.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      
      (cherry picked from commit 02f75405)
      7c9950fd
    • D
      xfs: add fsgeom flag for v5 superblock support. · e7927e87
      Dave Chinner 提交于
      Currently userspace has no way of determining that a filesystem is
      CRC enabled. Add a flag to the XFS_IOC_FSGEOMETRY ioctl output to
      indicate that the filesystem has v5 superblock support enabled.
      This will allow xfs_info to correctly report the state of the
      filesystem.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      
      (cherry picked from commit 74137fff)
      e7927e87
    • D
      xfs: fix incorrect remote symlink block count · 1de09d1a
      Dave Chinner 提交于
      When CRCs are enabled, the number of blocks needed to hold a remote
      symlink on a 1k block size filesystem may be 2 instead of 1. The
      transaction reservation for the allocated blocks was not taking this
      into account and only allocating one block. Hence when trying to
      read or invalidate such symlinks, we are mapping a hole where there
      should be a block and things go bad at that point.
      
      Fix the reservation to use the correct block count, clean up the
      block count calculation similar to the remote attribute calculation,
      and add a debug guard to detect when we don't write the entire
      symlink to disk.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      
      (cherry picked from commit 321a9583)
      1de09d1a
    • D
      xfs: fix split buffer vector log recovery support · 7d2ffe80
      Dave Chinner 提交于
      A long time ago in a galaxy far away....
      
      .. the was a commit made to fix some ilinux specific "fragmented
      buffer" log recovery problem:
      
      http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=commitdiff;h=b29c0bece51da72fb3ff3b61391a391ea54e1603
      
      That problem occurred when a contiguous dirty region of a buffer was
      split across across two pages of an unmapped buffer. It's been a
      long time since that has been done in XFS, and the changes to log
      the entire inode buffers for CRC enabled filesystems has
      re-introduced that corner case.
      
      And, of course, it turns out that the above commit didn't actually
      fix anything - it just ensured that log recovery is guaranteed to
      fail when this situation occurs. And now for the gory details.
      
      xfstest xfs/085 is failing with this assert:
      
      XFS (vdb): bad number of regions (0) in inode log format
      XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 1583
      
      Largely undocumented factoid #1: Log recovery depends on all log
      buffer format items starting with this format:
      
      struct foo_log_format {
      	__uint16_t	type;
      	__uint16_t	size;
      	....
      
      As recoery uses the size field and assumptions about 32 bit
      alignment in decoding format items.  So don't pay much attention to
      the fact log recovery thinks that it decoding an inode log format
      item - it just uses them to determine what the size of the item is.
      
      But why would it see a log format item with a zero size? Well,
      luckily enough xfs_logprint uses the same code and gives the same
      error, so with a bit of gdb magic, it turns out that it isn't a log
      format that is being decoded. What logprint tells us is this:
      
      Oper (130): tid: a0375e1a  len: 28  clientid: TRANS  flags: none
      BUF:  #regs: 2   start blkno: 144 (0x90)  len: 16  bmap size: 2  flags: 0x4000
      Oper (131): tid: a0375e1a  len: 4096  clientid: TRANS  flags: none
      BUF DATA
      ----------------------------------------------------------------------------
      Oper (132): tid: a0375e1a  len: 4096  clientid: TRANS  flags: none
      xfs_logprint: unknown log operation type (4e49)
      **********************************************************************
      * ERROR: data block=2                                                 *
      **********************************************************************
      
      That we've got a buffer format item (oper 130) that has two regions;
      the format item itself and one dirty region. The subsequent region
      after the buffer format item and it's data is them what we are
      tripping over, and the first bytes of it at an inode magic number.
      Not a log opheader like there is supposed to be.
      
      That means there's a problem with the buffer format item. It's dirty
      data region is 4096 bytes, and it contains - you guessed it -
      initialised inodes. But inode buffers are 8k, not 4k, and we log
      them in their entirety. So something is wrong here. The buffer
      format item contains:
      
      (gdb) p /x *(struct xfs_buf_log_format *)in_f
      $22 = {blf_type = 0x123c, blf_size = 0x2, blf_flags = 0x4000,
             blf_len = 0x10, blf_blkno = 0x90, blf_map_size = 0x2,
             blf_data_map = {0xffffffff, 0xffffffff, .... }}
      
      Two regions, and a signle dirty contiguous region of 64 bits.  64 *
      128 = 8k, so this should be followed by a single 8k region of data.
      And the blf_flags tell us that the type of buffer is a
      XFS_BLFT_DINO_BUF. It contains inodes. And because it doesn't have
      the XFS_BLF_INODE_BUF flag set, that means it's an inode allocation
      buffer. So, it should be followed by 8k of inode data.
      
      But we know that the next region has a header of:
      
      (gdb) p /x *ohead
      $25 = {oh_tid = 0x1a5e37a0, oh_len = 0x100000, oh_clientid = 0x69,
             oh_flags = 0x0, oh_res2 = 0x0}
      
      and so be32_to_cpu(oh_len) = 0x1000 = 4096 bytes. It's simply not
      long enough to hold all the logged data. There must be another
      region. There is - there's a following opheader for another 4k of
      data that contains the other half of the inode cluster data - the
      one we assert fail on because it's not a log format header.
      
      So why is the second part of the data not being accounted to the
      correct buffer log format structure? It took a little more work with
      gdb to work out that the buffer log format structure was both
      expecting it to be there but hadn't accounted for it. It was at that
      point I went to the kernel code, as clearly this wasn't a bug in
      xfs_logprint and the kernel was writing bad stuff to the log.
      
      First port of call was the buffer item formatting code, and the
      discontiguous memory/contiguous dirty region handling code
      immediately stood out. I've wondered for a long time why the code
      had this comment in it:
      
                              vecp->i_addr = xfs_buf_offset(bp, buffer_offset);
                              vecp->i_len = nbits * XFS_BLF_CHUNK;
                              vecp->i_type = XLOG_REG_TYPE_BCHUNK;
      /*
       * You would think we need to bump the nvecs here too, but we do not
       * this number is used by recovery, and it gets confused by the boundary
       * split here
       *                      nvecs++;
       */
                              vecp++;
      
      And it didn't account for the extra vector pointer. The case being
      handled here is that a contiguous dirty region lies across a
      boundary that cannot be memcpy()d across, and so has to be split
      into two separate operations for xlog_write() to perform.
      
      What this code assumes is that what is written to the log is two
      consecutive blocks of data that are accounted in the buf log format
      item as the same contiguous dirty region and so will get decoded as
      such by the log recovery code.
      
      The thing is, xlog_write() knows nothing about this, and so just
      does it's normal thing of adding an opheader for each vector. That
      means the 8k region gets written to the log as two separate regions
      of 4k each, but because nvecs has not been incremented, the buf log
      format item accounts for only one of them.
      
      Hence when we come to log recovery, we process the first 4k region
      and then expect to come across a new item that starts with a log
      format structure of some kind that tells us whenteh next data is
      going to be. Instead, we hit raw buffer data and things go bad real
      quick.
      
      So, the commit from 2002 that commented out nvecs++ is just plain
      wrong. It breaks log recovery completely, and it would seem the only
      reason this hasn't been since then is that we don't log large
      contigous regions of multi-page unmapped buffers very often. Never
      would be a closer estimate, at least until the CRC code came along....
      
      So, lets fix that by restoring the nvecs accounting for the extra
      region when we hit this case.....
      
      .... and there's the problemin log recovery it is apparently working
      around:
      
      XFS: Assertion failed: i == item->ri_total, file: fs/xfs/xfs_log_recover.c, line: 2135
      
      Yup, xlog_recover_do_reg_buffer() doesn't handle contigous dirty
      regions being broken up into multiple regions by the log formatting
      code. That's an easy fix, though - if the number of contiguous dirty
      bits exceeds the length of the region being copied out of the log,
      only account for the number of dirty bits that region covers, and
      then loop again and copy more from the next region. It's a 2 line
      fix.
      
      Now xfstests xfs/085 passes, we have one less piece of mystery
      code, and one more important piece of knowledge about how to
      structure new log format items..
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      
      (cherry picked from commit 709da6a6)
      7d2ffe80
    • D
      xfs: kill suid/sgid through the truncate path. · 2962f5a5
      Dave Chinner 提交于
      XFS has failed to kill suid/sgid bits correctly when truncating
      files of non-zero size since commit c4ed4243 ("xfs: split
      xfs_setattr") introduced in the 3.1 kernel. Fix it.
      
      Fix it.
      
      cc: stable kernel <stable@vger.kernel.org>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      
      (cherry picked from commit 56c19e89)
      2962f5a5
    • D
      xfs: avoid nesting transactions in xfs_qm_scall_setqlim() · 08fb3905
      Dave Chinner 提交于
      Lockdep reports:
      
      =============================================
      [ INFO: possible recursive locking detected ]
      3.9.0+ #3 Not tainted
      ---------------------------------------------
      setquota/28368 is trying to acquire lock:
       (sb_internal){++++.?}, at: [<c11e8846>] xfs_trans_alloc+0x26/0x50
      
      but task is already holding lock:
       (sb_internal){++++.?}, at: [<c11e8846>] xfs_trans_alloc+0x26/0x50
      
      from xfs_qm_scall_setqlim()->xfs_dqread() when a dquot needs to be
      allocated.
      
      xfs_qm_scall_setqlim() is starting a transaction and then not
      passing it into xfs_qm_dqet() and so it starts it's own transaction
      when allocating the dquot.  Splat!
      
      Fix this by not allocating the dquot in xfs_qm_scall_setqlim()
      inside the setqlim transaction. This requires getting the dquot
      first (and allocating it if necessary) then dropping and relocking
      the dquot before joining it to the setqlim transaction.
      Reported-by: NMichael L. Semon <mlsemon35@gmail.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      (cherry picked from commit f648167f)
      08fb3905
  2. 25 5月, 2013 7 次提交
    • D
      xfs: remote attribute lookups require the value length · 7ae07780
      Dave Chinner 提交于
      When reading a remote attribute, to correctly calculate the length
      of the data buffer for CRC enable filesystems, we need to know the
      length of the attribute data. We get this information when we look
      up the attribute, but we don't store it in the args structure along
      with the other remote attr information we get from the lookup. Add
      this information to the args structure so we can use it
      appropriately.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      
      (cherry picked from commit e461fcb1)
      7ae07780
    • D
      xfs: xfs_attr_shortform_allfit() does not handle attr3 format. · cf257abf
      Dave Chinner 提交于
      xfstests generic/117 fails with:
      
      XFS: Assertion failed: leaf->hdr.info.magic == cpu_to_be16(XFS_ATTR_LEAF_MAGIC)
      
      indicating a function that does not handle the attr3 format
      correctly. Fix it.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      (cherry picked from commit b38958d7)
      cf257abf
    • D
      xfs: xfs_da3_node_read_verify() doesn't handle XFS_ATTR3_LEAF_MAGIC · 7ced60ca
      Dave Chinner 提交于
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      
      (cherry picked from commit 72916fb8)
      7ced60ca
    • D
      xfs: fix missing KM_NOFS tags to keep lockdep happy · b17cb364
      Dave Chinner 提交于
      There are several places where we use KM_SLEEP allocation contexts
      and use the fact that they are called from transaction context to
      add KM_NOFS where appropriate. Unfortunately, there are several
      places where the code makes this assumption but can be called from
      outside transaction context but with filesystem locks held. These
      places need explicit KM_NOFS annotations to avoid lockdep
      complaining about reclaim contexts.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      
      (cherry picked from commit ac14876c)
      b17cb364
    • D
      xfs: Don't reference the EFI after it is freed · 509e708a
      Dave Chinner 提交于
      Checking the EFI for whether it is being released from recovery
      after we've already released the known active reference is a mistake
      worthy of a brown paper bag. Fix the (now) obvious use after free
      that it can cause.
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      
      (cherry picked from commit 52c24ad3)
      509e708a
    • D
      xfs: fix rounding in xfs_free_file_space · 7031d0e1
      Dave Chinner 提交于
      The offset passed into xfs_free_file_space() needs to be rounded
      down to a certain size, but the rounding mask is built by a 32 bit
      variable. Hence the mask will always mask off the upper 32 bits of
      the offset and lead to incorrect writeback and invalidation ranges.
      
      This is not actually exposed as a bug because we writeback and
      invalidate from the rounded offset to the end of the file, and hence
      the offset we are actually punching a hole out of will always be
      covered by the code. This needs fixing, however, if we ever want to
      use exact ranges for writeback/invalidation here...
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      
      (cherry picked from commit 28ca489c)
      7031d0e1
    • D
      xfs: fix sub-page blocksize data integrity writes · 480d7467
      Dave Chinner 提交于
      FSX on 512 byte block size filesystems has been failing for some
      time with corrupted data. The fault dates back to the change in
      the writeback data integrity algorithm that uses a mark-and-sweep
      approach to avoid data writeback livelocks.
      
      Unfortunately, a side effect of this mark-and-sweep approach is that
      each page will only be written once for a data integrity sync, and
      there is a condition in writeback in XFS where a page may require
      two writeback attempts to be fully written. As a result of the high
      level change, we now only get a partial page writeback during the
      integrity sync because the first pass through writeback clears the
      mark left on the page index to tell writeback that the page needs
      writeback....
      
      The cause is writing a partial page in the clustering code. This can
      happen when a mapping boundary falls in the middle of a page - we
      end up writing back the first part of the page that the mapping
      covers, but then never revisit the page to have the remainder mapped
      and written.
      
      The fix is simple - if the mapping boundary falls inside a page,
      then simple abort clustering without touching the page. This means
      that the next ->writepage entry that write_cache_pages() will make
      is the page we aborted on, and xfs_vm_writepage() will map all
      sections of the page correctly. This behaviour is also optimal for
      non-data integrity writes, as it results in contiguous sequential
      writeback of the file rather than missing small holes and having to
      write them a "random" writes in a future pass.
      
      With this fix, all the fsx tests in xfstests now pass on a 512 byte
      block size filesystem on a 4k page machine.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      
      (cherry picked from commit 49b137cb)
      480d7467
  3. 10 5月, 2013 4 次提交
    • T
      eCryptfs: Use the ablkcipher crypto API · 4dfea4f0
      Tyler Hicks 提交于
      Make the switch from the blkcipher kernel crypto interface to the
      ablkcipher interface.
      
      encrypt_scatterlist() and decrypt_scatterlist() now use the ablkcipher
      interface but, from the eCryptfs standpoint, still treat the crypto
      operation as a synchronous operation. They submit the async request and
      then wait until the operation is finished before they return. Most of
      the changes are contained inside those two functions.
      
      Despite waiting for the completion of the crypto operation, the
      ablkcipher interface provides performance increases in most cases when
      used on AES-NI capable hardware.
      Signed-off-by: NTyler Hicks <tyhicks@canonical.com>
      Acked-by: NColin King <colin.king@canonical.com>
      Reviewed-by: NZeev Zilberman <zeev@annapurnaLabs.com>
      Cc: Dustin Kirkland <dustin.kirkland@gazzang.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Ying Huang <ying.huang@intel.com>
      Cc: Thieu Le <thieule@google.com>
      Cc: Li Wang <dragonylffly@163.com>
      Cc: Jarkko Sakkinen <jarkko.sakkinen@iki.fi>
      4dfea4f0
    • A
      91c2e0bc
    • A
      ecryptfs: don't open-code kernel_read() · 39dfe6c6
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      39dfe6c6
    • J
      nfsd: fix oops when legacy_recdir_name_error is passed a -ENOENT error · 7255e716
      Jeff Layton 提交于
      Toralf reported the following oops to the linux-nfs mailing list:
      
          -----------------[snip]------------------
          NFSD: unable to generate recoverydir name (-2).
          NFSD: disabling legacy clientid tracking. Reboot recovery will not function correctly!
          BUG: unable to handle kernel NULL pointer dereference at 000003c8
          IP: [<f90a3d91>] nfsd4_client_tracking_exit+0x11/0x50 [nfsd]
          *pdpt = 000000002ba33001 *pde = 0000000000000000
          Oops: 0000 [#1] SMP
          Modules linked in: loop nfsd auth_rpcgss ipt_MASQUERADE xt_owner xt_multiport ipt_REJECT xt_tcpudp xt_recent xt_conntrack nf_conntrack_ftp xt_limit xt_LOG iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_filter ip_tables x_tables af_packet pppoe pppox ppp_generic slhc bridge stp llc tun arc4 iwldvm mac80211 coretemp kvm_intel uvcvideo sdhci_pci sdhci mmc_core videobuf2_vmalloc videobuf2_memops usblp videobuf2_core i915 iwlwifi psmouse videodev cfg80211 kvm fbcon bitblit cfbfillrect acpi_cpufreq mperf evdev softcursor font cfbimgblt i2c_algo_bit cfbcopyarea intel_agp intel_gtt drm_kms_helper snd_hda_codec_conexant drm agpgart fb fbdev tpm_tis thinkpad_acpi tpm nvram e1000e rfkill thermal ptp wmi pps_core tpm_bios 8250_pci processor 8250 ac snd_hda_intel snd_hda_codec snd_pcm battery video i2c_i801 snd_page_alloc snd_timer button serial_core i2c_core snd soundcore thermal_sys hwmon aesni_intel ablk_helper cryp
      td lrw aes_i586 xts gf128mul cbc fuse nfs lockd sunrpc dm_crypt dm_mod hid_monterey hid_microsoft hid_logitech hid_ezkey hid_cypress hid_chicony hid_cherry hid_belkin hid_apple hid_a4tech hid_generic usbhid hid sr_mod cdrom sg [last unloaded: microcode]
          Pid: 6374, comm: nfsd Not tainted 3.9.1 #6 LENOVO 4180F65/4180F65
          EIP: 0060:[<f90a3d91>] EFLAGS: 00010202 CPU: 0
          EIP is at nfsd4_client_tracking_exit+0x11/0x50 [nfsd]
          EAX: 00000000 EBX: fffffffe ECX: 00000007 EDX: 00000007
          ESI: eb9dcb00 EDI: eb2991c0 EBP: eb2bde38 ESP: eb2bde34
          DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
          CR0: 80050033 CR2: 000003c8 CR3: 2ba80000 CR4: 000407f0
          DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
          DR6: ffff0ff0 DR7: 00000400
          Process nfsd (pid: 6374, ti=eb2bc000 task=eb2711c0 task.ti=eb2bc000)
          Stack:
          fffffffe eb2bde4c f90a3e0c f90a7754 fffffffe eb0a9c00 eb2bdea0 f90a41ed
          eb2991c0 1b270000 eb2991c0 eb2bde7c f9099ce9 eb2bde98 0129a020 eb29a020
          eb2bdecc eb2991c0 eb2bdea8 f9099da5 00000000 eb9dcb00 00000001 67822f08
          Call Trace:
          [<f90a3e0c>] legacy_recdir_name_error+0x3c/0x40 [nfsd]
          [<f90a41ed>] nfsd4_create_clid_dir+0x15d/0x1c0 [nfsd]
          [<f9099ce9>] ? nfsd4_lookup_stateid+0x99/0xd0 [nfsd]
          [<f9099da5>] ? nfs4_preprocess_seqid_op+0x85/0x100 [nfsd]
          [<f90a4287>] nfsd4_client_record_create+0x37/0x50 [nfsd]
          [<f909d6ce>] nfsd4_open_confirm+0xfe/0x130 [nfsd]
          [<f90980b1>] ? nfsd4_encode_operation+0x61/0x90 [nfsd]
          [<f909d5d0>] ? nfsd4_free_stateid+0xc0/0xc0 [nfsd]
          [<f908fd0b>] nfsd4_proc_compound+0x41b/0x530 [nfsd]
          [<f9081b7b>] nfsd_dispatch+0x8b/0x1a0 [nfsd]
          [<f857b85d>] svc_process+0x3dd/0x640 [sunrpc]
          [<f908165d>] nfsd+0xad/0x110 [nfsd]
          [<f90815b0>] ? nfsd_destroy+0x70/0x70 [nfsd]
          [<c1054824>] kthread+0x94/0xa0
          [<c1486937>] ret_from_kernel_thread+0x1b/0x28
          [<c1054790>] ? flush_kthread_work+0xd0/0xd0
          Code: 86 b0 00 00 00 90 c5 0a f9 c7 04 24 70 76 0a f9 e8 74 a9 3d c8 eb ba 8d 76 00 55 89 e5 53 66 66 66 66 90 8b 15 68 c7 0a f9 85 d2 <8b> 88 c8 03 00 00 74 2c 3b 11 77 28 8b 5c 91 08 85 db 74 22 8b
          EIP: [<f90a3d91>] nfsd4_client_tracking_exit+0x11/0x50 [nfsd] SS:ESP 0068:eb2bde34
          CR2: 00000000000003c8
          ---[ end trace 09e54015d145c9c6 ]---
      
      The problem appears to be a regression that was introduced in commit
      9a9c6478 "nfsd: make NFSv4 recovery client tracking options per net".
      Prior to that commit, it was safe to pass a NULL net pointer to
      nfsd4_client_tracking_exit in the legacy recdir case, and
      legacy_recdir_name_error did so. After that comit, the net pointer must
      be valid.
      
      This patch just fixes legacy_recdir_name_error to pass in a valid net
      pointer to that function.
      
      Cc: <stable@vger.kernel.org> # v3.8+
      Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
      Reported-and-tested-by: NToralf Förster <toralf.foerster@gmx.de>
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      7255e716
  4. 09 5月, 2013 2 次提交
  5. 08 5月, 2013 14 次提交
    • J
      f2fs: cover free_nid management with spin_lock · 59bbd474
      Jaegeuk Kim 提交于
      After build_free_nids() searches free nid candidates from nat pages and
      current journal blocks, it checks all the candidates if they are allocated
      so that the nat cache has its nid with an allocated block address.
      
      In this procedure, previously we used
          list_for_each_entry_safe(fnid, next_fnid, &nm_i->free_nid_list, list).
      But, this is not covered by free_nid_list_lock, resulting in null pointer bug.
      
      This patch moves this checking routine inside add_free_nid() in order not to use
      the spin_lock.
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      59bbd474
    • H
      f2fs: optimize scan_nat_page() · 23d38844
      Haicheng Li 提交于
      When nm_i->fcnt > 2 * MAX_FREE_NIDS, stop scanning other NAT entries.
      Signed-off-by: NHaicheng Li <haicheng.li@linux.intel.com>
      [Jaegeuk Kim: fix handling the return value of add_free_nid()]
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      23d38844
    • H
      f2fs: code cleanup for scan_nat_page() and build_free_nids() · 8760952d
      Haicheng Li 提交于
      This patch does two cleanups:
      1. remove unused variable "fcnt" in build_free_nids().
      2. make scan_nat_page() as void type and remove useless variable "fcnt".
      Signed-off-by: NHaicheng Li <haicheng.li@linux.intel.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      8760952d
    • H
      f2fs: bugfix for alloc_nid_failed() · 95630cba
      Haicheng Li 提交于
      Directly drop the free_nid cache when nm_i->fcnt > 2 * MAX_FREE_NIDS
      
      Since there is NOT nmi->free_nid_list_lock spinlock protection between
      a sequential calling of alloc_nid() and alloc_nid_failed(), some other
      threads may already add new free_nid to the free_nid_list during this
      period.
      
      We need to make sure nmi->fcnt is never > 2 * MAX_FREE_NIDS.
      Signed-off-by: NHaicheng Li <haicheng.li@linux.intel.com>
      [Jaegeuk Kim: fit the coding style]
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      95630cba
    • C
      f2fs: recover when journal contains deleted files · 047184b4
      Chris Fries 提交于
      When recovering a journal file with fsync data for files that have
      been deleted, don't bail out on recovery.
      Signed-off-by: NChris Fries <C.Fries@motorola.com>
      Reviewed-by: NRussell Knize <rknize2@motorola.com>
      Reviewed-by: NJason Hrycay <jason.hrycay@motorola.com>
      [Jaegeuk Kim: fit the coding style]
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      047184b4
    • C
      f2fs: continue to mount after failing recovery · bde582b2
      Chris Fries 提交于
      When unable to roll forward the journal, we shouldn't bail out and
      not mount, we should continue to attempt the mount.  Bad recovery data
      is likely unrecoverable at this point, and requiring the user to try
      to mount again doesn't solve any issues.
      Signed-off-by: NChris Fries <C.Fries@motorola.com>
      Reviewed-by: NRussell Knize <rknize2@motorola.com>
      Reviewed-by: NJason Hrycay <jason.hrycay@motorola.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      bde582b2
    • J
      f2fs: avoid deadlock during evict after f2fs_gc · 531ad7d5
      Jaegeuk Kim 提交于
      o Deadlock case #1
      
      Thread 1:
      - writeback_sb_inodes
       - do_writepages
        - f2fs_write_data_pages
         - write_cache_pages
          - f2fs_write_data_page
           - f2fs_balance_fs
            - wait mutex_lock(gc_mutex)
      
      Thread 2:
      - f2fs_balance_fs
       - mutex_lock(gc_mutex)
       - f2fs_gc
        - f2fs_iget
         - wait iget_locked(inode->i_lock)
      
      Thread 3:
      - do_unlinkat
       - iput
        - lock(inode->i_lock)
         - evict
          - inode_wait_for_writeback
      
      o Deadlock case #2
      
      Thread 1:
      - __writeback_single_inode
       : set I_SYNC
        - do_writepages
         - f2fs_write_data_page
          - f2fs_balance_fs
           - f2fs_gc
            - iput
             - evict
              - inode_wait_for_writeback(I_SYNC)
      
      In order to avoid this, even though iput is called with the zero-reference
      count, we need to stop the eviction procedure if the inode is on writeback.
      So this patch links f2fs_drop_inode which checks the I_SYNC flag.
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      531ad7d5
    • K
      aio: don't include aio.h in sched.h · a27bb332
      Kent Overstreet 提交于
      Faster kernel compiles by way of fewer unnecessary includes.
      
      [akpm@linux-foundation.org: fix fallout]
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      Cc: Zach Brown <zab@redhat.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Reviewed-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a27bb332
    • K
      aio: kill ki_retry · 41ef4eb8
      Kent Overstreet 提交于
      Thanks to Zach Brown's work to rip out the retry infrastructure, we don't
      need this anymore - ki_retry was only called right after the kiocb was
      initialized.
      
      This also refactors and trims some duplicated code, as well as cleaning up
      the refcounting/error handling a bit.
      
      [akpm@linux-foundation.org: use fmode_t in aio_run_iocb()]
      [akpm@linux-foundation.org: fix file_start_write/file_end_write tests]
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      Cc: Zach Brown <zab@redhat.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Reviewed-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41ef4eb8
    • K
      aio: kill ki_key · 8a660890
      Kent Overstreet 提交于
      ki_key wasn't actually used for anything previously - it was always 0.
      Drop it to trim struct kiocb a bit.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      Cc: Zach Brown <zab@redhat.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Reviewed-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a660890
    • J
      audit: vfs: fix audit_inode call in O_CREAT case of do_last · 33e2208a
      Jeff Layton 提交于
      Jiri reported a regression in auditing of open(..., O_CREAT) syscalls.
      In older kernels, creating a file with open(..., O_CREAT) created
      audit_name records that looked like this:
      
      type=PATH msg=audit(1360255720.628:64): item=1 name="/abc/foo" inode=138810 dev=fd:00 mode=0100640 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
      type=PATH msg=audit(1360255720.628:64): item=0 name="/abc/" inode=138635 dev=fd:00 mode=040750 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
      
      ...in recent kernels though, they look like this:
      
      type=PATH msg=audit(1360255402.886:12574): item=2 name=(null) inode=264599 dev=fd:00 mode=0100640 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
      type=PATH msg=audit(1360255402.886:12574): item=1 name=(null) inode=264598 dev=fd:00 mode=040750 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
      type=PATH msg=audit(1360255402.886:12574): item=0 name="/abc/foo" inode=264598 dev=fd:00 mode=040750 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
      
      Richard bisected to determine that the problems started with commit
      bfcec708, but the log messages have changed with some later
      audit-related patches.
      
      The problem is that this audit_inode call is passing in the parent of
      the dentry being opened, but audit_inode is being called with the parent
      flag false. This causes later audit_inode and audit_inode_child calls to
      match the wrong entry in the audit_names list.
      
      This patch simply sets the flag to properly indicate that this inode
      represents the parent. With this, the audit_names entries are back to
      looking like they did before.
      
      Cc: <stable@vger.kernel.org> # v3.7+
      Reported-by: NJiri Jaburek <jjaburek@redhat.com>
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Test By: Richard Guy Briggs <rbriggs@redhat.com>
      Signed-off-by: NEric Paris <eparis@redhat.com>
      33e2208a
    • K
      aio: give shared kioctx fields their own cachelines · 4e23bcae
      Kent Overstreet 提交于
      [akpm@linux-foundation.org: make reqs_active __cacheline_aligned_in_smp]
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      Cc: Zach Brown <zab@redhat.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Reviewed-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e23bcae
    • K
      aio: kill struct aio_ring_info · 58c85dc2
      Kent Overstreet 提交于
      struct aio_ring_info was kind of odd, the only place it's used is where
      it's embedded in struct kioctx - there's no real need for it.
      
      The next patch rearranges struct kioctx and puts various things on their
      own cachelines - getting rid of struct aio_ring_info now makes that
      reordering a bit clearer.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      Cc: Zach Brown <zab@redhat.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Reviewed-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58c85dc2
    • K
      aio: kill batch allocation · a1c8eae7
      Kent Overstreet 提交于
      Previously, allocating a kiocb required touching quite a few global
      (well, per kioctx) cachelines...  so batching up allocation to amortize
      those was worthwhile.  But we've gotten rid of some of those, and in
      another couple of patches kiocb allocation won't require writing to any
      shared cachelines, so that means we can just rip this code out.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      Cc: Zach Brown <zab@redhat.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Reviewed-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a1c8eae7