1. 13 8月, 2013 5 次提交
  2. 31 7月, 2013 1 次提交
  3. 25 7月, 2013 1 次提交
    • D
      xfs: di_flushiter considered harmful · e60896d8
      Dave Chinner 提交于
      When we made all inode updates transactional, we no longer needed
      the log recovery detection for inodes being newer on disk than the
      transaction being replayed - it was redundant as replay of the log
      would always result in the latest version of the inode would be on
      disk. It was redundant, but left in place because it wasn't
      considered to be a problem.
      
      However, with the new "don't read inodes on create" optimisation,
      flushiter has come back to bite us. Essentially, the optimisation
      made always initialises flushiter to zero in the create transaction,
      and so if we then crash and run recovery and the inode already on
      disk has a non-zero flushiter it will skip recovery of that inode.
      As a result, log recovery does the wrong thing and we end up with a
      corrupt filesystem.
      
      Because we have to support old kernel to new kernel upgrades, we
      can't just get rid of the flushiter support in log recovery as we
      might be upgrading from a kernel that doesn't have fully transactional
      inode updates.  Unfortunately, for v4 superblocks there is no way to
      guarantee that log recovery knows about this fact.
      
      We cannot add a new inode format flag to say it's a "special inode
      create" because it won't be understood by older kernels and so
      recovery could do the wrong thing on downgrade. We cannot specially
      detect the combination of zero mode/non-zero flushiter on disk to
      non-zero mode, zero flushiter in the log item during recovery
      because wrapping of the flushiter can result in false detection.
      
      Hence that makes this "don't use flushiter" optimisation limited to
      a disk format that guarantees that we don't need it. And that means
      the only fix here is to limit the "no read IO on create"
      optimisation to version 5 superblocks....
      Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      e60896d8
  4. 23 7月, 2013 4 次提交
    • C
      xfs: Start using pquotaino from the superblock. · d892d586
      Chandra Seetharaman 提交于
      Start using pquotino and define a macro to check if the
      superblock has pquotino.
      
      Keep backward compatibilty by alowing mount of older superblock
      with no separate pquota inode.
      Signed-off-by: NChandra Seetharaman <sekharan@us.ibm.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      d892d586
    • C
      xfs: Initialize all quota inodes to be NULLFSINO · 01026297
      Chandra Seetharaman 提交于
      mkfs doesn't initialize the quota inodes to NULLFSINO as it does for the
      other internal inodes. This leads to two in-core values (0 and NULLFSINO)
      to be checked against, to make sure if a quota inode is valid.
      
      Solve that problem by initializing the in-core values of all quotaino
      values to NULLFSINO if they are 0 in the disk.
      
      Note that these values are not written back to on-disk superblock unless
      some quota is enabled on the filesystem. Even in that case sb_pquotino is
      written to disk only if the on-disk superblock supports pquotino
      Signed-off-by: NChandra Seetharaman <sekharan@us.ibm.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      01026297
    • C
      xfs: Fix a deadlock in xfs_log_commit_cil() code path · 297aa637
      Chandra Seetharaman 提交于
      While testing and rearranging pquota/gquota code, I stumbled
      on a xfs_shutdown() during a mount. But the mount just hung.
      
      Debugged and found that there is a deadlock involving
      &log->l_cilp->xc_ctx_lock.
      
      It is in a code path where &log->l_cilp->xc_ctx_lock is first
      acquired in read mode and some levels down the same semaphore
      is being acquired in write mode causing a deadlock.
      
      This is the stack:
      xfs_log_commit_cil -> acquires &log->l_cilp->xc_ctx_lock in read mode
        xlog_print_tic_res
          xfs_force_shutdown
            xfs_log_force_umount
              xlog_cil_force
                xlog_cil_force_lsn
                  xlog_cil_push_foreground
                    xlog_cil_push - tries to acquire same semaphore in write mode
      
      This patch fixes the deadlock by changing the reason code for
      xfs_force_shutdown in xlog_print_tic_res() to SHUTDOWN_LOG_IO_ERROR.
      
      SHUTDOWN_LOG_IO_ERROR is the right reason code to be set since
      we are in the log path.
      
      Thanks to Dave for suggesting this solution.
      Signed-off-by: NChandra Seetharaman <sekharan@us.ibm.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      297aa637
    • J
      xfs: fix assertion failure in xfs_vm_write_failed() · 58e59854
      Jie Liu 提交于
      In xfs_vm_write_failed(), we evaluate the block_offset of pos with
      PAGE_MASK which is an unsigned long.  That is fine on 64-bit platforms
      regardless of whether the request pos is 32-bit or 64-bit.  However, on
      32-bit platforms the value is 0xfffff000 and so the high 32 bits in it
      will be masked off with (pos & PAGE_MASK) for a 64-bit pos.
      
      As a result, the evaluated block_offset is incorrect which will cause
      this failure ASSERT(block_offset + from == pos); and potentially pass
      the wrong block to xfs_vm_kill_delalloc_range().
      
      In this case, we can get a kernel panic if CONFIG_XFS_DEBUG is enabled:
      
      XFS: Assertion failed: block_offset + from == pos, file: fs/xfs/xfs_aops.c, line: 1504
      
      ------------[ cut here ]------------
       kernel BUG at fs/xfs/xfs_message.c:100!
       invalid opcode: 0000 [#1] SMP
       ........
       Pid: 4057, comm: mkfs.xfs Tainted: G           O 3.9.0-rc2 #1
       EIP: 0060:[<f94a7e8b>] EFLAGS: 00010282 CPU: 0
       EIP is at assfail+0x2b/0x30 [xfs]
       EAX: 00000056 EBX: f6ef28a0 ECX: 00000007 EDX: f57d22a4
       ESI: 1c2fb000 EDI: 00000000 EBP: ea6b5d30 ESP: ea6b5d1c
       DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
       CR0: 8005003b CR2: 094f3ff4 CR3: 2bcb4000 CR4: 000006f0
       DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
       DR6: ffff0ff0 DR7: 00000400
       Process mkfs.xfs (pid: 4057, ti=ea6b4000 task=ea5799e0 task.ti=ea6b4000)
       Stack:
       00000000 f9525c48 f951fa80 f951f96b 000005e4 ea6b5d7c f9494b34 c19b0ea2
       00000066 f3d6c620 c19b0ea2 00000000 e9a91458 00001000 00000000 00000000
       00000000 c15c7e89 00000000 1c2fb000 00000000 00000000 1c2fb000 00000080
       Call Trace:
       [<f9494b34>] xfs_vm_write_failed+0x74/0x1b0 [xfs]
       [<c15c7e89>] ? printk+0x4d/0x4f
       [<f9494d7d>] xfs_vm_write_begin+0x10d/0x170 [xfs]
       [<c110a34c>] generic_file_buffered_write+0xdc/0x210
       [<f949b669>] xfs_file_buffered_aio_write+0xf9/0x190 [xfs]
       [<f949b7f3>] xfs_file_aio_write+0xf3/0x160 [xfs]
       [<c115e504>] do_sync_write+0x94/0xd0
       [<c115ed1f>] vfs_write+0x8f/0x160
       [<c115e470>] ? wait_on_retry_sync_kiocb+0x50/0x50
       [<c115f017>] sys_write+0x47/0x80
       [<c15d860d>] sysenter_do_call+0x12/0x28
       .............
       EIP: [<f94a7e8b>] assfail+0x2b/0x30 [xfs] SS:ESP 0068:ea6b5d1c
       ---[ end trace cdd9af4f4ecab42f ]---
       Kernel panic - not syncing: Fatal exception
      
      In order to avoid this, we can evaluate the block_offset of the start
      of the page by using shifts rather than masks the mismatch problem.
      
      Thanks Dave Chinner for help finding and fixing this bug.
      Reported-by: NMichael L. Semon <mlsemon35@gmail.com>
      Reviewed-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      58e59854
  5. 12 7月, 2013 1 次提交
  6. 11 7月, 2013 1 次提交
    • C
      xfs: Add pquota fields where gquota is used. · 92f8ff73
      Chandra Seetharaman 提交于
      Add project quota changes to all the places where group quota field
      is used:
         * add separate project quota members into various structures
         * split project quota and group quotas so that instead of overriding
           the group quota members incore, the new project quota members are
           used instead
         * get rid of usage of the OQUOTA flag incore, in favor of separate
           group and project quota flags.
         * add a project dquot argument to various functions.
      
      Not using the pquotino field from superblock yet.
      Signed-off-by: NChandra Seetharaman <sekharan@us.ibm.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      92f8ff73
  7. 10 7月, 2013 6 次提交
    • C
      xfs: fix sgid inheritance for subdirectories inheriting default acls [V3] · 42c49d7f
      Carlos Maiolino 提交于
      XFS removes sgid bits of subdirectories under a directory containing a default
      acl.
      
      When a default acl is set, it implies xfs to call xfs_setattr_nonsize() in its
      code path. Such function is shared among mkdir and chmod system calls, and
      does some checks unneeded by mkdir (calling inode_change_ok()). Such checks
      remove sgid bit from the inode after it has been granted.
      
      With this patch, we extend the meaning of XFS_ATTR_NOACL flag to avoid these
      checks when acls are being inherited (thanks hch).
      
      Also, xfs_setattr_mode, doesn't need to re-check for group id and capabilities
      permissions, this only implies in another try to remove sgid bit from the
      directories. Such check is already done either on inode_change_ok() or
      xfs_setattr_nonsize().
      
      Changelog:
      
      V2: Extends the meaning of XFS_ATTR_NOACL instead of wrap the tests into another
          function
      
      V3: Remove S_ISDIR check in xfs_setattr_nonsize() from the patch
      Signed-off-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      42c49d7f
    • D
      xfs: dquot log reservations are too small · b0a9dab7
      Dave Chinner 提交于
      During review of the separate project quota inode patches, it became
      obvious that the dquot log reservation calculation underestimated
      the number dquots that can be modified in a transaction. This has
      it's roots way back in the Irix quota implementation.
      
      That is, when quotas were first implemented in XFS, it only
      supported user and project quotas as Irix did not have group quotas.
      Hence the worst case operation involving dquot modification was
      calculated to involve 2 user dquots and 1 project dquot or 1 user
      dequot and 2 project dquots. i.e. 3 dquots. This was determined back
      in 1996, and has remained unchanged ever since.
      
      However, back in 2001, the Linux XFS port dropped all support for
      project quota and implmented group quotas over the top. This was
      effectively done with a search-and-replace of project with group,
      and as such the log reservation was not changed. However, with the
      advent of group quotas, chmod and rename now could modify more than
      3 dquots in a single transaction - both could modify 4 dquots. Hence
      this log reservation has been wrong for a long time.
      
      In 2005, project quota support was reintroduced into Linux, but it
      was implemented to be mutually exclusive to group quotas and so this
      didn't add any new changes to the dquot log reservation. Hence when
      project quotas were in use (rather than group quotas) the log
      reservation was again valid, just like in the Irix days.
      
      Now, with the addition of the separate project quota inode, group
      and project quotas are no longer mutually exclusive, and hence
      operations can now modify three dquots per inode where previously it
      was only two. The worst case here is the rename transaction, which
      can allocate/free space on two different directory inodes, and if
      they have different uid/gid/prid configurations and are world
      writeable, then rename can actually modify 6 different dquots now.
      
      Further, the dquot log reservation doesn't take into account the
      space used by the dquot log format structure that precedes the dquot
      that is logged, and hence further underestimates the worst case
      log space required by dquots during a transaction. This has been
      missing since the first commit in 1996.
      
      Hence the worst case log reservation needs to be increased from 3 to
      6, and it needs to take into account a log format header for each of
      those dquots.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      b0a9dab7
    • D
      xfs: remove local fork format handling from xfs_bmapi_write() · f3508bcd
      Dave Chinner 提交于
      The conversion from local format to extent format requires
      interpretation of the data in the fork being converted, so it cannot
      be done in a generic way. It is up to the caller to convert the fork
      format to extent format before calling into xfs_bmapi_write() so
      format conversion can be done correctly.
      
      The code in xfs_bmapi_write() to convert the format is used
      implicitly by the attribute and directory code, but they
      specifically zero the fork size so that the conversion does not do
      any allocation or manipulation. Move this conversion into the
      shortform to leaf functions for the dir/attr code so the conversions
      are explicitly controlled by all callers.
      
      Now we can remove the conversion code in xfs_bmapi_write.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      f3508bcd
    • Y
      xfs: use get_unused_fd_flags(0) instead of get_unused_fd() · 862a6293
      Yann Droneaud 提交于
      Macro get_unused_fd() is used to allocate a file descriptor with
      default flags. Those default flags (0) can be "unsafe":
      O_CLOEXEC must be used by default to not leak file descriptor
      across exec().
      
      Instead of macro get_unused_fd(), functions anon_inode_getfd()
      or get_unused_fd_flags() should be used with flags given by userspace.
      If not possible, flags should be set to O_CLOEXEC to provide userspace
      with a default safe behavor.
      
      In a further patch, get_unused_fd() will be removed so that
      new code start using anon_inode_getfd() or get_unused_fd_flags()
      with correct flags.
      
      This patch replaces calls to get_unused_fd() with equivalent call to
      get_unused_fd_flags(0) to preserve current behavor for existing code.
      
      The hard coded flag value (0) should be reviewed on a per-subsystem basis,
      and, if possible, set to O_CLOEXEC.
      Signed-off-by: NYann Droneaud <ydroneaud@opteya.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      862a6293
    • J
      xfs: clean up unused codes at xfs_bulkstat() · 9cee4c5b
      Jie Liu 提交于
      There are some unused codes at xfs_bulkstat():
      
      - Variable bp is defined to point to the on-disk inode cluster
        buffer, but it proved to be of no practical help.
      
      - We process the chunks of good inodes which were fetched by iterating
        btree records from an AG.  When processing inodes from each chunk,
        the code recomputing agbno if run into the first inode of a cluster,
        however, the agbno is not being used thereafter.
      
      This patch tries to clean up those things.
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      9cee4c5b
    • E
      xfs: use XFS_BMAP_BMDR_SPACE vs. XFS_BROOT_SIZE_ADJ · a69c7c07
      Eric Sandeen 提交于
      XFS_BROOT_SIZE_ADJ is an undocumented macro which accounts for
      the difference in size between the on-disk and in-core btree
      root.  It's much clearer to just use the newly-added
      XFS_BMAP_BMDR_SPACE macro which gives us the on-disk size
      directly.
      
      In one case, we must test that the if_broot exists before
      applying the macro, however.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      a69c7c07
  8. 03 7月, 2013 1 次提交
    • J
      vfs: export lseek_execute() to modules · 46a1c2c7
      Jie Liu 提交于
      For those file systems(btrfs/ext4/ocfs2/tmpfs) that support
      SEEK_DATA/SEEK_HOLE functions, we end up handling the similar
      matter in lseek_execute() to update the current file offset
      to the desired offset if it is valid, ceph also does the
      simliar things at ceph_llseek().
      
      To reduce the duplications, this patch make lseek_execute()
      public accessible so that we can call it directly from the
      underlying file systems.
      
      Thanks Dave Chinner for this suggestion.
      
      [AV: call it vfs_setpos(), don't bring the removed 'inode' argument back]
      
      v2->v1:
      - Add kernel-doc comments for lseek_execute()
      - Call lseek_execute() in ceph->llseek()
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chris Mason <chris.mason@fusionio.com>
      Cc: Josef Bacik <jbacik@fusionio.com>
      Cc: Ben Myers <bpm@sgi.com>
      Cc: Ted Tso <tytso@mit.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Sage Weil <sage@inktank.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      46a1c2c7
  9. 29 6月, 2013 8 次提交
  10. 28 6月, 2013 11 次提交
    • D
      xfs: Use inode create transaction · ddf6ad01
      Dave Chinner 提交于
      Replace the use of buffer based logging of inode initialisation,
      uses the new logical form to describe the range to be initialised
      in recovery. We continue to "log" the inode buffers to push them
      into the AIL and ensure that the inode create transaction is not
      removed from the log before the inode buffers are written to disk.
      
      Update the transaction identifier and reservations to match the
      changed implementation.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      ddf6ad01
    • D
      xfs: Inode create item recovery · 28c8e41a
      Dave Chinner 提交于
      When we find a icreate transaction, we need to get and initialise
      the buffers in the range that has been passed. Extract and verify
      the information in the item record, then loop over the range
      initialising and issuing the buffer writes delayed.
      
      Support an arbitrary size range to initialise so that in
      future when we allocate inodes in much larger chunks all kernels
      that understand this transaction can still recover them.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      28c8e41a
    • D
      xfs: Inode create transaction reservations · b8402b47
      Dave Chinner 提交于
      Define the log and space transaction sizes. Factor the current
      create log reservation macro into the two logical halves and reuse
      one half for the new icreate transactions. The icreate transaction
      is transparent to all the high level create code - the
      pre-calculated reservations will correctly set the reservations
      dependent on whether the filesystem supports the icreate
      transaction.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      b8402b47
    • D
      xfs: Inode create log items · 3ebe7d2d
      Dave Chinner 提交于
      Introduce the inode create log item type for logical inode create logging.
      Instead of logging the changes in buffers, pass the range to be
      initialised through the log by a new transaction type.  This reduces
      the amount of log space required to record initialisation during
      allocation from about 128 bytes per inode to a small fixed amount
      per inode extent to be initialised.
      
      This requires a new log item type to track it through the log
      and the AIL. This is a relatively simple item - most callbacks are
      noops as this item has the same life cycle as the transaction.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      3ebe7d2d
    • D
      xfs: Introduce an ordered buffer item · 5f6bed76
      Dave Chinner 提交于
      If we have a buffer that we have modified but we do not wish to
      physically log in a transaction (e.g. we've logged a logical
      change), we still need to ensure that transactional integrity is
      maintained. Hence we must not move the tail of the log past the
      transaction that the buffer is associated with before the buffer is
      written to disk.
      
      This means these special buffers still need to be included in the
      transaction and added to the AIL just like a normal buffer, but we
      do not want the modifications to the buffer written into the
      transaction. IOWs, what we want is an "ordered buffer" that
      maintains the same transactional life cycle as a physically logged
      buffer, just without the transcribing of the modifications to the
      log.
      
      Hence we need to flag the buffer as an "ordered buffer" to avoid
      including it in vector size calculations or formatting during the
      transaction. Once the transaction is committed, the buffer appears
      for all intents to be the same as a physically logged buffer as it
      transitions through the log and AIL.
      
      Relogging will also work just fine for such an ordered buffer - the
      logical transaction will be replayed before the subsequent
      modifications that relog the buffer, so everything will be
      reconstructed correctly by recovery.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      5f6bed76
    • D
      xfs: Introduce ordered log vector support · fd63875c
      Dave Chinner 提交于
      And "ordered log vector" is a log vector that is used for
      tracking a log item through the CIL and into the AIL as part of the
      log checkpointing. These ordered log vectors are special in that
      they are not written to to journal in any way, and are not accounted
      to the checkpoint being written.
      
      The reason for this behaviour is to allow operations to attach items
      to transactions and have them follow the normal transactional
      lifecycle without actually having to write them to the journal. This
      allows logging of items that track high level logical changes and
      writing them to the log, while the physical items being modified
      pass through into the AIL and pin the tail of the log (and therefore
      the logical item in the log) until all the modified items are
      physically written to disk.
      
      IOWs, it allows us to write metadata without physically logging
      every individual change but still maintain the full transactional
      integrity guarantees we currently have w.r.t. crash recovery.
      
      This change modifies some of the CIL item insertion loops, as
      ordered log vectors introduce some new constraints as they don't
      track any data. One advantage of this change is that it combines
      two log vector chain walks into a single pass, so there is less
      overhead in the transaction commit pass as well. It also kills some
      unused code in the log vector walk loop when committing the CIL.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      fd63875c
    • D
      xfs: xfs_ifree doesn't need to modify the inode buffer · 1baaed8f
      Dave Chinner 提交于
      Long ago, bulkstat used to read inodes directly from the backing
      buffer for speed. This had the unfortunate problem of being cache
      incoherent with unlinks, and so xfs_ifree() had to mark the inode
      as free directly in the backing buffer. bulkstat was changed some
      time ago to use inode cache coherent lookups, and so will never see
      unlinked inodes in it's lookups. Hence xfs_ifree() does not need to
      touch the inode backing buffer anymore.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      1baaed8f
    • D
      xfs: don't do IO when creating an new inode · cca9f93a
      Dave Chinner 提交于
      When we are allocating a new inode, we read the inode cluster off
      disk to increment the generation number. We are already using a
      random generation number for newly allocated inodes, so if we are not
      using the ikeep mode, we can just generate a new generation number
      when we initialise the newly allocated inode.
      
      This avoids the need for reading the inode buffer during inode
      creation. This will speed up allocation of inodes in cold, partially
      allocated clusters as they will no longer need to be read from disk
      during allocation. It will also reduce the CPU overhead of inode
      allocation by not having the process the buffer read, even on cache
      hits.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      cca9f93a
    • D
      xfs: don't use speculative prealloc for small files · 133eeb17
      Dave Chinner 提交于
      Dedicated small file workloads have been seeing significant free
      space fragmentation causing premature inode allocation failure
      when large inode sizes are in use. A particular test case showed
      that a workload that runs to a real ENOSPC on 256 byte inodes would
      fail inode allocation with ENOSPC about about 80% full with 512 byte
      inodes, and at about 50% full with 1024 byte inodes.
      
      The same workload, when run with -o allocsize=4096 on 1024 byte
      inodes would run to being 100% full before giving ENOSPC. That is,
      no freespace fragmentation at all.
      
      The issue was caused by the specific IO pattern the application had
      - the framework it was using did not support direct IO, and so it
      was emulating it by using fadvise(DONT_NEED). The result was that
      the data was getting written back before the speculative prealloc
      had been trimmed from memory by the close(), and so small single
      block files were being allocated with 2 blocks, and then having one
      truncated away. The result was lots of small 4k free space extents,
      and hence each new 8k allocation would take another 8k from
      contiguous free space and turn it into 4k of allocated space and 4k
      of free space.
      
      Hence inode allocation, which requires contiguous, aligned
      allocation of 16k (256 byte inodes), 32k (512 byte inodes) or 64k
      (1024 byte inodes) can fail to find sufficiently large freespace and
      hence fail while there is still lots of free space available.
      
      There's a simple fix for this, and one that has precendence in the
      allocator code already - don't do speculative allocation unless the
      size of the file is larger than a certain size. In this case, that
      size is the minimum default preallocation size:
      mp->m_writeio_blocks. And to keep with the concept of being nice to
      people when the files are still relatively small, cap the prealloc
      to mp->m_writeio_blocks until the file goes over a stripe unit is
      size, at which point we'll fall back to the current behaviour based
      on the last extent size.
      
      This will effectively turn off speculative prealloc for very small
      files, keep preallocation low for small files, and behave as it
      currently does for any file larger than a stripe unit. This
      completely avoids the freespace fragmentation problem this
      particular IO pattern was causing.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      133eeb17
    • D
      xfs: plug directory buffer readahead · 34eefc06
      Dave Chinner 提交于
      Similar to bulkstat inode chunk readahead, we need to plug directory
      data buffer readahead during getdents to ensure that we can merge
      adjacent readahead requests and sort out of order requests optimally
      before they are dispatched. This improves the readahead efficiency
      and reduces the IO load it generates as the IO patterns are
      significantly better for both contiguous and fragmented directories.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      34eefc06
    • D
      xfs: add pluging for bulkstat readahead · cbb2864a
      Dave Chinner 提交于
      I was running some tests on bulkstat on CRC enabled filesystems when
      I noticed that all the IO being issued was 8k in size, regardless of
      the fact taht we are issuing sequential 8k buffers for inodes
      clusters. The IO size should be 16k for 256 byte inodes, and 32k for
      512 byte inodes, but this wasn't happening.
      
      blktrace showed that there was an explict plug and unplug happening
      around each readahead IO from _xfs_buf_ioapply, and the unplug was
      causing the IO to be issued immediately. Hence no opportunity was
      being given to the elevator to merge adjacent readahead requests and
      dispatch them as a single IO.
      
      Add plugging around the inode chunk readahead dispatch loop in
      bulkstat to ensure that we don't unplug the queue between adjacent
      inode buffer readahead IOs and so we get fewer, larger IO requests
      hitting the storage subsystem for bulkstat.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      cbb2864a
  11. 27 6月, 2013 1 次提交
新手
引导
客服 返回
顶部