1. 27 7月, 2010 6 次提交
    • D
      xfs: fix xfs_trans_add_item() lockdep warnings · 43869706
      Dave Chinner 提交于
      xfs_trans_add_item() is called with ip->i_ilock held, which means it
      is unsafe for memory reclaim to recurse back into the filesystem
      (ilock is required in writeback). Hence the allocation needs to be
      KM_NOFS to avoid recursion.
      
      Lockdep report indicating memory allocation being called with the
      ip->i_ilock held is as follows:
      
      [ 1749.866796] =================================
      [ 1749.867788] [ INFO: inconsistent lock state ]
      [ 1749.868327] 2.6.35-rc3-dgc+ #25
      [ 1749.868741] ---------------------------------
      [ 1749.868741] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
      [ 1749.868741] dd/2835 [HC0[0]:SC0[0]:HE1:SE1] takes:
      [ 1749.868741]  (&(&ip->i_lock)->mr_lock){++++?.}, at: [<ffffffff813170fb>] xfs_ilock+0x10b/0x190
      [ 1749.868741] {IN-RECLAIM_FS-W} state was registered at:
      [ 1749.868741]   [<ffffffff810b3a97>] __lock_acquire+0x437/0x1450
      [ 1749.868741]   [<ffffffff810b4b56>] lock_acquire+0xa6/0x160
      [ 1749.868741]   [<ffffffff810a20b5>] down_write_nested+0x65/0xb0
      [ 1749.868741]   [<ffffffff813170fb>] xfs_ilock+0x10b/0x190
      [ 1749.868741]   [<ffffffff8134e819>] xfs_reclaim_inode+0x99/0x310
      [ 1749.868741]   [<ffffffff8134f56b>] xfs_inode_ag_walk+0x8b/0x150
      [ 1749.868741]   [<ffffffff8134f6bb>] xfs_inode_ag_iterator+0x8b/0xf0
      [ 1749.868741]   [<ffffffff8134f7a8>] xfs_reclaim_inode_shrink+0x88/0x90
      [ 1749.868741]   [<ffffffff81119d07>] shrink_slab+0x137/0x1a0
      [ 1749.868741]   [<ffffffff8111bbe1>] balance_pgdat+0x421/0x6a0
      [ 1749.868741]   [<ffffffff8111bf7d>] kswapd+0x11d/0x320
      [ 1749.868741]   [<ffffffff8109ce56>] kthread+0x96/0xa0
      [ 1749.868741]   [<ffffffff81035de4>] kernel_thread_helper+0x4/0x10
      [ 1749.868741] irq event stamp: 4234335
      [ 1749.868741] hardirqs last  enabled at (4234335): [<ffffffff81147d25>] kmem_cache_free+0x115/0x220
      [ 1749.868741] hardirqs last disabled at (4234334): [<ffffffff81147c4d>] kmem_cache_free+0x3d/0x220
      [ 1749.868741] softirqs last  enabled at (4233112): [<ffffffff81084dd2>] __do_softirq+0x142/0x260
      [ 1749.868741] softirqs last disabled at (4233095): [<ffffffff81035edc>] call_softirq+0x1c/0x50
      [ 1749.868741] 
      [ 1749.868741] other info that might help us debug this:
      [ 1749.868741] 2 locks held by dd/2835:
      [ 1749.868741]  #0:  (&(&ip->i_iolock)->mr_lock#2){+.+.+.}, at: [<ffffffff81316edd>] xfs_ilock_nowait+0xed/0x200
      [ 1749.868741]  #1:  (&(&ip->i_lock)->mr_lock){++++?.}, at: [<ffffffff813170fb>] xfs_ilock+0x10b/0x190
      [ 1749.868741] 
      [ 1749.868741] stack backtrace:
      [ 1749.868741] Pid: 2835, comm: dd Not tainted 2.6.35-rc3-dgc+ #25
      [ 1749.868741] Call Trace:
      [ 1749.868741]  [<ffffffff810b1faa>] print_usage_bug+0x18a/0x190
      [ 1749.868741]  [<ffffffff8104264f>] ? save_stack_trace+0x2f/0x50
      [ 1749.868741]  [<ffffffff810b2400>] ? check_usage_backwards+0x0/0xf0
      [ 1749.868741]  [<ffffffff810b2f11>] mark_lock+0x331/0x400
      [ 1749.868741]  [<ffffffff810b3047>] mark_held_locks+0x67/0x90
      [ 1749.868741]  [<ffffffff810b3111>] lockdep_trace_alloc+0xa1/0xe0
      [ 1749.868741]  [<ffffffff81147419>] kmem_cache_alloc+0x39/0x1e0
      [ 1749.868741]  [<ffffffff8133f954>] kmem_zone_alloc+0x94/0xe0
      [ 1749.868741]  [<ffffffff8133f9be>] kmem_zone_zalloc+0x1e/0x50
      [ 1749.868741]  [<ffffffff81335f02>] xfs_trans_add_item+0x72/0xb0
      [ 1749.868741]  [<ffffffff81339e41>] xfs_trans_ijoin+0xa1/0xd0
      [ 1749.868741]  [<ffffffff81319f82>] xfs_itruncate_finish+0x312/0x5d0
      [ 1749.868741]  [<ffffffff8133cb87>] xfs_free_eofblocks+0x227/0x280
      [ 1749.868741]  [<ffffffff8133cd18>] xfs_release+0x138/0x190
      [ 1749.868741]  [<ffffffff813464c5>] xfs_file_release+0x15/0x20
      [ 1749.868741]  [<ffffffff81150ebf>] fput+0x13f/0x260
      [ 1749.868741]  [<ffffffff8114d8c2>] filp_close+0x52/0x80
      [ 1749.868741]  [<ffffffff8114d9a9>] sys_close+0xb9/0x120
      [ 1749.868741]  [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      43869706
    • C
      xfs: simplify inode to transaction joining · 898621d5
      Christoph Hellwig 提交于
      Currently we need to either call IHOLD or xfs_trans_ihold on an inode when
      joining it to a transaction via xfs_trans_ijoin.
      
      This patches instead makes xfs_trans_ijoin usable on it's own by doing
      an implicity xfs_trans_ihold, which also allows us to drop the third
      argument.  For the case where we want to hold a reference on the inode
      a xfs_trans_ijoin_ref wrapper is added which does the IHOLD and marks
      the inode for needing an xfs_iput.  In addition to the cleaner interface
      to the caller this also simplifies the implementation.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      898621d5
    • C
      xfs: merge iop_unpin_remove into iop_unpin · 9412e318
      Christoph Hellwig 提交于
      The unpin_remove item operation instances always share most of the
      implementation with the respective unpin implementation.  So instead
      of keeping two different entry points add a remove flag to the unpin
      operation and share the code more easily.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      9412e318
    • C
      xfs: simplify log item descriptor tracking · e98c414f
      Christoph Hellwig 提交于
      Currently we track log item descriptor belonging to a transaction using a
      complex opencoded chunk allocator.  This code has been there since day one
      and seems to work around the lack of an efficient slab allocator.
      
      This patch replaces it with dynamically allocated log item descriptors
      from a dedicated slab pool, linked to the transaction by a linked list.
      
      This allows to greatly simplify the log item descriptor tracking to the
      point where it's just a couple hundred lines in xfs_trans.c instead of
      a separate file.  The external API has also been simplified while we're
      at it - the xfs_trans_add_item and xfs_trans_del_item functions to add/
      delete items from a transaction have been simplified to the bare minium,
      and the xfs_trans_find_item function is replaced with a direct dereference
      of the li_desc field.  All debug code walking the list of log items in
      a transaction is down to a simple list_for_each_entry.
      
      Note that we could easily use a singly linked list here instead of the
      double linked list from list.h as the fastpath only does deletion from
      sequential traversal.  But given that we don't have one available as
      a library function yet I use the list.h functions for simplicity.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      e98c414f
    • C
      xfs: remove unneeded #include statements · 3400777f
      Christoph Hellwig 提交于
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <david@fromorbit.com>
      3400777f
    • C
      xfs: drop dmapi hooks · 288699fe
      Christoph Hellwig 提交于
      Dmapi support was never merged upstream, but we still have a lot of hooks
      bloating XFS for it, all over the fast pathes of the filesystem.
      
      This patch drops over 700 lines of dmapi overhead.  If we'll ever get HSM
      support in mainline at least the namespace events can be done much saner
      in the VFS instead of the individual filesystem, so it's not like this
      is much help for future work.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      288699fe
  2. 29 5月, 2010 1 次提交
  3. 24 5月, 2010 2 次提交
    • D
      xfs: Introduce delayed logging core code · 71e330b5
      Dave Chinner 提交于
      The delayed logging code only changes in-memory structures and as
      such can be enabled and disabled with a mount option. Add the mount
      option and emit a warning that this is an experimental feature that
      should not be used in production yet.
      
      We also need infrastructure to track committed items that have not
      yet been written to the log. This is what the Committed Item List
      (CIL) is for.
      
      The log item also needs to be extended to track the current log
      vector, the associated memory buffer and it's location in the Commit
      Item List. Extend the log item and log vector structures to enable
      this tracking.
      
      To maintain the current log format for transactions with delayed
      logging, we need to introduce a checkpoint transaction and a context
      for tracking each checkpoint from initiation to transaction
      completion.  This includes adding a log ticket for tracking space
      log required/used by the context checkpoint.
      
      To track all the changes we need an io vector array per log item,
      rather than a single array for the entire transaction. Using the new
      log vector structure for this requires two passes - the first to
      allocate the log vector structures and chain them together, and the
      second to fill them out.  This log vector chain can then be passed
      to the CIL for formatting, pinning and insertion into the CIL.
      
      Formatting of the log vector chain is relatively simple - it's just
      a loop over the iovecs on each log vector, but it is made slightly
      more complex because we re-write the iovec after the copy to point
      back at the memory buffer we just copied into.
      
      This code also needs to pin log items. If the log item is not
      already tracked in this checkpoint context, then it needs to be
      pinned. Otherwise it is already pinned and we don't need to pin it
      again.
      
      The only other complexity is calculating the amount of new log space
      the formatting has consumed. This needs to be accounted to the
      transaction in progress, and the accounting is made more complex
      becase we need also to steal space from it for log metadata in the
      checkpoint transaction. Calculate all this at insert time and update
      all the tickets, counters, etc correctly.
      
      Once we've formatted all the log items in the transaction, attach
      the busy extents to the checkpoint context so the busy extents live
      until checkpoint completion and can be processed at that point in
      time. Transactions can then be freed at this point in time.
      
      Now we need to issue checkpoints - we are tracking the amount of log space
      used by the items in the CIL, so we can trigger background checkpoints when the
      space usage gets to a certain threshold. Otherwise, checkpoints need ot be
      triggered when a log synchronisation point is reached - a log force event.
      
      Because the log write code already handles chained log vectors, writing the
      transaction is trivial, too. Construct a transaction header, add it
      to the head of the chain and write it into the log, then issue a
      commit record write. Then we can release the checkpoint log ticket
      and attach the context to the log buffer so it can be called during
      Io completion to complete the checkpoint.
      
      We also need to allow for synchronising multiple in-flight
      checkpoints. This is needed for two things - the first is to ensure
      that checkpoint commit records appear in the log in the correct
      sequence order (so they are replayed in the correct order). The
      second is so that xfs_log_force_lsn() operates correctly and only
      flushes and/or waits for the specific sequence it was provided with.
      
      To do this we need a wait variable and a list tracking the
      checkpoint commits in progress. We can walk this list and wait for
      the checkpoints to change state or complete easily, an this provides
      the necessary synchronisation for correct operation in both cases.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      71e330b5
    • D
      xfs: Improve scalability of busy extent tracking · ed3b4d6c
      Dave Chinner 提交于
      When we free a metadata extent, we record it in the per-AG busy
      extent array so that it is not re-used before the freeing
      transaction hits the disk. This array is fixed size, so when it
      overflows we make further allocation transactions synchronous
      because we cannot track more freed extents until those transactions
      hit the disk and are completed. Under heavy mixed allocation and
      freeing workloads with large log buffers, we can overflow this array
      quite easily.
      
      Further, the array is sparsely populated, which means that inserts
      need to search for a free slot, and array searches often have to
      search many more slots that are actually used to check all the
      busy extents. Quite inefficient, really.
      
      To enable this aspect of extent freeing to scale better, we need
      a structure that can grow dynamically. While in other areas of
      XFS we have used radix trees, the extents being freed are at random
      locations on disk so are better suited to being indexed by an rbtree.
      
      So, use a per-AG rbtree indexed by block number to track busy
      extents.  This incures a memory allocation when marking an extent
      busy, but should not occur too often in low memory situations. This
      should scale to an arbitrary number of extents so should not be a
      limitation for features such as in-memory aggregation of
      transactions.
      
      However, there are still situations where we can't avoid allocating
      busy extents (such as allocation from the AGFL). To minimise the
      overhead of such occurences, we need to avoid doing a synchronous
      log force while holding the AGF locked to ensure that the previous
      transactions are safely on disk before we use the extent. We can do
      this by marking the transaction doing the allocation as synchronous
      rather issuing a log force.
      
      Because of the locking involved and the ordering of transactions,
      the synchronous transaction provides the same guarantees as a
      synchronous log force because it ensures that all the prior
      transactions are already on disk when the synchronous transaction
      hits the disk. i.e. it preserves the free->allocate order of the
      extent correctly in recovery.
      
      By doing this, we avoid holding the AGF locked while log writes are
      in progress, hence reducing the length of time the lock is held and
      therefore we increase the rate at which we can allocate and free
      from the allocation group, thereby increasing overall throughput.
      
      The only problem with this approach is that when a metadata buffer is
      marked stale (e.g. a directory block is removed), then buffer remains
      pinned and locked until the log goes to disk. The issue here is that
      if that stale buffer is reallocated in a subsequent transaction, the
      attempt to lock that buffer in the transaction will hang waiting
      the log to go to disk to unlock and unpin the buffer. Hence if
      someone tries to lock a pinned, stale, locked buffer we need to
      push on the log to get it unlocked ASAP. Effectively we are trading
      off a guaranteed log force for a much less common trigger for log
      force to occur.
      
      Ideally we should not reallocate busy extents. That is a much more
      complex fix to the problem as it involves direct intervention in the
      allocation btree searches in many places. This is left to a future
      set of modifications.
      
      Finally, now that we track busy extents in allocated memory, we
      don't need the descriptors in the transaction structure to point to
      them. We can replace the complex busy chunk infrastructure with a
      simple linked list of busy extents. This allows us to remove a large
      chunk of code, making the overall change a net reduction in code
      size.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      ed3b4d6c
  4. 19 5月, 2010 5 次提交
  5. 02 3月, 2010 1 次提交
  6. 22 1月, 2010 2 次提交
  7. 12 12月, 2009 1 次提交
  8. 12 6月, 2009 1 次提交
  9. 08 6月, 2009 1 次提交
    • C
      xfs: kill xfs_qmops · 7d095257
      Christoph Hellwig 提交于
      Kill the quota ops function vector and replace it with direct calls or
      stubs in the CONFIG_XFS_QUOTA=n case.
      
      Make sure we check XFS_IS_QUOTA_RUNNING in the right spots.  We can remove
      the number of those checks because the XFS_TRANS_DQ_DIRTY flag can't be set
      otherwise.
      
      This brings us back closer to the way this code worked in IRIX and earlier
      Linux versions, but we keep a lot of the more useful factoring of common
      code.
      
      Eventually we should also kill xfs_qm_bhv.c, but that's left for a later
      patch.
      
      Reduces the size of the source code by about 250 lines and the size of
      XFS module by about 1.5 kilobytes with quotas enabled:
      
         text	   data	    bss	    dec	    hex	filename
       615957	   2960	   3848	 622765	  980ad	fs/xfs/xfs.o
       617231	   3152	   3848	 624231	  98667	fs/xfs/xfs.o.old
      
      Fallout:
      
       - xfs_qm_dqattach is split into xfs_qm_dqattach_locked which expects
         the inode locked and xfs_qm_dqattach which does the locking around it,
         thus removing XFS_QMOPT_ILOCKED.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NEric Sandeen <sandeen@sandeen.net>
      7d095257
  10. 17 11月, 2008 1 次提交
    • D
      [XFS] Fix double free of log tickets · cc09c0dc
      Dave Chinner 提交于
      When an I/O error occurs during an intermediate commit on a rolling
      transaction, xfs_trans_commit() will free the transaction structure
      and the related ticket. However, the duplicate transaction that
      gets used as the transaction continues still contains a pointer
      to the ticket. Hence when the duplicate transaction is cancelled
      and freed, we free the ticket a second time.
      
      Add reference counting to the ticket so that we hold an extra
      reference to the ticket over the transaction commit. We drop the
      extra reference once we have checked that the transaction commit
      did not return an error, thus avoiding a double free on commit
      error.
      
      Credit to Nick Piggin for tripping over the problem.
      
      SGI-PV: 989741
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NLachlan McIlroy <lachlan@sgi.com>
      cc09c0dc
  11. 30 10月, 2008 3 次提交
  12. 13 8月, 2008 2 次提交
  13. 28 7月, 2008 1 次提交
  14. 14 2月, 2008 1 次提交
  15. 07 2月, 2008 1 次提交
  16. 16 10月, 2007 1 次提交
  17. 15 10月, 2007 1 次提交
    • C
      [XFS] superblock endianess annotations · 2bdf7cd0
      Christoph Hellwig 提交于
      Creates a new xfs_dsb_t that is __be annotated and keeps xfs_sb_t for the
      incore one. xfs_xlatesb is renamed to xfs_sb_to_disk and only handles the
      incore -> disk conversion. A new helper xfs_sb_from_disk handles the other
      direction and doesn't need the slightly hacky table-driven approach
      because we only ever read the full sb from disk.
      
      The handling of shared r/o filesystems has been buggy on little endian
      system and fixing this required shuffling around of some code in that
      area.
      
      SGI-PV: 968563
      SGI-Modid: xfs-linux-melb:xfs-kern:29477a
      Signed-off-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NDavid Chinner <dgc@sgi.com>
      Signed-off-by: NTim Shimmin <tes@sgi.com>
      2bdf7cd0
  18. 14 7月, 2007 3 次提交
    • D
      [XFS] Apply transaction delta counts atomically to incore counters · 45c34141
      David Chinner 提交于
      With the per-cpu superblock counters, batch updates are no longer atomic
      across the entire batch of changes. This is not an issue if each
      individual change in the batch is applied atomically. Unfortunately, free
      block count changes are not applied atomically, and they are applied in a
      manner guaranteed to cause problems.
      
      Essentially, the free block count reservation that the transaction took
      initially is returned to the in core counters before a second delta takes
      away what is used. because these two operations are not atomic, we can
      race with another thread that can use the returned transaction reservation
      before the transaction takes the space away again and we can then get
      ENOSPC being reported in a spot where we don't have an ENOSPC condition,
      nor should we ever see one there.
      
      Fix it up by rolling the two deltas into the one so it can be applied
      safely (i.e. atomically) to the incore counters.
      
      SGI-PV: 964465
      SGI-Modid: xfs-linux-melb:xfs-kern:28796a
      Signed-off-by: NDavid Chinner <dgc@sgi.com>
      Signed-off-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NTim Shimmin <tes@sgi.com>
      45c34141
    • D
      [XFS] Fix the transaction flags to make lazy superblock counters work. · 210c6f1c
      David Chinner 提交于
      SGI-PV: 964999
      SGI-Modid: xfs-linux-melb:xfs-kern:28653a
      Signed-off-by: NDavid Chinner <dgc@sgi.com>
      Signed-off-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NTim Shimmin <tes@sgi.com>
      210c6f1c
    • D
      [XFS] Lazy Superblock Counters · 92821e2b
      David Chinner 提交于
      When we have a couple of hundred transactions on the fly at once, they all
      typically modify the on disk superblock in some way.
      create/unclink/mkdir/rmdir modify inode counts, allocation/freeing modify
      free block counts.
      
      When these counts are modified in a transaction, they must eventually lock
      the superblock buffer and apply the mods. The buffer then remains locked
      until the transaction is committed into the incore log buffer. The result
      of this is that with enough transactions on the fly the incore superblock
      buffer becomes a bottleneck.
      
      The result of contention on the incore superblock buffer is that
      transaction rates fall - the more pressure that is put on the superblock
      buffer, the slower things go.
      
      The key to removing the contention is to not require the superblock fields
      in question to be locked. We do that by not marking the superblock dirty
      in the transaction. IOWs, we modify the incore superblock but do not
      modify the cached superblock buffer. In short, we do not log superblock
      modifications to critical fields in the superblock on every transaction.
      In fact we only do it just before we write the superblock to disk every
      sync period or just before unmount.
      
      This creates an interesting problem - if we don't log or write out the
      fields in every transaction, then how do the values get recovered after a
      crash? the answer is simple - we keep enough duplicate, logged information
      in other structures that we can reconstruct the correct count after log
      recovery has been performed.
      
      It is the AGF and AGI structures that contain the duplicate information;
      after recovery, we walk every AGI and AGF and sum their individual
      counters to get the correct value, and we do a transaction into the log to
      correct them. An optimisation of this is that if we have a clean unmount
      record, we know the value in the superblock is correct, so we can avoid
      the summation walk under normal conditions and so mount/recovery times do
      not change under normal operation.
      
      One wrinkle that was discovered during development was that the blocks
      used in the freespace btrees are never accounted for in the AGF counters.
      This was once a valid optimisation to make; when the filesystem is full,
      the free space btrees are empty and consume no space. Hence when it
      matters, the "accounting" is correct. But that means the when we do the
      AGF summations, we would not have a correct count and xfs_check would
      complain. Hence a new counter was added to track the number of blocks used
      by the free space btrees. This is an *on-disk format change*.
      
      As a result of this, lazy superblock counters are a mkfs option and at the
      moment on linux there is no way to convert an old filesystem. This is
      possible - xfs_db can be used to twiddle the right bits and then
      xfs_repair will do the format conversion for you. Similarly, you can
      convert backwards as well. At some point we'll add functionality to
      xfs_admin to do the bit twiddling easily....
      
      SGI-PV: 964999
      SGI-Modid: xfs-linux-melb:xfs-kern:28652a
      Signed-off-by: NDavid Chinner <dgc@sgi.com>
      Signed-off-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NTim Shimmin <tes@sgi.com>
      92821e2b
  19. 08 5月, 2007 1 次提交
  20. 10 2月, 2007 1 次提交
  21. 20 6月, 2006 1 次提交
  22. 09 6月, 2006 3 次提交