1. 20 5月, 2011 1 次提交
    • D
      xfs: reset buffer pointers before freeing them · 44396476
      Dave Chinner 提交于
      When we free a vmapped buffer, we need to ensure the vmap address
      and length we free is the same as when it was allocated. In various
      places in the log code we change the memory the buffer is pointing
      to before issuing IO, but we never reset the buffer to point back to
      it's original memory (or no memory, if that is the case for the
      buffer).
      
      As a result, when we free the buffer it points to memory that is
      owned by something else and attempts to unmap and free it. Because
      the range does not match any known mapped range, it can trigger
      BUG_ON() traps in the vmap code, and potentially corrupt the vmap
      area tracking.
      
      Fix this by always resetting these buffers to their original state
      before freeing them.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      44396476
  2. 29 4月, 2011 1 次提交
    • C
      xfs: exact busy extent tracking · 97d3ac75
      Christoph Hellwig 提交于
      Update the extent tree in case we have to reuse a busy extent, so that it
      always is kept uptodate.  This is done by replacing the busy list searches
      with a new xfs_alloc_busy_reuse helper, which updates the busy extent tree
      in case of a reuse.  This allows us to allow reusing metadata extents
      unconditionally, and thus avoid log forces especially for allocation btree
      blocks.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      97d3ac75
  3. 08 4月, 2011 2 次提交
    • D
      xfs: convert log tail checking to a warning · da8a1a4a
      Dave Chinner 提交于
      On the Power platform, the log tail debug checks fire excessively
      causing the system to panic early in testing. The debug checks are
      known to be racy, though on x86_64 there is no evidence that they
      trigger at all.
      
      We want to keep the checks active on debug systems to alert us to
      problems with log space accounting, but we need to reduce the impact
      of a racy check on testing on the Power platform.
      
      As a result, convert the ASSERT conditions to warnings, and
      allow them to fire only once per filesystem mount. This will prevent
      false positives from interfering with testing, whilst still
      providing us with the indication that they may be a problem with log
      space accounting should that occur.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      da8a1a4a
    • D
      xfs: push the AIL from memory reclaim and periodic sync · fd074841
      Dave Chinner 提交于
      When we are short on memory, we want to expedite the cleaning of
      dirty objects.  Hence when we run short on memory, we need to kick
      the AIL flushing into action to clean as many dirty objects as
      quickly as possible.  To implement this, sample the lsn of the log
      item at the head of the AIL and use that as the push target for the
      AIL flush.
      
      Further, we keep items in the AIL that are dirty that are not
      tracked any other way, so we can get objects sitting in the AIL that
      don't get written back until the AIL is pushed. Hence to get the
      filesystem to the idle state, we might need to push the AIL to flush
      out any remaining dirty objects sitting in the AIL. This requires
      the same push mechanism as the reclaim push.
      
      This patch also renames xfs_trans_ail_tail() to xfs_ail_min_lsn() to
      match the new xfs_ail_max_lsn() function introduced in this patch.
      Similarly for xfs_trans_ail_push -> xfs_ail_push.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      fd074841
  4. 07 3月, 2011 1 次提交
  5. 12 1月, 2011 1 次提交
    • D
      xfs: prevent NMI timeouts in cmn_err · 73efe4a4
      Dave Chinner 提交于
      We currently have a global error message buffer in cmn_err that is
      protected by a spin lock that disables interrupts.  Recently there
      have been reports of NMI timeouts occurring when the console is
      being flooded by SCSI error reports due to cmn_err() getting stuck
      trying to print to the console while holding this lock (i.e. with
      interrupts disabled). The NMI watchdog is seeing this CPU as
      non-responding and so is triggering a panic.  While the trigger for
      the reported case is SCSI errors, pretty much anything that spams
      the kernel log could cause this to occur.
      
      Realistically the only reason that we have the intemediate message
      buffer is to prepend the correct kernel log level prefix to the log
      message. The only reason we have the lock is to protect the global
      message buffer and the only reason the message buffer is global is
      to keep it off the stack. Hence if we can avoid needing a global
      message buffer we avoid needing the lock, and we can do this with a
      small amount of cleanup and some preprocessor tricks:
      
      	1. clean up xfs_cmn_err() panic mask functionality to avoid
      	   needing debug code in xfs_cmn_err()
      	2. remove the couple of "!" message prefixes that still exist that
      	   the existing cmn_err() code steps over.
      	3. redefine CE_* levels directly to KERN_*
      	4. redefine cmn_err() and friends to use printk() directly
      	   via variable argument length macros.
      
      By doing this, we can completely remove the cmn_err() code and the
      lock that is causing the problems, and rely solely on printk()
      serialisation to ensure that we don't get garbled messages.
      
      A series of followup patches is really needed to clean up all the
      cmn_err() calls and related messages properly, but that results in a
      series that is not easily back portable to enterprise kernels. Hence
      this initial fix is only to address the direct problem in the lowest
      impact way possible.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      73efe4a4
  6. 21 12月, 2010 2 次提交
    • D
      xfs: convert grant head manipulations to lockless algorithm · d0eb2f38
      Dave Chinner 提交于
      The only thing that the grant lock remains to protect is the grant head
      manipulations when adding or removing space from the log. These calculations
      are already based on atomic variables, so we can already update them safely
      without locks. However, the grant head manpulations require atomic multi-step
      calculations to be executed, which the algorithms currently don't allow.
      
      To make these multi-step calculations atomic, convert the algorithms to
      compare-and-exchange loops on the atomic variables. That is, we sample the old
      value, perform the calculation and use atomic64_cmpxchg() to attempt to update
      the head with the new value. If the head has not changed since we sampled it,
      it will succeed and we are done. Otherwise, we rerun the calculation again from
      a new sample of the head.
      
      This allows us to remove the grant lock from around all the grant head space
      manipulations, and that effectively removes the grant lock from the log
      completely. Hence we can remove the grant lock completely from the log at this
      point.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      d0eb2f38
    • D
      xfs: introduce new locks for the log grant ticket wait queues · 3f16b985
      Dave Chinner 提交于
      The log grant ticket wait queues are currently protected by the log
      grant lock.  However, the queues are functionally independent from
      each other, and operations on them only require serialisation
      against other queue operations now that all of the other log
      variables they use are atomic values.
      
      Hence, we can make them independent of the grant lock by introducing
      new locks just to protect the lists operations. because the lists
      are independent, we can use a lock per list and ensure that reserve
      and write head queuing do not contend.
      
      To ensure forced shutdowns work correctly in conjunction with the
      new fast paths, ensure that we check whether the log has been shut
      down in the grant functions once we hold the relevant spin locks but
      before we go to sleep. This is needed to co-ordinate correctly with
      the wakeups that are issued on the ticket queues so we don't leave
      any processes sleeping on the queues during a shutdown.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      3f16b985
  7. 03 12月, 2010 1 次提交
  8. 21 12月, 2010 1 次提交
    • D
      xfs: convert l_tail_lsn to an atomic variable. · 1c3cb9ec
      Dave Chinner 提交于
      log->l_tail_lsn is currently protected by the log grant lock. The
      lock is only needed for serialising readers against writers, so we
      don't really need the lock if we make the l_tail_lsn variable an
      atomic. Converting the l_tail_lsn variable to an atomic64_t means we
      can start to peel back the grant lock from various operations.
      
      Also, provide functions to safely crack an atomic LSN variable into
      it's component pieces and to recombined the components into an
      atomic variable. Use them where appropriate.
      
      This also removes the need for explicitly holding a spinlock to read
      the l_tail_lsn on 32 bit platforms.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      
      1c3cb9ec
  9. 03 12月, 2010 1 次提交
    • D
      xfs: convert l_last_sync_lsn to an atomic variable · 84f3c683
      Dave Chinner 提交于
      log->l_last_sync_lsn is updated in only one critical spot - log
      buffer Io completion - and is protected by the grant lock here. This
      requires the grant lock to be taken for every log buffer IO
      completion. Converting the l_last_sync_lsn variable to an atomic64_t
      means that we do not need to take the grant lock in log buffer IO
      completion to update it.
      
      This also removes the need for explicitly holding a spinlock to read
      the l_last_sync_lsn on 32 bit platforms.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      84f3c683
  10. 21 12月, 2010 6 次提交
  11. 19 10月, 2010 2 次提交
  12. 10 9月, 2010 1 次提交
  13. 24 8月, 2010 1 次提交
    • D
      xfs: Reduce log force overhead for delayed logging · a44f13ed
      Dave Chinner 提交于
      Delayed logging adds some serialisation to the log force process to
      ensure that it does not deference a bad commit context structure
      when determining if a CIL push is necessary or not. It does this by
      grabing the CIL context lock exclusively, then dropping it before
      pushing the CIL if necessary. This causes serialisation of all log
      forces and pushes regardless of whether a force is necessary or not.
      As a result fsync heavy workloads (like dbench) can be significantly
      slower with delayed logging than without.
      
      To avoid this penalty, copy the current sequence from the context to
      the CIL structure when they are swapped. This allows us to do
      unlocked checks on the current sequence without having to worry
      about dereferencing context structures that may have already been
      freed. Hence we can remove the CIL context locking in the forcing
      code and only call into the push code if the current context matches
      the sequence we need to force.
      
      By passing the sequence into the push code, we can check the
      sequence again once we have the CIL lock held exclusive and abort if
      the sequence has already been pushed. This avoids a lock round-trip
      and unnecessary CIL pushes when we have racing push calls.
      
      The result is that the regression in dbench performance goes away -
      this change improves dbench performance on a ramdisk from ~2100MB/s
      to ~2500MB/s. This compares favourably to not using delayed logging
      which retuns ~2500MB/s for the same workload.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      a44f13ed
  14. 27 7月, 2010 6 次提交
  15. 24 5月, 2010 6 次提交
    • D
      xfs: forced unmounts need to push the CIL · 9da1ab18
      Dave Chinner 提交于
      If the filesystem is being shut down and the there is no log error,
      the current code forces out the current log buffers. This code now needs
      to push the CIL before it forces out the log buffers to acheive the same
      result.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      9da1ab18
    • D
      xfs: Introduce delayed logging core code · 71e330b5
      Dave Chinner 提交于
      The delayed logging code only changes in-memory structures and as
      such can be enabled and disabled with a mount option. Add the mount
      option and emit a warning that this is an experimental feature that
      should not be used in production yet.
      
      We also need infrastructure to track committed items that have not
      yet been written to the log. This is what the Committed Item List
      (CIL) is for.
      
      The log item also needs to be extended to track the current log
      vector, the associated memory buffer and it's location in the Commit
      Item List. Extend the log item and log vector structures to enable
      this tracking.
      
      To maintain the current log format for transactions with delayed
      logging, we need to introduce a checkpoint transaction and a context
      for tracking each checkpoint from initiation to transaction
      completion.  This includes adding a log ticket for tracking space
      log required/used by the context checkpoint.
      
      To track all the changes we need an io vector array per log item,
      rather than a single array for the entire transaction. Using the new
      log vector structure for this requires two passes - the first to
      allocate the log vector structures and chain them together, and the
      second to fill them out.  This log vector chain can then be passed
      to the CIL for formatting, pinning and insertion into the CIL.
      
      Formatting of the log vector chain is relatively simple - it's just
      a loop over the iovecs on each log vector, but it is made slightly
      more complex because we re-write the iovec after the copy to point
      back at the memory buffer we just copied into.
      
      This code also needs to pin log items. If the log item is not
      already tracked in this checkpoint context, then it needs to be
      pinned. Otherwise it is already pinned and we don't need to pin it
      again.
      
      The only other complexity is calculating the amount of new log space
      the formatting has consumed. This needs to be accounted to the
      transaction in progress, and the accounting is made more complex
      becase we need also to steal space from it for log metadata in the
      checkpoint transaction. Calculate all this at insert time and update
      all the tickets, counters, etc correctly.
      
      Once we've formatted all the log items in the transaction, attach
      the busy extents to the checkpoint context so the busy extents live
      until checkpoint completion and can be processed at that point in
      time. Transactions can then be freed at this point in time.
      
      Now we need to issue checkpoints - we are tracking the amount of log space
      used by the items in the CIL, so we can trigger background checkpoints when the
      space usage gets to a certain threshold. Otherwise, checkpoints need ot be
      triggered when a log synchronisation point is reached - a log force event.
      
      Because the log write code already handles chained log vectors, writing the
      transaction is trivial, too. Construct a transaction header, add it
      to the head of the chain and write it into the log, then issue a
      commit record write. Then we can release the checkpoint log ticket
      and attach the context to the log buffer so it can be called during
      Io completion to complete the checkpoint.
      
      We also need to allow for synchronising multiple in-flight
      checkpoints. This is needed for two things - the first is to ensure
      that checkpoint commit records appear in the log in the correct
      sequence order (so they are replayed in the correct order). The
      second is so that xfs_log_force_lsn() operates correctly and only
      flushes and/or waits for the specific sequence it was provided with.
      
      To do this we need a wait variable and a list tracking the
      checkpoint commits in progress. We can walk this list and wait for
      the checkpoints to change state or complete easily, an this provides
      the necessary synchronisation for correct operation in both cases.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      71e330b5
    • D
      xfs: make the log ticket ID available outside the log infrastructure · 955833cf
      Dave Chinner 提交于
      The ticket ID is needed to uniquely identify transactions when doing busy
      extent matching. Delayed logging changes the lifecycle of busy extents with
      respect to the transaction structure lifecycle. Hence we can no longer use
      the transaction structure as a means of determining the owner of the busy
      extent as it may be freed and reused while the busy extent is still active.
      
      This commit provides the infrastructure to access the xlog_tid_t held in the
      ticket from a transaction handle. This avoids the need for callers to peek
      into the transaction and log structures to find this out.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      955833cf
    • D
      xfs: clean up log ticket overrun debug output · 169a7b07
      Dave Chinner 提交于
      Push the error message output when a ticket overrun is detected
      into the ticket printing functions. Also remove the debug version
      of the code as the production version will still panic just as
      effectively on a debug kernel via the panic mask being set.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      169a7b07
    • D
      xfs: allow log ticket allocation to take allocation flags · 3383ca57
      Dave Chinner 提交于
      Delayed logging currently requires ticket allocation to succeed, so
      we need to be able to sleep on allocation. It also should not allow
      memory allocation to recurse into the filesystem. hence we need to
      pass allocation flags directing the type of allocation the caller
      requires.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      3383ca57
    • D
      xfs: Don't reuse the same transaction ID for duplicated transactions. · 524ee36f
      Dave Chinner 提交于
      The transaction ID is written into the log as the unique identifier
      for transactions during recover. When duplicating a transaction, we
      reuse the log ticket, which means it has the same transaction ID as
      the previous transaction.
      
      Rather than regenerating a random transaction ID for the duplicated
      transaction, just add one to the current ID so that duplicated
      transaction can be easily spotted in the log and during recovery
      during problem diagnosis.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      524ee36f
  16. 19 5月, 2010 7 次提交