1. 26 9月, 2016 7 次提交
    • B
      xfs: log recovery tracepoints to track current lsn and buffer submission · 5cd9cee9
      Brian Foster 提交于
      Log recovery has particular rules around buffer submission along with
      tricky corner cases where independent transactions can share an LSN. As
      such, it can be difficult to follow when/why buffers are submitted
      during recovery.
      
      Add a couple tracepoints to post the current LSN of a record when a new
      record is being processed and when a buffer is being skipped due to LSN
      ordering. Also, update the recover item class to include the LSN of the
      current transaction for the item being processed.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      5cd9cee9
    • B
      xfs: update metadata LSN in buffers during log recovery · 60a4a222
      Brian Foster 提交于
      Log recovery is currently broken for v5 superblocks in that it never
      updates the metadata LSN of buffers written out during recovery. The
      metadata LSN is recorded in various bits of metadata to provide recovery
      ordering criteria that prevents transient corruption states reported by
      buffer write verifiers. Without such ordering logic, buffer updates can
      be replayed out of order and lead to false positive transient corruption
      states. This is generally not a corruption vector on its own, but
      corruption detection shuts down the filesystem and ultimately prevents a
      mount if it occurs during log recovery. This requires an xfs_repair run
      that clears the log and potentially loses filesystem updates.
      
      This problem is avoided in most cases as metadata writes during normal
      filesystem operation update the metadata LSN appropriately. The problem
      with log recovery not updating metadata LSNs manifests if the system
      happens to crash shortly after log recovery itself. In this scenario, it
      is possible for log recovery to complete all metadata I/O such that the
      filesystem is consistent. If a crash occurs after that point but before
      the log tail is pushed forward by subsequent operations, however, the
      next mount performs the same log recovery over again. If a buffer is
      updated multiple times in the dirty range of the log, an earlier update
      in the log might not be valid based on the current state of the
      associated buffer after all of the updates in the log had been replayed
      (before the previous crash). If a verifier happens to detect such a
      problem, the filesystem claims corruption and immediately shuts down.
      
      This commonly manifests in practice as directory block verifier failures
      such as the following, likely due to directory verifiers being
      particularly detailed in their checks as compared to most others:
      
        ...
        Mounting V5 Filesystem
        XFS (dm-0): Starting recovery (logdev: internal)
        XFS (dm-0): Internal error XFS_WANT_CORRUPTED_RETURN at line ... of \
          file fs/xfs/libxfs/xfs_dir2_data.c.  Caller xfs_dir3_data_verify ...
        ...
      
      Update log recovery to update the metadata LSN of recovered buffers.
      Since metadata LSNs are already updated by write verifer functions via
      attached log items, attach a dummy log item to the buffer during
      validation and explicitly set the LSN of the current transaction. This
      ensures that the metadata LSN of a buffer is updated based on whether
      the recovery I/O actually completes, and if so, that subsequent recovery
      attempts identify that the buffer is already up to date with respect to
      the current transaction.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      60a4a222
    • B
      xfs: don't warn on buffers not being recovered due to LSN · 040c52c0
      Brian Foster 提交于
      The log recovery buffer validation function is invoked in cases where a
      buffer update may be skipped due to LSN ordering. If the validation
      function happens to come across directory conversion situations (e.g., a
      dir3 block to data conversion), it may warn about seeing a buffer log
      format of one type and a buffer with a magic number of another.
      
      This warning is not valid as the buffer update is ultimately skipped.
      This is indicated by a current_lsn of NULLCOMMITLSN provided by the
      caller. As such, update xlog_recover_validate_buf_type() to only warn in
      such cases when a buffer update is expected.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      040c52c0
    • B
      xfs: pass current lsn to log recovery buffer validation · 22db9af2
      Brian Foster 提交于
      The current LSN must be available to the buffer validation function to
      provide the ability to update the metadata LSN of the buffer. Pass the
      current_lsn value down to xlog_recover_validate_buf_type() in
      preparation.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      22db9af2
    • B
      xfs: rework log recovery to submit buffers on LSN boundaries · 12818d24
      Brian Foster 提交于
      The fix to log recovery to update the metadata LSN in recovered buffers
      introduces the requirement that a buffer is submitted only once per
      current LSN. Log recovery currently submits buffers on transaction
      boundaries. This is not sufficient as the abstraction between log
      records and transactions allows for various scenarios where multiple
      transactions can share the same current LSN. If independent transactions
      share an LSN and both modify the same buffer, log recovery can
      incorrectly skip updates and leave the filesystem in an inconsisent
      state.
      
      In preparation for proper metadata LSN updates during log recovery,
      update log recovery to submit buffers for write on LSN change boundaries
      rather than transaction boundaries. Explicitly track the current LSN in
      a new struct xlog field to handle the various corner cases of when the
      current LSN may or may not change.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      12818d24
    • D
      xfs: quiesce the filesystem after recovery on readonly mount · ddeb14f4
      Dave Chinner 提交于
      Recently we've had a number of reports where log recovery on a v5
      filesystem has reported corruptions that looked to be caused by
      recovery being re-run over the top of an already-recovered
      metadata. This has uncovered a bug in recovery (fixed elsewhere)
      but the vector that caused this was largely unknown.
      
      A kdump test started tripping over this problem - the system
      would be crashed, the kdump kernel and environment would boot and
      dump the kernel core image, and then the system would reboot. After
      reboot, the root filesystem was triggering log recovery and
      corruptions were being detected. The metadumps indicated the above
      log recovery issue.
      
      What is happening is that the kdump kernel and environment is
      mounting the root device read-only to find the binaries needed to do
      it's work. The result of this is that it is running log recovery.
      However, because there were unlinked files and EFIs to be processed
      by recovery, the completion of phase 1 of log recovery could not
      mark the log clean. And because it's a read-only mount, the unmount
      process does not write records to the log to mark it clean, either.
      Hence on the next mount of the filesystem, log recovery was run
      again across all the metadata that had already been recovered and
      this is what triggered corruption warnings.
      
      To avoid this problem, we need to ensure that a read-only mount
      always updates the log when it completes the second phase of
      recovery. We already handle this sort of issue with rw->ro remount
      transitions, so the solution is as simple as quiescing the
      filesystem at the appropriate time during the mount process. This
      results in the log being marked clean so the mount behaviour
      recorded in the logs on repeated RO mounts will change (i.e. log
      recovery will no longer be run on every mount until a RW mount is
      done). This is a user visible change in behaviour, but it is
      harmless.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      ddeb14f4
    • D
      xfs: remote attribute blocks aren't really userdata · 292378ed
      Dave Chinner 提交于
      When adding a new remote attribute, we write the attribute to the
      new extent before the allocation transaction is committed. This
      means we cannot reuse busy extents as that violates crash
      consistency semantics. Hence we currently treat remote attribute
      extent allocation like userdata because it has the same overwrite
      ordering constraints as userdata.
      
      Unfortunately, this also allows the allocator to incorrectly apply
      extent size hints to the remote attribute extent allocation. This
      results in interesting failures, such as transaction block
      reservation overruns and in-memory inode attribute fork corruption.
      
      To fix this, we need to separate the busy extent reuse configuration
      from the userdata configuration. This changes the definition of
      XFS_BMAPI_METADATA slightly - it now means that allocation is
      metadata and reuse of busy extents is acceptible due to the metadata
      ordering semantics of the journal. If this flag is not set, it
      means the allocation is that has unordered data writeback, and hence
      busy extent reuse is not allowed. It no longer implies the
      allocation is for user data, just that the data write will not be
      strictly ordered. This matches the semantics for both user data
      and remote attribute block allocation.
      
      As such, This patch changes the "userdata" field to a "datatype"
      field, and adds a "no busy reuse" flag to the field.
      When we detect an unordered data extent allocation, we immediately set
      the no reuse flag. We then set the "user data" flags based on the
      inode fork we are allocating the extent to. Hence we only set
      userdata flags on data fork allocations now and consider attribute
      fork remote extents to be an unordered metadata extent.
      
      The result is that remote attribute extents now have the expected
      allocation semantics, and the data fork allocation behaviour is
      completely unchanged.
      
      It should be noted that there may be other ways to fix this (e.g.
      use ordered metadata buffers for the remote attribute extent data
      write) but they are more invasive and difficult to validate both
      from a design and implementation POV. Hence this patch takes the
      simple, obvious route to fixing the problem...
      Reported-and-tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      292378ed
  2. 30 8月, 2016 1 次提交
    • D
      xfs: track log done items directly in the deferred pending work item · ea78d808
      Darrick J. Wong 提交于
      Christoph reports slab corruption when a deferred refcount update
      aborts during _defer_finish().  The cause of this was broken log item
      state tracking in xfs_defer_pending -- upon an abort,
      _defer_trans_abort() will call abort_intent on all intent items,
      including the ones that have already had a done item attached.
      
      This is incorrect because each intent item has 2 refcount: the first
      is released when the intent item is committed to the log; and the
      second is released when the _done_ item is committed to the log, or
      by the intent creator if there is no done item.  In other words, once
      we log the done item, responsibility for releasing the intent item's
      second refcount is transferred to the done item and /must not/ be
      performed by anything else.
      
      The dfp_committed flag should have been tracking whether or not we had
      a done item so that _defer_trans_abort could decide if it needs to
      abort the intent item, but due to a thinko this was not the case.  Rip
      it out and track the done item directly so that we do the right thing
      w.r.t. intent item freeing.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reported-by: NChristoph Hellwig <hch@infradead.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      ea78d808
  3. 29 8月, 2016 1 次提交
  4. 26 8月, 2016 7 次提交
  5. 17 8月, 2016 13 次提交
  6. 15 8月, 2016 3 次提交
  7. 14 8月, 2016 4 次提交
    • L
      Merge tag 'fixes-for-linus-4.8' of... · 118253a5
      Linus Torvalds 提交于
      Merge tag 'fixes-for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
      
      Pull h8300 and unicore32 architecture fixes from Guenter Roeck:
       "Two patches to fix h8300 and unicore32 builds.
      
        unicore32 builds have been broken since v4.6.  The fix has been
        available in -next since March of this year.
      
        h8300 builds have been broken since the last commit window.  The fix
        has been available in -next since June of this year"
      
      * tag 'fixes-for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
        h8300: Add missing include file to asm/io.h
        unicore32: mm: Add missing parameter to arch_vma_access_permitted
      118253a5
    • L
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · 120c5475
      Linus Torvalds 提交于
      Pull arm64 fixes from Catalin Marinas:
      
       - support for nr_cpus= command line argument (maxcpus was previously
         changed to allow secondary CPUs to be hot-plugged)
      
       - ARM PMU interrupt handling fix
      
       - fix potential TLB conflict in the hibernate code
      
       - improved handling of EL1 instruction aborts (better error reporting)
      
       - removal of useless jprobes code for stack saving/restoring
      
       - defconfig updates
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        arm64: defconfig: enable CONFIG_LOCALVERSION_AUTO
        arm64: defconfig: add options for virtualization and containers
        arm64: hibernate: handle allocation failures
        arm64: hibernate: avoid potential TLB conflict
        arm64: Handle el1 synchronous instruction aborts cleanly
        arm64: Remove stack duplicating code from jprobes
        drivers/perf: arm-pmu: Fix handling of SPI lacking "interrupt-affinity" property
        drivers/perf: arm-pmu: convert arm_pmu_mutex to spinlock
        arm64: Support hard limit of cpu count by nr_cpus
      120c5475
    • L
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 329f4152
      Linus Torvalds 提交于
      Pull KVM fixes from Radim Krčmář:
       "KVM:
         - lock kvm_device list to prevent corruption on device creation.
      
        PPC:
         - split debugfs initialization from creation of the xics device to
           unlock the newly taken kvm lock earlier.
      
        s390:
         - prevent userspace from triggering two WARN_ON_ONCE.
      
        MIPS:
         - fix several issues in the management of TLB faults (Cc: stable)"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        MIPS: KVM: Propagate kseg0/mapped tlb fault errors
        MIPS: KVM: Fix gfn range check in kseg0 tlb faults
        MIPS: KVM: Add missing gfn range check
        MIPS: KVM: Fix mapped fault broken commpage handling
        KVM: Protect device ops->create and list_add with kvm->lock
        KVM: PPC: Move xics_debugfs_init out of create
        KVM: s390: reset KVM_REQ_MMU_RELOAD if mapping the prefix failed
        KVM: s390: set the prefix initially properly
      329f4152
    • L
      Merge branch 'for-linus' of git://git.kernel.dk/linux-block · a1e21033
      Linus Torvalds 提交于
      Pull block fixes from Jens Axboe:
      
       - an NVMe fix from Gabriel, fixing a suspend/resume issue on some
         setups
      
       - addition of a few missing entries in the block queue sysfs
         documentation, from Joe
      
       - a fix for a sparse shadow warning for the bvec iterator, from
         Johannes
      
       - a writeback deadlock involving raid issuing barriers, and not
         flushing the plug when we wakeup the flusher threads.  From
         Konstantin
      
       - a set of patches for the NVMe target/loop/rdma code, from Roland and
         Sagi
      
      * 'for-linus' of git://git.kernel.dk/linux-block:
        bvec: avoid variable shadowing warning
        doc: update block/queue-sysfs.txt entries
        nvme: Suspend all queues before deletion
        mm, writeback: flush plugged IO in wakeup_flusher_threads()
        nvme-rdma: Remove unused includes
        nvme-rdma: start async event handler after reconnecting to a controller
        nvmet: Fix controller serial number inconsistency
        nvmet-rdma: Don't use the inline buffer in order to avoid allocation for small reads
        nvmet-rdma: Correctly handle RDMA device hot removal
        nvme-rdma: Make sure to shutdown the controller if we can
        nvme-loop: Remove duplicate call to nvme_remove_namespaces
        nvme-rdma: Free the I/O tags when we delete the controller
        nvme-rdma: Remove duplicate call to nvme_remove_namespaces
        nvme-rdma: Fix device removal handling
        nvme-rdma: Queue ns scanning after a sucessful reconnection
        nvme-rdma: Don't leak uninitialized memory in connect request private data
      a1e21033
  8. 13 8月, 2016 4 次提交