1. 04 10月, 2016 8 次提交
  2. 03 10月, 2016 1 次提交
  3. 26 9月, 2016 7 次提交
    • B
      xfs: log recovery tracepoints to track current lsn and buffer submission · 5cd9cee9
      Brian Foster 提交于
      Log recovery has particular rules around buffer submission along with
      tricky corner cases where independent transactions can share an LSN. As
      such, it can be difficult to follow when/why buffers are submitted
      during recovery.
      
      Add a couple tracepoints to post the current LSN of a record when a new
      record is being processed and when a buffer is being skipped due to LSN
      ordering. Also, update the recover item class to include the LSN of the
      current transaction for the item being processed.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      5cd9cee9
    • B
      xfs: update metadata LSN in buffers during log recovery · 60a4a222
      Brian Foster 提交于
      Log recovery is currently broken for v5 superblocks in that it never
      updates the metadata LSN of buffers written out during recovery. The
      metadata LSN is recorded in various bits of metadata to provide recovery
      ordering criteria that prevents transient corruption states reported by
      buffer write verifiers. Without such ordering logic, buffer updates can
      be replayed out of order and lead to false positive transient corruption
      states. This is generally not a corruption vector on its own, but
      corruption detection shuts down the filesystem and ultimately prevents a
      mount if it occurs during log recovery. This requires an xfs_repair run
      that clears the log and potentially loses filesystem updates.
      
      This problem is avoided in most cases as metadata writes during normal
      filesystem operation update the metadata LSN appropriately. The problem
      with log recovery not updating metadata LSNs manifests if the system
      happens to crash shortly after log recovery itself. In this scenario, it
      is possible for log recovery to complete all metadata I/O such that the
      filesystem is consistent. If a crash occurs after that point but before
      the log tail is pushed forward by subsequent operations, however, the
      next mount performs the same log recovery over again. If a buffer is
      updated multiple times in the dirty range of the log, an earlier update
      in the log might not be valid based on the current state of the
      associated buffer after all of the updates in the log had been replayed
      (before the previous crash). If a verifier happens to detect such a
      problem, the filesystem claims corruption and immediately shuts down.
      
      This commonly manifests in practice as directory block verifier failures
      such as the following, likely due to directory verifiers being
      particularly detailed in their checks as compared to most others:
      
        ...
        Mounting V5 Filesystem
        XFS (dm-0): Starting recovery (logdev: internal)
        XFS (dm-0): Internal error XFS_WANT_CORRUPTED_RETURN at line ... of \
          file fs/xfs/libxfs/xfs_dir2_data.c.  Caller xfs_dir3_data_verify ...
        ...
      
      Update log recovery to update the metadata LSN of recovered buffers.
      Since metadata LSNs are already updated by write verifer functions via
      attached log items, attach a dummy log item to the buffer during
      validation and explicitly set the LSN of the current transaction. This
      ensures that the metadata LSN of a buffer is updated based on whether
      the recovery I/O actually completes, and if so, that subsequent recovery
      attempts identify that the buffer is already up to date with respect to
      the current transaction.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      60a4a222
    • B
      xfs: don't warn on buffers not being recovered due to LSN · 040c52c0
      Brian Foster 提交于
      The log recovery buffer validation function is invoked in cases where a
      buffer update may be skipped due to LSN ordering. If the validation
      function happens to come across directory conversion situations (e.g., a
      dir3 block to data conversion), it may warn about seeing a buffer log
      format of one type and a buffer with a magic number of another.
      
      This warning is not valid as the buffer update is ultimately skipped.
      This is indicated by a current_lsn of NULLCOMMITLSN provided by the
      caller. As such, update xlog_recover_validate_buf_type() to only warn in
      such cases when a buffer update is expected.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      040c52c0
    • B
      xfs: pass current lsn to log recovery buffer validation · 22db9af2
      Brian Foster 提交于
      The current LSN must be available to the buffer validation function to
      provide the ability to update the metadata LSN of the buffer. Pass the
      current_lsn value down to xlog_recover_validate_buf_type() in
      preparation.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      22db9af2
    • B
      xfs: rework log recovery to submit buffers on LSN boundaries · 12818d24
      Brian Foster 提交于
      The fix to log recovery to update the metadata LSN in recovered buffers
      introduces the requirement that a buffer is submitted only once per
      current LSN. Log recovery currently submits buffers on transaction
      boundaries. This is not sufficient as the abstraction between log
      records and transactions allows for various scenarios where multiple
      transactions can share the same current LSN. If independent transactions
      share an LSN and both modify the same buffer, log recovery can
      incorrectly skip updates and leave the filesystem in an inconsisent
      state.
      
      In preparation for proper metadata LSN updates during log recovery,
      update log recovery to submit buffers for write on LSN change boundaries
      rather than transaction boundaries. Explicitly track the current LSN in
      a new struct xlog field to handle the various corner cases of when the
      current LSN may or may not change.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      12818d24
    • D
      xfs: quiesce the filesystem after recovery on readonly mount · ddeb14f4
      Dave Chinner 提交于
      Recently we've had a number of reports where log recovery on a v5
      filesystem has reported corruptions that looked to be caused by
      recovery being re-run over the top of an already-recovered
      metadata. This has uncovered a bug in recovery (fixed elsewhere)
      but the vector that caused this was largely unknown.
      
      A kdump test started tripping over this problem - the system
      would be crashed, the kdump kernel and environment would boot and
      dump the kernel core image, and then the system would reboot. After
      reboot, the root filesystem was triggering log recovery and
      corruptions were being detected. The metadumps indicated the above
      log recovery issue.
      
      What is happening is that the kdump kernel and environment is
      mounting the root device read-only to find the binaries needed to do
      it's work. The result of this is that it is running log recovery.
      However, because there were unlinked files and EFIs to be processed
      by recovery, the completion of phase 1 of log recovery could not
      mark the log clean. And because it's a read-only mount, the unmount
      process does not write records to the log to mark it clean, either.
      Hence on the next mount of the filesystem, log recovery was run
      again across all the metadata that had already been recovered and
      this is what triggered corruption warnings.
      
      To avoid this problem, we need to ensure that a read-only mount
      always updates the log when it completes the second phase of
      recovery. We already handle this sort of issue with rw->ro remount
      transitions, so the solution is as simple as quiescing the
      filesystem at the appropriate time during the mount process. This
      results in the log being marked clean so the mount behaviour
      recorded in the logs on repeated RO mounts will change (i.e. log
      recovery will no longer be run on every mount until a RW mount is
      done). This is a user visible change in behaviour, but it is
      harmless.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      ddeb14f4
    • D
      xfs: remote attribute blocks aren't really userdata · 292378ed
      Dave Chinner 提交于
      When adding a new remote attribute, we write the attribute to the
      new extent before the allocation transaction is committed. This
      means we cannot reuse busy extents as that violates crash
      consistency semantics. Hence we currently treat remote attribute
      extent allocation like userdata because it has the same overwrite
      ordering constraints as userdata.
      
      Unfortunately, this also allows the allocator to incorrectly apply
      extent size hints to the remote attribute extent allocation. This
      results in interesting failures, such as transaction block
      reservation overruns and in-memory inode attribute fork corruption.
      
      To fix this, we need to separate the busy extent reuse configuration
      from the userdata configuration. This changes the definition of
      XFS_BMAPI_METADATA slightly - it now means that allocation is
      metadata and reuse of busy extents is acceptible due to the metadata
      ordering semantics of the journal. If this flag is not set, it
      means the allocation is that has unordered data writeback, and hence
      busy extent reuse is not allowed. It no longer implies the
      allocation is for user data, just that the data write will not be
      strictly ordered. This matches the semantics for both user data
      and remote attribute block allocation.
      
      As such, This patch changes the "userdata" field to a "datatype"
      field, and adds a "no busy reuse" flag to the field.
      When we detect an unordered data extent allocation, we immediately set
      the no reuse flag. We then set the "user data" flags based on the
      inode fork we are allocating the extent to. Hence we only set
      userdata flags on data fork allocations now and consider attribute
      fork remote extents to be an unordered metadata extent.
      
      The result is that remote attribute extents now have the expected
      allocation semantics, and the data fork allocation behaviour is
      completely unchanged.
      
      It should be noted that there may be other ways to fix this (e.g.
      use ordered metadata buffers for the remote attribute extent data
      write) but they are more invasive and difficult to validate both
      from a design and implementation POV. Hence this patch takes the
      simple, obvious route to fixing the problem...
      Reported-and-tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      292378ed
  4. 19 9月, 2016 15 次提交
  5. 14 9月, 2016 4 次提交
    • E
      xfs: normalize "infinite" retries in error configs · 77169812
      Eric Sandeen 提交于
      As it stands today, the "fail immediately" vs. "retry forever"
      values for max_retries and retry_timeout_seconds in the xfs metadata
      error configurations are not consistent.
      
      A retry_timeout_seconds of 0 means "retry forever," but a
      max_retries of 0 means "fail immediately."
      
      retry_timeout_seconds < 0 is disallowed, while max_retries == -1
      means "retry forever."
      
      Make this consistent across the error configs, such that a value of
      0 means "fail immediately" (i.e. wait 0 seconds, or retry 0 times),
      and a value of -1 always means "retry forever."
      
      This makes retry_timeout a signed long to accommodate the -1, even
      though it stores jiffies.  Given our limit of a 1 day maximum
      timeout, this should be sufficient even at much higher HZ values
      than we have available today.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      77169812
    • X
      xfs: fix signed integer overflow · 79c350e4
      Xie XiuQi 提交于
      Use 1U for unsigned int to avoid a overflow warning from UBSAN.
      
      [   31.910858] UBSAN: Undefined behaviour in fs/xfs/xfs_buf_item.c:889:25
      [   31.911252] signed integer overflow:
      [   31.911478] -2147483648 - 1 cannot be represented in type 'int'
      [   31.911846] CPU: 1 PID: 1011 Comm: tuned Tainted: G    B          ---- -------   3.10.0-327.28.3.el7.x86_64 #1
      [   31.911857] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 01/07/2011
      [   31.911866]  1ffff1004069cd3b 0000000076bec3fd ffff8802034e69a0 ffffffff81ee3140
      [   31.911883]  ffff8802034e69b8 ffffffff81ee31fd ffffffffa0ad79e0 ffff8802034e6b20
      [   31.911898]  ffffffff81ee46e2 0000002d515470c0 0000000000000001 0000000041b58ab3
      [   31.911913] Call Trace:
      [   31.911932]  [<ffffffff81ee3140>] dump_stack+0x1e/0x20
      [   31.911947]  [<ffffffff81ee31fd>] ubsan_epilogue+0x12/0x55
      [   31.911964]  [<ffffffff81ee46e2>] handle_overflow+0x1ba/0x215
      [   31.912083]  [<ffffffff81ee4798>] __ubsan_handle_sub_overflow+0x2a/0x31
      [   31.912204]  [<ffffffffa08676fb>] xfs_buf_item_log+0x34b/0x3f0 [xfs]
      [   31.912314]  [<ffffffffa0880490>] xfs_trans_log_buf+0x120/0x260 [xfs]
      [   31.912402]  [<ffffffffa079a890>] xfs_btree_log_recs+0x80/0xc0 [xfs]
      [   31.912490]  [<ffffffffa07a29f8>] xfs_btree_delrec+0x11a8/0x2d50 [xfs]
      [   31.913589]  [<ffffffffa07a86f9>] xfs_btree_delete+0xc9/0x260 [xfs]
      [   31.913762]  [<ffffffffa075b5cf>] xfs_free_ag_extent+0x63f/0xe20 [xfs]
      [   31.914339]  [<ffffffffa075ec0f>] xfs_free_extent+0x2af/0x3e0 [xfs]
      [   31.914641]  [<ffffffffa0801b2b>] xfs_bmap_finish+0x32b/0x4b0 [xfs]
      [   31.914841]  [<ffffffffa083c2e7>] xfs_itruncate_extents+0x3b7/0x740 [xfs]
      [   31.915216]  [<ffffffffa08342fa>] xfs_setattr_size+0x60a/0x860 [xfs]
      [   31.915471]  [<ffffffffa08345ea>] xfs_vn_setattr+0x9a/0xe0 [xfs]
      [   31.915590]  [<ffffffff8149ad38>] notify_change+0x5c8/0x8a0
      [   31.915607]  [<ffffffff81450f22>] do_truncate+0x122/0x1d0
      [   31.915640]  [<ffffffff8147beee>] do_last+0x15de/0x2c80
      [   31.915707]  [<ffffffff8147d777>] path_openat+0x1e7/0xcc0
      [   31.915802]  [<ffffffff81480824>] do_filp_open+0xa4/0x160
      [   31.915848]  [<ffffffff81453127>] do_sys_open+0x1b7/0x3f0
      [   31.915879]  [<ffffffff81453392>] SyS_open+0x32/0x40
      [   31.915897]  [<ffffffff81f08989>] system_call_fastpath+0x16/0x1b
      
      [  240.086809] UBSAN: Undefined behaviour in fs/xfs/xfs_buf_item.c:866:34
      [  240.086820] signed integer overflow:
      [  240.086830] -2147483648 - 1 cannot be represented in type 'int'
      [  240.086846] CPU: 1 PID: 12969 Comm: rm Tainted: G    B          ---- -------   3.10.0-327.28.3.el7.x86_64 #1
      [  240.086857] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 01/07/2011
      [  240.086868]  1ffff10040491def 00000000e2ea59c1 ffff88020248ef40 ffffffff81ee3140
      [  240.086885]  ffff88020248ef58 ffffffff81ee31fd ffffffffa0ad79e0 ffff88020248f0c0
      [  240.086901]  ffffffff81ee46e2 0000002d02488000 0000000000000001 0000000041b58ab3
      [  240.086915] Call Trace:
      [  240.086938]  [<ffffffff81ee3140>] dump_stack+0x1e/0x20
      [  240.086953]  [<ffffffff81ee31fd>] ubsan_epilogue+0x12/0x55
      [  240.086971]  [<ffffffff81ee46e2>] handle_overflow+0x1ba/0x215
      ...
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      79c350e4
    • A
      Make __xfs_xattr_put_listen preperly report errors. · 791cc43b
      Artem Savkov 提交于
      Commit 2a6fba6d "xfs: only return -errno or success from attr ->put_listent"
      changes the returnvalue of __xfs_xattr_put_listen to 0 in case when there is
      insufficient space in the buffer assuming that setting context->count to -1
      would be enough, but all of the ->put_listent callers only check seen_enough.
      This results in a failed assertion:
      XFS: Assertion failed: context->count >= 0, file: fs/xfs/xfs_xattr.c, line: 175
      in insufficient buffer size case.
      
      This is only reproducible with at least 2 xattrs and only when the buffer
      gets depleted before the last one.
      
      Furthermore if buffersize is such that it is enough to hold the last xattr's
      name, but not enough to hold the sum of preceeding xattr names listxattr won't
      fail with ERANGE, but will suceed returning last xattr's name without the
      first character. The first character end's up overwriting data stored at
      (context->alist - 1).
      Signed-off-by: NArtem Savkov <asavkov@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      791cc43b
    • E
      xfs: undo block reservation correctly in xfs_trans_reserve() · a27f6ef4
      Eryu Guan 提交于
      "blocks" should be added back to fdblocks at undo time, not taken
      away, i.e. the minus sign should not be used.
      
      This is a regression introduced by commit 0d485ada ("xfs: use
      generic percpu counters for free block counter"). And it's found by
      code inspection, I didn't it in real world, so there's no
      reproducer.
      Signed-off-by: NEryu Guan <eguan@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      a27f6ef4
  6. 30 8月, 2016 1 次提交
    • D
      xfs: track log done items directly in the deferred pending work item · ea78d808
      Darrick J. Wong 提交于
      Christoph reports slab corruption when a deferred refcount update
      aborts during _defer_finish().  The cause of this was broken log item
      state tracking in xfs_defer_pending -- upon an abort,
      _defer_trans_abort() will call abort_intent on all intent items,
      including the ones that have already had a done item attached.
      
      This is incorrect because each intent item has 2 refcount: the first
      is released when the intent item is committed to the log; and the
      second is released when the _done_ item is committed to the log, or
      by the intent creator if there is no done item.  In other words, once
      we log the done item, responsibility for releasing the intent item's
      second refcount is transferred to the done item and /must not/ be
      performed by anything else.
      
      The dfp_committed flag should have been tracking whether or not we had
      a done item so that _defer_trans_abort could decide if it needs to
      abort the intent item, but due to a thinko this was not the case.  Rip
      it out and track the done item directly so that we do the right thing
      w.r.t. intent item freeing.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reported-by: NChristoph Hellwig <hch@infradead.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      ea78d808
  7. 26 8月, 2016 4 次提交
    • B
      xfs: prevent dropping ioend completions during buftarg wait · 800b2694
      Brian Foster 提交于
      xfs_wait_buftarg() waits for all pending I/O, drains the ioend
      completion workqueue and walks the LRU until all buffers in the cache
      have been released. This is traditionally an unmount operation` but the
      mechanism is also reused during filesystem freeze.
      
      xfs_wait_buftarg() invokes drain_workqueue() as part of the quiesce,
      which is intended more for a shutdown sequence in that it indicates to
      the queue that new operations are not expected once the drain has begun.
      New work jobs after this point result in a WARN_ON_ONCE() and are
      otherwise dropped.
      
      With filesystem freeze, however, read operations are allowed and can
      proceed during or after the workqueue drain. If such a read occurs
      during the drain sequence, the workqueue infrastructure complains about
      the queued ioend completion work item and drops it on the floor. As a
      result, the buffer remains on the LRU and the freeze never completes.
      
      Despite the fact that the overall buffer cache cleanup is not necessary
      during freeze, fix up this operation such that it is safe to invoke
      during non-unmount quiesce operations. Replace the drain_workqueue()
      call with flush_workqueue(), which runs a similar serialization on
      pending workqueue jobs without causing new jobs to be dropped. This is
      safe for unmount as unmount independently locks out new operations by
      the time xfs_wait_buftarg() is invoked.
      
      cc: <stable@vger.kernel.org>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      800b2694
    • D
      xfs: fix superblock inprogress check · f3d7ebde
      Dave Chinner 提交于
      From inspection, the superblock sb_inprogress check is done in the
      verifier and triggered only for the primary superblock via a
      "bp->b_bn == XFS_SB_DADDR" check.
      
      Unfortunately, the primary superblock is an uncached buffer, and
      hence it is configured by xfs_buf_read_uncached() with:
      
      	bp->b_bn = XFS_BUF_DADDR_NULL;  /* always null for uncached buffers */
      
      And so this check never triggers. Fix it.
      
      cc: <stable@vger.kernel.org>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      f3d7ebde
    • D
      xfs: simple btree query range should look right if LE lookup fails · 5b5c2dbd
      Darrick J. Wong 提交于
      If the initial LOOKUP_LE in the simple query range fails to find
      anything, we should attempt to increment the btree cursor to see
      if there actually /are/ records for what we're trying to find.
      Without this patch, a bnobt range query of (0, $agsize) returns
      no results because the leftmost record never has a startblock
      of zero.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      5b5c2dbd
    • D
      xfs: fix some key handling problems in _btree_simple_query_range · 72227899
      Darrick J. Wong 提交于
      We only need the record's high key for the first record that we look
      at; for all records, we /definitely/ need the regular record key.
      Therefore, fix how the simple range query function gets its keys.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      72227899