1. 22 10月, 2019 2 次提交
  2. 21 10月, 2019 10 次提交
  3. 18 10月, 2019 1 次提交
    • D
      iomap: iomap that extends beyond EOF should be marked dirty · 7684e2c4
      Dave Chinner 提交于
      When doing a direct IO that spans the current EOF, and there are
      written blocks beyond EOF that extend beyond the current write, the
      only metadata update that needs to be done is a file size extension.
      
      However, we don't mark such iomaps as IOMAP_F_DIRTY to indicate that
      there is IO completion metadata updates required, and hence we may
      fail to correctly sync file size extensions made in IO completion
      when O_DSYNC writes are being used and the hardware supports FUA.
      
      Hence when setting IOMAP_F_DIRTY, we need to also take into account
      whether the iomap spans the current EOF. If it does, then we need to
      mark it dirty so that IO completion will call generic_write_sync()
      to flush the inode size update to stable storage correctly.
      
      Fixes: 3460cac1 ("iomap: Use FUA for pure data O_DSYNC DIO writes")
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      [darrick: removed the ext4 part; they'll handle it separately]
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      7684e2c4
  4. 15 10月, 2019 2 次提交
  5. 09 10月, 2019 3 次提交
    • B
      xfs: move local to extent inode logging into bmap helper · aeea4b75
      Brian Foster 提交于
      The callers of xfs_bmap_local_to_extents_empty() log the inode
      external to the function, yet this function is where the on-disk
      format value is updated. Push the inode logging down into the
      function itself to help prevent future mistakes.
      
      Note that internal bmap callers track the inode logging flags
      independently and thus may log the inode core twice due to this
      change. This is harmless, so leave this code around for consistency
      with the other attr fork conversion functions.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      aeea4b75
    • B
      xfs: remove broken error handling on failed attr sf to leaf change · 603efebd
      Brian Foster 提交于
      xfs_attr_shortform_to_leaf() attempts to put the shortform fork back
      together after a failed attempt to convert from shortform to leaf
      format. While this code reallocates and copies back the shortform
      attr fork data, it never resets the inode format field back to local
      format. Further, now that the inode is properly logged after the
      initial switch from local format, any error that triggers the
      recovery code will eventually abort the transaction and shutdown the
      fs. Therefore, remove the broken and unnecessary error handling
      code.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      603efebd
    • B
      xfs: log the inode on directory sf to block format change · 0b10d8a8
      Brian Foster 提交于
      When a directory changes from shortform (sf) to block format, the sf
      format is copied to a temporary buffer, the inode format is modified
      and the updated format filled with the dentries from the temporary
      buffer. If the inode format is modified and attempt to grow the
      inode fails (due to I/O error, for example), it is possible to
      return an error while leaving the directory in an inconsistent state
      and with an otherwise clean transaction. This results in corruption
      of the associated directory and leads to xfs_dabuf_map() errors as
      subsequent lookups cannot accurately determine the format of the
      directory. This problem is reproduced occasionally by generic/475.
      
      The fundamental problem is that xfs_dir2_sf_to_block() changes the
      on-disk inode format without logging the inode. The inode is
      eventually logged by the bmapi layer in the common case, but error
      checking introduces the possibility of failing the high level
      request before this happens.
      
      Update both of the dir2 and attr callers of
      xfs_bmap_local_to_extents_empty() to log the inode core as
      consistent with the bmap local to extent format change codepath.
      This ensures that any subsequent errors after the format has changed
      cause the transaction to abort.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0b10d8a8
  6. 07 10月, 2019 4 次提交
  7. 27 9月, 2019 1 次提交
  8. 25 9月, 2019 1 次提交
  9. 24 9月, 2019 4 次提交
  10. 20 9月, 2019 2 次提交
  11. 06 9月, 2019 9 次提交
    • D
      xfs: push the grant head when the log head moves forward · 14e15f1b
      Dave Chinner 提交于
      When the log fills up, we can get into the state where the
      outstanding items in the CIL being committed and aggregated are
      larger than the range that the reservation grant head tail pushing
      will attempt to clean. This can result in the tail pushing range
      being trimmed back to the the log head (l_last_sync_lsn) and so
      may not actually move the push target at all.
      
      When the iclogs associated with the CIL commit finally land, the
      log head moves forward, and this removes the restriction on the AIL
      push target. However, if we already have transactions sleeping on
      the grant head, and there's nothing in the AIL still to flush from
      the current push target, then nothing will move the tail of the log
      and trigger a log reservation wakeup.
      
      Hence the there is nothing that will trigger xlog_grant_push_ail()
      to recalculate the AIL push target and start pushing on the AIL
      again to write back the metadata objects that pin the tail of the
      log and hence free up space and allow the transaction reservations
      to be woken and make progress.
      
      Hence we need to push on the grant head when we move the log head
      forward, as this may be the only trigger we have that can move the
      AIL push target forwards in this situation.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      14e15f1b
    • D
      xfs: push iclog state cleaning into xlog_state_clean_log · 0383f543
      Dave Chinner 提交于
      xlog_state_clean_log() is only called from one place, and it occurs
      when an iclog is transitioning back to ACTIVE. Prior to calling
      xlog_state_clean_log, the iclog we are processing has a hard coded
      state check to DIRTY so that xlog_state_clean_log() processes it
      correctly. We also have a hard coded wakeup after
      xlog_state_clean_log() to enfore log force waiters on that iclog
      are woken correctly.
      
      Both of these things are operations required to finish processing an
      iclog and return it to the ACTIVE state again, so they make little
      sense to be separated from the rest of the clean state transition
      code.
      
      Hence push these things inside xlog_state_clean_log(), document the
      behaviour and rename it xlog_state_clean_iclog() to indicate that
      it's being driven by an iclog state change and does the iclog state
      change work itself.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0383f543
    • D
      xfs: factor iclog state processing out of xlog_state_do_callback() · 5e96fa8d
      Dave Chinner 提交于
      The iclog IO completion state processing is somewhat complex, and
      because it's inside two nested loops it is highly indented and very
      hard to read. Factor it out, flatten the logic flow and clean up the
      comments so that it much easier to see what the code is doing both
      in processing the individual iclogs and in the over
      xlog_state_do_callback() operation.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      5e96fa8d
    • D
      xfs: factor callbacks out of xlog_state_do_callback() · 6546818c
      Dave Chinner 提交于
      Simplify the code flow by lifting the iclog callback work out of
      the main iclog iteration loop. This isolates the log juggling and
      callbacks from the iclog state change logic in the loop.
      
      Note that the loopdidcallbacks variable is not actually tracking
      whether callbacks are actually run - it is tracking whether the
      icloglock was dropped during the loop and so determines if we
      completed the entire iclog scan loop atomically. Hence we know for
      certain there are either no more ordered completions to run or
      that the next completion will run the remaining ordered iclog
      completions. Hence rename that variable appropriately for it's
      function.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      6546818c
    • D
      xfs: factor debug code out of xlog_state_do_callback() · 6769aa2a
      Dave Chinner 提交于
      Start making this function readable by lifting the debug code into
      a conditional function.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      6769aa2a
    • D
      xfs: prevent CIL push holdoff in log recovery · 8ab39f11
      Dave Chinner 提交于
      generic/530 on a machine with enough ram and a non-preemptible
      kernel can run the AGI processing phase of log recovery enitrely out
      of cache. This means it never blocks on locks, never waits for IO
      and runs entirely through the unlinked lists until it either
      completes or blocks and hangs because it has run out of log space.
      
      It runs out of log space because the background CIL push is
      scheduled but never runs. queue_work() queues the CIL work on the
      current CPU that is busy, and the workqueue code will not run it on
      any other CPU. Hence if the unlinked list processing never yields
      the CPU voluntarily, the push work is delayed indefinitely. This
      results in the CIL aggregating changes until all the log space is
      consumed.
      
      When the log recoveyr processing evenutally blocks, the CIL flushes
      but because the last iclog isn't submitted for IO because it isn't
      full, the CIL flush never completes and nothing ever moves the log
      head forwards, or indeed inserts anything into the tail of the log,
      and hence nothing is able to get the log moving again and recovery
      hangs.
      
      There are several problems here, but the two obvious ones from
      the trace are that:
      	a) log recovery does not yield the CPU for over 4 seconds,
      	b) binding CIL pushes to a single CPU is a really bad idea.
      
      This patch addresses just these two aspects of the problem, and are
      suitable for backporting to work around any issues in older kernels.
      The more fundamental problem of preventing the CIL from consuming
      more than 50% of the log without committing will take more invasive
      and complex work, so will be done as followup work.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      8ab39f11
    • R
      xfs: fix missed wakeup on l_flush_wait · cdea5459
      Rik van Riel 提交于
      The code in xlog_wait uses the spinlock to make adding the task to
      the wait queue, and setting the task state to UNINTERRUPTIBLE atomic
      with respect to the waker.
      
      Doing the wakeup after releasing the spinlock opens up the following
      race condition:
      
      Task 1					task 2
      add task to wait queue
      					wake up task
      set task state to UNINTERRUPTIBLE
      
      This issue was found through code inspection as a result of kworkers
      being observed stuck in UNINTERRUPTIBLE state with an empty
      wait queue. It is rare and largely unreproducable.
      
      Simply moving the spin_unlock to after the wake_up_all results
      in the waker not being able to see a task on the waitqueue before
      it has set its state to UNINTERRUPTIBLE.
      
      This bug dates back to the conversion of this code to generic
      waitqueue infrastructure from a counting semaphore back in 2008
      which didn't place the wakeups consistently w.r.t. to the relevant
      spin locks.
      
      [dchinner: Also fix a similar issue in the shutdown path on
      xc_commit_wait. Update commit log with more details of the issue.]
      
      Fixes: d748c623 ("[XFS] Convert l_flushsema to a sv_t")
      Reported-by: NChris Mason <clm@fb.com>
      Signed-off-by: NRik van Riel <riel@surriel.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      cdea5459
    • D
      xfs: push the AIL in xlog_grant_head_wake · 7c107afb
      Dave Chinner 提交于
      In the situation where the log is full and the CIL has not recently
      flushed, the AIL push threshold is throttled back to the where the
      last write of the head of the log was completed. This is stored in
      log->l_last_sync_lsn. Hence if the CIL holds > 25% of the log space
      pinned by flushes and/or aggregation in progress, we can get the
      situation where the head of the log lags a long way behind the
      reservation grant head.
      
      When this happens, the AIL push target is trimmed back from where
      the reservation grant head wants to push the log tail to, back to
      where the head of the log currently is. This means the push target
      doesn't reach far enough into the log to actually move the tail
      before the transaction reservation goes to sleep.
      
      When the CIL push completes, it moves the log head forward such that
      the AIL push target can now be moved, but that has no mechanism for
      puhsing the log tail. Further, if the next tail movement of the log
      is not large enough wake the waiter (i.e. still not enough space for
      it to have a reservation granted), we don't wake anything up, and
      hence we do not update the AIL push target to take into account the
      head of the log moving and allowing the push target to be moved
      forwards.
      
      To avoid this particular condition, if we fail to wake the first
      waiter on the grant head because we don't have enough space,
      push on the AIL again. This will pick up any movement of the log
      head and allow the push target to move forward due to completion of
      CIL pushing.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      7c107afb
    • A
      xfs: Use WARN_ON_ONCE for bailout mount-operation · eb2e9994
      Austin Kim 提交于
      If the CONFIG_BUG is enabled, BUG is executed and then system is crashed.
      However, the bailout for mount is no longer proceeding.
      
      Using WARN_ON_ONCE rather than BUG can prevent this situation.
      Signed-off-by: NAustin Kim <austindh.kim@gmail.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      eb2e9994
  12. 04 9月, 2019 1 次提交
    • K
      xfs: Fix deadlock between AGI and AGF with RENAME_WHITEOUT · bc56ad8c
      kaixuxia 提交于
      When performing rename operation with RENAME_WHITEOUT flag, we will
      hold AGF lock to allocate or free extents in manipulating the dirents
      firstly, and then doing the xfs_iunlink_remove() call last to hold
      AGI lock to modify the tmpfile info, so we the lock order AGI->AGF.
      
      The big problem here is that we have an ordering constraint on AGF
      and AGI locking - inode allocation locks the AGI, then can allocate
      a new extent for new inodes, locking the AGF after the AGI. Hence
      the ordering that is imposed by other parts of the code is AGI before
      AGF. So we get an ABBA deadlock between the AGI and AGF here.
      
      Process A:
      Call trace:
       ? __schedule+0x2bd/0x620
       schedule+0x33/0x90
       schedule_timeout+0x17d/0x290
       __down_common+0xef/0x125
       ? xfs_buf_find+0x215/0x6c0 [xfs]
       down+0x3b/0x50
       xfs_buf_lock+0x34/0xf0 [xfs]
       xfs_buf_find+0x215/0x6c0 [xfs]
       xfs_buf_get_map+0x37/0x230 [xfs]
       xfs_buf_read_map+0x29/0x190 [xfs]
       xfs_trans_read_buf_map+0x13d/0x520 [xfs]
       xfs_read_agf+0xa6/0x180 [xfs]
       ? schedule_timeout+0x17d/0x290
       xfs_alloc_read_agf+0x52/0x1f0 [xfs]
       xfs_alloc_fix_freelist+0x432/0x590 [xfs]
       ? down+0x3b/0x50
       ? xfs_buf_lock+0x34/0xf0 [xfs]
       ? xfs_buf_find+0x215/0x6c0 [xfs]
       xfs_alloc_vextent+0x301/0x6c0 [xfs]
       xfs_ialloc_ag_alloc+0x182/0x700 [xfs]
       ? _xfs_trans_bjoin+0x72/0xf0 [xfs]
       xfs_dialloc+0x116/0x290 [xfs]
       xfs_ialloc+0x6d/0x5e0 [xfs]
       ? xfs_log_reserve+0x165/0x280 [xfs]
       xfs_dir_ialloc+0x8c/0x240 [xfs]
       xfs_create+0x35a/0x610 [xfs]
       xfs_generic_create+0x1f1/0x2f0 [xfs]
       ...
      
      Process B:
      Call trace:
       ? __schedule+0x2bd/0x620
       ? xfs_bmapi_allocate+0x245/0x380 [xfs]
       schedule+0x33/0x90
       schedule_timeout+0x17d/0x290
       ? xfs_buf_find+0x1fd/0x6c0 [xfs]
       __down_common+0xef/0x125
       ? xfs_buf_get_map+0x37/0x230 [xfs]
       ? xfs_buf_find+0x215/0x6c0 [xfs]
       down+0x3b/0x50
       xfs_buf_lock+0x34/0xf0 [xfs]
       xfs_buf_find+0x215/0x6c0 [xfs]
       xfs_buf_get_map+0x37/0x230 [xfs]
       xfs_buf_read_map+0x29/0x190 [xfs]
       xfs_trans_read_buf_map+0x13d/0x520 [xfs]
       xfs_read_agi+0xa8/0x160 [xfs]
       xfs_iunlink_remove+0x6f/0x2a0 [xfs]
       ? current_time+0x46/0x80
       ? xfs_trans_ichgtime+0x39/0xb0 [xfs]
       xfs_rename+0x57a/0xae0 [xfs]
       xfs_vn_rename+0xe4/0x150 [xfs]
       ...
      
      In this patch we move the xfs_iunlink_remove() call to
      before acquiring the AGF lock to preserve correct AGI/AGF locking
      order.
      Signed-off-by: Nkaixuxia <kaixuxia@tencent.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      bc56ad8c