1. 07 7月, 2020 31 次提交
    • D
      xfs: make inode reclaim almost non-blocking · 993f951f
      Dave Chinner 提交于
      Now that dirty inode writeback doesn't cause read-modify-write
      cycles on the inode cluster buffer under memory pressure, the need
      to throttle memory reclaim to the rate at which we can clean dirty
      inodes goes away. That is due to the fact that we no longer thrash
      inode cluster buffers under memory pressure to clean dirty inodes.
      
      This means inode writeback no longer stalls on memory allocation
      or read IO, and hence can be done asynchronously without generating
      memory pressure. As a result, blocking inode writeback in reclaim is
      no longer necessary to prevent reclaim priority windup as cleaning
      dirty inodes is no longer dependent on having memory reserves
      available for the filesystem to make progress reclaiming inodes.
      
      Hence we can convert inode reclaim to be non-blocking for shrinker
      callouts, both for direct reclaim and kswapd.
      
      On a vanilla kernel, running a 16-way fsmark create workload on a
      4 node/16p/16GB RAM machine, I can reliably pin 14.75GB of RAM via
      userspace mlock(). The OOM killer gets invoked at 15GB of
      pinned RAM.
      
      Without the inode cluster pinning, this non-blocking reclaim patch
      triggers premature OOM killer invocation with the same memory
      pinning, sometimes with as much as 45% of RAM being free.  It's
      trivially easy to trigger the OOM killer when reclaim does not
      block.
      
      With pinning inode clusters in RAM and then adding this patch, I can
      reliably pin 14.5GB of RAM and still have the fsmark workload run to
      completion. The OOM killer gets invoked 14.75GB of pinned RAM, which
      is only a small amount of memory less than the vanilla kernel. It is
      much more reliable than just with async reclaim alone.
      
      simoops shows that allocation stalls go away when async reclaim is
      used. Vanilla kernel:
      
      Run time: 1924 seconds
      Read latency (p50: 3,305,472) (p95: 3,723,264) (p99: 4,001,792)
      Write latency (p50: 184,064) (p95: 553,984) (p99: 807,936)
      Allocation latency (p50: 2,641,920) (p95: 3,911,680) (p99: 4,464,640)
      work rate = 13.45/sec (avg 13.44/sec) (p50: 13.46) (p95: 13.58) (p99: 13.70)
      alloc stall rate = 3.80/sec (avg: 2.59) (p50: 2.54) (p95: 2.96) (p99: 3.02)
      
      With inode cluster pinning and async reclaim:
      
      Run time: 1924 seconds
      Read latency (p50: 3,305,472) (p95: 3,715,072) (p99: 3,977,216)
      Write latency (p50: 187,648) (p95: 553,984) (p99: 789,504)
      Allocation latency (p50: 2,748,416) (p95: 3,919,872) (p99: 4,448,256)
      work rate = 13.28/sec (avg 13.32/sec) (p50: 13.26) (p95: 13.34) (p99: 13.34)
      alloc stall rate = 0.02/sec (avg: 0.02) (p50: 0.01) (p95: 0.03) (p99: 0.03)
      
      Latencies don't really change much, nor does the work rate. However,
      allocation almost never stalls with these changes, whilst the
      vanilla kernel is sometimes reporting 20 stalls/s over a 60s sample
      period. This difference is due to inode reclaim being largely
      non-blocking now.
      
      IOWs, once we have pinned inode cluster buffers, we can make inode
      reclaim non-blocking without a major risk of premature and/or
      spurious OOM killer invocation, and without any changes to memory
      reclaim infrastructure.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      993f951f
    • D
      xfs: pin inode backing buffer to the inode log item · 298f7bec
      Dave Chinner 提交于
      When we dirty an inode, we are going to have to write it disk at
      some point in the near future. This requires the inode cluster
      backing buffer to be present in memory. Unfortunately, under severe
      memory pressure we can reclaim the inode backing buffer while the
      inode is dirty in memory, resulting in stalling the AIL pushing
      because it has to do a read-modify-write cycle on the cluster
      buffer.
      
      When we have no memory available, the read of the cluster buffer
      blocks the AIL pushing process, and this causes all sorts of issues
      for memory reclaim as it requires inode writeback to make forwards
      progress. Allocating a cluster buffer causes more memory pressure,
      and results in more cluster buffers to be reclaimed, resulting in
      more RMW cycles to be done in the AIL context and everything then
      backs up on AIL progress. Only the synchronous inode cluster
      writeback in the the inode reclaim code provides some level of
      forwards progress guarantees that prevent OOM-killer rampages in
      this situation.
      
      Fix this by pinning the inode backing buffer to the inode log item
      when the inode is first dirtied (i.e. in xfs_trans_log_inode()).
      This may mean the first modification of an inode that has been held
      in cache for a long time may block on a cluster buffer read, but
      we can do that in transaction context and block safely until the
      buffer has been allocated and read.
      
      Once we have the cluster buffer, the inode log item takes a
      reference to it, pinning it in memory, and attaches it to the log
      item for future reference. This means we can always grab the cluster
      buffer from the inode log item when we need it.
      
      When the inode is finally cleaned and removed from the AIL, we can
      drop the reference the inode log item holds on the cluster buffer.
      Once all inodes on the cluster buffer are clean, the cluster buffer
      will be unpinned and it will be available for memory reclaim to
      reclaim again.
      
      This avoids the issues with needing to do RMW cycles in the AIL
      pushing context, and hence allows complete non-blocking inode
      flushing to be performed by the AIL pushing context.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      298f7bec
    • D
      xfs: move xfs_clear_li_failed out of xfs_ail_delete_one() · e98084b8
      Dave Chinner 提交于
      xfs_ail_delete_one() is called directly from dquot and inode IO
      completion, as well as from the generic xfs_trans_ail_delete()
      function. Inodes are about to have their own failure handling, and
      dquots will in future, too. Pull the clearing of the LI_FAILED flag
      up into the callers so we can customise the code appropriately.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      e98084b8
    • D
      xfs: unwind log item error flagging · 3536b61e
      Dave Chinner 提交于
      When an buffer IO error occurs, we want to mark all
      the log items attached to the buffer as failed. Open code
      the error handling loop so that we can modify the flagging for the
      different types of objects directly and independently of each other.
      
      This also allows us to remove the ->iop_error method from the log
      item operations.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      3536b61e
    • D
      xfs: handle buffer log item IO errors directly · 428947e9
      Dave Chinner 提交于
      Currently when a buffer with attached log items has an IO error
      it called ->iop_error for each attched log item. These all call
      xfs_set_li_failed() to handle the error, but we are about to change
      the way log items manage buffers. hence we first need to remove the
      per-item dependency on buffer handling done by xfs_set_li_failed().
      
      We already have specific buffer type IO completion routines, so move
      the log item error handling out of the generic error handling and
      into the log item specific functions so we can implement per-type
      error handling easily.
      
      This requires a more complex return value from the error handling
      code so that we can take the correct action the failure handling
      requires.  This results in some repeated boilerplate in the
      functions, but that can be cleaned up later once all the changes
      cascade through this code.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      428947e9
    • D
      xfs: get rid of log item callbacks · 2ef3f7f5
      Dave Chinner 提交于
      They are not used anymore, so remove them from the log item and the
      buffer iodone attachment interfaces.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      2ef3f7f5
    • D
      xfs: clean up the buffer iodone callback functions · fec671cd
      Dave Chinner 提交于
      Now that we've sorted inode and dquot buffers, we can apply the same
      cleanups to dirty buffers with buffer log items. They only have one
      callback, too, so we don't need the log item callback. Collapse the
      iodone functions and remove all the now unnecessary infrastructure
      around callback processing.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      fec671cd
    • D
      xfs: use direct calls for dquot IO completion · 6f5de180
      Dave Chinner 提交于
      Similar to inodes, we can call the dquot IO completion functions
      directly from the buffer completion code, removing another user of
      log item callbacks for IO completion processing.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      6f5de180
    • D
      xfs: make inode IO completion buffer centric · aac855ab
      Dave Chinner 提交于
      Having different io completion callbacks for different inode states
      makes things complex. We can detect if the inode is stale via the
      XFS_ISTALE flag in IO completion, so we don't need a special
      callback just for this.
      
      This means inodes only have a single iodone callback, and inode IO
      completion is entirely buffer centric at this point. Hence we no
      longer need to use a log item callback at all as we can just call
      xfs_iflush_done() directly from the buffer completions and walk the
      buffer log item list to complete the all inodes under IO.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      aac855ab
    • D
      xfs: clean up whacky buffer log item list reinit · a7e134ef
      Dave Chinner 提交于
      When we've emptied the buffer log item list, it does a list_del_init
      on itself to reset it's pointers to itself. This is unnecessary as
      the list is already empty at this point - it was a left-over
      fragment from the list_head conversion of the buffer log item list.
      Remove them.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      a7e134ef
    • D
      xfs: call xfs_buf_iodone directly · b01d1461
      Dave Chinner 提交于
      All unmarked dirty buffers should be in the AIL and have log items
      attached to them. Hence when they are written, we will run a
      callback to remove the item from the AIL if appropriate. Now that
      we've handled inode and dquot buffers, all remaining calls are to
      xfs_buf_iodone() and so we can hard code this rather than use an
      indirect call.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      b01d1461
    • D
      xfs: mark log recovery buffers for completion · 9fe5c77c
      Dave Chinner 提交于
      Log recovery has it's own buffer write completion handler for
      buffers that it directly recovers. Convert these to direct calls by
      flagging these buffers as being log recovery buffers. The flag will
      get cleared by the log recovery IO completion routine, so it will
      never leak out of log recovery.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      9fe5c77c
    • D
      xfs: mark dquot buffers in cache · 0c7e5afb
      Dave Chinner 提交于
      dquot buffers always have write IO callbacks, so by marking them
      directly we can avoid needing to attach ->b_iodone functions to
      them. This avoids an indirect call, and makes future modifications
      much simpler.
      
      This is largely a rearrangement of the code at this point - no IO
      completion functionality changes at this point, just how the
      code is run is modified.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0c7e5afb
    • D
      xfs: mark inode buffers in cache · f593bf14
      Dave Chinner 提交于
      Inode buffers always have write IO callbacks, so by marking them
      directly we can avoid needing to attach ->b_iodone functions to
      them. This avoids an indirect call, and makes future modifications
      much simpler.
      
      While this is largely a refactor of existing functionality, we
      broaden the scope of the flag to beyond where inodes are explicitly
      attached because future changes need to know what type of log items
      are attached to the buffer. Adding this buffer flag may invoke the
      inode iodone callback in cases where it wouldn't have been
      previously, but this is not a functional change because the callback
      is identical to the normal buffer write iodone callback when inodes
      are not attached.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      f593bf14
    • D
      xfs: add an inode item lock · 1319ebef
      Dave Chinner 提交于
      The inode log item is kind of special in that it can be aggregating
      new changes in memory at the same time time existing changes are
      being written back to disk. This means there are fields in the log
      item that are accessed concurrently from contexts that don't share
      any locking at all.
      
      e.g. updating ili_last_fields occurs at flush time under the
      ILOCK_EXCL and flush lock at flush time, under the flush lock at IO
      completion time, and is read under the ILOCK_EXCL when the inode is
      logged.  Hence there is no actual serialisation between reading the
      field during logging of the inode in transactions vs clearing the
      field in IO completion.
      
      We currently get away with this by the fact that we are only
      clearing fields in IO completion, and nothing bad happens if we
      accidentally log more of the inode than we actually modify. Worst
      case is we consume a tiny bit more memory and log bandwidth.
      
      However, if we want to do more complex state manipulations on the
      log item that requires updates at all three of these potential
      locations, we need to have some mechanism of serialising those
      operations. To do this, introduce a spinlock into the log item to
      serialise internal state.
      
      This could be done via the xfs_inode i_flags_lock, but this then
      leads to potential lock inversion issues where inode flag updates
      need to occur inside locks that best nest inside the inode log item
      locks (e.g. marking inodes stale during inode cluster freeing).
      Using a separate spinlock avoids these sorts of problems and
      simplifies future code.
      
      This does not touch the use of ili_fields in the item formatting
      code - that is entirely protected by the ILOCK_EXCL at this point in
      time, so it remains untouched.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      1319ebef
    • D
      xfs: remove logged flag from inode log item · 1dfde687
      Dave Chinner 提交于
      This was used to track if the item had logged fields being flushed
      to disk. We log everything in the inode these days, so this logic is
      no longer needed. Remove it.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      1dfde687
    • D
      xfs: Don't allow logging of XFS_ISTALE inodes · 96355d5a
      Dave Chinner 提交于
      In tracking down a problem in this patchset, I discovered we are
      reclaiming dirty stale inodes. This wasn't discovered until inodes
      were always attached to the cluster buffer and then the rcu callback
      that freed inodes was assert failing because the inode still had an
      active pointer to the cluster buffer after it had been reclaimed.
      
      Debugging the issue indicated that this was a pre-existing issue
      resulting from the way the inodes are handled in xfs_inactive_ifree.
      When we free a cluster buffer from xfs_ifree_cluster, all the inodes
      in cache are marked XFS_ISTALE. Those that are clean have nothing
      else done to them and so eventually get cleaned up by background
      reclaim. i.e. it is assumed we'll never dirty/relog an inode marked
      XFS_ISTALE.
      
      On journal commit dirty stale inodes as are handled by both
      buffer and inode log items to run though xfs_istale_done() and
      removed from the AIL (buffer log item commit) or the log item will
      simply unpin it because the buffer log item will clean it. What happens
      to any specific inode is entirely dependent on which log item wins
      the commit race, but the result is the same - stale inodes are
      clean, not attached to the cluster buffer, and not in the AIL. Hence
      inode reclaim can just free these inodes without further care.
      
      However, if the stale inode is relogged, it gets dirtied again and
      relogged into the CIL. Most of the time this isn't an issue, because
      relogging simply changes the inode's location in the current
      checkpoint. Problems arise, however, when the CIL checkpoints
      between two transactions in the xfs_inactive_ifree() deferops
      processing. This results in the XFS_ISTALE inode being redirtied
      and inserted into the CIL without any of the other stale cluster
      buffer infrastructure being in place.
      
      Hence on journal commit, it simply gets unpinned, so it remains
      dirty in memory. Everything in inode writeback avoids XFS_ISTALE
      inodes so it can't be written back, and it is not tracked in the AIL
      so there's not even a trigger to attempt to clean the inode. Hence
      the inode just sits dirty in memory until inode reclaim comes along,
      sees that it is XFS_ISTALE, and goes to reclaim it. This reclaiming
      of a dirty inode caused use after free, list corruptions and other
      nasty issues later in this patchset.
      
      Hence this patch addresses a violation of the "never log XFS_ISTALE
      inodes" caused by the deferops processing rolling a transaction
      and relogging a stale inode in xfs_inactive_free. It also adds a
      bunch of asserts to catch this problem in debug kernels so that
      we don't reintroduce this problem in future.
      
      Reproducer for this issue was generic/558 on a v4 filesystem.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      96355d5a
    • Y
      xfs: remove useless definitions in xfs_linux.h · 0d5a5714
      Yafang Shao 提交于
      Remove current_pid(), current_test_flags() and
      current_clear_flags_nested(), because they are useless.
      Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0d5a5714
    • D
      xfs: use MMAPLOCK around filemap_map_pages() · cd647d56
      Dave Chinner 提交于
      The page faultround path ->map_pages is implemented in XFS via
      filemap_map_pages(). This function checks that pages found in page
      cache lookups have not raced with truncate based invalidation by
      checking page->mapping is correct and page->index is within EOF.
      
      However, we've known for a long time that this is not sufficient to
      protect against races with invalidations done by operations that do
      not change EOF. e.g. hole punching and other fallocate() based
      direct extent manipulations. The way we protect against these
      races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED
      lock so they serialise against fallocate and truncate before calling
      into the filemap function that processes the fault.
      
      Do the same for XFS's ->map_pages implementation to close this
      potential data corruption issue.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      cd647d56
    • D
      xfs: move helpers that lock and unlock two inodes against userspace IO · e2aaee9c
      Darrick J. Wong 提交于
      Move the double-inode locking helpers to xfs_inode.c since they're not
      specific to reflink.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      e2aaee9c
    • D
      xfs: refactor locking and unlocking two inodes against userspace IO · 10b4bd6c
      Darrick J. Wong 提交于
      Refactor the two functions that we use to lock and unlock two inodes to
      block userspace from initiating IO against a file, whether via system
      calls or mmap activity.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      10b4bd6c
    • D
      xfs: fix xfs_reflink_remap_prep calling conventions · 451d34ee
      Darrick J. Wong 提交于
      Fix the return value of xfs_reflink_remap_prep so that its return value
      conventions match the rest of xfs.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      451d34ee
    • D
      xfs: reflink can skip remap existing mappings · 168eae80
      Darrick J. Wong 提交于
      If the source and destination map are identical, we can skip the remap
      step to save some time.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      168eae80
    • D
      xfs: only reserve quota blocks if we're mapping into a hole · 94b941fd
      Darrick J. Wong 提交于
      When logging quota block count updates during a reflink operation, we
      only log the /delta/ of the block count changes to the dquot.  Since we
      now know ahead of time the extent type of both dmap and smap (and that
      they have the same length), we know that we only need to reserve quota
      blocks for dmap's blockcount if we're mapping it into a hole.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      94b941fd
    • D
      xfs: only reserve quota blocks for bmbt changes if we're changing the data fork · aa5d0ba0
      Darrick J. Wong 提交于
      Now that we've reworked xfs_reflink_remap_extent to remap only one
      extent per transaction, we actually know if the extent being removed is
      an allocated mapping.  This means that we now know ahead of time if
      we're going to be touching the data fork.
      
      Since we only need blocks for a bmbt split if we're going to update the
      data fork, we only need to get quota reservation if we know we're going
      to touch the data fork.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      aa5d0ba0
    • D
      xfs: redesign the reflink remap loop to fix blkres depletion crash · 00fd1d56
      Darrick J. Wong 提交于
      The existing reflink remapping loop has some structural problems that
      need addressing:
      
      The biggest problem is that we create one transaction for each extent in
      the source file without accounting for the number of mappings there are
      for the same range in the destination file.  In other words, we don't
      know the number of remap operations that will be necessary and we
      therefore cannot guess the block reservation required.  On highly
      fragmented filesystems (e.g. ones with active dedupe) we guess wrong,
      run out of block reservation, and fail.
      
      The second problem is that we don't actually use the bmap intents to
      their full potential -- instead of calling bunmapi directly and having
      to deal with its backwards operation, we could call the deferred ops
      xfs_bmap_unmap_extent and xfs_refcount_decrease_extent instead.  This
      makes the frontend loop much simpler.
      
      Solve all of these problems by refactoring the remapping loops so that
      we only perform one remapping operation per transaction, and each
      operation only tries to remap a single extent from source to dest.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reported-by: NEdwin Török <edwin@etorok.net>
      Tested-by: NEdwin Török <edwin@etorok.net>
      00fd1d56
    • D
      xfs: rename xfs_bmap_is_real_extent to is_written_extent · 877f58f5
      Darrick J. Wong 提交于
      The name of this predicate is a little misleading -- it decides if the
      extent mapping is allocated and written.  Change the name to be more
      direct, as we're going to add a new predicate in the next patch.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      877f58f5
    • D
      xfs: fix reflink quota reservation accounting error · 83895227
      Darrick J. Wong 提交于
      Quota reservations are supposed to account for the blocks that might be
      allocated due to a bmap btree split.  Reflink doesn't do this, so fix
      this to make the quota accounting more accurate before we start
      rearranging things.
      
      Fixes: 862bb360 ("xfs: reflink extents from one file to another")
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      83895227
    • D
      xfs: don't eat an EIO/ENOSPC writeback error when scrubbing data fork · eb0efe50
      Darrick J. Wong 提交于
      The data fork scrubber calls filemap_write_and_wait to flush dirty pages
      and delalloc reservations out to disk prior to checking the data fork's
      extent mappings.  Unfortunately, this means that scrub can consume the
      EIO/ENOSPC errors that would otherwise have stayed around in the address
      space until (we hope) the writer application calls fsync to persist data
      and collect errors.  The end result is that programs that wrote to a
      file might never see the error code and proceed as if nothing were
      wrong.
      
      xfs_scrub is not in a position to notify file writers about the
      writeback failure, and it's only here to check metadata, not file
      contents.  Therefore, if writeback fails, we should stuff the error code
      back into the address space so that an fsync by the writer application
      can pick that up.
      
      Fixes: 99d9d8d0 ("xfs: scrub inode block mappings")
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      eb0efe50
    • B
      xfs: preserve rmapbt swapext block reservation from freed blocks · f74681ba
      Brian Foster 提交于
      The rmapbt extent swap algorithm remaps individual extents between
      the source inode and the target to trigger reverse mapping metadata
      updates. If either inode straddles a format or other bmap allocation
      boundary, the individual unmap and map cycles can trigger repeated
      bmap block allocations and frees as the extent count bounces back
      and forth across the boundary. While net block usage is bound across
      the swap operation, this behavior can prematurely exhaust the
      transaction block reservation because it continuously drains as the
      transaction rolls. Each allocation accounts against the reservation
      and each free returns to global free space on transaction roll.
      
      The previous workaround to this problem attempted to detect this
      boundary condition and provide surplus block reservation to
      acommodate it. This is insufficient because more remaps can occur
      than implied by the extent counts; if start offset boundaries are
      not aligned between the two inodes, for example.
      
      To address this problem more generically and dynamically, add a
      transaction accounting mode that returns freed blocks to the
      transaction reservation instead of the superblock counters on
      transaction roll and use it when the rmapbt based algorithm is
      active. This allows the chain of remap transactions to preserve the
      block reservation based own its own frees and prevent premature
      exhaustion regardless of the remap pattern. Note that this is only
      safe for superblocks with lazy sb accounting, but the latter is
      required for v5 supers and the rmap feature depends on v5.
      
      Fixes: b3fed434 ("xfs: account format bouncing into rmapbt swapext tx reservation")
      Root-caused-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      f74681ba
    • K
      xfs: Couple of typo fixes in comments · 06734e3c
      Keyur Patel 提交于
      ./xfs/libxfs/xfs_inode_buf.c:56: unnecssary ==> unnecessary
      ./xfs/libxfs/xfs_inode_buf.c:59: behavour ==> behaviour
      ./xfs/libxfs/xfs_inode_buf.c:206: unitialized ==> uninitialized
      Signed-off-by: NKeyur Patel <iamkeyur96@gmail.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      06734e3c
  2. 06 7月, 2020 9 次提交
    • L
      Linux 5.8-rc4 · dcb7fd82
      Linus Torvalds 提交于
      dcb7fd82
    • L
      x86/ldt: use "pr_info_once()" instead of open-coding it badly · bb5a93aa
      Linus Torvalds 提交于
      Using a mutex for "print this warning only once" is so overdesigned as
      to be actively offensive to my sensitive stomach.
      
      Just use "pr_info_once()" that already does this, although in a
      (harmlessly) racy manner that can in theory cause the message to be
      printed twice if more than one CPU races on that "is this the first
      time" test.
      
      [ If somebody really cares about that harmless data race (which sounds
        very unlikely indeed), that person can trivially fix printk_once() by
        using a simple atomic access, preferably with an optimistic non-atomic
        test first before even bothering to treat the pointless "make sure it
        is _really_ just once" case.
      
        A mutex is most definitely never the right primitive to use for
        something like this. ]
      
      Yes, this is a small and meaningless detail in a code path that hardly
      matters.  But let's keep some code quality standards here, and not
      accept outrageously bad code.
      
      Link: https://lore.kernel.org/lkml/CAHk-=wgV9toS7GU3KmNpj8hCS9SeF+A0voHS8F275_mgLhL4Lw@mail.gmail.com/
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb5a93aa
    • L
      Merge tag 'x86-urgent-2020-07-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 72674d48
      Linus Torvalds 提交于
      Pull x86 fixes from Thomas Gleixner:
       "A series of fixes for x86:
      
         - Reset MXCSR in kernel_fpu_begin() to prevent using a stale user
           space value.
      
         - Prevent writing MSR_TEST_CTRL on CPUs which are not explicitly
           whitelisted for split lock detection. Some CPUs which do not
           support it crash even when the MSR is written to 0 which is the
           default value.
      
         - Fix the XEN PV fallout of the entry code rework
      
         - Fix the 32bit fallout of the entry code rework
      
         - Add more selftests to ensure that these entry problems don't come
           back.
      
         - Disable 16 bit segments on XEN PV. It's not supported because XEN
           PV does not implement ESPFIX64"
      
      * tag 'x86-urgent-2020-07-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/ldt: Disable 16-bit segments on Xen PV
        x86/entry/32: Fix #MC and #DB wiring on x86_32
        x86/entry/xen: Route #DB correctly on Xen PV
        x86/entry, selftests: Further improve user entry sanity checks
        x86/entry/compat: Clear RAX high bits on Xen PV SYSENTER
        selftests/x86: Consolidate and fix get/set_eflags() helpers
        selftests/x86/syscall_nt: Clear weird flags after each test
        selftests/x86/syscall_nt: Add more flag combinations
        x86/entry/64/compat: Fix Xen PV SYSENTER frame setup
        x86/entry: Move SYSENTER's regs->sp and regs->flags fixups into C
        x86/entry: Assert that syscalls are on the right stack
        x86/split_lock: Don't write MSR_TEST_CTRL on CPUs that aren't whitelisted
        x86/fpu: Reset MXCSR to default in kernel_fpu_begin()
      72674d48
    • L
      Merge tag 'irq-urgent-2020-07-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · f23dbe18
      Linus Torvalds 提交于
      Pull irq fixes from Thomas Gleixner:
       "A set of interrupt chip driver fixes:
      
         - Ensure the atomicity of affinity updates in the GIC driver
      
         - Don't try to sleep in atomic context when waiting for the GICv4.1
           to respond. Use polling instead.
      
         - Typo fixes in Kconfig and warnings"
      
      * tag 'irq-urgent-2020-07-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        irqchip/gic: Atomically update affinity
        irqchip/riscv-intc: Fix a typo in a pr_warn()
        irqchip/gic-v4.1: Use readx_poll_timeout_atomic() to fix sleep in atomic
        irqchip/loongson-pci-msi: Fix a typo in Kconfig
      f23dbe18
    • L
      Merge tag 'core-urgent-2020-07-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 5465a324
      Linus Torvalds 提交于
      Pull rcu fixlet from Thomas Gleixner:
       "A single fix for a printk format warning in RCU"
      
      * tag 'core-urgent-2020-07-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        rcuperf: Fix printk format warning
      5465a324
    • L
      Merge tag 'kbuild-fixes-v5.8-2' of... · 4bc92736
      Linus Torvalds 提交于
      Merge tag 'kbuild-fixes-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
      
      Pull Kbuild fixes frin Masahiro Yamada:
      
       - fix various bugs in xconfig
      
       - fix some issues in cross-compilation using Clang
      
       - fix documentation
      
      * tag 'kbuild-fixes-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
        .gitignore: Do not track `defconfig` from `make savedefconfig`
        kbuild: make Clang build userprogs for target architecture
        kbuild: fix CONFIG_CC_CAN_LINK(_STATIC) for cross-compilation with Clang
        kconfig: qconf: parse newer types at debug info
        kconfig: qconf: navigate menus on hyperlinks
        kconfig: qconf: don't show goback button on splitMode
        kconfig: qconf: simplify the goBack() logic
        kconfig: qconf: re-implement setSelected()
        kconfig: qconf: make debug links work again
        kconfig: qconf: make search fully work again on split mode
        kconfig: qconf: cleanup includes
        docs: kbuild: fix ReST formatting
        gcc-plugins: fix gcc-plugins directory path in documentation
      4bc92736
    • L
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 19a61a75
      Linus Torvalds 提交于
      Pull SCSI fixes from James Bottomley:
       "Four small fixes in three drivers.
      
        The mptfusion one has actually caused user visible issues in certain
        kernel configurations"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: mptfusion: Don't use GFP_ATOMIC for larger DMA allocations
        scsi: libfc: Skip additional kref updating work event
        scsi: libfc: Handling of extra kref
        scsi: qla2xxx: Fix a condition in qla2x00_find_all_fabric_devs()
      19a61a75
    • L
      Merge tag 'block-5.8-2020-07-05' of git://git.kernel.dk/linux-block · 29206c63
      Linus Torvalds 提交于
      Pull block fixes from Jens Axboe:
      
       - NVMe fixes from Christoph:
          - Fix crash in multi-path disk add (Christoph)
          - Fix ignore of identify error (Sagi)
      
       - Fix a compiler complaint that a function should be static (Wei)
      
      * tag 'block-5.8-2020-07-05' of git://git.kernel.dk/linux-block:
        block: make function __bio_integrity_free() static
        nvme: fix a crash in nvme_mpath_add_disk
        nvme: fix identify error status silent ignore
      29206c63
    • L
      Merge tag 'io_uring-5.8-2020-07-05' of git://git.kernel.dk/linux-block · 9fbe565c
      Linus Torvalds 提交于
      Pull io_uring fix from Jens Axboe:
       "Andres reported a regression with the fix that was merged earlier this
        week, where his setup of using signals to interrupt io_uring CQ waits
        no longer worked correctly.
      
        Fix this, and also limit our use of TWA_SIGNAL to the case where we
        need it, and continue using TWA_RESUME for task_work as before.
      
        Since the original is marked for 5.7 stable, let's flush this one out
        early"
      
      * tag 'io_uring-5.8-2020-07-05' of git://git.kernel.dk/linux-block:
        io_uring: fix regression with always ignoring signals in io_cqring_wait()
      9fbe565c