1. 10 2月, 2022 1 次提交
  2. 02 2月, 2022 5 次提交
  3. 01 2月, 2022 1 次提交
  4. 31 1月, 2022 1 次提交
  5. 27 1月, 2022 1 次提交
    • D
      xfs, iomap: limit individual ioend chain lengths in writeback · ebb7fb15
      Dave Chinner 提交于
      Trond Myklebust reported soft lockups in XFS IO completion such as
      this:
      
       watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [kworker/12:1:3106]
       CPU: 12 PID: 3106 Comm: kworker/12:1 Not tainted 4.18.0-305.10.2.el8_4.x86_64 #1
       Workqueue: xfs-conv/md127 xfs_end_io [xfs]
       RIP: 0010:_raw_spin_unlock_irqrestore+0x11/0x20
       Call Trace:
        wake_up_page_bit+0x8a/0x110
        iomap_finish_ioend+0xd7/0x1c0
        iomap_finish_ioends+0x7f/0xb0
        xfs_end_ioend+0x6b/0x100 [xfs]
        xfs_end_io+0xb9/0xe0 [xfs]
        process_one_work+0x1a7/0x360
        worker_thread+0x1fa/0x390
        kthread+0x116/0x130
        ret_from_fork+0x35/0x40
      
      Ioends are processed as an atomic completion unit when all the
      chained bios in the ioend have completed their IO. Logically
      contiguous ioends can also be merged and completed as a single,
      larger unit.  Both of these things can be problematic as both the
      bio chains per ioend and the size of the merged ioends processed as
      a single completion are both unbound.
      
      If we have a large sequential dirty region in the page cache,
      write_cache_pages() will keep feeding us sequential pages and we
      will keep mapping them into ioends and bios until we get a dirty
      page at a non-sequential file offset. These large sequential runs
      can will result in bio and ioend chaining to optimise the io
      patterns. The pages iunder writeback are pinned within these chains
      until the submission chaining is broken, allowing the entire chain
      to be completed. This can result in huge chains being processed
      in IO completion context.
      
      We get deep bio chaining if we have large contiguous physical
      extents. We will keep adding pages to the current bio until it is
      full, then we'll chain a new bio to keep adding pages for writeback.
      Hence we can build bio chains that map millions of pages and tens of
      gigabytes of RAM if the page cache contains big enough contiguous
      dirty file regions. This long bio chain pins those pages until the
      final bio in the chain completes and the ioend can iterate all the
      chained bios and complete them.
      
      OTOH, if we have a physically fragmented file, we end up submitting
      one ioend per physical fragment that each have a small bio or bio
      chain attached to them. We do not chain these at IO submission time,
      but instead we chain them at completion time based on file
      offset via iomap_ioend_try_merge(). Hence we can end up with unbound
      ioend chains being built via completion merging.
      
      XFS can then do COW remapping or unwritten extent conversion on that
      merged chain, which involves walking an extent fragment at a time
      and running a transaction to modify the physical extent information.
      IOWs, we merge all the discontiguous ioends together into a
      contiguous file range, only to then process them individually as
      discontiguous extents.
      
      This extent manipulation is computationally expensive and can run in
      a tight loop, so merging logically contiguous but physically
      discontigous ioends gains us nothing except for hiding the fact the
      fact we broke the ioends up into individual physical extents at
      submission and then need to loop over those individual physical
      extents at completion.
      
      Hence we need to have mechanisms to limit ioend sizes and
      to break up completion processing of large merged ioend chains:
      
      1. bio chains per ioend need to be bound in length. Pure overwrites
      go straight to iomap_finish_ioend() in softirq context with the
      exact bio chain attached to the ioend by submission. Hence the only
      way to prevent long holdoffs here is to bound ioend submission
      sizes because we can't reschedule in softirq context.
      
      2. iomap_finish_ioends() has to handle unbound merged ioend chains
      correctly. This relies on any one call to iomap_finish_ioend() being
      bound in runtime so that cond_resched() can be issued regularly as
      the long ioend chain is processed. i.e. this relies on mechanism #1
      to limit individual ioend sizes to work correctly.
      
      3. filesystems have to loop over the merged ioends to process
      physical extent manipulations. This means they can loop internally,
      and so we break merging at physical extent boundaries so the
      filesystem can easily insert reschedule points between individual
      extent manipulations.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reported-and-tested-by: NTrond Myklebust <trondmy@hammerspace.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      ebb7fb15
  6. 20 1月, 2022 1 次提交
    • B
      xfs: flush inodegc workqueue tasks before cancel · 6191cf3a
      Brian Foster 提交于
      The xfs_inodegc_stop() helper performs a high level flush of pending
      work on the percpu queues and then runs a cancel_work_sync() on each
      of the percpu work tasks to ensure all work has completed before
      returning.  While cancel_work_sync() waits for wq tasks to complete,
      it does not guarantee work tasks have started. This means that the
      _stop() helper can queue and instantly cancel a wq task without
      having completed the associated work. This can be observed by
      tracepoint inspection of a simple "rm -f <file>; fsfreeze -f <mnt>"
      test:
      
      	xfs_destroy_inode: ... ino 0x83 ...
      	xfs_inode_set_need_inactive: ... ino 0x83 ...
      	xfs_inodegc_stop: ...
      	...
      	xfs_inodegc_start: ...
      	xfs_inodegc_worker: ...
      	xfs_inode_inactivating: ... ino 0x83 ...
      
      The first few lines show that the inode is removed and need inactive
      state set, but the inactivation work has not completed before the
      inodegc mechanism stops. The inactivation doesn't actually occur
      until the fs is unfrozen and the gc mechanism starts back up. Note
      that this test requires fsfreeze to reproduce because xfs_freeze
      indirectly invokes xfs_fs_statfs(), which calls xfs_inodegc_flush().
      
      When this occurs, the workqueue try_to_grab_pending() logic first
      tries to steal the pending bit, which does not succeed because the
      bit has been set by queue_work_on(). Subsequently, it checks for
      association of a pool workqueue from the work item under the pool
      lock. This association is set at the point a work item is queued and
      cleared when dequeued for processing. If the association exists, the
      work item is removed from the queue and cancel_work_sync() returns
      true. If the pwq association is cleared, the remove attempt assumes
      the task is busy and retries (eventually returning false to the
      caller after waiting for the work task to complete).
      
      To avoid this race, we can flush each work item explicitly before
      cancel. However, since the _queue_all() already schedules each
      underlying work item, the workqueue level helpers are sufficient to
      achieve the same ordering effect. E.g., the inodegc enabled flag
      prevents scheduling any further work in the _stop() case. Use the
      drain_workqueue() helper in this particular case to make the intent
      a bit more self explanatory.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      6191cf3a
  7. 19 1月, 2022 1 次提交
  8. 18 1月, 2022 3 次提交
  9. 15 1月, 2022 1 次提交
    • N
      mm: introduce memalloc_retry_wait() · 4034247a
      NeilBrown 提交于
      Various places in the kernel - largely in filesystems - respond to a
      memory allocation failure by looping around and re-trying.  Some of
      these cannot conveniently use __GFP_NOFAIL, for reasons such as:
      
       - a GFP_ATOMIC allocation, which __GFP_NOFAIL doesn't work on
       - a need to check for the process being signalled between failures
       - the possibility that other recovery actions could be performed
       - the allocation is quite deep in support code, and passing down an
         extra flag to say if __GFP_NOFAIL is wanted would be clumsy.
      
      Many of these currently use congestion_wait() which (in almost all
      cases) simply waits the given timeout - congestion isn't tracked for
      most devices.
      
      It isn't clear what the best delay is for loops, but it is clear that
      the various filesystems shouldn't be responsible for choosing a timeout.
      
      This patch introduces memalloc_retry_wait() with takes on that
      responsibility.  Code that wants to retry a memory allocation can call
      this function passing the GFP flags that were used.  It will wait
      however is appropriate.
      
      For now, it only considers __GFP_NORETRY and whatever
      gfpflags_allow_blocking() tests.  If blocking is allowed without
      __GFP_NORETRY, then alloc_page either made some reclaim progress, or
      waited for a while, before failing.  So there is no need for much
      further waiting.  memalloc_retry_wait() will wait until the current
      jiffie ends.  If this condition is not met, then alloc_page() won't have
      waited much if at all.  In that case memalloc_retry_wait() waits about
      200ms.  This is the delay that most current loops uses.
      
      linux/sched/mm.h needs to be included in some files now,
      but linux/backing-dev.h does not.
      
      Link: https://lkml.kernel.org/r/163754371968.13692.1277530886009912421@noble.neil.brown.nameSigned-off-by: NNeilBrown <neilb@suse.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4034247a
  10. 13 1月, 2022 1 次提交
    • D
      xfs: fix online fsck handling of v5 feature bits on secondary supers · 4a9bca86
      Darrick J. Wong 提交于
      While I was auditing the code in xfs_repair that adds feature bits to
      existing V5 filesystems, I decided to have a look at how online fsck
      handles feature bits, and I found a few problems:
      
      1) ATTR2 is added to the primary super when an xattr is set to a file,
      but that isn't consistently propagated to secondary supers.  This isn't
      a corruption, merely a discrepancy that repair will fix if it ever has
      to restore the primary from a secondary.  Hence, if we find a mismatch
      on a secondary, this is a preen condition, not a corruption.
      
      2) There are more compat and ro_compat features now than there used to
      be, but we mask off the newer features from testing.  This means we
      ignore inconsistencies in the INOBTCOUNT and BIGTIME features, which is
      wrong.  Get rid of the masking and compare directly.
      
      3) NEEDSREPAIR, when set on a secondary, is ignored by everyone.  Hence
      a mismatch here should also be flagged for preening, and online repair
      should clear the flag.  Right now we ignore it due to (2).
      
      4) log_incompat features are ephemeral, since we can clear the feature
      bit as soon as the log no longer contains live records for a particular
      log feature.  As such, the only copy we care about is the one in the
      primary super.  If we find any bits set in the secondary super, we
      should flag that for preening, and clear the bits if the user elects to
      repair it.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      4a9bca86
  11. 12 1月, 2022 1 次提交
    • D
      xfs: take the ILOCK when readdir inspects directory mapping data · 65552b02
      Darrick J. Wong 提交于
      I was poking around in the directory code while diagnosing online fsck
      bugs, and noticed that xfs_readdir doesn't actually take the directory
      ILOCK when it calls xfs_dir2_isblock.  xfs_dir_open most probably loaded
      the data fork mappings and the VFS took i_rwsem (aka IOLOCK_SHARED) so
      we're protected against writer threads, but we really need to follow the
      locking model like we do in other places.
      
      To avoid unnecessarily cycling the ILOCK for fairly small directories,
      change the block/leaf _getdents functions to consume the ILOCK hold that
      the parent readdir function took to decide on a _getdents implementation.
      
      It is ok to cycle the ILOCK in readdir because the VFS takes the IOLOCK
      in the appropriate mode during lookups and writes, and we don't want to
      be holding the ILOCK when we copy directory entries to userspace in case
      there's a page fault.  We really only need it to protect against data
      fork lookups, like we do for other files.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      65552b02
  12. 07 1月, 2022 5 次提交
    • D
      xfs: warn about inodes with project id of -1 · 7e937bb3
      Darrick J. Wong 提交于
      Inodes aren't supposed to have a project id of -1U (aka 4294967295) but
      the kernel hasn't always validated FSSETXATTR correctly.  Flag this as
      something for the sysadmin to check out.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      7e937bb3
    • D
      xfs: hold quota inode ILOCK_EXCL until the end of dqalloc · eae44cb3
      Darrick J. Wong 提交于
      Online fsck depends on callers holding ILOCK_EXCL from the time they
      decide to update a block mapping until after they've updated the reverse
      mapping records to guarantee the stability of both mapping records.
      Unfortunately, the quota code drops ILOCK_EXCL at the first transaction
      roll in the dquot allocation process, which breaks that assertion.  This
      leads to sporadic failures in the online rmap repair code if the repair
      code grabs the AGF after bmapi_write maps a new block into the quota
      file's data fork but before it can finish the deferred rmap update.
      
      Fix this by rewriting the function to hold the ILOCK until after the
      transaction commit like all other bmap updates do, and get rid of the
      dqread wrapper that does nothing but complicate the codebase.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      eae44cb3
    • J
      xfs: Remove redundant assignment of mp · f4901a18
      Jiapeng Chong 提交于
      mp is being initialized to log->l_mp but this is never read
      as record is overwritten later on. Remove the redundant
      assignment.
      
      Cleans up the following clang-analyzer warning:
      
      fs/xfs/xfs_log_recover.c:3543:20: warning: Value stored to 'mp' during
      its initialization is never read [clang-analyzer-deadcode.DeadStores].
      Reported-by: NAbaci Robot <abaci@linux.alibaba.com>
      Signed-off-by: NJiapeng Chong <jiapeng.chong@linux.alibaba.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      f4901a18
    • D
      xfs: reduce kvmalloc overhead for CIL shadow buffers · 8dc9384b
      Dave Chinner 提交于
      Oh, let me count the ways that the kvmalloc API sucks dog eggs.
      
      The problem is when we are logging lots of large objects, we hit
      kvmalloc really damn hard with costly order allocations, and
      behaviour utterly sucks:
      
           - 49.73% xlog_cil_commit
      	 - 31.62% kvmalloc_node
      	    - 29.96% __kmalloc_node
      	       - 29.38% kmalloc_large_node
      		  - 29.33% __alloc_pages
      		     - 24.33% __alloc_pages_slowpath.constprop.0
      			- 18.35% __alloc_pages_direct_compact
      			   - 17.39% try_to_compact_pages
      			      - compact_zone_order
      				 - 15.26% compact_zone
      				      5.29% __pageblock_pfn_to_page
      				      3.71% PageHuge
      				    - 1.44% isolate_migratepages_block
      					 0.71% set_pfnblock_flags_mask
      				   1.11% get_pfnblock_flags_mask
      			   - 0.81% get_page_from_freelist
      			      - 0.59% _raw_spin_lock_irqsave
      				 - do_raw_spin_lock
      				      __pv_queued_spin_lock_slowpath
      			- 3.24% try_to_free_pages
      			   - 3.14% shrink_node
      			      - 2.94% shrink_slab.constprop.0
      				 - 0.89% super_cache_count
      				    - 0.66% xfs_fs_nr_cached_objects
      				       - 0.65% xfs_reclaim_inodes_count
      					    0.55% xfs_perag_get_tag
      				   0.58% kfree_rcu_shrink_count
      			- 2.09% get_page_from_freelist
      			   - 1.03% _raw_spin_lock_irqsave
      			      - do_raw_spin_lock
      				   __pv_queued_spin_lock_slowpath
      		     - 4.88% get_page_from_freelist
      			- 3.66% _raw_spin_lock_irqsave
      			   - do_raw_spin_lock
      				__pv_queued_spin_lock_slowpath
      	    - 1.63% __vmalloc_node
      	       - __vmalloc_node_range
      		  - 1.10% __alloc_pages_bulk
      		     - 0.93% __alloc_pages
      			- 0.92% get_page_from_freelist
      			   - 0.89% rmqueue_bulk
      			      - 0.69% _raw_spin_lock
      				 - do_raw_spin_lock
      				      __pv_queued_spin_lock_slowpath
      	   13.73% memcpy_erms
      	 - 2.22% kvfree
      
      On this workload, that's almost a dozen CPUs all trying to compact
      and reclaim memory inside kvmalloc_node at the same time. Yet it is
      regularly falling back to vmalloc despite all that compaction, page
      and shrinker reclaim that direct reclaim is doing. Copying all the
      metadata is taking far less CPU time than allocating the storage!
      
      Direct reclaim should be considered extremely harmful.
      
      This is a high frequency, high throughput, CPU usage and latency
      sensitive allocation. We've got memory there, and we're using
      kvmalloc to allow memory allocation to avoid doing lots of work to
      try to do contiguous allocations.
      
      Except it still does *lots of costly work* that is unnecessary.
      
      Worse: the only way to avoid the slowpath page allocation trying to
      do compaction on costly allocations is to turn off direct reclaim
      (i.e. remove __GFP_RECLAIM_DIRECT from the gfp flags).
      
      Unfortunately, the stupid kvmalloc API then says "oh, this isn't a
      GFP_KERNEL allocation context, so you only get kmalloc!". This
      cuts off the vmalloc fallback, and this leads to almost instant OOM
      problems which ends up in filesystems deadlocks, shutdowns and/or
      kernel crashes.
      
      I want some basic kvmalloc behaviour:
      
      - kmalloc for a contiguous range with fail fast semantics - no
        compaction direct reclaim if the allocation enters the slow path.
      - run normal vmalloc (i.e. GFP_KERNEL) if kmalloc fails
      
      The really, really stupid part about this is these kvmalloc() calls
      are run under memalloc_nofs task context, so all the allocations are
      always reduced to GFP_NOFS regardless of the fact that kvmalloc
      requires GFP_KERNEL to be passed in. IOWs, we're already telling
      kvmalloc to behave differently to the gfp flags we pass in, but it
      still won't allow vmalloc to be run with anything other than
      GFP_KERNEL.
      
      So, this patch open codes the kvmalloc() in the commit path to have
      the above described behaviour. The result is we more than halve the
      CPU time spend doing kvmalloc() in this path and transaction commits
      with 64kB objects in them more than doubles. i.e. we get ~5x
      reduction in CPU usage per costly-sized kvmalloc() invocation and
      the profile looks like this:
      
        - 37.60% xlog_cil_commit
      	16.01% memcpy_erms
            - 8.45% __kmalloc
      	 - 8.04% kmalloc_order_trace
      	    - 8.03% kmalloc_order
      	       - 7.93% alloc_pages
      		  - 7.90% __alloc_pages
      		     - 4.05% __alloc_pages_slowpath.constprop.0
      			- 2.18% get_page_from_freelist
      			- 1.77% wake_all_kswapds
      ....
      				    - __wake_up_common_lock
      				       - 0.94% _raw_spin_lock_irqsave
      		     - 3.72% get_page_from_freelist
      			- 2.43% _raw_spin_lock_irqsave
            - 5.72% vmalloc
      	 - 5.72% __vmalloc_node_range
      	    - 4.81% __get_vm_area_node.constprop.0
      	       - 3.26% alloc_vmap_area
      		  - 2.52% _raw_spin_lock
      	       - 1.46% _raw_spin_lock
      	      0.56% __alloc_pages_bulk
            - 4.66% kvfree
      	 - 3.25% vfree
      	    - __vfree
      	       - 3.23% __vunmap
      		  - 1.95% remove_vm_area
      		     - 1.06% free_vmap_area_noflush
      			- 0.82% _raw_spin_lock
      		     - 0.68% _raw_spin_lock
      		  - 0.92% _raw_spin_lock
      	 - 1.40% kfree
      	    - 1.36% __free_pages
      	       - 1.35% __free_pages_ok
      		  - 1.02% _raw_spin_lock_irqsave
      
      It's worth noting that over 50% of the CPU time spent allocating
      these shadow buffers is now spent on spinlocks. So the shadow buffer
      allocation overhead is greatly reduced by getting rid of direct
      reclaim from kmalloc, and could probably be made even less costly if
      vmalloc() didn't use global spinlocks to protect it's structures.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      8dc9384b
    • G
      xfs: sysfs: use default_groups in kobj_type · 219aac5d
      Greg Kroah-Hartman 提交于
      There are currently 2 ways to create a set of sysfs files for a
      kobj_type, through the default_attrs field, and the default_groups
      field.  Move the xfs sysfs code to use default_groups field which has
      been the preferred way since aa30f47c ("kobject: Add support for
      default attribute groups to kobj_type") so that we can soon get rid of
      the obsolete default_attrs field.
      
      Cc: "Darrick J. Wong" <djwong@kernel.org>
      Cc: linux-xfs@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      219aac5d
  13. 23 12月, 2021 2 次提交
    • D
      xfs: map unwritten blocks in XFS_IOC_{ALLOC,FREE}SP just like fallocate · 983d8e60
      Darrick J. Wong 提交于
      The old ALLOCSP/FREESP ioctls in XFS can be used to preallocate space at
      the end of files, just like fallocate and RESVSP.  Make the behavior
      consistent with the other ioctls.
      Reported-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      983d8e60
    • D
      xfs: prevent UAF in xfs_log_item_in_current_chkpt · f8d92a66
      Darrick J. Wong 提交于
      While I was running with KASAN and lockdep enabled, I stumbled upon an
      KASAN report about a UAF to a freed CIL checkpoint.  Looking at the
      comment for xfs_log_item_in_current_chkpt, it seems pretty obvious to me
      that the original patch to xfs_defer_finish_noroll should have done
      something to lock the CIL to prevent it from switching the CIL contexts
      while the predicate runs.
      
      For upper level code that needs to know if a given log item is new
      enough not to need relogging, add a new wrapper that takes the CIL
      context lock long enough to sample the current CIL context.  This is
      kind of racy in that the CIL can switch the contexts immediately after
      sampling, but that's ok because the consequence is that the defer ops
      code is a little slow to relog items.
      
       ==================================================================
       BUG: KASAN: use-after-free in xfs_log_item_in_current_chkpt+0x139/0x160 [xfs]
       Read of size 8 at addr ffff88804ea5f608 by task fsstress/527999
      
       CPU: 1 PID: 527999 Comm: fsstress Tainted: G      D      5.16.0-rc4-xfsx #rc4
       Call Trace:
        <TASK>
        dump_stack_lvl+0x45/0x59
        print_address_description.constprop.0+0x1f/0x140
        kasan_report.cold+0x83/0xdf
        xfs_log_item_in_current_chkpt+0x139/0x160
        xfs_defer_finish_noroll+0x3bb/0x1e30
        __xfs_trans_commit+0x6c8/0xcf0
        xfs_reflink_remap_extent+0x66f/0x10e0
        xfs_reflink_remap_blocks+0x2dd/0xa90
        xfs_file_remap_range+0x27b/0xc30
        vfs_dedupe_file_range_one+0x368/0x420
        vfs_dedupe_file_range+0x37c/0x5d0
        do_vfs_ioctl+0x308/0x1260
        __x64_sys_ioctl+0xa1/0x170
        do_syscall_64+0x35/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xae
       RIP: 0033:0x7f2c71a2950b
       Code: 0f 1e fa 48 8b 05 85 39 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff
      ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01
      f0 ff ff 73 01 c3 48 8b 0d 55 39 0d 00 f7 d8 64 89 01 48
       RSP: 002b:00007ffe8c0e03c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
       RAX: ffffffffffffffda RBX: 00005600862a8740 RCX: 00007f2c71a2950b
       RDX: 00005600862a7be0 RSI: 00000000c0189436 RDI: 0000000000000004
       RBP: 000000000000000b R08: 0000000000000027 R09: 0000000000000003
       R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000005a
       R13: 00005600862804a8 R14: 0000000000016000 R15: 00005600862a8a20
        </TASK>
      
       Allocated by task 464064:
        kasan_save_stack+0x1e/0x50
        __kasan_kmalloc+0x81/0xa0
        kmem_alloc+0xcd/0x2c0 [xfs]
        xlog_cil_ctx_alloc+0x17/0x1e0 [xfs]
        xlog_cil_push_work+0x141/0x13d0 [xfs]
        process_one_work+0x7f6/0x1380
        worker_thread+0x59d/0x1040
        kthread+0x3b0/0x490
        ret_from_fork+0x1f/0x30
      
       Freed by task 51:
        kasan_save_stack+0x1e/0x50
        kasan_set_track+0x21/0x30
        kasan_set_free_info+0x20/0x30
        __kasan_slab_free+0xed/0x130
        slab_free_freelist_hook+0x7f/0x160
        kfree+0xde/0x340
        xlog_cil_committed+0xbfd/0xfe0 [xfs]
        xlog_cil_process_committed+0x103/0x1c0 [xfs]
        xlog_state_do_callback+0x45d/0xbd0 [xfs]
        xlog_ioend_work+0x116/0x1c0 [xfs]
        process_one_work+0x7f6/0x1380
        worker_thread+0x59d/0x1040
        kthread+0x3b0/0x490
        ret_from_fork+0x1f/0x30
      
       Last potentially related work creation:
        kasan_save_stack+0x1e/0x50
        __kasan_record_aux_stack+0xb7/0xc0
        insert_work+0x48/0x2e0
        __queue_work+0x4e7/0xda0
        queue_work_on+0x69/0x80
        xlog_cil_push_now.isra.0+0x16b/0x210 [xfs]
        xlog_cil_force_seq+0x1b7/0x850 [xfs]
        xfs_log_force_seq+0x1c7/0x670 [xfs]
        xfs_file_fsync+0x7c1/0xa60 [xfs]
        __x64_sys_fsync+0x52/0x80
        do_syscall_64+0x35/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
       The buggy address belongs to the object at ffff88804ea5f600
        which belongs to the cache kmalloc-256 of size 256
       The buggy address is located 8 bytes inside of
        256-byte region [ffff88804ea5f600, ffff88804ea5f700)
       The buggy address belongs to the page:
       page:ffffea00013a9780 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88804ea5ea00 pfn:0x4ea5e
       head:ffffea00013a9780 order:1 compound_mapcount:0
       flags: 0x4fff80000010200(slab|head|node=1|zone=1|lastcpupid=0xfff)
       raw: 04fff80000010200 ffffea0001245908 ffffea00011bd388 ffff888004c42b40
       raw: ffff88804ea5ea00 0000000000100009 00000001ffffffff 0000000000000000
       page dumped because: kasan: bad access detected
      
       Memory state around the buggy address:
        ffff88804ea5f500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
        ffff88804ea5f580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
       >ffff88804ea5f600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                             ^
        ffff88804ea5f680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ffff88804ea5f700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
       ==================================================================
      
      Fixes: 4e919af7 ("xfs: periodically relog deferred intent items")
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      f8d92a66
  14. 22 12月, 2021 8 次提交
    • D
      xfs: prevent a WARN_ONCE() in xfs_ioc_attr_list() · 6ed6356b
      Dan Carpenter 提交于
      The "bufsize" comes from the root user.  If "bufsize" is negative then,
      because of type promotion, neither of the validation checks at the start
      of the function are able to catch it:
      
      	if (bufsize < sizeof(struct xfs_attrlist) ||
      	    bufsize > XFS_XATTR_LIST_MAX)
      		return -EINVAL;
      
      This means "bufsize" will trigger (WARN_ON_ONCE(size > INT_MAX)) in
      kvmalloc_node().  Fix this by changing the type from int to size_t.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      6ed6356b
    • Y
      xfs: Fix comments mentioning xfs_ialloc · 132c460e
      Yang Xu 提交于
      Since kernel commit 1abcf261 ("xfs: move on-disk inode allocation out of xfs_ialloc()"),
      xfs_ialloc has been renamed to xfs_init_new_inode. So update this in comments.
      Signed-off-by: NYang Xu <xuyang2018.jy@fujitsu.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      132c460e
    • D
      xfs: check sb_meta_uuid for dabuf buffer recovery · 09654ed8
      Dave Chinner 提交于
      Got a report that a repeated crash test of a container host would
      eventually fail with a log recovery error preventing the system from
      mounting the root filesystem. It manifested as a directory leaf node
      corruption on writeback like so:
      
       XFS (loop0): Mounting V5 Filesystem
       XFS (loop0): Starting recovery (logdev: internal)
       XFS (loop0): Metadata corruption detected at xfs_dir3_leaf_check_int+0x99/0xf0, xfs_dir3_leaf1 block 0x12faa158
       XFS (loop0): Unmount and run xfs_repair
       XFS (loop0): First 128 bytes of corrupted metadata buffer:
       00000000: 00 00 00 00 00 00 00 00 3d f1 00 00 e1 9e d5 8b  ........=.......
       00000010: 00 00 00 00 12 fa a1 58 00 00 00 29 00 00 1b cc  .......X...)....
       00000020: 91 06 78 ff f7 7e 4a 7d 8d 53 86 f2 ac 47 a8 23  ..x..~J}.S...G.#
       00000030: 00 00 00 00 17 e0 00 80 00 43 00 00 00 00 00 00  .........C......
       00000040: 00 00 00 2e 00 00 00 08 00 00 17 2e 00 00 00 0a  ................
       00000050: 02 35 79 83 00 00 00 30 04 d3 b4 80 00 00 01 50  .5y....0.......P
       00000060: 08 40 95 7f 00 00 02 98 08 41 fe b7 00 00 02 d4  .@.......A......
       00000070: 0d 62 ef a7 00 00 01 f2 14 50 21 41 00 00 00 0c  .b.......P!A....
       XFS (loop0): Corruption of in-memory data (0x8) detected at xfs_do_force_shutdown+0x1a/0x20 (fs/xfs/xfs_buf.c:1514).  Shutting down.
       XFS (loop0): Please unmount the filesystem and rectify the problem(s)
       XFS (loop0): log mount/recovery failed: error -117
       XFS (loop0): log mount failed
      
      Tracing indicated that we were recovering changes from a transaction
      at LSN 0x29/0x1c16 into a buffer that had an LSN of 0x29/0x1d57.
      That is, log recovery was overwriting a buffer with newer changes on
      disk than was in the transaction. Tracing indicated that we were
      hitting the "recovery immediately" case in
      xfs_buf_log_recovery_lsn(), and hence it was ignoring the LSN in the
      buffer.
      
      The code was extracting the LSN correctly, then ignoring it because
      the UUID in the buffer did not match the superblock UUID. The
      problem arises because the UUID check uses the wrong UUID - it
      should be checking the sb_meta_uuid, not sb_uuid. This filesystem
      has sb_uuid != sb_meta_uuid (which is fine), and the buffer has the
      correct matching sb_meta_uuid in it, it's just the code checked it
      against the wrong superblock uuid.
      
      The is no corruption in the filesystem, and failing to recover the
      buffer due to a write verifier failure means the recovery bug did
      not propagate the corruption to disk. Hence there is no corruption
      before or after this bug has manifested, the impact is limited
      simply to an unmountable filesystem....
      
      This was missed back in 2015 during an audit of incorrect sb_uuid
      usage that resulted in commit fcfbe2c4 ("xfs: log recovery needs
      to validate against sb_meta_uuid") that fixed the magic32 buffers to
      validate against sb_meta_uuid instead of sb_uuid. It missed the
      magicda buffers....
      
      Fixes: ce748eaa ("xfs: create new metadata UUID field and incompat flag")
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      09654ed8
    • D
      xfs: fix a bug in the online fsck directory leaf1 bestcount check · e5d1802c
      Darrick J. Wong 提交于
      When xfs_scrub encounters a directory with a leaf1 block, it tries to
      validate that the leaf1 block's bestcount (aka the best free count of
      each directory data block) is the correct size.  Previously, this author
      believed that comparing bestcount to the directory isize (since
      directory data blocks are under isize, and leaf/bestfree blocks are
      above it) was sufficient.
      
      Unfortunately during testing of online repair, it was discovered that it
      is possible to create a directory with a hole between the last directory
      block and isize.  The directory code seems to handle this situation just
      fine and xfs_repair doesn't complain, which effectively makes this quirk
      part of the disk format.
      
      Fix the check to work properly.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      e5d1802c
    • D
      xfs: only run COW extent recovery when there are no live extents · 7993f1a4
      Darrick J. Wong 提交于
      As part of multiple customer escalations due to file data corruption
      after copy on write operations, I wrote some fstests that use fsstress
      to hammer on COW to shake things loose.  Regrettably, I caught some
      filesystem shutdowns due to incorrect rmap operations with the following
      loop:
      
      mount <filesystem>				# (0)
      fsstress <run only readonly ops> &		# (1)
      while true; do
      	fsstress <run all ops>
      	mount -o remount,ro			# (2)
      	fsstress <run only readonly ops>
      	mount -o remount,rw			# (3)
      done
      
      When (2) happens, notice that (1) is still running.  xfs_remount_ro will
      call xfs_blockgc_stop to walk the inode cache to free all the COW
      extents, but the blockgc mechanism races with (1)'s reader threads to
      take IOLOCKs and loses, which means that it doesn't clean them all out.
      Call such a file (A).
      
      When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
      walks the ondisk refcount btree and frees any COW extent that it finds.
      This function does not check the inode cache, which means that incore
      COW forks of inode (A) is now inconsistent with the ondisk metadata.  If
      one of those former COW extents are allocated and mapped into another
      file (B) and someone triggers a COW to the stale reservation in (A), A's
      dirty data will be written into (B) and once that's done, those blocks
      will be transferred to (A)'s data fork without bumping the refcount.
      
      The results are catastrophic -- file (B) and the refcount btree are now
      corrupt.  In the first patch, we fixed the race condition in (2) so that
      (A) will always flush the COW fork.  In this second patch, we move the
      _recover_cow call to the initial mount call in (0) for safety.
      
      As mentioned previously, xfs_reflink_recover_cow walks the refcount
      btree looking for COW staging extents, and frees them.  This was
      intended to be run at mount time (when we know there are no live inodes)
      to clean up any leftover staging events that may have been left behind
      during an unclean shutdown.  As a time "optimization" for readonly
      mounts, we deferred this to the ro->rw transition, not realizing that
      any failure to clean all COW forks during a rw->ro transition would
      result in catastrophic corruption.
      
      Therefore, remove this optimization and only run the recovery routine
      when we're guaranteed not to have any COW staging extents anywhere,
      which means we always run this at mount time.  While we're at it, move
      the callsite to xfs_log_mount_finish because any refcount btree
      expansion (however unlikely given that we're removing records from the
      right side of the index) must be fed by a per-AG reservation, which
      doesn't exist in its current location.
      
      Fixes: 174edb0e ("xfs: store in-progress CoW allocations in the refcount btree")
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChandan Babu R <chandan.babu@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      7993f1a4
    • D
      xfs: don't expose internal symlink metadata buffers to the vfs · 7b7820b8
      Darrick J. Wong 提交于
      Ian Kent reported that for inline symlinks, it's possible for
      vfs_readlink to hang on to the target buffer returned by
      _vn_get_link_inline long after it's been freed by xfs inode reclaim.
      This is a layering violation -- we should never expose XFS internals to
      the VFS.
      
      When the symlink has a remote target, we allocate a separate buffer,
      copy the internal information, and let the VFS manage the new buffer's
      lifetime.  Let's adapt the inline code paths to do this too.  It's
      less efficient, but fixes the layering violation and avoids the need to
      adapt the if_data lifetime to rcu rules.  Clearly I don't care about
      readlink benchmarks.
      
      As a side note, this fixes the minor locking violation where we can
      access the inode data fork without taking any locks; proper locking (and
      eliminating the possibility of having to switch inode_operations on a
      live inode) is essential to online repair coordinating repairs
      correctly.
      Reported-by: NIan Kent <raven@themaw.net>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      7b7820b8
    • D
      xfs: fix quotaoff mutex usage now that we don't support disabling it · 59d7fab2
      Darrick J. Wong 提交于
      Prior to commit 40b52225 ("xfs: remove support for disabling quota
      accounting on a mounted file system"), we used the quotaoff mutex to
      protect dquot operations against quotaoff trying to pull down dquots as
      part of disabling quota.
      
      Now that we only support turning off quota enforcement, the quotaoff
      mutex only protects changes in m_qflags/sb_qflags.  We don't need it to
      protect dquots, which means we can remove it from setqlimits and the
      dquot scrub code.  While we're at it, fix the function that forces
      quotacheck, since it should have been taking the quotaoff mutex.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      59d7fab2
    • D
      xfs: shut down filesystem if we xfs_trans_cancel with deferred work items · 47a6df7c
      Darrick J. Wong 提交于
      While debugging some very strange rmap corruption reports in connection
      with the online directory repair code.  I root-caused the error to the
      following incorrect sequence:
      
      <start repair transaction>
      <expand directory, causing a deferred rmap to be queued>
      <roll transaction>
      <cancel transaction>
      
      Obviously, we should have committed the transaction instead of
      cancelling it.  Thinking more broadly, however, xfs_trans_cancel should
      have warned us that we were throwing away work item that we already
      committed to performing.  This is not correct, and we need to shut down
      the filesystem.
      
      Change xfs_trans_cancel to complain in the loudest manner if we're
      cancelling any transaction with deferred work items attached.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      47a6df7c
  15. 18 12月, 2021 2 次提交
  16. 08 12月, 2021 1 次提交
    • D
      xfs: remove all COW fork extents when remounting readonly · 089558bc
      Darrick J. Wong 提交于
      As part of multiple customer escalations due to file data corruption
      after copy on write operations, I wrote some fstests that use fsstress
      to hammer on COW to shake things loose.  Regrettably, I caught some
      filesystem shutdowns due to incorrect rmap operations with the following
      loop:
      
      mount <filesystem>				# (0)
      fsstress <run only readonly ops> &		# (1)
      while true; do
      	fsstress <run all ops>
      	mount -o remount,ro			# (2)
      	fsstress <run only readonly ops>
      	mount -o remount,rw			# (3)
      done
      
      When (2) happens, notice that (1) is still running.  xfs_remount_ro will
      call xfs_blockgc_stop to walk the inode cache to free all the COW
      extents, but the blockgc mechanism races with (1)'s reader threads to
      take IOLOCKs and loses, which means that it doesn't clean them all out.
      Call such a file (A).
      
      When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
      walks the ondisk refcount btree and frees any COW extent that it finds.
      This function does not check the inode cache, which means that incore
      COW forks of inode (A) is now inconsistent with the ondisk metadata.  If
      one of those former COW extents are allocated and mapped into another
      file (B) and someone triggers a COW to the stale reservation in (A), A's
      dirty data will be written into (B) and once that's done, those blocks
      will be transferred to (A)'s data fork without bumping the refcount.
      
      The results are catastrophic -- file (B) and the refcount btree are now
      corrupt.  Solve this race by forcing the xfs_blockgc_free_space to run
      synchronously, which causes xfs_icwalk to return to inodes that were
      skipped because the blockgc code couldn't take the IOLOCK.  This is safe
      to do here because the VFS has already prohibited new writer threads.
      
      Fixes: 10ddf64e ("xfs: remove leftover CoW reservations when remounting ro")
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChandan Babu R <chandan.babu@oracle.com>
      089558bc
  17. 05 12月, 2021 5 次提交