1. 31 5月, 2018 2 次提交
  2. 16 5月, 2018 3 次提交
    • D
      xfs: clear sb->s_fs_info on mount failure · c9fbd7bb
      Dave Chinner 提交于
      We recently had an oops reported on a 4.14 kernel in
      xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage
      and so the m_perag_tree lookup walked into lala land.
      
      Essentially, the machine was under memory pressure when the mount
      was being run, xfs_fs_fill_super() failed after allocating the
      xfs_mount and attaching it to sb->s_fs_info. It then cleaned up and
      freed the xfs_mount, but the sb->s_fs_info field still pointed to
      the freed memory. Hence when the superblock shrinker then ran
      it fell off the bad pointer.
      
      With the superblock shrinker problem fixed at teh VFS level, this
      stale s_fs_info pointer is still a problem - we use it
      unconditionally in ->put_super when the superblock is being torn
      down, and hence we can still trip over it after a ->fill_super
      call failure. Hence we need to clear s_fs_info if
      xfs-fs_fill_super() fails, and we need to check if it's valid in
      the places it can potentially be dereferenced after a ->fill_super
      failure.
      Signed-Off-By: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      c9fbd7bb
    • D
      xfs: add mount delay debug option · dae5cd81
      Dave Chinner 提交于
      Similar to log_recovery_delay, this delay occurs between the VFS
      superblock being initialised and the xfs_mount being fully
      initialised. It also poisons the per-ag radix tree node so that it
      can be used for triggering shrinker races during mount
      such as the following:
      
      <run memory pressure workload in background>
      
      $ cat dirty-mount.sh
      #! /bin/bash
      
      umount -f /dev/pmem0
      mkfs.xfs -f /dev/pmem0
      mount /dev/pmem0 /mnt/test
      rm -f /mnt/test/foo
      xfs_io -fxc "pwrite 0 4k" -c fsync -c "shutdown" /mnt/test/foo
      umount /dev/pmem0
      
      # let's crash it now!
      echo 30 > /sys/fs/xfs/debug/mount_delay
      mount /dev/pmem0 /mnt/test
      echo 0 > /sys/fs/xfs/debug/mount_delay
      umount /dev/pmem0
      $ sudo ./dirty-mount.sh
      .....
      [   60.378118] CPU: 3 PID: 3577 Comm: fs_mark Tainted: G      D W        4.16.0-rc5-dgc #440
      [   60.378120] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
      [   60.378124] RIP: 0010:radix_tree_next_chunk+0x76/0x320
      [   60.378127] RSP: 0018:ffffc9000276f4f8 EFLAGS: 00010282
      [   60.383670] RAX: a5a5a5a5a5a5a5a4 RBX: 0000000000000010 RCX: 000000000000001a
      [   60.385277] RDX: 0000000000000000 RSI: ffffc9000276f540 RDI: 0000000000000000
      [   60.386554] RBP: 0000000000000000 R08: 0000000000000000 R09: a5a5a5a5a5a5a5a5
      [   60.388194] R10: 0000000000000006 R11: 0000000000000001 R12: ffffc9000276f598
      [   60.389288] R13: 0000000000000040 R14: 0000000000000228 R15: ffff880816cd6458
      [   60.390827] FS:  00007f5c124b9740(0000) GS:ffff88083fc00000(0000) knlGS:0000000000000000
      [   60.392253] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   60.393423] CR2: 00007f5c11bba0b8 CR3: 000000035580e001 CR4: 00000000000606e0
      [   60.394519] Call Trace:
      [   60.395252]  radix_tree_gang_lookup_tag+0xc4/0x130
      [   60.395948]  xfs_perag_get_tag+0x37/0xf0
      [   60.396522]  xfs_reclaim_inodes_count+0x32/0x40
      [   60.397178]  xfs_fs_nr_cached_objects+0x11/0x20
      [   60.397837]  super_cache_count+0x35/0xc0
      [   60.399159]  shrink_slab.part.66+0xb1/0x370
      [   60.400194]  shrink_node+0x7e/0x1a0
      [   60.401058]  try_to_free_pages+0x199/0x470
      [   60.402081]  __alloc_pages_slowpath+0x3a1/0xd20
      [   60.403729]  __alloc_pages_nodemask+0x1c3/0x200
      [   60.404941]  cache_grow_begin+0x20b/0x2e0
      [   60.406164]  fallback_alloc+0x160/0x200
      [   60.407088]  kmem_cache_alloc+0x111/0x4e0
      [   60.408038]  ? xfs_buf_rele+0x61/0x430
      [   60.408925]  kmem_zone_alloc+0x61/0xe0
      [   60.409965]  xfs_inode_alloc+0x24/0x1d0
      .....
      Signed-Off-By: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      dae5cd81
    • D
      xfs: halt auto-reclamation activities while rebuilding rmap · d6b636eb
      Darrick J. Wong 提交于
      Rebuilding the reverse-mapping tree requires us to quiesce all inodes in
      the filesystem, so we must stop background reclamation of post-EOF and
      CoW prealloc blocks.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      d6b636eb
  3. 10 5月, 2018 1 次提交
  4. 10 4月, 2018 1 次提交
  5. 26 3月, 2018 1 次提交
  6. 12 3月, 2018 2 次提交
  7. 27 2月, 2018 1 次提交
  8. 02 2月, 2018 3 次提交
  9. 29 1月, 2018 1 次提交
  10. 09 1月, 2018 1 次提交
  11. 22 12月, 2017 1 次提交
  12. 28 11月, 2017 1 次提交
    • L
      Rename superblock flags (MS_xyz -> SB_xyz) · 1751e8a6
      Linus Torvalds 提交于
      This is a pure automated search-and-replace of the internal kernel
      superblock flags.
      
      The s_flags are now called SB_*, with the names and the values for the
      moment mirroring the MS_* flags that they're equivalent to.
      
      Note how the MS_xyz flags are the ones passed to the mount system call,
      while the SB_xyz flags are what we then use in sb->s_flags.
      
      The script to do this was:
      
          # places to look in; re security/*: it generally should *not* be
          # touched (that stuff parses mount(2) arguments directly), but
          # there are two places where we really deal with superblock flags.
          FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
                  include/linux/fs.h include/uapi/linux/bfs_fs.h \
                  security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
          # the list of MS_... constants
          SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
                DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
                POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
                I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
                ACTIVE NOUSER"
      
          SED_PROG=
          for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done
      
          # we want files that contain at least one of MS_...,
          # with fs/namespace.c and fs/pnode.c excluded.
          L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')
      
          for f in $L; do sed -i $f $SED_PROG; done
      Requested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1751e8a6
  13. 19 10月, 2017 1 次提交
  14. 26 9月, 2017 1 次提交
  15. 04 9月, 2017 1 次提交
  16. 01 9月, 2017 1 次提交
  17. 17 7月, 2017 1 次提交
    • D
      VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb) · bc98a42c
      David Howells 提交于
      Firstly by applying the following with coccinelle's spatch:
      
      	@@ expression SB; @@
      	-SB->s_flags & MS_RDONLY
      	+sb_rdonly(SB)
      
      to effect the conversion to sb_rdonly(sb), then by applying:
      
      	@@ expression A, SB; @@
      	(
      	-(!sb_rdonly(SB)) && A
      	+!sb_rdonly(SB) && A
      	|
      	-A != (sb_rdonly(SB))
      	+A != sb_rdonly(SB)
      	|
      	-A == (sb_rdonly(SB))
      	+A == sb_rdonly(SB)
      	|
      	-!(sb_rdonly(SB))
      	+!sb_rdonly(SB)
      	|
      	-A && (sb_rdonly(SB))
      	+A && sb_rdonly(SB)
      	|
      	-A || (sb_rdonly(SB))
      	+A || sb_rdonly(SB)
      	|
      	-(sb_rdonly(SB)) != A
      	+sb_rdonly(SB) != A
      	|
      	-(sb_rdonly(SB)) == A
      	+sb_rdonly(SB) == A
      	|
      	-(sb_rdonly(SB)) && A
      	+sb_rdonly(SB) && A
      	|
      	-(sb_rdonly(SB)) || A
      	+sb_rdonly(SB) || A
      	)
      
      	@@ expression A, B, SB; @@
      	(
      	-(sb_rdonly(SB)) ? 1 : 0
      	+sb_rdonly(SB)
      	|
      	-(sb_rdonly(SB)) ? A : B
      	+sb_rdonly(SB) ? A : B
      	)
      
      to remove left over excess bracketage and finally by applying:
      
      	@@ expression A, SB; @@
      	(
      	-(A & MS_RDONLY) != sb_rdonly(SB)
      	+(bool)(A & MS_RDONLY) != sb_rdonly(SB)
      	|
      	-(A & MS_RDONLY) == sb_rdonly(SB)
      	+(bool)(A & MS_RDONLY) == sb_rdonly(SB)
      	)
      
      to make comparisons against the result of sb_rdonly() (which is a bool)
      work correctly.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      bc98a42c
  18. 20 6月, 2017 1 次提交
    • D
      xfs: remove double-underscore integer types · c8ce540d
      Darrick J. Wong 提交于
      This is a purely mechanical patch that removes the private
      __{u,}int{8,16,32,64}_t typedefs in favor of using the system
      {u,}int{8,16,32,64}_t typedefs.  This is the sed script used to perform
      the transformation and fix the resulting whitespace and indentation
      errors:
      
      s/typedef\t__uint8_t/typedef __uint8_t\t/g
      s/typedef\t__uint/typedef __uint/g
      s/typedef\t__int\([0-9]*\)_t/typedef int\1_t\t/g
      s/__uint8_t\t/__uint8_t\t\t/g
      s/__uint/uint/g
      s/__int\([0-9]*\)_t\t/__int\1_t\t\t/g
      s/__int/int/g
      /^typedef.*int[0-9]*_t;$/d
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      c8ce540d
  19. 19 6月, 2017 1 次提交
  20. 09 5月, 2017 1 次提交
    • D
      block, dax: move "select DAX" from BLOCK to FS_DAX · ef510424
      Dan Williams 提交于
      For configurations that do not enable DAX filesystems or drivers, do not
      require the DAX core to be built.
      
      Given that the 'direct_access' method has been removed from
      'block_device_operations', we can also go ahead and remove the
      block-related dax helper functions from fs/block_dev.c to
      drivers/dax/super.c. This keeps dax details out of the block layer and
      lets the DAX core be built as a module in the FS_DAX=n case.
      
      Filesystems need to include dax.h to call bdev_dax_supported().
      
      Cc: linux-xfs@vger.kernel.org
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.com>
      Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      ef510424
  21. 04 4月, 2017 1 次提交
    • B
      xfs: use dedicated log worker wq to avoid deadlock with cil wq · 696a5620
      Brian Foster 提交于
      The log covering background task used to be part of the xfssyncd
      workqueue. That workqueue was removed as of commit 5889608d ("xfs:
      syncd workqueue is no more") and the associated work item scheduled
      to the xfs-log wq. The latter is used for log buffer I/O completion.
      
      Since xfs_log_worker() can invoke a log flush, a deadlock is
      possible between the xfs-log and xfs-cil workqueues. Consider the
      following codepath from xfs_log_worker():
      
      xfs_log_worker()
        xfs_log_force()
          _xfs_log_force()
            xlog_cil_force()
              xlog_cil_force_lsn()
                xlog_cil_push_now()
                  flush_work()
      
      The above is in xfs-log wq context and blocked waiting on the
      completion of an xfs-cil work item. Concurrently, the cil push in
      progress can end up blocked here:
      
      xlog_cil_push_work()
        xlog_cil_push()
          xlog_write()
            xlog_state_get_iclog_space()
              xlog_wait(&log->l_flush_wait, ...)
      
      The above is in xfs-cil context waiting on log buffer I/O
      completion, which executes in xfs-log wq context. In this scenario
      both workqueues are deadlocked waiting on eachother.
      
      Add a new workqueue specifically for the high level log covering and
      ail pushing worker, as was the case prior to commit 5889608d.
      Diagnosed-by: NDavid Jeffery <djeffery@redhat.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      696a5620
  22. 08 3月, 2017 1 次提交
  23. 10 2月, 2017 1 次提交
  24. 09 12月, 2016 1 次提交
  25. 30 11月, 2016 1 次提交
  26. 10 10月, 2016 1 次提交
  27. 06 10月, 2016 5 次提交
    • D
      xfs: recognize the reflink feature bit · e54b5bf9
      Darrick J. Wong 提交于
      Add the reflink feature flag to the set of recognized feature flags.
      This enables users to write to reflink filesystems.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      e54b5bf9
    • D
      xfs: garbage collect old cowextsz reservations · 83104d44
      Darrick J. Wong 提交于
      Trim CoW reservations made on behalf of a cowextsz hint if they get too
      old or we run low on quota, so long as we don't have dirty data awaiting
      writeback or directio operations in progress.
      
      Garbage collection of the cowextsize extents are kept separate from
      prealloc extent reaping because setting the CoW prealloc lifetime to a
      (much) higher value than the regular prealloc extent lifetime has been
      useful for combatting CoW fragmentation on VM hosts where the VMs
      experience bursty write behaviors and we can keep the utilization ratios
      low enough that we don't start to run out of space.  IOWs, it benefits
      us to keep the CoW fork reservations around for as long as we can unless
      we run out of blocks or hit inode reclaim.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      83104d44
    • D
      xfs: preallocate blocks for worst-case btree expansion · 84d69619
      Darrick J. Wong 提交于
      To gracefully handle the situation where a CoW operation turns a
      single refcount extent into a lot of tiny ones and then run out of
      space when a tree split has to happen, use the per-AG reserved block
      pool to pre-allocate all the space we'll ever need for a maximal
      btree.  For a 4K block size, this only costs an overhead of 0.3% of
      available disk space.
      
      When reflink is enabled, we have an unfortunate problem with rmap --
      since we can share a block billions of times, this means that the
      reverse mapping btree can expand basically infinitely.  When an AG is
      so full that there are no free blocks with which to expand the rmapbt,
      the filesystem will shut down hard.
      
      This is rather annoying to the user, so use the AG reservation code to
      reserve a "reasonable" amount of space for rmap.  We'll prevent
      reflinks and CoW operations if we think we're getting close to
      exhausting an AG's free space rather than shutting down, but this
      permanent reservation should be enough for "most" users.  Hopefully.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      [hch@lst.de: ensure that we invalidate the freed btree buffer]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      84d69619
    • D
      xfs: store in-progress CoW allocations in the refcount btree · 174edb0e
      Darrick J. Wong 提交于
      Due to the way the CoW algorithm in XFS works, there's an interval
      during which blocks allocated to handle a CoW can be lost -- if the FS
      goes down after the blocks are allocated but before the block
      remapping takes place.  This is exacerbated by the cowextsz hint --
      allocated reservations can sit around for a while, waiting to get
      used.
      
      Since the refcount btree doesn't normally store records with refcount
      of 1, we can use it to record these in-progress extents.  In-progress
      blocks cannot be shared because they're not user-visible, so there
      shouldn't be any conflicts with other programs.  This is a better
      solution than holding EFIs during writeback because (a) EFIs can't be
      relogged currently, (b) even if they could, EFIs are bound by
      available log space, which puts an unnecessary upper bound on how much
      CoW we can have in flight, and (c) we already have a mechanism to
      track blocks.
      
      At mount time, read the refcount records and free anything we find
      with a refcount of 1 because those were in-progress when the FS went
      down.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      174edb0e
    • D
      xfs: cancel pending CoW reservations when destroying inodes · 5e7e605c
      Darrick J. Wong 提交于
      When destroying the inode, cancel all pending reservations in the CoW
      fork so that all the reserved blocks go back to the free pile.  In
      theory this sort of cleanup is only needed to clean up after write
      errors.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      5e7e605c
  28. 05 10月, 2016 3 次提交