1. 06 10月, 2016 5 次提交
    • D
      xfs: recognize the reflink feature bit · e54b5bf9
      Darrick J. Wong 提交于
      Add the reflink feature flag to the set of recognized feature flags.
      This enables users to write to reflink filesystems.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      e54b5bf9
    • D
      xfs: garbage collect old cowextsz reservations · 83104d44
      Darrick J. Wong 提交于
      Trim CoW reservations made on behalf of a cowextsz hint if they get too
      old or we run low on quota, so long as we don't have dirty data awaiting
      writeback or directio operations in progress.
      
      Garbage collection of the cowextsize extents are kept separate from
      prealloc extent reaping because setting the CoW prealloc lifetime to a
      (much) higher value than the regular prealloc extent lifetime has been
      useful for combatting CoW fragmentation on VM hosts where the VMs
      experience bursty write behaviors and we can keep the utilization ratios
      low enough that we don't start to run out of space.  IOWs, it benefits
      us to keep the CoW fork reservations around for as long as we can unless
      we run out of blocks or hit inode reclaim.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      83104d44
    • D
      xfs: preallocate blocks for worst-case btree expansion · 84d69619
      Darrick J. Wong 提交于
      To gracefully handle the situation where a CoW operation turns a
      single refcount extent into a lot of tiny ones and then run out of
      space when a tree split has to happen, use the per-AG reserved block
      pool to pre-allocate all the space we'll ever need for a maximal
      btree.  For a 4K block size, this only costs an overhead of 0.3% of
      available disk space.
      
      When reflink is enabled, we have an unfortunate problem with rmap --
      since we can share a block billions of times, this means that the
      reverse mapping btree can expand basically infinitely.  When an AG is
      so full that there are no free blocks with which to expand the rmapbt,
      the filesystem will shut down hard.
      
      This is rather annoying to the user, so use the AG reservation code to
      reserve a "reasonable" amount of space for rmap.  We'll prevent
      reflinks and CoW operations if we think we're getting close to
      exhausting an AG's free space rather than shutting down, but this
      permanent reservation should be enough for "most" users.  Hopefully.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      [hch@lst.de: ensure that we invalidate the freed btree buffer]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      84d69619
    • D
      xfs: store in-progress CoW allocations in the refcount btree · 174edb0e
      Darrick J. Wong 提交于
      Due to the way the CoW algorithm in XFS works, there's an interval
      during which blocks allocated to handle a CoW can be lost -- if the FS
      goes down after the blocks are allocated but before the block
      remapping takes place.  This is exacerbated by the cowextsz hint --
      allocated reservations can sit around for a while, waiting to get
      used.
      
      Since the refcount btree doesn't normally store records with refcount
      of 1, we can use it to record these in-progress extents.  In-progress
      blocks cannot be shared because they're not user-visible, so there
      shouldn't be any conflicts with other programs.  This is a better
      solution than holding EFIs during writeback because (a) EFIs can't be
      relogged currently, (b) even if they could, EFIs are bound by
      available log space, which puts an unnecessary upper bound on how much
      CoW we can have in flight, and (c) we already have a mechanism to
      track blocks.
      
      At mount time, read the refcount records and free anything we find
      with a refcount of 1 because those were in-progress when the FS went
      down.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      174edb0e
    • D
      xfs: cancel pending CoW reservations when destroying inodes · 5e7e605c
      Darrick J. Wong 提交于
      When destroying the inode, cancel all pending reservations in the CoW
      fork so that all the reserved blocks go back to the free pile.  In
      theory this sort of cleanup is only needed to clean up after write
      errors.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      5e7e605c
  2. 05 10月, 2016 3 次提交
  3. 04 10月, 2016 2 次提交
  4. 26 9月, 2016 1 次提交
    • D
      xfs: quiesce the filesystem after recovery on readonly mount · ddeb14f4
      Dave Chinner 提交于
      Recently we've had a number of reports where log recovery on a v5
      filesystem has reported corruptions that looked to be caused by
      recovery being re-run over the top of an already-recovered
      metadata. This has uncovered a bug in recovery (fixed elsewhere)
      but the vector that caused this was largely unknown.
      
      A kdump test started tripping over this problem - the system
      would be crashed, the kdump kernel and environment would boot and
      dump the kernel core image, and then the system would reboot. After
      reboot, the root filesystem was triggering log recovery and
      corruptions were being detected. The metadumps indicated the above
      log recovery issue.
      
      What is happening is that the kdump kernel and environment is
      mounting the root device read-only to find the binaries needed to do
      it's work. The result of this is that it is running log recovery.
      However, because there were unlinked files and EFIs to be processed
      by recovery, the completion of phase 1 of log recovery could not
      mark the log clean. And because it's a read-only mount, the unmount
      process does not write records to the log to mark it clean, either.
      Hence on the next mount of the filesystem, log recovery was run
      again across all the metadata that had already been recovered and
      this is what triggered corruption warnings.
      
      To avoid this problem, we need to ensure that a read-only mount
      always updates the log when it completes the second phase of
      recovery. We already handle this sort of issue with rw->ro remount
      transitions, so the solution is as simple as quiescing the
      filesystem at the appropriate time during the mount process. This
      results in the log being marked clean so the mount behaviour
      recorded in the logs on repeated RO mounts will change (i.e. log
      recovery will no longer be run on every mount until a RW mount is
      done). This is a user visible change in behaviour, but it is
      harmless.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      ddeb14f4
  5. 19 9月, 2016 1 次提交
  6. 26 8月, 2016 1 次提交
  7. 03 8月, 2016 7 次提交
  8. 22 7月, 2016 1 次提交
  9. 21 6月, 2016 2 次提交
    • D
      xfs: convert list of extents to free into a regular list · e66a4c67
      Darrick J. Wong 提交于
      In struct xfs_bmap_free, convert the open-coded free extent list to
      a regular list, then use list_sort to sort it prior to processing.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      e66a4c67
    • B
      xfs: cancel eofblocks background trimming on remount read-only · fa5a4f57
      Brian Foster 提交于
      The filesystem quiesce sequence performs the operations necessary to
      drain all background work, push pending transactions through the log
      infrastructure and wait on I/O resulting from the final AIL push. We
      have had reports of remount,ro hangs in xfs_log_quiesce() ->
      xfs_wait_buftarg(), however, and some instrumentation code to detect
      transaction commits at this point in the quiesce sequence has inculpated
      the eofblocks background scanner as a cause.
      
      While higher level remount code generally prevents user modifications by
      the time the filesystem has made it to xfs_log_quiesce(), the background
      scanner may still be alive and can perform pending work at any time. If
      this occurs between the xfs_log_force() and xfs_wait_buftarg() calls
      within xfs_log_quiesce(), this can lead to an indefinite lockup in
      xfs_wait_buftarg().
      
      To prevent this problem, cancel the background eofblocks scan worker
      during the remount read-only quiesce sequence. This suspends background
      trimming when a filesystem is remounted read-only. This is only done in
      the remount path because the freeze codepath has already locked out new
      transactions by the time the filesystem attempts to quiesce (and thus
      waiting on an active work item could deadlock). Kick the eofblocks
      worker to pick up where it left off once an fs is remounted back to
      read-write.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      fa5a4f57
  10. 01 6月, 2016 1 次提交
  11. 18 5月, 2016 1 次提交
    • D
      xfs: remove xfs_fs_evict_inode() · 8179c036
      Dave Chinner 提交于
      Joe Lawrence reported a list_add corruption with 4.6-rc1 when
      testing some custom md administration code that made it's own
      block device nodes for the md array. The simple test loop of:
      
      for i in {0..100}; do
      	mknod --mode=0600 $tmp/tmp_node b $MAJOR $MINOR
      	mdadm --detail --export $tmp/tmp_node > /dev/null
      	rm -f $tmp/tmp_node
      done
      
      
      Would produce this warning in bd_acquire() when mdadm opened the
      device node:
      
      list_add double add: new=ffff88043831c7b8, prev=ffff8804380287d8, next=ffff88043831c7b8.
      
      And then produce this from bd_forget from kdevtmpfs evicting a block
      dev inode:
      
      list_del corruption. prev->next should be ffff8800bb83eb10, but was ffff88043831c7b8
      
      This is a regression caused by commit c19b3b05 ("xfs: mode di_mode
      to vfs inode"). The issue is that xfs_inactive() frees the
      unlinked inode, and the above commit meant that this freeing zeroed
      the mode in the struct inode. The problem is that after evict() has
      called ->evict_inode, it expects the i_mode to be intact so that it
      can call bd_forget() or cd_forget() to drop the reference to the
      block device inode attached to the XFS inode.
      
      In reality, the only thing we do in xfs_fs_evict_inode() that is not
      generic is call xfs_inactive(). We can move the xfs_inactive() call
      to xfs_fs_destroy_inode() without any problems at all, and this
      will leave the VFS inode intact until it is completely done with it.
      
      So, remove xfs_fs_evict_inode(), and do the work it used to do in
      ->destroy_inode instead.
      
      cc: <stable@vger.kernel.org> # 4.6
      Reported-by: NJoe Lawrence <joe.lawrence@stratus.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      8179c036
  12. 17 5月, 2016 1 次提交
    • T
      xfs: Add alignment check for DAX mount · 1e937cdd
      Toshi Kani 提交于
      When a partition is not aligned by 4KB, mount -o dax succeeds,
      but any read/write access to the filesystem fails, except for
      metadata update.
      
      Call bdev_dax_supported() to perform proper precondition checks
      which includes this partition alignment check.
      Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
      1e937cdd
  13. 06 4月, 2016 3 次提交
  14. 05 4月, 2016 2 次提交
    • K
      mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage · ea1754a0
      Kirill A. Shutemov 提交于
      Mostly direct substitution with occasional adjustment or removing
      outdated comments.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea1754a0
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  15. 09 3月, 2016 1 次提交
  16. 02 3月, 2016 3 次提交
  17. 15 1月, 2016 1 次提交
    • V
      kmemcg: account certain kmem allocations to memcg · 5d097056
      Vladimir Davydov 提交于
      Mark those kmem allocations that are known to be easily triggered from
      userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
      memcg.  For the list, see below:
      
       - threadinfo
       - task_struct
       - task_delay_info
       - pid
       - cred
       - mm_struct
       - vm_area_struct and vm_region (nommu)
       - anon_vma and anon_vma_chain
       - signal_struct
       - sighand_struct
       - fs_struct
       - files_struct
       - fdtable and fdtable->full_fds_bits
       - dentry and external_name
       - inode for all filesystems. This is the most tedious part, because
         most filesystems overwrite the alloc_inode method.
      
      The list is far from complete, so feel free to add more objects.
      Nevertheless, it should be close to "account everything" approach and
      keep most workloads within bounds.  Malevolent users will be able to
      breach the limit, but this was possible even with the former "account
      everything" approach (simply because it did not account everything in
      fact).
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d097056
  18. 04 1月, 2016 1 次提交
  19. 10 11月, 2015 1 次提交
    • C
      xfs: give all workqueues rescuer threads · 7a29ac47
      Chris Mason 提交于
      We're consistently hitting deadlocks here with XFS on recent kernels.
      After some digging through the crash files, it looks like everyone in
      the system is waiting for XFS to reclaim memory.
      
      Something like this:
      
      PID: 2733434  TASK: ffff8808cd242800  CPU: 19  COMMAND: "java"
       #0 [ffff880019c53588] __schedule at ffffffff818c4df2
       #1 [ffff880019c535d8] schedule at ffffffff818c5517
       #2 [ffff880019c535f8] _xfs_log_force_lsn at ffffffff81316348
       #3 [ffff880019c53688] xfs_log_force_lsn at ffffffff813164fb
       #4 [ffff880019c536b8] xfs_iunpin_wait at ffffffff8130835e
       #5 [ffff880019c53728] xfs_reclaim_inode at ffffffff812fd453
       #6 [ffff880019c53778] xfs_reclaim_inodes_ag at ffffffff812fd8c7
       #7 [ffff880019c53928] xfs_reclaim_inodes_nr at ffffffff812fe433
       #8 [ffff880019c53958] xfs_fs_free_cached_objects at ffffffff8130d3b9
       #9 [ffff880019c53968] super_cache_scan at ffffffff811a6f73
      #10 [ffff880019c539c8] shrink_slab at ffffffff811460e6
      #11 [ffff880019c53aa8] shrink_zone at ffffffff8114a53f
      #12 [ffff880019c53b48] do_try_to_free_pages at ffffffff8114a8ba
      #13 [ffff880019c53be8] try_to_free_pages at ffffffff8114ad5a
      #14 [ffff880019c53c78] __alloc_pages_nodemask at ffffffff8113e1b8
      #15 [ffff880019c53d88] alloc_kmem_pages_node at ffffffff8113e671
      #16 [ffff880019c53dd8] copy_process at ffffffff8104f781
      #17 [ffff880019c53ec8] do_fork at ffffffff8105129c
      #18 [ffff880019c53f38] sys_clone at ffffffff810515b6
      #19 [ffff880019c53f48] stub_clone at ffffffff818c8e4d
      
      xfs_log_force_lsn is waiting for logs to get cleaned, which is waiting
      for IO, which is waiting for workers to complete the IO which is waiting
      for worker threads that don't exist yet:
      
      PID: 2752451  TASK: ffff880bd6bdda00  CPU: 37  COMMAND: "kworker/37:1"
       #0 [ffff8808d20abbb0] __schedule at ffffffff818c4df2
       #1 [ffff8808d20abc00] schedule at ffffffff818c5517
       #2 [ffff8808d20abc20] schedule_timeout at ffffffff818c7c6c
       #3 [ffff8808d20abcc0] wait_for_completion_killable at ffffffff818c6495
       #4 [ffff8808d20abd30] kthread_create_on_node at ffffffff8106ec82
       #5 [ffff8808d20abdf0] create_worker at ffffffff8106752f
       #6 [ffff8808d20abe40] worker_thread at ffffffff810699be
       #7 [ffff8808d20abec0] kthread at ffffffff8106ef59
       #8 [ffff8808d20abf50] ret_from_fork at ffffffff818c8ac8
      
      I think we should be using WQ_MEM_RECLAIM to make sure this thread
      pool makes progress when we're not able to allocate new workers.
      
      [dchinner: make all workqueues WQ_MEM_RECLAIM]
      Signed-off-by: NChris Mason <clm@fb.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      7a29ac47
  20. 03 11月, 2015 1 次提交
  21. 19 10月, 2015 1 次提交