1. 10 5月, 2018 3 次提交
  2. 03 4月, 2018 1 次提交
  3. 13 1月, 2018 1 次提交
    • D
      xfs: use %px for data pointers when debugging · c9690043
      Darrick J. Wong 提交于
      Starting with commit 57e73442 ("vsprintf: refactor %pK code out of
      pointer"), the behavior of the raw '%p' printk format specifier was
      changed to print a 32-bit hash of the pointer value to avoid leaking
      kernel pointers into dmesg.  For most situations that's good.
      
      This is /undesirable/ behavior when we're trying to debug XFS, however,
      so define a PTR_FMT that prints the actual pointer when we're in debug
      mode.
      
      Note that %p for tracepoints still prints the raw pointer, so in the
      long run we could consider rewriting some of these messages as
      tracepoints.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      c9690043
  4. 09 1月, 2018 2 次提交
  5. 03 1月, 2018 2 次提交
  6. 09 12月, 2017 1 次提交
    • C
      xfs: remove "no-allocation" reservations for file creations · f59cf5c2
      Christoph Hellwig 提交于
      If we create a new file we will need an inode, and usually some metadata
      in the parent direction.  Aiming for everything to go well despite the
      lack of a reservation leads to dirty transactions cancelled under a heavy
      create/delete load.  This patch removes those nospace transactions, which
      will lead to slightly earlier ENOSPC on some workloads, but instead
      prevent file system shutdowns due to cancelling dirty transactions for
      others.
      
      A customer could observe assertations failures and shutdowns due to
      cancelation of dirty transactions during heavy NFS workloads as shown
      below:
      
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728125] XFS: Assertion failed: error != -ENOSPC, file: fs/xfs/xfs_inode.c, line: 1262
      
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728222] Call Trace:
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728246]  [<ffffffff81795daf>] dump_stack+0x63/0x81
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728262]  [<ffffffff810a1a5a>] warn_slowpath_common+0x8a/0xc0
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728264]  [<ffffffff810a1b8a>] warn_slowpath_null+0x1a/0x20
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728285]  [<ffffffffa01bf403>] asswarn+0x33/0x40 [xfs]
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728308]  [<ffffffffa01bb07e>] xfs_create+0x7be/0x7d0 [xfs]
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728329]  [<ffffffffa01b6ffb>] xfs_generic_create+0x1fb/0x2e0 [xfs]
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728348]  [<ffffffffa01b7114>] xfs_vn_mknod+0x14/0x20 [xfs]
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728366]  [<ffffffffa01b7153>] xfs_vn_create+0x13/0x20 [xfs]
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728380]  [<ffffffff81231de5>] vfs_create+0xd5/0x140
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728390]  [<ffffffffa045ddb9>] do_nfsd_create+0x499/0x610 [nfsd]
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728396]  [<ffffffffa0465fa5>] nfsd3_proc_create+0x135/0x210 [nfsd]
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728401]  [<ffffffffa04561e3>] nfsd_dispatch+0xc3/0x210 [nfsd]
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728416]  [<ffffffffa03bfa43>] svc_process_common+0x453/0x6f0 [sunrpc]
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728423]  [<ffffffffa03bfdf3>] svc_process+0x113/0x1f0 [sunrpc]
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728427]  [<ffffffffa0455bcf>] nfsd+0x10f/0x180 [nfsd]
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728432]  [<ffffffffa0455ac0>] ? nfsd_destroy+0x80/0x80 [nfsd]
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728438]  [<ffffffff810c0d58>] kthread+0xd8/0xf0
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728441]  [<ffffffff810c0c80>] ? kthread_create_on_node+0x1b0/0x1b0
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728451]  [<ffffffff8179d962>] ret_from_fork+0x42/0x70
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728453]  [<ffffffff810c0c80>] ? kthread_create_on_node+0x1b0/0x1b0
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728454] ---[ end trace f9822c842fec81d4 ]---
      
      2017-05-30 21:17:06 kernel: ALERT: [ 2670.728477] XFS (sdb): Internal error xfs_trans_cancel at line 983 of file fs/xfs/xfs_trans.c.  Caller xfs_create+0x4ee/0x7d0 [xfs]
      
      2017-05-30 21:17:06 kernel: ALERT: [ 2670.728684] XFS (sdb): Corruption of in-memory data detected. Shutting down filesystem
      2017-05-30 21:17:06 kernel: ALERT: [ 2670.728685] XFS (sdb): Please umount the filesystem and rectify the problem(s)
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      f59cf5c2
  7. 02 9月, 2017 1 次提交
  8. 24 7月, 2017 1 次提交
  9. 19 6月, 2017 1 次提交
    • B
      xfs: push buffer of flush locked dquot to avoid quotacheck deadlock · 7912e7fe
      Brian Foster 提交于
      Reclaim during quotacheck can lead to deadlocks on the dquot flush
      lock:
      
       - Quotacheck populates a local delwri queue with the physical dquot
         buffers.
       - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and
         dirties all of the dquots.
       - Reclaim kicks in and attempts to flush a dquot whose buffer is
         already queud on the quotacheck queue. The flush succeeds but
         queueing to the reclaim delwri queue fails as the backing buffer is
         already queued. The flush unlock is now deferred to I/O completion
         of the buffer from the quotacheck queue.
       - The dqadjust bulkstat continues and dirties the recently flushed
         dquot once again.
       - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires
         the flush lock to update the backing buffers with the in-core
         recalculated values. It deadlocks on the redirtied dquot as the
         flush lock was already acquired by reclaim, but the buffer resides
         on the local delwri queue which isn't submitted until the end of
         quotacheck.
      
      This is reproduced by running quotacheck on a filesystem with a
      couple million inodes in low memory (512MB-1GB) situations. This is
      a regression as of commit 43ff2122 ("xfs: on-stack delayed write
      buffer lists"), which removed a trylock and buffer I/O submission
      from the quotacheck dquot flush sequence.
      
      Quotacheck first resets and collects the physical dquot buffers in a
      delwri queue. Then, it traverses the filesystem inodes via bulkstat,
      updates the in-core dquots, flushes the corrected dquots to the
      backing buffers and finally submits the delwri queue for I/O. Since
      the backing buffers are queued across the entire quotacheck
      operation, dquot reclaim cannot possibly complete a dquot flush
      before quotacheck completes.
      
      Therefore, quotacheck must submit the buffer for I/O in order to
      cycle the flush lock and flush the dirty in-core dquot to the
      buffer. Add a delwri queue buffer push mechanism to submit an
      individual buffer for I/O without losing the delwri queue status and
      use it from quotacheck to avoid the deadlock. This restores
      quotacheck behavior to as before the regression was introduced.
      Reported-by: NMartin Svec <martin.svec@zoner.cz>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      7912e7fe
  10. 26 4月, 2017 2 次提交
  11. 28 1月, 2017 1 次提交
    • B
      xfs: prevent quotacheck from overloading inode lru · e0d76fa4
      Brian Foster 提交于
      Quotacheck runs at mount time in situations where quota accounting must
      be recalculated. In doing so, it uses bulkstat to visit every inode in
      the filesystem. Historically, every inode processed during quotacheck
      was released and immediately tagged for reclaim because quotacheck runs
      before the superblock is marked active by the VFS. In other words,
      the final iput() lead to an immediate ->destroy_inode() call, which
      allowed the XFS background reclaim worker to start reclaiming inodes.
      
      Commit 17c12bcd ("xfs: when replaying bmap operations, don't let
      unlinked inodes get reaped") marks the XFS superblock active sooner as
      part of the mount process to support caching inodes processed during log
      recovery. This occurs before quotacheck and thus means all inodes
      processed by quotacheck are inserted to the LRU on release.  The
      s_umount lock is held until the mount has completed and thus prevents
      the shrinkers from operating on the sb. This means that quotacheck can
      excessively populate the inode LRU and lead to OOM conditions on systems
      without sufficient RAM.
      
      Update the quotacheck bulkstat handler to set XFS_IGET_DONTCACHE on
      inodes processed by quotacheck. This causes ->drop_inode() to return 1
      and in turn causes iput_final() to evict the inode. This preserves the
      original quotacheck behavior and prevents it from overloading the LRU
      and running out of memory.
      
      CC: stable@vger.kernel.org # v4.9
      Reported-by: NMartin Svec <martin.svec@zoner.cz>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      e0d76fa4
  12. 08 11月, 2016 1 次提交
  13. 06 4月, 2016 1 次提交
    • C
      xfs: better xfs_trans_alloc interface · 253f4911
      Christoph Hellwig 提交于
      Merge xfs_trans_reserve and xfs_trans_alloc into a single function call
      that returns a transaction with all the required log and block reservations,
      and which allows passing transaction flags directly to avoid the cumbersome
      _xfs_trans_alloc interface.
      
      While we're at it we also get rid of the transaction type argument that has
      been superflous since we stopped supporting the non-CIL logging mode.  The
      guts of it will be removed in another patch.
      
      [dchinner: fixed transaction leak in error path in xfs_setattr_nonsize]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      253f4911
  14. 08 2月, 2016 1 次提交
    • C
      xfs: Split default quota limits by quota type · be607946
      Carlos Maiolino 提交于
      Default quotas are globally set due historical reasons. IRIX only
      supported user and project quotas, and default quota was only
      applied to user quotas.
      
      In Linux, when a default quota is set, all different quota types
      inherits the same default value.
      
      An user with a quota limit larger than the default quota value, will
      still be limited to the default value because the group quotas also
      inherits the default quotas. Unless the group which the user belongs
      to have a custom quota limit set.
      
      This patch aims to split the default quota value by quota type.
      Allowing each quota type having different default values.
      
      Default time limits are still set globally. XFS does not set a
      per-user/group timer, but a single global timer. For changing this
      behavior, some changes should be made in user-space tools another
      bugs being fixed.
      Signed-off-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      be607946
  15. 07 11月, 2015 1 次提交
    • M
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman 提交于
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
  16. 12 10月, 2015 1 次提交
    • B
      xfs: per-filesystem stats counter implementation · ff6d6af2
      Bill O'Donnell 提交于
      This patch modifies the stats counting macros and the callers
      to those macros to properly increment, decrement, and add-to
      the xfs stats counts. The counts for global and per-fs stats
      are correctly advanced, and cleared by writing a "1" to the
      corresponding clear file.
      
      global counts: /sys/fs/xfs/stats/stats
      per-fs counts: /sys/fs/xfs/sda*/stats/stats
      
      global clear:  /sys/fs/xfs/stats/stats_clear
      per-fs clear:  /sys/fs/xfs/sda*/stats/stats_clear
      
      [dchinner: cleaned up macro variables, removed CONFIG_FS_PROC around
       stats structures and macros. ]
      Signed-off-by: NBill O'Donnell <billodo@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      ff6d6af2
  17. 04 6月, 2015 2 次提交
    • C
      xfs: saner xfs_trans_commit interface · 70393313
      Christoph Hellwig 提交于
      The flags argument to xfs_trans_commit is not useful for most callers, as
      a commit of a transaction without a permanent log reservation must pass
      0 here, and all callers for a transaction with a permanent log reservation
      except for xfs_trans_roll must pass XFS_TRANS_RELEASE_LOG_RES.  So remove
      the flags argument from the public xfs_trans_commit interfaces, and
      introduce low-level __xfs_trans_commit variant just for xfs_trans_roll
      that regrants a log reservation instead of releasing it.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      70393313
    • C
      xfs: remove the flags argument to xfs_trans_cancel · 4906e215
      Christoph Hellwig 提交于
      xfs_trans_cancel takes two flags arguments: XFS_TRANS_RELEASE_LOG_RES and
      XFS_TRANS_ABORT.  Both of them are a direct product of the transaction
      state, and can be deducted:
      
       - any dirty transaction needs XFS_TRANS_ABORT to be properly canceled,
         and XFS_TRANS_ABORT is a noop for a transaction that is not dirty.
       - any transaction with a permanent log reservation needs
         XFS_TRANS_RELEASE_LOG_RES to be properly canceled, and passing
         XFS_TRANS_RELEASE_LOG_RES for a transaction without a permanent
         log reservation is invalid.
      
      So just remove the flags argument and do the right thing.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      4906e215
  18. 23 2月, 2015 2 次提交
    • D
      xfs: inodes are new until the dentry cache is set up · 58c90473
      Dave Chinner 提交于
      Al Viro noticed a generic set of issues to do with filehandle lookup
      racing with dentry cache setup. They involve a filehandle lookup
      occurring while an inode is being created and the filehandle lookup
      racing with the dentry creation for the real file. This can lead to
      multiple dentries for the one path being instantiated. There are a
      host of other issues around this same set of paths.
      
      The underlying cause is that file handle lookup only waits on inode
      cache instantiation rather than full dentry cache instantiation. XFS
      is mostly immune to the problems discovered due to it's own internal
      inode cache, but there are a couple of corner cases where races can
      happen.
      
      We currently clear the XFS_INEW flag when the inode is fully set up
      after insertion into the cache. Newly allocated inodes are inserted
      locked and so aren't usable until the allocation transaction
      commits. This, however, occurs before the dentry and security
      information is fully initialised and hence the inode is unlocked and
      available for lookups to find too early.
      
      To solve the problem, only clear the XFS_INEW flag for newly created
      inodes once the dentry is fully instantiated. This means lookups
      will retry until the XFS_INEW flag is removed from the inode and
      hence avoids the race conditions in questions.
      
      THis also means that xfs_create(), xfs_create_tmpfile() and
      xfs_symlink() need to finish the setup of the inode in their error
      paths if we had allocated the inode but failed later in the creation
      process. xfs_symlink(), in particular, needed a lot of help to make
      it's error handling match that of xfs_create().
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      58c90473
    • J
      xfs: Fix quota type in quota structures when reusing quota file · dfcc70a8
      Jan Kara 提交于
      For filesystems without separate project quota inode field in the
      superblock we just reuse project quota file for group quotas (and vice
      versa) if project quota file is allocated and we need group quota file.
      When we reuse the file, quota structures on disk suddenly have wrong
      type stored in d_flags though. Nobody really cares about this (although
      structure type reported to userspace was wrong as well) except
      that after commit 14bf61ff (quota: Switch ->get_dqblk() and
      ->set_dqblk() to use bytes as space units) assertion in
      xfs_qm_scall_getquota() started to trigger on xfs/106 test (apparently I
      was testing without XFS_DEBUG so I didn't notice when submitting the
      above commit).
      
      Fix the problem by properly resetting ddq->d_flags when running quotacheck
      for a quota file.
      
      CC: stable@vger.kernel.org
      Reported-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      dfcc70a8
  19. 13 2月, 2015 2 次提交
    • V
      list_lru: add helpers to isolate items · 3f97b163
      Vladimir Davydov 提交于
      Currently, the isolate callback passed to the list_lru_walk family of
      functions is supposed to just delete an item from the list upon returning
      LRU_REMOVED or LRU_REMOVED_RETRY, while nr_items counter is fixed by
      __list_lru_walk_one after the callback returns.  Since the callback is
      allowed to drop the lock after removing an item (it has to return
      LRU_REMOVED_RETRY then), the nr_items can be less than the actual number
      of elements on the list even if we check them under the lock.  This makes
      it difficult to move items from one list_lru_one to another, which is
      required for per-memcg list_lru reparenting - we can't just splice the
      lists, we have to move entries one by one.
      
      This patch therefore introduces helpers that must be used by callback
      functions to isolate items instead of raw list_del/list_move.  These are
      list_lru_isolate and list_lru_isolate_move.  They not only remove the
      entry from the list, but also fix the nr_items counter, making sure
      nr_items always reflects the actual number of elements on the list if
      checked under the appropriate lock.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f97b163
    • V
      list_lru: introduce list_lru_shrink_{count,walk} · 503c358c
      Vladimir Davydov 提交于
      Kmem accounting of memcg is unusable now, because it lacks slab shrinker
      support.  That means when we hit the limit we will get ENOMEM w/o any
      chance to recover.  What we should do then is to call shrink_slab, which
      would reclaim old inode/dentry caches from this cgroup.  This is what
      this patch set is intended to do.
      
      Basically, it does two things.  First, it introduces the notion of
      per-memcg slab shrinker.  A shrinker that wants to reclaim objects per
      cgroup should mark itself as SHRINKER_MEMCG_AWARE.  Then it will be
      passed the memory cgroup to scan from in shrink_control->memcg.  For
      such shrinkers shrink_slab iterates over the whole cgroup subtree under
      the target cgroup and calls the shrinker for each kmem-active memory
      cgroup.
      
      Secondly, this patch set makes the list_lru structure per-memcg.  It's
      done transparently to list_lru users - everything they have to do is to
      tell list_lru_init that they want memcg-aware list_lru.  Then the
      list_lru will automatically distribute objects among per-memcg lists
      basing on which cgroup the object is accounted to.  This way to make FS
      shrinkers (icache, dcache) memcg-aware we only need to make them use
      memcg-aware list_lru, and this is what this patch set does.
      
      As before, this patch set only enables per-memcg kmem reclaim when the
      pressure goes from memory.limit, not from memory.kmem.limit.  Handling
      memory.kmem.limit is going to be tricky due to GFP_NOFS allocations, and
      it is still unclear whether we will have this knob in the unified
      hierarchy.
      
      This patch (of 9):
      
      NUMA aware slab shrinkers use the list_lru structure to distribute
      objects coming from different NUMA nodes to different lists.  Whenever
      such a shrinker needs to count or scan objects from a particular node,
      it issues commands like this:
      
              count = list_lru_count_node(lru, sc->nid);
              freed = list_lru_walk_node(lru, sc->nid, isolate_func,
                                         isolate_arg, &sc->nr_to_scan);
      
      where sc is an instance of the shrink_control structure passed to it
      from vmscan.
      
      To simplify this, let's add special list_lru functions to be used by
      shrinkers, list_lru_shrink_count() and list_lru_shrink_walk(), which
      consolidate the nid and nr_to_scan arguments in the shrink_control
      structure.
      
      This will also allow us to avoid patching shrinkers that use list_lru
      when we make shrink_slab() per-memcg - all we will have to do is extend
      the shrink_control structure to include the target memcg and make
      list_lru_shrink_{count,walk} handle this appropriately.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Suggested-by: NDave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      503c358c
  20. 22 1月, 2015 2 次提交
    • D
      xfs: consolidate superblock logging functions · 61e63ecb
      Dave Chinner 提交于
      We now have several superblock loggin functions that are identical
      except for the transaction reservation and whether it shoul dbe a
      synchronous transaction or not. Consolidate these all into a single
      function, a single reserveration and a sync flag and call it
      xfs_sync_sb().
      
      Also, xfs_mod_sb() is not really a modification function - it's the
      operation of logging the superblock buffer. hence change the name of
      it to reflect this.
      
      Note that we have to change the mp->m_update_flags that are passed
      around at mount time to a boolean simply to indicate a superblock
      update is needed.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      61e63ecb
    • D
      xfs: remove bitfield based superblock updates · 4d11a402
      Dave Chinner 提交于
      When we log changes to the superblock, we first have to write them
      to the on-disk buffer, and then log that. Right now we have a
      complex bitfield based arrangement to only write the modified field
      to the buffer before we log it.
      
      This used to be necessary as a performance optimisation because we
      logged the superblock buffer in every extent or inode allocation or
      freeing, and so performance was extremely important. We haven't done
      this for years, however, ever since the lazy superblock counters
      pulled the superblock logging out of the transaction commit
      fast path.
      
      Hence we have a bunch of complexity that is not necessary that makes
      writing the in-core superblock to disk much more complex than it
      needs to be. We only need to log the superblock now during
      management operations (e.g. during mount, unmount or quota control
      operations) so it is not a performance critical path anymore.
      
      As such, remove the complex field based logging mechanism and
      replace it with a simple conversion function similar to what we use
      for all other on-disk structures.
      
      This means we always log the entirity of the superblock, but again
      because we rarely modify the superblock this is not an issue for log
      bandwidth or CPU time. Indeed, if we do log the superblock
      frequently, delayed logging will minimise the impact of this
      overhead.
      
      [Fixed gquota/pquota inode sharing regression noticed by bfoster.]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      4d11a402
  21. 01 12月, 2014 1 次提交
  22. 28 11月, 2014 2 次提交
  23. 29 9月, 2014 1 次提交
  24. 04 8月, 2014 1 次提交
    • D
      xfs: quotacheck leaves dquot buffers without verifiers · 5fd364fe
      Dave Chinner 提交于
      When running xfs/305, I noticed that quotacheck was flushing dquot
      buffers that did not have the xfs_dquot_buf_ops verifiers attached:
      
      XFS (vdb): _xfs_buf_ioapply: no ops on block 0x1dc8/0x1dc8
      ffff880052489000: 44 51 01 04 00 00 65 b8 00 00 00 00 00 00 00 00  DQ....e.........
      ffff880052489010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      ffff880052489020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      ffff880052489030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      CPU: 1 PID: 2376 Comm: mount Not tainted 3.16.0-rc2-dgc+ #306
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
       ffff88006fe38000 ffff88004a0ffae8 ffffffff81cf1cca 0000000000000001
       ffff88004a0ffb88 ffffffff814d50ca 000010004a0ffc70 0000000000000000
       ffff88006be56dc4 0000000000000021 0000000000001dc8 ffff88007c773d80
      Call Trace:
       [<ffffffff81cf1cca>] dump_stack+0x45/0x56
       [<ffffffff814d50ca>] _xfs_buf_ioapply+0x3ca/0x3d0
       [<ffffffff810db520>] ? wake_up_state+0x20/0x20
       [<ffffffff814d51f5>] ? xfs_bdstrat_cb+0x55/0xb0
       [<ffffffff814d513b>] xfs_buf_iorequest+0x6b/0xd0
       [<ffffffff814d51f5>] xfs_bdstrat_cb+0x55/0xb0
       [<ffffffff814d53ab>] __xfs_buf_delwri_submit+0x15b/0x220
       [<ffffffff814d6040>] ? xfs_buf_delwri_submit+0x30/0x90
       [<ffffffff814d6040>] xfs_buf_delwri_submit+0x30/0x90
       [<ffffffff8150f89d>] xfs_qm_quotacheck+0x17d/0x3c0
       [<ffffffff81510591>] xfs_qm_mount_quotas+0x151/0x1e0
       [<ffffffff814ed01c>] xfs_mountfs+0x56c/0x7d0
       [<ffffffff814f0f12>] xfs_fs_fill_super+0x2c2/0x340
       [<ffffffff811c9fe4>] mount_bdev+0x194/0x1d0
       [<ffffffff814f0c50>] ? xfs_finish_flags+0x170/0x170
       [<ffffffff814ef0f5>] xfs_fs_mount+0x15/0x20
       [<ffffffff811ca8c9>] mount_fs+0x39/0x1b0
       [<ffffffff811e4d67>] vfs_kern_mount+0x67/0x120
       [<ffffffff811e757e>] do_mount+0x23e/0xad0
       [<ffffffff8117abde>] ? __get_free_pages+0xe/0x50
       [<ffffffff811e71e6>] ? copy_mount_options+0x36/0x150
       [<ffffffff811e8103>] SyS_mount+0x83/0xc0
       [<ffffffff81cfd40b>] tracesys+0xdd/0xe2
      
      This was caused by dquot buffer readahead not attaching a verifier
      structure to the buffer when readahead was issued, resulting in the
      followup read of the buffer finding a valid buffer and so not
      attaching new verifiers to the buffer as part of the read.
      
      Also, when a verifier failure occurs, we then read the buffer
      without verifiers. Attach the verifiers manually after this read so
      that if the buffer is then written it will be verified that the
      corruption has been repaired.
      
      Further, when flushing a dquot we don't ask for a verifier when
      reading in the dquot buffer the dquot belongs to. Most of the time
      this isn't an issue because the buffer is still cached, but when it
      is not cached it will result in writing the dquot buffer without
      having the verfier attached.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      5fd364fe
  25. 24 7月, 2014 1 次提交
  26. 25 6月, 2014 1 次提交
    • D
      xfs: global error sign conversion · 2451337d
      Dave Chinner 提交于
      Convert all the errors the core XFs code to negative error signs
      like the rest of the kernel and remove all the sign conversion we
      do in the interface layers.
      
      Errors for conversion (and comparison) found via searches like:
      
      $ git grep " E" fs/xfs
      $ git grep "return E" fs/xfs
      $ git grep " E[A-Z].*;$" fs/xfs
      
      Negation points found via searches like:
      
      $ git grep "= -[a-z,A-Z]" fs/xfs
      $ git grep "return -[a-z,A-D,F-Z]" fs/xfs
      $ git grep " -[a-z].*;" fs/xfs
      
      [ with some bits I missed from Brian Foster ]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      2451337d
  27. 22 6月, 2014 2 次提交
  28. 15 5月, 2014 1 次提交
  29. 05 5月, 2014 1 次提交
    • D
      xfs: remove dquot hints · 3c353375
      Dave Chinner 提交于
      group and project quota hints are currently stored on the user
      dquot. If we are attaching quotas to the inode, then the group and
      project dquots are stored as hints on the user dquot to save having
      to look them up again later.
      
      The thing is, the hints are not used for that inode for the rest of
      the life of the inode - the dquots are attached directly to the
      inode itself - so the only time the hints are used is when an inode
      first has dquots attached.
      
      When the hints on the user dquot don't match the dquots being
      attache dto the inode, they are then removed and replaced with the
      new hints. If a user is concurrently modifying files in different
      group and/or project contexts, then this leads to thrashing of the
      hints attached to user dquot.
      
      If user quotas are not enabled, then hints are never even used.
      
      So, if the hints are used to avoid the cost of the lookup, is the
      cost of the lookup significant enough to justify the hint
      infrstructure? Maybe it was once, when there was a global quota
      manager shared between all XFS filesystems and was hash table based.
      
      However, lookups are now much simpler, requiring only a single lock and
      radix tree lookup local to the filesystem and no hash or LRU
      manipulations to be made. Hence the cost of lookup is much lower
      than when hints were implemented. Turns out that benchmarks show
      that, too, with thir being no differnce in performance when doing
      file creation workloads as a single user with user, group and
      project quotas enabled - the hints do not make the code go any
      faster. In fact, removing the hints shows a 2-3% reduction in the
      time it takes to create 50 million inodes....
      
      So, let's just get rid of the hints and the complexity around them.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      3c353375