1. 04 8月, 2018 1 次提交
    • A
      new primitive: discard_new_inode() · c2b6d621
      Al Viro 提交于
      	We don't want open-by-handle picking half-set-up in-core
      struct inode from e.g. mkdir() having failed halfway through.
      In other words, we don't want such inodes returned by iget_locked()
      on their way to extinction.  However, we can't just have them
      unhashed - otherwise open-by-handle immediately *after* that would've
      ended up creating a new in-core inode over the on-disk one that
      is in process of being freed right under us.
      
      	Solution: new flag (I_CREATING) set by insert_inode_locked() and
      removed by unlock_new_inode() and a new primitive (discard_new_inode())
      to be used by such halfway-through-setup failure exits instead of
      unlock_new_inode() / iput() combinations.  That primitive unlocks new
      inode, but leaves I_CREATING in place.
      
      	iget_locked() treats finding an I_CREATING inode as failure
      (-ESTALE, once we sort out the error propagation).
      	insert_inode_locked() treats the same as instant -EBUSY.
      	ilookup() treats those as icache miss.
      
      [Fix by Dan Carpenter <dan.carpenter@oracle.com> folded in]
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c2b6d621
  2. 02 8月, 2018 1 次提交
    • A
      kill d_instantiate_no_diralias() · c971e6a0
      Al Viro 提交于
      The only user is fuse_create_new_entry(), and there it's used to
      mitigate the same mkdir/open-by-handle race as in nfs_mkdir().
      The same solution applies - unhash the mkdir argument, then
      call d_splice_alias() and if that returns a reference to preexisting
      alias, dput() and report success.  ->mkdir() argument left unhashed
      negative with the preexisting alias moved in the right place is just
      fine from the ->mkdir() callers point of view.
      
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c971e6a0
  3. 14 5月, 2018 1 次提交
    • A
      get rid of dead code in d_find_alias() · 61fec493
      Al Viro 提交于
      All "try disconnected alias if nothing else fits" logics in d_find_alias()
      got accidentally disabled by Neil a while ago; for most of the callers it
      was the right thing to do, so fixes belong in few callers that *do* want
      disconnected aliases.  This just takes the now-dead code in d_find_alias()
      out.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      61fec493
  4. 12 5月, 2018 1 次提交
    • A
      do d_instantiate/unlock_new_inode combinations safely · 1e2e547a
      Al Viro 提交于
      For anything NFS-exported we do _not_ want to unlock new inode
      before it has grown an alias; original set of fixes got the
      ordering right, but missed the nasty complication in case of
      lockdep being enabled - unlock_new_inode() does
      	lockdep_annotate_inode_mutex_key(inode)
      which can only be done before anyone gets a chance to touch
      ->i_mutex.  Unfortunately, flipping the order and doing
      unlock_new_inode() before d_instantiate() opens a window when
      mkdir can race with open-by-fhandle on a guessed fhandle, leading
      to multiple aliases for a directory inode and all the breakage
      that follows from that.
      
      	Correct solution: a new primitive (d_instantiate_new())
      combining these two in the right order - lockdep annotate, then
      d_instantiate(), then the rest of unlock_new_inode().  All
      combinations of d_instantiate() with unlock_new_inode() should
      be converted to that.
      
      Cc: stable@kernel.org	# 2.6.29 and later
      Tested-by: NMike Marshall <hubcap@omnibond.com>
      Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1e2e547a
  5. 20 4月, 2018 1 次提交
  6. 16 4月, 2018 4 次提交
  7. 12 4月, 2018 2 次提交
    • N
      fs/dcache.c: add cond_resched() in shrink_dentry_list() · 32785c05
      Nikolay Borisov 提交于
      As previously reported (https://patchwork.kernel.org/patch/8642031/)
      it's possible to call shrink_dentry_list with a large number of dentries
      (> 10000).  This, in turn, could trigger the softlockup detector and
      possibly trigger a panic.  In addition to the unmount path being
      vulnerable to this scenario, at SuSE we've observed similar situation
      happening during process exit on processes that touch a lot of dentries.
      Here is an excerpt from a crash dump.  The number after the colon are
      the number of dentries on the list passed to shrink_dentry_list:
      
      PID 99760: 10722
      PID 107530: 215
      PID 108809: 24134
      PID 108877: 21331
      PID 141708: 16487
      
      So we want to kill between 15k-25k dentries without yielding.
      
      And one possible call stack looks like:
      
      4 [ffff8839ece41db0] _raw_spin_lock at ffffffff8152a5f8
      5 [ffff8839ece41db0] evict at ffffffff811c3026
      6 [ffff8839ece41dd0] __dentry_kill at ffffffff811bf258
      7 [ffff8839ece41df0] shrink_dentry_list at ffffffff811bf593
      8 [ffff8839ece41e18] shrink_dcache_parent at ffffffff811bf830
      9 [ffff8839ece41e50] proc_flush_task at ffffffff8120dd61
      10 [ffff8839ece41ec0] release_task at ffffffff81059ebd
      11 [ffff8839ece41f08] do_exit at ffffffff8105b8ce
      12 [ffff8839ece41f78] sys_exit at ffffffff8105bd53
      13 [ffff8839ece41f80] system_call_fastpath at ffffffff81532909
      
      While some of the callers of shrink_dentry_list do use cond_resched,
      this is not sufficient to prevent softlockups.  So just move
      cond_resched into shrink_dentry_list from its callers.
      
      David said: I've found hundreds of occurrences of warnings that we emit
      when need_resched stays set for a prolonged period of time with the
      stack trace that is included in the change log.
      
      Link: http://lkml.kernel.org/r/1521718946-31521-1-git-send-email-nborisov@suse.comSigned-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      32785c05
    • R
      dcache: account external names as indirectly reclaimable memory · f1782c9b
      Roman Gushchin 提交于
      I received a report about suspicious growth of unreclaimable slabs on
      some machines.  I've found that it happens on machines with low memory
      pressure, and these unreclaimable slabs are external names attached to
      dentries.
      
      External names are allocated using generic kmalloc() function, so they
      are accounted as unreclaimable.  But they are held by dentries, which
      are reclaimable, and they will be reclaimed under the memory pressure.
      
      In particular, this breaks MemAvailable calculation, as it doesn't take
      unreclaimable slabs into account.  This leads to a silly situation, when
      a machine is almost idle, has no memory pressure and therefore has a big
      dentry cache.  And the resulting MemAvailable is too low to start a new
      workload.
      
      To address the issue, the NR_INDIRECTLY_RECLAIMABLE_BYTES counter is
      used to track the amount of memory, consumed by external names.  The
      counter is increased in the dentry allocation path, if an external name
      structure is allocated; and it's decreased in the dentry freeing path.
      
      To reproduce the problem I've used the following Python script:
      
        import os
      
        for iter in range (0, 10000000):
            try:
                name = ("/some_long_name_%d" % iter) + "_" * 220
                os.stat(name)
            except Exception:
                pass
      
      Without this patch:
        $ cat /proc/meminfo | grep MemAvailable
        MemAvailable:    7811688 kB
        $ python indirect.py
        $ cat /proc/meminfo | grep MemAvailable
        MemAvailable:    2753052 kB
      
      With the patch:
        $ cat /proc/meminfo | grep MemAvailable
        MemAvailable:    7809516 kB
        $ python indirect.py
        $ cat /proc/meminfo | grep MemAvailable
        MemAvailable:    7749144 kB
      
      [guro@fb.com: fix indirectly reclaimable memory accounting for CONFIG_SLOB]
        Link: http://lkml.kernel.org/r/20180312194140.19517-1-guro@fb.com
      [guro@fb.com: fix indirectly reclaimable memory accounting]
        Link: http://lkml.kernel.org/r/20180313125701.7955-1-guro@fb.com
      Link: http://lkml.kernel.org/r/20180305133743.12746-5-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1782c9b
  8. 30 3月, 2018 12 次提交
    • A
      d_genocide: move export to definition · cbd4a5bc
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      cbd4a5bc
    • A
      42177007
    • A
      make non-exchanging __d_move() copy ->d_parent rather than swap them · 076515fc
      Al Viro 提交于
      Currently d_move(from, to) does the following:
      	* name/parent of from <- old name/parent of to, from hashed there
      	* to is unhashed
      	* name of to is preserved
      	* if from used to be detached, to gets detached
      	* if from used to be attached, parent of to <- old parent of from.
      
      That's both user-visibly bogus and complicates reasoning a lot.
      Much saner semantics would be
      	* name/parent of from <- name/parent of to, from hashed there.
      	* to is unhashed
      	* name/parent of to is unchanged.
      
      The price, of course, is that old parent of from might lose a reference.
      However,
      	* all potentially cross-directory callers of d_move() have both
      parents pinned directly; typically, dentries themselves are grabbed
      only after we have grabbed and locked both parents.  IOW, the decrement
      of old parent's refcount in case of d_move() won't reach zero.
      	* __d_move() from d_splice_alias() is done to detached alias.
      No refcount decrements in that case
      	* __d_move() from __d_unalias() *can* get the refcount to zero.
      So let's grab a reference to alias' old parent before calling __d_unalias()
      and dput() it after we'd dropped rename_lock.
      
      That does make d_splice_alias() potentially blocking.  However, it has
      no callers in non-sleepable contexts (and the case where we'd grown
      that dget/dput pair is _very_ rare, so performance is not an issue).
      
      Another thing that needs adjustment is unlocking in the end of __d_move();
      folded it in.  And cleaned the remnants of bogus ordering from the
      "lock them in the beginning" counterpart - it's never been right and
      now (well, for 7 years now) we have that thing always serialized on
      rename_lock anyway.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      076515fc
    • A
      split d_path() and friends into a separate file · 7a5cf791
      Al Viro 提交于
      Those parts of fs/dcache.c are pretty much self-contained.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7a5cf791
    • A
      dcache.c: trim includes · 43986d63
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      43986d63
    • J
      fs/dcache: Avoid a try_lock loop in shrink_dentry_list() · 8f04da2a
      John Ogness 提交于
      shrink_dentry_list() holds dentry->d_lock and needs to acquire
      dentry->d_inode->i_lock. This cannot be done with a spin_lock()
      operation because it's the reverse of the regular lock order.
      To avoid ABBA deadlocks it is done with a trylock loop.
      
      Trylock loops are problematic in two scenarios:
      
        1) PREEMPT_RT converts spinlocks to 'sleeping' spinlocks, which are
           preemptible. As a consequence the i_lock holder can be preempted
           by a higher priority task. If that task executes the trylock loop
           it will do so forever and live lock.
      
        2) In virtual machines trylock loops are problematic as well. The
           VCPU on which the i_lock holder runs can be scheduled out and a
           task on a different VCPU can loop for a whole time slice. In the
           worst case this can lead to starvation. Commits 47be6184
           ("fs/dcache.c: avoid soft-lockup in dput()") and 046b961b
           ("shrink_dentry_list(): take parent's d_lock earlier") are
           addressing exactly those symptoms.
      
      Avoid the trylock loop by using dentry_kill(). When pruning ancestors,
      the same code applies that is used to kill a dentry in dput(). This
      also has the benefit that the locking order is now the same. First
      the inode is locked, then the parent.
      Signed-off-by: NJohn Ogness <john.ogness@linutronix.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8f04da2a
    • A
      get rid of trylock loop around dentry_kill() · f657a666
      Al Viro 提交于
      In case when trylock in there fails, deal with it directly in
      dentry_kill().  Note that in cases when we drop and retake
      ->d_lock, we need to recheck whether to retain the dentry.
      Another thing is that dropping/retaking ->d_lock might have
      ended up with negative dentry turning into positive; that,
      of course, can happen only once...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f657a666
    • A
      handle move to LRU in retain_dentry() · 62d9956c
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      62d9956c
    • A
    • A
      split the slow part of lock_parent() off · 8b987a46
      Al Viro 提交于
      Turn the "trylock failed" part into uninlined __lock_parent().
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8b987a46
    • A
      now lock_parent() can't run into killed dentry · 65d8eb5a
      Al Viro 提交于
      all remaining callers hold either a reference or ->i_lock
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      65d8eb5a
    • A
      get rid of trylock loop in locking dentries on shrink list · 3b3f09f4
      Al Viro 提交于
      In case of trylock failure don't re-add to the list - drop the locks
      and carefully get them in the right order.  For shrink_dentry_list(),
      somebody having grabbed a reference to dentry means that we can
      kick it off-list, so if we find dentry being modified under us we
      don't need to play silly buggers with retries anyway - off the list
      it is.
      
      The locking logics taken out into a helper of its own; lock_parent()
      is no longer used for dentries that can be killed under us.
      
      [fix from Eric Biggers folded]
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3b3f09f4
  9. 12 3月, 2018 4 次提交
  10. 26 2月, 2018 2 次提交
  11. 24 2月, 2018 1 次提交
    • A
      lock_parent() needs to recheck if dentry got __dentry_kill'ed under it · 3b821409
      Al Viro 提交于
      In case when dentry passed to lock_parent() is protected from freeing only
      by the fact that it's on a shrink list and trylock of parent fails, we
      could get hit by __dentry_kill() (and subsequent dentry_kill(parent))
      between unlocking dentry and locking presumed parent.  We need to recheck
      that dentry is alive once we lock both it and parent *and* postpone
      rcu_read_unlock() until after that point.  Otherwise we could return
      a pointer to struct dentry that already is rcu-scheduled for freeing, with
      ->d_lock held on it; caller's subsequent attempt to unlock it can end
      up with memory corruption.
      
      Cc: stable@vger.kernel.org # 3.12+, counting backports
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3b821409
  12. 02 2月, 2018 2 次提交
  13. 26 1月, 2018 2 次提交
  14. 24 1月, 2018 2 次提交
  15. 16 1月, 2018 2 次提交
    • D
      vfs: Define usercopy region in names_cache slab caches · 6a9b8820
      David Windsor 提交于
      VFS pathnames are stored in the names_cache slab cache, either inline
      or across an entire allocation entry (when approaching PATH_MAX). These
      are copied to/from userspace, so they must be entirely whitelisted.
      
      cache object allocation:
          include/linux/fs.h:
              #define __getname()    kmem_cache_alloc(names_cachep, GFP_KERNEL)
      
      example usage trace:
          strncpy_from_user+0x4d/0x170
          getname_flags+0x6f/0x1f0
          user_path_at_empty+0x23/0x40
          do_mount+0x69/0xda0
          SyS_mount+0x83/0xd0
      
          fs/namei.c:
              getname_flags(...):
                  ...
                  result = __getname();
                  ...
                  kname = (char *)result->iname;
                  result->name = kname;
                  len = strncpy_from_user(kname, filename, EMBEDDED_NAME_MAX);
                  ...
                  if (unlikely(len == EMBEDDED_NAME_MAX)) {
                      const size_t size = offsetof(struct filename, iname[1]);
                      kname = (char *)result;
      
                      result = kzalloc(size, GFP_KERNEL);
                      ...
                      result->name = kname;
                      len = strncpy_from_user(kname, filename, PATH_MAX);
      
      In support of usercopy hardening, this patch defines the entire cache
      object in the names_cache slab cache as whitelisted, since it may entirely
      hold name strings to be copied to/from userspace.
      
      This patch is verbatim from Brad Spengler/PaX Team's PAX_USERCOPY
      whitelisting code in the last public patch of grsecurity/PaX based on my
      understanding of the code. Changes or omissions from the original code are
      mine and don't reflect the original grsecurity/PaX code.
      Signed-off-by: NDavid Windsor <dave@nullcore.net>
      [kees: adjust commit log, add usage trace]
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: NKees Cook <keescook@chromium.org>
      6a9b8820
    • D
      dcache: Define usercopy region in dentry_cache slab cache · 80344266
      David Windsor 提交于
      When a dentry name is short enough, it can be stored directly in the
      dentry itself (instead in a separate kmalloc allocation). These dentry
      short names, stored in struct dentry.d_iname and therefore contained in
      the dentry_cache slab cache, need to be coped to userspace.
      
      cache object allocation:
          fs/dcache.c:
              __d_alloc(...):
                  ...
                  dentry = kmem_cache_alloc(dentry_cache, ...);
                  ...
                  dentry->d_name.name = dentry->d_iname;
      
      example usage trace:
          filldir+0xb0/0x140
          dcache_readdir+0x82/0x170
          iterate_dir+0x142/0x1b0
          SyS_getdents+0xb5/0x160
      
          fs/readdir.c:
              (called via ctx.actor by dir_emit)
              filldir(..., const char *name, ...):
                  ...
                  copy_to_user(..., name, namlen)
      
          fs/libfs.c:
              dcache_readdir(...):
                  ...
                  next = next_positive(dentry, p, 1)
                  ...
                  dir_emit(..., next->d_name.name, ...)
      
      In support of usercopy hardening, this patch defines a region in the
      dentry_cache slab cache in which userspace copy operations are allowed.
      
      This region is known as the slab cache's usercopy region. Slab caches can
      now check that each dynamic copy operation involving cache-managed memory
      falls entirely within the slab's usercopy region.
      
      This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
      whitelisting code in the last public patch of grsecurity/PaX based on my
      understanding of the code. Changes or omissions from the original code are
      mine and don't reflect the original grsecurity/PaX code.
      Signed-off-by: NDavid Windsor <dave@nullcore.net>
      [kees: adjust hunks for kmalloc-specific things moved later]
      [kees: adjust commit log, provide usage trace]
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: NKees Cook <keescook@chromium.org>
      80344266
  16. 29 12月, 2017 1 次提交
    • N
      VFS: close race between getcwd() and d_move() · 61647823
      NeilBrown 提交于
      d_move() will call __d_drop() and then __d_rehash()
      on the dentry being moved.  This creates a small window
      when the dentry appears to be unhashed.  Many tests
      of d_unhashed() are made under ->d_lock and so are safe
      from racing with this window, but some aren't.
      In particular, getcwd() calls d_unlinked() (which calls
      d_unhashed()) without d_lock protection, so it can race.
      
      This races has been seen in practice with lustre, which uses d_move() as
      part of name lookup.  See:
         https://jira.hpdd.intel.com/browse/LU-9735
      It could race with a regular rename(), and result in ENOENT instead
      of either the 'before' or 'after' name.
      
      The race can be demonstrated with a simple program which
      has two threads, one renaming a directory back and forth
      while another calls getcwd() within that directory: it should never
      fail, but does.  See:
        https://patchwork.kernel.org/patch/9455345/
      
      We could fix this race by taking d_lock and rechecking when
      d_unhashed() reports true.  Alternately when can remove the window,
      which is the approach this patch takes.
      
      ___d_drop() is introduce which does *not* clear d_hash.pprev
      so the dentry still appears to be hashed.  __d_drop() calls
      ___d_drop(), then clears d_hash.pprev.
      __d_move() now uses ___d_drop() and only clears d_hash.pprev
      when not rehashing.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      61647823
  17. 26 12月, 2017 1 次提交
    • N
      VFS: don't keep disconnected dentries on d_anon · f1ee6162
      NeilBrown 提交于
      The original purpose of the per-superblock d_anon list was to
      keep disconnected dentries in the cache between consecutive
      requests to the NFS server.  Dentries can be disconnected if
      a client holds a file open and repeatedly performs IO on it,
      and if the server drops the dentry, whether due to memory
      pressure, server restart, or "echo 3 > /proc/sys/vm/drop_caches".
      
      This purpose was thwarted by commit 75a6f82a ("freeing unlinked
      file indefinitely delayed") which caused disconnected dentries
      to be freed as soon as their refcount reached zero.
      
      This means that, when a dentry being used by nfsd gets disconnected, a
      new one needs to be allocated for every request (unless requests
      overlap).  As the dentry has no name, no parent, and no children,
      there is little of value to cache.  As small memory allocations are
      typically fast (from per-cpu free lists) this likely has little cost.
      
      This means that the original purpose of s_anon is no longer relevant:
      there is no longer any need to keep disconnected dentries on a list so
      they appear to be hashed.
      
      However, s_anon now has a new use.  When you mount an NFS filesystem,
      the dentry stored in s_root is just a placebo.  The "real" root dentry
      is allocated using d_obtain_root() and so it kept on the s_anon list.
      I don't know the reason for this, but suspect it related to NFSv4
      where a mount of "server:/some/path" require NFS to look up the root
      filehandle on the server, then walk down "/some" and "/path" to get
      the filehandle to mount.
      
      Whatever the reason, NFS depends on the s_anon list and on
      shrink_dcache_for_umount() pruning all dentries on this list.  So we
      cannot simply remove s_anon.
      
      We could just leave the code unchanged, but apart from that being
      potentially confusing, the (unfair) bit-spin-lock which protects
      s_anon can become a bottle neck when lots of disconnected dentries are
      being created.
      
      So this patch renames s_anon to s_roots, and stops storing
      disconnected dentries on the list.  Only dentries obtained with
      d_obtain_root() are now stored on this list.  There are many fewer of
      these (only NFS and NILFS2 use the call, and only during filesystem
      mount) so contention on the bit-lock will not be a problem.
      
      Possibly an alternate solution should be found for NFS and NILFS2, but
      that would require understanding their needs first.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f1ee6162