1. 15 1月, 2016 1 次提交
    • V
      kmemcg: account certain kmem allocations to memcg · 5d097056
      Vladimir Davydov 提交于
      Mark those kmem allocations that are known to be easily triggered from
      userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
      memcg.  For the list, see below:
      
       - threadinfo
       - task_struct
       - task_delay_info
       - pid
       - cred
       - mm_struct
       - vm_area_struct and vm_region (nommu)
       - anon_vma and anon_vma_chain
       - signal_struct
       - sighand_struct
       - fs_struct
       - files_struct
       - fdtable and fdtable->full_fds_bits
       - dentry and external_name
       - inode for all filesystems. This is the most tedious part, because
         most filesystems overwrite the alloc_inode method.
      
      The list is far from complete, so feel free to add more objects.
      Nevertheless, it should be close to "account everything" approach and
      keep most workloads within bounds.  Malevolent users will be able to
      breach the limit, but this was possible even with the former "account
      everything" approach (simply because it did not account everything in
      fact).
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d097056
  2. 09 12月, 2015 1 次提交
    • A
      replace ->follow_link() with new method that could stay in RCU mode · 6b255391
      Al Viro 提交于
      new method: ->get_link(); replacement of ->follow_link().  The differences
      are:
      	* inode and dentry are passed separately
      	* might be called both in RCU and non-RCU mode;
      the former is indicated by passing it a NULL dentry.
      	* when called that way it isn't allowed to block
      and should return ERR_PTR(-ECHILD) if it needs to be called
      in non-RCU mode.
      
      It's a flagday change - the old method is gone, all in-tree instances
      converted.  Conversion isn't hard; said that, so far very few instances
      do not immediately bail out when called in RCU mode.  That'll change
      in the next commits.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6b255391
  3. 07 12月, 2015 1 次提交
  4. 21 8月, 2015 2 次提交
    • E
      dcache: Reduce the scope of i_lock in d_splice_alias · a03e283b
      Eric W. Biederman 提交于
      i_lock is only needed until __d_find_any_alias calls dget on the alias
      dentry.  After that the reference to new ensures that dentry_kill and
      d_delete will not remove the inode from the dentry, and remove the
      dentry from the inode->d_entry list.
      
      The inode i_lock came to be held over the the __d_move calls in
      d_splice_alias through a series of introduction of locks with
      increasing smaller scope.  First it was the dcache_lock, then
      it was the dcache_inode_lock, and finally inode->i_lock.
      
      Furthermore inode->i_lock is not held over any other calls
      to d_move or __d_move so it can not provide any meaningful
      rename protection.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a03e283b
    • E
      dcache: Handle escaped paths in prepend_path · cde93be4
      Eric W. Biederman 提交于
      A rename can result in a dentry that by walking up d_parent
      will never reach it's mnt_root.  For lack of a better term
      I call this an escaped path.
      
      prepend_path is called by four different functions __d_path,
      d_absolute_path, d_path, and getcwd.
      
      __d_path only wants to see paths are connected to the root it passes
      in.  So __d_path needs prepend_path to return an error.
      
      d_absolute_path similarly wants to see paths that are connected to
      some root.  Escaped paths are not connected to any mnt_root so
      d_absolute_path needs prepend_path to return an error greater
      than 1.  So escaped paths will be treated like paths on lazily
      unmounted mounts.
      
      getcwd needs to prepend "(unreachable)" so getcwd also needs
      prepend_path to return an error.
      
      d_path is the interesting hold out.  d_path just wants to print
      something, and does not care about the weird cases.  Which raises
      the question what should be printed?
      
      Given that <escaped_path>/<anything> should result in -ENOENT I
      believe it is desirable for escaped paths to be printed as empty
      paths.  As there are not really any meaninful path components when
      considered from the perspective of a mount tree.
      
      So tweak prepend_path to return an empty path with an new error
      code of 3 when it encounters an escaped path.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      cde93be4
  5. 07 8月, 2015 1 次提交
    • M
      fs, file table: reinit files_stat.max_files after deferred memory initialisation · 4248b0da
      Mel Gorman 提交于
      Dave Hansen reported the following;
      
      	My laptop has been behaving strangely with 4.2-rc2.  Once I log
      	in to my X session, I start getting all kinds of strange errors
      	from applications and see this in my dmesg:
      
              	VFS: file-max limit 8192 reached
      
      The problem is that the file-max is calculated before memory is fully
      initialised and miscalculates how much memory the kernel is using.  This
      patch recalculates file-max after deferred memory initialisation.  Note
      that using memory hotplug infrastructure would not have avoided this
      problem as the value is not recalculated after memory hot-add.
      
      4.1:             files_stat.max_files = 6582781
      4.2-rc2:         files_stat.max_files = 8192
      4.2-rc2 patched: files_stat.max_files = 6562467
      
      Small differences with the patch applied and 4.1 but not enough to matter.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NDave Hansen <dave.hansen@intel.com>
      Cc: Nicolai Stange <nicstange@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Alex Ng <alexng@microsoft.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4248b0da
  6. 12 7月, 2015 1 次提交
    • A
      freeing unlinked file indefinitely delayed · 75a6f82a
      Al Viro 提交于
      	Normally opening a file, unlinking it and then closing will have
      the inode freed upon close() (provided that it's not otherwise busy and
      has no remaining links, of course).  However, there's one case where that
      does *not* happen.  Namely, if you open it by fhandle with cold dcache,
      then unlink() and close().
      
      	In normal case you get d_delete() in unlink(2) notice that dentry
      is busy and unhash it; on the final dput() it will be forcibly evicted from
      dcache, triggering iput() and inode removal.  In this case, though, we end
      up with *two* dentries - disconnected (created by open-by-fhandle) and
      regular one (used by unlink()).  The latter will have its reference to inode
      dropped just fine, but the former will not - it's considered hashed (it
      is on the ->s_anon list), so it will stay around until the memory pressure
      will finally do it in.  As the result, we have the final iput() delayed
      indefinitely.  It's trivial to reproduce -
      
      void flush_dcache(void)
      {
              system("mount -o remount,rw /");
      }
      
      static char buf[20 * 1024 * 1024];
      
      main()
      {
              int fd;
              union {
                      struct file_handle f;
                      char buf[MAX_HANDLE_SZ];
              } x;
              int m;
      
              x.f.handle_bytes = sizeof(x);
              chdir("/root");
              mkdir("foo", 0700);
              fd = open("foo/bar", O_CREAT | O_RDWR, 0600);
              close(fd);
              name_to_handle_at(AT_FDCWD, "foo/bar", &x.f, &m, 0);
              flush_dcache();
              fd = open_by_handle_at(AT_FDCWD, &x.f, O_RDWR);
              unlink("foo/bar");
              write(fd, buf, sizeof(buf));
              system("df .");			/* 20Mb eaten */
              close(fd);
              system("df .");			/* should've freed those 20Mb */
              flush_dcache();
              system("df .");			/* should be the same as #2 */
      }
      
      will spit out something like
      Filesystem     1K-blocks   Used Available Use% Mounted on
      /dev/root         322023 303843      1131 100% /
      Filesystem     1K-blocks   Used Available Use% Mounted on
      /dev/root         322023 303843      1131 100% /
      Filesystem     1K-blocks   Used Available Use% Mounted on
      /dev/root         322023 283282     21692  93% /
      - inode gets freed only when dentry is finally evicted (here we trigger
      than by remount; normally it would've happened in response to memory
      pressure hell knows when).
      
      Cc: stable@vger.kernel.org # v2.6.38+; earlier ones need s/kill_it/unhash_it/
      Acked-by: NJ. Bruce Fields <bfields@fieldses.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      75a6f82a
  7. 01 7月, 2015 1 次提交
    • E
      vfs: Remove incorrect debugging WARN in prepend_path · 93e3bce6
      Eric W. Biederman 提交于
      The warning message in prepend_path is unclear and outdated.  It was
      added as a warning that the mechanism for generating names of pseudo
      files had been removed from prepend_path and d_dname should be used
      instead.  Unfortunately the warning reads like a general warning,
      making it unclear what to do with it.
      
      Remove the warning.  The transition it was added to warn about is long
      over, and I added code several years ago which in rare cases causes
      the warning to fire on legitimate code, and the warning is now firing
      and scaring people for no good reason.
      
      Cc: stable@vger.kernel.org
      Reported-by: NIvan Delalande <colona@arista.com>
      Reported-by: NOmar Sandoval <osandov@osandov.com>
      Fixes: f48cfddc ("vfs: In d_path don't call d_dname on a mount point")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      93e3bce6
  8. 19 6月, 2015 2 次提交
    • D
      overlayfs: Make f_path always point to the overlay and f_inode to the underlay · 4bacc9c9
      David Howells 提交于
      Make file->f_path always point to the overlay dentry so that the path in
      /proc/pid/fd is correct and to ensure that label-based LSMs have access to the
      overlay as well as the underlay (path-based LSMs probably don't need it).
      
      Using my union testsuite to set things up, before the patch I see:
      
      	[root@andromeda union-testsuite]# bash 5</mnt/a/foo107
      	[root@andromeda union-testsuite]# ls -l /proc/$$/fd/
      	...
      	lr-x------. 1 root root 64 Jun  5 14:38 5 -> /a/foo107
      	[root@andromeda union-testsuite]# stat /mnt/a/foo107
      	...
      	Device: 23h/35d Inode: 13381       Links: 1
      	...
      	[root@andromeda union-testsuite]# stat -L /proc/$$/fd/5
      	...
      	Device: 23h/35d Inode: 13381       Links: 1
      	...
      
      After the patch:
      
      	[root@andromeda union-testsuite]# bash 5</mnt/a/foo107
      	[root@andromeda union-testsuite]# ls -l /proc/$$/fd/
      	...
      	lr-x------. 1 root root 64 Jun  5 14:22 5 -> /mnt/a/foo107
      	[root@andromeda union-testsuite]# stat /mnt/a/foo107
      	...
      	Device: 23h/35d Inode: 40346       Links: 1
      	...
      	[root@andromeda union-testsuite]# stat -L /proc/$$/fd/5
      	...
      	Device: 23h/35d Inode: 40346       Links: 1
      	...
      
      Note the change in where /proc/$$/fd/5 points to in the ls command.  It was
      pointing to /a/foo107 (which doesn't exist) and now points to /mnt/a/foo107
      (which is correct).
      
      The inode accessed, however, is the lower layer.  The union layer is on device
      25h/37d and the upper layer on 24h/36d.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4bacc9c9
    • P
      seqcount: Rename write_seqcount_barrier() · a7c6f571
      Peter Zijlstra 提交于
      I'll shortly be introducing another seqcount primitive that's useful
      to provide ordering semantics and would like to use the
      write_seqcount_barrier() name for that.
      
      Seeing how there's only one user of the current primitive, lets rename
      it to invalidate, as that appears what its doing.
      
      While there, employ lockdep_assert_held() instead of
      assert_spin_locked() to not generate debug code for regular kernels.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: ktkhai@parallels.com
      Cc: rostedt@goodmis.org
      Cc: juri.lelli@gmail.com
      Cc: pang.xunlei@linaro.org
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: wanpeng.li@linux.intel.com
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: umgwanakikbuti@gmail.com
      Link: http://lkml.kernel.org/r/20150611124743.279926217@infradead.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      a7c6f571
  9. 29 5月, 2015 1 次提交
    • A
      d_walk() might skip too much · 2159184e
      Al Viro 提交于
      when we find that a child has died while we'd been trying to ascend,
      we should go into the first live sibling itself, rather than its sibling.
      
      Off-by-one in question had been introduced in "deal with deadlock in
      d_walk()" and the fix needs to be backported to all branches this one
      has been backported to.
      
      Cc: stable@vger.kernel.org # 3.2 and later
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2159184e
  10. 16 4月, 2015 1 次提交
    • D
      VFS: Impose ordering on accesses of d_inode and d_flags · 4bf46a27
      David Howells 提交于
      Impose ordering on accesses of d_inode and d_flags to avoid the need to do
      this:
      
      	if (!dentry->d_inode || d_is_negative(dentry)) {
      
      when this:
      
      	if (d_is_negative(dentry)) {
      
      should suffice.
      
      This check is especially problematic if a dentry can have its type field set
      to something other than DENTRY_MISS_TYPE when d_inode is NULL (as in
      unionmount).
      
      What we really need to do is stick a write barrier between setting d_inode and
      setting d_flags and a read barrier between reading d_flags and reading
      d_inode.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4bf46a27
  11. 12 4月, 2015 1 次提交
    • J
      dcache: return -ESTALE not -EBUSY on distributed fs race · 3d330dc1
      J. Bruce Fields 提交于
      On a distributed filesystem it's possible for lookup to discover that a
      directory it just found is already cached elsewhere in the directory
      heirarchy.  The dcache won't let us keep the directory in both places,
      so we have to move the dentry to the new location from the place we
      previously had it cached.
      
      If the parent has changed, then this requires all the same locks as we'd
      need to do a cross-directory rename.  But we're already in lookup
      holding one parent's i_mutex, so it's too late to acquire those locks in
      the right order.
      
      The (unreliable) solution in __d_unalias is to trylock() the required
      locks and return -EBUSY if it fails.
      
      I see no particular reason for returning -EBUSY, and -ESTALE is already
      the result of some other lookup races on NFS.  I think -ESTALE is the
      more helpful error return.  It also allows us to take advantage of the
      logic Jeff Layton added in c6a94284 "vfs: fix renameat to retry on
      ESTALE errors" and ancestors, which hopefully resolves some of these
      errors before they're returned to userspace.
      
      I can reproduce these cases using NFS with:
      
      	ssh root@$client '
      		mount -olookupcache=pos '$server':'$export' /mnt/
      		mkdir /mnt/TO
      		mkdir /mnt/DIR
      		touch /mnt/DIR/test.txt
      		while true; do
      			strace -e open cat /mnt/DIR/test.txt 2>&1 | grep EBUSY
      		done
      	'
      	ssh root@$server '
      		while true; do
      			mv $export/DIR $export/TO/DIR
      			mv $export/TO/DIR $export/DIR
      		done
      	'
      
      It also helps to add some other concurrent use of the directory on the
      client (e.g., "ls /mnt/TO").  And you can replace the server-side mv's
      by client-side mv's that are repeatedly killed.  (If the client is
      interrupted while waiting for the RENAME response then it's left with a
      dentry that has to go under one parent or the other, but it doesn't yet
      know which.)
      Acked-by: NJeff Layton <jlayton@primarydata.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3d330dc1
  12. 23 2月, 2015 2 次提交
  13. 14 2月, 2015 1 次提交
    • A
      fs: dcache: manually unpoison dname after allocation to shut up kasan's reports · df4c0e36
      Andrey Ryabinin 提交于
      We need to manually unpoison rounded up allocation size for dname to avoid
      kasan's reports in dentry_string_cmp().  When CONFIG_DCACHE_WORD_ACCESS=y
      dentry_string_cmp may access few bytes beyound requested in kmalloc()
      size.
      
      dentry_string_cmp() relates on that fact that dentry allocated using
      kmalloc and kmalloc internally round up allocation size.  So this is not a
      bug, but this makes kasan to complain about such accesses.  To avoid such
      reports we mark rounded up allocation size in shadow as accessible.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df4c0e36
  14. 13 2月, 2015 2 次提交
    • V
      list_lru: add helpers to isolate items · 3f97b163
      Vladimir Davydov 提交于
      Currently, the isolate callback passed to the list_lru_walk family of
      functions is supposed to just delete an item from the list upon returning
      LRU_REMOVED or LRU_REMOVED_RETRY, while nr_items counter is fixed by
      __list_lru_walk_one after the callback returns.  Since the callback is
      allowed to drop the lock after removing an item (it has to return
      LRU_REMOVED_RETRY then), the nr_items can be less than the actual number
      of elements on the list even if we check them under the lock.  This makes
      it difficult to move items from one list_lru_one to another, which is
      required for per-memcg list_lru reparenting - we can't just splice the
      lists, we have to move entries one by one.
      
      This patch therefore introduces helpers that must be used by callback
      functions to isolate items instead of raw list_del/list_move.  These are
      list_lru_isolate and list_lru_isolate_move.  They not only remove the
      entry from the list, but also fix the nr_items counter, making sure
      nr_items always reflects the actual number of elements on the list if
      checked under the appropriate lock.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f97b163
    • V
      list_lru: introduce list_lru_shrink_{count,walk} · 503c358c
      Vladimir Davydov 提交于
      Kmem accounting of memcg is unusable now, because it lacks slab shrinker
      support.  That means when we hit the limit we will get ENOMEM w/o any
      chance to recover.  What we should do then is to call shrink_slab, which
      would reclaim old inode/dentry caches from this cgroup.  This is what
      this patch set is intended to do.
      
      Basically, it does two things.  First, it introduces the notion of
      per-memcg slab shrinker.  A shrinker that wants to reclaim objects per
      cgroup should mark itself as SHRINKER_MEMCG_AWARE.  Then it will be
      passed the memory cgroup to scan from in shrink_control->memcg.  For
      such shrinkers shrink_slab iterates over the whole cgroup subtree under
      the target cgroup and calls the shrinker for each kmem-active memory
      cgroup.
      
      Secondly, this patch set makes the list_lru structure per-memcg.  It's
      done transparently to list_lru users - everything they have to do is to
      tell list_lru_init that they want memcg-aware list_lru.  Then the
      list_lru will automatically distribute objects among per-memcg lists
      basing on which cgroup the object is accounted to.  This way to make FS
      shrinkers (icache, dcache) memcg-aware we only need to make them use
      memcg-aware list_lru, and this is what this patch set does.
      
      As before, this patch set only enables per-memcg kmem reclaim when the
      pressure goes from memory.limit, not from memory.kmem.limit.  Handling
      memory.kmem.limit is going to be tricky due to GFP_NOFS allocations, and
      it is still unclear whether we will have this knob in the unified
      hierarchy.
      
      This patch (of 9):
      
      NUMA aware slab shrinkers use the list_lru structure to distribute
      objects coming from different NUMA nodes to different lists.  Whenever
      such a shrinker needs to count or scan objects from a particular node,
      it issues commands like this:
      
              count = list_lru_count_node(lru, sc->nid);
              freed = list_lru_walk_node(lru, sc->nid, isolate_func,
                                         isolate_arg, &sc->nr_to_scan);
      
      where sc is an instance of the shrink_control structure passed to it
      from vmscan.
      
      To simplify this, let's add special list_lru functions to be used by
      shrinkers, list_lru_shrink_count() and list_lru_shrink_walk(), which
      consolidate the nid and nr_to_scan arguments in the shrink_control
      structure.
      
      This will also allow us to avoid patching shrinkers that use list_lru
      when we make shrink_slab() per-memcg - all we will have to do is extend
      the shrink_control structure to include the target memcg and make
      list_lru_shrink_{count,walk} handle this appropriately.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Suggested-by: NDave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      503c358c
  15. 26 1月, 2015 2 次提交
  16. 20 11月, 2014 4 次提交
  17. 04 11月, 2014 2 次提交
  18. 24 10月, 2014 1 次提交
    • A
      fix inode leaks on d_splice_alias() failure exits · 51486b90
      Al Viro 提交于
      d_splice_alias() callers expect it to either stash the inode reference
      into a new alias, or drop the inode reference.  That makes it possible
      to just return d_splice_alias() result from ->lookup() instance, without
      any extra housekeeping required.
      
      Unfortunately, that should include the failure exits.  If d_splice_alias()
      returns an error, it leaves the dentry it has been given negative and
      thus it *must* drop the inode reference.  Easily fixed, but it goes way
      back and will need backporting.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      51486b90
  19. 13 10月, 2014 1 次提交
  20. 09 10月, 2014 9 次提交
    • D
      dcache: Fix no spaces at the start of a line in dcache.c · b8314f93
      Daeseok Youn 提交于
      Fixed coding style in dcache.c
      Signed-off-by: NDaeseok Youn <daeseok.youn@gmail.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b8314f93
    • A
      dcache.c: call ->d_prune() regardless of d_unhashed() · 29266201
      Al Viro 提交于
      the only in-tree instance checks d_unhashed() anyway,
      out-of-tree code can preserve the current behaviour by
      adding such check if they want it and we get an ability
      to use it in cases where we *want* to be notified of
      killing being inevitable before ->d_lock is dropped,
      whether it's unhashed or not.  In particular, autofs
      would benefit from that.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      29266201
    • A
      d_prune_alias(): just lock the parent and call __dentry_kill() · 29355c39
      Al Viro 提交于
      The only reason for games with ->d_prune() was __d_drop(), which
      was needed only to force dput() into killing the sucker off.
      
      Note that lock_parent() can be called under ->i_lock and won't
      drop it, so dentry is safe from somebody managing to kill it
      under us - it won't happen while we are holding ->i_lock.
      
      __dentry_kill() is called only with ->d_lockref.count being 0
      (here and when picked from shrink list) or 1 (dput() and dropping
      the ancestors in shrink_dentry_list()), so it will never be called
      twice - the first thing it's doing is making ->d_lockref.count
      negative and once that happens, nothing will increment it.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      29355c39
    • E
      vfs: Make d_invalidate return void · 5542aa2f
      Eric W. Biederman 提交于
      Now that d_invalidate can no longer fail, stop returning a useless
      return code.  For the few callers that checked the return code update
      remove the handling of d_invalidate failure.
      Reviewed-by: NMiklos Szeredi <miklos@szeredi.hu>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5542aa2f
    • E
      vfs: Merge check_submounts_and_drop and d_invalidate · 1ffe46d1
      Eric W. Biederman 提交于
      Now that d_invalidate is the only caller of check_submounts_and_drop,
      expand check_submounts_and_drop inline in d_invalidate.
      Reviewed-by: NMiklos Szeredi <miklos@szeredi.hu>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1ffe46d1
    • E
      vfs: Lazily remove mounts on unlinked files and directories. · 8ed936b5
      Eric W. Biederman 提交于
      With the introduction of mount namespaces and bind mounts it became
      possible to access files and directories that on some paths are mount
      points but are not mount points on other paths.  It is very confusing
      when rm -rf somedir returns -EBUSY simply because somedir is mounted
      somewhere else.  With the addition of user namespaces allowing
      unprivileged mounts this condition has gone from annoying to allowing
      a DOS attack on other users in the system.
      
      The possibility for mischief is removed by updating the vfs to support
      rename, unlink and rmdir on a dentry that is a mountpoint and by
      lazily unmounting mountpoints on deleted dentries.
      
      In particular this change allows rename, unlink and rmdir system calls
      on a dentry without a mountpoint in the current mount namespace to
      succeed, and it allows rename, unlink, and rmdir performed on a
      distributed filesystem to update the vfs cache even if when there is a
      mount in some namespace on the original dentry.
      
      There are two common patterns of maintaining mounts: Mounts on trusted
      paths with the parent directory of the mount point and all ancestory
      directories up to / owned by root and modifiable only by root
      (i.e. /media/xxx, /dev, /dev/pts, /proc, /sys, /sys/fs/cgroup/{cpu,
      cpuacct, ...}, /usr, /usr/local).  Mounts on unprivileged directories
      maintained by fusermount.
      
      In the case of mounts in trusted directories owned by root and
      modifiable only by root the current parent directory permissions are
      sufficient to ensure a mount point on a trusted path is not removed
      or renamed by anyone other than root, even if there is a context
      where the there are no mount points to prevent this.
      
      In the case of mounts in directories owned by less privileged users
      races with users modifying the path of a mount point are already a
      danger.  fusermount already uses a combination of chdir,
      /proc/<pid>/fd/NNN, and UMOUNT_NOFOLLOW to prevent these races.  The
      removable of global rename, unlink, and rmdir protection really adds
      nothing new to consider only a widening of the attack window, and
      fusermount is already safe against unprivileged users modifying the
      directory simultaneously.
      
      In principle for perfect userspace programs returning -EBUSY for
      unlink, rmdir, and rename of dentires that have mounts in the local
      namespace is actually unnecessary.  Unfortunately not all userspace
      programs are perfect so retaining -EBUSY for unlink, rmdir and rename
      of dentries that have mounts in the current mount namespace plays an
      important role of maintaining consistency with historical behavior and
      making imperfect userspace applications hard to exploit.
      
      v2: Remove spurious old_dentry.
      v3: Optimized shrink_submounts_and_drop
          Removed unsued afs label
      v4: Simplified the changes to check_submounts_and_drop
          Do not rename check_submounts_and_drop shrink_submounts_and_drop
          Document what why we need atomicity in check_submounts_and_drop
          Rely on the parent inode mutex to make d_revalidate and d_invalidate
          an atomic unit.
      v5: Refcount the mountpoint to detach in case of simultaneous
          renames.
      Reviewed-by: NMiklos Szeredi <miklos@szeredi.hu>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8ed936b5
    • E
      vfs: More precise tests in d_invalidate · bafc9b75
      Eric W. Biederman 提交于
      The current comments in d_invalidate about what and why it is doing
      what it is doing are wildly off-base.  Which is not surprising as
      the comments date back to last minute bug fix of the 2.2 kernel.
      
      The big fat lie of a comment said: If it's a directory, we can't drop
      it for fear of somebody re-populating it with children (even though
      dropping it would make it unreachable from that root, we still might
      repopulate it if it was a working directory or similar).
      
      [AV] What we really need to avoid is multiple dentry aliases of the
      same directory inode; on all filesystems that have ->d_revalidate()
      we either declare all positive dentries always valid (and thus never
      fed to d_invalidate()) or use d_materialise_unique() and/or d_splice_alias(),
      which take care of alias prevention.
      
      The current rules are:
      - To prevent mount point leaks dentries that are mount points or that
        have childrent that are mount points may not be be unhashed.
      - All dentries may be unhashed.
      - Directories may be rehashed with d_materialise_unique
      
      check_submounts_and_drop implements this already for well maintained
      remote filesystems so implement the current rules in d_invalidate
      by just calling check_submounts_and_drop.
      
      The one difference between d_invalidate and check_submounts_and_drop
      is that d_invalidate must respect it when a d_revalidate method has
      earlier called d_drop so preserve the d_unhashed check in
      d_invalidate.
      Reviewed-by: NMiklos Szeredi <miklos@szeredi.hu>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      bafc9b75
    • E
      vfs: Document the effect of d_revalidate on d_find_alias · 3ccb354d
      Eric W. Biederman 提交于
      d_drop or check_submounts_and_drop called from d_revalidate can result
      in renamed directories with child dentries being unhashed.  These
      renamed and drop directory dentries can be rehashed after
      d_materialise_unique uses d_find_alias to find them.
      Reviewed-by: NMiklos Szeredi <miklos@szeredi.hu>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3ccb354d
    • A
      Allow sharing external names after __d_move() · 8d85b484
      Al Viro 提交于
      * external dentry names get a small structure prepended to them
      (struct external_name).
      * it contains an atomic refcount, matching the number of struct dentry
      instances that have ->d_name.name pointing to that external name.  The
      first thing free_dentry() does is decrementing refcount of external name,
      so the instances that are between the call of free_dentry() and
      RCU-delayed actual freeing do not contribute.
      * __d_move(x, y, false) makes the name of x equal to the name of y,
      external or not.  If y has an external name, extra reference is grabbed
      and put into x->d_name.name.  If x used to have an external name, the
      reference to the old name is dropped and, should it reach zero, freeing
      is scheduled via kfree_rcu().
      * free_dentry() in dentry with external name decrements the refcount of
      that name and, should it reach zero, does RCU-delayed call that will
      free both the dentry and external name.  Otherwise it does what it
      used to do, except that __d_free() doesn't even look at ->d_name.name;
      it simply frees the dentry.
      
      All non-RCU accesses to dentry external name are safe wrt freeing since they
      all should happen before free_dentry() is called.  RCU accesses might run
      into a dentry seen by free_dentry() or into an old name that got already
      dropped by __d_move(); however, in both cases dentry must have been
      alive and refer to that name at some point after we'd done rcu_read_lock(),
      which means that any freeing must be still pending.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8d85b484
  21. 30 9月, 2014 1 次提交
    • A
      missing data dependency barrier in prepend_name() · 6d13f694
      Al Viro 提交于
      AFAICS, prepend_name() is broken on SMP alpha.  Disclaimer: I don't have
      SMP alpha boxen to reproduce it on.  However, it really looks like the race
      is real.
      
      CPU1: d_path() on /mnt/ramfs/<255-character>/foo
      CPU2: mv /mnt/ramfs/<255-character> /mnt/ramfs/<63-character>
      
      CPU2 does d_alloc(), which allocates an external name, stores the name there
      including terminating NUL, does smp_wmb() and stores its address in
      dentry->d_name.name.  It proceeds to d_add(dentry, NULL) and d_move()
      old dentry over to that.  ->d_name.name value ends up in that dentry.
      
      In the meanwhile, CPU1 gets to prepend_name() for that dentry.  It fetches
      ->d_name.name and ->d_name.len; the former ends up pointing to new name
      (64-byte kmalloc'ed array), the latter - 255 (length of the old name).
      Nothing to force the ordering there, and normally that would be OK, since we'd
      run into the terminating NUL and stop.  Except that it's alpha, and we'd need
      a data dependency barrier to guarantee that we see that store of NUL
      __d_alloc() has done.  In a similar situation dentry_cmp() would survive; it
      does explicit smp_read_barrier_depends() after fetching ->d_name.name.
      prepend_name() doesn't and it risks walking past the end of kmalloc'ed object
      and possibly oops due to taking a page fault in kernel mode.
      
      Cc: stable@vger.kernel.org # 3.12+
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6d13f694
  22. 28 9月, 2014 2 次提交
    • M
      vfs: Don't exchange "short" filenames unconditionally. · d2fa4a84
      Mikhail Efremov 提交于
      Only exchange source and destination filenames
      if flags contain RENAME_EXCHANGE.
      In case if executable file was running and replaced by
      other file /proc/PID/exe should still show correct file name,
      not the old name of the file by which it was replaced.
      
      The scenario when this bug manifests itself was like this:
      * ALT Linux uses rpm and start-stop-daemon;
      * during a package upgrade rpm creates a temporary file
        for an executable to rename it upon successful unpacking;
      * start-stop-daemon is run subsequently and it obtains
        the (nonexistant) temporary filename via /proc/PID/exe
        thus failing to identify the running process.
      
      Note that "long" filenames (> DNAiME_INLINE_LEN) are still
      exchanged without RENAME_EXCHANGE and this behaviour exists
      long enough (should be fixed too apparently).
      So this patch is just an interim workaround that restores
      behavior for "short" names as it was before changes
      introduced by commit da1ce067 ("vfs: add cross-rename").
      
      See https://lkml.org/lkml/2014/9/7/6 for details.
      
      AV: the comments about being more careful with ->d_name.hash
      than with ->d_name.name are from back in 2.3.40s; they
      became obsolete by 2.3.60s, when we started to unhash the
      target instead of swapping hash chain positions followed
      by d_delete() as we used to do when dcache was first
      introduced.
      Acked-by: NMiklos Szeredi <mszeredi@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: stable@vger.kernel.org
      Fixes: da1ce067 "vfs: add cross-rename"
      Signed-off-by: NMikhail Efremov <sem@altlinux.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d2fa4a84
    • L
      fold swapping ->d_name.hash into switch_names() · a28ddb87
      Linus Torvalds 提交于
      and do it along with ->d_name.len there
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a28ddb87