1. 15 8月, 2022 1 次提交
  2. 30 7月, 2022 1 次提交
  3. 18 7月, 2022 1 次提交
  4. 23 3月, 2022 1 次提交
  5. 22 1月, 2022 1 次提交
  6. 07 5月, 2021 2 次提交
  7. 07 11月, 2020 1 次提交
  8. 04 9月, 2020 3 次提交
  9. 13 6月, 2020 1 次提交
    • E
      proc: Use new_inode not new_inode_pseudo · ef1548ad
      Eric W. Biederman 提交于
      Recently syzbot reported that unmounting proc when there is an ongoing
      inotify watch on the root directory of proc could result in a use
      after free when the watch is removed after the unmount of proc
      when the watcher exits.
      
      Commit 69879c01 ("proc: Remove the now unnecessary internal mount
      of proc") made it easier to unmount proc and allowed syzbot to see the
      problem, but looking at the code it has been around for a long time.
      
      Looking at the code the fsnotify watch should have been removed by
      fsnotify_sb_delete in generic_shutdown_super.  Unfortunately the inode
      was allocated with new_inode_pseudo instead of new_inode so the inode
      was not on the sb->s_inodes list.  Which prevented
      fsnotify_unmount_inodes from finding the inode and removing the watch
      as well as made it so the "VFS: Busy inodes after unmount" warning
      could not find the inodes to warn about them.
      
      Make all of the inodes in proc visible to generic_shutdown_super,
      and fsnotify_sb_delete by using new_inode instead of new_inode_pseudo.
      The only functional difference is that new_inode places the inodes
      on the sb->s_inodes list.
      
      I wrote a small test program and I can verify that without changes it
      can trigger this issue, and by replacing new_inode_pseudo with
      new_inode the issues goes away.
      
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/000000000000d788c905a7dfa3f4@google.com
      Reported-by: syzbot+7d2debdcdb3cb93c1e5e@syzkaller.appspotmail.com
      Fixes: 0097875b ("proc: Implement /proc/thread-self to point at the directory of the current thread")
      Fixes: 021ada7d ("procfs: switch /proc/self away from proc_dir_entry")
      Fixes: 51f0885e ("vfs,proc: guarantee unique inodes in /proc")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      ef1548ad
  10. 22 4月, 2020 4 次提交
    • A
    • A
      proc: use human-readable values for hidepid · 1c6c4d11
      Alexey Gladkov 提交于
      The hidepid parameter values are becoming more and more and it becomes
      difficult to remember what each new magic number means.
      
      Backward compatibility is preserved since it is possible to specify
      numerical value for the hidepid parameter. This does not break the
      fsconfig since it is not possible to specify a numerical value through
      it. All numeric values are converted to a string. The type
      FSCONFIG_SET_BINARY cannot be used to indicate a numerical value.
      
      Selftest has been added to verify this behavior.
      Suggested-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NAlexey Gladkov <gladkov.alexey@gmail.com>
      Reviewed-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      1c6c4d11
    • A
      proc: add option to mount only a pids subset · 6814ef2d
      Alexey Gladkov 提交于
      This allows to hide all files and directories in the procfs that are not
      related to tasks.
      Signed-off-by: NAlexey Gladkov <gladkov.alexey@gmail.com>
      Reviewed-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      6814ef2d
    • A
      proc: allow to mount many instances of proc in one pid namespace · fa10fed3
      Alexey Gladkov 提交于
      This patch allows to have multiple procfs instances inside the
      same pid namespace. The aim here is lightweight sandboxes, and to allow
      that we have to modernize procfs internals.
      
      1) The main aim of this work is to have on embedded systems one
      supervisor for apps. Right now we have some lightweight sandbox support,
      however if we create pid namespacess we have to manages all the
      processes inside too, where our goal is to be able to run a bunch of
      apps each one inside its own mount namespace without being able to
      notice each other. We only want to use mount namespaces, and we want
      procfs to behave more like a real mount point.
      
      2) Linux Security Modules have multiple ptrace paths inside some
      subsystems, however inside procfs, the implementation does not guarantee
      that the ptrace() check which triggers the security_ptrace_check() hook
      will always run. We have the 'hidepid' mount option that can be used to
      force the ptrace_may_access() check inside has_pid_permissions() to run.
      The problem is that 'hidepid' is per pid namespace and not attached to
      the mount point, any remount or modification of 'hidepid' will propagate
      to all other procfs mounts.
      
      This also does not allow to support Yama LSM easily in desktop and user
      sessions. Yama ptrace scope which restricts ptrace and some other
      syscalls to be allowed only on inferiors, can be updated to have a
      per-task context, where the context will be inherited during fork(),
      clone() and preserved across execve(). If we support multiple private
      procfs instances, then we may force the ptrace_may_access() on
      /proc/<pids>/ to always run inside that new procfs instances. This will
      allow to specifiy on user sessions if we should populate procfs with
      pids that the user can ptrace or not.
      
      By using Yama ptrace scope, some restricted users will only be able to see
      inferiors inside /proc, they won't even be able to see their other
      processes. Some software like Chromium, Firefox's crash handler, Wine
      and others are already using Yama to restrict which processes can be
      ptracable. With this change this will give the possibility to restrict
      /proc/<pids>/ but more importantly this will give desktop users a
      generic and usuable way to specifiy which users should see all processes
      and which users can not.
      
      Side notes:
      * This covers the lack of seccomp where it is not able to parse
      arguments, it is easy to install a seccomp filter on direct syscalls
      that operate on pids, however /proc/<pid>/ is a Linux ABI using
      filesystem syscalls. With this change LSMs should be able to analyze
      open/read/write/close...
      
      In the new patch set version I removed the 'newinstance' option
      as suggested by Eric W. Biederman.
      
      Selftest has been added to verify new behavior.
      Signed-off-by: NAlexey Gladkov <gladkov.alexey@gmail.com>
      Reviewed-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      fa10fed3
  11. 08 4月, 2020 2 次提交
    • A
      proc: faster open/read/close with "permanent" files · d919b33d
      Alexey Dobriyan 提交于
      Now that "struct proc_ops" exist we can start putting there stuff which
      could not fly with VFS "struct file_operations"...
      
      Most of fs/proc/inode.c file is dedicated to make open/read/.../close
      reliable in the event of disappearing /proc entries which usually happens
      if module is getting removed.  Files like /proc/cpuinfo which never
      disappear simply do not need such protection.
      
      Save 2 atomic ops, 1 allocation, 1 free per open/read/close sequence for such
      "permanent" files.
      
      Enable "permanent" flag for
      
      	/proc/cpuinfo
      	/proc/kmsg
      	/proc/modules
      	/proc/slabinfo
      	/proc/stat
      	/proc/sysvipc/*
      	/proc/swaps
      
      More will come once I figure out foolproof way to prevent out module
      authors from marking their stuff "permanent" for performance reasons
      when it is not.
      
      This should help with scalability: benchmark is "read /proc/cpuinfo R times
      by N threads scattered over the system".
      
      	N	R	t, s (before)	t, s (after)
      	-----------------------------------------------------
      	64	4096	1.582458	1.530502	-3.2%
      	256	4096	6.371926	6.125168	-3.9%
      	1024	4096	25.64888	24.47528	-4.6%
      
      Benchmark source:
      
      #include <chrono>
      #include <iostream>
      #include <thread>
      #include <vector>
      
      #include <sys/types.h>
      #include <sys/stat.h>
      #include <fcntl.h>
      #include <unistd.h>
      
      const int NR_CPUS = sysconf(_SC_NPROCESSORS_ONLN);
      int N;
      const char *filename;
      int R;
      
      int xxx = 0;
      
      int glue(int n)
      {
      	cpu_set_t m;
      	CPU_ZERO(&m);
      	CPU_SET(n, &m);
      	return sched_setaffinity(0, sizeof(cpu_set_t), &m);
      }
      
      void f(int n)
      {
      	glue(n % NR_CPUS);
      
      	while (*(volatile int *)&xxx == 0) {
      	}
      
      	for (int i = 0; i < R; i++) {
      		int fd = open(filename, O_RDONLY);
      		char buf[4096];
      		ssize_t rv = read(fd, buf, sizeof(buf));
      		asm volatile ("" :: "g" (rv));
      		close(fd);
      	}
      }
      
      int main(int argc, char *argv[])
      {
      	if (argc < 4) {
      		std::cerr << "usage: " << argv[0] << ' ' << "N /proc/filename R
      ";
      		return 1;
      	}
      
      	N = atoi(argv[1]);
      	filename = argv[2];
      	R = atoi(argv[3]);
      
      	for (int i = 0; i < NR_CPUS; i++) {
      		if (glue(i) == 0)
      			break;
      	}
      
      	std::vector<std::thread> T;
      	T.reserve(N);
      	for (int i = 0; i < N; i++) {
      		T.emplace_back(f, i);
      	}
      
      	auto t0 = std::chrono::system_clock::now();
      	{
      		*(volatile int *)&xxx = 1;
      		for (auto& t: T) {
      			t.join();
      		}
      	}
      	auto t1 = std::chrono::system_clock::now();
      	std::chrono::duration<double> dt = t1 - t0;
      	std::cout << dt.count() << '
      ';
      
      	return 0;
      }
      
      P.S.:
      Explicit randomization marker is added because adding non-function pointer
      will silently disable structure layout randomization.
      
      [akpm@linux-foundation.org: coding style fixes]
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Joe Perches <joe@perches.com>
      Link: http://lkml.kernel.org/r/20200222201539.GA22576@avx2Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d919b33d
    • J
      fs/proc/inode.c: annotate close_pdeo() for sparse · 904f394e
      Jules Irenge 提交于
      Fix sparse locking imbalance warning:
      
      	warning: context imbalance in close_pdeo() - unexpected unlock
      Signed-off-by: NJules Irenge <jbi.octave@gmail.com>
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200227201538.GA30462@avx2Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      904f394e
  12. 25 2月, 2020 1 次提交
    • E
      proc: Use a list of inodes to flush from proc · 7bc3e6e5
      Eric W. Biederman 提交于
      Rework the flushing of proc to use a list of directory inodes that
      need to be flushed.
      
      The list is kept on struct pid not on struct task_struct, as there is
      a fixed connection between proc inodes and pids but at least for the
      case of de_thread the pid of a task_struct changes.
      
      This removes the dependency on proc_mnt which allows for different
      mounts of proc having different mount options even in the same pid
      namespace and this allows for the removal of proc_mnt which will
      trivially the first mount of proc to honor it's mount options.
      
      This flushing remains an optimization.  The functions
      pid_delete_dentry and pid_revalidate ensure that ordinary dcache
      management will not attempt to use dentries past the point their
      respective task has died.  When unused the shrinker will
      eventually be able to remove these dentries.
      
      There is a case in de_thread where proc_flush_pid can be
      called early for a given pid.  Which winds up being
      safe (if suboptimal) as this is just an optiimization.
      
      Only pid directories are put on the list as the other
      per pid files are children of those directories and
      d_invalidate on the directory will get them as well.
      
      So that the pid can be used during flushing it's reference count is
      taken in release_task and dropped in proc_flush_pid.  Further the call
      of proc_flush_pid is moved after the tasklist_lock is released in
      release_task so that it is certain that the pid has already been
      unhashed when flushing it taking place.  This removes a small race
      where a dentry could recreated.
      
      As struct pid is supposed to be small and I need a per pid lock
      I reuse the only lock that currently exists in struct pid the
      the wait_pidfd.lock.
      
      The net result is that this adds all of this functionality
      with just a little extra list management overhead and
      a single extra pointer in struct pid.
      
      v2: Initialize pid->inodes.  I somehow failed to get that
          initialization into the initial version of the patch.  A boot
          failure was reported by "kernel test robot <lkp@intel.com>", and
          failure to initialize that pid->inodes matches all of the reported
          symptoms.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      7bc3e6e5
  13. 24 2月, 2020 2 次提交
    • E
      proc: Clear the pieces of proc_inode that proc_evict_inode cares about · 71448011
      Eric W. Biederman 提交于
      This just keeps everything tidier, and allows for using flags like
      SLAB_TYPESAFE_BY_RCU where slabs are not always cleared before reuse.
      I don't see reuse without reinitializing happening with the proc_inode
      but I had a false alarm while reworking flushing of proc dentries and
      indoes when a process dies that caused me to tidy this up.
      
      The code is a little easier to follow and reason about this
      way so I figured the changes might as well be kept.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      71448011
    • E
      proc: Use d_invalidate in proc_prune_siblings_dcache · f90f3caf
      Eric W. Biederman 提交于
      The function d_prune_aliases has the problem that it will only prune
      aliases thare are completely unused.  It will not remove aliases for
      the dcache or even think of removing mounts from the dcache.  For that
      behavior d_invalidate is needed.
      
      To use d_invalidate replace d_prune_aliases with d_find_alias followed
      by d_invalidate and dput.
      
      For completeness the directory and the non-directory cases are
      separated because in theory (although not in currently in practice for
      proc) directories can only ever have a single dentry while
      non-directories can have hardlinks and thus multiple dentries.
      As part of this separation use d_find_any_alias for directories
      to spare d_find_alias the extra work of doing that.
      
      Plus the differences between d_find_any_alias and d_find_alias makes
      it clear why the directory and non-directory code and not share code.
      
      To make it clear these routines now invalidate dentries rename
      proc_prune_siblings_dache to proc_invalidate_siblings_dcache, and rename
      proc_sys_prune_dcache proc_sys_invalidate_dcache.
      
      V2: Split the directory and non-directory cases.  To make this
          code robust to future changes in proc.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      f90f3caf
  14. 22 2月, 2020 1 次提交
  15. 21 2月, 2020 1 次提交
  16. 20 2月, 2020 1 次提交
  17. 04 2月, 2020 1 次提交
  18. 17 7月, 2019 1 次提交
  19. 02 5月, 2019 1 次提交
  20. 28 2月, 2019 2 次提交
  21. 05 1月, 2019 1 次提交
  22. 27 10月, 2018 1 次提交
    • J
      mm: zero-seek shrinkers · 4b85afbd
      Johannes Weiner 提交于
      The page cache and most shrinkable slab caches hold data that has been
      read from disk, but there are some caches that only cache CPU work, such
      as the dentry and inode caches of procfs and sysfs, as well as the subset
      of radix tree nodes that track non-resident page cache.
      
      Currently, all these are shrunk at the same rate: using DEFAULT_SEEKS for
      the shrinker's seeks setting tells the reclaim algorithm that for every
      two page cache pages scanned it should scan one slab object.
      
      This is a bogus setting.  A virtual inode that required no IO to create is
      not twice as valuable as a page cache page; shadow cache entries with
      eviction distances beyond the size of memory aren't either.
      
      In most cases, the behavior in practice is still fine.  Such virtual
      caches don't tend to grow and assert themselves aggressively, and usually
      get picked up before they cause problems.  But there are scenarios where
      that's not true.
      
      Our database workloads suffer from two of those.  For one, their file
      workingset is several times bigger than available memory, which has the
      kernel aggressively create shadow page cache entries for the non-resident
      parts of it.  The workingset code does tell the VM that most of these are
      expendable, but the VM ends up balancing them 2:1 to cache pages as per
      the seeks setting.  This is a huge waste of memory.
      
      These workloads also deal with tens of thousands of open files and use
      /proc for introspection, which ends up growing the proc_inode_cache to
      absurdly large sizes - again at the cost of valuable cache space, which
      isn't a reasonable trade-off, given that proc inodes can be re-created
      without involving the disk.
      
      This patch implements a "zero-seek" setting for shrinkers that results in
      a target ratio of 0:1 between their objects and IO-backed caches.  This
      allows such virtual caches to grow when memory is available (they do
      cache/avoid CPU work after all), but effectively disables them as soon as
      IO-backed objects are under pressure.
      
      It then switches the shrinkers for procfs and sysfs metadata, as well as
      excess page cache shadow nodes, to the new zero-seek setting.
      
      Link: http://lkml.kernel.org/r/20181009184732.762-5-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NDomas Mituzas <dmituzas@fb.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b85afbd
  23. 23 8月, 2018 1 次提交
  24. 15 6月, 2018 1 次提交
  25. 12 4月, 2018 5 次提交
  26. 07 2月, 2018 2 次提交