1. 18 1月, 2012 1 次提交
    • L
      proc: clean up and fix /proc/<pid>/mem handling · e268337d
      Linus Torvalds 提交于
      Jüri Aedla reported that the /proc/<pid>/mem handling really isn't very
      robust, and it also doesn't match the permission checking of any of the
      other related files.
      
      This changes it to do the permission checks at open time, and instead of
      tracking the process, it tracks the VM at the time of the open.  That
      simplifies the code a lot, but does mean that if you hold the file
      descriptor open over an execve(), you'll continue to read from the _old_
      VM.
      
      That is different from our previous behavior, but much simpler.  If
      somebody actually finds a load where this matters, we'll need to revert
      this commit.
      
      I suspect that nobody will ever notice - because the process mapping
      addresses will also have changed as part of the execve.  So you cannot
      actually usefully access the fd across a VM change simply because all
      the offsets for IO would have changed too.
      Reported-by: NJüri Aedla <asd@ut.ee>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e268337d
  2. 13 1月, 2012 1 次提交
  3. 11 1月, 2012 4 次提交
    • V
      procfs: add hidepid= and gid= mount options · 0499680a
      Vasiliy Kulikov 提交于
      Add support for mount options to restrict access to /proc/PID/
      directories.  The default backward-compatible "relaxed" behaviour is left
      untouched.
      
      The first mount option is called "hidepid" and its value defines how much
      info about processes we want to be available for non-owners:
      
      hidepid=0 (default) means the old behavior - anybody may read all
      world-readable /proc/PID/* files.
      
      hidepid=1 means users may not access any /proc/<pid>/ directories, but
      their own.  Sensitive files like cmdline, sched*, status are now protected
      against other users.  As permission checking done in proc_pid_permission()
      and files' permissions are left untouched, programs expecting specific
      files' modes are not confused.
      
      hidepid=2 means hidepid=1 plus all /proc/PID/ will be invisible to other
      users.  It doesn't mean that it hides whether a process exists (it can be
      learned by other means, e.g.  by kill -0 $PID), but it hides process' euid
      and egid.  It compicates intruder's task of gathering info about running
      processes, whether some daemon runs with elevated privileges, whether
      another user runs some sensitive program, whether other users run any
      program at all, etc.
      
      gid=XXX defines a group that will be able to gather all processes' info
      (as in hidepid=0 mode).  This group should be used instead of putting
      nonroot user in sudoers file or something.  However, untrusted users (like
      daemons, etc.) which are not supposed to monitor the tasks in the whole
      system should not be added to the group.
      
      hidepid=1 or higher is designed to restrict access to procfs files, which
      might reveal some sensitive private information like precise keystrokes
      timings:
      
      http://www.openwall.com/lists/oss-security/2011/11/05/3
      
      hidepid=1/2 doesn't break monitoring userspace tools.  ps, top, pgrep, and
      conky gracefully handle EPERM/ENOENT and behave as if the current user is
      the only user running processes.  pstree shows the process subtree which
      contains "pstree" process.
      
      Note: the patch doesn't deal with setuid/setgid issues of keeping
      preopened descriptors of procfs files (like
      https://lkml.org/lkml/2011/2/7/368).  We rely on that the leaked
      information like the scheduling counters of setuid apps doesn't threaten
      anybody's privacy - only the user started the setuid program may read the
      counters.
      Signed-off-by: NVasiliy Kulikov <segoon@openwall.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Randy Dunlap <rdunlap@xenotime.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Theodore Tso <tytso@MIT.EDU>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: James Morris <jmorris@namei.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0499680a
    • P
      procfs: introduce the /proc/<pid>/map_files/ directory · 640708a2
      Pavel Emelyanov 提交于
      This one behaves similarly to the /proc/<pid>/fd/ one - it contains
      symlinks one for each mapping with file, the name of a symlink is
      "vma->vm_start-vma->vm_end", the target is the file.  Opening a symlink
      results in a file that point exactly to the same inode as them vma's one.
      
      For example the ls -l of some arbitrary /proc/<pid>/map_files/
      
       | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
       | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
       | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
       | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
       | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
      
      This *helps* checkpointing process in three ways:
      
      1. When dumping a task mappings we do know exact file that is mapped
         by particular region.  We do this by opening
         /proc/$pid/map_files/$address symlink the way we do with file
         descriptors.
      
      2. This also helps in determining which anonymous shared mappings are
         shared with each other by comparing the inodes of them.
      
      3. When restoring a set of processes in case two of them has a mapping
         shared, we map the memory by the 1st one and then open its
         /proc/$pid/map_files/$address file and map it by the 2nd task.
      
      Using /proc/$pid/maps for this is quite inconvenient since it brings
      repeatable re-reading and reparsing for this text file which slows down
      restore procedure significantly.  Also as being pointed in (3) it is a way
      easier to use top level shared mapping in children as
      /proc/$pid/map_files/$address when needed.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [gorcunov@openvz.org: make map_files depend on CHECKPOINT_RESTORE]
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Reviewed-by: NVasiliy Kulikov <segoon@openwall.com>
      Reviewed-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Pavel Machek <pavel@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      640708a2
    • C
      procfs: make proc_get_link to use dentry instead of inode · 7773fbc5
      Cyrill Gorcunov 提交于
      Prepare the ground for the next "map_files" patch which needs a name of a
      link file to analyse.
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7773fbc5
    • K
      tracepoint: add tracepoints for debugging oom_score_adj · 43d2b113
      KAMEZAWA Hiroyuki 提交于
      oom_score_adj is used for guarding processes from OOM-Killer.  One of
      problem is that it's inherited at fork().  When a daemon set oom_score_adj
      and make children, it's hard to know where the value is set.
      
      This patch adds some tracepoints useful for debugging. This patch adds
      3 trace points.
        - creating new task
        - renaming a task (exec)
        - set oom_score_adj
      
      To debug, users need to enable some trace pointer. Maybe filtering is useful as
      
      # EVENT=/sys/kernel/debug/tracing/events/task/
      # echo "oom_score_adj != 0" > $EVENT/task_newtask/filter
      # echo "oom_score_adj != 0" > $EVENT/task_rename/filter
      # echo 1 > $EVENT/enable
      # EVENT=/sys/kernel/debug/tracing/events/oom/
      # echo 1 > $EVENT/enable
      
      output will be like this.
      # grep oom /sys/kernel/debug/tracing/trace
      bash-7699  [007] d..3  5140.744510: oom_score_adj_update: pid=7699 comm=bash oom_score_adj=-1000
      bash-7699  [007] ...1  5151.818022: task_newtask: pid=7729 comm=bash clone_flags=1200011 oom_score_adj=-1000
      ls-7729  [003] ...2  5151.818504: task_rename: pid=7729 oldcomm=bash newcomm=ls oom_score_adj=-1000
      bash-7699  [002] ...1  5175.701468: task_newtask: pid=7730 comm=bash clone_flags=1200011 oom_score_adj=-1000
      grep-7730  [007] ...2  5175.701993: task_rename: pid=7730 oldcomm=bash newcomm=grep oom_score_adj=-1000
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43d2b113
  4. 04 1月, 2012 2 次提交
  5. 10 11月, 2011 1 次提交
  6. 03 11月, 2011 1 次提交
    • V
      proc: fix races against execve() of /proc/PID/fd** · aa6afca5
      Vasiliy Kulikov 提交于
      fd* files are restricted to the task's owner, and other users may not get
      direct access to them.  But one may open any of these files and run any
      setuid program, keeping opened file descriptors.  As there are permission
      checks on open(), but not on readdir() and read(), operations on the kept
      file descriptors will not be checked.  It makes it possible to violate
      procfs permission model.
      
      Reading fdinfo/* may disclosure current fds' position and flags, reading
      directory contents of fdinfo/ and fd/ may disclosure the number of opened
      files by the target task.  This information is not sensible per se, but it
      can reveal some private information (like length of a password stored in a
      file) under certain conditions.
      
      Used existing (un)lock_trace functions to check for ptrace_may_access(),
      but instead of using EPERM return code from it use EACCES to be consistent
      with existing proc_pid_follow_link()/proc_pid_readlink() return code.  If
      they differ, attacker can guess what fds exist by analyzing stat() return
      code.  Patched handlers: stat() for fd/*, stat() and read() for fdindo/*,
      readdir() and lookup() for fd/ and fdinfo/.
      Signed-off-by: NVasiliy Kulikov <segoon@openwall.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: <stable@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa6afca5
  7. 02 11月, 2011 1 次提交
  8. 01 11月, 2011 1 次提交
    • D
      oom: remove oom_disable_count · c9f01245
      David Rientjes 提交于
      This removes mm->oom_disable_count entirely since it's unnecessary and
      currently buggy.  The counter was intended to be per-process but it's
      currently decremented in the exit path for each thread that exits, causing
      it to underflow.
      
      The count was originally intended to prevent oom killing threads that
      share memory with threads that cannot be killed since it doesn't lead to
      future memory freeing.  The counter could be fixed to represent all
      threads sharing the same mm, but it's better to remove the count since:
      
       - it is possible that the OOM_DISABLE thread sharing memory with the
         victim is waiting on that thread to exit and will actually cause
         future memory freeing, and
      
       - there is no guarantee that a thread is disabled from oom killing just
         because another thread sharing its mm is oom disabled.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9f01245
  9. 07 8月, 2011 2 次提交
    • L
      vfs: show O_CLOEXE bit properly in /proc/<pid>/fdinfo/<fd> files · 1117f72e
      Linus Torvalds 提交于
      The CLOEXE bit is magical, and for performance (and semantic) reasons we
      don't actually maintain it in the file descriptor itself, but in a
      separate bit array.  Which means that when we show f_flags, the CLOEXE
      status is shown incorrectly: we show the status not as it is now, but as
      it was when the file was opened.
      
      Fix that by looking up the bit properly in the 'fdt->close_on_exec' bit
      array.
      
      Uli needs this in order to re-implement the pfiles program:
      
        "For normal file descriptors (not sockets) this was the last piece of
         information which wasn't available.  This is all part of my 'give
         Solaris users no reason to not switch' effort.  I intend to offer the
         code to the util-linux-ng maintainers."
      Requested-by: NUlrich Drepper <drepper@akkadia.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1117f72e
    • L
      oom_ajd: don't use WARN_ONCE, just use printk_once · c2142704
      Linus Torvalds 提交于
      WARN_ONCE() is very annoying, in that it shows the stack trace that we
      don't care about at all, and also triggers various user-level "kernel
      oopsed" logic that we really don't care about.  And it's not like the
      user can do anything about the applications (sshd) in question, it's a
      distro issue.
      
      Requested-by: Andi Kleen <andi@firstfloor.org> (and many others)
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c2142704
  10. 27 7月, 2011 1 次提交
    • V
      proc: fix a race in do_io_accounting() · 293eb1e7
      Vasiliy Kulikov 提交于
      If an inode's mode permits opening /proc/PID/io and the resulting file
      descriptor is kept across execve() of a setuid or similar binary, the
      ptrace_may_access() check tries to prevent using this fd against the
      task with escalated privileges.
      
      Unfortunately, there is a race in the check against execve().  If
      execve() is processed after the ptrace check, but before the actual io
      information gathering, io statistics will be gathered from the
      privileged process.  At least in theory this might lead to gathering
      sensible information (like ssh/ftp password length) that wouldn't be
      available otherwise.
      
      Holding task->signal->cred_guard_mutex while gathering the io
      information should protect against the race.
      
      The order of locking is similar to the one inside of ptrace_attach():
      first goes cred_guard_mutex, then lock_task_sighand().
      Signed-off-by: NVasiliy Kulikov <segoon@openwall.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      293eb1e7
  11. 26 7月, 2011 1 次提交
    • D
      oom: make deprecated use of oom_adj more verbose · be8f684d
      David Rientjes 提交于
      /proc/pid/oom_adj is deprecated and scheduled for removal in August 2012
      according to Documentation/feature-removal-schedule.txt.
      
      This patch makes the warning more verbose by making it appear as a more
      serious problem (the presence of a stack trace and being multiline should
      attract more attention) so that applications still using the old interface
      can get fixed.
      
      Very popular users of the old interface have been converted since the oom
      killer rewrite has been introduced.  udevd switched to the
      /proc/pid/oom_score_adj interface for v162, kde switched in 4.6.1, and
      opensshd switched in 5.7p1.
      
      At the start of 2012, this should be changed into a WARN() to emit all
      such incidents and then finally remove the tunable in August 2012 as
      scheduled.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be8f684d
  12. 21 7月, 2011 1 次提交
  13. 20 7月, 2011 3 次提交
  14. 29 6月, 2011 1 次提交
  15. 23 6月, 2011 1 次提交
  16. 20 6月, 2011 1 次提交
  17. 27 5月, 2011 4 次提交
    • C
      arch/tile: more /proc and /sys file support · f133ecca
      Chris Metcalf 提交于
      This change introduces a few of the less controversial /proc and
      /proc/sys interfaces for tile, along with sysfs attributes for
      various things that were originally proposed as /proc/tile files.
      It also adjusts the "hardwall" proc API.
      
      Arnd Bergmann reviewed the initial arch/tile submission, which
      included a complete set of all the /proc/tile and /proc/sys/tile
      knobs that we had added in a somewhat ad hoc way during initial
      development, and provided feedback on where most of them should go.
      
      One knob turned out to be similar enough to the existing
      /proc/sys/debug/exception-trace that it was re-implemented to use
      that model instead.
      
      Another knob was /proc/tile/grid, which reported the "grid" dimensions
      of a tile chip (e.g. 8x8 processors = 64-core chip).  Arnd suggested
      looking at sysfs for that, so this change moves that information
      to a pair of sysfs attributes (chip_width and chip_height) in the
      /sys/devices/system/cpu directory.  We also put the "chip_serial"
      and "chip_revision" information from our old /proc/tile/board file
      as attributes in /sys/devices/system/cpu.
      
      Other information collected via hypervisor APIs is now placed in
      /sys/hypervisor.  We create a /sys/hypervisor/type file (holding the
      constant string "tilera") to be parallel with the Xen use of
      /sys/hypervisor/type holding "xen".  We create three top-level files,
      "version" (the hypervisor's own version), "config_version" (the
      version of the configuration file), and "hvconfig" (the contents of
      the configuration file).  The remaining information from our old
      /proc/tile/board and /proc/tile/switch files becomes an attribute
      group appearing under /sys/hypervisor/board/.
      
      Finally, after some feedback from Arnd Bergmann for the previous
      version of this patch, the /proc/tile/hardwall file is split up into
      two conceptual parts.  First, a directory /proc/tile/hardwall/ which
      contains one file per active hardwall, each file named after the
      hardwall's ID and holding a cpulist that says which cpus are enclosed by
      the hardwall.  Second, a /proc/PID file "hardwall" that is either
      empty (for non-hardwall-using processes) or contains the hardwall ID.
      
      Finally, this change pushes the /proc/sys/tile/unaligned_fixup/
      directory, with knobs controlling the kernel code for handling the
      fixup of unaligned exceptions.
      Reviewed-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
      f133ecca
    • K
      proc: put check_mem_permission after __get_free_page in mem_write · 30cd8903
      KOSAKI Motohiro 提交于
      It whould be better if put check_mem_permission after __get_free_page in
      mem_write, to be same as function mem_read.
      
      Hugh Dickins explained the reason.
      
          check_mem_permission gets a reference to the mm.  If we __get_free_page
          after check_mem_permission, imagine what happens if the system is out
          of memory, and the mm we're looking at is selected for killing by the
          OOM killer: while we wait in __get_free_page for more memory, no memory
          is freed from the selected mm because it cannot reach exit_mmap while
          we hold that reference.
      Reported-by: NJovi Zhang <bookjovi@gmail.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NStephen Wilson <wilsons@start.ca>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      30cd8903
    • A
      fs/proc: convert to kstrtoX() · 0a8cb8e3
      Alexey Dobriyan 提交于
      Convert fs/proc/ from strict_strto*() to kstrto*() functions.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a8cb8e3
    • J
      mm: extract exe_file handling from procfs · 38646013
      Jiri Slaby 提交于
      Setup and cleanup of mm_struct->exe_file is currently done in fs/proc/.
      This was because exe_file was needed only for /proc/<pid>/exe.  Since we
      will need the exe_file functionality also for core dumps (so core name can
      contain full binary path), built this functionality always into the
      kernel.
      
      To achieve that move that out of proc FS to the kernel/ where in fact it
      should belong.  By doing that we can make dup_mm_exe_file static.  Also we
      can drop linux/proc_fs.h inclusion in fs/exec.c and kernel/fork.c.
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      38646013
  18. 11 5月, 2011 1 次提交
    • E
      ns: proc files for namespace naming policy. · 6b4e306a
      Eric W. Biederman 提交于
      Create files under /proc/<pid>/ns/ to allow controlling the
      namespaces of a process.
      
      This addresses three specific problems that can make namespaces hard to
      work with.
      - Namespaces require a dedicated process to pin them in memory.
      - It is not possible to use a namespace unless you are the child
        of the original creator.
      - Namespaces don't have names that userspace can use to talk about
        them.
      
      The namespace files under /proc/<pid>/ns/ can be opened and the
      file descriptor can be used to talk about a specific namespace, and
      to keep the specified namespace alive.
      
      A namespace can be kept alive by either holding the file descriptor
      open or bind mounting the file someplace else.  aka:
      mount --bind /proc/self/ns/net /some/filesystem/path
      mount --bind /proc/self/fd/<N> /some/filesystem/path
      
      This allows namespaces to be named with userspace policy.
      
      It requires additional support to make use of these filedescriptors
      and that will be comming in the following patches.
      Acked-by: NDaniel Lezcano <daniel.lezcano@free.fr>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      6b4e306a
  19. 19 4月, 2011 1 次提交
  20. 31 3月, 2011 1 次提交
  21. 24 3月, 2011 10 次提交