1. 24 1月, 2012 1 次提交
    • W
      proc: clear_refs: do not clear reserved pages · 85e72aa5
      Will Deacon 提交于
      /proc/pid/clear_refs is used to clear the Referenced and YOUNG bits for
      pages and corresponding page table entries of the task with PID pid, which
      includes any special mappings inserted into the page tables in order to
      provide things like vDSOs and user helper functions.
      
      On ARM this causes a problem because the vectors page is mapped as a
      global mapping and since ec706dab ("ARM: add a vma entry for the user
      accessible vector page"), a VMA is also inserted into each task for this
      page to aid unwinding through signals and syscall restarts.  Since the
      vectors page is required for handling faults, clearing the YOUNG bit (and
      subsequently writing a faulting pte) means that we lose the vectors page
      *globally* and cannot fault it back in.  This results in a system deadlock
      on the next exception.
      
      To see this problem in action, just run:
      
      	$ echo 1 > /proc/self/clear_refs
      
      on an ARM platform (as any user) and watch your system hang.  I think this
      has been the case since 2.6.37
      
      This patch avoids clearing the aforementioned bits for reserved pages,
      therefore leaving the vectors page intact on ARM.  Since reserved pages
      are not candidates for swap, this change should not have any impact on the
      usefulness of clear_refs.
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Reported-by: NMoussa Ba <moussaba@micron.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Acked-by: NNicolas Pitre <nico@linaro.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: <stable@vger.kernel.org>		[2.6.37+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      85e72aa5
  2. 18 1月, 2012 3 次提交
    • L
      proc: clean up and fix /proc/<pid>/mem handling · e268337d
      Linus Torvalds 提交于
      Jüri Aedla reported that the /proc/<pid>/mem handling really isn't very
      robust, and it also doesn't match the permission checking of any of the
      other related files.
      
      This changes it to do the permission checks at open time, and instead of
      tracking the process, it tracks the VM at the time of the open.  That
      simplifies the code a lot, but does mean that if you hold the file
      descriptor open over an execve(), you'll continue to read from the _old_
      VM.
      
      That is different from our previous behavior, but much simpler.  If
      somebody actually finds a load where this matters, we'll need to revert
      this commit.
      
      I suspect that nobody will ever notice - because the process mapping
      addresses will also have changed as part of the execve.  So you cannot
      actually usefully access the fd across a VM change simply because all
      the offsets for IO would have changed too.
      Reported-by: NJüri Aedla <asd@ut.ee>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e268337d
    • E
      audit: only allow tasks to set their loginuid if it is -1 · 633b4545
      Eric Paris 提交于
      At the moment we allow tasks to set their loginuid if they have
      CAP_AUDIT_CONTROL.  In reality we want tasks to set the loginuid when they
      log in and it be impossible to ever reset.  We had to make it mutable even
      after it was once set (with the CAP) because on update and admin might have
      to restart sshd.  Now sshd would get his loginuid and the next user which
      logged in using ssh would not be able to set his loginuid.
      
      Systemd has changed how userspace works and allowed us to make the kernel
      work the way it should.  With systemd users (even admins) are not supposed
      to restart services directly.  The system will restart the service for
      them.  Thus since systemd is going to loginuid==-1, sshd would get -1, and
      sshd would be allowed to set a new loginuid without special permissions.
      
      If an admin in this system were to manually start an sshd he is inserting
      himself into the system chain of trust and thus, logically, it's his
      loginuid that should be used!  Since we have old systems I make this a
      Kconfig option.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      633b4545
    • E
      audit: remove task argument to audit_set_loginuid · 0a300be6
      Eric Paris 提交于
      The function always deals with current.  Don't expose an option
      pretending one can use it for something.  You can't.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      0a300be6
  3. 16 1月, 2012 1 次提交
  4. 13 1月, 2012 2 次提交
  5. 11 1月, 2012 5 次提交
    • V
      procfs: add hidepid= and gid= mount options · 0499680a
      Vasiliy Kulikov 提交于
      Add support for mount options to restrict access to /proc/PID/
      directories.  The default backward-compatible "relaxed" behaviour is left
      untouched.
      
      The first mount option is called "hidepid" and its value defines how much
      info about processes we want to be available for non-owners:
      
      hidepid=0 (default) means the old behavior - anybody may read all
      world-readable /proc/PID/* files.
      
      hidepid=1 means users may not access any /proc/<pid>/ directories, but
      their own.  Sensitive files like cmdline, sched*, status are now protected
      against other users.  As permission checking done in proc_pid_permission()
      and files' permissions are left untouched, programs expecting specific
      files' modes are not confused.
      
      hidepid=2 means hidepid=1 plus all /proc/PID/ will be invisible to other
      users.  It doesn't mean that it hides whether a process exists (it can be
      learned by other means, e.g.  by kill -0 $PID), but it hides process' euid
      and egid.  It compicates intruder's task of gathering info about running
      processes, whether some daemon runs with elevated privileges, whether
      another user runs some sensitive program, whether other users run any
      program at all, etc.
      
      gid=XXX defines a group that will be able to gather all processes' info
      (as in hidepid=0 mode).  This group should be used instead of putting
      nonroot user in sudoers file or something.  However, untrusted users (like
      daemons, etc.) which are not supposed to monitor the tasks in the whole
      system should not be added to the group.
      
      hidepid=1 or higher is designed to restrict access to procfs files, which
      might reveal some sensitive private information like precise keystrokes
      timings:
      
      http://www.openwall.com/lists/oss-security/2011/11/05/3
      
      hidepid=1/2 doesn't break monitoring userspace tools.  ps, top, pgrep, and
      conky gracefully handle EPERM/ENOENT and behave as if the current user is
      the only user running processes.  pstree shows the process subtree which
      contains "pstree" process.
      
      Note: the patch doesn't deal with setuid/setgid issues of keeping
      preopened descriptors of procfs files (like
      https://lkml.org/lkml/2011/2/7/368).  We rely on that the leaked
      information like the scheduling counters of setuid apps doesn't threaten
      anybody's privacy - only the user started the setuid program may read the
      counters.
      Signed-off-by: NVasiliy Kulikov <segoon@openwall.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Randy Dunlap <rdunlap@xenotime.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Theodore Tso <tytso@MIT.EDU>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: James Morris <jmorris@namei.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0499680a
    • V
      procfs: parse mount options · 97412950
      Vasiliy Kulikov 提交于
      Add support for procfs mount options.  Actual mount options are coming in
      the next patches.
      Signed-off-by: NVasiliy Kulikov <segoon@openwall.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Randy Dunlap <rdunlap@xenotime.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Theodore Tso <tytso@MIT.EDU>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: James Morris <jmorris@namei.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97412950
    • P
      procfs: introduce the /proc/<pid>/map_files/ directory · 640708a2
      Pavel Emelyanov 提交于
      This one behaves similarly to the /proc/<pid>/fd/ one - it contains
      symlinks one for each mapping with file, the name of a symlink is
      "vma->vm_start-vma->vm_end", the target is the file.  Opening a symlink
      results in a file that point exactly to the same inode as them vma's one.
      
      For example the ls -l of some arbitrary /proc/<pid>/map_files/
      
       | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
       | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
       | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
       | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
       | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
      
      This *helps* checkpointing process in three ways:
      
      1. When dumping a task mappings we do know exact file that is mapped
         by particular region.  We do this by opening
         /proc/$pid/map_files/$address symlink the way we do with file
         descriptors.
      
      2. This also helps in determining which anonymous shared mappings are
         shared with each other by comparing the inodes of them.
      
      3. When restoring a set of processes in case two of them has a mapping
         shared, we map the memory by the 1st one and then open its
         /proc/$pid/map_files/$address file and map it by the 2nd task.
      
      Using /proc/$pid/maps for this is quite inconvenient since it brings
      repeatable re-reading and reparsing for this text file which slows down
      restore procedure significantly.  Also as being pointed in (3) it is a way
      easier to use top level shared mapping in children as
      /proc/$pid/map_files/$address when needed.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [gorcunov@openvz.org: make map_files depend on CHECKPOINT_RESTORE]
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Reviewed-by: NVasiliy Kulikov <segoon@openwall.com>
      Reviewed-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Pavel Machek <pavel@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      640708a2
    • C
      procfs: make proc_get_link to use dentry instead of inode · 7773fbc5
      Cyrill Gorcunov 提交于
      Prepare the ground for the next "map_files" patch which needs a name of a
      link file to analyse.
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7773fbc5
    • K
      tracepoint: add tracepoints for debugging oom_score_adj · 43d2b113
      KAMEZAWA Hiroyuki 提交于
      oom_score_adj is used for guarding processes from OOM-Killer.  One of
      problem is that it's inherited at fork().  When a daemon set oom_score_adj
      and make children, it's hard to know where the value is set.
      
      This patch adds some tracepoints useful for debugging. This patch adds
      3 trace points.
        - creating new task
        - renaming a task (exec)
        - set oom_score_adj
      
      To debug, users need to enable some trace pointer. Maybe filtering is useful as
      
      # EVENT=/sys/kernel/debug/tracing/events/task/
      # echo "oom_score_adj != 0" > $EVENT/task_newtask/filter
      # echo "oom_score_adj != 0" > $EVENT/task_rename/filter
      # echo 1 > $EVENT/enable
      # EVENT=/sys/kernel/debug/tracing/events/oom/
      # echo 1 > $EVENT/enable
      
      output will be like this.
      # grep oom /sys/kernel/debug/tracing/trace
      bash-7699  [007] d..3  5140.744510: oom_score_adj_update: pid=7699 comm=bash oom_score_adj=-1000
      bash-7699  [007] ...1  5151.818022: task_newtask: pid=7729 comm=bash clone_flags=1200011 oom_score_adj=-1000
      ls-7729  [003] ...2  5151.818504: task_rename: pid=7729 oldcomm=bash newcomm=ls oom_score_adj=-1000
      bash-7699  [002] ...1  5175.701468: task_newtask: pid=7730 comm=bash clone_flags=1200011 oom_score_adj=-1000
      grep-7730  [007] ...2  5175.701993: task_rename: pid=7730 oldcomm=bash newcomm=grep oom_score_adj=-1000
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43d2b113
  6. 06 1月, 2012 1 次提交
    • E
      ptrace: do not audit capability check when outputing /proc/pid/stat · 69f594a3
      Eric Paris 提交于
      Reading /proc/pid/stat of another process checks if one has ptrace permissions
      on that process.  If one does have permissions it outputs some data about the
      process which might have security and attack implications.  If the current
      task does not have ptrace permissions the read still works, but those fields
      are filled with inocuous (0) values.  Since this check and a subsequent denial
      is not a violation of the security policy we should not audit such denials.
      
      This can be quite useful to removing ptrace broadly across a system without
      flooding the logs when ps is run or something which harmlessly walks proc.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Acked-by: NSerge E. Hallyn <serge.hallyn@canonical.com>
      69f594a3
  7. 04 1月, 2012 4 次提交
  8. 30 12月, 2011 1 次提交
    • A
      procfs: do not confuse jiffies with cputime64_t · 34845636
      Andreas Schwab 提交于
      Commit 2a95ea6c ("procfs: do not overflow get_{idle,iowait}_time
      for nohz") did not take into account that one some architectures jiffies
      and cputime use different units.
      
      This causes get_idle_time() to return numbers in the wrong units, making
      the idle time fields in /proc/stat wrong.
      
      Instead of converting the usec value returned by
      get_cpu_{idle,iowait}_time_us to units of jiffies, use the new function
      usecs_to_cputime64 to convert it to the correct unit of cputime64_t.
      Signed-off-by: NAndreas Schwab <schwab@linux-m68k.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: "Artem S. Tashkinov" <t.artem@mailcity.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34845636
  9. 15 12月, 2011 2 次提交
  10. 09 12月, 2011 3 次提交
    • M
      procfs: do not overflow get_{idle,iowait}_time for nohz · 2a95ea6c
      Michal Hocko 提交于
      Since commit a25cac51 ("proc: Consider NO_HZ when printing idle and
      iowait times") we are reporting idle/io_wait time also while a CPU is
      tickless.  We rely on get_{idle,iowait}_time functions to retrieve
      proper data.
      
      These functions, however, use usecs_to_cputime to translate micro
      seconds time to cputime64_t.  This is just an alias to usecs_to_jiffies
      which reduces the data type from u64 to unsigned int and also checks
      whether the given parameter overflows jiffies_to_usecs(MAX_JIFFY_OFFSET)
      and returns MAX_JIFFY_OFFSET in that case.
      
      When we overflow depends on CONFIG_HZ but especially for CONFIG_HZ_300
      it is quite low (1431649781) so we are getting MAX_JIFFY_OFFSET for
      >3000s! until we overflow unsigned int.  Just for reference
      CONFIG_HZ_100 has an overflow window around 20s, CONFIG_HZ_250 ~8s and
      CONFIG_HZ_1000 ~2s.
      
      This results in a bug when people saw [h]top going mad reporting 100%
      CPU usage even though there was basically no CPU load.  The reason was
      simply that /proc/stat stopped reporting idle/io_wait changes (and
      reported MAX_JIFFY_OFFSET) and so the only change happening was for user
      system time.
      
      Let's use nsecs_to_jiffies64 instead which doesn't reduce the precision
      to 32b type and it is much more appropriate for cumulative time values
      (unlike usecs_to_jiffies which intended for timeout calculations).
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Tested-by: NArtem S. Tashkinov <t.artem@mailcity.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a95ea6c
    • C
      fs/proc/meminfo.c: fix compilation error · b53fc7c2
      Claudio Scordino 提交于
      Fix the error message "directives may not be used inside a macro argument"
      which appears when the kernel is compiled for the cris architecture.
      Signed-off-by: NClaudio Scordino <claudio@evidence.eu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b53fc7c2
    • A
      procfs: fix a vfsmount longterm reference leak · 905ad269
      Al Viro 提交于
      kern_mount() doesn't pair with plain mntput()...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      905ad269
  11. 06 12月, 2011 1 次提交
  12. 10 11月, 2011 1 次提交
  13. 03 11月, 2011 3 次提交
  14. 02 11月, 2011 2 次提交
  15. 01 11月, 2011 4 次提交
  16. 22 9月, 2011 3 次提交
  17. 08 9月, 2011 1 次提交
    • M
      proc: Consider NO_HZ when printing idle and iowait times · a25cac51
      Michal Hocko 提交于
      show_stat handler of the /proc/stat file relies on kstat_cpu(cpu)
      statistics when priting information about idle and iowait times.
      This is OK if we are not using tickless kernel (CONFIG_NO_HZ) because
      counters are updated periodically.
      With NO_HZ things got more tricky because we are not doing idle/iowait
      accounting while we are tickless so the value might get outdated.
      Users of /proc/stat will notice that by unchanged idle/iowait values
      which is then interpreted as 0% idle/iowait time. From the user space
      POV this is an unexpected behavior and a change of the interface.
      
      Let's fix this by using get_cpu_{idle,iowait}_time_us which accounts the
      total idle/iowait time since boot and it doesn't rely on sampling or any
      other periodic activity. Fall back to the previous behavior if NO_HZ is
      disabled or not configured.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Link: http://lkml.kernel.org/r/39181366adac1b39cb6aa3cd53ff0f7c78d32676.1314172057.git.mhocko@suse.czSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      a25cac51
  18. 07 8月, 2011 2 次提交
    • L
      vfs: show O_CLOEXE bit properly in /proc/<pid>/fdinfo/<fd> files · 1117f72e
      Linus Torvalds 提交于
      The CLOEXE bit is magical, and for performance (and semantic) reasons we
      don't actually maintain it in the file descriptor itself, but in a
      separate bit array.  Which means that when we show f_flags, the CLOEXE
      status is shown incorrectly: we show the status not as it is now, but as
      it was when the file was opened.
      
      Fix that by looking up the bit properly in the 'fdt->close_on_exec' bit
      array.
      
      Uli needs this in order to re-implement the pfiles program:
      
        "For normal file descriptors (not sockets) this was the last piece of
         information which wasn't available.  This is all part of my 'give
         Solaris users no reason to not switch' effort.  I intend to offer the
         code to the util-linux-ng maintainers."
      Requested-by: NUlrich Drepper <drepper@akkadia.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1117f72e
    • L
      oom_ajd: don't use WARN_ONCE, just use printk_once · c2142704
      Linus Torvalds 提交于
      WARN_ONCE() is very annoying, in that it shows the stack trace that we
      don't care about at all, and also triggers various user-level "kernel
      oopsed" logic that we really don't care about.  And it's not like the
      user can do anything about the applications (sshd) in question, it's a
      distro issue.
      
      Requested-by: Andi Kleen <andi@firstfloor.org> (and many others)
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c2142704