1. 11 1月, 2012 3 次提交
    • P
      procfs: introduce the /proc/<pid>/map_files/ directory · 640708a2
      Pavel Emelyanov 提交于
      This one behaves similarly to the /proc/<pid>/fd/ one - it contains
      symlinks one for each mapping with file, the name of a symlink is
      "vma->vm_start-vma->vm_end", the target is the file.  Opening a symlink
      results in a file that point exactly to the same inode as them vma's one.
      
      For example the ls -l of some arbitrary /proc/<pid>/map_files/
      
       | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
       | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
       | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
       | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
       | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
      
      This *helps* checkpointing process in three ways:
      
      1. When dumping a task mappings we do know exact file that is mapped
         by particular region.  We do this by opening
         /proc/$pid/map_files/$address symlink the way we do with file
         descriptors.
      
      2. This also helps in determining which anonymous shared mappings are
         shared with each other by comparing the inodes of them.
      
      3. When restoring a set of processes in case two of them has a mapping
         shared, we map the memory by the 1st one and then open its
         /proc/$pid/map_files/$address file and map it by the 2nd task.
      
      Using /proc/$pid/maps for this is quite inconvenient since it brings
      repeatable re-reading and reparsing for this text file which slows down
      restore procedure significantly.  Also as being pointed in (3) it is a way
      easier to use top level shared mapping in children as
      /proc/$pid/map_files/$address when needed.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [gorcunov@openvz.org: make map_files depend on CHECKPOINT_RESTORE]
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Reviewed-by: NVasiliy Kulikov <segoon@openwall.com>
      Reviewed-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Pavel Machek <pavel@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      640708a2
    • C
      procfs: make proc_get_link to use dentry instead of inode · 7773fbc5
      Cyrill Gorcunov 提交于
      Prepare the ground for the next "map_files" patch which needs a name of a
      link file to analyse.
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7773fbc5
    • K
      tracepoint: add tracepoints for debugging oom_score_adj · 43d2b113
      KAMEZAWA Hiroyuki 提交于
      oom_score_adj is used for guarding processes from OOM-Killer.  One of
      problem is that it's inherited at fork().  When a daemon set oom_score_adj
      and make children, it's hard to know where the value is set.
      
      This patch adds some tracepoints useful for debugging. This patch adds
      3 trace points.
        - creating new task
        - renaming a task (exec)
        - set oom_score_adj
      
      To debug, users need to enable some trace pointer. Maybe filtering is useful as
      
      # EVENT=/sys/kernel/debug/tracing/events/task/
      # echo "oom_score_adj != 0" > $EVENT/task_newtask/filter
      # echo "oom_score_adj != 0" > $EVENT/task_rename/filter
      # echo 1 > $EVENT/enable
      # EVENT=/sys/kernel/debug/tracing/events/oom/
      # echo 1 > $EVENT/enable
      
      output will be like this.
      # grep oom /sys/kernel/debug/tracing/trace
      bash-7699  [007] d..3  5140.744510: oom_score_adj_update: pid=7699 comm=bash oom_score_adj=-1000
      bash-7699  [007] ...1  5151.818022: task_newtask: pid=7729 comm=bash clone_flags=1200011 oom_score_adj=-1000
      ls-7729  [003] ...2  5151.818504: task_rename: pid=7729 oldcomm=bash newcomm=ls oom_score_adj=-1000
      bash-7699  [002] ...1  5175.701468: task_newtask: pid=7730 comm=bash clone_flags=1200011 oom_score_adj=-1000
      grep-7730  [007] ...2  5175.701993: task_rename: pid=7730 oldcomm=bash newcomm=grep oom_score_adj=-1000
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43d2b113
  2. 04 1月, 2012 4 次提交
  3. 30 12月, 2011 1 次提交
    • A
      procfs: do not confuse jiffies with cputime64_t · 34845636
      Andreas Schwab 提交于
      Commit 2a95ea6c ("procfs: do not overflow get_{idle,iowait}_time
      for nohz") did not take into account that one some architectures jiffies
      and cputime use different units.
      
      This causes get_idle_time() to return numbers in the wrong units, making
      the idle time fields in /proc/stat wrong.
      
      Instead of converting the usec value returned by
      get_cpu_{idle,iowait}_time_us to units of jiffies, use the new function
      usecs_to_cputime64 to convert it to the correct unit of cputime64_t.
      Signed-off-by: NAndreas Schwab <schwab@linux-m68k.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: "Artem S. Tashkinov" <t.artem@mailcity.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34845636
  4. 15 12月, 2011 2 次提交
  5. 09 12月, 2011 3 次提交
    • M
      procfs: do not overflow get_{idle,iowait}_time for nohz · 2a95ea6c
      Michal Hocko 提交于
      Since commit a25cac51 ("proc: Consider NO_HZ when printing idle and
      iowait times") we are reporting idle/io_wait time also while a CPU is
      tickless.  We rely on get_{idle,iowait}_time functions to retrieve
      proper data.
      
      These functions, however, use usecs_to_cputime to translate micro
      seconds time to cputime64_t.  This is just an alias to usecs_to_jiffies
      which reduces the data type from u64 to unsigned int and also checks
      whether the given parameter overflows jiffies_to_usecs(MAX_JIFFY_OFFSET)
      and returns MAX_JIFFY_OFFSET in that case.
      
      When we overflow depends on CONFIG_HZ but especially for CONFIG_HZ_300
      it is quite low (1431649781) so we are getting MAX_JIFFY_OFFSET for
      >3000s! until we overflow unsigned int.  Just for reference
      CONFIG_HZ_100 has an overflow window around 20s, CONFIG_HZ_250 ~8s and
      CONFIG_HZ_1000 ~2s.
      
      This results in a bug when people saw [h]top going mad reporting 100%
      CPU usage even though there was basically no CPU load.  The reason was
      simply that /proc/stat stopped reporting idle/io_wait changes (and
      reported MAX_JIFFY_OFFSET) and so the only change happening was for user
      system time.
      
      Let's use nsecs_to_jiffies64 instead which doesn't reduce the precision
      to 32b type and it is much more appropriate for cumulative time values
      (unlike usecs_to_jiffies which intended for timeout calculations).
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Tested-by: NArtem S. Tashkinov <t.artem@mailcity.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a95ea6c
    • C
      fs/proc/meminfo.c: fix compilation error · b53fc7c2
      Claudio Scordino 提交于
      Fix the error message "directives may not be used inside a macro argument"
      which appears when the kernel is compiled for the cris architecture.
      Signed-off-by: NClaudio Scordino <claudio@evidence.eu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b53fc7c2
    • A
      procfs: fix a vfsmount longterm reference leak · 905ad269
      Al Viro 提交于
      kern_mount() doesn't pair with plain mntput()...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      905ad269
  6. 06 12月, 2011 1 次提交
  7. 10 11月, 2011 1 次提交
  8. 03 11月, 2011 3 次提交
  9. 02 11月, 2011 2 次提交
  10. 01 11月, 2011 4 次提交
  11. 22 9月, 2011 3 次提交
  12. 08 9月, 2011 1 次提交
    • M
      proc: Consider NO_HZ when printing idle and iowait times · a25cac51
      Michal Hocko 提交于
      show_stat handler of the /proc/stat file relies on kstat_cpu(cpu)
      statistics when priting information about idle and iowait times.
      This is OK if we are not using tickless kernel (CONFIG_NO_HZ) because
      counters are updated periodically.
      With NO_HZ things got more tricky because we are not doing idle/iowait
      accounting while we are tickless so the value might get outdated.
      Users of /proc/stat will notice that by unchanged idle/iowait values
      which is then interpreted as 0% idle/iowait time. From the user space
      POV this is an unexpected behavior and a change of the interface.
      
      Let's fix this by using get_cpu_{idle,iowait}_time_us which accounts the
      total idle/iowait time since boot and it doesn't rely on sampling or any
      other periodic activity. Fall back to the previous behavior if NO_HZ is
      disabled or not configured.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Link: http://lkml.kernel.org/r/39181366adac1b39cb6aa3cd53ff0f7c78d32676.1314172057.git.mhocko@suse.czSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      a25cac51
  13. 07 8月, 2011 2 次提交
    • L
      vfs: show O_CLOEXE bit properly in /proc/<pid>/fdinfo/<fd> files · 1117f72e
      Linus Torvalds 提交于
      The CLOEXE bit is magical, and for performance (and semantic) reasons we
      don't actually maintain it in the file descriptor itself, but in a
      separate bit array.  Which means that when we show f_flags, the CLOEXE
      status is shown incorrectly: we show the status not as it is now, but as
      it was when the file was opened.
      
      Fix that by looking up the bit properly in the 'fdt->close_on_exec' bit
      array.
      
      Uli needs this in order to re-implement the pfiles program:
      
        "For normal file descriptors (not sockets) this was the last piece of
         information which wasn't available.  This is all part of my 'give
         Solaris users no reason to not switch' effort.  I intend to offer the
         code to the util-linux-ng maintainers."
      Requested-by: NUlrich Drepper <drepper@akkadia.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1117f72e
    • L
      oom_ajd: don't use WARN_ONCE, just use printk_once · c2142704
      Linus Torvalds 提交于
      WARN_ONCE() is very annoying, in that it shows the stack trace that we
      don't care about at all, and also triggers various user-level "kernel
      oopsed" logic that we really don't care about.  And it's not like the
      user can do anything about the applications (sshd) in question, it's a
      distro issue.
      
      Requested-by: Andi Kleen <andi@firstfloor.org> (and many others)
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c2142704
  14. 28 7月, 2011 1 次提交
    • D
      proc: make struct proc_dir_entry::name a terminal array rather than a pointer · 09570f91
      David Howells 提交于
      Since __proc_create() appends the name it is given to the end of the PDE
      structure that it allocates, there isn't a need to store a name pointer.
      Instead we can just replace the name pointer with a terminal char array of
      _unspecified_ length.  The compiler will simply append the string to statically
      defined variables of PDE type overlapping any hole at the end of the structure
      and, unlike specifying an explicitly _zero_ length array, won't give a warning
      if you try to statically initialise it with a string of more than zero length.
      
      Also, whilst we're at it:
      
       (1) Move namelen to end just prior to name and reduce it to a single byte
           (name shouldn't be longer than NAME_MAX).
      
       (2) Move pde_unload_lock two places further on so that if it's four bytes in
           size on a 64-bit machine, it won't cause an unused hole in the PDE struct.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09570f91
  15. 27 7月, 2011 3 次提交
  16. 26 7月, 2011 1 次提交
    • D
      oom: make deprecated use of oom_adj more verbose · be8f684d
      David Rientjes 提交于
      /proc/pid/oom_adj is deprecated and scheduled for removal in August 2012
      according to Documentation/feature-removal-schedule.txt.
      
      This patch makes the warning more verbose by making it appear as a more
      serious problem (the presence of a stack trace and being multiline should
      attract more attention) so that applications still using the old interface
      can get fixed.
      
      Very popular users of the old interface have been converted since the oom
      killer rewrite has been introduced.  udevd switched to the
      /proc/pid/oom_score_adj interface for v162, kde switched in 4.6.1, and
      opensshd switched in 5.7p1.
      
      At the start of 2012, this should be changed into a WARN() to emit all
      such incidents and then finally remove the tunable in August 2012 as
      scheduled.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be8f684d
  17. 21 7月, 2011 1 次提交
  18. 20 7月, 2011 4 次提交