1. 13 1月, 2012 31 次提交
  2. 11 1月, 2012 9 次提交
    • T
      workqueue: make alloc_workqueue() take printf fmt and args for name · b196be89
      Tejun Heo 提交于
      alloc_workqueue() currently expects the passed in @name pointer to remain
      accessible.  This is inconvenient and a bit silly given that the whole wq
      is being dynamically allocated.  This patch updates alloc_workqueue() and
      friends to take printf format string instead of opaque string and matching
      varargs at the end.  The name is allocated together with the wq and
      formatted.
      
      alloc_ordered_workqueue() is converted to a macro to unify varargs
      handling with alloc_workqueue(), and, while at it, add comment to
      alloc_workqueue().
      
      None of the current in-kernel users pass in string with '%' as constant
      name and this change shouldn't cause any problem.
      
      [akpm@linux-foundation.org: use __printf]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Suggested-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b196be89
    • V
      procfs: add hidepid= and gid= mount options · 0499680a
      Vasiliy Kulikov 提交于
      Add support for mount options to restrict access to /proc/PID/
      directories.  The default backward-compatible "relaxed" behaviour is left
      untouched.
      
      The first mount option is called "hidepid" and its value defines how much
      info about processes we want to be available for non-owners:
      
      hidepid=0 (default) means the old behavior - anybody may read all
      world-readable /proc/PID/* files.
      
      hidepid=1 means users may not access any /proc/<pid>/ directories, but
      their own.  Sensitive files like cmdline, sched*, status are now protected
      against other users.  As permission checking done in proc_pid_permission()
      and files' permissions are left untouched, programs expecting specific
      files' modes are not confused.
      
      hidepid=2 means hidepid=1 plus all /proc/PID/ will be invisible to other
      users.  It doesn't mean that it hides whether a process exists (it can be
      learned by other means, e.g.  by kill -0 $PID), but it hides process' euid
      and egid.  It compicates intruder's task of gathering info about running
      processes, whether some daemon runs with elevated privileges, whether
      another user runs some sensitive program, whether other users run any
      program at all, etc.
      
      gid=XXX defines a group that will be able to gather all processes' info
      (as in hidepid=0 mode).  This group should be used instead of putting
      nonroot user in sudoers file or something.  However, untrusted users (like
      daemons, etc.) which are not supposed to monitor the tasks in the whole
      system should not be added to the group.
      
      hidepid=1 or higher is designed to restrict access to procfs files, which
      might reveal some sensitive private information like precise keystrokes
      timings:
      
      http://www.openwall.com/lists/oss-security/2011/11/05/3
      
      hidepid=1/2 doesn't break monitoring userspace tools.  ps, top, pgrep, and
      conky gracefully handle EPERM/ENOENT and behave as if the current user is
      the only user running processes.  pstree shows the process subtree which
      contains "pstree" process.
      
      Note: the patch doesn't deal with setuid/setgid issues of keeping
      preopened descriptors of procfs files (like
      https://lkml.org/lkml/2011/2/7/368).  We rely on that the leaked
      information like the scheduling counters of setuid apps doesn't threaten
      anybody's privacy - only the user started the setuid program may read the
      counters.
      Signed-off-by: NVasiliy Kulikov <segoon@openwall.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Randy Dunlap <rdunlap@xenotime.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Theodore Tso <tytso@MIT.EDU>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: James Morris <jmorris@namei.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0499680a
    • P
      procfs: introduce the /proc/<pid>/map_files/ directory · 640708a2
      Pavel Emelyanov 提交于
      This one behaves similarly to the /proc/<pid>/fd/ one - it contains
      symlinks one for each mapping with file, the name of a symlink is
      "vma->vm_start-vma->vm_end", the target is the file.  Opening a symlink
      results in a file that point exactly to the same inode as them vma's one.
      
      For example the ls -l of some arbitrary /proc/<pid>/map_files/
      
       | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
       | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
       | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
       | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
       | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
      
      This *helps* checkpointing process in three ways:
      
      1. When dumping a task mappings we do know exact file that is mapped
         by particular region.  We do this by opening
         /proc/$pid/map_files/$address symlink the way we do with file
         descriptors.
      
      2. This also helps in determining which anonymous shared mappings are
         shared with each other by comparing the inodes of them.
      
      3. When restoring a set of processes in case two of them has a mapping
         shared, we map the memory by the 1st one and then open its
         /proc/$pid/map_files/$address file and map it by the 2nd task.
      
      Using /proc/$pid/maps for this is quite inconvenient since it brings
      repeatable re-reading and reparsing for this text file which slows down
      restore procedure significantly.  Also as being pointed in (3) it is a way
      easier to use top level shared mapping in children as
      /proc/$pid/map_files/$address when needed.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [gorcunov@openvz.org: make map_files depend on CHECKPOINT_RESTORE]
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Reviewed-by: NVasiliy Kulikov <segoon@openwall.com>
      Reviewed-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Pavel Machek <pavel@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      640708a2
    • C
      procfs: make proc_get_link to use dentry instead of inode · 7773fbc5
      Cyrill Gorcunov 提交于
      Prepare the ground for the next "map_files" patch which needs a name of a
      link file to analyse.
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7773fbc5
    • M
      signal: add block_sigmask() for adding sigmask to current->blocked · 5e6292c0
      Matt Fleming 提交于
      Abstract the code sequence for adding a signal handler's sa_mask to
      current->blocked because the sequence is identical for all architectures.
      Furthermore, in the past some architectures actually got this code wrong,
      so introduce a wrapper that all architectures can use.
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5e6292c0
    • N
      leds: add driver for TCA6507 LED controller · a6d511e5
      NeilBrown 提交于
      TI's TCA6507 is the LED driver in the GTA04 Openmoko motherboard.  The
      driver provides full support for brightness levels and hardware blinking.
      
      This driver can drive each of 7 outputs as an LED or a GPIO output,
      and provides hardware-assist blinking.
      
      [akpm@linux-foundation.org: fix __mod_i2c_device_table alias]
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Cc: Richard Purdie <rpurdie@rpsys.net>
      Cc: Randy Dunlap <rdunlap@xenotime.net>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6d511e5
    • K
      mm/mempolicy.c: mpol_equal(): use bool · fcfb4dcc
      KOSAKI Motohiro 提交于
      mpol_equal() logically returns a boolean.  Use a bool type to slightly
      improve readability.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Stephen Wilson <wilsons@start.ca>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fcfb4dcc
    • K
      tracepoint: add tracepoints for debugging oom_score_adj · 43d2b113
      KAMEZAWA Hiroyuki 提交于
      oom_score_adj is used for guarding processes from OOM-Killer.  One of
      problem is that it's inherited at fork().  When a daemon set oom_score_adj
      and make children, it's hard to know where the value is set.
      
      This patch adds some tracepoints useful for debugging. This patch adds
      3 trace points.
        - creating new task
        - renaming a task (exec)
        - set oom_score_adj
      
      To debug, users need to enable some trace pointer. Maybe filtering is useful as
      
      # EVENT=/sys/kernel/debug/tracing/events/task/
      # echo "oom_score_adj != 0" > $EVENT/task_newtask/filter
      # echo "oom_score_adj != 0" > $EVENT/task_rename/filter
      # echo 1 > $EVENT/enable
      # EVENT=/sys/kernel/debug/tracing/events/oom/
      # echo 1 > $EVENT/enable
      
      output will be like this.
      # grep oom /sys/kernel/debug/tracing/trace
      bash-7699  [007] d..3  5140.744510: oom_score_adj_update: pid=7699 comm=bash oom_score_adj=-1000
      bash-7699  [007] ...1  5151.818022: task_newtask: pid=7729 comm=bash clone_flags=1200011 oom_score_adj=-1000
      ls-7729  [003] ...2  5151.818504: task_rename: pid=7729 oldcomm=bash newcomm=ls oom_score_adj=-1000
      bash-7699  [002] ...1  5175.701468: task_newtask: pid=7730 comm=bash clone_flags=1200011 oom_score_adj=-1000
      grep-7730  [007] ...2  5175.701993: task_rename: pid=7730 oldcomm=bash newcomm=grep oom_score_adj=-1000
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43d2b113
    • A
      mremap: enforce rmap src/dst vma ordering in case of vma_merge() succeeding in copy_vma() · 948f017b
      Andrea Arcangeli 提交于
      migrate was doing an rmap_walk with speculative lock-less access on
      pagetables.  That could lead it to not serializing properly against mremap
      PT locks.  But a second problem remains in the order of vmas in the
      same_anon_vma list used by the rmap_walk.
      
      If vma_merge succeeds in copy_vma, the src vma could be placed after the
      dst vma in the same_anon_vma list.  That could still lead to migrate
      missing some pte.
      
      This patch adds an anon_vma_moveto_tail() function to force the dst vma at
      the end of the list before mremap starts to solve the problem.
      
      If the mremap is very large and there are a lots of parents or childs
      sharing the anon_vma root lock, this should still scale better than taking
      the anon_vma root lock around every pte copy practically for the whole
      duration of mremap.
      
      Update: Hugh noticed special care is needed in the error path where
      move_page_tables goes in the reverse direction, a second
      anon_vma_moveto_tail() call is needed in the error path.
      
      This program exercises the anon_vma_moveto_tail:
      
      ===
      
      int main()
      {
      	static struct timeval oldstamp, newstamp;
      	long diffsec;
      	char *p, *p2, *p3, *p4;
      	if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
      		perror("memalign"), exit(1);
      	if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
      		perror("memalign"), exit(1);
      	if (posix_memalign((void **)&p3, 2*1024*1024, SIZE))
      		perror("memalign"), exit(1);
      
      	memset(p, 0xff, SIZE);
      	printf("%p\n", p);
      	memset(p2, 0xff, SIZE);
      	memset(p3, 0x77, 4096);
      	if (memcmp(p, p2, SIZE))
      		printf("error\n");
      	p4 = mremap(p+SIZE/2, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
      	if (p4 != p3)
      		perror("mremap"), exit(1);
      	p4 = mremap(p4, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p+SIZE/2);
      	if (p4 != p+SIZE/2)
      		perror("mremap"), exit(1);
      	if (memcmp(p, p2, SIZE))
      		printf("error\n");
      	printf("ok\n");
      
      	return 0;
      }
      ===
      
      $ perf probe -a anon_vma_moveto_tail
      Add new event:
        probe:anon_vma_moveto_tail (on anon_vma_moveto_tail)
      
      You can now use it on all perf tools, such as:
      
              perf record -e probe:anon_vma_moveto_tail -aR sleep 1
      
      $ perf record -e probe:anon_vma_moveto_tail -aR ./anon_vma_moveto_tail
      0x7f2ca2800000
      ok
      [ perf record: Woken up 1 times to write data ]
      [ perf record: Captured and wrote 0.043 MB perf.data (~1860 samples) ]
      $ perf report --stdio
         100.00%  anon_vma_moveto  [kernel.kallsyms]  [k] anon_vma_moveto_tail
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NNai Xia <nai.xia@gmail.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pawel Sikora <pluto@agmk.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      948f017b