1. 08 4月, 2012 1 次提交
  2. 29 3月, 2012 1 次提交
    • D
      pidns: add reboot_pid_ns() to handle the reboot syscall · cf3f8921
      Daniel Lezcano 提交于
      In the case of a child pid namespace, rebooting the system does not really
      makes sense.  When the pid namespace is used in conjunction with the other
      namespaces in order to create a linux container, the reboot syscall leads
      to some problems.
      
      A container can reboot the host.  That can be fixed by dropping the
      sys_reboot capability but we are unable to correctly to poweroff/
      halt/reboot a container and the container stays stuck at the shutdown time
      with the container's init process waiting indefinitively.
      
      After several attempts, no solution from userspace was found to reliabily
      handle the shutdown from a container.
      
      This patch propose to make the init process of the child pid namespace to
      exit with a signal status set to : SIGINT if the child pid namespace
      called "halt/poweroff" and SIGHUP if the child pid namespace called
      "reboot".  When the reboot syscall is called and we are not in the initial
      pid namespace, we kill the pid namespace for "HALT", "POWEROFF",
      "RESTART", and "RESTART2".  Otherwise we return EINVAL.
      
      Returning EINVAL is also an easy way to check if this feature is supported
      by the kernel when invoking another 'reboot' option like CAD.
      
      By this way the parent process of the child pid namespace knows if it
      rebooted or not and can take the right decision.
      
      Test case:
      ==========
      
      #include <alloca.h>
      #include <stdio.h>
      #include <sched.h>
      #include <unistd.h>
      #include <signal.h>
      #include <sys/reboot.h>
      #include <sys/types.h>
      #include <sys/wait.h>
      
      #include <linux/reboot.h>
      
      static int do_reboot(void *arg)
      {
              int *cmd = arg;
      
              if (reboot(*cmd))
                      printf("failed to reboot(%d): %m\n", *cmd);
      }
      
      int test_reboot(int cmd, int sig)
      {
              long stack_size = 4096;
              void *stack = alloca(stack_size) + stack_size;
              int status;
              pid_t ret;
      
              ret = clone(do_reboot, stack, CLONE_NEWPID | SIGCHLD, &cmd);
              if (ret < 0) {
                      printf("failed to clone: %m\n");
                      return -1;
              }
      
              if (wait(&status) < 0) {
                      printf("unexpected wait error: %m\n");
                      return -1;
              }
      
              if (!WIFSIGNALED(status)) {
                      printf("child process exited but was not signaled\n");
                      return -1;
              }
      
              if (WTERMSIG(status) != sig) {
                      printf("signal termination is not the one expected\n");
                      return -1;
              }
      
              return 0;
      }
      
      int main(int argc, char *argv[])
      {
              int status;
      
              status = test_reboot(LINUX_REBOOT_CMD_RESTART, SIGHUP);
              if (status < 0)
                      return 1;
              printf("reboot(LINUX_REBOOT_CMD_RESTART) succeed\n");
      
              status = test_reboot(LINUX_REBOOT_CMD_RESTART2, SIGHUP);
              if (status < 0)
                      return 1;
              printf("reboot(LINUX_REBOOT_CMD_RESTART2) succeed\n");
      
              status = test_reboot(LINUX_REBOOT_CMD_HALT, SIGINT);
              if (status < 0)
                      return 1;
              printf("reboot(LINUX_REBOOT_CMD_HALT) succeed\n");
      
              status = test_reboot(LINUX_REBOOT_CMD_POWER_OFF, SIGINT);
              if (status < 0)
                      return 1;
              printf("reboot(LINUX_REBOOT_CMD_POWERR_OFF) succeed\n");
      
              status = test_reboot(LINUX_REBOOT_CMD_CAD_ON, -1);
              if (status >= 0) {
                      printf("reboot(LINUX_REBOOT_CMD_CAD_ON) should have failed\n");
                      return 1;
              }
              printf("reboot(LINUX_REBOOT_CMD_CAD_ON) has failed as expected\n");
      
              return 0;
      }
      
      [akpm@linux-foundation.org: tweak and add comments]
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: NDaniel Lezcano <daniel.lezcano@free.fr>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Tested-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf3f8921
  3. 24 3月, 2012 1 次提交
    • L
      prctl: add PR_{SET,GET}_CHILD_SUBREAPER to allow simple process supervision · ebec18a6
      Lennart Poettering 提交于
      Userspace service managers/supervisors need to track their started
      services.  Many services daemonize by double-forking and get implicitly
      re-parented to PID 1.  The service manager will no longer be able to
      receive the SIGCHLD signals for them, and is no longer in charge of
      reaping the children with wait().  All information about the children is
      lost at the moment PID 1 cleans up the re-parented processes.
      
      With this prctl, a service manager process can mark itself as a sort of
      'sub-init', able to stay as the parent for all orphaned processes
      created by the started services.  All SIGCHLD signals will be delivered
      to the service manager.
      
      Receiving SIGCHLD and doing wait() is in cases of a service-manager much
      preferred over any possible asynchronous notification about specific
      PIDs, because the service manager has full access to the child process
      data in /proc and the PID can not be re-used until the wait(), the
      service-manager itself is in charge of, has happened.
      
      As a side effect, the relevant parent PID information does not get lost
      by a double-fork, which results in a more elaborate process tree and
      'ps' output:
      
      before:
        # ps afx
        253 ?        Ss     0:00 /bin/dbus-daemon --system --nofork
        294 ?        Sl     0:00 /usr/libexec/polkit-1/polkitd
        328 ?        S      0:00 /usr/sbin/modem-manager
        608 ?        Sl     0:00 /usr/libexec/colord
        658 ?        Sl     0:00 /usr/libexec/upowerd
        819 ?        Sl     0:00 /usr/libexec/imsettings-daemon
        916 ?        Sl     0:00 /usr/libexec/udisks-daemon
        917 ?        S      0:00  \_ udisks-daemon: not polling any devices
      
      after:
        # ps afx
        294 ?        Ss     0:00 /bin/dbus-daemon --system --nofork
        426 ?        Sl     0:00  \_ /usr/libexec/polkit-1/polkitd
        449 ?        S      0:00  \_ /usr/sbin/modem-manager
        635 ?        Sl     0:00  \_ /usr/libexec/colord
        705 ?        Sl     0:00  \_ /usr/libexec/upowerd
        959 ?        Sl     0:00  \_ /usr/libexec/udisks-daemon
        960 ?        S      0:00  |   \_ udisks-daemon: not polling any devices
        977 ?        Sl     0:00  \_ /usr/libexec/packagekitd
      
      This prctl is orthogonal to PID namespaces.  PID namespaces are isolated
      from each other, while a service management process usually requires the
      services to live in the same namespace, to be able to talk to each
      other.
      
      Users of this will be the systemd per-user instance, which provides
      init-like functionality for the user's login session and D-Bus, which
      activates bus services on-demand.  Both need init-like capabilities to
      be able to properly keep track of the services they start.
      
      Many thanks to Oleg for several rounds of review and insights.
      
      [akpm@linux-foundation.org: fix comment layout and spelling]
      [akpm@linux-foundation.org: add lengthy code comment from Oleg]
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLennart Poettering <lennart@poettering.net>
      Signed-off-by: NKay Sievers <kay.sievers@vrfy.org>
      Acked-by: NValdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebec18a6
  4. 16 3月, 2012 1 次提交
  5. 13 1月, 2012 1 次提交
    • C
      c/r: prctl: add PR_SET_MM codes to set up mm_struct entries · 028ee4be
      Cyrill Gorcunov 提交于
      When we restore a task we need to set up text, data and data heap sizes
      from userspace to the values a task had at checkpoint time.  This patch
      adds auxilary prctl codes for that.
      
      While most of them have a statistical nature (their values are involved
      into calculation of /proc/<pid>/statm output) the start_brk and brk values
      are used to compute an allowed size of program data segment expansion.
      Which means an arbitrary changes of this values might be dangerous
      operation.  So to restrict access the following requirements applied to
      prctl calls:
      
       - The process has to have CAP_SYS_ADMIN capability granted.
       - For all opcodes except start_brk/brk members an appropriate
         VMA area must exist and should fit certain VMA flags,
         such as:
         - code segment must be executable but not writable;
         - data segment must not be executable.
      
      start_brk/brk values must not intersect with data segment and must not
      exceed RLIMIT_DATA resource limit.
      
      Still the main guard is CAP_SYS_ADMIN capability check.
      
      Note the kernel should be compiled with CONFIG_CHECKPOINT_RESTORE support
      otherwise these prctl calls will return -EINVAL.
      
      [akpm@linux-foundation.org: cache current->mm in a local, saving 200 bytes text]
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Andrew Vagin <avagin@openvz.org>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      028ee4be
  6. 15 12月, 2011 1 次提交
  7. 03 11月, 2011 1 次提交
  8. 31 10月, 2011 2 次提交
    • P
      kernel: fix several implicit usasges of kmod.h · 74da1ff7
      Paul Gortmaker 提交于
      These files were implicitly relying on <linux/kmod.h> coming in via
      module.h, as without it we get things like:
      
      kernel/power/suspend.c:100: error: implicit declaration of function ‘usermodehelper_disable’
      kernel/power/suspend.c:109: error: implicit declaration of function ‘usermodehelper_enable’
      kernel/power/user.c:254: error: implicit declaration of function ‘usermodehelper_disable’
      kernel/power/user.c:261: error: implicit declaration of function ‘usermodehelper_enable’
      
      kernel/sys.c:317: error: implicit declaration of function ‘usermodehelper_disable’
      kernel/sys.c:1816: error: implicit declaration of function ‘call_usermodehelper_setup’
      kernel/sys.c:1822: error: implicit declaration of function ‘call_usermodehelper_setfns’
      kernel/sys.c:1824: error: implicit declaration of function ‘call_usermodehelper_exec’
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      74da1ff7
    • P
      kernel: Map most files to use export.h instead of module.h · 9984de1a
      Paul Gortmaker 提交于
      The changed files were only including linux/module.h for the
      EXPORT_SYMBOL infrastructure, and nothing else.  Revector them
      onto the isolated export header for faster compile times.
      
      Nothing to see here but a whole lot of instances of:
      
        -#include <linux/module.h>
        +#include <linux/export.h>
      
      This commit is only changing the kernel dir; next targets
      will probably be mm, fs, the arch dirs, etc.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      9984de1a
  9. 17 10月, 2011 1 次提交
    • L
      Avoid using variable-length arrays in kernel/sys.c · a84a79e4
      Linus Torvalds 提交于
      The size is always valid, but variable-length arrays generate worse code
      for no good reason (unless the function happens to be inlined and the
      compiler sees the length for the simple constant it is).
      
      Also, there seems to be some code generation problem on POWER, where
      Henrik Bakken reports that register r28 can get corrupted under some
      subtle circumstances (interrupt happening at the wrong time?).  That all
      indicates some seriously broken compiler issues, but since variable
      length arrays are bad regardless, there's little point in trying to
      chase it down.
      
      "Just don't do that, then".
      Reported-by: NHenrik Grindal Bakken <henribak@cisco.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a84a79e4
  10. 29 9月, 2011 1 次提交
    • V
      connector: add comm change event report to proc connector · f786ecba
      Vladimir Zapolskiy 提交于
      Add an event to monitor comm value changes of tasks.  Such an event
      becomes vital, if someone desires to control threads of a process in
      different manner.
      
      A natural characteristic of threads is its comm value, and helpfully
      application developers have an opportunity to change it in runtime.
      Reporting about such events via proc connector allows to fine-grain
      monitoring and control potentials, for instance a process control daemon
      listening to proc connector and following comm value policies can place
      specific threads to assigned cgroup partitions.
      
      It might be possible to achieve a pale partial one-shot likeness without
      this update, if an application changes comm value of a thread generator
      task beforehand, then a new thread is cloned, and after that proc
      connector listener gets the fork event and reads new thread's comm value
      from procfs stat file, but this change visibly simplifies and extends the
      matter.
      Signed-off-by: NVladimir Zapolskiy <vzapolskiy@gmail.com>
      Acked-by: NEvgeniy Polyakov <zbr@ioremap.net>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f786ecba
  11. 26 8月, 2011 1 次提交
    • A
      Add a personality to report 2.6.x version numbers · be27425d
      Andi Kleen 提交于
      I ran into a couple of programs which broke with the new Linux 3.0
      version.  Some of those were binary only.  I tried to use LD_PRELOAD to
      work around it, but it was quite difficult and in one case impossible
      because of a mix of 32bit and 64bit executables.
      
      For example, all kind of management software from HP doesnt work, unless
      we pretend to run a 2.6 kernel.
      
        $ uname -a
        Linux svivoipvnx001 3.0.0-08107-g97cd98f #1062 SMP Fri Aug 12 18:11:45 CEST 2011 i686 i686 i386 GNU/Linux
      
        $ hpacucli ctrl all show
      
        Error: No controllers detected.
      
        $ rpm -qf /usr/sbin/hpacucli
        hpacucli-8.75-12.0
      
      Another notable case is that Python now reports "linux3" from
      sys.platform(); which in turn can break things that were checking
      sys.platform() == "linux2":
      
        https://bugzilla.mozilla.org/show_bug.cgi?id=664564
      
      It seems pretty clear to me though it's a bug in the apps that are using
      '==' instead of .startswith(), but this allows us to unbreak broken
      programs.
      
      This patch adds a UNAME26 personality that makes the kernel report a
      2.6.40+x version number instead.  The x is the x in 3.x.
      
      I know this is somewhat ugly, but I didn't find a better workaround, and
      compatibility to existing programs is important.
      
      Some programs also read /proc/sys/kernel/osrelease.  This can be worked
      around in user space with mount --bind (and a mount namespace)
      
      To use:
      
        wget ftp://ftp.kernel.org/pub/linux/kernel/people/ak/uname26/uname26.c
        gcc -o uname26 uname26.c
        ./uname26 program
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be27425d
  12. 12 8月, 2011 1 次提交
    • V
      move RLIMIT_NPROC check from set_user() to do_execve_common() · 72fa5997
      Vasiliy Kulikov 提交于
      The patch http://lkml.org/lkml/2003/7/13/226 introduced an RLIMIT_NPROC
      check in set_user() to check for NPROC exceeding via setuid() and
      similar functions.
      
      Before the check there was a possibility to greatly exceed the allowed
      number of processes by an unprivileged user if the program relied on
      rlimit only.  But the check created new security threat: many poorly
      written programs simply don't check setuid() return code and believe it
      cannot fail if executed with root privileges.  So, the check is removed
      in this patch because of too often privilege escalations related to
      buggy programs.
      
      The NPROC can still be enforced in the common code flow of daemons
      spawning user processes.  Most of daemons do fork()+setuid()+execve().
      The check introduced in execve() (1) enforces the same limit as in
      setuid() and (2) doesn't create similar security issues.
      
      Neil Brown suggested to track what specific process has exceeded the
      limit by setting PF_NPROC_EXCEEDED process flag.  With the change only
      this process would fail on execve(), and other processes' execve()
      behaviour is not changed.
      
      Solar Designer suggested to re-check whether NPROC limit is still
      exceeded at the moment of execve().  If the process was sleeping for
      days between set*uid() and execve(), and the NPROC counter step down
      under the limit, the defered execve() failure because NPROC limit was
      exceeded days ago would be unexpected.  If the limit is not exceeded
      anymore, we clear the flag on successful calls to execve() and fork().
      
      The flag is also cleared on successful calls to set_user() as the limit
      was exceeded for the previous user, not the current one.
      
      Similar check was introduced in -ow patches (without the process flag).
      
      v3 - clear PF_NPROC_EXCEEDED on successful calls to set_user().
      Reviewed-by: NJames Morris <jmorris@namei.org>
      Signed-off-by: NVasiliy Kulikov <segoon@openwall.com>
      Acked-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72fa5997
  13. 26 7月, 2011 1 次提交
  14. 12 5月, 2011 1 次提交
  15. 07 5月, 2011 1 次提交
  16. 24 3月, 2011 2 次提交
  17. 15 3月, 2011 1 次提交
    • R
      PM / Core: Introduce struct syscore_ops for core subsystems PM · 40dc166c
      Rafael J. Wysocki 提交于
      Some subsystems need to carry out suspend/resume and shutdown
      operations with one CPU on-line and interrupts disabled.  The only
      way to register such operations is to define a sysdev class and
      a sysdev specifically for this purpose which is cumbersome and
      inefficient.  Moreover, the arguments taken by sysdev suspend,
      resume and shutdown callbacks are practically never necessary.
      
      For this reason, introduce a simpler interface allowing subsystems
      to register operations to be executed very late during system suspend
      and shutdown and very early during resume in the form of
      strcut syscore_ops objects.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: NGreg Kroah-Hartman <gregkh@suse.de>
      40dc166c
  18. 31 1月, 2011 1 次提交
  19. 14 1月, 2011 1 次提交
  20. 30 11月, 2010 1 次提交
    • M
      sched: Add 'autogroup' scheduling feature: automated per session task groups · 5091faa4
      Mike Galbraith 提交于
      A recurring complaint from CFS users is that parallel kbuild has
      a negative impact on desktop interactivity.  This patch
      implements an idea from Linus, to automatically create task
      groups.  Currently, only per session autogroups are implemented,
      but the patch leaves the way open for enhancement.
      
      Implementation: each task's signal struct contains an inherited
      pointer to a refcounted autogroup struct containing a task group
      pointer, the default for all tasks pointing to the
      init_task_group.  When a task calls setsid(), a new task group
      is created, the process is moved into the new task group, and a
      reference to the preveious task group is dropped.  Child
      processes inherit this task group thereafter, and increase it's
      refcount.  When the last thread of a process exits, the
      process's reference is dropped, such that when the last process
      referencing an autogroup exits, the autogroup is destroyed.
      
      At runqueue selection time, IFF a task has no cgroup assignment,
      its current autogroup is used.
      
      Autogroup bandwidth is controllable via setting it's nice level
      through the proc filesystem:
      
        cat /proc/<pid>/autogroup
      
      Displays the task's group and the group's nice level.
      
        echo <nice level> > /proc/<pid>/autogroup
      
      Sets the task group's shares to the weight of nice <level> task.
      Setting nice level is rate limited for !admin users due to the
      abuse risk of task group locking.
      
      The feature is enabled from boot by default if
      CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via
      the boot option noautogroup, and can also be turned on/off on
      the fly via:
      
        echo [01] > /proc/sys/kernel/sched_autogroup_enabled
      
      ... which will automatically move tasks to/from the root task group.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      [ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      LKML-Reference: <1290281700.28711.9.camel@maggy.simson.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      5091faa4
  21. 01 9月, 2010 1 次提交
    • P
      pid: make setpgid() system call use RCU read-side critical section · 950eaaca
      Paul E. McKenney 提交于
      [   23.584719]
      [   23.584720] ===================================================
      [   23.585059] [ INFO: suspicious rcu_dereference_check() usage. ]
      [   23.585176] ---------------------------------------------------
      [   23.585176] kernel/pid.c:419 invoked rcu_dereference_check() without protection!
      [   23.585176]
      [   23.585176] other info that might help us debug this:
      [   23.585176]
      [   23.585176]
      [   23.585176] rcu_scheduler_active = 1, debug_locks = 1
      [   23.585176] 1 lock held by rc.sysinit/728:
      [   23.585176]  #0:  (tasklist_lock){.+.+..}, at: [<ffffffff8104771f>] sys_setpgid+0x5f/0x193
      [   23.585176]
      [   23.585176] stack backtrace:
      [   23.585176] Pid: 728, comm: rc.sysinit Not tainted 2.6.36-rc2 #2
      [   23.585176] Call Trace:
      [   23.585176]  [<ffffffff8105b436>] lockdep_rcu_dereference+0x99/0xa2
      [   23.585176]  [<ffffffff8104c324>] find_task_by_pid_ns+0x50/0x6a
      [   23.585176]  [<ffffffff8104c35b>] find_task_by_vpid+0x1d/0x1f
      [   23.585176]  [<ffffffff81047727>] sys_setpgid+0x67/0x193
      [   23.585176]  [<ffffffff810029eb>] system_call_fastpath+0x16/0x1b
      [   24.959669] type=1400 audit(1282938522.956:4): avc:  denied  { module_request } for  pid=766 comm="hwclock" kmod="char-major-10-135" scontext=system_u:system_r:hwclock_t:s0 tcontext=system_u:system_r:kernel_t:s0 tclas
      
      It turns out that the setpgid() system call fails to enter an RCU
      read-side critical section before doing a PID-to-task_struct translation.
      This commit therefore does rcu_read_lock() before the translation, and
      also does rcu_read_unlock() after the last use of the returned pointer.
      Reported-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      950eaaca
  22. 16 7月, 2010 9 次提交
    • J
      rlimits: implement prlimit64 syscall · c022a0ac
      Jiri Slaby 提交于
      This patch adds the code to support the sys_prlimit64 syscall which
      modifies-and-returns the rlim values of a selected process atomically.
      The first parameter, pid, being 0 means current process.
      
      Unlike the current implementation, it is a generic interface,
      architecture indepentent so that we needn't handle compat stuff
      anymore. In the future, after glibc start to use this we can deprecate
      sys_setrlimit and sys_getrlimit in favor to clean up the code finally.
      
      It also adds a possibility of changing limits of other processes. We
      check the user's permissions to do that and if it succeeds, the new
      limits are propagated online. This is good for large scale
      applications such as SAP or databases where administrators need to
      change limits time by time (e.g. on crashes increase core size). And
      it is unacceptable to restart the service.
      
      For safety, all rlim users now either use accessors or doesn't need
      them due to
      - locking
      - the fact a process was just forked and nobody else knows about it
        yet (and nobody can't thus read/write limits)
      hence it is safe to modify limits now.
      
      The limitation is that we currently stay at ulong internal
      representation. So the rlim64_is_infinity check is used where value is
      compared against ULONG_MAX on 32-bit which is the maximum value there.
      
      And since internally the limits are held in struct rlimit, converters
      which are used before and after do_prlimit call in sys_prlimit64 are
      introduced.
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      c022a0ac
    • J
      rlimits: switch more rlimit syscalls to do_prlimit · b9518345
      Jiri Slaby 提交于
      After we added more generic do_prlimit, switch sys_getrlimit to that.
      Also switch compat handling, so we can get rid of ugly __user casts
      and avoid setting process' address limit to kernel data and back.
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      b9518345
    • J
      rlimits: redo do_setrlimit to more generic do_prlimit · 5b41535a
      Jiri Slaby 提交于
      It now allows also reading of limits. I.e. all read and writes will
      later use this function.
      
      It takes two parameters, new and old limits which can be both NULL.
      If new is non-NULL, the value in it is set to rlimits.
      If old is non-NULL, current rlimits are stored there.
      If both are non-NULL, old are stored prior to setting the new ones,
      atomically.
      (Similar to sigaction.)
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      5b41535a
    • J
      rlimits: do security check under task_lock · 86f162f4
      Jiri Slaby 提交于
      Do security_task_setrlimit under task_lock. Other tasks may change
      limits under our hands while we are checking limits inside the
      function. From now on, they can't.
      
      Note that all the security work is done under a spinlock here now.
      Security hooks count with that, they are called from interrupt context
      (like security_task_kill) and with spinlocks already held (e.g.
      capable->security_capable).
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Acked-by: NJames Morris <jmorris@namei.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      86f162f4
    • J
      rlimits: allow setrlimit to non-current tasks · 1c1e618d
      Jiri Slaby 提交于
      Add locking to allow setrlimit accept task parameter other than
      current.
      
      Namely, lock tasklist_lock for read and check whether the task
      structure has sighand non-null. Do all the signal processing under
      that lock still held.
      
      There are some points:
      1) security_task_setrlimit is now called with that lock held. This is
         not new, many security_* functions are called with this lock held
         already so it doesn't harm (all this security_* stuff does almost
         the same).
      2) task->sighand->siglock (in update_rlimit_cpu) is nested in
         tasklist_lock. This dependence is already existing.
      3) tsk->alloc_lock is nested in tasklist_lock. This is OK too, already
         existing dependence.
      Signed-off-by: NJiri Slaby <jirislaby@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      1c1e618d
    • J
      rlimits: split sys_setrlimit · 7855c35d
      Jiri Slaby 提交于
      Create do_setrlimit from sys_setrlimit and declare do_setrlimit
      in the resource header. This is the first phase to have generic
      do_prlimit which allows to be called from read, write and compat
      rlimits code.
      
      The new do_setrlimit also accepts a task pointer to change the limits
      of. Currently, it cannot be other than current, but this will change
      with locking later.
      
      Also pass tsk->group_leader to security_task_setrlimit to check
      whether current is allowed to change rlimits of the process and not
      its arbitrary thread because it makes more sense given that rlimit are
      per process and not per-thread.
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      7855c35d
    • O
      rlimits: make sure ->rlim_max never grows in sys_setrlimit · 2fb9d268
      Oleg Nesterov 提交于
      Mostly preparation for Jiri's changes, but probably makes sense anyway.
      
      sys_setrlimit() checks new_rlim.rlim_max <= old_rlim->rlim_max, but when
      it takes task_lock() old_rlim->rlim_max can be already lowered. Move this
      check under task_lock().
      
      Currently this is not important, we can only race with our sub-thread,
      this means the application is stupid. But when we change the code to allow
      the update of !current task's limits, it becomes important to make sure
      ->rlim_max can be lowered "reliably" even if we race with the application
      doing sys_setrlimit().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      2fb9d268
    • J
      rlimits: add task_struct to update_rlimit_cpu · 5ab46b34
      Jiri Slaby 提交于
      Add task_struct as a parameter to update_rlimit_cpu to be able to set
      rlimit_cpu of different task than current.
      Signed-off-by: NJiri Slaby <jirislaby@gmail.com>
      Acked-by: NJames Morris <jmorris@namei.org>
      5ab46b34
    • J
      rlimits: security, add task_struct to setrlimit · 8fd00b4d
      Jiri Slaby 提交于
      Add task_struct to task_setrlimit of security_operations to be able to set
      rlimit of task other than current.
      Signed-off-by: NJiri Slaby <jirislaby@gmail.com>
      Acked-by: NEric Paris <eparis@redhat.com>
      Acked-by: NJames Morris <jmorris@namei.org>
      8fd00b4d
  23. 28 5月, 2010 1 次提交
    • N
      kmod: add init function to usermodehelper · a06a4dc3
      Neil Horman 提交于
      About 6 months ago, I made a set of changes to how the core-dump-to-a-pipe
      feature in the kernel works.  We had reports of several races, including
      some reports of apps bypassing our recursion check so that a process that
      was forked as part of a core_pattern setup could infinitely crash and
      refork until the system crashed.
      
      We fixed those by improving our recursion checks.  The new check basically
      refuses to fork a process if its core limit is zero, which works well.
      
      Unfortunately, I've been getting grief from maintainer of user space
      programs that are inserted as the forked process of core_pattern.  They
      contend that in order for their programs (such as abrt and apport) to
      work, all the running processes in a system must have their core limits
      set to a non-zero value, to which I say 'yes'.  I did this by design, and
      think thats the right way to do things.
      
      But I've been asked to ease this burden on user space enough times that I
      thought I would take a look at it.  The first suggestion was to make the
      recursion check fail on a non-zero 'special' number, like one.  That way
      the core collector process could set its core size ulimit to 1, and enable
      the kernel's recursion detection.  This isn't a bad idea on the surface,
      but I don't like it since its opt-in, in that if a program like abrt or
      apport has a bug and fails to set such a core limit, we're left with a
      recursively crashing system again.
      
      So I've come up with this.  What I've done is modify the
      call_usermodehelper api such that an extra parameter is added, a function
      pointer which will be called by the user helper task, after it forks, but
      before it exec's the required process.  This will give the caller the
      opportunity to get a call back in the processes context, allowing it to do
      whatever it needs to to the process in the kernel prior to exec-ing the
      user space code.  In the case of do_coredump, this callback is ues to set
      the core ulimit of the helper process to 1.  This elimnates the opt-in
      problem that I had above, as it allows the ulimit for core sizes to be set
      to the value of 1, which is what the recursion check looks for in
      do_coredump.
      
      This patch:
      
      Create new function call_usermodehelper_fns() and allow it to assign both
      an init and cleanup function, as we'll as arbitrary data.
      
      The init function is called from the context of the forked process and
      allows for customization of the helper process prior to calling exec.  Its
      return code gates the continuation of the process, or causes its exit.
      Also add an arbitrary data pointer to the subprocess_info struct allowing
      for data to be passed from the caller to the new process, and the
      subsequent cleanup process
      
      Also, use this patch to cleanup the cleanup function.  It currently takes
      an argp and envp pointer for freeing, which is ugly.  Lets instead just
      make the subprocess_info structure public, and pass that to the cleanup
      and init routines
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a06a4dc3
  24. 25 4月, 2010 1 次提交
  25. 12 4月, 2010 2 次提交
  26. 30 3月, 2010 1 次提交
    • T
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo 提交于
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  27. 13 3月, 2010 2 次提交
    • C
      Add generic sys_olduname() · 5cacdb4a
      Christoph Hellwig 提交于
      Add generic implementations of the old and really old uname system calls.
      Note that sh only implements sys_olduname but not sys_oldolduname, but I'm
      not going to bother with another ifdef for that special case.
      
      m32r implemented an old uname but never wired it up, so kill it, too.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Andreas Schwab <schwab@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5cacdb4a
    • C
      improve sys_newuname() for compat architectures · e28cbf22
      Christoph Hellwig 提交于
      On an architecture that supports 32-bit compat we need to override the
      reported machine in uname with the 32-bit value.  Instead of doing this
      separately in every architecture introduce a COMPAT_UTS_MACHINE define in
      <asm/compat.h> and apply it directly in sys_newuname().
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Andreas Schwab <schwab@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e28cbf22
  28. 07 3月, 2010 1 次提交