1. 18 12月, 2007 1 次提交
    • N
      hugetlb: introduce nr_overcommit_hugepages sysctl · d1c3fb1f
      Nishanth Aravamudan 提交于
      hugetlb: introduce nr_overcommit_hugepages sysctl
      
      While examining the code to support /proc/sys/vm/hugetlb_dynamic_pool, I
      became convinced that having a boolean sysctl was insufficient:
      
      1) To support per-node control of hugepages, I have previously submitted
      patches to add a sysfs attribute related to nr_hugepages. However, with
      a boolean global value and per-mount quota enforcement constraining the
      dynamic pool, adding corresponding control of the dynamic pool on a
      per-node basis seems inconsistent to me.
      
      2) Administration of the hugetlb dynamic pool with multiple hugetlbfs
      mount points is, arguably, more arduous than it needs to be. Each quota
      would need to be set separately, and the sum would need to be monitored.
      
      To ease the administration, and to help make the way for per-node
      control of the static & dynamic hugepage pool, I added a separate
      sysctl, nr_overcommit_hugepages. This value serves as a high watermark
      for the overall hugepage pool, while nr_hugepages serves as a low
      watermark. The boolean sysctl can then be removed, as the condition
      
      	nr_overcommit_hugepages > 0
      
      indicates the same administrative setting as
      
      	hugetlb_dynamic_pool == 1
      
      Quotas still serve as local enforcement of the size of the pool on a
      per-mount basis.
      
      A few caveats:
      
      1) There is a race whereby the global surplus huge page counter is
      incremented before a hugepage has allocated. Another process could then
      try grow the pool, and fail to convert a surplus huge page to a normal
      huge page and instead allocate a fresh huge page. I believe this is
      benign, as no memory is leaked (the actual pages are still tracked
      correctly) and the counters won't go out of sync.
      
      2) Shrinking the static pool while a surplus is in effect will allow the
      number of surplus huge pages to exceed the overcommit value. As long as
      this condition holds, however, no more surplus huge pages will be
      allowed on the system until one of the two sysctls are increased
      sufficiently, or the surplus huge pages go out of use and are freed.
      
      Successfully tested on x86_64 with the current libhugetlbfs snapshot,
      modified to use the new sysctl.
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d1c3fb1f
  2. 06 12月, 2007 1 次提交
    • P
      Avoid potential NULL dereference in unregister_sysctl_table · f1dad166
      Pavel Emelyanov 提交于
      register_sysctl_table() can return NULL sometimes, e.g.  when kmalloc()
      returns NULL or when sysctl check fails.
      
      I've also noticed, that many (most?) code in the kernel doesn't check for
      the return value from register_sysctl_table() and later simply calls the
      unregister_sysctl_table() with potentially NULL argument.
      
      This is unlikely on a common kernel configuration, but in case we're
      dealing with modules and/or fault-injection support, there's a slight
      possibility of an OOPS.
      
      Changing all the users to check for return code from the registering does
      not look like a good solution - there are too many code doing this and
      failure in sysctl tables registration is not a good reason to abort module
      loading (in most of the cases).
      
      So I think, that we can just have this check in unregister_sysctl_table
      just to avoid accidental OOPS-es (actually, the unregister_sysctl_table()
      did exactly this, before the start_unregistering() appeared).
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1dad166
  3. 15 11月, 2007 1 次提交
  4. 10 11月, 2007 3 次提交
  5. 20 10月, 2007 2 次提交
    • P
      pid namespaces: changes to show virtual ids to user · b488893a
      Pavel Emelyanov 提交于
      This is the largest patch in the set. Make all (I hope) the places where
      the pid is shown to or get from user operate on the virtual pids.
      
      The idea is:
       - all in-kernel data structures must store either struct pid itself
         or the pid's global nr, obtained with pid_nr() call;
       - when seeking the task from kernel code with the stored id one
         should use find_task_by_pid() call that works with global pids;
       - when showing pid's numerical value to the user the virtual one
         should be used, but however when one shows task's pid outside this
         task's namespace the global one is to be used;
       - when getting the pid from userspace one need to consider this as
         the virtual one and use appropriate task/pid-searching functions.
      
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: nuther build fix]
      [akpm@linux-foundation.org: yet nuther build fix]
      [akpm@linux-foundation.org: remove unneeded casts]
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAlexey Dobriyan <adobriyan@openvz.org>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b488893a
    • S
      pid namespaces: define is_global_init() and is_container_init() · b460cbc5
      Serge E. Hallyn 提交于
      is_init() is an ambiguous name for the pid==1 check.  Split it into
      is_global_init() and is_container_init().
      
      A cgroup init has it's tsk->pid == 1.
      
      A global init also has it's tsk->pid == 1 and it's active pid namespace
      is the init_pid_ns.  But rather than check the active pid namespace,
      compare the task structure with 'init_pid_ns.child_reaper', which is
      initialized during boot to the /sbin/init process and never changes.
      
      Changelog:
      
      	2.6.22-rc4-mm2-pidns1:
      	- Use 'init_pid_ns.child_reaper' to determine if a given task is the
      	  global init (/sbin/init) process. This would improve performance
      	  and remove dependence on the task_pid().
      
      	2.6.21-mm2-pidns2:
      
      	- [Sukadev Bhattiprolu] Changed is_container_init() calls in {powerpc,
      	  ppc,avr32}/traps.c for the _exception() call to is_global_init().
      	  This way, we kill only the cgroup if the cgroup's init has a
      	  bug rather than force a kernel panic.
      
      [akpm@linux-foundation.org: fix comment]
      [sukadev@us.ibm.com: Use is_global_init() in arch/m32r/mm/fault.c]
      [bunk@stusta.de: kernel/pid.c: remove unused exports]
      [sukadev@us.ibm.com: Fix capability.c to work with threaded init]
      Signed-off-by: NSerge E. Hallyn <serue@us.ibm.com>
      Signed-off-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
      Acked-by: NPavel Emelianov <xemul@openvz.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Herbert Poetzel <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b460cbc5
  6. 19 10月, 2007 9 次提交
    • A
      V3 file capabilities: alter behavior of cap_setpcap · 72c2d582
      Andrew Morgan 提交于
      The non-filesystem capability meaning of CAP_SETPCAP is that a process, p1,
      can change the capabilities of another process, p2.  This is not the
      meaning that was intended for this capability at all, and this
      implementation came about purely because, without filesystem capabilities,
      there was no way to use capabilities without one process bestowing them on
      another.
      
      Since we now have a filesystem support for capabilities we can fix the
      implementation of CAP_SETPCAP.
      
      The most significant thing about this change is that, with it in effect, no
      process can set the capabilities of another process.
      
      The capabilities of a program are set via the capability convolution
      rules:
      
         pI(post-exec) = pI(pre-exec)
         pP(post-exec) = (X(aka cap_bset) & fP) | (pI(post-exec) & fI)
         pE(post-exec) = fE ? pP(post-exec) : 0
      
      at exec() time.  As such, the only influence the pre-exec() program can
      have on the post-exec() program's capabilities are through the pI
      capability set.
      
      The correct implementation for CAP_SETPCAP (and that enabled by this patch)
      is that it can be used to add extra pI capabilities to the current process
      - to be picked up by subsequent exec()s when the above convolution rules
      are applied.
      
      Here is how it works:
      
      Let's say we have a process, p. It has capability sets, pE, pP and pI.
      Generally, p, can change the value of its own pI to pI' where
      
         (pI' & ~pI) & ~pP = 0.
      
      That is, the only new things in pI' that were not present in pI need to
      be present in pP.
      
      The role of CAP_SETPCAP is basically to permit changes to pI beyond
      the above:
      
         if (pE & CAP_SETPCAP) {
            pI' = anything; /* ie., even (pI' & ~pI) & ~pP != 0  */
         }
      
      This capability is useful for things like login, which (say, via
      pam_cap) might want to raise certain inheritable capabilities for use
      by the children of the logged-in user's shell, but those capabilities
      are not useful to or needed by the login program itself.
      
      One such use might be to limit who can run ping. You set the
      capabilities of the 'ping' program to be "= cap_net_raw+i", and then
      only shells that have (pI & CAP_NET_RAW) will be able to run
      it. Without CAP_SETPCAP implemented as described above, login(pam_cap)
      would have to also have (pP & CAP_NET_RAW) in order to raise this
      capability and pass it on through the inheritable set.
      Signed-off-by: NAndrew Morgan <morgan@kernel.org>
      Signed-off-by: NSerge E. Hallyn <serue@us.ibm.com>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: James Morris <jmorris@namei.org>
      Cc: Casey Schaufler <casey@schaufler-ca.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72c2d582
    • E
      sysctl: deprecate sys_sysctl in a user space visible fashion. · 7058cb02
      Eric W. Biederman 提交于
      After adding checking to register_sysctl_table and finding a whole new set
      of bugs.  Missed by countless code reviews and testers I have finally lost
      patience with the binary sysctl interface.
      
      The binary sysctl interface has been sort of deprecated for years and
      finding a user space program that uses the syscall is more difficult then
      finding a needle in a haystack.  Problems continue to crop up, with the in
      kernel implementation.  So since supporting something that no one uses is
      silly, deprecate sys_sysctl with a sufficient grace period and notice that
      the handful of user space applications that care can be fixed or replaced.
      
      The /proc/sys sysctl interface that people use will continue to be
      supported indefinitely.
      
      This patch moves the tested warning about sysctls from the path where
      sys_sysctl to a separate path called from both implementations of
      sys_sysctl, and it adds a proper entry into
      Documentation/feature-removal-schedule.
      
      Allowing us to revisit this in a couple years time and actually kill
      sys_sysctl.
      
      [lethal@linux-sh.org: sysctl: Fix syscall disabled build]
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7058cb02
    • E
      sysctl: Error on bad sysctl tables · fc6cd25b
      Eric W. Biederman 提交于
      After going through the kernels sysctl tables several times it has become
      clear that code review and testing is just not effective in prevent
      problematic sysctl tables from being used in the stable kernel.  I certainly
      can't seem to fix the problems as fast as they are introduced.
      
      Therefore this patch adds sysctl_check_table which is called when a sysctl
      table is registered and checks to see if we have a problematic sysctl table.
      
      The biggest part of the code is the table of valid binary sysctl entries, but
      since we have frozen our set of binary sysctls this table should not need to
      change, and it makes it much easier to detect when someone unintentionally
      adds a new binary sysctl value.
      
      As best as I can determine all of the several hundred errors spewed on boot up
      now are legitimate.
      
      [bunk@kernel.org: kernel/sysctl_check.c must #include <linux/string.h>]
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Alexey Dobriyan <adobriyan@sw.ru>
      Signed-off-by: NAdrian Bunk <bunk@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc6cd25b
    • E
      sysctl: remove the cad_pid binary sysctl path · c65f9239
      Eric W. Biederman 提交于
      It looks like we inadvertently killed the cad_pid binary sysctl support when
      cap_pid was changed to be a struct pid.  Since no one has complained just
      remove the binary path.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c65f9239
    • E
      sysctl: simplify the pty sysctl logic · 35834ca1
      Eric W. Biederman 提交于
      Instead of having a bunch of ifdefs in sysctl.c move all of the pty sysctl
      logic into drivers/char/pty.c
      
      As well as cleaning up the logic this prevents sysctl_check_table from
      complaining that the root table has a NULL data pointer on something with
      generic methods.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      35834ca1
    • E
      sysctl: remove the binary interface for aio-nr, aio-max-nr, acpi_video_flags · 0d135a4a
      Eric W. Biederman 提交于
      aio-nr, aio-max-nr, acpi_video_flags are unsigned long values which sysctl
      does not handle properly with a 64bit kernel and a 32bit user space.
      
      Since no one is likely to be using the binary sysctl values and the ascii
      interface still works, this patch just removes support for the binary sysctl
      interface from the kernel.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Alexey Dobriyan <adobriyan@sw.ru>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Len Brown <lenb@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d135a4a
    • E
      sysctl: remove binary sysctl support where it clearly doesn't work · f5ead5ce
      Eric W. Biederman 提交于
      These functions are all wrapper functions for the proc interface that are
      needed for them to work correctly.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Alexey Dobriyan <adobriyan@sw.ru>
      Acked-by: NAndrew Morgan <morgan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5ead5ce
    • E
      sysctl: Factor out sysctl_data. · 49a0c458
      Eric W. Biederman 提交于
      There as been no easy way to wrap the default sysctl strategy routine except
      for returning 0.  Which is not always what we want.  The few instances I have
      seen that want different behaviour have written their own version of
      sysctl_data.  While not too hard it is unnecessary code and has the potential
      for extra bugs.
      
      So to make these situations easier and make that part of sysctl more symetric
      I have factord sysctl_data out of do_sysctl_strategy and exported as a
      function everyone can use.
      
      Further having sysctl_data be an explicit function makes checking for badly
      formed sysctl tables much easier.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Alexey Dobriyan <adobriyan@sw.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49a0c458
    • E
      sysctl core: Stop using the unnecessary ctl_table typedef · d8217f07
      Eric W. Biederman 提交于
      In sysctl.h the typedef struct ctl_table ctl_table violates coding style isn't
      needed and is a bit of a nuisance because it makes it harder to recognize
      ctl_table is a type name.
      
      So this patch removes it from the generic sysctl code.  Hopefully I will have
      enough energy to send the rest of my patches will follow and to remove it from
      the rest of the kernel.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Alexey Dobriyan <adobriyan@sw.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8217f07
  7. 17 10月, 2007 4 次提交
  8. 15 10月, 2007 4 次提交
  9. 14 10月, 2007 1 次提交
    • A
      minimal build fixes for uml (fallout from x86 merge) · 2b8232ce
      Al Viro 提交于
       a) include/asm-um/arch can't just point to include/asm-$(SUBARCH) now
       b) arch/{i386,x86_64}/crypto are merged now
       c) subarch-obj needed changes
       d) cpufeature_64.h should pull "cpufeature_32.h", not <asm/cpufeature_32.h>
          since it can be included from asm-um/cpufeature.h
       e) in case of uml-i386 we need CONFIG_X86_32 for make and gcc, but not
          for Kconfig
       f) sysctl.c shouldn't do vdso_enabled for uml-i386 (actually, that one
          should be registered from corresponding arch/*/kernel/*, with ifdef
          going away; that's a separate patch, though).
      
      With that and with Stephen's patch ("[PATCH net-2.6] uml: hard_header fix")
      we have uml allmodconfig building both on i386 and amd64.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b8232ce
  10. 12 10月, 2007 1 次提交
  11. 20 9月, 2007 1 次提交
    • I
      sched: add /proc/sys/kernel/sched_compat_yield · 1799e35d
      Ingo Molnar 提交于
      add /proc/sys/kernel/sched_compat_yield to make sys_sched_yield()
      more agressive, by moving the yielding task to the last position
      in the rbtree.
      
      with sched_compat_yield=0:
      
         PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
        2539 mingo     20   0  1576  252  204 R   50  0.0   0:02.03 loop_yield
        2541 mingo     20   0  1576  244  196 R   50  0.0   0:02.05 loop
      
      with sched_compat_yield=1:
      
         PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
        2584 mingo     20   0  1576  248  196 R   99  0.0   0:52.45 loop
        2582 mingo     20   0  1576  256  204 R    0  0.0   0:00.00 loop_yield
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      1799e35d
  12. 26 8月, 2007 3 次提交
  13. 20 8月, 2007 1 次提交
  14. 12 8月, 2007 1 次提交
  15. 30 7月, 2007 1 次提交
  16. 25 7月, 2007 1 次提交
  17. 23 7月, 2007 1 次提交
    • M
      x86: i386-show-unhandled-signals-v3 · abd4f750
      Masoud Asgharifard Sharbiani 提交于
      This patch makes the i386 behave the same way that x86_64 does when a
      segfault happens.  A line gets printed to the kernel log so that tools
      that need to check for failures can behave more uniformly between
      debug.show_unhandled_signals sysctl variable to 0 (or by doing echo 0 >
      /proc/sys/debug/exception-trace)
      
      Also, all of the lines being printed are now using printk_ratelimit() to
      deny the ability of DoS from a local user with a program like the
      following:
      
      main()
      {
             while (1)
                     if (!fork()) *(int *)0 = 0;
      }
      
      This new revision also includes the fix that Andrew did which got rid of
      new sysctl that was added to the system in earlier versions of this.
      Also, 'show-unhandled-signals' sysctl has been renamed back to the old
      'exception-trace' to avoid breakage of people's scripts.
      
      AK: Enabling by default for i386 will be likely controversal, but let's see what happens
      AK: Really folks, before complaining just fix your segfaults
      AK: I bet this will find a lot of silent issues
      Signed-off-by: NMasoud Sharbiani <masouds@google.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      [ Personally, I've found the complaints useful on x86-64, so I'm all for
        this. That said, I wonder if we could do it more prettily..   -Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      abd4f750
  18. 20 7月, 2007 4 次提交
    • A
      kernel/sysctl.c: finish off the warning comments · ed2c12f3
      Andrew Morton 提交于
      I've been chasing these comments around this file all week.  Hopefully we're
      straight now.
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ed2c12f3
    • P
      lockstat: core infrastructure · f20786ff
      Peter Zijlstra 提交于
      Introduce the core lock statistics code.
      
      Lock statistics provides lock wait-time and hold-time (as well as the count
      of corresponding contention and acquisitions events). Also, the first few
      call-sites that encounter contention are tracked.
      
      Lock wait-time is the time spent waiting on the lock. This provides insight
      into the locking scheme, that is, a heavily contended lock is indicative of
      a too coarse locking scheme.
      
      Lock hold-time is the duration the lock was held, this provides a reference for
      the wait-time numbers, so they can be put into perspective.
      
        1)
          lock
        2)
          ... do stuff ..
          unlock
        3)
      
      The time between 1 and 2 is the wait-time. The time between 2 and 3 is the
      hold-time.
      
      The lockdep held-lock tracking code is reused, because it already collects locks
      into meaningful groups (classes), and because it is an existing infrastructure
      for lock instrumentation.
      
      Currently lockdep tracks lock acquisition with two hooks:
      
        lock()
          lock_acquire()
          _lock()
      
       ... code protected by lock ...
      
        unlock()
          lock_release()
          _unlock()
      
      We need to extend this with two more hooks, in order to measure contention.
      
        lock_contended() - used to measure contention events
        lock_acquired()  - completion of the contention
      
      These are then placed the following way:
      
        lock()
          lock_acquire()
          if (!_try_lock())
            lock_contended()
            _lock()
            lock_acquired()
      
       ... do locked stuff ...
      
        unlock()
          lock_release()
          _unlock()
      
      (Note: the try_lock() 'trick' is used to avoid instrumenting all platform
             dependent lock primitive implementations.)
      
      It is also possible to toggle the two lockdep features at runtime using:
      
        /proc/sys/kernel/prove_locking
        /proc/sys/kernel/lock_stat
      
      (esp. turning off the O(n^2) prove_locking functionaliy can help)
      
      [akpm@linux-foundation.org: build fixes]
      [akpm@linux-foundation.org: nuke unneeded ifdefs]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NJason Baron <jbaron@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f20786ff
    • K
      coredump masking: bound suid_dumpable sysctl · 76fdbb25
      Kawai, Hidehiro 提交于
      This patch series is version 5 of the core dump masking feature, which
      controls which VMAs should be dumped based on their memory types and
      per-process flags.
      
      I adopted most of Andrew's suggestion at the previous version.  He also
      suggested using system call instead of /proc/<pid>/ interface, I decided to
      use the latter continuously because adding new system call with pid argument
      will give a big impact on the kernel.
      
      You can access the per-process flags via /proc/<pid>/coredump_filter
      interface.  coredump_filter represents a bitmask of memory types, and if a bit
      is set, VMAs of corresponding memory type are written into a core file when
      the process is dumped.  The bitmask is inherited from the parent process when
      a process is created.
      
      The original purpose is to avoid longtime system slowdown when a number of
      processes which share a huge shared memory are dumped at the same time.  To
      achieve this purpose, this patch series adds an ability to suppress dumping
      anonymous shared memory for specified processes.  In this version, three other
      memory types are also supported.
      
      Here are the coredump_filter bits:
        bit 0: anonymous private memory
        bit 1: anonymous shared memory
        bit 2: file-backed private memory
        bit 3: file-backed shared memory
      
      The default value of coredump_filter is 0x3.  This means the new core dump
      routine has the same behavior as conventional behavior by default.
      
      In this version, coredump_filter bits and mm.dumpable are merged into
      mm.flags, and it is accessed by atomic bitops.
      
      The supported core file formats are ELF and ELF-FDPIC.  ELF has been tested,
      but ELF-FDPIC has not been built and tested because I don't have the test
      environment.
      
      This patch limits a value of suid_dumpable sysctl to the range of 0 to 2.
      Signed-off-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      76fdbb25
    • P
      audit: rework execve audit · bdf4c48a
      Peter Zijlstra 提交于
      The purpose of audit_bprm() is to log the argv array to a userspace daemon at
      the end of the execve system call.  Since user-space hasn't had time to run,
      this array is still in pristine state on the process' stack; so no need to
      copy it, we can just grab it from there.
      
      In order to minimize the damage to audit_log_*() copy each string into a
      temporary kernel buffer first.
      
      Currently the audit code requires that the full argument vector fits in a
      single packet.  So currently it does clip the argv size to a (sysctl) limit,
      but only when execve auditing is enabled.
      
      If the audit protocol gets extended to allow for multiple packets this check
      can be removed.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NOllie Wild <aaw@google.com>
      Cc: <linux-audit@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bdf4c48a