1. 01 8月, 2012 1 次提交
  2. 31 7月, 2012 2 次提交
    • S
      sysctl: suppress kmemleak messages · fd4b616b
      Steven Rostedt 提交于
      register_sysctl_table() is a strange function, as it makes internal
      allocations (a header) to register a sysctl_table.  This header is a
      handle to the table that is created, and can be used to unregister the
      table.  But if the table is permanent and never unregistered, the header
      acts the same as a static variable.
      
      Unfortunately, this allocation of memory that is never expected to be
      freed fools kmemleak in thinking that we have leaked memory.  For those
      sysctl tables that are never unregistered, and have no pointer referencing
      them, kmemleak will think that these are memory leaks:
      
      unreferenced object 0xffff880079fb9d40 (size 192):
        comm "swapper/0", pid 0, jiffies 4294667316 (age 12614.152s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff8146b590>] kmemleak_alloc+0x73/0x98
          [<ffffffff8110a935>] kmemleak_alloc_recursive.constprop.42+0x16/0x18
          [<ffffffff8110b852>] __kmalloc+0x107/0x153
          [<ffffffff8116fa72>] kzalloc.constprop.8+0xe/0x10
          [<ffffffff811703c9>] __register_sysctl_paths+0xe1/0x160
          [<ffffffff81170463>] register_sysctl_paths+0x1b/0x1d
          [<ffffffff8117047d>] register_sysctl_table+0x18/0x1a
          [<ffffffff81afb0a1>] sysctl_init+0x10/0x14
          [<ffffffff81b05a6f>] proc_sys_init+0x2f/0x31
          [<ffffffff81b0584c>] proc_root_init+0xa5/0xa7
          [<ffffffff81ae5b7e>] start_kernel+0x3d0/0x40a
          [<ffffffff81ae52a7>] x86_64_start_reservations+0xae/0xb2
          [<ffffffff81ae53ad>] x86_64_start_kernel+0x102/0x111
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      The sysctl_base_table used by sysctl itself is one such instance that
      registers the table to never be unregistered.
      
      Use kmemleak_not_leak() to suppress the kmemleak false positive.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fd4b616b
    • K
      coredump: warn about unsafe suid_dumpable / core_pattern combo · 54b50199
      Kees Cook 提交于
      When suid_dumpable=2, detect unsafe core_pattern settings and warn when
      they are seen.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Suggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alan Cox <alan@linux.intel.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54b50199
  3. 30 7月, 2012 1 次提交
    • K
      fs: add link restrictions · 800179c9
      Kees Cook 提交于
      This adds symlink and hardlink restrictions to the Linux VFS.
      
      Symlinks:
      
      A long-standing class of security issues is the symlink-based
      time-of-check-time-of-use race, most commonly seen in world-writable
      directories like /tmp. The common method of exploitation of this flaw
      is to cross privilege boundaries when following a given symlink (i.e. a
      root process follows a symlink belonging to another user). For a likely
      incomplete list of hundreds of examples across the years, please see:
      http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp
      
      The solution is to permit symlinks to only be followed when outside
      a sticky world-writable directory, or when the uid of the symlink and
      follower match, or when the directory owner matches the symlink's owner.
      
      Some pointers to the history of earlier discussion that I could find:
      
       1996 Aug, Zygo Blaxell
        http://marc.info/?l=bugtraq&m=87602167419830&w=2
       1996 Oct, Andrew Tridgell
        http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html
       1997 Dec, Albert D Cahalan
        http://lkml.org/lkml/1997/12/16/4
       2005 Feb, Lorenzo Hernández García-Hierro
        http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html
       2010 May, Kees Cook
        https://lkml.org/lkml/2010/5/30/144
      
      Past objections and rebuttals could be summarized as:
      
       - Violates POSIX.
         - POSIX didn't consider this situation and it's not useful to follow
           a broken specification at the cost of security.
       - Might break unknown applications that use this feature.
         - Applications that break because of the change are easy to spot and
           fix. Applications that are vulnerable to symlink ToCToU by not having
           the change aren't. Additionally, no applications have yet been found
           that rely on this behavior.
       - Applications should just use mkstemp() or O_CREATE|O_EXCL.
         - True, but applications are not perfect, and new software is written
           all the time that makes these mistakes; blocking this flaw at the
           kernel is a single solution to the entire class of vulnerability.
       - This should live in the core VFS.
         - This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135)
       - This should live in an LSM.
         - This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188)
      
      Hardlinks:
      
      On systems that have user-writable directories on the same partition
      as system files, a long-standing class of security issues is the
      hardlink-based time-of-check-time-of-use race, most commonly seen in
      world-writable directories like /tmp. The common method of exploitation
      of this flaw is to cross privilege boundaries when following a given
      hardlink (i.e. a root process follows a hardlink created by another
      user). Additionally, an issue exists where users can "pin" a potentially
      vulnerable setuid/setgid file so that an administrator will not actually
      upgrade a system fully.
      
      The solution is to permit hardlinks to only be created when the user is
      already the existing file's owner, or if they already have read/write
      access to the existing file.
      
      Many Linux users are surprised when they learn they can link to files
      they have no access to, so this change appears to follow the doctrine
      of "least surprise". Additionally, this change does not violate POSIX,
      which states "the implementation may require that the calling process
      has permission to access the existing file"[1].
      
      This change is known to break some implementations of the "at" daemon,
      though the version used by Fedora and Ubuntu has been fixed[2] for
      a while. Otherwise, the change has been undisruptive while in use in
      Ubuntu for the last 1.5 years.
      
      [1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/linkat.html
      [2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279
      
      This patch is based on the patches in Openwall and grsecurity, along with
      suggestions from Al Viro. I have added a sysctl to enable the protected
      behavior, and documentation.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      800179c9
  4. 05 4月, 2012 1 次提交
  5. 29 3月, 2012 3 次提交
  6. 14 2月, 2012 1 次提交
  7. 25 1月, 2012 3 次提交
  8. 05 12月, 2011 1 次提交
  9. 01 11月, 2011 1 次提交
    • D
      kernel/sysctl.c: add cap_last_cap to /proc/sys/kernel · 73efc039
      Dan Ballard 提交于
      Userspace needs to know the highest valid capability of the running
      kernel, which right now cannot reliably be retrieved from the header files
      only.  The fact that this value cannot be determined properly right now
      creates various problems for libraries compiled on newer header files
      which are run on older kernels.  They assume capabilities are available
      which actually aren't.  libcap-ng is one example.  And we ran into the
      same problem with systemd too.
      
      Now the capability is exported in /proc/sys/kernel/cap_last_cap.
      
      [akpm@linux-foundation.org: make cap_last_cap const, per Ulrich]
      Signed-off-by: NDan Ballard <dan@mindstab.net>
      Cc: Randy Dunlap <rdunlap@xenotime.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Cc: Ulrich Drepper <drepper@akkadia.org>
      Cc: James Morris <jmorris@namei.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73efc039
  10. 30 10月, 2011 1 次提交
  11. 14 8月, 2011 1 次提交
  12. 21 7月, 2011 1 次提交
  13. 04 6月, 2011 1 次提交
  14. 23 5月, 2011 1 次提交
  15. 20 5月, 2011 1 次提交
    • C
      arch/tile: support signal "exception-trace" hook · 571d76ac
      Chris Metcalf 提交于
      This change adds support for /proc/sys/debug/exception-trace to tile.
      Like x86 and sparc, by default it is set to "1", generating a one-line
      printk whenever a user process crashes.  By setting it to "2", we get
      a much more complete userspace diagnostic at crash time, including
      a user-space backtrace, register dump, and memory dump around the
      address of the crash.
      
      Some vestiges of the Tilera-internal version of this support are
      removed with this patch (the show_crashinfo variable and the
      arch_coredump_signal function).  We retain a "crashinfo" boot parameter
      which allows you to set the boot-time value of exception-trace.
      Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
      571d76ac
  16. 04 4月, 2011 1 次提交
    • E
      capabilites: allow the application of capability limits to usermode helpers · 17f60a7d
      Eric Paris 提交于
      There is no way to limit the capabilities of usermodehelpers. This problem
      reared its head recently when someone complained that any user with
      cap_net_admin was able to load arbitrary kernel modules, even though the user
      didn't have cap_sys_module.  The reason is because the actual load is done by
      a usermode helper and those always have the full cap set.  This patch addes new
      sysctls which allow us to bound the permissions of usermode helpers.
      
      /proc/sys/kernel/usermodehelper/bset
      /proc/sys/kernel/usermodehelper/inheritable
      
      You must have CAP_SYS_MODULE  and CAP_SETPCAP to change these (changes are
      &= ONLY).  When the kernel launches a usermodehelper it will do so with these
      as the bset and pI.
      
      -v2:	make globals static
      	create spinlock to protect globals
      
      -v3:	require both CAP_SETPCAP and CAP_SYS_MODULE
      -v4:	fix the typo s/CAP_SET_PCAP/CAP_SETPCAP/ because I didn't commit
      Signed-off-by: NEric Paris <eparis@redhat.com>
      No-objection-from: Serge E. Hallyn <serge.hallyn@canonical.com>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NSerge E. Hallyn <serge.hallyn@canonical.com>
      Acked-by: NAndrew G. Morgan <morgan@kernel.org>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      17f60a7d
  17. 24 3月, 2011 2 次提交
  18. 08 3月, 2011 1 次提交
    • A
      unfuck proc_sysctl ->d_compare() · dfef6dcd
      Al Viro 提交于
      a) struct inode is not going to be freed under ->d_compare();
      however, the thing PROC_I(inode)->sysctl points to just might.
      Fortunately, it's enough to make freeing that sucker delayed,
      provided that we don't step on its ->unregistering, clear
      the pointer to it in PROC_I(inode) before dropping the reference
      and check if it's NULL in ->d_compare().
      
      b) I'm not sure that we *can* walk into NULL inode here (we recheck
      dentry->seq between verifying that it's still hashed / fetching
      dentry->d_inode and passing it to ->d_compare() and there's no
      negative hashed dentries in /proc/sys/*), but if we can walk into
      that, we really should not have ->d_compare() return 0 on it!
      Said that, I really suspect that this check can be simply killed.
      Nick?
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      dfef6dcd
  19. 23 2月, 2011 1 次提交
  20. 16 2月, 2011 1 次提交
  21. 03 2月, 2011 1 次提交
    • R
      sched: Use a buddy to implement yield_task_fair() · ac53db59
      Rik van Riel 提交于
      Use the buddy mechanism to implement yield_task_fair.  This
      allows us to skip onto the next highest priority se at every
      level in the CFS tree, unless doing so would introduce gross
      unfairness in CPU time distribution.
      
      We order the buddy selection in pick_next_entity to check
      yield first, then last, then next.  We need next to be able
      to override yield, because it is possible for the "next" and
      "yield" task to be different processen in the same sub-tree
      of the CFS tree.  When they are, we need to go into that
      sub-tree regardless of the "yield" hint, and pick the correct
      entity once we get to the right level.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20110201095103.3a79e92a@annuminas.surriel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ac53db59
  22. 02 2月, 2011 1 次提交
  23. 25 1月, 2011 1 次提交
  24. 14 1月, 2011 3 次提交
  25. 10 12月, 2010 1 次提交
  26. 30 11月, 2010 1 次提交
    • M
      sched: Add 'autogroup' scheduling feature: automated per session task groups · 5091faa4
      Mike Galbraith 提交于
      A recurring complaint from CFS users is that parallel kbuild has
      a negative impact on desktop interactivity.  This patch
      implements an idea from Linus, to automatically create task
      groups.  Currently, only per session autogroups are implemented,
      but the patch leaves the way open for enhancement.
      
      Implementation: each task's signal struct contains an inherited
      pointer to a refcounted autogroup struct containing a task group
      pointer, the default for all tasks pointing to the
      init_task_group.  When a task calls setsid(), a new task group
      is created, the process is moved into the new task group, and a
      reference to the preveious task group is dropped.  Child
      processes inherit this task group thereafter, and increase it's
      refcount.  When the last thread of a process exits, the
      process's reference is dropped, such that when the last process
      referencing an autogroup exits, the autogroup is destroyed.
      
      At runqueue selection time, IFF a task has no cgroup assignment,
      its current autogroup is used.
      
      Autogroup bandwidth is controllable via setting it's nice level
      through the proc filesystem:
      
        cat /proc/<pid>/autogroup
      
      Displays the task's group and the group's nice level.
      
        echo <nice level> > /proc/<pid>/autogroup
      
      Sets the task group's shares to the weight of nice <level> task.
      Setting nice level is rate limited for !admin users due to the
      abuse risk of task group locking.
      
      The feature is enabled from boot by default if
      CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via
      the boot option noautogroup, and can also be turned on/off on
      the fly via:
      
        echo [01] > /proc/sys/kernel/sched_autogroup_enabled
      
      ... which will automatically move tasks to/from the root task group.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      [ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      LKML-Reference: <1290281700.28711.9.camel@maggy.simson.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      5091faa4
  27. 18 11月, 2010 3 次提交
    • P
      sched: Add sysctl_sched_shares_window · a7a4f8a7
      Paul Turner 提交于
      Introduce a new sysctl for the shares window and disambiguate it from
      sched_time_avg.
      
      A 10ms window appears to be a good compromise between accuracy and performance.
      Signed-off-by: NPaul Turner <pjt@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20101115234938.112173964@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a7a4f8a7
    • P
      sched: Rewrite tg_shares_up) · 2069dd75
      Peter Zijlstra 提交于
      By tracking a per-cpu load-avg for each cfs_rq and folding it into a
      global task_group load on each tick we can rework tg_shares_up to be
      strictly per-cpu.
      
      This should improve cpu-cgroup performance for smp systems
      significantly.
      
      [ Paul: changed to use queueing cfs_rq + bug fixes ]
      Signed-off-by: NPaul Turner <pjt@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20101115234937.580480400@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2069dd75
    • D
      x86, nmi_watchdog: Remove the old nmi_watchdog · 5f2b0ba4
      Don Zickus 提交于
      Now that we have a new nmi_watchdog that is more generic and
      sits on top of the perf subsystem, we really do not need the old
      nmi_watchdog any more.
      
      In addition, the old nmi_watchdog doesn't really work if you are
      using the default clocksource, hpet.  The old nmi_watchdog code
      relied on local apic interrupts to determine if the cpu is still
      alive.  With hpet as the clocksource, these interrupts don't
      increment any more and the old nmi_watchdog triggers false
      postives.
      
      This piece removes the old nmi_watchdog code and stubs out any
      variables and functions calls.  The stubs are the same ones used
      by the new nmi_watchdog code, so it should be well tested.
      Signed-off-by: NDon Zickus <dzickus@redhat.com>
      Cc: fweisbec@gmail.com
      Cc: gorcunov@openvz.org
      LKML-Reference: <1289578944-28564-2-git-send-email-dzickus@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      5f2b0ba4
  28. 16 11月, 2010 1 次提交
  29. 12 11月, 2010 1 次提交
  30. 27 10月, 2010 1 次提交