1. 30 4月, 2013 2 次提交
    • A
      mm: limit growth of 3% hardcoded other user reserve · c9b1d098
      Andrew Shewmaker 提交于
      Add user_reserve_kbytes knob.
      
      Limit the growth of the memory reserved for other user processes to
      min(3% current process size, user_reserve_pages).  Only about 8MB is
      necessary to enable recovery in the default mode, and only a few hundred
      MB are required even when overcommit is disabled.
      
      user_reserve_pages defaults to min(3% free pages, 128MB)
      
      I arrived at 128MB by taking the max VSZ of sshd, login, bash, and top ...
      then adding the RSS of each.
      
      This only affects OVERCOMMIT_NEVER mode.
      
      Background
      
      1. user reserve
      
      __vm_enough_memory reserves a hardcoded 3% of the current process size for
      other applications when overcommit is disabled.  This was done so that a
      user could recover if they launched a memory hogging process.  Without the
      reserve, a user would easily run into a message such as:
      
      bash: fork: Cannot allocate memory
      
      2. admin reserve
      
      Additionally, a hardcoded 3% of free memory is reserved for root in both
      overcommit 'guess' and 'never' modes.  This was intended to prevent a
      scenario where root-cant-log-in and perform recovery operations.
      
      Note that this reserve shrinks, and doesn't guarantee a useful reserve.
      
      Motivation
      
      The two hardcoded memory reserves should be updated to account for current
      memory sizes.
      
      Also, the admin reserve would be more useful if it didn't shrink too much.
      
      When the current code was originally written, 1GB was considered
      "enterprise".  Now the 3% reserve can grow to multiple GB on large memory
      systems, and it only needs to be a few hundred MB at most to enable a user
      or admin to recover a system with an unwanted memory hogging process.
      
      I've found that reducing these reserves is especially beneficial for a
      specific type of application load:
      
       * single application system
       * one or few processes (e.g. one per core)
       * allocating all available memory
       * not initializing every page immediately
       * long running
      
      I've run scientific clusters with this sort of load.  A long running job
      sometimes failed many hours (weeks of CPU time) into a calculation.  They
      weren't initializing all of their memory immediately, and they weren't
      using calloc, so I put systems into overcommit 'never' mode.  These
      clusters run diskless and have no swap.
      
      However, with the current reserves, a user wishing to allocate as much
      memory as possible to one process may be prevented from using, for
      example, almost 2GB out of 32GB.
      
      The effect is less, but still significant when a user starts a job with
      one process per core.  I have repeatedly seen a set of processes
      requesting the same amount of memory fail because one of them could not
      allocate the amount of memory a user would expect to be able to allocate.
      For example, Message Passing Interfce (MPI) processes, one per core.  And
      it is similar for other parallel programming frameworks.
      
      Changing this reserve code will make the overcommit never mode more useful
      by allowing applications to allocate nearly all of the available memory.
      
      Also, the new admin_reserve_kbytes will be safer than the current behavior
      since the hardcoded 3% of available memory reserve can shrink to something
      useless in the case where applications have grabbed all available memory.
      
      Risks
      
      * "bash: fork: Cannot allocate memory"
      
        The downside of the first patch-- which creates a tunable user reserve
        that is only used in overcommit 'never' mode--is that an admin can set
        it so low that a user may not be able to kill their process, even if
        they already have a shell prompt.
      
        Of course, a user can get in the same predicament with the current 3%
        reserve--they just have to launch processes until 3% becomes negligible.
      
      * root-cant-log-in problem
      
        The second patch, adding the tunable rootuser_reserve_pages, allows
        the admin to shoot themselves in the foot by setting it too small.  They
        can easily get the system into a state where root-can't-log-in.
      
        However, the new admin_reserve_kbytes will be safer than the current
        behavior since the hardcoded 3% of available memory reserve can shrink
        to something useless in the case where applications have grabbed all
        available memory.
      
      Alternatives
      
       * Memory cgroups provide a more flexible way to limit application memory.
      
         Not everyone wants to set up cgroups or deal with their overhead.
      
       * We could create a fourth overcommit mode which provides smaller reserves.
      
         The size of useful reserves may be drastically different depending
         on the whether the system is embedded or enterprise.
      
       * Force users to initialize all of their memory or use calloc.
      
         Some users don't want/expect the system to overcommit when they malloc.
         Overcommit 'never' mode is for this scenario, and it should work well.
      
      The new user and admin reserve tunables are simple to use, with low
      overhead compared to cgroups.  The patches preserve current behavior where
      3% of memory is less than 128MB, except that the admin reserve doesn't
      shrink to an unusable size under pressure.  The code allows admins to tune
      for embedded and enterprise usage.
      
      FAQ
      
       * How is the root-cant-login problem addressed?
         What happens if admin_reserve_pages is set to 0?
      
         Root is free to shoot themselves in the foot by setting
         admin_reserve_kbytes too low.
      
         On x86_64, the minimum useful reserve is:
           8MB for overcommit 'guess'
         128MB for overcommit 'never'
      
         admin_reserve_pages defaults to min(3% free memory, 8MB)
      
         So, anyone switching to 'never' mode needs to adjust
         admin_reserve_pages.
      
       * How do you calculate a minimum useful reserve?
      
         A user or the admin needs enough memory to login and perform
         recovery operations, which includes, at a minimum:
      
         sshd or login + bash (or some other shell) + top (or ps, kill, etc.)
      
         For overcommit 'guess', we can sum resident set sizes (RSS)
         because we only need enough memory to handle what the recovery
         programs will typically use. On x86_64 this is about 8MB.
      
         For overcommit 'never', we can take the max of their virtual sizes (VSZ)
         and add the sum of their RSS. We use VSZ instead of RSS because mode
         forces us to ensure we can fulfill all of the requested memory allocations--
         even if the programs only use a fraction of what they ask for.
         On x86_64 this is about 128MB.
      
         When swap is enabled, reserves are useful even when they are as
         small as 10MB, regardless of overcommit mode.
      
         When both swap and overcommit are disabled, then the admin should
         tune the reserves higher to be absolutley safe. Over 230MB each
         was safest in my testing.
      
       * What happens if user_reserve_pages is set to 0?
      
         Note, this only affects overcomitt 'never' mode.
      
         Then a user will be able to allocate all available memory minus
         admin_reserve_kbytes.
      
         However, they will easily see a message such as:
      
         "bash: fork: Cannot allocate memory"
      
         And they won't be able to recover/kill their application.
         The admin should be able to recover the system if
         admin_reserve_kbytes is set appropriately.
      
       * What's the difference between overcommit 'guess' and 'never'?
      
         "Guess" allows an allocation if there are enough free + reclaimable
         pages. It has a hardcoded 3% of free pages reserved for root.
      
         "Never" allows an allocation if there is enough swap + a configurable
         percentage (default is 50) of physical RAM. It has a hardcoded 3% of
         free pages reserved for root, like "Guess" mode. It also has a
         hardcoded 3% of the current process size reserved for additional
         applications.
      
       * Why is overcommit 'guess' not suitable even when an app eventually
         writes to every page? It takes free pages, file pages, available
         swap pages, reclaimable slab pages into consideration. In other words,
         these are all pages available, then why isn't overcommit suitable?
      
         Because it only looks at the present state of the system. It
         does not take into account the memory that other applications have
         malloced, but haven't initialized yet. It overcommits the system.
      
      Test Summary
      
      There was little change in behavior in the default overcommit 'guess'
      mode with swap enabled before and after the patch. This was expected.
      
      Systems run most predictably (i.e. no oom kills) in overcommit 'never'
      mode with swap enabled. This also allowed the most memory to be allocated
      to a user application.
      
      Overcommit 'guess' mode without swap is a bad idea. It is easy to
      crash the system. None of the other tested combinations crashed.
      This matches my experience on the Roadrunner supercomputer.
      
      Without the tunable user reserve, a system in overcommit 'never' mode
      and without swap does not allow the admin to recover, although the
      admin can.
      
      With the new tunable reserves, a system in overcommit 'never' mode
      and without swap can be configured to:
      
      1. maximize user-allocatable memory, running close to the edge of
      recoverability
      
      2. maximize recoverability, sacrificing allocatable memory to
      ensure that a user cannot take down a system
      
      Test Description
      
      Fedora 18 VM - 4 x86_64 cores, 5725MB RAM, 4GB Swap
      
      System is booted into multiuser console mode, with unnecessary services
      turned off. Caches were dropped before each test.
      
      Hogs are user memtester processes that attempt to allocate all free memory
      as reported by /proc/meminfo
      
      In overcommit 'never' mode, memory_ratio=100
      
      Test Results
      
      3.9.0-rc1-mm1
      
      Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
      ----------   ----   ----   -------------   ----   -------------   --------------
      guess        yes    1      5432/5432       no     yes             yes
      guess        yes    4      5444/5444       1      yes             yes
      guess        no     1      5302/5449       no     yes             yes
      guess        no     4      -               crash  no              no
      
      never        yes    1      5460/5460       1      yes             yes
      never        yes    4      5460/5460       1      yes             yes
      never        no     1      5218/5432       no     no              yes
      never        no     4      5203/5448       no     no              yes
      
      3.9.0-rc1-mm1-tunablereserves
      
      User and Admin Recovery show their respective reserves, if applicable.
      
      Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
      ----------   ----   ----   -------------   ----   -------------   --------------
      guess        yes    1      5419/5419       no     - yes           8MB yes
      guess        yes    4      5436/5436       1      - yes           8MB yes
      guess        no     1      5440/5440       *      - yes           8MB yes
      guess        no     4      -               crash  - no            8MB no
      
      * process would successfully mlock, then the oom killer would pick it
      
      never        yes    1      5446/5446       no     10MB yes        20MB yes
      never        yes    4      5456/5456       no     10MB yes        20MB yes
      never        no     1      5387/5429       no     128MB no        8MB barely
      never        no     1      5323/5428       no     226MB barely    8MB barely
      never        no     1      5323/5428       no     226MB barely    8MB barely
      
      never        no     1      5359/5448       no     10MB no         10MB barely
      
      never        no     1      5323/5428       no     0MB no          10MB barely
      never        no     1      5332/5428       no     0MB no          50MB yes
      never        no     1      5293/5429       no     0MB no          90MB yes
      
      never        no     1      5001/5427       no     230MB yes       338MB yes
      never        no     4*     4998/5424       no     230MB yes       338MB yes
      
      * more memtesters were launched, able to allocate approximately another 100MB
      
      Future Work
      
       - Test larger memory systems.
      
       - Test an embedded image.
      
       - Test other architectures.
      
       - Time malloc microbenchmarks.
      
       - Would it be useful to be able to set overcommit policy for
         each memory cgroup?
      
       - Some lines are slightly above 80 chars.
         Perhaps define a macro to convert between pages and kb?
         Other places in the kernel do this.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: make init_user_reserve() static]
      Signed-off-by: NAndrew Shewmaker <agshew@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9b1d098
    • J
      fs: don't compile in drop_caches.c when CONFIG_SYSCTL=n · 146732ce
      Josh Triplett 提交于
      drop_caches.c provides code only invokable via sysctl, so don't compile it
      in when CONFIG_SYSCTL=n.
      Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
      Acked-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      146732ce
  2. 28 2月, 2013 1 次提交
  3. 24 2月, 2013 1 次提交
  4. 16 2月, 2013 1 次提交
  5. 08 2月, 2013 2 次提交
  6. 21 1月, 2013 1 次提交
  7. 10 1月, 2013 1 次提交
  8. 11 12月, 2012 5 次提交
    • M
      mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate · b8593bfd
      Mel Gorman 提交于
      The PTE scanning rate and fault rates are two of the biggest sources of
      system CPU overhead with automatic NUMA placement.  Ideally a proper policy
      would detect if a workload was properly placed, schedule and adjust the
      PTE scanning rate accordingly. We do not track the necessary information
      to do that but we at least know if we migrated or not.
      
      This patch scans slower if a page was not migrated as the result of a
      NUMA hinting fault up to sysctl_numa_balancing_scan_period_max which is
      now higher than the previous default. Once every minute it will reset
      the scanner in case of phase changes.
      
      This is hilariously crude and the numbers are arbitrary. Workloads will
      converge quite slowly in comparison to what a proper policy should be able
      to do. On the plus side, we will chew up less CPU for workloads that have
      no need for automatic balancing.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      b8593bfd
    • P
      mm: sched: numa: Implement slow start for working set sampling · 4b96a29b
      Peter Zijlstra 提交于
      Add a 1 second delay before starting to scan the working set of
      a task and starting to balance it amongst nodes.
      
      [ note that before the constant per task WSS sampling rate patch
        the initial scan would happen much later still, in effect that
        patch caused this regression. ]
      
      The theory is that short-run tasks benefit very little from NUMA
      placement: they come and go, and they better stick to the node
      they were started on. As tasks mature and rebalance to other CPUs
      and nodes, so does their NUMA placement have to change and so
      does it start to matter more and more.
      
      In practice this change fixes an observable kbuild regression:
      
         # [ a perf stat --null --repeat 10 test of ten bzImage builds to /dev/shm ]
      
         !NUMA:
         45.291088843 seconds time elapsed                                          ( +-  0.40% )
         45.154231752 seconds time elapsed                                          ( +-  0.36% )
      
         +NUMA, no slow start:
         46.172308123 seconds time elapsed                                          ( +-  0.30% )
         46.343168745 seconds time elapsed                                          ( +-  0.25% )
      
         +NUMA, 1 sec slow start:
         45.224189155 seconds time elapsed                                          ( +-  0.25% )
         45.160866532 seconds time elapsed                                          ( +-  0.17% )
      
      and it also fixes an observable perf bench (hackbench) regression:
      
         # perf stat --null --repeat 10 perf bench sched messaging
      
         -NUMA:
      
         -NUMA:                  0.246225691 seconds time elapsed                   ( +-  1.31% )
         +NUMA no slow start:    0.252620063 seconds time elapsed                   ( +-  1.13% )
      
         +NUMA 1sec delay:       0.248076230 seconds time elapsed                   ( +-  1.35% )
      
      The implementation is simple and straightforward, most of the patch
      deals with adding the /proc/sys/kernel/numa_balancing_scan_delay_ms tunable
      knob.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      [ Wrote the changelog, ran measurements, tuned the default. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      4b96a29b
    • P
      mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate · 6e5fb223
      Peter Zijlstra 提交于
      Previously, to probe the working set of a task, we'd use
      a very simple and crude method: mark all of its address
      space PROT_NONE.
      
      That method has various (obvious) disadvantages:
      
       - it samples the working set at dissimilar rates,
         giving some tasks a sampling quality advantage
         over others.
      
       - creates performance problems for tasks with very
         large working sets
      
       - over-samples processes with large address spaces but
         which only very rarely execute
      
      Improve that method by keeping a rotating offset into the
      address space that marks the current position of the scan,
      and advance it by a constant rate (in a CPU cycles execution
      proportional manner). If the offset reaches the last mapped
      address of the mm then it then it starts over at the first
      address.
      
      The per-task nature of the working set sampling functionality in this tree
      allows such constant rate, per task, execution-weight proportional sampling
      of the working set, with an adaptive sampling interval/frequency that
      goes from once per 100ms up to just once per 8 seconds.  The current
      sampling volume is 256 MB per interval.
      
      As tasks mature and converge their working set, so does the
      sampling rate slow down to just a trickle, 256 MB per 8
      seconds of CPU time executed.
      
      This, beyond being adaptive, also rate-limits rarely
      executing systems and does not over-sample on overloaded
      systems.
      
      [ In AutoNUMA speak, this patch deals with the effective sampling
        rate of the 'hinting page fault'. AutoNUMA's scanning is
        currently rate-limited, but it is also fundamentally
        single-threaded, executing in the knuma_scand kernel thread,
        so the limit in AutoNUMA is global and does not scale up with
        the number of CPUs, nor does it scan tasks in an execution
        proportional manner.
      
        So the idea of rate-limiting the scanning was first implemented
        in the AutoNUMA tree via a global rate limit. This patch goes
        beyond that by implementing an execution rate proportional
        working set sampling rate that is not implemented via a single
        global scanning daemon. ]
      
      [ Dan Carpenter pointed out a possible NULL pointer dereference in the
        first version of this patch. ]
      Based-on-idea-by: NAndrea Arcangeli <aarcange@redhat.com>
      Bug-Found-By: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      [ Wrote changelog and fixed bug. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      6e5fb223
    • P
      mm: numa: Add fault driven placement and migration · cbee9f88
      Peter Zijlstra 提交于
      NOTE: This patch is based on "sched, numa, mm: Add fault driven
      	placement and migration policy" but as it throws away all the policy
      	to just leave a basic foundation I had to drop the signed-offs-by.
      
      This patch creates a bare-bones method for setting PTEs pte_numa in the
      context of the scheduler that when faulted later will be faulted onto the
      node the CPU is running on.  In itself this does nothing useful but any
      placement policy will fundamentally depend on receiving hints on placement
      from fault context and doing something intelligent about it.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      cbee9f88
    • I
      Revert "sched/autogroup: Fix crash on reboot when autogroup is disabled" · c1ad41f1
      Ingo Molnar 提交于
      This reverts commit 5258f386,
      because the underlying autogroups bug got fixed upstream in
      a better way, via:
      
        fd8ef117 Revert "sched, autogroup: Stop going ahead if autogroup is disabled"
      
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Yong Zhang <yong.zhang0@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c1ad41f1
  9. 29 11月, 2012 1 次提交
  10. 30 10月, 2012 1 次提交
    • M
      sched/autogroup: Fix crash on reboot when autogroup is disabled · 5258f386
      Mike Galbraith 提交于
      Due to these two commits:
      
        8323f26c sched: Fix race in task_group()
        800d4d30 sched, autogroup: Stop going ahead if autogroup is disabled
      
      ... autogroup scheduling's dynamic knobs are wrecked.
      
      With both patches applied, all you have to do to crash a box is
      disable autogroup during boot up, then reboot.. boom, NULL pointer
      dereference due to 800d4d30 not allowing autogroup to move things,
      and 8323f26c making that the only way to switch runqueues.
      
      Remove most of the (dysfunctional) knobs and turn the remaining
      sched_autogroup_enabled knob readonly.
      
      If the user fiddles with cgroups hereafter, once tasks
      are moved, autogroup won't mess with them again unless
      they call setsid().
      
      No knobs, no glitz, nada, just a cute little thing folks can
      turn on if they don't want to muck about with cgroups and/or
      systemd.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Cc: Xiaotian Feng <xtfeng@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Xiaotian Feng <dannyfeng@tencent.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: <stable@vger.kernel.org> # v3.6
      Link: http://lkml.kernel.org/r/1351451963.4999.8.camel@maggy.simpson.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5258f386
  11. 09 10月, 2012 1 次提交
  12. 06 10月, 2012 1 次提交
  13. 17 9月, 2012 1 次提交
  14. 04 9月, 2012 1 次提交
  15. 01 8月, 2012 1 次提交
  16. 31 7月, 2012 2 次提交
    • S
      sysctl: suppress kmemleak messages · fd4b616b
      Steven Rostedt 提交于
      register_sysctl_table() is a strange function, as it makes internal
      allocations (a header) to register a sysctl_table.  This header is a
      handle to the table that is created, and can be used to unregister the
      table.  But if the table is permanent and never unregistered, the header
      acts the same as a static variable.
      
      Unfortunately, this allocation of memory that is never expected to be
      freed fools kmemleak in thinking that we have leaked memory.  For those
      sysctl tables that are never unregistered, and have no pointer referencing
      them, kmemleak will think that these are memory leaks:
      
      unreferenced object 0xffff880079fb9d40 (size 192):
        comm "swapper/0", pid 0, jiffies 4294667316 (age 12614.152s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff8146b590>] kmemleak_alloc+0x73/0x98
          [<ffffffff8110a935>] kmemleak_alloc_recursive.constprop.42+0x16/0x18
          [<ffffffff8110b852>] __kmalloc+0x107/0x153
          [<ffffffff8116fa72>] kzalloc.constprop.8+0xe/0x10
          [<ffffffff811703c9>] __register_sysctl_paths+0xe1/0x160
          [<ffffffff81170463>] register_sysctl_paths+0x1b/0x1d
          [<ffffffff8117047d>] register_sysctl_table+0x18/0x1a
          [<ffffffff81afb0a1>] sysctl_init+0x10/0x14
          [<ffffffff81b05a6f>] proc_sys_init+0x2f/0x31
          [<ffffffff81b0584c>] proc_root_init+0xa5/0xa7
          [<ffffffff81ae5b7e>] start_kernel+0x3d0/0x40a
          [<ffffffff81ae52a7>] x86_64_start_reservations+0xae/0xb2
          [<ffffffff81ae53ad>] x86_64_start_kernel+0x102/0x111
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      The sysctl_base_table used by sysctl itself is one such instance that
      registers the table to never be unregistered.
      
      Use kmemleak_not_leak() to suppress the kmemleak false positive.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fd4b616b
    • K
      coredump: warn about unsafe suid_dumpable / core_pattern combo · 54b50199
      Kees Cook 提交于
      When suid_dumpable=2, detect unsafe core_pattern settings and warn when
      they are seen.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Suggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alan Cox <alan@linux.intel.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54b50199
  17. 30 7月, 2012 1 次提交
    • K
      fs: add link restrictions · 800179c9
      Kees Cook 提交于
      This adds symlink and hardlink restrictions to the Linux VFS.
      
      Symlinks:
      
      A long-standing class of security issues is the symlink-based
      time-of-check-time-of-use race, most commonly seen in world-writable
      directories like /tmp. The common method of exploitation of this flaw
      is to cross privilege boundaries when following a given symlink (i.e. a
      root process follows a symlink belonging to another user). For a likely
      incomplete list of hundreds of examples across the years, please see:
      http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp
      
      The solution is to permit symlinks to only be followed when outside
      a sticky world-writable directory, or when the uid of the symlink and
      follower match, or when the directory owner matches the symlink's owner.
      
      Some pointers to the history of earlier discussion that I could find:
      
       1996 Aug, Zygo Blaxell
        http://marc.info/?l=bugtraq&m=87602167419830&w=2
       1996 Oct, Andrew Tridgell
        http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html
       1997 Dec, Albert D Cahalan
        http://lkml.org/lkml/1997/12/16/4
       2005 Feb, Lorenzo Hernández García-Hierro
        http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html
       2010 May, Kees Cook
        https://lkml.org/lkml/2010/5/30/144
      
      Past objections and rebuttals could be summarized as:
      
       - Violates POSIX.
         - POSIX didn't consider this situation and it's not useful to follow
           a broken specification at the cost of security.
       - Might break unknown applications that use this feature.
         - Applications that break because of the change are easy to spot and
           fix. Applications that are vulnerable to symlink ToCToU by not having
           the change aren't. Additionally, no applications have yet been found
           that rely on this behavior.
       - Applications should just use mkstemp() or O_CREATE|O_EXCL.
         - True, but applications are not perfect, and new software is written
           all the time that makes these mistakes; blocking this flaw at the
           kernel is a single solution to the entire class of vulnerability.
       - This should live in the core VFS.
         - This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135)
       - This should live in an LSM.
         - This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188)
      
      Hardlinks:
      
      On systems that have user-writable directories on the same partition
      as system files, a long-standing class of security issues is the
      hardlink-based time-of-check-time-of-use race, most commonly seen in
      world-writable directories like /tmp. The common method of exploitation
      of this flaw is to cross privilege boundaries when following a given
      hardlink (i.e. a root process follows a hardlink created by another
      user). Additionally, an issue exists where users can "pin" a potentially
      vulnerable setuid/setgid file so that an administrator will not actually
      upgrade a system fully.
      
      The solution is to permit hardlinks to only be created when the user is
      already the existing file's owner, or if they already have read/write
      access to the existing file.
      
      Many Linux users are surprised when they learn they can link to files
      they have no access to, so this change appears to follow the doctrine
      of "least surprise". Additionally, this change does not violate POSIX,
      which states "the implementation may require that the calling process
      has permission to access the existing file"[1].
      
      This change is known to break some implementations of the "at" daemon,
      though the version used by Fedora and Ubuntu has been fixed[2] for
      a while. Otherwise, the change has been undisruptive while in use in
      Ubuntu for the last 1.5 years.
      
      [1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/linkat.html
      [2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279
      
      This patch is based on the patches in Openwall and grsecurity, along with
      suggestions from Al Viro. I have added a sysctl to enable the protected
      behavior, and documentation.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      800179c9
  18. 05 4月, 2012 1 次提交
  19. 29 3月, 2012 3 次提交
  20. 14 2月, 2012 1 次提交
  21. 25 1月, 2012 3 次提交
  22. 05 12月, 2011 1 次提交
  23. 01 11月, 2011 1 次提交
    • D
      kernel/sysctl.c: add cap_last_cap to /proc/sys/kernel · 73efc039
      Dan Ballard 提交于
      Userspace needs to know the highest valid capability of the running
      kernel, which right now cannot reliably be retrieved from the header files
      only.  The fact that this value cannot be determined properly right now
      creates various problems for libraries compiled on newer header files
      which are run on older kernels.  They assume capabilities are available
      which actually aren't.  libcap-ng is one example.  And we ran into the
      same problem with systemd too.
      
      Now the capability is exported in /proc/sys/kernel/cap_last_cap.
      
      [akpm@linux-foundation.org: make cap_last_cap const, per Ulrich]
      Signed-off-by: NDan Ballard <dan@mindstab.net>
      Cc: Randy Dunlap <rdunlap@xenotime.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Cc: Ulrich Drepper <drepper@akkadia.org>
      Cc: James Morris <jmorris@namei.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73efc039
  24. 30 10月, 2011 1 次提交
  25. 14 8月, 2011 1 次提交
  26. 21 7月, 2011 1 次提交
  27. 04 6月, 2011 1 次提交
  28. 23 5月, 2011 1 次提交
  29. 20 5月, 2011 1 次提交
    • C
      arch/tile: support signal "exception-trace" hook · 571d76ac
      Chris Metcalf 提交于
      This change adds support for /proc/sys/debug/exception-trace to tile.
      Like x86 and sparc, by default it is set to "1", generating a one-line
      printk whenever a user process crashes.  By setting it to "2", we get
      a much more complete userspace diagnostic at crash time, including
      a user-space backtrace, register dump, and memory dump around the
      address of the crash.
      
      Some vestiges of the Tilera-internal version of this support are
      removed with this patch (the show_crashinfo variable and the
      arch_coredump_signal function).  We retain a "crashinfo" boot parameter
      which allows you to set the boot-time value of exception-trace.
      Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
      571d76ac