1. 07 1月, 2009 1 次提交
    • D
      mm: add dirty_background_bytes and dirty_bytes sysctls · 2da02997
      David Rientjes 提交于
      This change introduces two new sysctls to /proc/sys/vm:
      dirty_background_bytes and dirty_bytes.
      
      dirty_background_bytes is the counterpart to dirty_background_ratio and
      dirty_bytes is the counterpart to dirty_ratio.
      
      With growing memory capacities of individual machines, it's no longer
      sufficient to specify dirty thresholds as a percentage of the amount of
      dirtyable memory over the entire system.
      
      dirty_background_bytes and dirty_bytes specify quantities of memory, in
      bytes, that represent the dirty limits for the entire system.  If either
      of these values is set, its value represents the amount of dirty memory
      that is needed to commence either background or direct writeback.
      
      When a `bytes' or `ratio' file is written, its counterpart becomes a
      function of the written value.  For example, if dirty_bytes is written to
      be 8096, 8K of memory is required to commence direct writeback.
      dirty_ratio is then functionally equivalent to 8K / the amount of
      dirtyable memory:
      
      	dirtyable_memory = free pages + mapped pages + file cache
      
      	dirty_background_bytes = dirty_background_ratio * dirtyable_memory
      		-or-
      	dirty_background_ratio = dirty_background_bytes / dirtyable_memory
      
      		AND
      
      	dirty_bytes = dirty_ratio * dirtyable_memory
      		-or-
      	dirty_ratio = dirty_bytes / dirtyable_memory
      
      Only one of dirty_background_bytes and dirty_background_ratio may be
      specified at a time, and only one of dirty_bytes and dirty_ratio may be
      specified.  When one sysctl is written, the other appears as 0 when read.
      
      The `bytes' files operate on a page size granularity since dirty limits
      are compared with ZVC values, which are in page units.
      
      Prior to this change, the minimum dirty_ratio was 5 as implemented by
      get_dirty_limits() although /proc/sys/vm/dirty_ratio would show any user
      written value between 0 and 100.  This restriction is maintained, but
      dirty_bytes has a lower limit of only one page.
      
      Also prior to this change, the dirty_background_ratio could not equal or
      exceed dirty_ratio.  This restriction is maintained in addition to
      restricting dirty_background_bytes.  If either background threshold equals
      or exceeds that of the dirty threshold, it is implicitly set to half the
      dirty threshold.
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andrea Righi <righi.andrea@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2da02997
  2. 05 1月, 2009 1 次提交
    • K
      proc: add /proc/*/stack · 2ec220e2
      Ken Chen 提交于
      /proc/*/stack adds the ability to query a task's stack trace. It is more
      useful than /proc/*/wchan as it provides full stack trace instead of single
      depth. Example output:
      
      	$ cat /proc/self/stack
      	[<c010a271>] save_stack_trace_tsk+0x17/0x35
      	[<c01827b4>] proc_pid_stack+0x4a/0x76
      	[<c018312d>] proc_single_show+0x4a/0x5e
      	[<c016bdec>] seq_read+0xf3/0x29f
      	[<c015a004>] vfs_read+0x6d/0x91
      	[<c015a0c1>] sys_read+0x3b/0x60
      	[<c0102eda>] syscall_call+0x7/0xb
      	[<ffffffff>] 0xffffffff
      
      [add save_stack_trace_tsk() on mips, ACK Ralf --adobriyan]
      Signed-off-by: NKen Chen <kenchen@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      2ec220e2
  3. 02 12月, 2008 1 次提交
    • D
      epoll: introduce resource usage limits · 7ef9964e
      Davide Libenzi 提交于
      It has been thought that the per-user file descriptors limit would also
      limit the resources that a normal user can request via the epoll
      interface.  Vegard Nossum reported a very simple program (a modified
      version attached) that can make a normal user to request a pretty large
      amount of kernel memory, well within the its maximum number of fds.  To
      solve such problem, default limits are now imposed, and /proc based
      configuration has been introduced.  A new directory has been created,
      named /proc/sys/fs/epoll/ and inside there, there are two configuration
      points:
      
        max_user_instances = Maximum number of devices - per user
      
        max_user_watches   = Maximum number of "watched" fds - per user
      
      The current default for "max_user_watches" limits the memory used by epoll
      to store "watches", to 1/32 of the amount of the low RAM.  As example, a
      256MB 32bit machine, will have "max_user_watches" set to roughly 90000.
      That should be enough to not break existing heavy epoll users.  The
      default value for "max_user_instances" is set to 128, that should be
      enough too.
      
      This also changes the userspace, because a new error code can now come out
      from EPOLL_CTL_ADD (-ENOSPC).  The EMFILE from epoll_create() was already
      listed, so that should be ok.
      
      [akpm@linux-foundation.org: use get_current_user()]
      Signed-off-by: NDavide Libenzi <davidel@xmailserver.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: <stable@kernel.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Reported-by: NVegard Nossum <vegardno@ifi.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ef9964e
  4. 31 10月, 2008 1 次提交
  5. 20 10月, 2008 2 次提交
    • A
      documentation: clarify dirty_ratio and dirty_background_ratio description · 7a6560e0
      Andrea Righi 提交于
      The current documentation of dirty_ratio and dirty_background_ratio is a
      bit misleading.
      
      In the documentation we say that they are "a percentage of total system
      memory", but the current page writeback policy, intead, is to apply the
      percentages to the dirtyable memory, that means free pages + reclaimable
      pages.
      
      Better to be more explicit to clarify this concept.
      Signed-off-by: NAndrea Righi <righi.andrea@gmail.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a6560e0
    • K
      coredump_filter: add hugepage dumping · e575f111
      KOSAKI Motohiro 提交于
      Presently hugepage's vma has a VM_RESERVED flag in order not to be
      swapped.  But a VM_RESERVED vma isn't core dumped because this flag is
      often used for some kernel vmas (e.g.  vmalloc, sound related).
      
      Thus hugepages are never dumped and it can't be debugged easily.  Many
      developers want hugepages to be included into core-dump.
      
      However, We can't read generic VM_RESERVED area because this area is often
      IO mapping area.  then these area reading may change device state.  it is
      definitly undesiable side-effect.
      
      So adding a hugepage specific bit to the coredump filter is better.  It
      will be able to hugepage core dumping and doesn't cause any side-effect to
      any i/o devices.
      
      In additional, libhugetlb use hugetlb private mapping pages as anonymous
      page.  Then, hugepage private mapping pages should be core dumped by
      default.
      
      Then, /proc/[pid]/core_dump_filter has two new bits.
      
       - bit 5 mean hugetlb private mapping pages are dumped or not. (default: yes)
       - bit 6 mean hugetlb shared mapping pages are dumped or not.  (default: no)
      
      I tested by following method.
      
      % ulimit -c unlimited
      % ./crash_hugepage  50
      % ./crash_hugepage  50  -p
      % ls -lh
      % gdb ./crash_hugepage core
      %
      % echo 0x43 > /proc/self/coredump_filter
      % ./crash_hugepage  50
      % ./crash_hugepage  50  -p
      % ls -lh
      % gdb ./crash_hugepage core
      
      #include <stdlib.h>
      #include <stdio.h>
      #include <unistd.h>
      #include <sys/mman.h>
      #include <string.h>
      
      #include "hugetlbfs.h"
      
      int main(int argc, char** argv){
      	char* p;
      	int ch;
      	int mmap_flags = MAP_SHARED;
      	int fd;
      	int nr_pages;
      
      	while((ch = getopt(argc, argv, "p")) != -1) {
      		switch (ch) {
      		case 'p':
      			mmap_flags &= ~MAP_SHARED;
      			mmap_flags |= MAP_PRIVATE;
      			break;
      		default:
      			/* nothing*/
      			break;
      		}
      	}
      	argc -= optind;
      	argv += optind;
      
      	if (argc == 0){
      		printf("need # of pages\n");
      		exit(1);
      	}
      
      	nr_pages = atoi(argv[0]);
      	if (nr_pages < 2) {
      		printf("nr_pages must >2\n");
      		exit(1);
      	}
      
      	fd = hugetlbfs_unlinked_fd();
      	p = mmap(NULL, nr_pages * gethugepagesize(),
      		 PROT_READ|PROT_WRITE, mmap_flags, fd, 0);
      
      	sleep(2);
      
      	*(p + gethugepagesize()) = 1; /* COW */
      	sleep(2);
      
      	/* crash! */
      	*(int*)0 = 1;
      
      	return 0;
      }
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NKawai Hidehiro <hidehiro.kawai.ez@hitachi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: William Irwin <wli@holomorphy.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e575f111
  6. 17 10月, 2008 2 次提交
  7. 10 10月, 2008 3 次提交
  8. 14 9月, 2008 1 次提交
  9. 03 9月, 2008 1 次提交
  10. 27 7月, 2008 1 次提交
  11. 25 7月, 2008 1 次提交
    • E
      vmallocinfo: add NUMA information · a47a126a
      Eric Dumazet 提交于
      Christoph recently added /proc/vmallocinfo file to get information about
      vmalloc allocations.
      
      This patch adds NUMA specific information, giving number of pages
      allocated on each memory node.
      
      This should help to check that vmalloc() is able to respect NUMA policies.
      
      Example of output on a four nodes machine (one cpu per node)
      
      1) network hash tables are evenly spreaded on four nodes (OK) (Same
         point for inodes and dentries hash tables)
      
      2) iptables tables (x_tables) are correctly allocated on each cpu node
         (OK).
      
      3) sys_swapon() allocates its memory from one node only.
      
      4) each loaded module is using memory on one node.
      
      Sysadmins could tune their setup to change points 3) and 4) if necessary.
      
      grep "pages="  /proc/vmallocinfo
      0xffffc20000000000-0xffffc20000201000 2101248 alloc_large_system_hash+0x204/0x2c0 pages=512 vmalloc N0=128 N1=128 N2=128 N3=128
      0xffffc20000201000-0xffffc20000302000 1052672 alloc_large_system_hash+0x204/0x2c0 pages=256 vmalloc N0=64 N1=64 N2=64 N3=64
      0xffffc2000031a000-0xffffc2000031d000   12288 alloc_large_system_hash+0x204/0x2c0 pages=2 vmalloc N1=1 N2=1
      0xffffc2000031f000-0xffffc2000032b000   49152 cramfs_uncompress_init+0x2e/0x80 pages=11 vmalloc N0=3 N1=3 N2=2 N3=3
      0xffffc2000033e000-0xffffc20000341000   12288 sys_swapon+0x640/0xac0 pages=2 vmalloc N0=2
      0xffffc20000341000-0xffffc20000344000   12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N0=2
      0xffffc20000344000-0xffffc20000347000   12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N1=2
      0xffffc20000347000-0xffffc2000034a000   12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N2=2
      0xffffc2000034a000-0xffffc2000034d000   12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N3=2
      0xffffc20004381000-0xffffc20004402000  528384 alloc_large_system_hash+0x204/0x2c0 pages=128 vmalloc N0=32 N1=32 N2=32 N3=32
      0xffffc20004402000-0xffffc20004803000 4198400 alloc_large_system_hash+0x204/0x2c0 pages=1024 vmalloc vpages N0=256 N1=256 N2=256 N3=256
      0xffffc20004803000-0xffffc20004904000 1052672 alloc_large_system_hash+0x204/0x2c0 pages=256 vmalloc N0=64 N1=64 N2=64 N3=64
      0xffffc20004904000-0xffffc20004bec000 3047424 sys_swapon+0x640/0xac0 pages=743 vmalloc vpages N0=743
      0xffffffffa0000000-0xffffffffa000f000   61440 sys_init_module+0xc27/0x1d00 pages=14 vmalloc N1=14
      0xffffffffa000f000-0xffffffffa0014000   20480 sys_init_module+0xc27/0x1d00 pages=4 vmalloc N0=4
      0xffffffffa0014000-0xffffffffa0017000   12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N0=2
      0xffffffffa0017000-0xffffffffa0022000   45056 sys_init_module+0xc27/0x1d00 pages=10 vmalloc N1=10
      0xffffffffa0022000-0xffffffffa0028000   24576 sys_init_module+0xc27/0x1d00 pages=5 vmalloc N3=5
      0xffffffffa0028000-0xffffffffa0050000  163840 sys_init_module+0xc27/0x1d00 pages=39 vmalloc N1=39
      0xffffffffa0050000-0xffffffffa0052000    8192 sys_init_module+0xc27/0x1d00 pages=1 vmalloc N1=1
      0xffffffffa0052000-0xffffffffa0056000   16384 sys_init_module+0xc27/0x1d00 pages=3 vmalloc N1=3
      0xffffffffa0056000-0xffffffffa0081000  176128 sys_init_module+0xc27/0x1d00 pages=42 vmalloc N3=42
      0xffffffffa0081000-0xffffffffa00ae000  184320 sys_init_module+0xc27/0x1d00 pages=44 vmalloc N3=44
      0xffffffffa00ae000-0xffffffffa00b1000   12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N3=2
      0xffffffffa00b1000-0xffffffffa00b9000   32768 sys_init_module+0xc27/0x1d00 pages=7 vmalloc N0=7
      0xffffffffa00b9000-0xffffffffa00c4000   45056 sys_init_module+0xc27/0x1d00 pages=10 vmalloc N3=10
      0xffffffffa00c6000-0xffffffffa00e0000  106496 sys_init_module+0xc27/0x1d00 pages=25 vmalloc N2=25
      0xffffffffa00e0000-0xffffffffa00f1000   69632 sys_init_module+0xc27/0x1d00 pages=16 vmalloc N2=16
      0xffffffffa00f1000-0xffffffffa00f4000   12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N3=2
      0xffffffffa00f4000-0xffffffffa00f7000   12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N3=2
      
      [akpm@linux-foundation.org: fix comment]
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a47a126a
  12. 05 6月, 2008 1 次提交
    • M
      genirq: Expose default irq affinity mask (take 3) · 18404756
      Max Krasnyansky 提交于
      Current IRQ affinity interface does not provide a way to set affinity
      for the IRQs that will be allocated/activated in the future.
      This patch creates /proc/irq/default_smp_affinity that lets users set
      default affinity mask for the newly allocated IRQs. Changing the default
      does not affect affinity masks for the currently active IRQs, they
      have to be changed explicitly.
      
      Updated based on Paul J's comments and added some more documentation.
      Signed-off-by: NMax Krasnyansky <maxk@qualcomm.com>
      Cc: pj@sgi.com
      Cc: a.p.zijlstra@chello.nl
      Cc: tglx@linutronix.de
      Cc: rdunlap@xenotime.net
      Cc: mingo@elte.hu
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      18404756
  13. 30 4月, 2008 1 次提交
  14. 23 4月, 2008 2 次提交
  15. 12 3月, 2008 1 次提交
  16. 07 2月, 2008 1 次提交
    • E
      get rid of NR_OPEN and introduce a sysctl_nr_open · 9cfe015a
      Eric Dumazet 提交于
      NR_OPEN (historically set to 1024*1024) actually forbids processes to open
      more than 1024*1024 handles.
      
      Unfortunatly some production servers hit the not so 'ridiculously high
      value' of 1024*1024 file descriptors per process.
      
      Changing NR_OPEN is not considered safe because of vmalloc space potential
      exhaust.
      
      This patch introduces a new sysctl (/proc/sys/fs/nr_open) wich defaults to
      1024*1024, so that admins can decide to change this limit if their workload
      needs it.
      
      [akpm@linux-foundation.org: export it for sparc64]
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9cfe015a
  17. 06 2月, 2008 2 次提交
  18. 03 2月, 2008 1 次提交
  19. 02 2月, 2008 1 次提交
    • E
      [AUDIT] break large execve argument logging into smaller messages · de6bbd1d
      Eric Paris 提交于
      execve arguments can be quite large.  There is no limit on the number of
      arguments and a 4G limit on the size of an argument.
      
      this patch prints those aruguments in bite sized pieces.  a userspace size
      limitation of 8k was discovered so this keeps messages around 7.5k
      
      single arguments larger than 7.5k in length are split into multiple records
      and can be identified as aX[Y]=
      Signed-off-by: NEric Paris <eparis@redhat.com>
      de6bbd1d
  20. 01 2月, 2008 1 次提交
    • E
      [IPV4] route cache: Introduce rt_genid for smooth cache invalidation · 29e75252
      Eric Dumazet 提交于
      Current ip route cache implementation is not suited to large caches.
      
      We can consume a lot of CPU when cache must be invalidated, since we
      currently need to evict all cache entries, and this eviction is
      sometimes asynchronous. min_delay & max_delay can somewhat control this
      asynchronism behavior, but whole thing is a kludge, regularly triggering
      infamous soft lockup messages. When entries are still in use, this also
      consumes a lot of ram, filling dst_garbage.list.
      
      A better scheme is to use a generation identifier on each entry,
      so that cache invalidation can be performed by changing the table
      identifier, without having to scan all entries.
      No more delayed flushing, no more stalling when secret_interval expires.
      
      Invalidated entries will then be freed at GC time (controled by
      ip_rt_gc_timeout or stress), or when an invalidated entry is found
      in a chain when an insert is done.
      Thus we keep a normal equilibrium.
      
      This patch :
      - renames rt_hash_rnd to rt_genid (and makes it an atomic_t)
      - Adds a new rt_genid field to 'struct rtable' (filling a hole on 64bit)
      - Checks entry->rt_genid at appropriate places :
      29e75252
  21. 29 1月, 2008 1 次提交
  22. 20 10月, 2007 1 次提交
  23. 18 10月, 2007 1 次提交
    • J
      x86: expand /proc/interrupts to include missing vectors, v2 · 38e760a1
      Joe Korty 提交于
      Add missing IRQs and IRQ descriptions to /proc/interrupts.
      
      /proc/interrupts is most useful when it displays every IRQ vector in use by
      the system, not just those somebody thought would be interesting.
      
      This patch inserts the following vector displays to the i386 and x86_64
      platforms, as appropriate:
      
      	rescheduling interrupts
      	TLB flush interrupts
      	function call interrupts
      	thermal event interrupts
      	threshold interrupts
      	spurious interrupts
      
      A threshold interrupt occurs when ECC memory correction is occuring at too
      high a frequency.  Thresholds are used by the ECC hardware as occasional
      ECC failures are part of normal operation, but long sequences of ECC
      failures usually indicate a memory chip that is about to fail.
      
      Thermal event interrupts occur when a temperature threshold has been
      exceeded for some CPU chip.  IIRC, a thermal interrupt is also generated
      when the temperature drops back to a normal level.
      
      A spurious interrupt is an interrupt that was raised then lowered by the
      device before it could be fully processed by the APIC.  Hence the apic sees
      the interrupt but does not know what device it came from.  For this case
      the APIC hardware will assume a vector of 0xff.
      
      Rescheduling, call, and TLB flush interrupts are sent from one CPU to
      another per the needs of the OS.  Typically, their statistics would be used
      to discover if an interrupt flood of the given type has been occuring.
      
      AK: merged v2 and v4 which had some more tweaks
      AK: replace Local interrupts with Local timer interrupts
      AK: Fixed description of interrupt types.
      
      [ tglx: arch/x86 adaptation ]
      [ mingo: small cleanup ]
      Signed-off-by: NJoe Korty <joe.korty@ccur.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Tim Hockin <thockin@hockin.org>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      38e760a1
  24. 20 7月, 2007 2 次提交
  25. 18 7月, 2007 1 次提交
    • M
      handle kernelcore=: generic · ed7ed365
      Mel Gorman 提交于
      This patch adds the kernelcore= parameter for x86.
      
      Once all patches are applied, a new command-line parameter exist and a new
      sysctl.  This patch adds the necessary documentation.
      
      From: Yasunori Goto <y-goto@jp.fujitsu.com>
      
        When "kernelcore" boot option is specified, kernel can't boot up on ia64
        because of an infinite loop.  In addition, the parsing code can be handled
        in an architecture-independent manner.
      
        This patch uses common code to handle the kernelcore= parameter.  It is
        only available to architectures that support arch-independent zone-sizing
        (i.e.  define CONFIG_ARCH_POPULATES_NODE_MAP).  Other architectures will
        ignore the boot parameter.
      
      [bunk@stusta.de: make cmdline_parse_kernelcore() static]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Acked-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ed7ed365
  26. 17 7月, 2007 1 次提交
  27. 09 5月, 2007 2 次提交
    • R
      Fix more "deprecated" spellos. · 8b60756a
      Randy Dunlap 提交于
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NAdrian Bunk <bunk@stusta.de>
      8b60756a
    • K
      proc: maps protection · 5096add8
      Kees Cook 提交于
      The /proc/pid/ "maps", "smaps", and "numa_maps" files contain sensitive
      information about the memory location and usage of processes.  Issues:
      
      - maps should not be world-readable, especially if programs expect any
        kind of ASLR protection from local attackers.
      - maps cannot just be 0400 because "-D_FORTIFY_SOURCE=2 -O2" makes glibc
        check the maps when %n is in a *printf call, and a setuid(getuid())
        process wouldn't be able to read its own maps file.  (For reference
        see http://lkml.org/lkml/2006/1/22/150)
      - a system-wide toggle is needed to allow prior behavior in the case of
        non-root applications that depend on access to the maps contents.
      
      This change implements a check using "ptrace_may_attach" before allowing
      access to read the maps contents.  To control this protection, the new knob
      /proc/sys/kernel/maps_protect has been added, with corresponding updates to
      the procfs documentation.
      
      [akpm@linux-foundation.org: build fixes]
      [akpm@linux-foundation.org: New sysctl numbers are old hat]
      Signed-off-by: NKees Cook <kees@outflux.net>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5096add8
  28. 08 5月, 2007 1 次提交
    • D
      smaps: add clear_refs file to clear reference · b813e931
      David Rientjes 提交于
      Adds /proc/pid/clear_refs.  When any non-zero number is written to this file,
      pte_mkold() and ClearPageReferenced() is called for each pte and its
      corresponding page, respectively, in that task's VMAs.  This file is only
      writable by the user who owns the task.
      
      It is now possible to measure _approximately_ how much memory a task is using
      by clearing the reference bits with
      
      	echo 1 > /proc/pid/clear_refs
      
      and checking the reference count for each VMA from the /proc/pid/smaps output
      at a measured time interval.  For example, to observe the approximate change
      in memory footprint for a task, write a script that clears the references
      (echo 1 > /proc/pid/clear_refs), sleeps, and then greps for Pgs_Referenced and
      extracts the size in kB.  Add the sizes for each VMA together for the total
      referenced footprint.  Moments later, repeat the process and observe the
      difference.
      
      For example, using an efficient Mozilla:
      
      	accumulated time		referenced memory
      	----------------		-----------------
      		 0 s				 408 kB
      		 1 s				 408 kB
      		 2 s				 556 kB
      		 3 s				1028 kB
      		 4 s				 872 kB
      		 5 s				1956 kB
      		 6 s				 416 kB
      		 7 s				1560 kB
      		 8 s				2336 kB
      		 9 s				1044 kB
      		10 s				 416 kB
      
      This is a valuable tool to get an approximate measurement of the memory
      footprint for a task.
      
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      [akpm@linux-foundation.org: build fixes]
      [mpm@selenic.com: rename for_each_pmd]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b813e931
  29. 26 4月, 2007 1 次提交
  30. 05 3月, 2007 1 次提交
  31. 30 11月, 2006 2 次提交