1. 14 11月, 2008 1 次提交
  2. 20 10月, 2008 1 次提交
    • K
      coredump_filter: add hugepage dumping · e575f111
      KOSAKI Motohiro 提交于
      Presently hugepage's vma has a VM_RESERVED flag in order not to be
      swapped.  But a VM_RESERVED vma isn't core dumped because this flag is
      often used for some kernel vmas (e.g.  vmalloc, sound related).
      
      Thus hugepages are never dumped and it can't be debugged easily.  Many
      developers want hugepages to be included into core-dump.
      
      However, We can't read generic VM_RESERVED area because this area is often
      IO mapping area.  then these area reading may change device state.  it is
      definitly undesiable side-effect.
      
      So adding a hugepage specific bit to the coredump filter is better.  It
      will be able to hugepage core dumping and doesn't cause any side-effect to
      any i/o devices.
      
      In additional, libhugetlb use hugetlb private mapping pages as anonymous
      page.  Then, hugepage private mapping pages should be core dumped by
      default.
      
      Then, /proc/[pid]/core_dump_filter has two new bits.
      
       - bit 5 mean hugetlb private mapping pages are dumped or not. (default: yes)
       - bit 6 mean hugetlb shared mapping pages are dumped or not.  (default: no)
      
      I tested by following method.
      
      % ulimit -c unlimited
      % ./crash_hugepage  50
      % ./crash_hugepage  50  -p
      % ls -lh
      % gdb ./crash_hugepage core
      %
      % echo 0x43 > /proc/self/coredump_filter
      % ./crash_hugepage  50
      % ./crash_hugepage  50  -p
      % ls -lh
      % gdb ./crash_hugepage core
      
      #include <stdlib.h>
      #include <stdio.h>
      #include <unistd.h>
      #include <sys/mman.h>
      #include <string.h>
      
      #include "hugetlbfs.h"
      
      int main(int argc, char** argv){
      	char* p;
      	int ch;
      	int mmap_flags = MAP_SHARED;
      	int fd;
      	int nr_pages;
      
      	while((ch = getopt(argc, argv, "p")) != -1) {
      		switch (ch) {
      		case 'p':
      			mmap_flags &= ~MAP_SHARED;
      			mmap_flags |= MAP_PRIVATE;
      			break;
      		default:
      			/* nothing*/
      			break;
      		}
      	}
      	argc -= optind;
      	argv += optind;
      
      	if (argc == 0){
      		printf("need # of pages\n");
      		exit(1);
      	}
      
      	nr_pages = atoi(argv[0]);
      	if (nr_pages < 2) {
      		printf("nr_pages must >2\n");
      		exit(1);
      	}
      
      	fd = hugetlbfs_unlinked_fd();
      	p = mmap(NULL, nr_pages * gethugepagesize(),
      		 PROT_READ|PROT_WRITE, mmap_flags, fd, 0);
      
      	sleep(2);
      
      	*(p + gethugepagesize()) = 1; /* COW */
      	sleep(2);
      
      	/* crash! */
      	*(int*)0 = 1;
      
      	return 0;
      }
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NKawai Hidehiro <hidehiro.kawai.ez@hitachi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: William Irwin <wli@holomorphy.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e575f111
  3. 16 10月, 2008 1 次提交
  4. 14 9月, 2008 1 次提交
    • F
      timers: fix itimer/many thread hang · f06febc9
      Frank Mayhar 提交于
      Overview
      
      This patch reworks the handling of POSIX CPU timers, including the
      ITIMER_PROF, ITIMER_VIRT timers and rlimit handling.  It was put together
      with the help of Roland McGrath, the owner and original writer of this code.
      
      The problem we ran into, and the reason for this rework, has to do with using
      a profiling timer in a process with a large number of threads.  It appears
      that the performance of the old implementation of run_posix_cpu_timers() was
      at least O(n*3) (where "n" is the number of threads in a process) or worse.
      Everything is fine with an increasing number of threads until the time taken
      for that routine to run becomes the same as or greater than the tick time, at
      which point things degrade rather quickly.
      
      This patch fixes bug 9906, "Weird hang with NPTL and SIGPROF."
      
      Code Changes
      
      This rework corrects the implementation of run_posix_cpu_timers() to make it
      run in constant time for a particular machine.  (Performance may vary between
      one machine and another depending upon whether the kernel is built as single-
      or multiprocessor and, in the latter case, depending upon the number of
      running processors.)  To do this, at each tick we now update fields in
      signal_struct as well as task_struct.  The run_posix_cpu_timers() function
      uses those fields to make its decisions.
      
      We define a new structure, "task_cputime," to contain user, system and
      scheduler times and use these in appropriate places:
      
      struct task_cputime {
      	cputime_t utime;
      	cputime_t stime;
      	unsigned long long sum_exec_runtime;
      };
      
      This is included in the structure "thread_group_cputime," which is a new
      substructure of signal_struct and which varies for uniprocessor versus
      multiprocessor kernels.  For uniprocessor kernels, it uses "task_cputime" as
      a simple substructure, while for multiprocessor kernels it is a pointer:
      
      struct thread_group_cputime {
      	struct task_cputime totals;
      };
      
      struct thread_group_cputime {
      	struct task_cputime *totals;
      };
      
      We also add a new task_cputime substructure directly to signal_struct, to
      cache the earliest expiration of process-wide timers, and task_cputime also
      replaces the it_*_expires fields of task_struct (used for earliest expiration
      of thread timers).  The "thread_group_cputime" structure contains process-wide
      timers that are updated via account_user_time() and friends.  In the non-SMP
      case the structure is a simple aggregator; unfortunately in the SMP case that
      simplicity was not achievable due to cache-line contention between CPUs (in
      one measured case performance was actually _worse_ on a 16-cpu system than
      the same test on a 4-cpu system, due to this contention).  For SMP, the
      thread_group_cputime counters are maintained as a per-cpu structure allocated
      using alloc_percpu().  The timer functions update only the timer field in
      the structure corresponding to the running CPU, obtained using per_cpu_ptr().
      
      We define a set of inline functions in sched.h that we use to maintain the
      thread_group_cputime structure and hide the differences between UP and SMP
      implementations from the rest of the kernel.  The thread_group_cputime_init()
      function initializes the thread_group_cputime structure for the given task.
      The thread_group_cputime_alloc() is a no-op for UP; for SMP it calls the
      out-of-line function thread_group_cputime_alloc_smp() to allocate and fill
      in the per-cpu structures and fields.  The thread_group_cputime_free()
      function, also a no-op for UP, in SMP frees the per-cpu structures.  The
      thread_group_cputime_clone_thread() function (also a UP no-op) for SMP calls
      thread_group_cputime_alloc() if the per-cpu structures haven't yet been
      allocated.  The thread_group_cputime() function fills the task_cputime
      structure it is passed with the contents of the thread_group_cputime fields;
      in UP it's that simple but in SMP it must also safely check that tsk->signal
      is non-NULL (if it is it just uses the appropriate fields of task_struct) and,
      if so, sums the per-cpu values for each online CPU.  Finally, the three
      functions account_group_user_time(), account_group_system_time() and
      account_group_exec_runtime() are used by timer functions to update the
      respective fields of the thread_group_cputime structure.
      
      Non-SMP operation is trivial and will not be mentioned further.
      
      The per-cpu structure is always allocated when a task creates its first new
      thread, via a call to thread_group_cputime_clone_thread() from copy_signal().
      It is freed at process exit via a call to thread_group_cputime_free() from
      cleanup_signal().
      
      All functions that formerly summed utime/stime/sum_sched_runtime values from
      from all threads in the thread group now use thread_group_cputime() to
      snapshot the values in the thread_group_cputime structure or the values in
      the task structure itself if the per-cpu structure hasn't been allocated.
      
      Finally, the code in kernel/posix-cpu-timers.c has changed quite a bit.
      The run_posix_cpu_timers() function has been split into a fast path and a
      slow path; the former safely checks whether there are any expired thread
      timers and, if not, just returns, while the slow path does the heavy lifting.
      With the dedicated thread group fields, timers are no longer "rebalanced" and
      the process_timer_rebalance() function and related code has gone away.  All
      summing loops are gone and all code that used them now uses the
      thread_group_cputime() inline.  When process-wide timers are set, the new
      task_cputime structure in signal_struct is used to cache the earliest
      expiration; this is checked in the fast path.
      
      Performance
      
      The fix appears not to add significant overhead to existing operations.  It
      generally performs the same as the current code except in two cases, one in
      which it performs slightly worse (Case 5 below) and one in which it performs
      very significantly better (Case 2 below).  Overall it's a wash except in those
      two cases.
      
      I've since done somewhat more involved testing on a dual-core Opteron system.
      
      Case 1: With no itimer running, for a test with 100,000 threads, the fixed
      	kernel took 1428.5 seconds, 513 seconds more than the unfixed system,
      	all of which was spent in the system.  There were twice as many
      	voluntary context switches with the fix as without it.
      
      Case 2: With an itimer running at .01 second ticks and 4000 threads (the most
      	an unmodified kernel can handle), the fixed kernel ran the test in
      	eight percent of the time (5.8 seconds as opposed to 70 seconds) and
      	had better tick accuracy (.012 seconds per tick as opposed to .023
      	seconds per tick).
      
      Case 3: A 4000-thread test with an initial timer tick of .01 second and an
      	interval of 10,000 seconds (i.e. a timer that ticks only once) had
      	very nearly the same performance in both cases:  6.3 seconds elapsed
      	for the fixed kernel versus 5.5 seconds for the unfixed kernel.
      
      With fewer threads (eight in these tests), the Case 1 test ran in essentially
      the same time on both the modified and unmodified kernels (5.2 seconds versus
      5.8 seconds).  The Case 2 test ran in about the same time as well, 5.9 seconds
      versus 5.4 seconds but again with much better tick accuracy, .013 seconds per
      tick versus .025 seconds per tick for the unmodified kernel.
      
      Since the fix affected the rlimit code, I also tested soft and hard CPU limits.
      
      Case 4: With a hard CPU limit of 20 seconds and eight threads (and an itimer
      	running), the modified kernel was very slightly favored in that while
      	it killed the process in 19.997 seconds of CPU time (5.002 seconds of
      	wall time), only .003 seconds of that was system time, the rest was
      	user time.  The unmodified kernel killed the process in 20.001 seconds
      	of CPU (5.014 seconds of wall time) of which .016 seconds was system
      	time.  Really, though, the results were too close to call.  The results
      	were essentially the same with no itimer running.
      
      Case 5: With a soft limit of 20 seconds and a hard limit of 2000 seconds
      	(where the hard limit would never be reached) and an itimer running,
      	the modified kernel exhibited worse tick accuracy than the unmodified
      	kernel: .050 seconds/tick versus .028 seconds/tick.  Otherwise,
      	performance was almost indistinguishable.  With no itimer running this
      	test exhibited virtually identical behavior and times in both cases.
      
      In times past I did some limited performance testing.  those results are below.
      
      On a four-cpu Opteron system without this fix, a sixteen-thread test executed
      in 3569.991 seconds, of which user was 3568.435s and system was 1.556s.  On
      the same system with the fix, user and elapsed time were about the same, but
      system time dropped to 0.007 seconds.  Performance with eight, four and one
      thread were comparable.  Interestingly, the timer ticks with the fix seemed
      more accurate:  The sixteen-thread test with the fix received 149543 ticks
      for 0.024 seconds per tick, while the same test without the fix received 58720
      for 0.061 seconds per tick.  Both cases were configured for an interval of
      0.01 seconds.  Again, the other tests were comparable.  Each thread in this
      test computed the primes up to 25,000,000.
      
      I also did a test with a large number of threads, 100,000 threads, which is
      impossible without the fix.  In this case each thread computed the primes only
      up to 10,000 (to make the runtime manageable).  System time dominated, at
      1546.968 seconds out of a total 2176.906 seconds (giving a user time of
      629.938s).  It received 147651 ticks for 0.015 seconds per tick, still quite
      accurate.  There is obviously no comparable test without the fix.
      Signed-off-by: NFrank Mayhar <fmayhar@google.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f06febc9
  5. 27 7月, 2008 1 次提交
  6. 26 7月, 2008 2 次提交
  7. 25 7月, 2008 1 次提交
    • N
      ELF loader support for auxvec base platform string · 483fad1c
      Nathan Lynch 提交于
      Some IBM POWER-based platforms have the ability to run in a
      mode which mostly appears to the OS as a different processor from the
      actual hardware.  For example, a Power6 system may appear to be a
      Power5+, which makes the AT_PLATFORM value "power5+".  This means that
      programs are restricted to the ISA supported by Power5+;
      Power6-specific instructions are treated as illegal.
      
      However, some applications (virtual machines, optimized libraries) can
      benefit from knowledge of the underlying CPU model.  A new aux vector
      entry, AT_BASE_PLATFORM, will denote the actual hardware.  For
      example, on a Power6 system in Power5+ compatibility mode, AT_PLATFORM
      will be "power5+" and AT_BASE_PLATFORM will be "power6".  The idea is
      that AT_PLATFORM indicates the instruction set supported, while
      AT_BASE_PLATFORM indicates the underlying microarchitecture.
      
      If the architecture has defined ELF_BASE_PLATFORM, copy that value to
      the user stack in the same manner as ELF_PLATFORM.
      Signed-off-by: NNathan Lynch <ntl@pobox.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      483fad1c
  8. 23 7月, 2008 1 次提交
    • J
      execve filename: document and export via auxiliary vector · 65191087
      John Reiser 提交于
      The Linux kernel puts the filename argument of execve() into the new
      address space.  Many developers are surprised to learn this.  Those who
      know and could use it, object "But it's not documented."
      
      Those who want to use it dislike the expression
        (char *)(1+ strlen(env[-1+ n_env]) + env[-1+ n_env])
      because it requires locating the last original environment variable,
      and assumes that the filename follows the characters.
      
      This patch documents the insertion of the filename, and makes it easier
      to find by adding a new tag AT_EXECFN in the ElfXX_auxv_t; see <elf.h>.
      
      In many cases readlink("/proc/self/exe",) gives the same answer.  But if
      all the original pages get unmapped, then the kernel erases the symlink
      for /proc/self/exe.  This can happen when a program decompressor does a
      good job of cleaning up after uncompressing directly to memory, so that
      the address space of the target program looks the same as if compression
      had never happened.  One example is http://upx.sourceforge.net .
      
      One notable use of the underlying concept (what path containED the
      executable) is glibc expanding $ORIGIN in DT_RUNPATH.  In practice for
      the near term, it may be a good idea for user-mode code to use both
      /proc/self/exe and AT_EXECFN as fall-back methods for each other.
      /proc/self/exe can fail due to unmapping, AT_EXECFN can fail because it
      won't be present on non-new systems.  The auxvec or {AT_EXECFN}.d_val
      also can get overwritten, although in nearly all cases this would be the
      result of a bug.
      
      The runtime cost is one NEW_AUX_ENT using two words of stack space.  The
      underlying value is maintained already as bprm->exec; setup_arg_pages()
      in fs/exec.c slides it for stack_shift, etc.
      Signed-off-by: NJohn Reiser <jreiser@BitWagon.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Jakub Jelinek <jakub@redhat.com>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65191087
  9. 17 6月, 2008 1 次提交
  10. 17 5月, 2008 2 次提交
  11. 29 4月, 2008 2 次提交
  12. 25 4月, 2008 1 次提交
    • A
      [PATCH] sanitize handling of shared descriptor tables in failing execve() · fd8328be
      Al Viro 提交于
      * unshare_files() can fail; doing it after irreversible actions is wrong
        and de_thread() is certainly irreversible.
      * since we do it unconditionally anyway, we might as well do it in do_execve()
        and save ourselves the PITA in binfmt handlers, etc.
      * while we are at it, binfmt_som actually leaked files_struct on failure.
      
      As a side benefit, unshare_files(), put_files_struct() and reset_files_struct()
      become unexported.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fd8328be
  13. 05 3月, 2008 1 次提交
  14. 09 2月, 2008 2 次提交
    • A
      Remove a.out interpreter support in ELF loader · d20894a2
      Andi Kleen 提交于
      Following the deprecation schedule the a.out ELF interpreter support
      is removed now with this patch. a.out ELF interpreters were an transition
      feature for moving a.out systems to ELF, but they're unlikely to be still
      needed. Pure a.out systems will still work of course. This allows to
      simplify the hairy ELF loader.
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d20894a2
    • D
      aout: suppress A.OUT library support if !CONFIG_ARCH_SUPPORTS_AOUT · 7fa30315
      David Howells 提交于
      Suppress A.OUT library support if CONFIG_ARCH_SUPPORTS_AOUT is not set.
      
      Not all architectures support the A.OUT binfmt, so the ELF binfmt should not
      be permitted to go looking for A.OUT libraries to load in such a case.  Not
      only that, but under such conditions A.OUT core dumps are not produced either.
      
      To make this work, this patch also does the following:
      
       (1) Makes the existence of the contents of linux/a.out.h contingent on
           CONFIG_ARCH_SUPPORTS_AOUT.
      
       (2) Renames dump_thread() to aout_dump_thread() as it's only called by A.OUT
           core dumping code.
      
       (3) Moves aout_dump_thread() into asm/a.out-core.h and makes it inline.  This
           is then included only where needed.  This means that this bit of arch
           code will be stored in the appropriate A.OUT binfmt module rather than
           the core kernel.
      
       (4) Drops A.OUT support for Blackfin (according to Mike Frysinger it's not
           needed) and FRV.
      
      This patch depends on the previous patch to move STACK_TOP[_MAX] out of
      asm/a.out.h and into asm/processor.h as they're required whether or not A.OUT
      format is available.
      
      [jdike@addtoit.com: uml: re-remove accidentally restored code]
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NJeff Dike <jdike@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7fa30315
  15. 07 2月, 2008 1 次提交
  16. 04 2月, 2008 1 次提交
  17. 30 1月, 2008 6 次提交
    • A
      x86: remove iBCS support · 612a95b4
      Andi Kleen 提交于
      ibcs2 support has never been supported on 2.6 kernels as far as I know,
      and if it has it must have been an external patch.  Anyways, if anybody
      applies an external patch they could as well readd the ibcs checking
      code to the ELF loader in the same patch.  But there is no reason to
      keep this code running in all Linux kernels.  This will save at least
      two strcmps each ELF execution.
      
      No deprecation period because it could not have been used anyway.
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      612a95b4
    • R
      elf core dump: notes user_regset · 4206d3aa
      Roland McGrath 提交于
      This modifies the ELF core dump code under #ifdef CORE_DUMP_USE_REGSET.
      It changes nothing when this macro is not defined.  When it's #define'd
      by some arch header (e.g. asm/elf.h), the arch must support the
      user_regset (linux/regset.h) interface for reading thread state.
      
      This provides an alternate version of note segment writing that is based
      purely on the user_regset interfaces.  When CORE_DUMP_USE_REGSET is set,
      the arch need not define macros such as ELF_CORE_COPY_REGS and ELF_ARCH.
      All that information is taken from the user_regset data structures.
      The core dumps come out exactly the same if arch's definitions for its
      user_regset details are correct.
      Signed-off-by: NRoland McGrath <roland@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      4206d3aa
    • R
      elf core dump: notes reorg · 3aba481f
      Roland McGrath 提交于
      This pulls out the code for writing the notes segment of an ELF core dump
      into separate functions.  This cleanly isolates into one cluster of
      functions everything that deals with the note formats and the hooks into
      arch code to fill them.  The top-level elf_core_dump function itself now
      deals purely with the generic ELF format and the memory segments.
      
      This only moves code around into functions that can be inlined away.
      It should not change any behavior at all.
      Signed-off-by: NRoland McGrath <roland@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      3aba481f
    • A
      x86: PIE executable randomization, checkpatch fixes · bb1ad820
      Andrew Morton 提交于
      #39: FILE: arch/ia64/ia32/binfmt_elf32.c:229:
      +elf32_map (struct file *filep, unsigned long addr, struct elf_phdr *eppnt, int prot, int type, unsigned long unused)
      
      WARNING: no space between function name and open parenthesis '('
      #39: FILE: arch/ia64/ia32/binfmt_elf32.c:229:
      +elf32_map (struct file *filep, unsigned long addr, struct elf_phdr *eppnt, int prot, int type, unsigned long unused)
      
      WARNING: line over 80 characters
      #67: FILE: arch/x86/kernel/sys_x86_64.c:80:
      +			new_begin = randomize_range(*begin, *begin + 0x02000000, 0);
      
      ERROR: use tabs not spaces
      #110: FILE: arch/x86/kernel/sys_x86_64.c:185:
      + ^I        mm->cached_hole_size = 0;$
      
      ERROR: use tabs not spaces
      #111: FILE: arch/x86/kernel/sys_x86_64.c:186:
      + ^I^Imm->free_area_cache = mm->mmap_base;$
      
      ERROR: use tabs not spaces
      #112: FILE: arch/x86/kernel/sys_x86_64.c:187:
      + ^I}$
      
      ERROR: use tabs not spaces
      #141: FILE: arch/x86/kernel/sys_x86_64.c:216:
      + ^I^I/* remember the largest hole we saw so far */$
      
      ERROR: use tabs not spaces
      #142: FILE: arch/x86/kernel/sys_x86_64.c:217:
      + ^I^Iif (addr + mm->cached_hole_size < vma->vm_start)$
      
      ERROR: use tabs not spaces
      #143: FILE: arch/x86/kernel/sys_x86_64.c:218:
      + ^I^I        mm->cached_hole_size = vma->vm_start - addr;$
      
      ERROR: use tabs not spaces
      #157: FILE: arch/x86/kernel/sys_x86_64.c:232:
      +  ^Imm->free_area_cache = TASK_UNMAPPED_BASE;$
      
      ERROR: need a space before the open parenthesis '('
      #291: FILE: arch/x86/mm/mmap_64.c:101:
      +	} else if(mmap_is_legacy()) {
      
      WARNING: braces {} are not necessary for single statement blocks
      #302: FILE: arch/x86/mm/mmap_64.c:112:
      +	if (current->flags & PF_RANDOMIZE) {
      +		mm->mmap_base += ((long)rnd) << PAGE_SHIFT;
      +	}
      
      WARNING: line over 80 characters
      #314: FILE: fs/binfmt_elf.c:48:
      +static unsigned long elf_map (struct file *, unsigned long, struct elf_phdr *, int, int, unsigned long);
      
      WARNING: no space between function name and open parenthesis '('
      #314: FILE: fs/binfmt_elf.c:48:
      +static unsigned long elf_map (struct file *, unsigned long, struct elf_phdr *, int, int, unsigned long);
      
      WARNING: line over 80 characters
      #429: FILE: fs/binfmt_elf.c:438:
      +					   eppnt, elf_prot, elf_type, total_size);
      
      ERROR: need space after that ',' (ctx:VxV)
      #480: FILE: fs/binfmt_elf.c:939:
      +				elf_prot, elf_flags,0);
       				                   ^
      
      total: 9 errors, 7 warnings, 461 lines checked
      Your patch has style problems, please review.  If any of these errors
      are false positives report them to the maintainer, see
      CHECKPATCH in MAINTAINERS.
      
      Please run checkpatch prior to sending patches
      
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Jakub Jelinek <jakub@redhat.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Roland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      bb1ad820
    • J
      x86: PIE executable randomization · cc503c1b
      Jiri Kosina 提交于
      main executable of (specially compiled/linked -pie/-fpie) ET_DYN binaries
      onto a random address (in cases in which mmap() is allowed to perform a
      randomization).
      
      The code has been extraced from Ingo's exec-shield patch
      http://people.redhat.com/mingo/exec-shield/
      
      [akpm@linux-foundation.org: fix used-uninitialsied warning]
      [kamezawa.hiroyu@jp.fujitsu.com: fixed ia32 ELF on x86_64 handling]
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Jakub Jelinek <jakub@redhat.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      cc503c1b
    • J
      x86: randomize brk · c1d171a0
      Jiri Kosina 提交于
      Randomize the location of the heap (brk) for i386 and x86_64.  The range is
      randomized in the range starting at current brk location up to 0x02000000
      offset for both architectures.  This, together with
      pie-executable-randomization.patch and
      pie-executable-randomization-fix.patch, should make the address space
      randomization on i386 and x86_64 complete.
      
      Arjan says:
      
      This is known to break older versions of some emacs variants, whose dumper
      code assumed that the last variable declared in the program is equal to the
      start of the dynamically allocated memory region.
      
      (The dumper is the code where emacs effectively dumps core at the end of it's
      compilation stage; this coredump is then loaded as the main program during
      normal use)
      
      iirc this was 5 years or so; we found this way back when I was at RH and we
      first did the security stuff there (including this brk randomization).  It
      wasn't all variants of emacs, and it got fixed as a result (I vaguely remember
      that emacs already had code to deal with it for other archs/oses, just
      ifdeffed wrongly).
      
      It's a rare and wrong assumption as a general thing, just on x86 it mostly
      happened to be true (but to be honest, it'll break too if gcc does
      something fancy or if the linker does a non-standard order).  Still its
      something we should at least document.
      
      Note 2: afaik it only broke the emacs *build*.  I'm not 100% sure about that
      (it IS 5 years ago) though.
      
      [ akpm@linux-foundation.org: deuglification ]
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Jakub Jelinek <jakub@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      c1d171a0
  18. 08 1月, 2008 1 次提交
  19. 20 10月, 2007 2 次提交
    • P
      pid namespaces: changes to show virtual ids to user · b488893a
      Pavel Emelyanov 提交于
      This is the largest patch in the set. Make all (I hope) the places where
      the pid is shown to or get from user operate on the virtual pids.
      
      The idea is:
       - all in-kernel data structures must store either struct pid itself
         or the pid's global nr, obtained with pid_nr() call;
       - when seeking the task from kernel code with the stored id one
         should use find_task_by_pid() call that works with global pids;
       - when showing pid's numerical value to the user the virtual one
         should be used, but however when one shows task's pid outside this
         task's namespace the global one is to be used;
       - when getting the pid from userspace one need to consider this as
         the virtual one and use appropriate task/pid-searching functions.
      
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: nuther build fix]
      [akpm@linux-foundation.org: yet nuther build fix]
      [akpm@linux-foundation.org: remove unneeded casts]
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAlexey Dobriyan <adobriyan@openvz.org>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b488893a
    • P
      pid namespaces: round up the API · a47afb0f
      Pavel Emelianov 提交于
      The set of functions process_session, task_session, process_group and
      task_pgrp is confusing, as the names can be mixed with each other when looking
      at the code for a long time.
      
      The proposals are to
      * equip the functions that return the integer with _nr suffix to
        represent that fact,
      * and to make all functions work with task (not process) by making
        the common prefix of the same name.
      
      For monotony the routines signal_session() and set_signal_session() are
      replaced with task_session_nr() and set_task_session(), especially since they
      are only used with the explicit task->signal dereference.
      Signed-off-by: NPavel Emelianov <xemul@openvz.org>
      Acked-by: NSerge E. Hallyn <serue@us.ibm.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a47afb0f
  20. 17 10月, 2007 7 次提交
    • F
      Break ELF_PLATFORM and stack pointer randomization dependency · d68c9d6a
      Franck Bui-Huu 提交于
      Currently arch_align_stack() is used by fs/binfmt_elf.c to randomize
      stack pointer inside a page. But this happens only if ELF_PLATFORM
      symbol is defined.
      
      ELF_PLATFORM is normally set if the architecture wants ld.so to load
      implementation specific libraries for optimization. And currently a
      lot of architectures just yield this symbol to NULL.
      
      This is the case for MIPS architecture where ELF_PLATFORM is NULL but
      arch_align_stack() has been redefined to do stack inside page
      randomization. So in this case no randomization is actually done.
      
      This patch breaks this dependency which seems to be useless and allows
      platforms such MIPS to do the randomization.
      Signed-off-by: NFranck Bui-Huu <fbuihuu@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d68c9d6a
    • O
      increase AT_VECTOR_SIZE to terminate saved_auxv properly · 4f9a58d7
      Olaf Hering 提交于
      include/asm-powerpc/elf.h has 6 entries in ARCH_DLINFO.  fs/binfmt_elf.c
      has 14 unconditional NEW_AUX_ENT entries and 2 conditional NEW_AUX_ENT
      entries.  So in the worst case, saved_auxv does not get an AT_NULL entry at
      the end.
      
      The saved_auxv array must be terminated with an AT_NULL entry.  Make the
      size of mm_struct->saved_auxv arch dependend, based on the number of
      ARCH_DLINFO entries.
      Signed-off-by: NOlaf Hering <olh@suse.de>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Jakub Jelinek <jakub@redhat.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f9a58d7
    • R
      Add MMF_DUMP_ELF_HEADERS · 82df3973
      Roland McGrath 提交于
      This adds the MMF_DUMP_ELF_HEADERS option to /proc/pid/coredump_filter.
      This dumps the first page (only) of a private file mapping if it appears to
      be a mapping of an ELF file.  Including these pages in the core dump may
      give sufficient identifying information to associate the original DSO and
      executable file images and their debugging information with a core file in
      a generic way just from its contents (e.g.  when those binaries were built
      with ld --build-id).  I expect this to become the default behavior
      eventually.  Existing versions of gdb can be confused by the core dumps it
      creates, so it won't enabled by default for some time to come.  Soon many
      people will have systems with a gdb that handle these dumps, so they can
      arrange to set the bit at boot and have it inherited system-wide.
      
      This also cleans up the checking of the MMF_DUMP_* flag bits, which did not
      need to be using atomic macros.
      Signed-off-by: NRoland McGrath <roland@redhat.com>
      Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82df3973
    • A
      Deprecate a.out ELF interpreters · 8e9073ed
      Andi Kleen 提交于
      The Linux ELF loader is quite complicated and messy code (that could
      probably need a rewrite, but that's a different chapter).  One particular
      messy part in it is the support for non ELF a.out ld.sos.  This was
      originally added to make transition from a.out to ELF easier because an
      a.out ELF ld.so could be still build using an older a.out toolkit.  But by
      now that should be fully obsolete and removing it would clean up
      binfmt_elf.c up a bit.
      
      I propose to deprecate this support and remove for 2.6.25.
      
      Drawback is that someone still runs their system with a.out ld.so
      they would need to update the ld.so when updating to a new kernel.
      
      This patch just adds an entry to the deprecation file and a printk
      warning users.
      
      [akpm@linux-foundation.org: better warning message]
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e9073ed
    • N
      core_pattern: ignore RLIMIT_CORE if core_pattern is a pipe · 7dc0b22e
      Neil Horman 提交于
      For some time /proc/sys/kernel/core_pattern has been able to set its output
      destination as a pipe, allowing a user space helper to receive and
      intellegently process a core.  This infrastructure however has some
      shortcommings which can be enhanced.  Specifically:
      
      1) The coredump code in the kernel should ignore RLIMIT_CORE limitation
         when core_pattern is a pipe, since file system resources are not being
         consumed in this case, unless the user application wishes to save the core,
         at which point the app is restricted by usual file system limits and
         restrictions.
      
      2) The core_pattern code should be able to parse and pass options to the
         user space helper as an argv array.  The real core limit of the uid of the
         crashing proces should also be passable to the user space helper (since it
         is overridden to zero when called).
      
      3) Some miscellaneous bugs need to be cleaned up (specifically the
         recognition of a recursive core dump, should the user mode helper itself
         crash.  Also, the core dump code in the kernel should not wait for the user
         mode helper to exit, since the same context is responsible for writing to
         the pipe, and a read of the pipe by the user mode helper will result in a
         deadlock.
      
      This patch:
      
      Remove the check of RLIMIT_CORE if core_pattern is a pipe.  In the event that
      core_pattern is a pipe, the entire core will be fed to the user mode helper.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Cc: <martin.pitt@ubuntu.com>
      Cc: <wwoods@redhat.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7dc0b22e
    • M
      x86: replace NT_PRXFPREG with ELF_CORE_XFPREG_TYPE #define · 5b20cd80
      Mark Nelson 提交于
      Replace NT_PRXFPREG with ELF_CORE_XFPREG_TYPE in the coredump code which
      allows for more flexibility in the note type for the state of 'extended
      floating point' implementations in coredumps.  New note types can now be
      added with an appropriate #define.
      
      This does #define ELF_CORE_XFPREG_TYPE to be NT_PRXFPREG in all
      current users so there's are no change in behaviour.
      
      This will let us use different note types on powerpc for the Altivec/VMX
      state that some PowerPC cpus have (G4, PPC970, POWER6) and for the SPE
      (signal processing extension) state that some embedded PowerPC cpus from
      Freescale have.
      Signed-off-by: NMark Nelson <markn@au1.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Andi Kleen <ak@suse.de>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b20cd80
    • N
      remove ZERO_PAGE · 557ed1fa
      Nick Piggin 提交于
      The commit b5810039 contains the note
      
        A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
        (and thus mapcounted and count towards shared rss).  These writes to
        the struct page could cause excessive cacheline bouncing on big
        systems.  There are a number of ways this could be addressed if it is
        an issue.
      
      And indeed this cacheline bouncing has shown up on large SGI systems.
      There was a situation where an Altix system was essentially livelocked
      tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
      This situation can be avoided in userspace, but it does highlight the
      potential scalability problem with refcounting ZERO_PAGE, and corner
      cases where it can really hurt (we don't want the system to livelock!).
      
      There are several broad ways to fix this problem:
      1. add back some special casing to avoid refcounting ZERO_PAGE
      2. per-node or per-cpu ZERO_PAGES
      3. remove the ZERO_PAGE completely
      
      I will argue for 3. The others should also fix the problem, but they
      result in more complex code than does 3, with little or no real benefit
      that I can see.
      
      Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a
      false optimisation: if an application is performance critical, it would
      not be doing many read faults of new memory, or at least it could be
      expected to write to that memory soon afterwards. If cache or memory use
      is critical, it should not be working with a significant number of
      ZERO_PAGEs anyway (a more compact representation of zeroes should be
      used).
      
      As a sanity check -- mesuring on my desktop system, there are never many
      mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not
      increase much without it.
      
      When running a make -j4 kernel compile on my dual core system, there are
      about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000
      ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second
      is torn down without being COWed). So removing ZERO_PAGE will save 1,000
      page faults per second when running kbuild, while keeping it only saves
      less than 1 page clearing operation per second. 1 page clear is cheaper
      than a thousand faults, presumably, so there isn't an obvious loss.
      
      Neither the logical argument nor these basic tests give a guarantee of no
      regressions. However, this is a reasonable opportunity to try to remove
      the ZERO_PAGE from the pagefault path. If it is found to cause regressions,
      we can reintroduce it and just avoid refcounting it.
      
      The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked.  I don't see
      much use to them except on benchmarks.  All other users of ZERO_PAGE are
      converted just to use ZERO_PAGE(0) for simplicity. We can look at
      replacing them all and maybe ripping out ZERO_PAGE completely when we are
      more satisfied with this solution.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus "snif" Torvalds <torvalds@linux-foundation.org>
      557ed1fa
  21. 19 9月, 2007 1 次提交
    • M
      [POWERPC] spufs: Cleanup ELF coredump extra notes logic · e5501492
      Michael Ellerman 提交于
      To start with, arch_notes_size() etc. is a little too ambiguous a name for
      my liking, so change the function names to be more explicit.
      
      Calling through macros is ugly, especially with hidden parameters, so don't
      do that, call the routines directly.
      
      Use ARCH_HAVE_EXTRA_ELF_NOTES as the only flag, and based on it decide
      whether we want the extern declarations or the empty versions.
      
      Since we have empty routines, actually use them in the coredump code to
      save a few #ifdefs.
      
      We want to change the handling of foffset so that the write routine updates
      foffset as it goes, instead of using file->f_pos (so that writing to a pipe
      works).  So pass foffset to the write routine, and for now just set it to
      file->f_pos at the end of writing.
      
      It should also be possible for the write routine to fail, so change it to
      return int and treat a non-zero return as failure.
      Signed-off-by: NMichael Ellerman <michael@ellerman.id.au>
      Signed-off-by: NJeremy Kerr <jk@ozlabs.org>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      e5501492
  22. 22 7月, 2007 1 次提交
  23. 20 7月, 2007 2 次提交