1. 07 3月, 2010 2 次提交
    • D
      coredump: move dump_write() and dump_seek() into a header file · 088e7af7
      Daisuke HATAYAMA 提交于
      My next patch will replace ELF_CORE_EXTRA_* macros by functions, putting
      them into other newly created *.c files.  Then, each files will contain
      dump_write(), where each pair of binfmt_*.c and elfcore.c should be the
      same.  So, this patch moves them into a header file with dump_seek().
      Also, the patch deletes confusing DUMP_WRITE macros in each files.
      Signed-off-by: NDaisuke HATAYAMA <d.hatayama@jp.fujitsu.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      088e7af7
    • D
      coredump: unify dump_seek() implementations for each binfmt_*.c · 05f47fda
      Daisuke HATAYAMA 提交于
      The current ELF dumper can produce broken corefiles if program headers
      exceed 65535.  In particular, the program in 64-bit environment often
      demands more than 65535 mmaps.  If you google max_map_count, then you can
      find many users facing this problem.
      
      Solaris has already dealt with this issue, and other OSes have also
      adopted the same method as in Solaris.  Currently, Sun's document and AMD
      64 ABI include the description for the extension, where they call the
      extension Extended Numbering.  See Reference for further information.
      
      I believe that linux kernel should adopt the same way as they did, so I've
      written this patch.
      
      I am also preparing for patches of GDB and binutils.
      
      How to fix
      ==========
      
      In new dumping process, there are two cases according to weather or
      not the number of program headers is equal to or more than 65535.
      
       - if less than 65535, the produced corefile format is exactly the same
         as the ordinary one.
      
       - if equal to or more than 65535, then e_phnum field is set to newly
         introduced constant PN_XNUM(0xffff) and the actual number of program
         headers is set to sh_info field of the section header at index 0.
      
      Compatibility Concern
      =====================
      
       * As already mentioned in Summary, Sun and AMD64 has already adopted
         this.  See Reference.
      
       * There are four combinations according to whether kernel and userland
         tools are respectively modified or not.  The next table summarizes
         shortly for each combination.
      
                        ---------------------------------------------
                           Original Kernel    |   Modified Kernel
                        ---------------------------------------------
          	            < 65535  | >= 65535 | < 65535  | >= 65535
        -------------------------------------------------------------
         Original Tools |    OK    |  broken  |   OK     | broken (#)
        -------------------------------------------------------------
         Modified Tools |    OK    |  broken  |   OK     |    OK
        -------------------------------------------------------------
      
        Note that there is no case that `OK' changes to `broken'.
      
        (#) Although this case remains broken, O-M behaves better than
        O-O. That is, while in O-O case e_phnum field would be extremely
        small due to integer overflow, in O-M case it is guaranteed to be at
        least 65535 by being set to PN_XNUM(0xFFFF), much closer to the
        actual correct value than the O-O case.
      
      Test Program
      ============
      
      Here is a test program mkmmaps.c that is useful to produce the
      corefile with many mmaps. To use this, please take the following
      steps:
      
      $ ulimit -c unlimited
      $ sysctl vm.max_map_count=70000 # default 65530 is too small
      $ sysctl fs.file-max=70000
      $ mkmmaps 65535
      
      Then, the program will abort and a corefile will be generated.
      
      If failed, there are two cases according to the error message
      displayed.
      
       * ``out of memory'' means vm.max_map_count is still smaller
      
       * ``too many open files'' means fs.file-max is still smaller
      
      So, please change it to a larger value, and then retry it.
      
      mkmmaps.c
      ==
      #include <stdio.h>
      #include <stdlib.h>
      #include <sys/mman.h>
      #include <fcntl.h>
      #include <unistd.h>
      int main(int argc, char **argv)
      {
      	int maps_num;
      	if (argc < 2) {
      		fprintf(stderr, "mkmmaps [number of maps to be created]\n");
      		exit(1);
      	}
      	if (sscanf(argv[1], "%d", &maps_num) == EOF) {
      		perror("sscanf");
      		exit(2);
      	}
      	if (maps_num < 0) {
      		fprintf(stderr, "%d is invalid\n", maps_num);
      		exit(3);
      	}
      	for (; maps_num > 0; --maps_num) {
      		if (MAP_FAILED == mmap((void *)NULL, (size_t) 1, PROT_READ,
      					MAP_SHARED | MAP_ANONYMOUS, (int) -1,
      					(off_t) NULL)) {
      			perror("mmap");
      			exit(4);
      		}
      	}
      	abort();
      	{
      		char buffer[128];
      		sprintf(buffer, "wc -l /proc/%u/maps", getpid());
      		system(buffer);
      	}
      	return 0;
      }
      
      Tested on i386, ia64 and um/sys-i386.
      Built on sh4 (which covers fs/binfmt_elf_fdpic.c)
      
      References
      ==========
      
       - Sun microsystems: Linker and Libraries.
         Part No: 817-1984-17, September 2008.
         URL: http://docs.sun.com/app/docs/doc/817-1984
      
       - System V ABI AMD64 Architecture Processor Supplement
         Draft Version 0.99., May 11, 2009.
         URL: http://www.x86-64.org/
      
      This patch:
      
      There are three different definitions for dump_seek() functions in
      binfmt_aout.c, binfmt_elf.c and binfmt_elf_fdpic.c, respectively.  The
      only for binfmt_elf.c.
      
      My next patch will move dump_seek() into a header file in order to share
      the same implementations for dump_write() and dump_seek().  As the first
      step, this patch unify these three definitions for dump_seek() by applying
      the past commits that have been applied only for binfmt_elf.c.
      
      Specifically, the modification made here is part of the following commits:
      
        * d025c9db
        * 7f14daa1
      
      This patch does not change a shape of corefiles.
      Signed-off-by: NDaisuke HATAYAMA <d.hatayama@jp.fujitsu.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      05f47fda
  2. 30 1月, 2010 1 次提交
    • L
      Split 'flush_old_exec' into two functions · 221af7f8
      Linus Torvalds 提交于
      'flush_old_exec()' is the point of no return when doing an execve(), and
      it is pretty badly misnamed.  It doesn't just flush the old executable
      environment, it also starts up the new one.
      
      Which is very inconvenient for things like setting up the new
      personality, because we want the new personality to affect the starting
      of the new environment, but at the same time we do _not_ want the new
      personality to take effect if flushing the old one fails.
      
      As a result, the x86-64 '32-bit' personality is actually done using this
      insane "I'm going to change the ABI, but I haven't done it yet" bit
      (TIF_ABI_PENDING), with SET_PERSONALITY() not actually setting the
      personality, but just the "pending" bit, so that "flush_thread()" can do
      the actual personality magic.
      
      This patch in no way changes any of that insanity, but it does split the
      'flush_old_exec()' function up into a preparatory part that can fail
      (still called flush_old_exec()), and a new part that will actually set
      up the new exec environment (setup_new_exec()).  All callers are changed
      to trivially comply with the new world order.
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      221af7f8
  3. 07 1月, 2010 1 次提交
    • M
      FDPIC: Respect PT_GNU_STACK exec protection markings when creating NOMMU stack · 04e4f2b1
      Mike Frysinger 提交于
      The current code will load the stack size and protection markings, but
      then only use the markings in the MMU code path.  The NOMMU code path
      always passes PROT_EXEC to the mmap() call.  While this doesn't matter
      to most people whilst the code is running, it will cause a pointless
      icache flush when starting every FDPIC application.  Typically this
      icache flush will be of a region on the order of 128KB in size, or may
      be the entire icache, depending on the facilities available on the CPU.
      
      In the case where the arch default behaviour seems to be desired
      (EXSTACK_DEFAULT), we probe VM_STACK_FLAGS for VM_EXEC to determine
      whether we should be setting PROT_EXEC or not.
      
      For arches that support an MPU (Memory Protection Unit - an MMU without
      the virtual mapping capability), setting PROT_EXEC or not will make an
      important difference.
      
      It should be noted that this change also affects the executability of
      the brk region, since ELF-FDPIC has that share with the stack.  However,
      this is probably irrelevant as NOMMU programs aren't likely to use the
      brk region, preferring instead allocation via mmap().
      Signed-off-by: NMike Frysinger <vapier@gentoo.org>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04e4f2b1
  4. 04 1月, 2010 1 次提交
  5. 18 12月, 2009 1 次提交
  6. 16 12月, 2009 2 次提交
  7. 24 9月, 2009 1 次提交
  8. 22 9月, 2009 1 次提交
  9. 19 6月, 2009 1 次提交
  10. 03 5月, 2009 1 次提交
  11. 03 4月, 2009 1 次提交
  12. 08 1月, 2009 2 次提交
    • D
      FDPIC: Don't attempt to expand the userspace stack to fill the space allocated · f4bbf510
      David Howells 提交于
      Stop the ELF-FDPIC binfmt from attempting to expand the userspace stack and brk
      segments to fill the space actually allocated for it.  The space allocated may
      be rounded up by mmap(), and may be wasted.
      
      However, finding out how much space we actually obtained uses the contentious
      kobjsize() function which we'd like to get rid of as it doesn't necessarily
      work for all slab allocators.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NMike Frysinger <vapier.adi@gmail.com>
      Acked-by: NPaul Mundt <lethal@linux-sh.org>
      f4bbf510
    • D
      NOMMU: Make VMAs per MM as for MMU-mode linux · 8feae131
      David Howells 提交于
      Make VMAs per mm_struct as for MMU-mode linux.  This solves two problems:
      
       (1) In SYSV SHM where nattch for a segment does not reflect the number of
           shmat's (and forks) done.
      
       (2) In mmap() where the VMA's vm_mm is set to point to the parent mm by an
           exec'ing process when VM_EXECUTABLE is specified, regardless of the fact
           that a VMA might be shared and already have its vm_mm assigned to another
           process or a dead process.
      
      A new struct (vm_region) is introduced to track a mapped region and to remember
      the circumstances under which it may be shared and the vm_list_struct structure
      is discarded as it's no longer required.
      
      This patch makes the following additional changes:
      
       (1) Regions are now allocated with alloc_pages() rather than kmalloc() and
           with no recourse to __GFP_COMP, so the pages are not composite.  Instead,
           each page has a reference on it held by the region.  Anything else that is
           interested in such a page will have to get a reference on it to retain it.
           When the pages are released due to unmapping, each page is passed to
           put_page() and will be freed when the page usage count reaches zero.
      
       (2) Excess pages are trimmed after an allocation as the allocation must be
           made as a power-of-2 quantity of pages.
      
       (3) VMAs are added to the parent MM's R/B tree and mmap lists.  As an MM may
           end up with overlapping VMAs within the tree, the VMA struct address is
           appended to the sort key.
      
       (4) Non-anonymous VMAs are now added to the backing inode's prio list.
      
       (5) Holes may be punched in anonymous VMAs with munmap(), releasing parts of
           the backing region.  The VMA and region structs will be split if
           necessary.
      
       (6) sys_shmdt() only releases one attachment to a SYSV IPC shared memory
           segment instead of all the attachments at that addresss.  Multiple
           shmat()'s return the same address under NOMMU-mode instead of different
           virtual addresses as under MMU-mode.
      
       (7) Core dumping for ELF-FDPIC requires fewer exceptions for NOMMU-mode.
      
       (8) /proc/maps is now the global list of mapped regions, and may list bits
           that aren't actually mapped anywhere.
      
       (9) /proc/meminfo gains a line (tagged "MmapCopy") that indicates the amount
           of RAM currently allocated by mmap to hold mappable regions that can't be
           mapped directly.  These are copies of the backing device or file if not
           anonymous.
      
      These changes make NOMMU mode more similar to MMU mode.  The downside is that
      NOMMU mode requires some extra memory to track things over NOMMU without this
      patch (VMAs are no longer shared, and there are now region structs).
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NMike Frysinger <vapier.adi@gmail.com>
      Acked-by: NPaul Mundt <lethal@linux-sh.org>
      8feae131
  13. 14 11月, 2008 5 次提交
    • D
      CRED: Make execve() take advantage of copy-on-write credentials · a6f76f23
      David Howells 提交于
      Make execve() take advantage of copy-on-write credentials, allowing it to set
      up the credentials in advance, and then commit the whole lot after the point
      of no return.
      
      This patch and the preceding patches have been tested with the LTP SELinux
      testsuite.
      
      This patch makes several logical sets of alteration:
      
       (1) execve().
      
           The credential bits from struct linux_binprm are, for the most part,
           replaced with a single credentials pointer (bprm->cred).  This means that
           all the creds can be calculated in advance and then applied at the point
           of no return with no possibility of failure.
      
           I would like to replace bprm->cap_effective with:
      
      	cap_isclear(bprm->cap_effective)
      
           but this seems impossible due to special behaviour for processes of pid 1
           (they always retain their parent's capability masks where normally they'd
           be changed - see cap_bprm_set_creds()).
      
           The following sequence of events now happens:
      
           (a) At the start of do_execve, the current task's cred_exec_mutex is
           	 locked to prevent PTRACE_ATTACH from obsoleting the calculation of
           	 creds that we make.
      
           (a) prepare_exec_creds() is then called to make a copy of the current
           	 task's credentials and prepare it.  This copy is then assigned to
           	 bprm->cred.
      
        	 This renders security_bprm_alloc() and security_bprm_free()
           	 unnecessary, and so they've been removed.
      
           (b) The determination of unsafe execution is now performed immediately
           	 after (a) rather than later on in the code.  The result is stored in
           	 bprm->unsafe for future reference.
      
           (c) prepare_binprm() is called, possibly multiple times.
      
           	 (i) This applies the result of set[ug]id binaries to the new creds
           	     attached to bprm->cred.  Personality bit clearance is recorded,
           	     but now deferred on the basis that the exec procedure may yet
           	     fail.
      
               (ii) This then calls the new security_bprm_set_creds().  This should
      	     calculate the new LSM and capability credentials into *bprm->cred.
      
      	     This folds together security_bprm_set() and parts of
      	     security_bprm_apply_creds() (these two have been removed).
      	     Anything that might fail must be done at this point.
      
               (iii) bprm->cred_prepared is set to 1.
      
      	     bprm->cred_prepared is 0 on the first pass of the security
      	     calculations, and 1 on all subsequent passes.  This allows SELinux
      	     in (ii) to base its calculations only on the initial script and
      	     not on the interpreter.
      
           (d) flush_old_exec() is called to commit the task to execution.  This
           	 performs the following steps with regard to credentials:
      
      	 (i) Clear pdeath_signal and set dumpable on certain circumstances that
      	     may not be covered by commit_creds().
      
               (ii) Clear any bits in current->personality that were deferred from
                   (c.i).
      
           (e) install_exec_creds() [compute_creds() as was] is called to install the
           	 new credentials.  This performs the following steps with regard to
           	 credentials:
      
               (i) Calls security_bprm_committing_creds() to apply any security
                   requirements, such as flushing unauthorised files in SELinux, that
                   must be done before the credentials are changed.
      
      	     This is made up of bits of security_bprm_apply_creds() and
      	     security_bprm_post_apply_creds(), both of which have been removed.
      	     This function is not allowed to fail; anything that might fail
      	     must have been done in (c.ii).
      
               (ii) Calls commit_creds() to apply the new credentials in a single
                   assignment (more or less).  Possibly pdeath_signal and dumpable
                   should be part of struct creds.
      
      	 (iii) Unlocks the task's cred_replace_mutex, thus allowing
      	     PTRACE_ATTACH to take place.
      
               (iv) Clears The bprm->cred pointer as the credentials it was holding
                   are now immutable.
      
               (v) Calls security_bprm_committed_creds() to apply any security
                   alterations that must be done after the creds have been changed.
                   SELinux uses this to flush signals and signal handlers.
      
           (f) If an error occurs before (d.i), bprm_free() will call abort_creds()
           	 to destroy the proposed new credentials and will then unlock
           	 cred_replace_mutex.  No changes to the credentials will have been
           	 made.
      
       (2) LSM interface.
      
           A number of functions have been changed, added or removed:
      
           (*) security_bprm_alloc(), ->bprm_alloc_security()
           (*) security_bprm_free(), ->bprm_free_security()
      
           	 Removed in favour of preparing new credentials and modifying those.
      
           (*) security_bprm_apply_creds(), ->bprm_apply_creds()
           (*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
      
           	 Removed; split between security_bprm_set_creds(),
           	 security_bprm_committing_creds() and security_bprm_committed_creds().
      
           (*) security_bprm_set(), ->bprm_set_security()
      
           	 Removed; folded into security_bprm_set_creds().
      
           (*) security_bprm_set_creds(), ->bprm_set_creds()
      
           	 New.  The new credentials in bprm->creds should be checked and set up
           	 as appropriate.  bprm->cred_prepared is 0 on the first call, 1 on the
           	 second and subsequent calls.
      
           (*) security_bprm_committing_creds(), ->bprm_committing_creds()
           (*) security_bprm_committed_creds(), ->bprm_committed_creds()
      
           	 New.  Apply the security effects of the new credentials.  This
           	 includes closing unauthorised files in SELinux.  This function may not
           	 fail.  When the former is called, the creds haven't yet been applied
           	 to the process; when the latter is called, they have.
      
       	 The former may access bprm->cred, the latter may not.
      
       (3) SELinux.
      
           SELinux has a number of changes, in addition to those to support the LSM
           interface changes mentioned above:
      
           (a) The bprm_security_struct struct has been removed in favour of using
           	 the credentials-under-construction approach.
      
           (c) flush_unauthorized_files() now takes a cred pointer and passes it on
           	 to inode_has_perm(), file_has_perm() and dentry_open().
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NJames Morris <jmorris@namei.org>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      a6f76f23
    • D
      CRED: Use RCU to access another task's creds and to release a task's own creds · c69e8d9c
      David Howells 提交于
      Use RCU to access another task's creds and to release a task's own creds.
      This means that it will be possible for the credentials of a task to be
      replaced without another task (a) requiring a full lock to read them, and (b)
      seeing deallocated memory.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NJames Morris <jmorris@namei.org>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      c69e8d9c
    • D
      CRED: Wrap current->cred and a few other accessors · 86a264ab
      David Howells 提交于
      Wrap current->cred and a few other accessors to hide their actual
      implementation.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NJames Morris <jmorris@namei.org>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      86a264ab
    • D
      CRED: Separate task security context from task_struct · b6dff3ec
      David Howells 提交于
      Separate the task security context from task_struct.  At this point, the
      security data is temporarily embedded in the task_struct with two pointers
      pointing to it.
      
      Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in
      entry.S via asm-offsets.
      
      With comment fixes Signed-off-by: Marc Dionne <marc.c.dionne@gmail.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NJames Morris <jmorris@namei.org>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      b6dff3ec
    • D
      CRED: Wrap task credential accesses in the filesystem subsystem · da9592ed
      David Howells 提交于
      Wrap access to task credentials so that they can be separated more easily from
      the task_struct during the introduction of COW creds.
      
      Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().
      
      Change some task->e?[ug]id to task_e?[ug]id().  In some places it makes more
      sense to use RCU directly rather than a convenient wrapper; these will be
      addressed by later patches.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-by: NJames Morris <jmorris@namei.org>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      da9592ed
  14. 21 10月, 2008 1 次提交
  15. 17 10月, 2008 3 次提交
  16. 28 7月, 2008 1 次提交
    • P
      binfmt_elf_fdpic: Magical stack pointer index, for NEW_AUX_ENT compat. · 9b14ec35
      Paul Mundt 提交于
      While implementing binfmt_elf_fdpic on SH it quickly became apparent
      that SH was the first platform to support both binfmt_elf_fdpic and
      binfmt_elf, as well as the only of the FDPIC platforms to make use of the
      auxvt.
      
      Currently binfmt_elf_fdpic uses a special version of NEW_AUX_ENT() where
      the first argument is the entry displacement after csp has been adjusted,
      being reset after each adjustment. As we have no ability to sort this out
      through the platform's ARCH_DLINFO, this index needs to be managed
      entirely in create_elf_fdpic_tables(). Presently none of the platforms
      that set their own auxvt entries are able to do so through their
      respective ARCH_DLINFOs when using binfmt_elf_fdpic.
      
      In addition to this, binfmt_elf_fdpic has been looking at
      DLINFO_ARCH_ITEMS for the number of architecture-specific entries in the
      auxvt. This is legacy cruft, and is not defined by any platforms in-tree,
      even those that make heavy use of the auxvt. AT_VECTOR_SIZE_ARCH is
      always available, and contains the number that is of interest here, so we
      switch to using that unconditionally as well.
      
      As this has direct bearing on how much stack is used, platforms that have
      configurable (or dynamically adjustable) NEW_AUX_ENT calls need to either
      make AT_VECTOR_SIZE_ARCH more fine-grained, or leave it as a worst-case
      and live with some lost stack space if those entries aren't pushed (some
      platforms may also need to purposely sacrifice some space here for
      alignment considerations, as noted in the code -- although not an issue
      for any FDPIC-capable platform today).
      Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      9b14ec35
  17. 27 7月, 2008 1 次提交
  18. 26 7月, 2008 2 次提交
  19. 07 6月, 2008 1 次提交
  20. 29 4月, 2008 1 次提交
  21. 20 10月, 2007 2 次提交
    • P
      pid namespaces: changes to show virtual ids to user · b488893a
      Pavel Emelyanov 提交于
      This is the largest patch in the set. Make all (I hope) the places where
      the pid is shown to or get from user operate on the virtual pids.
      
      The idea is:
       - all in-kernel data structures must store either struct pid itself
         or the pid's global nr, obtained with pid_nr() call;
       - when seeking the task from kernel code with the stored id one
         should use find_task_by_pid() call that works with global pids;
       - when showing pid's numerical value to the user the virtual one
         should be used, but however when one shows task's pid outside this
         task's namespace the global one is to be used;
       - when getting the pid from userspace one need to consider this as
         the virtual one and use appropriate task/pid-searching functions.
      
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: nuther build fix]
      [akpm@linux-foundation.org: yet nuther build fix]
      [akpm@linux-foundation.org: remove unneeded casts]
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAlexey Dobriyan <adobriyan@openvz.org>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b488893a
    • P
      pid namespaces: round up the API · a47afb0f
      Pavel Emelianov 提交于
      The set of functions process_session, task_session, process_group and
      task_pgrp is confusing, as the names can be mixed with each other when looking
      at the code for a long time.
      
      The proposals are to
      * equip the functions that return the integer with _nr suffix to
        represent that fact,
      * and to make all functions work with task (not process) by making
        the common prefix of the same name.
      
      For monotony the routines signal_session() and set_signal_session() are
      replaced with task_session_nr() and set_task_session(), especially since they
      are only used with the explicit task->signal dereference.
      Signed-off-by: NPavel Emelianov <xemul@openvz.org>
      Acked-by: NSerge E. Hallyn <serue@us.ibm.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a47afb0f
  22. 17 10月, 2007 3 次提交
    • N
      core_pattern: ignore RLIMIT_CORE if core_pattern is a pipe · 7dc0b22e
      Neil Horman 提交于
      For some time /proc/sys/kernel/core_pattern has been able to set its output
      destination as a pipe, allowing a user space helper to receive and
      intellegently process a core.  This infrastructure however has some
      shortcommings which can be enhanced.  Specifically:
      
      1) The coredump code in the kernel should ignore RLIMIT_CORE limitation
         when core_pattern is a pipe, since file system resources are not being
         consumed in this case, unless the user application wishes to save the core,
         at which point the app is restricted by usual file system limits and
         restrictions.
      
      2) The core_pattern code should be able to parse and pass options to the
         user space helper as an argv array.  The real core limit of the uid of the
         crashing proces should also be passable to the user space helper (since it
         is overridden to zero when called).
      
      3) Some miscellaneous bugs need to be cleaned up (specifically the
         recognition of a recursive core dump, should the user mode helper itself
         crash.  Also, the core dump code in the kernel should not wait for the user
         mode helper to exit, since the same context is responsible for writing to
         the pipe, and a read of the pipe by the user mode helper will result in a
         deadlock.
      
      This patch:
      
      Remove the check of RLIMIT_CORE if core_pattern is a pipe.  In the event that
      core_pattern is a pipe, the entire core will be fed to the user mode helper.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Cc: <martin.pitt@ubuntu.com>
      Cc: <wwoods@redhat.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7dc0b22e
    • M
      x86: replace NT_PRXFPREG with ELF_CORE_XFPREG_TYPE #define · 5b20cd80
      Mark Nelson 提交于
      Replace NT_PRXFPREG with ELF_CORE_XFPREG_TYPE in the coredump code which
      allows for more flexibility in the note type for the state of 'extended
      floating point' implementations in coredumps.  New note types can now be
      added with an appropriate #define.
      
      This does #define ELF_CORE_XFPREG_TYPE to be NT_PRXFPREG in all
      current users so there's are no change in behaviour.
      
      This will let us use different note types on powerpc for the Altivec/VMX
      state that some PowerPC cpus have (G4, PPC970, POWER6) and for the SPE
      (signal processing extension) state that some embedded PowerPC cpus from
      Freescale have.
      Signed-off-by: NMark Nelson <markn@au1.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Andi Kleen <ak@suse.de>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b20cd80
    • N
      remove ZERO_PAGE · 557ed1fa
      Nick Piggin 提交于
      The commit b5810039 contains the note
      
        A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
        (and thus mapcounted and count towards shared rss).  These writes to
        the struct page could cause excessive cacheline bouncing on big
        systems.  There are a number of ways this could be addressed if it is
        an issue.
      
      And indeed this cacheline bouncing has shown up on large SGI systems.
      There was a situation where an Altix system was essentially livelocked
      tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
      This situation can be avoided in userspace, but it does highlight the
      potential scalability problem with refcounting ZERO_PAGE, and corner
      cases where it can really hurt (we don't want the system to livelock!).
      
      There are several broad ways to fix this problem:
      1. add back some special casing to avoid refcounting ZERO_PAGE
      2. per-node or per-cpu ZERO_PAGES
      3. remove the ZERO_PAGE completely
      
      I will argue for 3. The others should also fix the problem, but they
      result in more complex code than does 3, with little or no real benefit
      that I can see.
      
      Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a
      false optimisation: if an application is performance critical, it would
      not be doing many read faults of new memory, or at least it could be
      expected to write to that memory soon afterwards. If cache or memory use
      is critical, it should not be working with a significant number of
      ZERO_PAGEs anyway (a more compact representation of zeroes should be
      used).
      
      As a sanity check -- mesuring on my desktop system, there are never many
      mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not
      increase much without it.
      
      When running a make -j4 kernel compile on my dual core system, there are
      about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000
      ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second
      is torn down without being COWed). So removing ZERO_PAGE will save 1,000
      page faults per second when running kbuild, while keeping it only saves
      less than 1 page clearing operation per second. 1 page clear is cheaper
      than a thousand faults, presumably, so there isn't an obvious loss.
      
      Neither the logical argument nor these basic tests give a guarantee of no
      regressions. However, this is a reasonable opportunity to try to remove
      the ZERO_PAGE from the pagefault path. If it is found to cause regressions,
      we can reintroduce it and just avoid refcounting it.
      
      The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked.  I don't see
      much use to them except on benchmarks.  All other users of ZERO_PAGE are
      converted just to use ZERO_PAGE(0) for simplicity. We can look at
      replacing them all and maybe ripping out ZERO_PAGE completely when we are
      more satisfied with this solution.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus "snif" Torvalds <torvalds@linux-foundation.org>
      557ed1fa
  23. 20 7月, 2007 3 次提交
  24. 09 5月, 2007 1 次提交
  25. 03 4月, 2007 1 次提交