1. 17 10月, 2007 1 次提交
    • N
      remove ZERO_PAGE · 557ed1fa
      Nick Piggin 提交于
      The commit b5810039 contains the note
      
        A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
        (and thus mapcounted and count towards shared rss).  These writes to
        the struct page could cause excessive cacheline bouncing on big
        systems.  There are a number of ways this could be addressed if it is
        an issue.
      
      And indeed this cacheline bouncing has shown up on large SGI systems.
      There was a situation where an Altix system was essentially livelocked
      tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
      This situation can be avoided in userspace, but it does highlight the
      potential scalability problem with refcounting ZERO_PAGE, and corner
      cases where it can really hurt (we don't want the system to livelock!).
      
      There are several broad ways to fix this problem:
      1. add back some special casing to avoid refcounting ZERO_PAGE
      2. per-node or per-cpu ZERO_PAGES
      3. remove the ZERO_PAGE completely
      
      I will argue for 3. The others should also fix the problem, but they
      result in more complex code than does 3, with little or no real benefit
      that I can see.
      
      Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a
      false optimisation: if an application is performance critical, it would
      not be doing many read faults of new memory, or at least it could be
      expected to write to that memory soon afterwards. If cache or memory use
      is critical, it should not be working with a significant number of
      ZERO_PAGEs anyway (a more compact representation of zeroes should be
      used).
      
      As a sanity check -- mesuring on my desktop system, there are never many
      mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not
      increase much without it.
      
      When running a make -j4 kernel compile on my dual core system, there are
      about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000
      ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second
      is torn down without being COWed). So removing ZERO_PAGE will save 1,000
      page faults per second when running kbuild, while keeping it only saves
      less than 1 page clearing operation per second. 1 page clear is cheaper
      than a thousand faults, presumably, so there isn't an obvious loss.
      
      Neither the logical argument nor these basic tests give a guarantee of no
      regressions. However, this is a reasonable opportunity to try to remove
      the ZERO_PAGE from the pagefault path. If it is found to cause regressions,
      we can reintroduce it and just avoid refcounting it.
      
      The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked.  I don't see
      much use to them except on benchmarks.  All other users of ZERO_PAGE are
      converted just to use ZERO_PAGE(0) for simplicity. We can look at
      replacing them all and maybe ripping out ZERO_PAGE completely when we are
      more satisfied with this solution.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus "snif" Torvalds <torvalds@linux-foundation.org>
      557ed1fa
  2. 19 9月, 2007 1 次提交
    • M
      [POWERPC] spufs: Cleanup ELF coredump extra notes logic · e5501492
      Michael Ellerman 提交于
      To start with, arch_notes_size() etc. is a little too ambiguous a name for
      my liking, so change the function names to be more explicit.
      
      Calling through macros is ugly, especially with hidden parameters, so don't
      do that, call the routines directly.
      
      Use ARCH_HAVE_EXTRA_ELF_NOTES as the only flag, and based on it decide
      whether we want the extern declarations or the empty versions.
      
      Since we have empty routines, actually use them in the coredump code to
      save a few #ifdefs.
      
      We want to change the handling of foffset so that the write routine updates
      foffset as it goes, instead of using file->f_pos (so that writing to a pipe
      works).  So pass foffset to the write routine, and for now just set it to
      file->f_pos at the end of writing.
      
      It should also be possible for the write routine to fail, so change it to
      return int and treat a non-zero return as failure.
      Signed-off-by: NMichael Ellerman <michael@ellerman.id.au>
      Signed-off-by: NJeremy Kerr <jk@ozlabs.org>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      e5501492
  3. 22 7月, 2007 1 次提交
  4. 20 7月, 2007 2 次提交
  5. 17 7月, 2007 2 次提交
  6. 07 7月, 2007 1 次提交
    • M
      Fix elf_core_dump() when writing arch specific notes (spu coredumps) · ef7320ed
      Michael Ellerman 提交于
      elf_core_dump() supports dumping arch specific ELF notes, via the #define
      ELF_CORE_WRITE_EXTRA_NOTES.  Currently the only user of this is the powerpc
      spu coredump code.
      
      There is a bug in the handling of foffset WRT the arch notes, which causes
      us to erroneously increment foffset by the size of the arch notes, leaving
      a block of zeroes in the file, and causing all subsequent data in the file
      to be at <supposed position> + <arch note size>.  eg:
      
        LOAD  0x050000 0x00100000 0x00000000 0x20000 0x20000 R E 0x10000
      
      Tells us we should have a chunk of data at 0x50000.  The truth is the data
      is at 0x90dbc = 0x50000 + 0x40dbc (the size of the arch notes).
      
      This bug prevents gdb from reading the core file correctly.
      
      The simplest fix is to simply remember the size of the arch notes, and add
      it to foffset after we've written the arch notes.  The only drawback is
      that if the arch code doesn't write as many bytes as it said it would, we
      end up with a broken core dump again.  For now I think that's a reasonable
      requirement.
      
      Tested on a Cell blade, gdb no longer complains about the core file being
      bogus.
      
      While I'm here I should point out that the spu coredump code does not work
      if we're dumping to a pipe - we'll have to wait for 23 to fix that.
      Signed-off-by: NMichael Ellerman <michael@ellerman.id.au>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef7320ed
  7. 09 5月, 2007 3 次提交
  8. 03 4月, 2007 1 次提交
  9. 17 3月, 2007 1 次提交
    • J
      [PATCH] fix process crash caused by randomisation and 64k pages · d1cabd63
      James Bottomley 提交于
      This bug was seen on ppc64, but it could have occurred on any
      architecture with a page size of 64k or above.  The problem is that in
      fs/binfmt_elf.c:randomize_stack_top() randomizes the stack to within
      0x7ff pages.  On 4k page machines, this is 8MB; on 64k page boxes, this
      is 128MB.
      
      The problem is that the new binary layout (selected in
      arch_pick_mmap_layout) places the mapping segment 128MB or the stack
      rlimit away from the top of the process memory, whichever is larger.  If
      you chose an rlimit of less than 128MB (most defaults are in the 8Mb
      range) then you can end up having your entire stack randomized away.
      
      The fix is to make randomize_stack_top() only steal at most 8MB, which this
      patch does.  However, I have to point out that even with this, your stack
      rlimit might not be exactly what you get if it's > 128MB, because you're
      still losing the random offset of up to 8MB.
      
      The true fix should be to leave an explicit gap for the randomization plus
      a buffer when determining mmap_base, but that would involve fixing all the
      architectures.
      
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d1cabd63
  10. 13 2月, 2007 1 次提交
  11. 27 1月, 2007 3 次提交
  12. 07 1月, 2007 1 次提交
    • L
      Revert "[PATCH] binfmt_elf: randomize PIE binaries (2nd try)" · 90cb28e8
      Linus Torvalds 提交于
      This reverts commit 59287c09.
      
      Hugh Dickins reports that it causes random failures on x86 with SuSE
      10.2, and points out
      
        "Isn't that randomization, anywhere from 0x10000 to ELF_ET_DYN_BASE,
         sure to place the ET_DYN from time to time just where the comment
         says it's trying to avoid? I assume that somehow results in the error
         reported."
      
      (where the comment in question is the existing comment in the source
      code about mmap/brk clashes).
      Suggested-by: NHugh Dickins <hugh@veritas.com>
      Acked-by: NMarcus Meissner <meissner@suse.de>
      Cc: Andrew Morton <akpm@osdl.org>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Dave Jones <davej@codemonkey.org.uk>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      90cb28e8
  13. 09 12月, 2006 2 次提交
  14. 08 12月, 2006 4 次提交
  15. 04 12月, 2006 1 次提交
    • D
      [POWERPC] coredump: Add SPU elf notes to coredump. · bf1ab978
      Dwayne Grant McConnell 提交于
      This patch adds SPU elf notes to the coredump. It creates a separate note
      for each of /regs, /fpcr, /lslr, /decr, /decr_status, /mem, /signal1,
      /signal1_type, /signal2, /signal2_type, /event_mask, /event_status,
      /mbox_info, /ibox_info, /wbox_info, /dma_info, /proxydma_info, /object-id.
      
      A new macro, ARCH_HAVE_EXTRA_NOTES, was created for architectures to
      specify they have extra elf core notes.
      
      A new macro, ELF_CORE_EXTRA_NOTES_SIZE, was created so the size of the
      additional notes could be calculated and added to the notes phdr entry.
      
      A new macro, ELF_CORE_WRITE_EXTRA_NOTES, was created so the new notes
      would be written after the existing notes.
      
      The SPU coredump code resides in spufs. Stub functions are provided in the
      kernel which are hooked into the spufs code which does the actual work via
      register_arch_coredump_calls().
      
      A new set of __spufs_<file>_read/get() functions was provided to allow the
      coredump code to read from the spufs files without having to lock the
      SPU context for each file read from.
      
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NDwayne Grant McConnell <decimal@us.ibm.com>
      Signed-off-by: NArnd Bergmann <arnd.bergmann@de.ibm.com>
      bf1ab978
  16. 16 10月, 2006 1 次提交
  17. 13 10月, 2006 1 次提交
    • P
      [PATCH] Get core dump code to work... · 7f14daa1
      Petr Vandrovec 提交于
      The file based core dump code was broken by pipe changes - a relative
      llseek returns the absolute file position on success, not the relative
      one, so dump_seek() always failed when invoked with non-zero current
      position.
      
      Only success/failure can be tested with relative lseek, we have to trust
      kernel that on success we've got right file offset.  With this fix in
      place I have finally real core files instead of 1KB fragments...
      Signed-off-by: NPetr Vandrovec <petr@vandrovec.name>
      [ Cleaned it up a bit while here - use SEEK_CUR instead of hardcoding 1 ]
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      7f14daa1
  18. 01 10月, 2006 2 次提交
    • A
      [PATCH] Support piping into commands in /proc/sys/kernel/core_pattern · d025c9db
      Andi Kleen 提交于
      Using the infrastructure created in previous patches implement support to
      pipe core dumps into programs.
      
      This is done by overloading the existing core_pattern sysctl
      with a new syntax:
      
      |program
      
      When the first character of the pattern is a '|' the kernel will instead
      threat the rest of the pattern as a command to run.  The core dump will be
      written to the standard input of that program instead of to a file.
      
      This is useful for having automatic core dump analysis without filling up
      disks.  The program can do some simple analysis and save only a summary of
      the core dump.
      
      The core dump proces will run with the privileges and in the name space of
      the process that caused the core dump.
      
      I also increased the core pattern size to 128 bytes so that longer command
      lines fit.
      
      Most of the changes comes from allowing core dumps without seeks.  They are
      fairly straight forward though.
      
      One small incompatibility is that if someone had a core pattern previously
      that started with '|' they will get suddenly new behaviour.  I think that's
      unlikely to be a real problem though.
      
      Additional background:
      
      > Very nice, do you happen to have a program that can accept this kind of
      > input for crash dumps?  I'm guessing that the embedded people will
      > really want this functionality.
      
      I had a cheesy demo/prototype.  Basically it wrote the dump to a file again,
      ran gdb on it to get a backtrace and wrote the summary to a shared directory.
      Then there was a simple CGI script to generate a "top 10" crashes HTML
      listing.
      
      Unfortunately this still had the disadvantage to needing full disk space for a
      dump except for deleting it afterwards (in fact it was worse because over the
      pipe holes didn't work so if you have a holey address map it would require
      more space).
      
      Fortunately gdb seems to be happy to handle /proc/pid/fd/xxx input pipes as
      cores (at least it worked with zsh's =(cat core) syntax), so it would be
      likely possible to do it without temporary space with a simple wrapper that
      calls it in the right way.  I ran out of time before doing that though.
      
      The demo prototype scripts weren't very good.  If there is really interest I
      can dig them out (they are currently on a laptop disk on the desk with the
      laptop itself being in service), but I would recommend to rewrite them for any
      serious application of this and fix the disk space problem.
      
      Also to be really useful it should probably find a way to automatically fetch
      the debuginfos (I cheated and just installed them in advance).  If nobody else
      does it I can probably do the rewrite myself again at some point.
      
      My hope at some point was that desktops would support it in their builtin
      crash reporters, but at least the KDE people I talked too seemed to be happy
      with their user space only solution.
      
      Alan sayeth:
      
        I don't believe that piping as such as neccessarily the right model, but
        the ability to intercept and processes core dumps from user space is asked
        for by many enterprise users as well.  They want to know about, capture,
        analyse and process core dumps, often centrally and in automated form.
      
      [akpm@osdl.org: loff_t != unsigned long]
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d025c9db
    • D
      [PATCH] BLOCK: Move extern declarations out of fs/*.c into header files [try #6] · 07f3f05c
      David Howells 提交于
      Create a new header file, fs/internal.h, for common definitions local to the
      sources in the fs/ directory.
      
      Move extern definitions that should be in header files from fs/*.c to
      fs/internal.h or other main header files where they span directories.
      Signed-Off-By: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      07f3f05c
  19. 30 9月, 2006 2 次提交
  20. 26 9月, 2006 2 次提交
  21. 11 7月, 2006 1 次提交
  22. 04 7月, 2006 1 次提交
    • C
      [PATCH] binfmt_elf: fix checks for bad address · ce51059b
      Chuck Ebbert 提交于
      Fix check for bad address; use macro instead of open-coding two checks.
      
      Taken from RHEL4 kernel update.
      
      From: Ernie Petrides <petrides@redhat.com>
      
        For background, the BAD_ADDR() macro should return TRUE if the address is
        TASK_SIZE, because that's the lowest address that is *not* valid for
        user-space mappings.  The macro was correct in binfmt_aout.c but was wrong
        for the "equal to" case in binfmt_elf.c.  There were two in-line validations
        of user-space addresses in binfmt_elf.c, which have been appropriately
        converted to use the corrected BAD_ADDR() macro in the patch you posted
        yesterday.  Note that the size checks against TASK_SIZE are okay as coded.
      
        The additional changes that I propose are below.  These are in the error
        paths for bad ELF entry addresses once load_elf_binary() has already
        committed to exec'ing the new image (following the tearing down of the
        task's original address space).
      
        The 1st hunk deals with the interp-side of the outer "if".  There were two
        problems here.  The printk() should be removed because this path can be
        triggered at will by a bogus interpreter image created and used by a
        malicious user.  Further, the error code should not be ENOEXEC, because that
        causes the loop in search_binary_handler() to continue trying other exec
        handlers (twice, in fact).  But it's too late for this to work correctly,
        because the user address space has already been torn down, and an exec()
        failure cannot be returned to the user code because the code no longer
        exists.  The only recovery is to force a SIGSEGV, but it's best to terminate
        the search loop immediately.  I somewhat arbitrarily chose EINVAL as a
        fallback error code, but any error returned by load_elf_interp() will
        override that (but this value will never be seen by user-space).
      
        The 2nd hunk deals with the non-interp-side of the outer "if".  There were
        two problems here as well.  The SIGSEGV needs to be forced, because a prior
        sigaction() syscall might have set the associated disposition to SIG_IGN.
        And the ENOEXEC should be changed to EINVAL as described above.
      Signed-off-by: NChuck Ebbert <76306.1226@compuserve.com>
      Signed-off-by: NErnie Petrides <petrides@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ce51059b
  23. 23 6月, 2006 3 次提交
    • J
      [PATCH] binflt_elf: remove more casts · 785d5570
      Jesper Juhl 提交于
      Remove redundant casts from NEW_AUX_ENT() arguments in fs/binfmt_elf.c
      Signed-off-by: NJesper Juhl <jesper.juhl@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      785d5570
    • J
      [PATCH] binfmt_elf: CodingStyle cleanup and remove some pointless casts · f4e5cc2c
      Jesper Juhl 提交于
      Do a CodingStyle cleanup of fs/binfmt_elf.c and also remove some pointless
      casts of kmalloc() return values in the same file.
      Signed-off-by: NJesper Juhl <jesper.juhl@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f4e5cc2c
    • M
      [PATCH] remove steal_locks() · c89681ed
      Miklos Szeredi 提交于
      This patch removes the steal_locks() function.
      
      steal_locks() doesn't work correctly with any filesystem that does it's own
      lock management, including NFS, CIFS, etc.
      
      In addition it has weird semantics on local filesystems in case tasks
      sharing file-descriptor tables are doing POSIX locking operations in
      parallel to execve().
      
      The steal_locks() function has an effect on applications doing:
      
      clone(CLONE_FILES)
        /* in child */
        lock
        execve
        lock
      
      POSIX locks acquired before execve (by "child", "parent" or any further
      task sharing files_struct) will after the execve be owned exclusively by
      "child".
      
      According to Chris Wright some LSB/LTP kind of suite triggers without the
      stealing behavior, but there's no known real-world application that would
      also fail.
      
      Apps using NPTL are not affected, since all other threads are killed before
      execve.
      
      Apps using LinuxThreads are only affected if they
      
        - have multiple threads during exec (LinuxThreads doesn't kill other
          threads, the app may do it with pthread_kill_other_threads_np())
        - rely on POSIX locks being inherited across exec
      
      Both conditions are documented, but not their interaction.
      
      Apps using clone() natively are affected if they
      
        - use clone(CLONE_FILES)
        - rely on POSIX locks being inherited across exec
      
      The above scenarios are unlikely, but possible.
      
      If the patch is vetoed, there's a plan B, that involves mostly keeping the
      weird stealing semantics, but changing the way lock ownership is handled so
      that network and local filesystems work consistently.
      
      That would add more complexity though, so this solution seems to be
      preferred by most people.
      Signed-off-by: NMiklos Szeredi <miklos@szeredi.hu>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: Matthew Wilcox <willy@debian.org>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Steven French <sfrench@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c89681ed
  24. 26 3月, 2006 2 次提交