1. 07 3月, 2010 16 次提交
    • D
      elf coredump: replace ELF_CORE_EXTRA_* macros by functions · 1fcccbac
      Daisuke HATAYAMA 提交于
      elf_core_dump() and elf_fdpic_core_dump() use #ifdef and the corresponding
      macro for hiding _multiline_ logics in functions.  This patch removes
      #ifdef and replaces ELF_CORE_EXTRA_* by corresponding functions.  For
      architectures not implemeonting ELF_CORE_EXTRA_*, we use weak functions in
      order to reduce a range of modification.
      
      This cleanup is for my next patches, but I think this cleanup itself is
      worth doing regardless of my firnal purpose.
      Signed-off-by: NDaisuke HATAYAMA <d.hatayama@jp.fujitsu.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1fcccbac
    • D
      coredump: move dump_write() and dump_seek() into a header file · 088e7af7
      Daisuke HATAYAMA 提交于
      My next patch will replace ELF_CORE_EXTRA_* macros by functions, putting
      them into other newly created *.c files.  Then, each files will contain
      dump_write(), where each pair of binfmt_*.c and elfcore.c should be the
      same.  So, this patch moves them into a header file with dump_seek().
      Also, the patch deletes confusing DUMP_WRITE macros in each files.
      Signed-off-by: NDaisuke HATAYAMA <d.hatayama@jp.fujitsu.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      088e7af7
    • D
      coredump: unify dump_seek() implementations for each binfmt_*.c · 05f47fda
      Daisuke HATAYAMA 提交于
      The current ELF dumper can produce broken corefiles if program headers
      exceed 65535.  In particular, the program in 64-bit environment often
      demands more than 65535 mmaps.  If you google max_map_count, then you can
      find many users facing this problem.
      
      Solaris has already dealt with this issue, and other OSes have also
      adopted the same method as in Solaris.  Currently, Sun's document and AMD
      64 ABI include the description for the extension, where they call the
      extension Extended Numbering.  See Reference for further information.
      
      I believe that linux kernel should adopt the same way as they did, so I've
      written this patch.
      
      I am also preparing for patches of GDB and binutils.
      
      How to fix
      ==========
      
      In new dumping process, there are two cases according to weather or
      not the number of program headers is equal to or more than 65535.
      
       - if less than 65535, the produced corefile format is exactly the same
         as the ordinary one.
      
       - if equal to or more than 65535, then e_phnum field is set to newly
         introduced constant PN_XNUM(0xffff) and the actual number of program
         headers is set to sh_info field of the section header at index 0.
      
      Compatibility Concern
      =====================
      
       * As already mentioned in Summary, Sun and AMD64 has already adopted
         this.  See Reference.
      
       * There are four combinations according to whether kernel and userland
         tools are respectively modified or not.  The next table summarizes
         shortly for each combination.
      
                        ---------------------------------------------
                           Original Kernel    |   Modified Kernel
                        ---------------------------------------------
          	            < 65535  | >= 65535 | < 65535  | >= 65535
        -------------------------------------------------------------
         Original Tools |    OK    |  broken  |   OK     | broken (#)
        -------------------------------------------------------------
         Modified Tools |    OK    |  broken  |   OK     |    OK
        -------------------------------------------------------------
      
        Note that there is no case that `OK' changes to `broken'.
      
        (#) Although this case remains broken, O-M behaves better than
        O-O. That is, while in O-O case e_phnum field would be extremely
        small due to integer overflow, in O-M case it is guaranteed to be at
        least 65535 by being set to PN_XNUM(0xFFFF), much closer to the
        actual correct value than the O-O case.
      
      Test Program
      ============
      
      Here is a test program mkmmaps.c that is useful to produce the
      corefile with many mmaps. To use this, please take the following
      steps:
      
      $ ulimit -c unlimited
      $ sysctl vm.max_map_count=70000 # default 65530 is too small
      $ sysctl fs.file-max=70000
      $ mkmmaps 65535
      
      Then, the program will abort and a corefile will be generated.
      
      If failed, there are two cases according to the error message
      displayed.
      
       * ``out of memory'' means vm.max_map_count is still smaller
      
       * ``too many open files'' means fs.file-max is still smaller
      
      So, please change it to a larger value, and then retry it.
      
      mkmmaps.c
      ==
      #include <stdio.h>
      #include <stdlib.h>
      #include <sys/mman.h>
      #include <fcntl.h>
      #include <unistd.h>
      int main(int argc, char **argv)
      {
      	int maps_num;
      	if (argc < 2) {
      		fprintf(stderr, "mkmmaps [number of maps to be created]\n");
      		exit(1);
      	}
      	if (sscanf(argv[1], "%d", &maps_num) == EOF) {
      		perror("sscanf");
      		exit(2);
      	}
      	if (maps_num < 0) {
      		fprintf(stderr, "%d is invalid\n", maps_num);
      		exit(3);
      	}
      	for (; maps_num > 0; --maps_num) {
      		if (MAP_FAILED == mmap((void *)NULL, (size_t) 1, PROT_READ,
      					MAP_SHARED | MAP_ANONYMOUS, (int) -1,
      					(off_t) NULL)) {
      			perror("mmap");
      			exit(4);
      		}
      	}
      	abort();
      	{
      		char buffer[128];
      		sprintf(buffer, "wc -l /proc/%u/maps", getpid());
      		system(buffer);
      	}
      	return 0;
      }
      
      Tested on i386, ia64 and um/sys-i386.
      Built on sh4 (which covers fs/binfmt_elf_fdpic.c)
      
      References
      ==========
      
       - Sun microsystems: Linker and Libraries.
         Part No: 817-1984-17, September 2008.
         URL: http://docs.sun.com/app/docs/doc/817-1984
      
       - System V ABI AMD64 Architecture Processor Supplement
         Draft Version 0.99., May 11, 2009.
         URL: http://www.x86-64.org/
      
      This patch:
      
      There are three different definitions for dump_seek() functions in
      binfmt_aout.c, binfmt_elf.c and binfmt_elf_fdpic.c, respectively.  The
      only for binfmt_elf.c.
      
      My next patch will move dump_seek() into a header file in order to share
      the same implementations for dump_write() and dump_seek().  As the first
      step, this patch unify these three definitions for dump_seek() by applying
      the past commits that have been applied only for binfmt_elf.c.
      
      Specifically, the modification made here is part of the following commits:
      
        * d025c9db
        * 7f14daa1
      
      This patch does not change a shape of corefiles.
      Signed-off-by: NDaisuke HATAYAMA <d.hatayama@jp.fujitsu.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      05f47fda
    • A
      proc: warn on non-existing proc entries · 12bac0d9
      Alexey Dobriyan 提交于
      * warn if creation goes on to non-existent directory
      * warn if removal goes on from non-existing directory
      * warn if non-existing proc entry is removed
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      12bac0d9
    • A
      proc: do translation + unlink atomically at remove_proc_entry() · e17a5765
      Alexey Dobriyan 提交于
      remove_proc_entry() does
      
      	lock
      	lookup parent
      	unlock
      	lock
      	unlink proc entry from lists
      	unlock
      
      which can be made bit more correct by doing parent translation + unlink
      without dropping lock.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e17a5765
    • A
      fs/compat_ioctl.c: suppress two warnings · 45bf5cd7
      Andrew Morton 提交于
      fs/compat_ioctl.c: In function 'do_ioctl_trans':
      fs/compat_ioctl.c:534: warning: 'karg' may be used uninitialized in this function
      fs/compat_ioctl.c:533: warning: 'kcmd' may be used uninitialized in this function
      fs/compat_ioctl.c:656: warning: 'ret' may be used uninitialized in this function
      
      Reduces text size by 44 bytes.
      
      If someone calls one of these functions with an unexpected argument, the
      code's buggy as-is.
      
      Amerigo Wang <amwang@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45bf5cd7
    • D
      lib: build list_sort() only if needed · a069c266
      Don Mullis 提交于
      Build list_sort() only for configs that need it -- those that don't save
      ~581 bytes (i386).
      Signed-off-by: NDon Mullis <don.mullis@gmail.com>
      Cc: Dave Airlie <airlied@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Artem Bityutskiy <dedekind@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a069c266
    • M
      exec: create initial stack independent of PAGE_SIZE · 5ef097dd
      Michael Neuling 提交于
      Currently we create the initial stack based on the PAGE_SIZE.  This is
      unnecessary.
      
      This creates this initial stack independent of the PAGE_SIZE.
      
      It also bumps up the number of 4k pages allocated from 20 to 32, to
      align with 64K page systems.
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Cc: Helge Deller <deller@gmx.de>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Americo Wang <xiyou.wangcong@gmail.com>
      Cc: Anton Blanchard <anton@samba.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ef097dd
    • J
      fs: use rlimit helpers · d554ed89
      Jiri Slaby 提交于
      Make sure compiler won't do weird things with limits.  E.g.  fetching them
      twice may return 2 different values after writable limits are implemented.
      
      I.e.  either use rlimit helpers added in commit 3e10e716 ("resource:
      add helpers for fetching rlimits") or ACCESS_ONCE if not applicable.
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d554ed89
    • R
      mm: change anon_vma linking to fix multi-process server scalability issue · 5beb4930
      Rik van Riel 提交于
      The old anon_vma code can lead to scalability issues with heavily forking
      workloads.  Specifically, each anon_vma will be shared between the parent
      process and all its child processes.
      
      In a workload with 1000 child processes and a VMA with 1000 anonymous
      pages per process that get COWed, this leads to a system with a million
      anonymous pages in the same anon_vma, each of which is mapped in just one
      of the 1000 processes.  However, the current rmap code needs to walk them
      all, leading to O(N) scanning complexity for each page.
      
      This can result in systems where one CPU is walking the page tables of
      1000 processes in page_referenced_one, while all other CPUs are stuck on
      the anon_vma lock.  This leads to catastrophic failure for a benchmark
      like AIM7, where the total number of processes can reach in the tens of
      thousands.  Real workloads are still a factor 10 less process intensive
      than AIM7, but they are catching up.
      
      This patch changes the way anon_vmas and VMAs are linked, which allows us
      to associate multiple anon_vmas with a VMA.  At fork time, each child
      process gets its own anon_vmas, in which its COWed pages will be
      instantiated.  The parents' anon_vma is also linked to the VMA, because
      non-COWed pages could be present in any of the children.
      
      This reduces rmap scanning complexity to O(1) for the pages of the 1000
      child processes, with O(N) complexity for at most 1/N pages in the system.
       This reduces the average scanning cost in heavily forking workloads from
      O(N) to 2.
      
      The only real complexity in this patch stems from the fact that linking a
      VMA to anon_vmas now involves memory allocations.  This means vma_adjust
      can fail, if it needs to attach a VMA to anon_vma structures.  This in
      turn means error handling needs to be added to the calling functions.
      
      A second source of complexity is that, because there can be multiple
      anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
      "the" anon_vma lock.  To prevent the rmap code from walking up an
      incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag.  This bit
      flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
      to make sure it is impossible to compile a kernel that needs both symbolic
      values for the same bitflag.
      
      Some test results:
      
      Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
      box with 16GB RAM and not quite enough IO), the system ends up running
      >99% in system time, with every CPU on the same anon_vma lock in the
      pageout code.
      
      With these changes, AIM7 hits the cross-over point around 29.7k users.
      This happens with ~99% IO wait time, there never seems to be any spike in
      system time.  The anon_vma lock contention appears to be resolved.
      
      [akpm@linux-foundation.org: cleanups]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5beb4930
    • W
      vfs: take f_lock on modifying f_mode after open time · 42e49608
      Wu Fengguang 提交于
      We'll introduce FMODE_RANDOM which will be runtime modified.  So protect
      all runtime modification to f_mode with f_lock to avoid races.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: <stable@kernel.org>			[2.6.33.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      42e49608
    • K
      mm: count swap usage · b084d435
      KAMEZAWA Hiroyuki 提交于
      A frequent questions from users about memory management is what numbers of
      swap ents are user for processes.  And this information will give some
      hints to oom-killer.
      
      Besides we can count the number of swapents per a process by scanning
      /proc/<pid>/smaps, this is very slow and not good for usual process
      information handler which works like 'ps' or 'top'.  (ps or top is now
      enough slow..)
      
      This patch adds a counter of swapents to mm_counter and update is at each
      swap events.  Information is exported via /proc/<pid>/status file as
      
      [kamezawa@bluextal memory]$ cat /proc/self/status
      Name:   cat
      State:  R (running)
      Tgid:   2910
      Pid:    2910
      PPid:   2823
      TracerPid:      0
      Uid:    500     500     500     500
      Gid:    500     500     500     500
      FDSize: 256
      Groups: 500
      VmPeak:    82696 kB
      VmSize:    82696 kB
      VmLck:         0 kB
      VmHWM:       432 kB
      VmRSS:       432 kB
      VmData:      172 kB
      VmStk:        84 kB
      VmExe:        48 kB
      VmLib:      1568 kB
      VmPTE:        40 kB
      VmSwap:        0 kB <=============== this.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b084d435
    • K
      mm: avoid false sharing of mm_counter · 34e55232
      KAMEZAWA Hiroyuki 提交于
      Considering the nature of per mm stats, it's the shared object among
      threads and can be a cache-miss point in the page fault path.
      
      This patch adds per-thread cache for mm_counter.  RSS value will be
      counted into a struct in task_struct and synchronized with mm's one at
      events.
      
      Now, in this patch, the event is the number of calls to handle_mm_fault.
      Per-thread value is added to mm at each 64 calls.
      
       rough estimation with small benchmark on parallel thread (2threads) shows
       [before]
           4.5 cache-miss/faults
       [after]
           4.0 cache-miss/faults
       Anyway, the most contended object is mmap_sem if the number of threads grows.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34e55232
    • K
      mm: clean up mm_counter · d559db08
      KAMEZAWA Hiroyuki 提交于
      Presently, per-mm statistics counter is defined by macro in sched.h
      
      This patch modifies it to
        - defined in mm.h as inlinf functions
        - use array instead of macro's name creation.
      
      This patch is for reducing patch size in future patch to modify
      implementation of per-mm counter.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d559db08
    • A
      bitops: rename for_each_bit() to for_each_set_bit() · 984b3f57
      Akinobu Mita 提交于
      Rename for_each_bit to for_each_set_bit in the kernel source tree.  To
      permit for_each_clear_bit(), should that ever be added.
      
      The patch includes a macro to map the old for_each_bit() onto the new
      for_each_set_bit().  This is a (very) temporary thing to ease the migration.
      
      [akpm@linux-foundation.org: add temporary for_each_bit()]
      Suggested-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Suggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Artem Bityutskiy <dedekind@infradead.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      984b3f57
    • A
      Fix a dumb typo - use of & instead of && · 781b1677
      Al Viro 提交于
      We managed to lose O_DIRECTORY testing due to a stupid typo in commit
      1f36f774 ("Switch !O_CREAT case to use of do_last()")
      Reported-by: NWalter Sheets <w41ter@gmail.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      781b1677
  2. 06 3月, 2010 15 次提交
  3. 05 3月, 2010 9 次提交