1. 23 9月, 2009 23 次提交
    • K
      kcore: add kclist types · c30bb2a2
      KAMEZAWA Hiroyuki 提交于
      Presently, kclist_add() only eats start address and size as its arguments.
      Considering to make kclist dynamically reconfigulable, it's necessary to
      know which kclists are for System RAM and which are not.
      
      This patch add kclist types as
        KCORE_RAM
        KCORE_VMALLOC
        KCORE_TEXT
        KCORE_OTHER
      
      This "type" is used in a patch following this for detecting KCORE_RAM.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: WANG Cong <xiyou.wangcong@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c30bb2a2
    • K
      kcore: use usual list for kclist · 2ef43ec7
      KAMEZAWA Hiroyuki 提交于
      This patchset is for /proc/kcore.  With this,
      
       - many per-arch hooks are removed.
      
       - /proc/kcore will know really valid physical memory area.
      
       - /proc/kcore will be aware of memory hotplug.
      
       - /proc/kcore will be architecture independent i.e.
         if an arch supports CONFIG_MMU, it can use /proc/kcore.
         (if the arch uses usual memory layout.)
      
      This patch:
      
      /proc/kcore uses its own list handling codes. It's better to use
      generic list codes.
      
      No changes in logic. just clean up.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: WANG Cong <xiyou.wangcong@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ef43ec7
    • S
      procfs: provide stack information for threads · d899bf7b
      Stefani Seibold 提交于
      A patch to give a better overview of the userland application stack usage,
      especially for embedded linux.
      
      Currently you are only able to dump the main process/thread stack usage
      which is showed in /proc/pid/status by the "VmStk" Value.  But you get no
      information about the consumed stack memory of the the threads.
      
      There is an enhancement in the /proc/<pid>/{task/*,}/*maps and which marks
      the vm mapping where the thread stack pointer reside with "[thread stack
      xxxxxxxx]".  xxxxxxxx is the maximum size of stack.  This is a value
      information, because libpthread doesn't set the start of the stack to the
      top of the mapped area, depending of the pthread usage.
      
      A sample output of /proc/<pid>/task/<tid>/maps looks like:
      
      08048000-08049000 r-xp 00000000 03:00 8312       /opt/z
      08049000-0804a000 rw-p 00001000 03:00 8312       /opt/z
      0804a000-0806b000 rw-p 00000000 00:00 0          [heap]
      a7d12000-a7d13000 ---p 00000000 00:00 0
      a7d13000-a7f13000 rw-p 00000000 00:00 0          [thread stack: 001ff4b4]
      a7f13000-a7f14000 ---p 00000000 00:00 0
      a7f14000-a7f36000 rw-p 00000000 00:00 0
      a7f36000-a8069000 r-xp 00000000 03:00 4222       /lib/libc.so.6
      a8069000-a806b000 r--p 00133000 03:00 4222       /lib/libc.so.6
      a806b000-a806c000 rw-p 00135000 03:00 4222       /lib/libc.so.6
      a806c000-a806f000 rw-p 00000000 00:00 0
      a806f000-a8083000 r-xp 00000000 03:00 14462      /lib/libpthread.so.0
      a8083000-a8084000 r--p 00013000 03:00 14462      /lib/libpthread.so.0
      a8084000-a8085000 rw-p 00014000 03:00 14462      /lib/libpthread.so.0
      a8085000-a8088000 rw-p 00000000 00:00 0
      a8088000-a80a4000 r-xp 00000000 03:00 8317       /lib/ld-linux.so.2
      a80a4000-a80a5000 r--p 0001b000 03:00 8317       /lib/ld-linux.so.2
      a80a5000-a80a6000 rw-p 0001c000 03:00 8317       /lib/ld-linux.so.2
      afaf5000-afb0a000 rw-p 00000000 00:00 0          [stack]
      ffffe000-fffff000 r-xp 00000000 00:00 0          [vdso]
      
      Also there is a new entry "stack usage" in /proc/<pid>/{task/*,}/status
      which will you give the current stack usage in kb.
      
      A sample output of /proc/self/status looks like:
      
      Name:	cat
      State:	R (running)
      Tgid:	507
      Pid:	507
      .
      .
      .
      CapBnd:	fffffffffffffeff
      voluntary_ctxt_switches:	0
      nonvoluntary_ctxt_switches:	0
      Stack usage:	12 kB
      
      I also fixed stack base address in /proc/<pid>/{task/*,}/stat to the base
      address of the associated thread stack and not the one of the main
      process.  This makes more sense.
      
      [akpm@linux-foundation.org: fs/proc/array.c now needs walk_page_range()]
      Signed-off-by: NStefani Seibold <stefani@seibold.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d899bf7b
    • N
      mmc: make SDIO device/driver struct accessors public · 996ad568
      Nicolas Pitre 提交于
      Especially with the PM framework, those are quite handy to have in driver
      code too.
      Signed-off-by: NNicolas Pitre <nico@marvell.com>
      Cc: <linux-mmc@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      996ad568
    • O
      sdio: add MMC_QUIRK_LENIENT_FN0 · 7c979ec7
      Ohad Ben-Cohen 提交于
      Normally writes to SDIO function 0 outside the vendor specific CCCR
      registers are prohibited.
      
      To support embedded devices that require writes to SDIO function 0 outside
      this range (e.g.  TI WL127x embedded sdio wifi device),
      MMC_QUIRK_LENIENT_FN0 is introduced.
      
      A card quirks field is added to `struct mmc_card' to support non-standard
      devices (e.g.  embedded sdio devices).
      
      [akpm@linux-foundation.org: code in C, not cpp!]
      Signed-off-by: NOhad Ben-Cohen <ohad@wizery.com>
      Cc: <linux-mmc@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c979ec7
    • O
      sdio: add CD disable support · 006ebd5d
      Ohad Ben-Cohen 提交于
      Add support to disconnect the pull-up resistor on CD/DAT[3] (pin 1)
      of the card. This may be desired on certain setups of boards,
      controllers and embedded sdio devices which do not need the card's
      pull-up. As a result, card detection is disabled and power is saved.
      
      [akpm@linux-foundation.org: simplify sdio_disable_cd() a bit]
      Signed-off-by: NOhad Ben-Cohen <ohad@wizery.com>
      Acked-by: NMatt Fleming <matt@console-pimps.org>
      Cc: Ian Molton <ian@mnementh.co.uk>
      Cc: "Roberto A. Foglietta" <roberto.foglietta@gmail.com>
      Cc: Philip Langdale <philipl@overt.org>
      Cc: Pierre Ossman <pierre@ossman.eu>
      Cc: David Vrabel <david.vrabel@csr.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      006ebd5d
    • A
      mmc: check status after MMC SWITCH command · ef0b27d4
      Adrian Hunter 提交于
      According to the standard, the SWITCH command should be followed by a
      SEND_STATUS command to check for errors.
      Signed-off-by: NAdrian Hunter <adrian.hunter@nokia.com>
      Acked-by: NMatt Fleming <matt@console-pimps.org>
      Cc: Ian Molton <ian@mnementh.co.uk>
      Cc: "Roberto A. Foglietta" <roberto.foglietta@gmail.com>
      Cc: Jarkko Lavinen <jarkko.lavinen@nokia.com>
      Cc: Denis Karpov <ext-denis.2.karpov@nokia.com>
      Cc: Pierre Ossman <pierre@ossman.eu>
      Cc: Philip Langdale <philipl@overt.org>
      Cc: "Madhusudhan" <madhu.cr@ti.com>
      Cc: <linux-mmc@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef0b27d4
    • J
      mmc: add mmc card sleep and awake support · b1ebe384
      Jarkko Lavinen 提交于
      Add support for the new MMC command SLEEP_AWAKE.
      Signed-off-by: NJarkko Lavinen <jarkko.lavinen@nokia.com>
      Signed-off-by: NAdrian Hunter <adrian.hunter@nokia.com>
      Acked-by: NMatt Fleming <matt@console-pimps.org>
      Cc: Ian Molton <ian@mnementh.co.uk>
      Cc: "Roberto A. Foglietta" <roberto.foglietta@gmail.com>
      Cc: Jarkko Lavinen <jarkko.lavinen@nokia.com>
      Cc: Denis Karpov <ext-denis.2.karpov@nokia.com>
      Cc: Pierre Ossman <pierre@ossman.eu>
      Cc: Philip Langdale <philipl@overt.org>
      Cc: "Madhusudhan" <madhu.cr@ti.com>
      Cc: <linux-mmc@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b1ebe384
    • A
      mmc: add ability to save power by powering off cards · eae1aeee
      Adrian Hunter 提交于
      Power can be saved by powering off cards that are not in use.  This is
      similar to suspend / resume except it is under the control of the driver,
      and does not require any power management support.  It can only be used
      when the driver can monitor whether the card is removed, otherwise it is
      unsafe.  This is possible because, unlike suspend, the driver still
      receives card detect and / or cover switch interrupts.
      Signed-off-by: NAdrian Hunter <adrian.hunter@nokia.com>
      Acked-by: NMatt Fleming <matt@console-pimps.org>
      Cc: Ian Molton <ian@mnementh.co.uk>
      Cc: "Roberto A. Foglietta" <roberto.foglietta@gmail.com>
      Cc: Jarkko Lavinen <jarkko.lavinen@nokia.com>
      Cc: Denis Karpov <ext-denis.2.karpov@nokia.com>
      Cc: Pierre Ossman <pierre@ossman.eu>
      Cc: Philip Langdale <philipl@overt.org>
      Cc: "Madhusudhan" <madhu.cr@ti.com>
      Cc: <linux-mmc@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eae1aeee
    • A
      mmc: add MMC_CAP_NONREMOVABLE host capability · 9feae246
      Adrian Hunter 提交于
      eMMC's are not removable, so unsafe resume is OK always.
      
      To permit this a new host capability MMC_CAP_NONREMOVABLE has been added
      and suspend / resume updated accordingly.
      Signed-off-by: NAdrian Hunter <adrian.hunter@nokia.com>
      Acked-by: NMatt Fleming <matt@console-pimps.org>
      Cc: Ian Molton <ian@mnementh.co.uk>
      Cc: "Roberto A. Foglietta" <roberto.foglietta@gmail.com>
      Cc: Jarkko Lavinen <jarkko.lavinen@nokia.com>
      Cc: Denis Karpov <ext-denis.2.karpov@nokia.com>
      Cc: Pierre Ossman <pierre@ossman.eu>
      Cc: Philip Langdale <philipl@overt.org>
      Cc: "Madhusudhan" <madhu.cr@ti.com>
      Cc: <linux-mmc@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9feae246
    • A
      mmc: allow host claim / release nesting · 319a3f14
      Adrian Hunter 提交于
      This change allows the MMC host to be claimed in situations where the host
      may or may not have already been claimed.  Also 'mmc_try_claim_host()' is
      now exported.
      Signed-off-by: NAdrian Hunter <adrian.hunter@nokia.com>
      Acked-by: NMatt Fleming <matt@console-pimps.org>
      Cc: Ian Molton <ian@mnementh.co.uk>
      Cc: "Roberto A. Foglietta" <roberto.foglietta@gmail.com>
      Cc: Jarkko Lavinen <jarkko.lavinen@nokia.com>
      Cc: Denis Karpov <ext-denis.2.karpov@nokia.com>
      Cc: Pierre Ossman <pierre@ossman.eu>
      Cc: Philip Langdale <philipl@overt.org>
      Cc: "Madhusudhan" <madhu.cr@ti.com>
      Cc: <linux-mmc@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      319a3f14
    • A
      mmc: add 'enable' and 'disable' methods to mmc host · 8ea926b2
      Adrian Hunter 提交于
      MMC hosts that support power saving can use the 'enable' and 'disable'
      methods to exit and enter power saving states.  An explanation of their
      use is provided in the comments added to include/linux/mmc/host.h.
      Signed-off-by: NAdrian Hunter <adrian.hunter@nokia.com>
      Acked-by: NMatt Fleming <matt@console-pimps.org>
      Cc: Ian Molton <ian@mnementh.co.uk>
      Cc: "Roberto A. Foglietta" <roberto.foglietta@gmail.com>
      Cc: Jarkko Lavinen <jarkko.lavinen@nokia.com>
      Cc: Denis Karpov <ext-denis.2.karpov@nokia.com>
      Cc: Pierre Ossman <pierre@ossman.eu>
      Cc: Philip Langdale <philipl@overt.org>
      Cc: "Madhusudhan" <madhu.cr@ti.com>
      Cc: <linux-mmc@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ea926b2
    • J
      getrusage: fill ru_maxrss value · 1f10206c
      Jiri Pirko 提交于
      Make ->ru_maxrss value in struct rusage filled accordingly to rss hiwater
      mark.  This struct is filled as a parameter to getrusage syscall.
      ->ru_maxrss value is set to KBs which is the way it is done in BSD
      systems.  /usr/bin/time (gnu time) application converts ->ru_maxrss to KBs
      which seems to be incorrect behavior.  Maintainer of this util was
      notified by me with the patch which corrects it and cc'ed.
      
      To make this happen we extend struct signal_struct by two fields.  The
      first one is ->maxrss which we use to store rss hiwater of the task.  The
      second one is ->cmaxrss which we use to store highest rss hiwater of all
      task childs.  These values are used in k_getrusage() to actually fill
      ->ru_maxrss.  k_getrusage() uses current rss hiwater value directly if mm
      struct exists.
      
      Note:
      exec() clear mm->hiwater_rss, but doesn't clear sig->maxrss.
      it is intetionally behavior. *BSD getrusage have exec() inheriting.
      
      test programs
      ========================================================
      
      getrusage.c
      ===========
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
       #include <sys/types.h>
       #include <sys/time.h>
       #include <sys/resource.h>
       #include <sys/types.h>
       #include <sys/wait.h>
       #include <unistd.h>
       #include <signal.h>
       #include <sys/mman.h>
      
       #include "common.h"
      
       #define err(str) perror(str), exit(1)
      
      int main(int argc, char** argv)
      {
      	int status;
      
      	printf("allocate 100MB\n");
      	consume(100);
      
      	printf("testcase1: fork inherit? \n");
      	printf("  expect: initial.self ~= child.self\n");
      	show_rusage("initial");
      	if (__fork()) {
      		wait(&status);
      	} else {
      		show_rusage("fork child");
      		_exit(0);
      	}
      	printf("\n");
      
      	printf("testcase2: fork inherit? (cont.) \n");
      	printf("  expect: initial.children ~= 100MB, but child.children = 0\n");
      	show_rusage("initial");
      	if (__fork()) {
      		wait(&status);
      	} else {
      		show_rusage("child");
      		_exit(0);
      	}
      	printf("\n");
      
      	printf("testcase3: fork + malloc \n");
      	printf("  expect: child.self ~= initial.self + 50MB\n");
      	show_rusage("initial");
      	if (__fork()) {
      		wait(&status);
      	} else {
      		printf("allocate +50MB\n");
      		consume(50);
      		show_rusage("fork child");
      		_exit(0);
      	}
      	printf("\n");
      
      	printf("testcase4: grandchild maxrss\n");
      	printf("  expect: post_wait.children ~= 300MB\n");
      	show_rusage("initial");
      	if (__fork()) {
      		wait(&status);
      		show_rusage("post_wait");
      	} else {
      		system("./child -n 0 -g 300");
      		_exit(0);
      	}
      	printf("\n");
      
      	printf("testcase5: zombie\n");
      	printf("  expect: pre_wait ~= initial, IOW the zombie process is not accounted.\n");
      	printf("          post_wait ~= 400MB, IOW wait() collect child's max_rss. \n");
      	show_rusage("initial");
      	if (__fork()) {
      		sleep(1); /* children become zombie */
      		show_rusage("pre_wait");
      		wait(&status);
      		show_rusage("post_wait");
      	} else {
      		system("./child -n 400");
      		_exit(0);
      	}
      	printf("\n");
      
      	printf("testcase6: SIG_IGN\n");
      	printf("  expect: initial ~= after_zombie (child's 500MB alloc should be ignored).\n");
      	show_rusage("initial");
      	signal(SIGCHLD, SIG_IGN);
      	if (__fork()) {
      		sleep(1); /* children become zombie */
      		show_rusage("after_zombie");
      	} else {
      		system("./child -n 500");
      		_exit(0);
      	}
      	printf("\n");
      	signal(SIGCHLD, SIG_DFL);
      
      	printf("testcase7: exec (without fork) \n");
      	printf("  expect: initial ~= exec \n");
      	show_rusage("initial");
      	execl("./child", "child", "-v", NULL);
      
      	return 0;
      }
      
      child.c
      =======
       #include <sys/types.h>
       #include <unistd.h>
       #include <sys/types.h>
       #include <sys/wait.h>
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
       #include <sys/types.h>
       #include <sys/time.h>
       #include <sys/resource.h>
      
       #include "common.h"
      
      int main(int argc, char** argv)
      {
      	int status;
      	int c;
      	long consume_size = 0;
      	long grandchild_consume_size = 0;
      	int show = 0;
      
      	while ((c = getopt(argc, argv, "n:g:v")) != -1) {
      		switch (c) {
      		case 'n':
      			consume_size = atol(optarg);
      			break;
      		case 'v':
      			show = 1;
      			break;
      		case 'g':
      
      			grandchild_consume_size = atol(optarg);
      			break;
      		default:
      			break;
      		}
      	}
      
      	if (show)
      		show_rusage("exec");
      
      	if (consume_size) {
      		printf("child alloc %ldMB\n", consume_size);
      		consume(consume_size);
      	}
      
      	if (grandchild_consume_size) {
      		if (fork()) {
      			wait(&status);
      		} else {
      			printf("grandchild alloc %ldMB\n", grandchild_consume_size);
      			consume(grandchild_consume_size);
      
      			exit(0);
      		}
      	}
      
      	return 0;
      }
      
      common.c
      ========
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
       #include <sys/types.h>
       #include <sys/time.h>
       #include <sys/resource.h>
       #include <sys/types.h>
       #include <sys/wait.h>
       #include <unistd.h>
       #include <signal.h>
       #include <sys/mman.h>
      
       #include "common.h"
       #define err(str) perror(str), exit(1)
      
      void show_rusage(char *prefix)
      {
          	int err, err2;
          	struct rusage rusage_self;
          	struct rusage rusage_children;
      
          	printf("%s: ", prefix);
          	err = getrusage(RUSAGE_SELF, &rusage_self);
          	if (!err)
          		printf("self %ld ", rusage_self.ru_maxrss);
          	err2 = getrusage(RUSAGE_CHILDREN, &rusage_children);
          	if (!err2)
          		printf("children %ld ", rusage_children.ru_maxrss);
      
          	printf("\n");
      }
      
      /* Some buggy OS need this worthless CPU waste. */
      void make_pagefault(void)
      {
      	void *addr;
      	int size = getpagesize();
      	int i;
      
      	for (i=0; i<1000; i++) {
      		addr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
      		if (addr == MAP_FAILED)
      			err("make_pagefault");
      		memset(addr, 0, size);
      		munmap(addr, size);
      	}
      }
      
      void consume(int mega)
      {
          	size_t sz = mega * 1024 * 1024;
          	void *ptr;
      
          	ptr = malloc(sz);
          	memset(ptr, 0, sz);
      	make_pagefault();
      }
      
      pid_t __fork(void)
      {
      	pid_t pid;
      
      	pid = fork();
      	make_pagefault();
      
      	return pid;
      }
      
      common.h
      ========
      void show_rusage(char *prefix);
      void make_pagefault(void);
      void consume(int mega);
      pid_t __fork(void);
      
      FreeBSD result (expected result)
      ========================================================
      allocate 100MB
      testcase1: fork inherit?
        expect: initial.self ~= child.self
      initial: self 103492 children 0
      fork child: self 103540 children 0
      
      testcase2: fork inherit? (cont.)
        expect: initial.children ~= 100MB, but child.children = 0
      initial: self 103540 children 103540
      child: self 103564 children 0
      
      testcase3: fork + malloc
        expect: child.self ~= initial.self + 50MB
      initial: self 103564 children 103564
      allocate +50MB
      fork child: self 154860 children 0
      
      testcase4: grandchild maxrss
        expect: post_wait.children ~= 300MB
      initial: self 103564 children 154860
      grandchild alloc 300MB
      post_wait: self 103564 children 308720
      
      testcase5: zombie
        expect: pre_wait ~= initial, IOW the zombie process is not accounted.
                post_wait ~= 400MB, IOW wait() collect child's max_rss.
      initial: self 103564 children 308720
      child alloc 400MB
      pre_wait: self 103564 children 308720
      post_wait: self 103564 children 411312
      
      testcase6: SIG_IGN
        expect: initial ~= after_zombie (child's 500MB alloc should be ignored).
      initial: self 103564 children 411312
      child alloc 500MB
      after_zombie: self 103624 children 411312
      
      testcase7: exec (without fork)
        expect: initial ~= exec
      initial: self 103624 children 411312
      exec: self 103624 children 411312
      
      Linux result (actual test result)
      ========================================================
      allocate 100MB
      testcase1: fork inherit?
        expect: initial.self ~= child.self
      initial: self 102848 children 0
      fork child: self 102572 children 0
      
      testcase2: fork inherit? (cont.)
        expect: initial.children ~= 100MB, but child.children = 0
      initial: self 102876 children 102644
      child: self 102572 children 0
      
      testcase3: fork + malloc
        expect: child.self ~= initial.self + 50MB
      initial: self 102876 children 102644
      allocate +50MB
      fork child: self 153804 children 0
      
      testcase4: grandchild maxrss
        expect: post_wait.children ~= 300MB
      initial: self 102876 children 153864
      grandchild alloc 300MB
      post_wait: self 102876 children 307536
      
      testcase5: zombie
        expect: pre_wait ~= initial, IOW the zombie process is not accounted.
                post_wait ~= 400MB, IOW wait() collect child's max_rss.
      initial: self 102876 children 307536
      child alloc 400MB
      pre_wait: self 102876 children 307536
      post_wait: self 102876 children 410076
      
      testcase6: SIG_IGN
        expect: initial ~= after_zombie (child's 500MB alloc should be ignored).
      initial: self 102876 children 410076
      child alloc 500MB
      after_zombie: self 102880 children 410076
      
      testcase7: exec (without fork)
        expect: initial ~= exec
      initial: self 102880 children 410076
      exec: self 102880 children 410076
      Signed-off-by: NJiri Pirko <jpirko@redhat.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f10206c
    • R
      Make sure the value in abs() does not get truncated if it is greater than 2^32 · a49c59c0
      Rolf Eike Beer 提交于
      abs() will truncate the input if is it outside the 2^32 range.  Fix that
      by assuming `long' input.
      
      This might generate worse code in the common case.
      Signed-off-by: NRolf Eike Beer <eike-kernel@sf-tec.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a49c59c0
    • D
      anonfd: split interface into file creation and install · 562787a5
      Davide Libenzi 提交于
      Split the anonfd interface into a bare file pointer creation one, and a
      file pointer creation plus install one.
      
      There are cases, like the usage of eventfds inside other kernel
      interfaces, where the file pointer created by anonfd needs to be used
      inside the initialization of other structures.
      
      As it is right now, as soon as anon_inode_getfd() returns, the kenrle can
      race with userspace closing the newly installed file descriptor.
      
      This patch, while keeping the old anon_inode_getfd(), introduces a new
      anon_inode_getfile() (whose services are reused in anon_inode_getfd())
      that allows to split the file creation phase and the fd install one.
      
      Once all the kernel structures are initialized, the code can call the
      proper fd_install().
      
      Gregory manifested the need for something like this inside KVM.
      Signed-off-by: NDavide Libenzi <davidel@xmailserver.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: James Morris <jmorris@namei.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Gregory Haskins <ghaskins@novell.com>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Acked-by: NRoland Dreier <rolandd@cisco.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      562787a5
    • J
      BUILD_BUG_ON(): fix it and a couple of bogus uses of it · 8c87df45
      Jan Beulich 提交于
      gcc permitting variable length arrays makes the current construct used for
      BUILD_BUG_ON() useless, as that doesn't produce any diagnostic if the
      controlling expression isn't really constant.  Instead, this patch makes
      it so that a bit field gets used here.  Consequently, those uses where the
      condition isn't really constant now also need fixing.
      
      Note that in the gfp.h, kmemcheck.h, and virtio_config.h cases
      MAYBE_BUILD_BUG_ON() really just serves documentation purposes - even if
      the expression is compile time constant (__builtin_constant_p() yields
      true), the array is still deemed of variable length by gcc, and hence the
      whole expression doesn't have the intended effect.
      
      [akpm@linux-foundation.org: make arch/sparc/include/asm/vio.h compile]
      [akpm@linux-foundation.org: more nonsensical assertions in tpm.c..]
      Signed-off-by: NJan Beulich <jbeulich@novell.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Rajiv Andrade <srajiv@linux.vnet.ibm.com>
      Cc: Mimi Zohar <zohar@us.ibm.com>
      Cc: James Morris <jmorris@namei.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8c87df45
    • R
      printk_once(): use bool for boolean flag · 70867453
      Roland Dreier 提交于
      Using the type bool (instead of int) for the __print_once flag in the
      printk_once() macro matches the intent of the code better, and allows the
      compiler to generate smaller code; eg a typical callsite with gcc 4.3.3 on
      i386:
      
      add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-6 (-6)
      function                                     old     new   delta
      static.__print_once                            4       1      -3
      get_cpu_vendor                               146     143      -3
      
      Saving 6 bytes of object size per callsite by slightly improving the
      readability of the source seems like a win to me.
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70867453
    • S
      proc connector: add event for process becoming session leader · 02b51df1
      Scott James Remnant 提交于
      The act of a process becoming a session leader is a useful signal to a
      supervising init daemon such as Upstart.
      
      While a daemon will normally do this as part of the process of becoming a
      daemon, it is rare for its children to do so.  When the children do, it is
      nearly always a sign that the child should be considered detached from the
      parent and not supervised along with it.
      
      The poster-child example is OpenSSH; the per-login children call setsid()
      so that they may control the pty connected to them.  If the primary daemon
      dies or is restarted, we do not want to consider the per-login children
      and want to respawn the primary daemon without killing the children.
      
      This patch adds a new PROC_SID_EVENT and associated structure to the
      proc_event event_data union, it arranges for this to be emitted when the
      special PIDTYPE_SID pid is set.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NScott James Remnant <scott@ubuntu.com>
      Acked-by: NMatt Helsley <matthltc@us.ibm.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
      Acked-by: N"David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02b51df1
    • J
      seq_file: constify seq_operations · 88e9d34c
      James Morris 提交于
      Make all seq_operations structs const, to help mitigate against
      revectoring user-triggerable function pointers.
      
      This is derived from the grsecurity patch, although generated from scratch
      because it's simpler than extracting the changes from there.
      Signed-off-by: NJames Morris <jmorris@namei.org>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Acked-by: NCasey Schaufler <casey@schaufler-ca.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88e9d34c
    • X
      generic-ipi: make struct call_function_data lockless · 54fdade1
      Xiao Guangrong 提交于
      This patch can remove spinlock from struct call_function_data, the
      reasons are below:
      
      1: add a new interface for cpumask named cpumask_test_and_clear_cpu(),
         it can atomically test and clear specific cpu, we can use it instead
         of cpumask_test_cpu() and cpumask_clear_cpu() and no need data->lock
         to protect those in generic_smp_call_function_interrupt().
      
      2: in smp_call_function_many(), after csd_lock() return, the current's
         cfd_data is deleted from call_function list, so it not have race
         between other cpus, then cfs_data is only used in
         smp_call_function_many() that must disable preemption and not from
         a hardware interrupthandler or from a bottom half handler to call,
         only the correspond cpu can use it, so it not have race in current
         cpu, no need cfs_data->lock to protect it.
      
      3: after 1 and 2, cfs_data->lock is only use to protect cfs_data->refs in
         generic_smp_call_function_interrupt(), so we can define cfs_data->refs
         to atomic_t, and no need cfs_data->lock any more.
      Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      [akpm@linux-foundation.org: use atomic_dec_return()]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54fdade1
    • N
      Move magic numbers into magic.h · 1fd7317d
      Nick Black 提交于
      Move various magic-number definitions into magic.h.
      Signed-off-by: NNick Black <dank@qemfd.net>
      Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Casey Schaufler <casey@schaufler-ca.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1fd7317d
    • D
      printk: add printk_delay to make messages readable for some scenarios · af91322e
      Dave Young 提交于
      When syslog is not possible, at the same time there's no serial/net
      console available, it will be hard to read the printk messages.  For
      example oops/panic/warning messages in shutdown phase.
      
      Add a printk delay feature, we can make each printk message delay some
      milliseconds.
      
      Setting the delay by proc/sysctl interface: /proc/sys/kernel/printk_delay
      
      The value range from 0 - 10000, default value is 0
      
      [akpm@linux-foundation.org: fix a few things]
      Signed-off-by: NDave Young <hidave.darkstar@gmail.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af91322e
    • A
      include/linux/kmemcheck.h: fix a trillion warnings · fa081b00
      Andrew Morton 提交于
      of the form
      
      include/net/inet_sock.h:208: warning: ISO C90 forbids mixed declarations and code
      
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Acked-by: NVegard Nossum <vegard.nossum@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa081b00
  2. 22 9月, 2009 17 次提交
    • D
      pnp: add a shutdown method to pnp drivers · abd6633c
      David Härdeman 提交于
      The shutdown method is used by the winbond cir driver to setup the
      hardware for wake-from-S5.
      Signed-off-by: NBjorn Helgaas <bjorn.helgaas@hp.com>
      Signed-off-by: NDavid Härdeman <david@hardeman.nu>
      Cc: Dmitry Torokhov <dtor@mail.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      abd6633c
    • D
      lis3: add free-fall/wakeup function via platform_data · 8873c334
      Daniel Mack 提交于
      This offers a way for platforms to define flags and thresholds for the
      free-fall/wakeup functions of the lis302d chips.
      
      More registers needed to be seperated as they are specific to the
      Signed-off-by: NDaniel Mack <daniel@caiaq.de>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Cc: Eric Piel <eric.piel@tremplin-utc.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8873c334
    • D
      lis3: fix typo · 0ec48915
      Daniel Mack 提交于
      Bit 0x80 in CTRL_REG3 is an ACTIVE_LOW rather than an ACTIVE_HIGH
      function, I got that wrong during my last change.
      Signed-off-by: NDaniel Mack <daniel@caiaq.de>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Cc: Eric Piel <eric.piel@tremplin-utc.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ec48915
    • D
      flex_array: introduce DEFINE_FLEX_ARRAY · 45b588d6
      David Rientjes 提交于
      FLEX_ARRAY_INIT(element_size, total_nr_elements) cannot determine if
      either parameter is valid, so flex arrays which are statically allocated
      with this interface can easily become corrupted or reference beyond its
      allocated memory.
      
      This removes FLEX_ARRAY_INIT() as a struct flex_array initializer since no
      initializer may perform the required checking.  Instead, the array is now
      defined with a new interface:
      
      	DEFINE_FLEX_ARRAY(name, element_size, total_nr_elements)
      
      This may be prefixed with `static' for file scope.
      
      This interface includes compile-time checking of the parameters to ensure
      they are valid.  Since the validity of both element_size and
      total_nr_elements depend on FLEX_ARRAY_BASE_SIZE and FLEX_ARRAY_PART_SIZE,
      the kernel build will fail if either of these predefined values changes
      such that the array parameters are no longer valid.
      
      Since BUILD_BUG_ON() requires compile time constants, several of the
      static inline functions that were once local to lib/flex_array.c had to be
      moved to include/linux/flex_array.h.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NDave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45b588d6
    • D
      flex_array: add flex_array_shrink function · 4af5a2f7
      David Rientjes 提交于
      Add a new function to the flex_array API:
      
      	int flex_array_shrink(struct flex_array *fa)
      
      This function will free all unused second-level pages.  Since elements are
      now poisoned if they are not allocated with __GFP_ZERO, it's possible to
      identify parts that consist solely of unused elements.
      
      flex_array_shrink() returns the number of pages freed.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4af5a2f7
    • D
      flex_array: poison free elements · 19da3dd1
      David Rientjes 提交于
      Newly initialized flex_array's and/or flex_array_part's are now poisoned
      with a new poison value, FLEX_ARRAY_FREE.  It's value is similar to
      POISON_FREE used in the various slab allocators, but is different to
      distinguish between flex array's poisoned kmem and slab allocator poisoned
      kmem.
      
      This will allow us to identify flex_array_part's that only contain free
      elements (and free them with an addition to the flex_array API).  This
      could also be extended in the future to identify `get' uses on elements
      that have not been `put'.
      
      If __GFP_ZERO is passed for a part's gfp mask, the poisoning is avoided.
      These elements are considered to be in-use since they have been
      initialized.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19da3dd1
    • D
      flex_array: add flex_array_clear function · e6de3988
      David Rientjes 提交于
      Add a new function to the flex_array API:
      
      	int flex_array_clear(struct flex_array *fa,
      				unsigned int element_nr)
      
      This function will zero the element at element_nr in the flex_array.
      
      Although this is equivalent to using flex_array_put() and passing a
      pointer to zero'd memory, flex_array_clear() does not require such a
      pointer to memory that would most likely need to be allocated on the
      caller's stack which could be significantly large depending on
      element_size.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6de3988
    • A
      cpuidle: fix the menu governor to boost IO performance · 69d25870
      Arjan van de Ven 提交于
      Fix the menu idle governor which balances power savings, energy efficiency
      and performance impact.
      
      The reason for a reworked governor is that there have been serious
      performance issues reported with the existing code on Nehalem server
      systems.
      
      To show this I'm sure Andrew wants to see benchmark results:
      (benchmark is "fio", "no cstates" is using "idle=poll")
      
      		no cstates	current linux	new algorithm
      1 disk		107 Mb/s	85 Mb/s		105 Mb/s
      2 disks		215 Mb/s	123 Mb/s	209 Mb/s
      12 disks	590 Mb/s	320 Mb/s	585 Mb/s
      
      In various power benchmark measurements, no degredation was found by our
      measurement&diagnostics team.  Obviously a small percentage more power was
      used in the "fio" benchmark, due to the much higher performance.
      
      While it would be a novel idea to describe the new algorithm in this
      commit message, I cheaped out and described it in comments in the code
      instead.
      
      [changes since first post: spelling fixes from akpm, review feedback,
      folded menu-tng into menu.c]
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Yanmin Zhang <yanmin_zhang@linux.intel.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69d25870
    • M
      mm: move use_mm/unuse_mm from aio.c to mm/ · 3d2d827f
      Michael S. Tsirkin 提交于
      Anyone who wants to do copy to/from user from a kernel thread, needs
      use_mm (like what fs/aio has).  Move that into mm/, to make reusing and
      exporting easier down the line, and make aio use it.  Next intended user,
      besides aio, will be vhost-net.
      Acked-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d2d827f
    • E
      hugetlb: add MAP_HUGETLB for mmaping pseudo-anonymous huge page regions · 4e52780d
      Eric B Munson 提交于
      Add a flag for mmap that will be used to request a huge page region that
      will look like anonymous memory to userspace.  This is accomplished by
      using a file on the internal vfsmount.  MAP_HUGETLB is a modifier of
      MAP_ANONYMOUS and so must be specified with it.  The region will behave
      the same as a MAP_ANONYMOUS region using small pages.
      
      [akpm@linux-foundation.org: fix arch definitions of MAP_HUGETLB]
      Signed-off-by: NEric B Munson <ebmunson@us.ibm.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e52780d
    • E
      hugetlbfs: allow the creation of files suitable for MAP_PRIVATE on the vfs internal mount · 6bfde05b
      Eric B Munson 提交于
      This patchset adds a flag to mmap that allows the user to request that an
      anonymous mapping be backed with huge pages.  This mapping will borrow
      functionality from the huge page shm code to create a file on the kernel
      internal mount and use it to approximate an anonymous mapping.  The
      MAP_HUGETLB flag is a modifier to MAP_ANONYMOUS and will not work without
      both flags being preset.
      
      A new flag is necessary because there is no other way to hook into huge
      pages without creating a file on a hugetlbfs mount which wouldn't be
      MAP_ANONYMOUS.
      
      To userspace, this mapping will behave just like an anonymous mapping
      because the file is not accessible outside of the kernel.
      
      This patchset is meant to simplify the programming model.  Presently there
      is a large chunk of boiler platecode, contained in libhugetlbfs, required
      to create private, hugepage backed mappings.  This patch set would allow
      use of hugepages without linking to libhugetlbfs or having hugetblfs
      mounted.
      
      Unification of the VM code would provide these same benefits, but it has
      been resisted each time that it has been suggested for several reasons: it
      would break PAGE_SIZE assumptions across the kernel, it makes page-table
      abstractions really expensive, and it does not provide any benefit on
      architectures that do not support huge pages, incurring fast path
      penalties without providing any benefit on these architectures.
      
      This patch:
      
      There are two means of creating mappings backed by huge pages:
      
              1. mmap() a file created on hugetlbfs
              2. Use shm which creates a file on an internal mount which essentially
                 maps it MAP_SHARED
      
      The internal mount is only used for shared mappings but there is very
      little that stops it being used for private mappings. This patch extends
      hugetlbfs_file_setup() to deal with the creation of files that will be
      mapped MAP_PRIVATE on the internal hugetlbfs mount. This extended API is
      used in a subsequent patch to implement the MAP_HUGETLB mmap() flag.
      Signed-off-by: NEric Munson <ebmunson@us.ibm.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6bfde05b
    • H
      tmpfs: depend on shmem · 3f96b79a
      Hugh Dickins 提交于
      CONFIG_SHMEM off gives you (ramfs masquerading as) tmpfs, even when
      CONFIG_TMPFS is off: that's a little anomalous, and I'd intended to make
      more sense of it by removing CONFIG_TMPFS altogether, always enabling its
      code when CONFIG_SHMEM; but so many defconfigs have CONFIG_SHMEM on
      CONFIG_TMPFS off that we'd better leave that as is.
      
      But there is no point in asking for CONFIG_TMPFS if CONFIG_SHMEM is off:
      make TMPFS depend on SHMEM, which also prevents TMPFS_POSIX_ACL
      shmem_acl.o being pointlessly built into the kernel when SHMEM is off.
      
      And a selfish change, to prevent the world from being rebuilt when I
      switch between CONFIG_SHMEM on and off: the only CONFIG_SHMEM in the
      header files is mm.h shmem_lock() - give that a shmem.c stub instead.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NMatt Mackall <mpm@selenic.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f96b79a
    • H
      mm: FOLL flags for GUP flags · 58fa879e
      Hugh Dickins 提交于
      __get_user_pages() has been taking its own GUP flags, then processing
      them into FOLL flags for follow_page().  Though oddly named, the FOLL
      flags are more widely used, so pass them to __get_user_pages() now.
      Sorry, VM flags, VM_FAULT flags and FAULT_FLAGs are still distinct.
      
      (The patch to __get_user_pages() looks peculiar, with both gup_flags
      and foll_flags: the gup_flags remain constant; but as before there's
      an exceptional case, out of scope of the patch, in which foll_flags
      per page have FOLL_WRITE masked off.)
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58fa879e
    • H
      mm: follow_hugetlb_page flags · 2a15efc9
      Hugh Dickins 提交于
      follow_hugetlb_page() shouldn't be guessing about the coredump case
      either: pass the foll_flags down to it, instead of just the write bit.
      
      Remove that obscure huge_zeropage_ok() test.  The decision is easy,
      though unlike the non-huge case - here vm_ops->fault is always set.
      But we know that a fault would serve up zeroes, unless there's
      already a hugetlbfs pagecache page to back the range.
      
      (Alternatively, since hugetlb pages aren't swapped out under pressure,
      you could save more dump space by arguing that a page not yet faulted
      into this process cannot be relevant to the dump; but that would be
      more surprising.)
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a15efc9
    • H
      mm: FOLL_DUMP replace FOLL_ANON · 8e4b9a60
      Hugh Dickins 提交于
      The "FOLL_ANON optimization" and its use_zero_page() test have caused
      confusion and bugs: why does it test VM_SHARED? for the very good but
      unsatisfying reason that VMware crashed without.  As we look to maybe
      reinstating anonymous use of the ZERO_PAGE, we need to sort this out.
      
      Easily done: it's silly for __get_user_pages() and follow_page() to
      be guessing whether it's safe to assume that they're being used for
      a coredump (which can take a shortcut snapshot where other uses must
      handle a fault) - just tell them with GUP_FLAGS_DUMP and FOLL_DUMP.
      
      get_dump_page() doesn't even want a ZERO_PAGE: an error suits fine.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e4b9a60
    • H
      mm: add get_dump_page · f3e8fccd
      Hugh Dickins 提交于
      In preparation for the next patch, add a simple get_dump_page(addr)
      interface for the CONFIG_ELF_CORE dumpers to use, instead of calling
      get_user_pages() directly.  They're not interested in errors: they
      just want to use holes as much as possible, to save space and make
      sure that the data is aligned where the headers said it would be.
      
      Oh, and don't use that horrid DUMP_SEEK(off) macro!
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3e8fccd
    • M
      page-allocator: split per-cpu list into one-list-per-migrate-type · 5f8dcc21
      Mel Gorman 提交于
      The following two patches remove searching in the page allocator fast-path
      by maintaining multiple free-lists in the per-cpu structure.  At the time
      the search was introduced, increasing the per-cpu structures would waste a
      lot of memory as per-cpu structures were statically allocated at
      compile-time.  This is no longer the case.
      
      The patches are as follows. They are based on mmotm-2009-08-27.
      
      Patch 1 adds multiple lists to struct per_cpu_pages, one per
      	migratetype that can be stored on the PCP lists.
      
      Patch 2 notes that the pcpu drain path check empty lists multiple times. The
      	patch reduces the number of checks by maintaining a count of free
      	lists encountered. Lists containing pages will then free multiple
      	pages in batch
      
      The patches were tested with kernbench, netperf udp/tcp, hackbench and
      sysbench.  The netperf tests were not bound to any CPU in particular and
      were run such that the results should be 99% confidence that the reported
      results are within 1% of the estimated mean.  sysbench was run with a
      postgres background and read-only tests.  Similar to netperf, it was run
      multiple times so that it's 99% confidence results are within 1%.  The
      patches were tested on x86, x86-64 and ppc64 as
      
      x86:	Intel Pentium D 3GHz with 8G RAM (no-brand machine)
      	kernbench	- No significant difference, variance well within noise
      	netperf-udp	- 1.34% to 2.28% gain
      	netperf-tcp	- 0.45% to 1.22% gain
      	hackbench	- Small variances, very close to noise
      	sysbench	- Very small gains
      
      x86-64:	AMD Phenom 9950 1.3GHz with 8G RAM (no-brand machine)
      	kernbench	- No significant difference, variance well within noise
      	netperf-udp	- 1.83% to 10.42% gains
      	netperf-tcp	- No conclusive until buffer >= PAGE_SIZE
      				4096	+15.83%
      				8192	+ 0.34% (not significant)
      				16384	+ 1%
      	hackbench	- Small gains, very close to noise
      	sysbench	- 0.79% to 1.6% gain
      
      ppc64:	PPC970MP 2.5GHz with 10GB RAM (it's a terrasoft powerstation)
      	kernbench	- No significant difference, variance well within noise
      	netperf-udp	- 2-3% gain for almost all buffer sizes tested
      	netperf-tcp	- losses on small buffers, gains on larger buffers
      			  possibly indicates some bad caching effect.
      	hackbench	- No significant difference
      	sysbench	- 2-4% gain
      
      This patch:
      
      Currently the per-cpu page allocator searches the PCP list for pages of
      the correct migrate-type to reduce the possibility of pages being
      inappropriate placed from a fragmentation perspective.  This search is
      potentially expensive in a fast-path and undesirable.  Splitting the
      per-cpu list into multiple lists increases the size of a per-cpu structure
      and this was potentially a major problem at the time the search was
      introduced.  These problem has been mitigated as now only the necessary
      number of structures is allocated for the running system.
      
      This patch replaces a list search in the per-cpu allocator with one list
      per migrate type.  The potential snag with this approach is when bulk
      freeing pages.  We round-robin free pages based on migrate type which has
      little bearing on the cache hotness of the page and potentially checks
      empty lists repeatedly in the event the majority of PCP pages are of one
      type.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NNick Piggin <npiggin@suse.de>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f8dcc21