1. 24 9月, 2009 12 次提交
    • D
      memcg: show swap usage in stat file · 1dd3a273
      Daisuke Nishimura 提交于
      We now count MEM_CGROUP_STAT_SWAPOUT, so we can show swap usage.  It would
      be useful for users to show swap usage in memory.stat file, because they
      don't need calculate memsw.usage - res.usage to know swap usage.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1dd3a273
    • B
      memcg: improve resource counter scalability · 0c3e73e8
      Balbir Singh 提交于
      Reduce the resource counter overhead (mostly spinlock) associated with the
      root cgroup.  This is a part of the several patches to reduce mem cgroup
      overhead.  I had posted other approaches earlier (including using percpu
      counters).  Those patches will be a natural addition and will be added
      iteratively on top of these.
      
      The patch stops resource counter accounting for the root cgroup.  The data
      for display is derived from the statisitcs we maintain via
      mem_cgroup_charge_statistics (which is more scalable).  What happens today
      is that, we do double accounting, once using res_counter_charge() and once
      using memory_cgroup_charge_statistics().  For the root, since we don't
      implement limits any more, we don't need to track every charge via
      res_counter_charge() and check for limit being exceeded and reclaim.
      
      The main mem->res usage_in_bytes can be derived by summing the cache and
      rss usage data from memory statistics (MEM_CGROUP_STAT_RSS and
      MEM_CGROUP_STAT_CACHE).  However, for memsw->res usage_in_bytes, we need
      additional data about swapped out memory.  This patch adds a
      MEM_CGROUP_STAT_SWAPOUT and uses that along with MEM_CGROUP_STAT_RSS and
      MEM_CGROUP_STAT_CACHE to derive the memsw data.  This data is computed
      recursively when hierarchy is enabled.
      
      The tests results I see on a 24 way show that
      
      1. The lock contention disappears from /proc/lock_stats
      2. The results of the test are comparable to running with
         cgroup_disable=memory.
      
      Here is a sample of my program runs
      
      Without Patch
      
       Performance counter stats for '/home/balbir/parallel_pagefault':
      
       7192804.124144  task-clock-msecs         #     23.937 CPUs
               424691  context-switches         #      0.000 M/sec
                  267  CPU-migrations           #      0.000 M/sec
             28498113  page-faults              #      0.004 M/sec
        5826093739340  cycles                   #    809.989 M/sec
         408883496292  instructions             #      0.070 IPC
           7057079452  cache-references         #      0.981 M/sec
           3036086243  cache-misses             #      0.422 M/sec
      
        300.485365680  seconds time elapsed
      
      With cgroup_disable=memory
      
       Performance counter stats for '/home/balbir/parallel_pagefault':
      
       7182183.546587  task-clock-msecs         #     23.915 CPUs
               425458  context-switches         #      0.000 M/sec
                  203  CPU-migrations           #      0.000 M/sec
             92545093  page-faults              #      0.013 M/sec
        6034363609986  cycles                   #    840.185 M/sec
         437204346785  instructions             #      0.072 IPC
           6636073192  cache-references         #      0.924 M/sec
           2358117732  cache-misses             #      0.328 M/sec
      
        300.320905827  seconds time elapsed
      
      With this patch applied
      
       Performance counter stats for '/home/balbir/parallel_pagefault':
      
       7191619.223977  task-clock-msecs         #     23.955 CPUs
               422579  context-switches         #      0.000 M/sec
                   88  CPU-migrations           #      0.000 M/sec
             91946060  page-faults              #      0.013 M/sec
        5957054385619  cycles                   #    828.333 M/sec
        1058117350365  instructions             #      0.178 IPC
           9161776218  cache-references         #      1.274 M/sec
           1920494280  cache-misses             #      0.267 M/sec
      
        300.218764862  seconds time elapsed
      
      Data from Prarit (kernel compile with make -j64 on a 64
      CPU/32G machine)
      
      For a single run
      
      Without patch
      
      real 27m8.988s
      user 87m24.916s
      sys 382m6.037s
      
      With patch
      
      real    4m18.607s
      user    84m58.943s
      sys     50m52.682s
      
      With config turned off
      
      real    4m54.972s
      user    90m13.456s
      sys     50m19.711s
      
      NOTE: The data looks counterintuitive due to the increased performance
      with the patch, even over the config being turned off. We probably need
      more runs, but so far all testing has shown that the patches definitely
      help.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0c3e73e8
    • B
      memory controller: soft limit reclaim on contention · 4e416953
      Balbir Singh 提交于
      Implement reclaim from groups over their soft limit
      
      Permit reclaim from memory cgroups on contention (via the direct reclaim
      path).
      
      memory cgroup soft limit reclaim finds the group that exceeds its soft
      limit by the largest number of pages and reclaims pages from it and then
      reinserts the cgroup into its correct place in the rbtree.
      
      Add additional checks to mem_cgroup_hierarchical_reclaim() to detect long
      loops in case all swap is turned off.  The code has been refactored and
      the loop check (loop < 2) has been enhanced for soft limits.  For soft
      limits, we try to do more targetted reclaim.  Instead of bailing out after
      two loops, the routine now reclaims memory proportional to the size by
      which the soft limit is exceeded.  The proportion has been empirically
      determined.
      
      [akpm@linux-foundation.org: build fix]
      [kamezawa.hiroyu@jp.fujitsu.com: fix softlimit css refcnt handling]
      [nishimura@mxp.nes.nec.co.jp: refcount of the "victim" should be decremented before exiting the loop]
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e416953
    • B
      memory controller: soft limit refactor reclaim flags · 75822b44
      Balbir Singh 提交于
      Refactor mem_cgroup_hierarchical_reclaim()
      
      Refactor the arguments passed to mem_cgroup_hierarchical_reclaim() into
      flags, so that new parameters don't have to be passed as we make the
      reclaim routine more flexible
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75822b44
    • B
      memory controller: soft limit organize cgroups · f64c3f54
      Balbir Singh 提交于
      Organize cgroups over soft limit in a RB-Tree
      
      Introduce an RB-Tree for storing memory cgroups that are over their soft
      limit.  The overall goal is to
      
      1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
         We are careful about updates, updates take place only after a particular
         time interval has passed
      2. We remove the node from the RB-Tree when the usage goes below the soft
         limit
      
      The next set of patches will exploit the RB-Tree to get the group that is
      over its soft limit by the largest amount and reclaim from it, when we
      face memory contention.
      
      [hugh.dickins@tiscali.co.uk: CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_PREEMPT=y fails to boot]
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Jiri Slaby <jirislaby@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f64c3f54
    • B
      memory controller: soft limit interface · 296c81d8
      Balbir Singh 提交于
      Add an interface to allow get/set of soft limits.  Soft limits for memory
      plus swap controller (memsw) is currently not supported.  Resource
      counters have been enhanced to support soft limits and new type
      RES_SOFT_LIMIT has been added.  Unlike hard limits, soft limits can be
      directly set and do not need any reclaim or checks before setting them to
      a newer value.
      
      Kamezawa-San raised a question as to whether soft limit should belong to
      res_counter.  Since all resources understand the basic concepts of hard
      and soft limits, it is justified to add soft limits here.  Soft limits are
      a generic resource usage feature, even file system quotas support soft
      limits.
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      296c81d8
    • K
      memcg: add comments explaining memory barriers · 261fb61a
      KAMEZAWA Hiroyuki 提交于
      Add comments for the reason of smp_wmb() in mem_cgroup_commit_charge().
      
      [akpm@linux-foundation.org: coding-style fixes]
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      261fb61a
    • B
      memcg: remove the overhead associated with the root cgroup · 4b3bde4c
      Balbir Singh 提交于
      Change the memory cgroup to remove the overhead associated with accounting
      all pages in the root cgroup.  As a side-effect, we can no longer set a
      memory hard limit in the root cgroup.
      
      A new flag to track whether the page has been accounted or not has been
      added as well.  Flags are now set atomically for page_cgroup,
      pcg_default_flags is now obsolete and removed.
      
      [akpm@linux-foundation.org: fix a few documentation glitches]
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b3bde4c
    • B
      cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time · be367d09
      Ben Blum 提交于
      Alter the ss->can_attach and ss->attach functions to be able to deal with
      a whole threadgroup at a time, for use in cgroup_attach_proc.  (This is a
      pre-patch to cgroup-procs-writable.patch.)
      
      Currently, new mode of the attach function can only tell the subsystem
      about the old cgroup of the threadgroup leader.  No subsystem currently
      needs that information for each thread that's being moved, but if one were
      to be added (for example, one that counts tasks within a group) this bit
      would need to be reworked a bit to tell the subsystem the right
      information.
      
      [hidave.darkstar@gmail.com: fix build]
      Signed-off-by: NBen Blum <bblum@google.com>
      Signed-off-by: NPaul Menage <menage@google.com>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Reviewed-by: NMatt Helsley <matthltc@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Dave Young <hidave.darkstar@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be367d09
    • I
      ksm: change default values to better fit into mainline kernel · 2c6854fd
      Izik Eidus 提交于
      Now that ksm is in mainline it is better to change the default values to
      better fit to most of the users.
      
      This patch change the ksm default values to be:
      
      	ksm_thread_pages_to_scan = 100 (instead of 200)
      	ksm_thread_sleep_millisecs = 20 (like before)
      	ksm_run = KSM_RUN_STOP (instead of KSM_RUN_MERGE - meaning ksm is
      	                        disabled by default)
      	ksm_max_kernel_pages = nr_free_buffer_pages / 4 (instead of 2046)
      
      The important aspect of this patch is: it disables ksm by default, and sets
      the number of the kernel_pages that can be allocated to be a reasonable
      number.
      Signed-off-by: NIzik Eidus <ieidus@redhat.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c6854fd
    • R
      cpumask: use new-style cpumask ops in mm/quicklist. · db790786
      Rusty Russell 提交于
      This slipped past the previous sweeps.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Acked-by: NChristoph Lameter <cl@linux-foundation.org>
      db790786
    • H
      nommu: fix two build breakages · 4266c97a
      Hugh Dickins 提交于
      My 58fa879e "mm: FOLL flags for GUP flags"
      broke CONFIG_NOMMU build by forgetting to update nommu.c foll_flags type:
      
        mm/nommu.c:171: error: conflicting types for `__get_user_pages'
        mm/internal.h:254: error: previous declaration of `__get_user_pages' was here
        make[1]: *** [mm/nommu.o] Error 1
      
      My 03f6462a "mm: move highest_memmap_pfn"
      broke CONFIG_NOMMU build by forgetting to add a nommu.c highest_memmap_pfn:
      
        mm/built-in.o: In function `memmap_init_zone':
        (.meminit.text+0x326): undefined reference to `highest_memmap_pfn'
        mm/built-in.o: In function `memmap_init_zone':
        (.meminit.text+0x32d): undefined reference to `highest_memmap_pfn'
      
      Fix both breakages, and give myself 30 lashes (ouch!)
      Reported-by: NMichal Simek <michal.simek@petalogix.com>
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4266c97a
  2. 23 9月, 2009 3 次提交
    • K
      kcore: register module area in generic way · 81ac3ad9
      KAMEZAWA Hiroyuki 提交于
      Some archs define MODULED_VADDR/MODULES_END which is not in VMALLOC area.
      This is handled only in x86-64.  This patch make it more generic.  And we
      can use vread/vwrite to access the area.  Fix it.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jirislaby@gmail.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: WANG Cong <xiyou.wangcong@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81ac3ad9
    • K
      walk system ram range · 908eedc6
      KAMEZAWA Hiroyuki 提交于
      Originally, walk_memory_resource() was introduced to traverse all memory
      of "System RAM" for detecting memory hotplug/unplug range.  For doing so,
      flags of IORESOUCE_MEM|IORESOURCE_BUSY was used and this was enough for
      memory hotplug.
      
      But for using other purpose, /proc/kcore, this may includes some firmware
      area marked as IORESOURCE_BUSY | IORESOUCE_MEM.  This patch makes the
      check strict to find out busy "System RAM".
      
      Note: PPC64 keeps their own walk_memory_resouce(), which walk through
      ppc64's lmb informaton.  Because old kclist_add() is called per lmb, this
      patch makes no difference in behavior, finally.
      
      And this patch removes CONFIG_MEMORY_HOTPLUG check from this function.
      Because pfn_valid() just show "there is memmap or not* and cannot be used
      for "there is physical memory or not", this function is useful in generic
      to scan physical memory range.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: WANG Cong <xiyou.wangcong@gmail.com>
      Cc: Américo Wang <xiyou.wangcong@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Roland Dreier <rolandd@cisco.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      908eedc6
    • S
      procfs: provide stack information for threads · d899bf7b
      Stefani Seibold 提交于
      A patch to give a better overview of the userland application stack usage,
      especially for embedded linux.
      
      Currently you are only able to dump the main process/thread stack usage
      which is showed in /proc/pid/status by the "VmStk" Value.  But you get no
      information about the consumed stack memory of the the threads.
      
      There is an enhancement in the /proc/<pid>/{task/*,}/*maps and which marks
      the vm mapping where the thread stack pointer reside with "[thread stack
      xxxxxxxx]".  xxxxxxxx is the maximum size of stack.  This is a value
      information, because libpthread doesn't set the start of the stack to the
      top of the mapped area, depending of the pthread usage.
      
      A sample output of /proc/<pid>/task/<tid>/maps looks like:
      
      08048000-08049000 r-xp 00000000 03:00 8312       /opt/z
      08049000-0804a000 rw-p 00001000 03:00 8312       /opt/z
      0804a000-0806b000 rw-p 00000000 00:00 0          [heap]
      a7d12000-a7d13000 ---p 00000000 00:00 0
      a7d13000-a7f13000 rw-p 00000000 00:00 0          [thread stack: 001ff4b4]
      a7f13000-a7f14000 ---p 00000000 00:00 0
      a7f14000-a7f36000 rw-p 00000000 00:00 0
      a7f36000-a8069000 r-xp 00000000 03:00 4222       /lib/libc.so.6
      a8069000-a806b000 r--p 00133000 03:00 4222       /lib/libc.so.6
      a806b000-a806c000 rw-p 00135000 03:00 4222       /lib/libc.so.6
      a806c000-a806f000 rw-p 00000000 00:00 0
      a806f000-a8083000 r-xp 00000000 03:00 14462      /lib/libpthread.so.0
      a8083000-a8084000 r--p 00013000 03:00 14462      /lib/libpthread.so.0
      a8084000-a8085000 rw-p 00014000 03:00 14462      /lib/libpthread.so.0
      a8085000-a8088000 rw-p 00000000 00:00 0
      a8088000-a80a4000 r-xp 00000000 03:00 8317       /lib/ld-linux.so.2
      a80a4000-a80a5000 r--p 0001b000 03:00 8317       /lib/ld-linux.so.2
      a80a5000-a80a6000 rw-p 0001c000 03:00 8317       /lib/ld-linux.so.2
      afaf5000-afb0a000 rw-p 00000000 00:00 0          [stack]
      ffffe000-fffff000 r-xp 00000000 00:00 0          [vdso]
      
      Also there is a new entry "stack usage" in /proc/<pid>/{task/*,}/status
      which will you give the current stack usage in kb.
      
      A sample output of /proc/self/status looks like:
      
      Name:	cat
      State:	R (running)
      Tgid:	507
      Pid:	507
      .
      .
      .
      CapBnd:	fffffffffffffeff
      voluntary_ctxt_switches:	0
      nonvoluntary_ctxt_switches:	0
      Stack usage:	12 kB
      
      I also fixed stack base address in /proc/<pid>/{task/*,}/stat to the base
      address of the associated thread stack and not the one of the main
      process.  This makes more sense.
      
      [akpm@linux-foundation.org: fs/proc/array.c now needs walk_page_range()]
      Signed-off-by: NStefani Seibold <stefani@seibold.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d899bf7b
  3. 22 9月, 2009 25 次提交
    • B
      nommu: add support for Memory Protection Units (MPU) · eb8cdec4
      Bernd Schmidt 提交于
      Some architectures (like the Blackfin arch) implement some of the
      "simpler" features that one would expect out of a MMU such as memory
      protection.
      
      In our case, we actually get read/write/exec protection down to the page
      boundary so processes can't stomp on each other let alone the kernel.
      
      There is a performance decrease (which depends greatly on the workload)
      however as the hardware/software interaction was not optimized at design
      time.
      Signed-off-by: NBernd Schmidt <bernds_cb1@t-online.de>
      Signed-off-by: NBryan Wu <cooloney@kernel.org>
      Signed-off-by: NMike Frysinger <vapier@gentoo.org>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NGreg Ungerer <gerg@snapgear.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eb8cdec4
    • M
      mm: reduce atomic use on use_mm fast path · f68e1480
      Michael S. Tsirkin 提交于
      When the mm being switched to matches the active mm, we don't need to
      increment and then drop the mm count.  In a simple benchmark this happens
      in about 50% of time.  Making that conditional reduces contention on that
      cacheline on SMP systems.
      Acked-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f68e1480
    • M
      mm: move use_mm/unuse_mm from aio.c to mm/ · 3d2d827f
      Michael S. Tsirkin 提交于
      Anyone who wants to do copy to/from user from a kernel thread, needs
      use_mm (like what fs/aio has).  Move that into mm/, to make reusing and
      exporting easier down the line, and make aio use it.  Next intended user,
      besides aio, will be vhost-net.
      Acked-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d2d827f
    • P
      shmem: initialize struct shmem_sb_info to zero · 425fbf04
      Pekka Enberg 提交于
      Fixes the following kmemcheck false positive (the compiler is using
      a 32-bit mov to load the 16-bit sbinfo->mode in shmem_fill_super):
      
      [    0.337000] Total of 1 processors activated (3088.38 BogoMIPS).
      [    0.352000] CPU0 attaching NULL sched-domain.
      [    0.360000] WARNING: kmemcheck: Caught 32-bit read from uninitialized
      memory (9f8020fc)
      [    0.361000]
      a44240820000000041f6998100000000000000000000000000000000ff030000
      [    0.368000]  i i i i i i i i i i i i i i i i u u u u i i i i i i i i i i u
      u
      [    0.375000]                                                          ^
      [    0.376000]
      [    0.377000] Pid: 9, comm: khelper Not tainted (2.6.31-tip #206) P4DC6
      [    0.378000] EIP: 0060:[<810a3a95>] EFLAGS: 00010246 CPU: 0
      [    0.379000] EIP is at shmem_fill_super+0xb5/0x120
      [    0.380000] EAX: 00000000 EBX: 9f845400 ECX: 824042a4 EDX: 8199f641
      [    0.381000] ESI: 9f8020c0 EDI: 9f845400 EBP: 9f81af68 ESP: 81cd6eec
      [    0.382000]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
      [    0.383000] CR0: 8005003b CR2: 9f806200 CR3: 01ccd000 CR4: 000006d0
      [    0.384000] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
      [    0.385000] DR6: ffff4ff0 DR7: 00000400
      [    0.386000]  [<810c25fc>] get_sb_nodev+0x3c/0x80
      [    0.388000]  [<810a3514>] shmem_get_sb+0x14/0x20
      [    0.390000]  [<810c207f>] vfs_kern_mount+0x4f/0x120
      [    0.392000]  [<81b2849e>] init_tmpfs+0x7e/0xb0
      [    0.394000]  [<81b11597>] do_basic_setup+0x17/0x30
      [    0.396000]  [<81b11907>] kernel_init+0x57/0xa0
      [    0.398000]  [<810039b7>] kernel_thread_helper+0x7/0x10
      [    0.400000]  [<ffffffff>] 0xffffffff
      [    0.402000] khelper used greatest stack depth: 2820 bytes left
      [    0.407000] calling  init_mmap_min_addr+0x0/0x10 @ 1
      [    0.408000] initcall init_mmap_min_addr+0x0/0x10 returned 0 after 0 usecs
      Reported-by: NIngo Molnar <mingo@elte.hu>
      Analysed-by: NVegard Nossum <vegard.nossum@gmail.com>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Acked-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      425fbf04
    • E
      hugetlb: add MAP_HUGETLB for mmaping pseudo-anonymous huge page regions · 4e52780d
      Eric B Munson 提交于
      Add a flag for mmap that will be used to request a huge page region that
      will look like anonymous memory to userspace.  This is accomplished by
      using a file on the internal vfsmount.  MAP_HUGETLB is a modifier of
      MAP_ANONYMOUS and so must be specified with it.  The region will behave
      the same as a MAP_ANONYMOUS region using small pages.
      
      [akpm@linux-foundation.org: fix arch definitions of MAP_HUGETLB]
      Signed-off-by: NEric B Munson <ebmunson@us.ibm.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e52780d
    • H
      mmap: save some cycles for the shared anonymous mapping · f8dbf0a7
      Huang Shijie 提交于
      shmem_zero_setup() does not change vm_start, pgoff or vm_flags, only some
      drivers change them (such as /driver/video/bfin-t350mcqb-fb.c).
      
      Move these codes to a more proper place to save cycles for shared
      anonymous mapping.
      Signed-off-by: NHuang Shijie <shijie8@gmail.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8dbf0a7
    • L
      mmap: avoid unnecessary anon_vma lock acquisition in vma_adjust() · 252c5f94
      Lee Schermerhorn 提交于
      We noticed very erratic behavior [throughput] with the AIM7 shared
      workload running on recent distro [SLES11] and mainline kernels on an
      8-socket, 32-core, 256GB x86_64 platform.  On the SLES11 kernel
      [2.6.27.19+] with Barcelona processors, as we increased the load [10s of
      thousands of tasks], the throughput would vary between two "plateaus"--one
      at ~65K jobs per minute and one at ~130K jpm.  The simple patch below
      causes the results to smooth out at the ~130k plateau.
      
      But wait, there's more:
      
      We do not see this behavior on smaller platforms--e.g., 4 socket/8 core.
      This could be the result of the larger number of cpus on the larger
      platform--a scalability issue--or it could be the result of the larger
      number of interconnect "hops" between some nodes in this platform and how
      the tasks for a given load end up distributed over the nodes' cpus and
      memories--a stochastic NUMA effect.
      
      The variability in the results are less pronounced [on the same platform]
      with Shanghai processors and with mainline kernels.  With 31-rc6 on
      Shanghai processors and 288 file systems on 288 fibre attached storage
      volumes, the curves [jpm vs load] are both quite flat with the patched
      kernel consistently producing ~3.9% better throughput [~80K jpm vs ~77K
      jpm] than the unpatched kernel.
      
      Profiling indicated that the "slow" runs were incurring high[er]
      contention on an anon_vma lock in vma_adjust(), apparently called from the
      sbrk() system call.
      
      The patch:
      
      A comment in mm/mmap.c:vma_adjust() suggests that we don't really need the
      anon_vma lock when we're only adjusting the end of a vma, as is the case
      for brk().  The comment questions whether it's worth while to optimize for
      this case.  Apparently, on the newer, larger x86_64 platforms, with
      interesting NUMA topologies, it is worth while--especially considering
      that the patch [if correct!] is quite simple.
      
      We can detect this condition--no overlap with next vma--by noting a NULL
      "importer".  The anon_vma pointer will also be NULL in this case, so
      simply avoid loading vma->anon_vma to avoid the lock.
      
      However, we DO need to take the anon_vma lock when we're inserting a vma
      ['insert' non-NULL] even when we have no overlap [NULL "importer"], so we
      need to check for 'insert', as well.  And Hugh points out that we should
      also take it when adjusting vm_start (so that rmap.c can rely upon
      vma_address() while it holds the anon_vma lock).
      
      akpm: Zhang Yanmin reprts a 150% throughput improvement with aim7, so it
      might be -stable material even though thiss isn't a regression: "this
      issue is not clear on dual socket Nehalem machine (2*4*2 cpu), but is
      severe on large machine (4*8*2 cpu)"
      
      [hugh.dickins@tiscali.co.uk: test vma start too]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Tested-by: N"Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      252c5f94
    • H
      tmpfs: depend on shmem · 3f96b79a
      Hugh Dickins 提交于
      CONFIG_SHMEM off gives you (ramfs masquerading as) tmpfs, even when
      CONFIG_TMPFS is off: that's a little anomalous, and I'd intended to make
      more sense of it by removing CONFIG_TMPFS altogether, always enabling its
      code when CONFIG_SHMEM; but so many defconfigs have CONFIG_SHMEM on
      CONFIG_TMPFS off that we'd better leave that as is.
      
      But there is no point in asking for CONFIG_TMPFS if CONFIG_SHMEM is off:
      make TMPFS depend on SHMEM, which also prevents TMPFS_POSIX_ACL
      shmem_acl.o being pointlessly built into the kernel when SHMEM is off.
      
      And a selfish change, to prevent the world from being rebuilt when I
      switch between CONFIG_SHMEM on and off: the only CONFIG_SHMEM in the
      header files is mm.h shmem_lock() - give that a shmem.c stub instead.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NMatt Mackall <mpm@selenic.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f96b79a
    • H
      mmap: remove unnecessary code · cdf7b341
      Huang Shijie 提交于
      If (flags & MAP_LOCKED) is true, it means vm_flags has already contained
      the bit VM_LOCKED which is set by calc_vm_flag_bits().
      
      So there is no need to reset it again, just remove it.
      Signed-off-by: NHuang Shijie <shijie8@gmail.com>
      Acked-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cdf7b341
    • H
      mm: move highest_memmap_pfn · 03f6462a
      Hugh Dickins 提交于
      Move highest_memmap_pfn __read_mostly from page_alloc.c next to zero_pfn
      __read_mostly in memory.c: to help them share a cacheline, since they're
      very often tested together in vm_normal_page().
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      03f6462a
    • H
      mm: ZERO_PAGE without PTE_SPECIAL · 62eede62
      Hugh Dickins 提交于
      Reinstate anonymous use of ZERO_PAGE to all architectures, not just to
      those which __HAVE_ARCH_PTE_SPECIAL: as suggested by Nick Piggin.
      
      Contrary to how I'd imagined it, there's nothing ugly about this, just a
      zero_pfn test built into one or another block of vm_normal_page().
      
      But the MIPS ZERO_PAGE-of-many-colours case demands is_zero_pfn() and
      my_zero_pfn() inlines.  Reinstate its mremap move_pte() shuffling of
      ZERO_PAGEs we did from 2.6.17 to 2.6.19?  Not unless someone shouts for
      that: it would have to take vm_flags to weed out some cases.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62eede62
    • H
      mm: hugetlbfs_pagecache_present · 3ae77f43
      Hugh Dickins 提交于
      Rename hugetlbfs_backed() to hugetlbfs_pagecache_present()
      and add more comments, as suggested by Mel Gorman.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ae77f43
    • H
      mm: m(un)lock avoid ZERO_PAGE · 6e919717
      Hugh Dickins 提交于
      I'm still reluctant to clutter __get_user_pages() with another flag, just
      to avoid touching ZERO_PAGE count in mlock(); though we can add that later
      if it shows up as an issue in practice.
      
      But when mlocking, we can test page->mapping slightly earlier, to avoid
      the potentially bouncy rescheduling of lock_page on ZERO_PAGE - mlock
      didn't lock_page in olden ZERO_PAGE days, so we might have regressed.
      
      And when munlocking, it turns out that FOLL_DUMP coincidentally does
      what's needed to avoid all updates to ZERO_PAGE, so use that here also.
      Plus add comment suggested by KAMEZAWA Hiroyuki.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6e919717
    • H
      mm: FOLL flags for GUP flags · 58fa879e
      Hugh Dickins 提交于
      __get_user_pages() has been taking its own GUP flags, then processing
      them into FOLL flags for follow_page().  Though oddly named, the FOLL
      flags are more widely used, so pass them to __get_user_pages() now.
      Sorry, VM flags, VM_FAULT flags and FAULT_FLAGs are still distinct.
      
      (The patch to __get_user_pages() looks peculiar, with both gup_flags
      and foll_flags: the gup_flags remain constant; but as before there's
      an exceptional case, out of scope of the patch, in which foll_flags
      per page have FOLL_WRITE masked off.)
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58fa879e
    • H
      mm: reinstate ZERO_PAGE · a13ea5b7
      Hugh Dickins 提交于
      KAMEZAWA Hiroyuki has observed customers of earlier kernels taking
      advantage of the ZERO_PAGE: which we stopped do_anonymous_page() from
      using in 2.6.24.  And there were a couple of regression reports on LKML.
      
      Following suggestions from Linus, reinstate do_anonymous_page() use of
      the ZERO_PAGE; but this time avoid dirtying its struct page cacheline
      with (map)count updates - let vm_normal_page() regard it as abnormal.
      
      Use it only on arches which __HAVE_ARCH_PTE_SPECIAL (x86, s390, sh32,
      most powerpc): that's not essential, but minimizes additional branches
      (keeping them in the unlikely pte_special case); and incidentally
      excludes mips (some models of which needed eight colours of ZERO_PAGE
      to avoid costly exceptions).
      
      Don't be fanatical about avoiding ZERO_PAGE updates: get_user_pages()
      callers won't want to make exceptions for it, so increment its count
      there.  Changes to mlock and migration? happily seems not needed.
      
      In most places it's quicker to check pfn than struct page address:
      prepare a __read_mostly zero_pfn for that.  Does get_dump_page()
      still need its ZERO_PAGE check? probably not, but keep it anyway.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a13ea5b7
    • H
      mm: fix anonymous dirtying · 1ac0cb5d
      Hugh Dickins 提交于
      do_anonymous_page() has been wrong to dirty the pte regardless.
      If it's not going to mark the pte writable, then it won't help
      to mark it dirty here, and clogs up memory with pages which will
      need swap instead of being thrown away.  Especially wrong if no
      overcommit is chosen, and this vma is not yet VM_ACCOUNTed -
      we could exceed the limit and OOM despite no overcommit.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: <stable@kernel.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1ac0cb5d
    • H
      mm: follow_hugetlb_page flags · 2a15efc9
      Hugh Dickins 提交于
      follow_hugetlb_page() shouldn't be guessing about the coredump case
      either: pass the foll_flags down to it, instead of just the write bit.
      
      Remove that obscure huge_zeropage_ok() test.  The decision is easy,
      though unlike the non-huge case - here vm_ops->fault is always set.
      But we know that a fault would serve up zeroes, unless there's
      already a hugetlbfs pagecache page to back the range.
      
      (Alternatively, since hugetlb pages aren't swapped out under pressure,
      you could save more dump space by arguing that a page not yet faulted
      into this process cannot be relevant to the dump; but that would be
      more surprising.)
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a15efc9
    • H
      mm: FOLL_DUMP replace FOLL_ANON · 8e4b9a60
      Hugh Dickins 提交于
      The "FOLL_ANON optimization" and its use_zero_page() test have caused
      confusion and bugs: why does it test VM_SHARED? for the very good but
      unsatisfying reason that VMware crashed without.  As we look to maybe
      reinstating anonymous use of the ZERO_PAGE, we need to sort this out.
      
      Easily done: it's silly for __get_user_pages() and follow_page() to
      be guessing whether it's safe to assume that they're being used for
      a coredump (which can take a shortcut snapshot where other uses must
      handle a fault) - just tell them with GUP_FLAGS_DUMP and FOLL_DUMP.
      
      get_dump_page() doesn't even want a ZERO_PAGE: an error suits fine.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e4b9a60
    • H
      mm: add get_dump_page · f3e8fccd
      Hugh Dickins 提交于
      In preparation for the next patch, add a simple get_dump_page(addr)
      interface for the CONFIG_ELF_CORE dumpers to use, instead of calling
      get_user_pages() directly.  They're not interested in errors: they
      just want to use holes as much as possible, to save space and make
      sure that the data is aligned where the headers said it would be.
      
      Oh, and don't use that horrid DUMP_SEEK(off) macro!
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3e8fccd
    • H
      mm: remove unused GUP flags · 1c3aff1c
      Hugh Dickins 提交于
      GUP_FLAGS_IGNORE_VMA_PERMISSIONS and GUP_FLAGS_IGNORE_SIGKILL were
      flags added solely to prevent __get_user_pages() from doing some of
      what it usually does, in the munlock case: we can now remove them.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c3aff1c
    • H
      mm: munlock use follow_page · 408e82b7
      Hugh Dickins 提交于
      Hiroaki Wakabayashi points out that when mlock() has been interrupted
      by SIGKILL, the subsequent munlock() takes unnecessarily long because
      its use of __get_user_pages() insists on faulting in all the pages
      which mlock() never reached.
      
      It's worse than slowness if mlock() is terminated by Out Of Memory kill:
      the munlock_vma_pages_all() in exit_mmap() insists on faulting in all the
      pages which mlock() could not find memory for; so innocent bystanders are
      killed too, and perhaps the system hangs.
      
      __get_user_pages() does a lot that's silly for munlock(): so remove the
      munlock option from __mlock_vma_pages_range(), and use a simple loop of
      follow_page()s in munlock_vma_pages_range() instead; ignoring absent
      pages, and not marking present pages as accessed or dirty.
      
      (Change munlock() to only go so far as mlock() reached?  That does not
      work out, given the convention that mlock() claims complete success even
      when it has to give up early - in part so that an underlying file can be
      extended later, and those pages locked which earlier would give SIGBUS.)
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: <stable@kernel.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Reviewed-by: NHiroaki Wakabayashi <primulaelatior@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      408e82b7
    • M
      page-allocator: maintain rolling count of pages to free from the PCP · a6f9edd6
      Mel Gorman 提交于
      When round-robin freeing pages from the PCP lists, empty lists may be
      encountered.  In the event one of the lists has more pages than another,
      there may be numerous checks for list_empty() which is undesirable.  This
      patch maintains a count of pages to free which is incremented when empty
      lists are encountered.  The intention is that more pages will then be
      freed from fuller lists than the empty ones reducing the number of empty
      list checks in the free path.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6f9edd6
    • M
      page-allocator: split per-cpu list into one-list-per-migrate-type · 5f8dcc21
      Mel Gorman 提交于
      The following two patches remove searching in the page allocator fast-path
      by maintaining multiple free-lists in the per-cpu structure.  At the time
      the search was introduced, increasing the per-cpu structures would waste a
      lot of memory as per-cpu structures were statically allocated at
      compile-time.  This is no longer the case.
      
      The patches are as follows. They are based on mmotm-2009-08-27.
      
      Patch 1 adds multiple lists to struct per_cpu_pages, one per
      	migratetype that can be stored on the PCP lists.
      
      Patch 2 notes that the pcpu drain path check empty lists multiple times. The
      	patch reduces the number of checks by maintaining a count of free
      	lists encountered. Lists containing pages will then free multiple
      	pages in batch
      
      The patches were tested with kernbench, netperf udp/tcp, hackbench and
      sysbench.  The netperf tests were not bound to any CPU in particular and
      were run such that the results should be 99% confidence that the reported
      results are within 1% of the estimated mean.  sysbench was run with a
      postgres background and read-only tests.  Similar to netperf, it was run
      multiple times so that it's 99% confidence results are within 1%.  The
      patches were tested on x86, x86-64 and ppc64 as
      
      x86:	Intel Pentium D 3GHz with 8G RAM (no-brand machine)
      	kernbench	- No significant difference, variance well within noise
      	netperf-udp	- 1.34% to 2.28% gain
      	netperf-tcp	- 0.45% to 1.22% gain
      	hackbench	- Small variances, very close to noise
      	sysbench	- Very small gains
      
      x86-64:	AMD Phenom 9950 1.3GHz with 8G RAM (no-brand machine)
      	kernbench	- No significant difference, variance well within noise
      	netperf-udp	- 1.83% to 10.42% gains
      	netperf-tcp	- No conclusive until buffer >= PAGE_SIZE
      				4096	+15.83%
      				8192	+ 0.34% (not significant)
      				16384	+ 1%
      	hackbench	- Small gains, very close to noise
      	sysbench	- 0.79% to 1.6% gain
      
      ppc64:	PPC970MP 2.5GHz with 10GB RAM (it's a terrasoft powerstation)
      	kernbench	- No significant difference, variance well within noise
      	netperf-udp	- 2-3% gain for almost all buffer sizes tested
      	netperf-tcp	- losses on small buffers, gains on larger buffers
      			  possibly indicates some bad caching effect.
      	hackbench	- No significant difference
      	sysbench	- 2-4% gain
      
      This patch:
      
      Currently the per-cpu page allocator searches the PCP list for pages of
      the correct migrate-type to reduce the possibility of pages being
      inappropriate placed from a fragmentation perspective.  This search is
      potentially expensive in a fast-path and undesirable.  Splitting the
      per-cpu list into multiple lists increases the size of a per-cpu structure
      and this was potentially a major problem at the time the search was
      introduced.  These problem has been mitigated as now only the necessary
      number of structures is allocated for the running system.
      
      This patch replaces a list search in the per-cpu allocator with one list
      per migrate type.  The potential snag with this approach is when bulk
      freeing pages.  We round-robin free pages based on migrate type which has
      little bearing on the cache hotness of the page and potentially checks
      empty lists repeatedly in the event the majority of PCP pages are of one
      type.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NNick Piggin <npiggin@suse.de>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f8dcc21
    • K
      oom: oom_kill doesn't kill vfork parent (or child) · 8c5cd6f3
      KOSAKI Motohiro 提交于
      Current oom_kill doesn't only kill the victim process, but also kill all
      thas shread the same mm.  it mean vfork parent will be killed.
      
      This is definitely incorrect.  another process have another oom_adj.  we
      shouldn't ignore their oom_adj (it might have OOM_DISABLE).
      
      following caller hit the minefield.
      
      ===============================
              switch (constraint) {
              case CONSTRAINT_MEMORY_POLICY:
                      oom_kill_process(current, gfp_mask, order, 0, NULL,
                                      "No available memory (MPOL_BIND)");
                      break;
      
      Note: force_sig(SIGKILL) send SIGKILL to all thread in the process.
      We don't need to care multi thread in here.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8c5cd6f3
    • K
      oom: make oom_score to per-process value · 495789a5
      KOSAKI Motohiro 提交于
      oom-killer kills a process, not task.  Then oom_score should be calculated
      as per-process too.  it makes consistency more and makes speed up
      select_bad_process().
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      495789a5