1. 17 1月, 2010 2 次提交
  2. 14 1月, 2010 1 次提交
  3. 12 1月, 2010 2 次提交
  4. 08 1月, 2010 1 次提交
  5. 07 1月, 2010 2 次提交
    • J
      NOMMU: Use copy_*_user_page() in access_process_vm() · 7959722b
      Jie Zhang 提交于
      The MMU code uses the copy_*_user_page() variants in access_process_vm()
      rather than copy_*_user() as the former includes an icache flush.  This
      is important when doing things like setting software breakpoints with
      gdb.  So switch the NOMMU code over to do the same.
      
      This patch makes the reasonable assumption that copy_from_user_page()
      won't fail - which is probably fine, as we've checked the VMA from which
      we're copying is usable, and the copy is not allowed to cross VMAs.  The
      one case where it might go wrong is if the VMA is a device rather than
      RAM, and that device returns an error which - in which case rubbish will
      be returned rather than EIO.
      Signed-off-by: NJie Zhang <jie.zhang@analog.com>
      Signed-off-by: NMike Frysinger <vapier@gentoo.org>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NDavid McCullough <david_mccullough@mcafee.com>
      Acked-by: NPaul Mundt <lethal@linux-sh.org>
      Acked-by: NGreg Ungerer <gerg@uclinux.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7959722b
    • M
      NOMMU: Avoiding duplicate icache flushes of shared maps · cfe79c00
      Mike Frysinger 提交于
      When working with FDPIC, there are many shared mappings of read-only
      code regions between applications (the C library, applet packages like
      busybox, etc.), but the current do_mmap_pgoff() function will issue an
      icache flush whenever a VMA is added to an MM instead of only doing it
      when the map is initially created.
      
      The flush can instead be done when a region is first mmapped PROT_EXEC.
      Note that we may not rely on the first mapping of a region being
      executable - it's possible for it to be PROT_READ only, so we have to
      remember whether we've flushed the region or not, and then flush the
      entire region when a bit of it is made executable.
      
      However, this also affects the brk area.  That will no longer be
      executable.  We can mprotect() it to PROT_EXEC on MPU-mode kernels, but
      for NOMMU mode kernels, when it increases the brk allocation, making
      sys_brk() flush the extra from the icache should suffice.  The brk area
      probably isn't used by NOMMU programs since the brk area can only use up
      the leavings from the stack allocation, where the stack allocation is
      larger than requested.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NMike Frysinger <vapier@gentoo.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cfe79c00
  6. 31 12月, 2009 1 次提交
  7. 29 12月, 2009 1 次提交
  8. 24 12月, 2009 1 次提交
  9. 22 12月, 2009 1 次提交
  10. 18 12月, 2009 2 次提交
    • R
      mm: Add notifier in pageblock isolation for balloon drivers · 925cc71e
      Robert Jennings 提交于
      Memory balloon drivers can allocate a large amount of memory which is not
      movable but could be freed to accomodate memory hotplug remove.
      
      Prior to calling the memory hotplug notifier chain the memory in the
      pageblock is isolated.  Currently, if the migrate type is not
      MIGRATE_MOVABLE the isolation will not proceed, causing the memory removal
      for that page range to fail.
      
      Rather than failing pageblock isolation if the migrateteype is not
      MIGRATE_MOVABLE, this patch checks if all of the pages in the pageblock,
      and not on the LRU, are owned by a registered balloon driver (or other
      entity) using a notifier chain.  If all of the non-movable pages are owned
      by a balloon, they can be freed later through the memory notifier chain
      and the range can still be isolated in set_migratetype_isolate().
      Signed-off-by: NRobert Jennings <rcj@linux.vnet.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Brian King <brking@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Gerald Schaefer <geralds@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      925cc71e
    • H
      readahead: add blk_run_backing_dev · 65a80b4c
      Hisashi Hifumi 提交于
      I added blk_run_backing_dev on page_cache_async_readahead so readahead I/O
      is unpluged to improve throughput on especially RAID environment.
      
      The normal case is, if page N become uptodate at time T(N), then T(N) <=
      T(N+1) holds.  With RAID (and NFS to some degree), there is no strict
      ordering, the data arrival time depends on runtime status of individual
      disks, which breaks that formula.  So in do_generic_file_read(), just
      after submitting the async readahead IO request, the current page may well
      be uptodate, so the page won't be locked, and the block device won't be
      implicitly unplugged:
      
                     if (PageReadahead(page))
                              page_cache_async_readahead()
                      if (!PageUptodate(page))
                                      goto page_not_up_to_date;
                      //...
      page_not_up_to_date:
                      lock_page_killable(page);
      
      Therefore explicit unplugging can help.
      
      Following is the test result with dd.
      
      #dd if=testdir/testfile of=/dev/null bs=16384
      
      -2.6.30-rc6
      1048576+0 records in
      1048576+0 records out
      17179869184 bytes (17 GB) copied, 224.182 seconds, 76.6 MB/s
      
      -2.6.30-rc6-patched
      1048576+0 records in
      1048576+0 records out
      17179869184 bytes (17 GB) copied, 206.465 seconds, 83.2 MB/s
      
      (7Disks RAID-0 Array)
      
      -2.6.30-rc6
      1054976+0 records in
      1054976+0 records out
      17284726784 bytes (17 GB) copied, 212.233 seconds, 81.4 MB/s
      
      -2.6.30-rc6-patched
      1054976+0 records out
      17284726784 bytes (17 GB) copied, 198.878 seconds, 86.9 MB/s
      
      (7Disks RAID-5 Array)
      
      The patch was found to improve performance with the SCST scsi target
      driver.  See
      http://sourceforge.net/mailarchive/forum.php?thread_name=a0272b440906030714g67eabc5k8f847fb1e538cc62%40mail.gmail.com&forum_name=scst-devel
      
      [akpm@linux-foundation.org: unbust comment layout]
      [akpm@linux-foundation.org: "fix" CONFIG_BLOCK=n]
      Signed-off-by: NHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
      Acked-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Tested-by: NRonald <intercommit@gmail.com>
      Cc: Bart Van Assche <bart.vanassche@gmail.com>
      Cc: Vladislav Bolkhovitin <vst@vlnb.net>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65a80b4c
  11. 17 12月, 2009 10 次提交
    • R
      cpumask: avoid deprecated function in mm/slab.c · 58463c1f
      Rusty Russell 提交于
      These days we use cpumask_empty() which takes a pointer.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Acked-by: NChristoph Lameter <cl@linux-foundation.org>
      58463c1f
    • A
      Fix breakage in shmem.c · 718deb6b
      Al Viro 提交于
      Replacing
      	error = 0;
      	if (error)
      		op
      with nothing is not quite an equivalent transformation ;-)
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      718deb6b
    • Y
      x86: Fix checking of SRAT when node 0 ram is not from 0 · 32996250
      Yinghai Lu 提交于
      Found one system that boot from socket1 instead of socket0, SRAT get rejected...
      
      [    0.000000] SRAT: Node 1 PXM 0 0-a0000
      [    0.000000] SRAT: Node 1 PXM 0 100000-80000000
      [    0.000000] SRAT: Node 1 PXM 0 100000000-2080000000
      [    0.000000] SRAT: Node 0 PXM 1 2080000000-4080000000
      [    0.000000] SRAT: Node 2 PXM 2 4080000000-6080000000
      [    0.000000] SRAT: Node 3 PXM 3 6080000000-8080000000
      [    0.000000] SRAT: Node 4 PXM 4 8080000000-a080000000
      [    0.000000] SRAT: Node 5 PXM 5 a080000000-c080000000
      [    0.000000] SRAT: Node 6 PXM 6 c080000000-e080000000
      [    0.000000] SRAT: Node 7 PXM 7 e080000000-10080000000
      ...
      [    0.000000] NUMA: Allocated memnodemap from 500000 - 701040
      [    0.000000] NUMA: Using 20 for the hash shift.
      [    0.000000] Adding active range (0, 0x2080000, 0x4080000) 0 entries of 3200 used
      [    0.000000] Adding active range (1, 0x0, 0x96) 1 entries of 3200 used
      [    0.000000] Adding active range (1, 0x100, 0x7f750) 2 entries of 3200 used
      [    0.000000] Adding active range (1, 0x100000, 0x2080000) 3 entries of 3200 used
      [    0.000000] Adding active range (2, 0x4080000, 0x6080000) 4 entries of 3200 used
      [    0.000000] Adding active range (3, 0x6080000, 0x8080000) 5 entries of 3200 used
      [    0.000000] Adding active range (4, 0x8080000, 0xa080000) 6 entries of 3200 used
      [    0.000000] Adding active range (5, 0xa080000, 0xc080000) 7 entries of 3200 used
      [    0.000000] Adding active range (6, 0xc080000, 0xe080000) 8 entries of 3200 used
      [    0.000000] Adding active range (7, 0xe080000, 0x10080000) 9 entries of 3200 used
      [    0.000000] SRAT: PXMs only cover 917504MB of your 1048566MB e820 RAM. Not used.
      [    0.000000] SRAT: SRAT not used.
      
      the early_node_map is not sorted because node0 with non zero start come first.
      
      so try to sort it right away after all regions are registered.
      
      also fixs refression by 8716273c (x86: Export srat physical topology)
      
      -v2: make it more solid to handle cross node case like node0 [0,4g), [8,12g) and node1 [4g, 8g), [12g, 16g)
      -v3: update comments.
      Reported-and-tested-by: NJens Axboe <jens.axboe@oracle.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <4B2579D2.3010201@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      32996250
    • D
      NOMMU: Optimise away the {dac_,}mmap_min_addr tests · 6e141546
      David Howells 提交于
      In NOMMU mode clamp dac_mmap_min_addr to zero to cause the tests on it to be
      skipped by the compiler.  We do this as the minimum mmap address doesn't make
      any sense in NOMMU mode.
      
      mmap_min_addr and round_hint_to_min() can be discarded entirely in NOMMU mode.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NEric Paris <eparis@redhat.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      6e141546
    • C
      direct I/O fallback sync simplification · c05c4edd
      Christoph Hellwig 提交于
      In the case of direct I/O falling back to buffered I/O we sync data
      twice currently: once at the end of generic_file_buffered_write using
      filemap_write_and_wait_range and once a little later in
      __generic_file_aio_write using do_sync_mapping_range with all flags set.
      
      The wait before write of the do_sync_mapping_range call does not make
      any sense, so just keep the filemap_write_and_wait_range call and move
      it to the right spot.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c05c4edd
    • C
      make generic_acl slightly more generic · 1c7c474c
      Christoph Hellwig 提交于
      Now that we cache the ACL pointers in the generic inode all the generic_acl
      cruft can go away and generic_acl.c can directly implement xattr handlers
      dealing with the full Posix ACL semantics for in-memory filesystems.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1c7c474c
    • C
      sanitize xattr handler prototypes · 431547b3
      Christoph Hellwig 提交于
      Add a flags argument to struct xattr_handler and pass it to all xattr
      handler methods.  This allows using the same methods for multiple
      handlers, e.g. for the ACL methods which perform exactly the same action
      for the access and default ACLs, just using a different underlying
      attribute.  With a little more groundwork it'll also allow sharing the
      methods for the regular user/trusted/secure handlers in extN, ocfs2 and
      jffs2 like it's already done for xfs in this patch.
      
      Also change the inode argument to the handlers to a dentry to allow
      using the handlers mechnism for filesystems that require it later,
      e.g. cifs.
      
      [with GFS2 bits updated by Steven Whitehouse <swhiteho@redhat.com>]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJames Morris <jmorris@namei.org>
      Acked-by: NJoel Becker <joel.becker@oracle.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      431547b3
    • A
      Untangling ima mess, part 1: alloc_file() · 0552f879
      Al Viro 提交于
      There are 2 groups of alloc_file() callers:
      	* ones that are followed by ima_counts_get
      	* ones giving non-regular files
      So let's pull that ima_counts_get() into alloc_file();
      it's a no-op in case of non-regular files.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0552f879
    • A
      switch alloc_file() to passing struct path · 2c48b9c4
      Al Viro 提交于
      ... and have the caller grab both mnt and dentry; kill
      leak in infiniband, while we are at it.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2c48b9c4
    • A
      switch shmem_file_setup() to alloc_file() · 4b42af81
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4b42af81
  12. 16 12月, 2009 16 次提交
    • B
      memcg: code clean, remove unused variable in mem_cgroup_resize_limit() · aa20d489
      Bob Liu 提交于
      Variable `progress' isn't used in mem_cgroup_resize_limit() any more.
      Remove it.
      
      [akpm@linux-foundation.org: cleanup]
      Signed-off-by: NBob Liu <lliubbo@gmail.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reviewed-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa20d489
    • D
      memcg: remove memcg_tasklist · 9ab322ca
      Daisuke Nishimura 提交于
      memcg_tasklist was introduced at commit 7f4d454d(memcg: avoid deadlock
      caused by race between oom and cpuset_attach) instead of cgroup_mutex to
      fix a deadlock problem.  The cgroup_mutex, which was removed by the
      commit, in mem_cgroup_out_of_memory() was originally introduced at commit
      c7ba5c9e (Memory controller: OOM handling).
      
      IIUC, the intention of this cgroup_mutex was to prevent task move during
      select_bad_process() so that situations like below can be avoided.
      
        Assume cgroup "foo" has exceeded its limit and is about to trigger oom.
        1. Process A, which has been in cgroup "baa" and uses large memory, is just
           moved to cgroup "foo". Process A can be the candidates for being killed.
        2. Process B, which has been in cgroup "foo" and uses large memory, is just
           moved from cgroup "foo". Process B can be excluded from the candidates for
           being killed.
      
      But these race window exists anyway even if we hold a lock, because
      __mem_cgroup_try_charge() decides wether it should trigger oom or not
      outside of the lock.  So the original cgroup_mutex in
      mem_cgroup_out_of_memory and thus current memcg_tasklist has no use.  And
      IMHO, those races are not so critical for users.
      
      This patch removes it and make codes simpler.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ab322ca
    • D
      memcg: avoid oom-killing innocent task in case of use_hierarchy · d31f56db
      Daisuke Nishimura 提交于
      task_in_mem_cgroup(), which is called by select_bad_process() to check
      whether a task can be a candidate for being oom-killed from memcg's limit,
      checks "curr->use_hierarchy"("curr" is the mem_cgroup the task belongs
      to).
      
      But this check return true(it's false positive) when:
      
      	<some path>/aa		use_hierarchy == 0	<- hitting limit
      	  <some path>/aa/00	use_hierarchy == 1	<- the task belongs to
      
      This leads to killing an innocent task in aa/00.  This patch is a fix for
      this bug.  And this patch also fixes the arg for
      mem_cgroup_print_oom_info().  We should print information of mem_cgroup
      which the task being killed, not current, belongs to.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d31f56db
    • D
      memcg: cleanup mem_cgroup_move_parent() · 57f9fd7d
      Daisuke Nishimura 提交于
      mem_cgroup_move_parent() calls try_charge first and cancel_charge on
      failure.  IMHO, charge/uncharge(especially charge) is high cost operation,
      so we should avoid it as far as possible.
      
      This patch tries to delay try_charge in mem_cgroup_move_parent() by
      re-ordering checks it does.
      
      And this patch renames mem_cgroup_move_account() to
      __mem_cgroup_move_account(), changes the return value of
      __mem_cgroup_move_account() from int to void, and adds a new
      wrapper(mem_cgroup_move_account()), which checks whether a @pc is valid
      for moving account and calls __mem_cgroup_move_account().
      
      This patch removes the last caller of trylock_page_cgroup(), so removes
      its definition too.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57f9fd7d
    • D
      memcg: add mem_cgroup_cancel_charge() · a3032a2c
      Daisuke Nishimura 提交于
      There are some places calling both res_counter_uncharge() and css_put() to
      cancel the charge and the refcnt we have got by mem_cgroup_tyr_charge().
      
      This patch introduces mem_cgroup_cancel_charge() and call it in those
      places.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reviewed-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a3032a2c
    • K
      memcg: make memcg's file mapped consistent with global VM · d8046582
      KAMEZAWA Hiroyuki 提交于
      In global VM, FILE_MAPPED is used but memcg uses MAPPED_FILE.  This makes
      grep difficult.  Replace memcg's MAPPED_FILE with FILE_MAPPED
      
      And in global VM, mapped shared memory is accounted into FILE_MAPPED.
      But memcg doesn't. fix it.
      Note:
        page_is_file_cache() just checks SwapBacked or not.
        So, we need to check PageAnon.
      
      Cc: Balbir Singh <balbir@in.ibm.com>
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8046582
    • K
      memcg: coalesce charging via percpu storage · cdec2e42
      KAMEZAWA Hiroyuki 提交于
      This is a patch for coalescing access to res_counter at charging by percpu
      caching.  At charge, memcg charges 64pages and remember it in percpu
      cache.  Because it's cache, drain/flush if necessary.
      
      This version uses public percpu area.
       2 benefits for using public percpu area.
       1. Sum of stocked charge in the system is limited to # of cpus
          not to the number of memcg. This shows better synchonization.
       2. drain code for flush/cpuhotplug is very easy (and quick)
      
      The most important point of this patch is that we never touch res_counter
      in fast path. The res_counter is system-wide shared counter which is modified
      very frequently. We shouldn't touch it as far as we can for avoiding
      false sharing.
      
      On x86-64 8cpu server, I tested overheads of memcg at page fault by
      running a program which does map/fault/unmap in a loop. Running
      a task per a cpu by taskset and see sum of the number of page faults
      in 60secs.
      
      [without memcg config]
        40156968  page-faults              #      0.085 M/sec   ( +-   0.046% )
        27.67 cache-miss/faults
      
      [root cgroup]
        36659599  page-faults              #      0.077 M/sec   ( +-   0.247% )
        31.58 cache miss/faults
      
      [in a child cgroup]
        18444157  page-faults              #      0.039 M/sec   ( +-   0.133% )
        69.96 cache miss/faults
      
      [ + coalescing uncharge patch]
        27133719  page-faults              #      0.057 M/sec   ( +-   0.155% )
        47.16 cache miss/faults
      
      [ + coalescing uncharge patch + this patch ]
        34224709  page-faults              #      0.072 M/sec   ( +-   0.173% )
        34.69 cache miss/faults
      
      Changelog (since Oct/2):
        - updated comments
        - replaced get_cpu_var() with __get_cpu_var() if possible.
        - removed mutex for system-wide drain. adds a counter instead of it.
        - removed CONFIG_HOTPLUG_CPU
      
      Changelog (old):
        - rebased onto the latest mmotm
        - moved charge size check before __GFP_WAIT check for avoiding unnecesary
        - added asynchronous flush routine.
        - fixed bugs pointed out by Nishimura-san.
      
      [akpm@linux-foundation.org: tweak comments]
      [nishimura@mxp.nes.nec.co.jp: don't do INIT_WORK() repeatedly against the same work_struct]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cdec2e42
    • K
      memcg: coalesce uncharge during unmap/truncate · 569b846d
      KAMEZAWA Hiroyuki 提交于
      In massive parallel enviroment, res_counter can be a performance
      bottleneck.  One strong techinque to reduce lock contention is reducing
      calls by coalescing some amount of calls into one.
      
      Considering charge/uncharge chatacteristic,
      	- charge is done one by one via demand-paging.
      	- uncharge is done by
      		- in chunk at munmap, truncate, exit, execve...
      		- one by one via vmscan/paging.
      
      It seems we have a chance to coalesce uncharges for improving scalability
      at unmap/truncation.
      
      This patch is a for coalescing uncharge.  For avoiding scattering memcg's
      structure to functions under /mm, this patch adds memcg batch uncharge
      information to the task.  A reason for per-task batching is for making use
      of caller's context information.  We do batched uncharge (deleyed
      uncharge) when truncation/unmap occurs but do direct uncharge when
      uncharge is called by memory reclaim (vmscan.c).
      
      The degree of coalescing depends on callers
        - at invalidate/trucate... pagevec size
        - at unmap ....ZAP_BLOCK_SIZE
      (memory itself will be freed in this degree.)
      Then, we'll not coalescing too much.
      
      On x86-64 8cpu server, I tested overheads of memcg at page fault by
      running a program which does map/fault/unmap in a loop. Running
      a task per a cpu by taskset and see sum of the number of page faults
      in 60secs.
      
      [without memcg config]
        40156968  page-faults              #      0.085 M/sec   ( +-   0.046% )
        27.67 cache-miss/faults
      [root cgroup]
        36659599  page-faults              #      0.077 M/sec   ( +-   0.247% )
        31.58 miss/faults
      [in a child cgroup]
        18444157  page-faults              #      0.039 M/sec   ( +-   0.133% )
        69.96 miss/faults
      [child with this patch]
        27133719  page-faults              #      0.057 M/sec   ( +-   0.155% )
        47.16 miss/faults
      
      We can see some amounts of improvement.
      (root cgroup doesn't affected by this patch)
      Another patch for "charge" will follow this and above will be improved more.
      
      Changelog(since 2009/10/02):
       - renamed filed of memcg_batch (as pages to bytes, memsw to memsw_bytes)
       - some clean up and commentary/description updates.
       - added initialize code to copy_process(). (possible bug fix)
      
      Changelog(old):
       - fixed !CONFIG_MEM_CGROUP case.
       - rebased onto the latest mmotm + softlimit fix patches.
       - unified patch for callers
       - added commetns.
       - make ->do_batch as bool.
       - removed css_get() at el. We don't need it.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      569b846d
    • K
      memcg: fix memory.memsw.usage_in_bytes for root cgroup · cd9b45b7
      Kirill A. Shutemov 提交于
      A memory cgroup has a memory.memsw.usage_in_bytes file.  It shows the sum
      of the usage of pages and swapents in the cgroup.  Presently the root
      cgroup's memsw.usage_in_bytes shows the wrong value - the number of
      swapents are not added.
      
      So take MEM_CGROUP_STAT_SWAPOUT into account.
      Signed-off-by: NKirill A. Shutemov <kirill@shutemov.name>
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd9b45b7
    • K
      oom-kill: fix NUMA constraint check with nodemask · 4365a567
      KAMEZAWA Hiroyuki 提交于
      Fix node-oriented allocation handling in oom-kill.c I myself think of this
      as a bugfix not as an ehnancement.
      
      In these days, things are changed as
        - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
        - mempolicy don't maintain its own private zonelists.
        (And cpuset doesn't use nodemask for __alloc_pages_nodemask())
      
      So, current oom-killer's check function is wrong.
      
      This patch does
        - check nodemask, if nodemask && nodemask doesn't cover all
          node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
        - Scan all zonelist under nodemask, if it hits cpuset's wall
          this faiulre is from cpuset.
      And
        - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
          This doesn't change "current" behavior. If callers use __GFP_THISNODE
          it should handle "page allocation failure" by itself.
      
        - handle __GFP_NOFAIL+__GFP_THISNODE path.
          This is something like a FIXME but this gfpmask is not used now.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4365a567
    • K
      oom-kill: show virtual size and rss information of the killed process · 3b4798cb
      KOSAKI Motohiro 提交于
      In a typical oom analysis scenario, we frequently want to know whether the
      killed process has a memory leak or not at the first step.  This patch
      adds vsz and rss information to the oom log to help this analysis.  To
      save time for the debugging.
      
      example:
      ===================================================================
      rsyslogd invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
      Pid: 1308, comm: rsyslogd Not tainted 2.6.32-rc6 #24
      Call Trace:
      [<ffffffff8132e35b>] ?_spin_unlock+0x2b/0x40
      [<ffffffff810f186e>] oom_kill_process+0xbe/0x2b0
      
      (snip)
      
      492283 pages non-shared
      Out of memory: kill process 2341 (memhog) score 527276 or a child
      Killed process 2341 (memhog) vsz:1054552kB, anon-rss:970588kB, file-rss:4kB
      ===========================================================================
                                   ^
                                   |
                                  here
      
      [rientjes@google.com: fix race, add pid & comm to message]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b4798cb
    • A
      HWPOISON: Remove stray phrase in a comment · f2c03deb
      Andi Kleen 提交于
      Better to have complete sentences.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      f2c03deb
    • A
      12686d15
    • A
      0d57eb8d
    • A
      HWPOISON: Add a madvise() injector for soft page offlining · afcf938e
      Andi Kleen 提交于
      Process based injection is much easier to handle for test programs,
      who can first bring a page into a specific state and then test.
      So add a new MADV_SOFT_OFFLINE to soft offline a page, similar
      to the existing hard offline injector.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      afcf938e
    • A
      HWPOISON: Add soft page offline support · facb6011
      Andi Kleen 提交于
      This is a simpler, gentler variant of memory_failure() for soft page
      offlining controlled from user space.  It doesn't kill anything, just
      tries to invalidate and if that doesn't work migrate the
      page away.
      
      This is useful for predictive failure analysis, where a page has
      a high rate of corrected errors, but hasn't gone bad yet. Instead
      it can be offlined early and avoided.
      
      The offlining is controlled from sysfs, including a new generic
      entry point for hard page offlining for symmetry too.
      
      We use the page isolate facility to prevent re-allocation
      race. Normally this is only used by memory hotplug. To avoid
      races with memory allocation I am using lock_system_sleep().
      This avoids the situation where memory hotplug is about
      to isolate a page range and then hwpoison undoes that work.
      This is a big hammer currently, but the simplest solution
      currently.
      
      When the page is not free or LRU we try to free pages
      from slab and other caches. The slab freeing is currently
      quite dumb and does not try to focus on the specific slab
      cache which might own the page. This could be potentially
      improved later.
      
      Thanks to Fengguang Wu and Haicheng Li for some fixes.
      
      [Added fix from Andrew Morton to adapt to new migrate_pages prototype]
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      facb6011