1. 22 12月, 2009 1 次提交
  2. 18 12月, 2009 1 次提交
    • H
      readahead: add blk_run_backing_dev · 65a80b4c
      Hisashi Hifumi 提交于
      I added blk_run_backing_dev on page_cache_async_readahead so readahead I/O
      is unpluged to improve throughput on especially RAID environment.
      
      The normal case is, if page N become uptodate at time T(N), then T(N) <=
      T(N+1) holds.  With RAID (and NFS to some degree), there is no strict
      ordering, the data arrival time depends on runtime status of individual
      disks, which breaks that formula.  So in do_generic_file_read(), just
      after submitting the async readahead IO request, the current page may well
      be uptodate, so the page won't be locked, and the block device won't be
      implicitly unplugged:
      
                     if (PageReadahead(page))
                              page_cache_async_readahead()
                      if (!PageUptodate(page))
                                      goto page_not_up_to_date;
                      //...
      page_not_up_to_date:
                      lock_page_killable(page);
      
      Therefore explicit unplugging can help.
      
      Following is the test result with dd.
      
      #dd if=testdir/testfile of=/dev/null bs=16384
      
      -2.6.30-rc6
      1048576+0 records in
      1048576+0 records out
      17179869184 bytes (17 GB) copied, 224.182 seconds, 76.6 MB/s
      
      -2.6.30-rc6-patched
      1048576+0 records in
      1048576+0 records out
      17179869184 bytes (17 GB) copied, 206.465 seconds, 83.2 MB/s
      
      (7Disks RAID-0 Array)
      
      -2.6.30-rc6
      1054976+0 records in
      1054976+0 records out
      17284726784 bytes (17 GB) copied, 212.233 seconds, 81.4 MB/s
      
      -2.6.30-rc6-patched
      1054976+0 records out
      17284726784 bytes (17 GB) copied, 198.878 seconds, 86.9 MB/s
      
      (7Disks RAID-5 Array)
      
      The patch was found to improve performance with the SCST scsi target
      driver.  See
      http://sourceforge.net/mailarchive/forum.php?thread_name=a0272b440906030714g67eabc5k8f847fb1e538cc62%40mail.gmail.com&forum_name=scst-devel
      
      [akpm@linux-foundation.org: unbust comment layout]
      [akpm@linux-foundation.org: "fix" CONFIG_BLOCK=n]
      Signed-off-by: NHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
      Acked-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Tested-by: NRonald <intercommit@gmail.com>
      Cc: Bart Van Assche <bart.vanassche@gmail.com>
      Cc: Vladislav Bolkhovitin <vst@vlnb.net>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65a80b4c
  3. 17 12月, 2009 10 次提交
    • R
      cpumask: avoid deprecated function in mm/slab.c · 58463c1f
      Rusty Russell 提交于
      These days we use cpumask_empty() which takes a pointer.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Acked-by: NChristoph Lameter <cl@linux-foundation.org>
      58463c1f
    • A
      Fix breakage in shmem.c · 718deb6b
      Al Viro 提交于
      Replacing
      	error = 0;
      	if (error)
      		op
      with nothing is not quite an equivalent transformation ;-)
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      718deb6b
    • Y
      x86: Fix checking of SRAT when node 0 ram is not from 0 · 32996250
      Yinghai Lu 提交于
      Found one system that boot from socket1 instead of socket0, SRAT get rejected...
      
      [    0.000000] SRAT: Node 1 PXM 0 0-a0000
      [    0.000000] SRAT: Node 1 PXM 0 100000-80000000
      [    0.000000] SRAT: Node 1 PXM 0 100000000-2080000000
      [    0.000000] SRAT: Node 0 PXM 1 2080000000-4080000000
      [    0.000000] SRAT: Node 2 PXM 2 4080000000-6080000000
      [    0.000000] SRAT: Node 3 PXM 3 6080000000-8080000000
      [    0.000000] SRAT: Node 4 PXM 4 8080000000-a080000000
      [    0.000000] SRAT: Node 5 PXM 5 a080000000-c080000000
      [    0.000000] SRAT: Node 6 PXM 6 c080000000-e080000000
      [    0.000000] SRAT: Node 7 PXM 7 e080000000-10080000000
      ...
      [    0.000000] NUMA: Allocated memnodemap from 500000 - 701040
      [    0.000000] NUMA: Using 20 for the hash shift.
      [    0.000000] Adding active range (0, 0x2080000, 0x4080000) 0 entries of 3200 used
      [    0.000000] Adding active range (1, 0x0, 0x96) 1 entries of 3200 used
      [    0.000000] Adding active range (1, 0x100, 0x7f750) 2 entries of 3200 used
      [    0.000000] Adding active range (1, 0x100000, 0x2080000) 3 entries of 3200 used
      [    0.000000] Adding active range (2, 0x4080000, 0x6080000) 4 entries of 3200 used
      [    0.000000] Adding active range (3, 0x6080000, 0x8080000) 5 entries of 3200 used
      [    0.000000] Adding active range (4, 0x8080000, 0xa080000) 6 entries of 3200 used
      [    0.000000] Adding active range (5, 0xa080000, 0xc080000) 7 entries of 3200 used
      [    0.000000] Adding active range (6, 0xc080000, 0xe080000) 8 entries of 3200 used
      [    0.000000] Adding active range (7, 0xe080000, 0x10080000) 9 entries of 3200 used
      [    0.000000] SRAT: PXMs only cover 917504MB of your 1048566MB e820 RAM. Not used.
      [    0.000000] SRAT: SRAT not used.
      
      the early_node_map is not sorted because node0 with non zero start come first.
      
      so try to sort it right away after all regions are registered.
      
      also fixs refression by 8716273c (x86: Export srat physical topology)
      
      -v2: make it more solid to handle cross node case like node0 [0,4g), [8,12g) and node1 [4g, 8g), [12g, 16g)
      -v3: update comments.
      Reported-and-tested-by: NJens Axboe <jens.axboe@oracle.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <4B2579D2.3010201@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      32996250
    • D
      NOMMU: Optimise away the {dac_,}mmap_min_addr tests · 6e141546
      David Howells 提交于
      In NOMMU mode clamp dac_mmap_min_addr to zero to cause the tests on it to be
      skipped by the compiler.  We do this as the minimum mmap address doesn't make
      any sense in NOMMU mode.
      
      mmap_min_addr and round_hint_to_min() can be discarded entirely in NOMMU mode.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NEric Paris <eparis@redhat.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      6e141546
    • C
      direct I/O fallback sync simplification · c05c4edd
      Christoph Hellwig 提交于
      In the case of direct I/O falling back to buffered I/O we sync data
      twice currently: once at the end of generic_file_buffered_write using
      filemap_write_and_wait_range and once a little later in
      __generic_file_aio_write using do_sync_mapping_range with all flags set.
      
      The wait before write of the do_sync_mapping_range call does not make
      any sense, so just keep the filemap_write_and_wait_range call and move
      it to the right spot.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c05c4edd
    • C
      make generic_acl slightly more generic · 1c7c474c
      Christoph Hellwig 提交于
      Now that we cache the ACL pointers in the generic inode all the generic_acl
      cruft can go away and generic_acl.c can directly implement xattr handlers
      dealing with the full Posix ACL semantics for in-memory filesystems.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1c7c474c
    • C
      sanitize xattr handler prototypes · 431547b3
      Christoph Hellwig 提交于
      Add a flags argument to struct xattr_handler and pass it to all xattr
      handler methods.  This allows using the same methods for multiple
      handlers, e.g. for the ACL methods which perform exactly the same action
      for the access and default ACLs, just using a different underlying
      attribute.  With a little more groundwork it'll also allow sharing the
      methods for the regular user/trusted/secure handlers in extN, ocfs2 and
      jffs2 like it's already done for xfs in this patch.
      
      Also change the inode argument to the handlers to a dentry to allow
      using the handlers mechnism for filesystems that require it later,
      e.g. cifs.
      
      [with GFS2 bits updated by Steven Whitehouse <swhiteho@redhat.com>]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJames Morris <jmorris@namei.org>
      Acked-by: NJoel Becker <joel.becker@oracle.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      431547b3
    • A
      Untangling ima mess, part 1: alloc_file() · 0552f879
      Al Viro 提交于
      There are 2 groups of alloc_file() callers:
      	* ones that are followed by ima_counts_get
      	* ones giving non-regular files
      So let's pull that ima_counts_get() into alloc_file();
      it's a no-op in case of non-regular files.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0552f879
    • A
      switch alloc_file() to passing struct path · 2c48b9c4
      Al Viro 提交于
      ... and have the caller grab both mnt and dentry; kill
      leak in infiniband, while we are at it.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2c48b9c4
    • A
      switch shmem_file_setup() to alloc_file() · 4b42af81
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4b42af81
  4. 16 12月, 2009 28 次提交
    • B
      memcg: code clean, remove unused variable in mem_cgroup_resize_limit() · aa20d489
      Bob Liu 提交于
      Variable `progress' isn't used in mem_cgroup_resize_limit() any more.
      Remove it.
      
      [akpm@linux-foundation.org: cleanup]
      Signed-off-by: NBob Liu <lliubbo@gmail.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reviewed-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa20d489
    • D
      memcg: remove memcg_tasklist · 9ab322ca
      Daisuke Nishimura 提交于
      memcg_tasklist was introduced at commit 7f4d454d(memcg: avoid deadlock
      caused by race between oom and cpuset_attach) instead of cgroup_mutex to
      fix a deadlock problem.  The cgroup_mutex, which was removed by the
      commit, in mem_cgroup_out_of_memory() was originally introduced at commit
      c7ba5c9e (Memory controller: OOM handling).
      
      IIUC, the intention of this cgroup_mutex was to prevent task move during
      select_bad_process() so that situations like below can be avoided.
      
        Assume cgroup "foo" has exceeded its limit and is about to trigger oom.
        1. Process A, which has been in cgroup "baa" and uses large memory, is just
           moved to cgroup "foo". Process A can be the candidates for being killed.
        2. Process B, which has been in cgroup "foo" and uses large memory, is just
           moved from cgroup "foo". Process B can be excluded from the candidates for
           being killed.
      
      But these race window exists anyway even if we hold a lock, because
      __mem_cgroup_try_charge() decides wether it should trigger oom or not
      outside of the lock.  So the original cgroup_mutex in
      mem_cgroup_out_of_memory and thus current memcg_tasklist has no use.  And
      IMHO, those races are not so critical for users.
      
      This patch removes it and make codes simpler.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ab322ca
    • D
      memcg: avoid oom-killing innocent task in case of use_hierarchy · d31f56db
      Daisuke Nishimura 提交于
      task_in_mem_cgroup(), which is called by select_bad_process() to check
      whether a task can be a candidate for being oom-killed from memcg's limit,
      checks "curr->use_hierarchy"("curr" is the mem_cgroup the task belongs
      to).
      
      But this check return true(it's false positive) when:
      
      	<some path>/aa		use_hierarchy == 0	<- hitting limit
      	  <some path>/aa/00	use_hierarchy == 1	<- the task belongs to
      
      This leads to killing an innocent task in aa/00.  This patch is a fix for
      this bug.  And this patch also fixes the arg for
      mem_cgroup_print_oom_info().  We should print information of mem_cgroup
      which the task being killed, not current, belongs to.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d31f56db
    • D
      memcg: cleanup mem_cgroup_move_parent() · 57f9fd7d
      Daisuke Nishimura 提交于
      mem_cgroup_move_parent() calls try_charge first and cancel_charge on
      failure.  IMHO, charge/uncharge(especially charge) is high cost operation,
      so we should avoid it as far as possible.
      
      This patch tries to delay try_charge in mem_cgroup_move_parent() by
      re-ordering checks it does.
      
      And this patch renames mem_cgroup_move_account() to
      __mem_cgroup_move_account(), changes the return value of
      __mem_cgroup_move_account() from int to void, and adds a new
      wrapper(mem_cgroup_move_account()), which checks whether a @pc is valid
      for moving account and calls __mem_cgroup_move_account().
      
      This patch removes the last caller of trylock_page_cgroup(), so removes
      its definition too.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57f9fd7d
    • D
      memcg: add mem_cgroup_cancel_charge() · a3032a2c
      Daisuke Nishimura 提交于
      There are some places calling both res_counter_uncharge() and css_put() to
      cancel the charge and the refcnt we have got by mem_cgroup_tyr_charge().
      
      This patch introduces mem_cgroup_cancel_charge() and call it in those
      places.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reviewed-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a3032a2c
    • K
      memcg: make memcg's file mapped consistent with global VM · d8046582
      KAMEZAWA Hiroyuki 提交于
      In global VM, FILE_MAPPED is used but memcg uses MAPPED_FILE.  This makes
      grep difficult.  Replace memcg's MAPPED_FILE with FILE_MAPPED
      
      And in global VM, mapped shared memory is accounted into FILE_MAPPED.
      But memcg doesn't. fix it.
      Note:
        page_is_file_cache() just checks SwapBacked or not.
        So, we need to check PageAnon.
      
      Cc: Balbir Singh <balbir@in.ibm.com>
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8046582
    • K
      memcg: coalesce charging via percpu storage · cdec2e42
      KAMEZAWA Hiroyuki 提交于
      This is a patch for coalescing access to res_counter at charging by percpu
      caching.  At charge, memcg charges 64pages and remember it in percpu
      cache.  Because it's cache, drain/flush if necessary.
      
      This version uses public percpu area.
       2 benefits for using public percpu area.
       1. Sum of stocked charge in the system is limited to # of cpus
          not to the number of memcg. This shows better synchonization.
       2. drain code for flush/cpuhotplug is very easy (and quick)
      
      The most important point of this patch is that we never touch res_counter
      in fast path. The res_counter is system-wide shared counter which is modified
      very frequently. We shouldn't touch it as far as we can for avoiding
      false sharing.
      
      On x86-64 8cpu server, I tested overheads of memcg at page fault by
      running a program which does map/fault/unmap in a loop. Running
      a task per a cpu by taskset and see sum of the number of page faults
      in 60secs.
      
      [without memcg config]
        40156968  page-faults              #      0.085 M/sec   ( +-   0.046% )
        27.67 cache-miss/faults
      
      [root cgroup]
        36659599  page-faults              #      0.077 M/sec   ( +-   0.247% )
        31.58 cache miss/faults
      
      [in a child cgroup]
        18444157  page-faults              #      0.039 M/sec   ( +-   0.133% )
        69.96 cache miss/faults
      
      [ + coalescing uncharge patch]
        27133719  page-faults              #      0.057 M/sec   ( +-   0.155% )
        47.16 cache miss/faults
      
      [ + coalescing uncharge patch + this patch ]
        34224709  page-faults              #      0.072 M/sec   ( +-   0.173% )
        34.69 cache miss/faults
      
      Changelog (since Oct/2):
        - updated comments
        - replaced get_cpu_var() with __get_cpu_var() if possible.
        - removed mutex for system-wide drain. adds a counter instead of it.
        - removed CONFIG_HOTPLUG_CPU
      
      Changelog (old):
        - rebased onto the latest mmotm
        - moved charge size check before __GFP_WAIT check for avoiding unnecesary
        - added asynchronous flush routine.
        - fixed bugs pointed out by Nishimura-san.
      
      [akpm@linux-foundation.org: tweak comments]
      [nishimura@mxp.nes.nec.co.jp: don't do INIT_WORK() repeatedly against the same work_struct]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cdec2e42
    • K
      memcg: coalesce uncharge during unmap/truncate · 569b846d
      KAMEZAWA Hiroyuki 提交于
      In massive parallel enviroment, res_counter can be a performance
      bottleneck.  One strong techinque to reduce lock contention is reducing
      calls by coalescing some amount of calls into one.
      
      Considering charge/uncharge chatacteristic,
      	- charge is done one by one via demand-paging.
      	- uncharge is done by
      		- in chunk at munmap, truncate, exit, execve...
      		- one by one via vmscan/paging.
      
      It seems we have a chance to coalesce uncharges for improving scalability
      at unmap/truncation.
      
      This patch is a for coalescing uncharge.  For avoiding scattering memcg's
      structure to functions under /mm, this patch adds memcg batch uncharge
      information to the task.  A reason for per-task batching is for making use
      of caller's context information.  We do batched uncharge (deleyed
      uncharge) when truncation/unmap occurs but do direct uncharge when
      uncharge is called by memory reclaim (vmscan.c).
      
      The degree of coalescing depends on callers
        - at invalidate/trucate... pagevec size
        - at unmap ....ZAP_BLOCK_SIZE
      (memory itself will be freed in this degree.)
      Then, we'll not coalescing too much.
      
      On x86-64 8cpu server, I tested overheads of memcg at page fault by
      running a program which does map/fault/unmap in a loop. Running
      a task per a cpu by taskset and see sum of the number of page faults
      in 60secs.
      
      [without memcg config]
        40156968  page-faults              #      0.085 M/sec   ( +-   0.046% )
        27.67 cache-miss/faults
      [root cgroup]
        36659599  page-faults              #      0.077 M/sec   ( +-   0.247% )
        31.58 miss/faults
      [in a child cgroup]
        18444157  page-faults              #      0.039 M/sec   ( +-   0.133% )
        69.96 miss/faults
      [child with this patch]
        27133719  page-faults              #      0.057 M/sec   ( +-   0.155% )
        47.16 miss/faults
      
      We can see some amounts of improvement.
      (root cgroup doesn't affected by this patch)
      Another patch for "charge" will follow this and above will be improved more.
      
      Changelog(since 2009/10/02):
       - renamed filed of memcg_batch (as pages to bytes, memsw to memsw_bytes)
       - some clean up and commentary/description updates.
       - added initialize code to copy_process(). (possible bug fix)
      
      Changelog(old):
       - fixed !CONFIG_MEM_CGROUP case.
       - rebased onto the latest mmotm + softlimit fix patches.
       - unified patch for callers
       - added commetns.
       - make ->do_batch as bool.
       - removed css_get() at el. We don't need it.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      569b846d
    • K
      memcg: fix memory.memsw.usage_in_bytes for root cgroup · cd9b45b7
      Kirill A. Shutemov 提交于
      A memory cgroup has a memory.memsw.usage_in_bytes file.  It shows the sum
      of the usage of pages and swapents in the cgroup.  Presently the root
      cgroup's memsw.usage_in_bytes shows the wrong value - the number of
      swapents are not added.
      
      So take MEM_CGROUP_STAT_SWAPOUT into account.
      Signed-off-by: NKirill A. Shutemov <kirill@shutemov.name>
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd9b45b7
    • K
      oom-kill: fix NUMA constraint check with nodemask · 4365a567
      KAMEZAWA Hiroyuki 提交于
      Fix node-oriented allocation handling in oom-kill.c I myself think of this
      as a bugfix not as an ehnancement.
      
      In these days, things are changed as
        - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
        - mempolicy don't maintain its own private zonelists.
        (And cpuset doesn't use nodemask for __alloc_pages_nodemask())
      
      So, current oom-killer's check function is wrong.
      
      This patch does
        - check nodemask, if nodemask && nodemask doesn't cover all
          node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
        - Scan all zonelist under nodemask, if it hits cpuset's wall
          this faiulre is from cpuset.
      And
        - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
          This doesn't change "current" behavior. If callers use __GFP_THISNODE
          it should handle "page allocation failure" by itself.
      
        - handle __GFP_NOFAIL+__GFP_THISNODE path.
          This is something like a FIXME but this gfpmask is not used now.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4365a567
    • K
      oom-kill: show virtual size and rss information of the killed process · 3b4798cb
      KOSAKI Motohiro 提交于
      In a typical oom analysis scenario, we frequently want to know whether the
      killed process has a memory leak or not at the first step.  This patch
      adds vsz and rss information to the oom log to help this analysis.  To
      save time for the debugging.
      
      example:
      ===================================================================
      rsyslogd invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
      Pid: 1308, comm: rsyslogd Not tainted 2.6.32-rc6 #24
      Call Trace:
      [<ffffffff8132e35b>] ?_spin_unlock+0x2b/0x40
      [<ffffffff810f186e>] oom_kill_process+0xbe/0x2b0
      
      (snip)
      
      492283 pages non-shared
      Out of memory: kill process 2341 (memhog) score 527276 or a child
      Killed process 2341 (memhog) vsz:1054552kB, anon-rss:970588kB, file-rss:4kB
      ===========================================================================
                                   ^
                                   |
                                  here
      
      [rientjes@google.com: fix race, add pid & comm to message]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b4798cb
    • A
      HWPOISON: Remove stray phrase in a comment · f2c03deb
      Andi Kleen 提交于
      Better to have complete sentences.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      f2c03deb
    • A
      12686d15
    • A
      0d57eb8d
    • A
      HWPOISON: Add a madvise() injector for soft page offlining · afcf938e
      Andi Kleen 提交于
      Process based injection is much easier to handle for test programs,
      who can first bring a page into a specific state and then test.
      So add a new MADV_SOFT_OFFLINE to soft offline a page, similar
      to the existing hard offline injector.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      afcf938e
    • A
      HWPOISON: Add soft page offline support · facb6011
      Andi Kleen 提交于
      This is a simpler, gentler variant of memory_failure() for soft page
      offlining controlled from user space.  It doesn't kill anything, just
      tries to invalidate and if that doesn't work migrate the
      page away.
      
      This is useful for predictive failure analysis, where a page has
      a high rate of corrected errors, but hasn't gone bad yet. Instead
      it can be offlined early and avoided.
      
      The offlining is controlled from sysfs, including a new generic
      entry point for hard page offlining for symmetry too.
      
      We use the page isolate facility to prevent re-allocation
      race. Normally this is only used by memory hotplug. To avoid
      races with memory allocation I am using lock_system_sleep().
      This avoids the situation where memory hotplug is about
      to isolate a page range and then hwpoison undoes that work.
      This is a big hammer currently, but the simplest solution
      currently.
      
      When the page is not free or LRU we try to free pages
      from slab and other caches. The slab freeing is currently
      quite dumb and does not try to focus on the specific slab
      cache which might own the page. This could be potentially
      improved later.
      
      Thanks to Fengguang Wu and Haicheng Li for some fixes.
      
      [Added fix from Andrew Morton to adapt to new migrate_pages prototype]
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      facb6011
    • A
    • A
      HWPOISON: Use new shake_page in memory_failure · 0474a60e
      Andi Kleen 提交于
      shake_page handles more types of page caches than
      the much simpler lru_add_drain_all:
      
      - slab (quite inefficiently for now)
      - any other caches with a shrinker callback
      - per cpu page allocator pages
      - per CPU LRU
      
      Use this call to try to turn pages into free or LRU pages.
      Then handle the case of the page becoming free after drain everything.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      0474a60e
    • A
      HWPOISON: mention HWPoison in Kconfig entry · 413f9efb
      Andi Kleen 提交于
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      413f9efb
    • A
      HWPOISON: Use get_user_page_fast in hwpoison madvise · d15f107d
      Andi Kleen 提交于
      The previous version didn't take the mmap_sem before calling gup(),
      which is racy.
      
      Use get_user_pages_fast() instead which doesn't need any locks.
      This is also faster of course, but then it doesn't really matter
      because this is just a testing path.
      
      Based on report from Nick Piggin.
      Cc: npiggin@suse.de
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      d15f107d
    • H
      HWPOISON: add an interface to switch off/on all the page filters · 1bfe5feb
      Haicheng Li 提交于
      In some use cases, user doesn't need extra filtering. E.g. user program
      can inject errors through madvise syscall to its own pages, however it
      might not know what the page state exactly is or which inode the page
      belongs to.
      
      So introduce an one-off interface "corrupt-filter-enable".
      
      Echo 0 to switch off page filters, and echo 1 to switch on the filters.
      [AK: changed default to 0]
      Signed-off-by: NHaicheng Li <haicheng.li@linux.intel.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      1bfe5feb
    • A
      HWPOISON: add memory cgroup filter · 4fd466eb
      Andi Kleen 提交于
      The hwpoison test suite need to inject hwpoison to a collection of
      selected task pages, and must not touch pages not owned by them and
      thus kill important system processes such as init. (But it's OK to
      mis-hwpoison free/unowned pages as well as shared clean pages.
      Mis-hwpoison of shared dirty pages will kill all tasks, so the test
      suite will target all or non of such tasks in the first place.)
      
      The memory cgroup serves this purpose well. We can put the target
      processes under the control of a memory cgroup, and tell the hwpoison
      injection code to only kill pages associated with some active memory
      cgroup.
      
      The prerequisite for doing hwpoison stress tests with mem_cgroup is,
      the mem_cgroup code tracks task pages _accurately_ (unless page is
      locked).  Which we believe is/should be true.
      
      The benefits are simplification of hwpoison injector code. Also the
      mem_cgroup code will automatically be tested by hwpoison test cases.
      
      The alternative interfaces pin-pfn/unpin-pfn can also delegate the
      (process and page flags) filtering functions reliably to user space.
      However prototype implementation shows that this scheme adds more
      complexity than we wanted.
      
      Example test case:
      
      	mkdir /cgroup/hwpoison
      
      	usemem -m 100 -s 1000 &
      	echo `jobs -p` > /cgroup/hwpoison/tasks
      
      	memcg_ino=$(ls -id /cgroup/hwpoison | cut -f1 -d' ')
      	echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg
      
      	page-types -p `pidof init`   --hwpoison  # shall do nothing
      	page-types -p `pidof usemem` --hwpoison  # poison its pages
      
      [AK: Fix documentation]
      [Add fix for problem noticed by Li Zefan <lizf@cn.fujitsu.com>;
      dentry in the css could be NULL]
      
      CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      CC: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      CC: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      CC: Balbir Singh <balbir@linux.vnet.ibm.com>
      CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      CC: Li Zefan <lizf@cn.fujitsu.com>
      CC: Paul Menage <menage@google.com>
      CC: Nick Piggin <npiggin@suse.de>
      CC: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      4fd466eb
    • W
      memcg: add accessor to mem_cgroup.css · d324236b
      Wu Fengguang 提交于
      So that an outside user can free the reference count grabbed by
      try_get_mem_cgroup_from_page().
      
      CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      CC: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      CC: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      CC: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      d324236b
    • W
      memcg: rename and export try_get_mem_cgroup_from_page() · e42d9d5d
      Wu Fengguang 提交于
      So that the hwpoison injector can get mem_cgroup for arbitrary page
      and thus know whether it is owned by some mem_cgroup task(s).
      
      [AK: Merged with latest git tree]
      
      CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      CC: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      CC: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      CC: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      e42d9d5d
    • W
      HWPOISON: add page flags filter · 478c5ffc
      Wu Fengguang 提交于
      When specified, only poison pages if ((page_flags & mask) == value).
      
      -       corrupt-filter-flags-mask
      -       corrupt-filter-flags-value
      
      This allows stress testing of many kinds of pages.
      
      Strictly speaking, the buddy pages requires taking zone lock, to avoid
      setting PG_hwpoison on a "was buddy but now allocated to someone" page.
      However we can just do nothing because we set PG_locked in the beginning,
      this prevents the page allocator from allocating it to someone. (It will
      BUG() on the unexpected PG_locked, which is fine for hwpoison testing.)
      
      [AK: Add select PROC_PAGE_MONITOR to satisfy dependency]
      
      CC: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      478c5ffc
    • W
      HWPOISON: limit hwpoison injector to known page types · 31d3d348
      Wu Fengguang 提交于
      __memory_failure()'s workflow is
      
      	set PG_hwpoison
      	//...
      	unset PG_hwpoison if didn't pass hwpoison filter
      
      That could kill unrelated process if it happens to page fault on the
      page with the (temporary) PG_hwpoison. The race should be big enough to
      appear in stress tests.
      
      Fix it by grabbing the page and checking filter at inject time.  This
      also avoids the very noisy "Injecting memory failure..." messages.
      
      - we don't touch madvise() based injection, because the filters are
        generally not necessary for it.
      - if we want to apply the filters to h/w aided injection, we'd better to
        rearrange the logic in __memory_failure() instead of this patch.
      
      AK: fix documentation, use drain all, cleanups
      
      CC: Haicheng Li <haicheng.li@intel.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      31d3d348
    • W
      HWPOISON: add fs/device filters · 7c116f2b
      Wu Fengguang 提交于
      Filesystem data/metadata present the most tricky-to-isolate pages.
      It requires careful code review and stress testing to get them right.
      
      The fs/device filter helps to target the stress tests to some specific
      filesystem pages. The filter condition is block device's major/minor
      numbers:
              - corrupt-filter-dev-major
              - corrupt-filter-dev-minor
      When specified (non -1), only page cache pages that belong to that
      device will be poisoned.
      
      The filters are checked reliably on the locked and refcounted page.
      
      Haicheng: clear PG_hwpoison and drop bad page count if filter not OK
      AK: Add documentation
      
      CC: Haicheng Li <haicheng.li@intel.com>
      CC: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      7c116f2b
    • W
      HWPOISON: return 0 to indicate success reliably · 138ce286
      Wu Fengguang 提交于
      Return 0 to indicate success, when
      - action result is RECOVERED or DELAYED
      - no extra page reference
      
      Note that dirty swapcache pages are kept in swapcache, so can have one
      more reference count.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      138ce286