1. 10 8月, 2010 1 次提交
    • C
      check ATTR_SIZE contraints in inode_change_ok · 2c27c65e
      Christoph Hellwig 提交于
      Make sure we check the truncate constraints early on in ->setattr by adding
      those checks to inode_change_ok.  Also clean up and document inode_change_ok
      to make this obvious.
      
      As a fallout we don't have to call inode_newsize_ok from simple_setsize and
      simplify it down to a truncate_setsize which doesn't return an error.  This
      simplifies a lot of setattr implementations and means we use truncate_setsize
      almost everywhere.  Get rid of fat_setsize now that it's trivial and mark
      ext2_setsize static to make the calling convention obvious.
      
      Keep the inode_newsize_ok in vmtruncate for now as all callers need an
      audit for its removal anyway.
      
      Note: setattr code in ecryptfs doesn't call inode_change_ok at all and
      needs a deeper audit, but that is left for later.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2c27c65e
  2. 19 7月, 2010 1 次提交
    • D
      mm: add context argument to shrinker callback · 7f8275d0
      Dave Chinner 提交于
      The current shrinker implementation requires the registered callback
      to have global state to work from. This makes it difficult to shrink
      caches that are not global (e.g. per-filesystem caches). Pass the shrinker
      structure to the callback so that users can embed the shrinker structure
      in the context the shrinker needs to operate on and get back to it in the
      callback via container_of().
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      7f8275d0
  3. 25 5月, 2010 3 次提交
    • C
      mm: make lowmem_page_address() use PFN_PHYS() for improved portability · c6f6b596
      Chris Metcalf 提交于
      This ensures that platforms with lowmem PAs above 32 bits work correctly
      by avoiding truncating the PA during a left shift.
      Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
      Cc: Barry Song <21cnbao@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c6f6b596
    • M
      mm: compaction: memory compaction core · 748446bb
      Mel Gorman 提交于
      This patch is the core of a mechanism which compacts memory in a zone by
      relocating movable pages towards the end of the zone.
      
      A single compaction run involves a migration scanner and a free scanner.
      Both scanners operate on pageblock-sized areas in the zone.  The migration
      scanner starts at the bottom of the zone and searches for all movable
      pages within each area, isolating them onto a private list called
      migratelist.  The free scanner starts at the top of the zone and searches
      for suitable areas and consumes the free pages within making them
      available for the migration scanner.  The pages isolated for migration are
      then migrated to the newly isolated free pages.
      
      [aarcange@redhat.com: Fix unsafe optimisation]
      [mel@csn.ul.ie: do not schedule work on other CPUs for compaction]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      748446bb
    • M
      mm: migration: avoid race between shift_arg_pages() and rmap_walk() during... · a8bef8ff
      Mel Gorman 提交于
      mm: migration: avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks
      
      Page migration requires rmap to be able to find all ptes mapping a page
      at all times, otherwise the migration entry can be instantiated, but it
      is possible to leave one behind if the second rmap_walk fails to find
      the page.  If this page is later faulted, migration_entry_to_page() will
      call BUG because the page is locked indicating the page was migrated by
      the migration PTE not cleaned up. For example
      
        kernel BUG at include/linux/swapops.h:105!
        invalid opcode: 0000 [#1] PREEMPT SMP
        ...
        Call Trace:
         [<ffffffff810e951a>] handle_mm_fault+0x3f8/0x76a
         [<ffffffff8130c7a2>] do_page_fault+0x44a/0x46e
         [<ffffffff813099b5>] page_fault+0x25/0x30
         [<ffffffff8114de33>] load_elf_binary+0x152a/0x192b
         [<ffffffff8111329b>] search_binary_handler+0x173/0x313
         [<ffffffff81114896>] do_execve+0x219/0x30a
         [<ffffffff8100a5c6>] sys_execve+0x43/0x5e
         [<ffffffff8100320a>] stub_execve+0x6a/0xc0
        RIP  [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
      
      There is a race between shift_arg_pages and migration that triggers this
      bug.  A temporary stack is setup during exec and later moved.  If
      migration moves a page in the temporary stack and the VMA is then removed
      before migration completes, the migration PTE may not be found leading to
      a BUG when the stack is faulted.
      
      This patch causes pages within the temporary stack during exec to be
      skipped by migration.  It does this by marking the VMA covering the
      temporary stack with an otherwise impossible combination of VMA flags.
      These flags are cleared when the temporary stack is moved to its final
      location.
      
      [kamezawa.hiroyu@jp.fujitsu.com: idea for having migration skip temporary stacks]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a8bef8ff
  4. 07 4月, 2010 1 次提交
    • N
      pagemap: fix pfn calculation for hugepage · 116354d1
      Naoya Horiguchi 提交于
      When we look into pagemap using page-types with option -p, the value of
      pfn for hugepages looks wrong (see below.) This is because pte was
      evaluated only once for one vma although it should be updated for each
      hugepage.  This patch fixes it.
      
        $ page-types -p 3277 -Nl -b huge
        voffset   offset  len     flags
        7f21e8a00 11e400  1       ___U___________H_G________________
        7f21e8a01 11e401  1ff     ________________TG________________
                     ^^^
        7f21e8c00 11e400  1       ___U___________H_G________________
        7f21e8c01 11e401  1ff     ________________TG________________
                     ^^^
      
      One hugepage contains 1 head page and 511 tail pages in x86_64 and each
      two lines represent each hugepage.  Voffset and offset mean virtual
      address and physical address in the page unit, respectively.  The
      different hugepages should not have the same offset value.
      
      With this patch applied:
      
        $ page-types -p 3386 -Nl -b huge
        voffset   offset   len    flags
        7fec7a600 112c00   1      ___UD__________H_G________________
        7fec7a601 112c01   1ff    ________________TG________________
                     ^^^
        7fec7a800 113200   1      ___UD__________H_G________________
        7fec7a801 113201   1ff    ________________TG________________
                     ^^^
                     OK
      
      More info:
      
      - This patch modifies walk_page_range()'s hugepage walker.  But the
        change only affects pagemap_read(), which is the only caller of hugepage
        callback.
      
      - Without this patch, hugetlb_entry() callback is called per vma, that
        doesn't match the natural expectation from its name.
      
      - With this patch, hugetlb_entry() is called per hugepte entry and the
        callback can become much simpler.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMatt Mackall <mpm@selenic.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      116354d1
  5. 26 3月, 2010 1 次提交
    • P
      x86, perf, bts, mm: Delete the never used BTS-ptrace code · faa4602e
      Peter Zijlstra 提交于
      Support for the PMU's BTS features has been upstreamed in
      v2.6.32, but we still have the old and disabled ptrace-BTS,
      as Linus noticed it not so long ago.
      
      It's buggy: TIF_DEBUGCTLMSR is trampling all over that MSR without
      regard for other uses (perf) and doesn't provide the flexibility
      needed for perf either.
      
      Its users are ptrace-block-step and ptrace-bts, since ptrace-bts
      was never used and ptrace-block-step can be implemented using a
      much simpler approach.
      
      So axe all 3000 lines of it. That includes the *locked_memory*()
      APIs in mm/mlock.c as well.
      Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Markus Metzger <markus.t.metzger@intel.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      LKML-Reference: <20100325135413.938004390@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      faa4602e
  6. 13 3月, 2010 2 次提交
  7. 07 3月, 2010 4 次提交
    • R
      mm: remove VM_LOCK_RMAP code · fc148a5f
      Rik van Riel 提交于
      When a VMA is in an inconsistent state during setup or teardown, the worst
      that can happen is that the rmap code will not be able to find the page.
      
      The mapping is in the process of being torn down (PTEs just got
      invalidated by munmap), or set up (no PTEs have been instantiated yet).
      
      It is also impossible for the rmap code to follow a pointer to an already
      freed VMA, because the rmap code holds the anon_vma->lock, which the VMA
      teardown code needs to take before the VMA is removed from the anon_vma
      chain.
      
      Hence, we should not need the VM_LOCK_RMAP locking at all.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc148a5f
    • R
      mm: change anon_vma linking to fix multi-process server scalability issue · 5beb4930
      Rik van Riel 提交于
      The old anon_vma code can lead to scalability issues with heavily forking
      workloads.  Specifically, each anon_vma will be shared between the parent
      process and all its child processes.
      
      In a workload with 1000 child processes and a VMA with 1000 anonymous
      pages per process that get COWed, this leads to a system with a million
      anonymous pages in the same anon_vma, each of which is mapped in just one
      of the 1000 processes.  However, the current rmap code needs to walk them
      all, leading to O(N) scanning complexity for each page.
      
      This can result in systems where one CPU is walking the page tables of
      1000 processes in page_referenced_one, while all other CPUs are stuck on
      the anon_vma lock.  This leads to catastrophic failure for a benchmark
      like AIM7, where the total number of processes can reach in the tens of
      thousands.  Real workloads are still a factor 10 less process intensive
      than AIM7, but they are catching up.
      
      This patch changes the way anon_vmas and VMAs are linked, which allows us
      to associate multiple anon_vmas with a VMA.  At fork time, each child
      process gets its own anon_vmas, in which its COWed pages will be
      instantiated.  The parents' anon_vma is also linked to the VMA, because
      non-COWed pages could be present in any of the children.
      
      This reduces rmap scanning complexity to O(1) for the pages of the 1000
      child processes, with O(N) complexity for at most 1/N pages in the system.
       This reduces the average scanning cost in heavily forking workloads from
      O(N) to 2.
      
      The only real complexity in this patch stems from the fact that linking a
      VMA to anon_vmas now involves memory allocations.  This means vma_adjust
      can fail, if it needs to attach a VMA to anon_vma structures.  This in
      turn means error handling needs to be added to the calling functions.
      
      A second source of complexity is that, because there can be multiple
      anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
      "the" anon_vma lock.  To prevent the rmap code from walking up an
      incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag.  This bit
      flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
      to make sure it is impossible to compile a kernel that needs both symbolic
      values for the same bitflag.
      
      Some test results:
      
      Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
      box with 16GB RAM and not quite enough IO), the system ends up running
      >99% in system time, with every CPU on the same anon_vma lock in the
      pageout code.
      
      With these changes, AIM7 hits the cross-over point around 29.7k users.
      This happens with ~99% IO wait time, there never seems to be any spike in
      system time.  The anon_vma lock contention appears to be resolved.
      
      [akpm@linux-foundation.org: cleanups]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5beb4930
    • K
      mm: avoid false sharing of mm_counter · 34e55232
      KAMEZAWA Hiroyuki 提交于
      Considering the nature of per mm stats, it's the shared object among
      threads and can be a cache-miss point in the page fault path.
      
      This patch adds per-thread cache for mm_counter.  RSS value will be
      counted into a struct in task_struct and synchronized with mm's one at
      events.
      
      Now, in this patch, the event is the number of calls to handle_mm_fault.
      Per-thread value is added to mm at each 64 calls.
      
       rough estimation with small benchmark on parallel thread (2threads) shows
       [before]
           4.5 cache-miss/faults
       [after]
           4.0 cache-miss/faults
       Anyway, the most contended object is mmap_sem if the number of threads grows.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34e55232
    • K
      mm: clean up mm_counter · d559db08
      KAMEZAWA Hiroyuki 提交于
      Presently, per-mm statistics counter is defined by macro in sched.h
      
      This patch modifies it to
        - defined in mm.h as inlinf functions
        - use array instead of macro's name creation.
      
      This patch is for reducing patch size in future patch to modify
      implementation of per-mm counter.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d559db08
  8. 13 2月, 2010 2 次提交
    • Y
      sparsemem: Put mem map for one node together. · 9bdac914
      Yinghai Lu 提交于
      Add vmemmap_alloc_block_buf for mem map only.
      
      It will fallback to the old way if it cannot get a block that big.
      
      Before this patch, when a node have 128g ram installed, memmap are
      split into two parts or more.
      [    0.000000]  [ffffea0000000000-ffffea003fffffff] PMD -> [ffff880100600000-ffff88013e9fffff] on node 1
      [    0.000000]  [ffffea0040000000-ffffea006fffffff] PMD -> [ffff88013ec00000-ffff88016ebfffff] on node 1
      [    0.000000]  [ffffea0070000000-ffffea007fffffff] PMD -> [ffff882000600000-ffff8820105fffff] on node 0
      [    0.000000]  [ffffea0080000000-ffffea00bfffffff] PMD -> [ffff882010800000-ffff8820507fffff] on node 0
      [    0.000000]  [ffffea00c0000000-ffffea00dfffffff] PMD -> [ffff882050a00000-ffff8820709fffff] on node 0
      [    0.000000]  [ffffea00e0000000-ffffea00ffffffff] PMD -> [ffff884000600000-ffff8840205fffff] on node 2
      [    0.000000]  [ffffea0100000000-ffffea013fffffff] PMD -> [ffff884020800000-ffff8840607fffff] on node 2
      [    0.000000]  [ffffea0140000000-ffffea014fffffff] PMD -> [ffff884060a00000-ffff8840709fffff] on node 2
      [    0.000000]  [ffffea0150000000-ffffea017fffffff] PMD -> [ffff886000600000-ffff8860305fffff] on node 3
      [    0.000000]  [ffffea0180000000-ffffea01bfffffff] PMD -> [ffff886030800000-ffff8860707fffff] on node 3
      [    0.000000]  [ffffea01c0000000-ffffea01ffffffff] PMD -> [ffff888000600000-ffff8880405fffff] on node 4
      [    0.000000]  [ffffea0200000000-ffffea022fffffff] PMD -> [ffff888040800000-ffff8880707fffff] on node 4
      [    0.000000]  [ffffea0230000000-ffffea023fffffff] PMD -> [ffff88a000600000-ffff88a0105fffff] on node 5
      [    0.000000]  [ffffea0240000000-ffffea027fffffff] PMD -> [ffff88a010800000-ffff88a0507fffff] on node 5
      [    0.000000]  [ffffea0280000000-ffffea029fffffff] PMD -> [ffff88a050a00000-ffff88a0709fffff] on node 5
      [    0.000000]  [ffffea02a0000000-ffffea02bfffffff] PMD -> [ffff88c000600000-ffff88c0205fffff] on node 6
      [    0.000000]  [ffffea02c0000000-ffffea02ffffffff] PMD -> [ffff88c020800000-ffff88c0607fffff] on node 6
      [    0.000000]  [ffffea0300000000-ffffea030fffffff] PMD -> [ffff88c060a00000-ffff88c0709fffff] on node 6
      [    0.000000]  [ffffea0310000000-ffffea033fffffff] PMD -> [ffff88e000600000-ffff88e0305fffff] on node 7
      [    0.000000]  [ffffea0340000000-ffffea037fffffff] PMD -> [ffff88e030800000-ffff88e0707fffff] on node 7
      
      after patch will get
      [    0.000000]  [ffffea0000000000-ffffea006fffffff] PMD -> [ffff880100200000-ffff88016e5fffff] on node 0
      [    0.000000]  [ffffea0070000000-ffffea00dfffffff] PMD -> [ffff882000200000-ffff8820701fffff] on node 1
      [    0.000000]  [ffffea00e0000000-ffffea014fffffff] PMD -> [ffff884000200000-ffff8840701fffff] on node 2
      [    0.000000]  [ffffea0150000000-ffffea01bfffffff] PMD -> [ffff886000200000-ffff8860701fffff] on node 3
      [    0.000000]  [ffffea01c0000000-ffffea022fffffff] PMD -> [ffff888000200000-ffff8880701fffff] on node 4
      [    0.000000]  [ffffea0230000000-ffffea029fffffff] PMD -> [ffff88a000200000-ffff88a0701fffff] on node 5
      [    0.000000]  [ffffea02a0000000-ffffea030fffffff] PMD -> [ffff88c000200000-ffff88c0701fffff] on node 6
      [    0.000000]  [ffffea0310000000-ffffea037fffffff] PMD -> [ffff88e000200000-ffff88e0701fffff] on node 7
      
      -v2: change buf to vmemmap_buf instead according to Ingo
           also add CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER according to Ingo
      -v3: according to Andrew, use sizeof(name) instead of hard coded 15
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <1265793639-15071-19-git-send-email-yinghai@kernel.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Acked-by: NChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      9bdac914
    • Y
      x86: Make 64 bit use early_res instead of bootmem before slab · 08677214
      Yinghai Lu 提交于
      Finally we can use early_res to replace bootmem for x86_64 now.
      
      Still can use CONFIG_NO_BOOTMEM to enable it or not.
      
      -v2: fix 32bit compiling about MAX_DMA32_PFN
      -v3: folded bug fix from LKML message below
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <4B747239.4070907@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      08677214
  9. 02 2月, 2010 1 次提交
  10. 17 1月, 2010 1 次提交
    • D
      nommu: fix shared mmap after truncate shrinkage problems · 7e660872
      David Howells 提交于
      Fix a problem in NOMMU mmap with ramfs whereby a shared mmap can happen
      over the end of a truncation.  The problem is that
      ramfs_nommu_check_mappings() checks that the reduced file size against the
      VMA tree, but not the vm_region tree.
      
      The following sequence of events can cause the problem:
      
      	fd = open("/tmp/x", O_RDWR|O_TRUNC|O_CREAT, 0600);
      	ftruncate(fd, 32 * 1024);
      	a = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
      	b = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
      	munmap(a, 32 * 1024);
      	ftruncate(fd, 16 * 1024);
      	c = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
      
      Mapping 'a' creates a vm_region covering 32KB of the file.  Mapping 'b'
      sees that the vm_region from 'a' is covering the region it wants and so
      shares it, pinning it in memory.
      
      Mapping 'a' then goes away and the file is truncated to the end of VMA
      'b'.  However, the region allocated by 'a' is still in effect, and has
      _not_ been reduced.
      
      Mapping 'c' is then created, and because there's a vm_region covering the
      desired region, get_unmapped_area() is _not_ called to repeat the check,
      and the mapping is granted, even though the pages from the latter half of
      the mapping have been discarded.
      
      However:
      
      	d = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
      
      Mapping 'd' should work, and should end up sharing the region allocated by
      'a'.
      
      To deal with this, we shrink the vm_region struct during the truncation,
      lest do_mmap_pgoff() take it as licence to share the full region
      automatically without calling the get_unmapped_area() file op again.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e660872
  11. 05 1月, 2010 1 次提交
    • C
      this_cpu: Page allocator conversion · 99dcc3e5
      Christoph Lameter 提交于
      Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.
      
      This drastically reduces the size of struct zone for systems with large
      amounts of processors and allows placement of critical variables of struct
      zone in one cacheline even on very large systems.
      
      Another effect is that the pagesets of one processor are placed near one
      another. If multiple pagesets from different zones fit into one cacheline
      then additional cacheline fetches can be avoided on the hot paths when
      allocating memory from multiple zones.
      
      Bootstrap becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs
      are reduced and we can drop the zone_pcp macro.
      
      Hotplug handling is also simplified since cpu alloc can bring up and
      shut down cpu areas for a specific cpu as a whole. So there is no need to
      allocate or free individual pagesets.
      
      V7-V8:
      - Explain chicken egg dilemmna with percpu allocator.
      
      V4-V5:
      - Fix up cases where per_cpu_ptr is called before irq disable
      - Integrate the bootstrap logic that was separate before.
      
      tj: Build failure in pageset_cpuup_callback() due to missing ret
          variable fixed.
      Reviewed-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      99dcc3e5
  12. 17 12月, 2009 1 次提交
    • Y
      x86: Fix checking of SRAT when node 0 ram is not from 0 · 32996250
      Yinghai Lu 提交于
      Found one system that boot from socket1 instead of socket0, SRAT get rejected...
      
      [    0.000000] SRAT: Node 1 PXM 0 0-a0000
      [    0.000000] SRAT: Node 1 PXM 0 100000-80000000
      [    0.000000] SRAT: Node 1 PXM 0 100000000-2080000000
      [    0.000000] SRAT: Node 0 PXM 1 2080000000-4080000000
      [    0.000000] SRAT: Node 2 PXM 2 4080000000-6080000000
      [    0.000000] SRAT: Node 3 PXM 3 6080000000-8080000000
      [    0.000000] SRAT: Node 4 PXM 4 8080000000-a080000000
      [    0.000000] SRAT: Node 5 PXM 5 a080000000-c080000000
      [    0.000000] SRAT: Node 6 PXM 6 c080000000-e080000000
      [    0.000000] SRAT: Node 7 PXM 7 e080000000-10080000000
      ...
      [    0.000000] NUMA: Allocated memnodemap from 500000 - 701040
      [    0.000000] NUMA: Using 20 for the hash shift.
      [    0.000000] Adding active range (0, 0x2080000, 0x4080000) 0 entries of 3200 used
      [    0.000000] Adding active range (1, 0x0, 0x96) 1 entries of 3200 used
      [    0.000000] Adding active range (1, 0x100, 0x7f750) 2 entries of 3200 used
      [    0.000000] Adding active range (1, 0x100000, 0x2080000) 3 entries of 3200 used
      [    0.000000] Adding active range (2, 0x4080000, 0x6080000) 4 entries of 3200 used
      [    0.000000] Adding active range (3, 0x6080000, 0x8080000) 5 entries of 3200 used
      [    0.000000] Adding active range (4, 0x8080000, 0xa080000) 6 entries of 3200 used
      [    0.000000] Adding active range (5, 0xa080000, 0xc080000) 7 entries of 3200 used
      [    0.000000] Adding active range (6, 0xc080000, 0xe080000) 8 entries of 3200 used
      [    0.000000] Adding active range (7, 0xe080000, 0x10080000) 9 entries of 3200 used
      [    0.000000] SRAT: PXMs only cover 917504MB of your 1048566MB e820 RAM. Not used.
      [    0.000000] SRAT: SRAT not used.
      
      the early_node_map is not sorted because node0 with non zero start come first.
      
      so try to sort it right away after all regions are registered.
      
      also fixs refression by 8716273c (x86: Export srat physical topology)
      
      -v2: make it more solid to handle cross node case like node0 [0,4g), [8,12g) and node1 [4g, 8g), [12g, 16g)
      -v3: update comments.
      Reported-and-tested-by: NJens Axboe <jens.axboe@oracle.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <4B2579D2.3010201@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      32996250
  13. 16 12月, 2009 7 次提交
    • A
      HWPOISON: Add soft page offline support · facb6011
      Andi Kleen 提交于
      This is a simpler, gentler variant of memory_failure() for soft page
      offlining controlled from user space.  It doesn't kill anything, just
      tries to invalidate and if that doesn't work migrate the
      page away.
      
      This is useful for predictive failure analysis, where a page has
      a high rate of corrected errors, but hasn't gone bad yet. Instead
      it can be offlined early and avoided.
      
      The offlining is controlled from sysfs, including a new generic
      entry point for hard page offlining for symmetry too.
      
      We use the page isolate facility to prevent re-allocation
      race. Normally this is only used by memory hotplug. To avoid
      races with memory allocation I am using lock_system_sleep().
      This avoids the situation where memory hotplug is about
      to isolate a page range and then hwpoison undoes that work.
      This is a big hammer currently, but the simplest solution
      currently.
      
      When the page is not free or LRU we try to free pages
      from slab and other caches. The slab freeing is currently
      quite dumb and does not try to focus on the specific slab
      cache which might own the page. This could be potentially
      improved later.
      
      Thanks to Fengguang Wu and Haicheng Li for some fixes.
      
      [Added fix from Andrew Morton to adapt to new migrate_pages prototype]
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      facb6011
    • W
      HWPOISON: Add unpoisoning support · 847ce401
      Wu Fengguang 提交于
      The unpoisoning interface is useful for stress testing tools to
      reclaim poisoned pages (to prevent OOM)
      
      There is no hardware level unpoisioning, so this
      cannot be used for real memory errors, only for software injected errors.
      
      Note that it may leak pages silently - those who have been removed from
      LRU cache, but not isolated from page cache/swap cache at hwpoison time.
      Especially the stress test of dirty swap cache pages shall reboot system
      before exhausting memory.
      
      AK: Fix comments, add documentation, add printks, rename symbol
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      847ce401
    • A
      HWPOISON: Turn ref argument into flags argument · 82ba011b
      Andi Kleen 提交于
      Now that "ref" is just a boolean turn it into
      a flags argument. First step is only a single flag
      that makes the code's intention more clear, but more
      may follow.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      82ba011b
    • A
      HWPOISON: Be more aggressive at freeing non LRU caches · 588f9ce6
      Andi Kleen 提交于
      shake_page handles more types of page caches than lru_drain_all()
      
      - per cpu page allocator pages
      - per CPU LRU
      
      Stops early when the page became free.
      
      Used in followon patches.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      588f9ce6
    • N
      mm hugetlb: add hugepage support to pagemap · 5dc37642
      Naoya Horiguchi 提交于
      This patch enables extraction of the pfn of a hugepage from
      /proc/pid/pagemap in an architecture independent manner.
      
      Details
      -------
      My test program (leak_pagemap) works as follows:
       - creat() and mmap() a file on hugetlbfs (file size is 200MB == 100 hugepages,)
       - read()/write() something on it,
       - call page-types with option -p,
       - munmap() and unlink() the file on hugetlbfs
      
      Without my patches
      ------------------
      $ ./leak_pagemap
                   flags page-count       MB  symbolic-flags                     long-symbolic-flags
      0x0000000000000000          1        0  __________________________________
      0x0000000000000804          1        0  __R________M______________________ referenced,mmap
      0x000000000000086c         81        0  __RU_lA____M______________________ referenced,uptodate,lru,active,mmap
      0x0000000000005808          5        0  ___U_______Ma_b___________________ uptodate,mmap,anonymous,swapbacked
      0x0000000000005868         12        0  ___U_lA____Ma_b___________________ uptodate,lru,active,mmap,anonymous,swapbacked
      0x000000000000586c          1        0  __RU_lA____Ma_b___________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
                   total        101        0
      
      The output of page-types don't show any hugepage.
      
      With my patches
      ---------------
      $ ./leak_pagemap
                   flags page-count       MB  symbolic-flags                     long-symbolic-flags
      0x0000000000000000          1        0  __________________________________
      0x0000000000030000      51100      199  ________________TG________________ compound_tail,huge
      0x0000000000028018        100        0  ___UD__________H_G________________ uptodate,dirty,compound_head,huge
      0x0000000000000804          1        0  __R________M______________________ referenced,mmap
      0x000000000000080c          1        0  __RU_______M______________________ referenced,uptodate,mmap
      0x000000000000086c         80        0  __RU_lA____M______________________ referenced,uptodate,lru,active,mmap
      0x0000000000005808          4        0  ___U_______Ma_b___________________ uptodate,mmap,anonymous,swapbacked
      0x0000000000005868         12        0  ___U_lA____Ma_b___________________ uptodate,lru,active,mmap,anonymous,swapbacked
      0x000000000000586c          1        0  __RU_lA____Ma_b___________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
                   total      51300      200
      
      The output of page-types shows 51200 pages contributing to hugepages,
      containing 100 head pages and 51100 tail pages as expected.
      
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5dc37642
    • H
      include/linux/mm.h: remove unneeded ifdef · f096e59e
      Huang Shijie 提交于
      The check code for CONFIG_SWAP is redundant, because there is a
      non-CONFIG_SWAP version for PageSwapCache() which just returns 0.
      Signed-off-by: NHuang Shijie <shijie8@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f096e59e
    • H
      mm: define PAGE_MAPPING_FLAGS · 3ca7b3c5
      Hugh Dickins 提交于
      At present we define PageAnon(page) by the low PAGE_MAPPING_ANON bit set
      in page->mapping, with the higher bits a pointer to the anon_vma; and have
      defined PageKsm(page) as that with NULL anon_vma.
      
      But KSM swapping will need to store a pointer there: so in preparation for
      that, now define PAGE_MAPPING_FLAGS as the low two bits, including
      PAGE_MAPPING_KSM (always set along with PAGE_MAPPING_ANON, until some
      other use for the bit emerges).
      
      Declare page_rmapping(page) to return the pointer part of page->mapping,
      and page_anon_vma(page) to return the anon_vma pointer when that's what it
      is.  Use these in a few appropriate places: notably, unuse_vma() has been
      testing page->mapping, but is better to be testing page_anon_vma() (cases
      may be added in which flag bits are set without any pointer).
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ca7b3c5
  14. 25 9月, 2009 1 次提交
  15. 24 9月, 2009 2 次提交
  16. 23 9月, 2009 1 次提交
  17. 22 9月, 2009 7 次提交
    • H
      tmpfs: depend on shmem · 3f96b79a
      Hugh Dickins 提交于
      CONFIG_SHMEM off gives you (ramfs masquerading as) tmpfs, even when
      CONFIG_TMPFS is off: that's a little anomalous, and I'd intended to make
      more sense of it by removing CONFIG_TMPFS altogether, always enabling its
      code when CONFIG_SHMEM; but so many defconfigs have CONFIG_SHMEM on
      CONFIG_TMPFS off that we'd better leave that as is.
      
      But there is no point in asking for CONFIG_TMPFS if CONFIG_SHMEM is off:
      make TMPFS depend on SHMEM, which also prevents TMPFS_POSIX_ACL
      shmem_acl.o being pointlessly built into the kernel when SHMEM is off.
      
      And a selfish change, to prevent the world from being rebuilt when I
      switch between CONFIG_SHMEM on and off: the only CONFIG_SHMEM in the
      header files is mm.h shmem_lock() - give that a shmem.c stub instead.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NMatt Mackall <mpm@selenic.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f96b79a
    • H
      mm: FOLL flags for GUP flags · 58fa879e
      Hugh Dickins 提交于
      __get_user_pages() has been taking its own GUP flags, then processing
      them into FOLL flags for follow_page().  Though oddly named, the FOLL
      flags are more widely used, so pass them to __get_user_pages() now.
      Sorry, VM flags, VM_FAULT flags and FAULT_FLAGs are still distinct.
      
      (The patch to __get_user_pages() looks peculiar, with both gup_flags
      and foll_flags: the gup_flags remain constant; but as before there's
      an exceptional case, out of scope of the patch, in which foll_flags
      per page have FOLL_WRITE masked off.)
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58fa879e
    • H
      mm: FOLL_DUMP replace FOLL_ANON · 8e4b9a60
      Hugh Dickins 提交于
      The "FOLL_ANON optimization" and its use_zero_page() test have caused
      confusion and bugs: why does it test VM_SHARED? for the very good but
      unsatisfying reason that VMware crashed without.  As we look to maybe
      reinstating anonymous use of the ZERO_PAGE, we need to sort this out.
      
      Easily done: it's silly for __get_user_pages() and follow_page() to
      be guessing whether it's safe to assume that they're being used for
      a coredump (which can take a shortcut snapshot where other uses must
      handle a fault) - just tell them with GUP_FLAGS_DUMP and FOLL_DUMP.
      
      get_dump_page() doesn't even want a ZERO_PAGE: an error suits fine.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e4b9a60
    • H
      mm: add get_dump_page · f3e8fccd
      Hugh Dickins 提交于
      In preparation for the next patch, add a simple get_dump_page(addr)
      interface for the CONFIG_ELF_CORE dumpers to use, instead of calling
      get_user_pages() directly.  They're not interested in errors: they
      just want to use holes as much as possible, to save space and make
      sure that the data is aligned where the headers said it would be.
      
      Oh, and don't use that horrid DUMP_SEEK(off) macro!
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3e8fccd
    • J
      mm: replace various uses of num_physpages by totalram_pages · 4481374c
      Jan Beulich 提交于
      Sizing of memory allocations shouldn't depend on the number of physical
      pages found in a system, as that generally includes (perhaps a huge amount
      of) non-RAM pages.  The amount of what actually is usable as storage
      should instead be used as a basis here.
      
      Some of the calculations (i.e.  those not intending to use high memory)
      should likely even use (totalram_pages - totalhigh_pages).
      Signed-off-by: NJan Beulich <jbeulich@novell.com>
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: Dave Airlie <airlied@linux.ie>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Patrick McHardy <kaber@trash.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4481374c
    • H
      ksm: the mm interface to ksm · f8af4da3
      Hugh Dickins 提交于
      This patch presents the mm interface to a dummy version of ksm.c, for
      better scrutiny of that interface: the real ksm.c follows later.
      
      When CONFIG_KSM is not set, madvise(2) reject MADV_MERGEABLE and
      MADV_UNMERGEABLE with EINVAL, since that seems more helpful than
      pretending that they can be serviced.  But when CONFIG_KSM=y, accept them
      even if KSM is not currently running, and even on areas which KSM will not
      touch (e.g.  hugetlb or shared file or special driver mappings).
      
      Like other madvices, report ENOMEM despite success if any area in the
      range is unmapped, and use EAGAIN to report out of memory.
      
      Define vma flag VM_MERGEABLE to identify an area on which KSM may try
      merging pages: leave it to ksm_madvise() to decide whether to set it.
      Define mm flag MMF_VM_MERGEABLE to identify an mm which might contain
      VM_MERGEABLE areas, to minimize callouts when forking or exiting.
      
      Based upon earlier patches by Chris Wright and Izik Eidus.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NChris Wright <chrisw@redhat.com>
      Signed-off-by: NIzik Eidus <ieidus@redhat.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8af4da3
    • S
      memory hotplug: update zone pcp at memory online · 112067f0
      Shaohua Li 提交于
      In my test, 128M memory is hot added, but zone's pcp batch is 0, which is
      an obvious error.  When pages are onlined, zone pcp should be updated
      accordingly.
      
      [akpm@linux-foundation.org: fix warnings]
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Yakui Zhao <yakui.zhao@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      112067f0
  18. 16 9月, 2009 3 次提交
    • A
      HWPOISON: The high level memory error handler in the VM v7 · 6a46079c
      Andi Kleen 提交于
      Add the high level memory handler that poisons pages
      that got corrupted by hardware (typically by a two bit flip in a DIMM
      or a cache) on the Linux level. The goal is to prevent everyone
      from accessing these pages in the future.
      
      This done at the VM level by marking a page hwpoisoned
      and doing the appropriate action based on the type of page
      it is.
      
      The code that does this is portable and lives in mm/memory-failure.c
      
      To quote the overview comment:
      
      High level machine check handler. Handles pages reported by the
      hardware as being corrupted usually due to a 2bit ECC memory or cache
      failure.
      
      This focuses on pages detected as corrupted in the background.
      When the current CPU tries to consume corruption the currently
      running process can just be killed directly instead. This implies
      that if the error cannot be handled for some reason it's safe to
      just ignore it because no corruption has been consumed yet. Instead
      when that happens another machine check will happen.
      
      Handles page cache pages in various states. The tricky part
      here is that we can access any page asynchronous to other VM
      users, because memory failures could happen anytime and anywhere,
      possibly violating some of their assumptions. This is why this code
      has to be extremely careful. Generally it tries to use normal locking
      rules, as in get the standard locks, even if that means the
      error handling takes potentially a long time.
      
      Some of the operations here are somewhat inefficient and have non
      linear algorithmic complexity, because the data structures have not
      been optimized for this case. This is in particular the case
      for the mapping from a vma to a process. Since this case is expected
      to be rare we hope we can get away with this.
      
      There are in principle two strategies to kill processes on poison:
      - just unmap the data and wait for an actual reference before
      killing
      - kill as soon as corruption is detected.
      Both have advantages and disadvantages and should be used
      in different situations. Right now both are implemented and can
      be switched with a new sysctl vm.memory_failure_early_kill
      The default is early kill.
      
      The patch does some rmap data structure walking on its own to collect
      processes to kill. This is unusual because normally all rmap data structure
      knowledge is in rmap.c only. I put it here for now to keep
      everything together and rmap knowledge has been seeping out anyways
      
      Includes contributions from Johannes Weiner, Chris Mason, Fengguang Wu,
      Nick Piggin (who did a lot of great work) and others.
      
      Cc: npiggin@suse.de
      Cc: riel@redhat.com
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      6a46079c
    • A
      HWPOISON: Define a new error_remove_page address space op for async truncation · 25718736
      Andi Kleen 提交于
      Truncating metadata pages is not safe right now before
      we haven't audited all file systems.
      
      To enable truncation only for data address space define
      a new address_space callback error_remove_page.
      
      This is used for memory_failure.c memory error handling.
      
      This can be then set to truncate_inode_page()
      
      This patch just defines the new operation and adds documentation.
      
      Callers and users come in followon patches.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      25718736
    • W
      HWPOISON: Add invalidate_inode_page · 83f78668
      Wu Fengguang 提交于
      Add a simple way to invalidate a single page
      This is just a refactoring of the truncate.c code.
      Originally from Fengguang, modified by Andi Kleen.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      83f78668