1. 14 1月, 2011 1 次提交
    • M
      do_wp_page: remove the 'reuse' flag · b009c024
      Michel Lespinasse 提交于
      mlocking a shared, writable vma currently causes the corresponding pages
      to be marked as dirty and queued for writeback.  This seems rather
      unnecessary given that the pages are not being actually modified during
      mlock.  It is understood that for non-shared mappings (file or anon) we
      want to use a write fault in order to break COW, but there is just no such
      need for shared mappings.
      
      The first two patches in this series do not introduce any behavior change.
       The intent there is to make it obvious that dirtying file pages is only
      done in the (writable, shared) case.  I think this clarifies the code, but
      I wouldn't mind dropping these two patches if there is no consensus about
      them.
      
      The last patch is where we actually avoid dirtying shared mappings during
      mlock.  Note that as a side effect of this, we won't call page_mkwrite()
      for the mappings that define it, and won't be pre-allocating data blocks
      at the FS level if the mapped file was sparsely allocated.  My
      understanding is that mlock does not need to provide such guarantee, as
      evidenced by the fact that it never did for the filesystems that don't
      define page_mkwrite() - including some common ones like ext3.  However, I
      would like to gather feedback on this from filesystem people as a
      precaution.  If this turns out to be a showstopper, maybe block
      preallocation can be added back on using a different interface.
      
      Large shared mlocks are getting significantly (>2x) faster in my tests, as
      the disk can be fully used for reading the file instead of having to share
      between this and writeback.
      
      This patch:
      
      Reorganize the code to remove the 'reuse' flag.  No behavior changes.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Theodore Tso <tytso@google.com>
      Cc: Michael Rubin <mrubin@google.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b009c024
  2. 27 10月, 2010 6 次提交
  3. 08 10月, 2010 1 次提交
    • A
      Encode huge page size for VM_FAULT_HWPOISON errors · aa50d3a7
      Andi Kleen 提交于
      This fixes a problem introduced with the hugetlb hwpoison handling
      
      The user space SIGBUS signalling wants to know the size of the hugepage
      that caused a HWPOISON fault.
      
      Unfortunately the architecture page fault handlers do not have easy
      access to the struct page.
      
      Pass the information out in the fault error code instead.
      
      I added a separate VM_FAULT_HWPOISON_LARGE bit for this case and encode
      the hpage index in some free upper bits of the fault code. The small
      page hwpoison keeps stays with the VM_FAULT_HWPOISON name to minimize
      changes.
      
      Also add code to hugetlb.h to convert that index into a page shift.
      
      Will be used in a further patch.
      
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: fengguang.wu@intel.com
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      aa50d3a7
  4. 21 9月, 2010 1 次提交
    • H
      mm: further fix swapin race condition · 31c4a3d3
      Hugh Dickins 提交于
      Commit 4969c119 ("mm: fix swapin race condition") is now agreed to
      be incomplete.  There's a race, not very much less likely than the
      original race envisaged, in which it is further necessary to check that
      the swapcache page's swap has not changed.
      
      Here's the reasoning: cast in terms of reuse_swap_page(), but probably
      could be reformulated to rely on try_to_free_swap() instead, or on
      swapoff+swapon.
      
      A, faults into do_swap_page(): does page1 = lookup_swap_cache(swap1) and
      comes through the lock_page(page1).
      
      B, a racing thread of the same process, faults on the same address: does
      page1 = lookup_swap_cache(swap1) and now waits in lock_page(page1), but
      for whatever reason is unlucky not to get the lock any time soon.
      
      A carries on through do_swap_page(), a write fault, but cannot reuse the
      swap page1 (another reference to swap1).  Unlocks the page1 (but B
      doesn't get it yet), does COW in do_wp_page(), page2 now in that pte.
      
      C, perhaps the parent of A+B, comes in and write faults the same swap
      page1 into its mm, reuse_swap_page() succeeds this time, swap1 is freed.
      
      kswapd comes in after some time (B still unlucky) and swaps out some
      pages from A+B and C: it allocates the original swap1 to page2 in A+B,
      and some other swap2 to the original page1 now in C.  But does not
      immediately free page1 (actually it couldn't: B holds a reference),
      leaving it in swap cache for now.
      
      B at last gets the lock on page1, hooray! Is PageSwapCache(page1)? Yes.
      Is pte_same(*page_table, orig_pte)? Yes, because page2 has now been
      given the swap1 which page1 used to have.  So B proceeds to insert page1
      into A+B's page_table, though its content now belongs to C, quite
      different from what A wrote there.
      
      B ought to have checked that page1's swap was still swap1.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31c4a3d3
  5. 10 9月, 2010 1 次提交
  6. 25 8月, 2010 1 次提交
  7. 24 8月, 2010 1 次提交
    • S
      x86, mm: Avoid unnecessary TLB flush · 61c77326
      Shaohua Li 提交于
      In x86, access and dirty bits are set automatically by CPU when CPU accesses
      memory. When we go into the code path of below flush_tlb_fix_spurious_fault(),
      we already set dirty bit for pte and don't need flush tlb. This might mean
      tlb entry in some CPUs hasn't dirty bit set, but this doesn't matter. When
      the CPUs do page write, they will automatically check the bit and no software
      involved.
      
      On the other hand, flush tlb in below position is harmful. Test creates CPU
      number of threads, each thread writes to a same but random address in same vma
      range and we measure the total time. Under a 4 socket system, original time is
      1.96s, while with the patch, the time is 0.8s. Under a 2 socket system, there is
      20% time cut too. perf shows a lot of time are taking to send ipi/handle ipi for
      tlb flush.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      LKML-Reference: <20100816011655.GA362@sli10-desk.sh.intel.com>
      Acked-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Andrea Archangeli <aarcange@redhat.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      61c77326
  8. 21 8月, 2010 1 次提交
  9. 15 8月, 2010 1 次提交
  10. 14 8月, 2010 1 次提交
  11. 13 8月, 2010 1 次提交
    • L
      mm: keep a guard page below a grow-down stack segment · 320b2b8d
      Linus Torvalds 提交于
      This is a rather minimally invasive patch to solve the problem of the
      user stack growing into a memory mapped area below it.  Whenever we fill
      the first page of the stack segment, expand the segment down by one
      page.
      
      Now, admittedly some odd application might _want_ the stack to grow down
      into the preceding memory mapping, and so we may at some point need to
      make this a process tunable (some people might also want to have more
      than a single page of guarding), but let's try the minimal approach
      first.
      
      Tested with trivial application that maps a single page just below the
      stack, and then starts recursing.  Without this, we will get a SIGSEGV
      _after_ the stack has smashed the mapping.  With this patch, we'll get a
      nice SIGBUS just as the stack touches the page just above the mapping.
      Requested-by: NKeith Packard <keithp@keithp.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      320b2b8d
  12. 10 8月, 2010 4 次提交
  13. 31 7月, 2010 1 次提交
    • H
      mm: fix ia64 crash when gcore reads gate area · de51257a
      Hugh Dickins 提交于
      Debian's ia64 autobuilders have been seeing kernel freeze or reboot
      when running the gdb testsuite (Debian bug 588574): dannf bisected to
      2.6.32 62eede62 "mm: ZERO_PAGE without
      PTE_SPECIAL"; and reproduced it with gdb's gcore on a simple target.
      
      I'd missed updating the gate_vma handling in __get_user_pages(): that
      happens to use vm_normal_page() (nowadays failing on the zero page),
      yet reported success even when it failed to get a page - boom when
      access_process_vm() tried to copy that to its intermediate buffer.
      
      Fix this, resisting cleanups: in particular, leave it for now reporting
      success when not asked to get any pages - very probably safe to change,
      but let's not risk it without testing exposure.
      
      Why did ia64 crash with 16kB pages, but succeed with 64kB pages?
      Because setup_gate() pads each 64kB of its gate area with zero pages.
      Reported-by: NAndreas Barth <aba@not.so.argh.org>
      Bisected-by: Ndann frazier <dannf@debian.org>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Tested-by: Ndann frazier <dannf@dannf.org>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de51257a
  14. 25 5月, 2010 1 次提交
  15. 07 4月, 2010 1 次提交
  16. 30 3月, 2010 1 次提交
    • T
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo 提交于
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  17. 25 3月, 2010 1 次提交
    • M
      exit: fix oops in sync_mm_rss · 298359c5
      Michael S. Tsirkin 提交于
      In 2.6.34-rc1, removing vhost_net module causes an oops in sync_mm_rss
      (called from do_exit) when workqueue is destroyed.  This does not happen
      on net-next, or with vhost on top of to 2.6.33.
      
      The issue seems to be introduced by
      34e55232 ("mm: avoid false sharing of
      mm_counter) which added sync_mm_rss() that is passed task->mm, and
      dereferences it without checking.  If task is a kernel thread, mm might be
      NULL.  I think this might also happen e.g.  with aio.
      
      This patch fixes the oops by calling sync_mm_rss when task->mm is set to
      NULL.  I also added BUG_ON to detect any other cases where counters get
      incremented while mm is NULL.
      
      The oops I observed looks like this:
      
      BUG: unable to handle kernel NULL pointer dereference at 00000000000002a8
      IP: [<ffffffff810b436d>] sync_mm_rss+0x33/0x6f
      PGD 0
      Oops: 0002 [#1] SMP
      last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map
      CPU 2
      Modules linked in: vhost_net(-) tun bridge stp sunrpc ipv6 cpufreq_ondemand acpi_cpufreq freq_table kvm_intel kvm i5000_edac edac_core rtc_cmos bnx2 button i2c_i801 i2c_core rtc_core e1000e sg joydev ide_cd_mod serio_raw pcspkr rtc_lib cdrom virtio_net virtio_blk virtio_pci virtio_ring virtio af_packet e1000 shpchp aacraid uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]
      
      Pid: 2046, comm: vhost Not tainted 2.6.34-rc1-vhost #25 System Planar/IBM System x3550 -[7978B3G]-
      RIP: 0010:[<ffffffff810b436d>]  [<ffffffff810b436d>] sync_mm_rss+0x33/0x6f
      RSP: 0018:ffff8802379b7e60  EFLAGS: 00010202
      RAX: 0000000000000008 RBX: ffff88023f2390c0 RCX: 0000000000000000
      RDX: ffff88023f2396b0 RSI: 0000000000000000 RDI: ffff88023f2390c0
      RBP: ffff8802379b7e60 R08: 0000000000000000 R09: 0000000000000000
      R10: ffff88023aecfbc0 R11: 0000000000013240 R12: 0000000000000000
      R13: ffffffff81051a6c R14: ffffe8ffffc0f540 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff880001e80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 00000000000002a8 CR3: 000000023af23000 CR4: 00000000000406e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process vhost (pid: 2046, threadinfo ffff8802379b6000, task ffff88023f2390c0)
      Stack:
       ffff8802379b7ee0 ffffffff81040687 ffffe8ffffc0f558 ffffffffa00a3e2d
      <0> 0000000000000000 ffff88023f2390c0 ffffffff81055817 ffff8802379b7e98
      <0> ffff8802379b7e98 0000000100000286 ffff8802379b7ee0 ffff88023ad47d78
      Call Trace:
       [<ffffffff81040687>] do_exit+0x147/0x6c4
       [<ffffffffa00a3e2d>] ? handle_rx_net+0x0/0x17 [vhost_net]
       [<ffffffff81055817>] ? autoremove_wake_function+0x0/0x39
       [<ffffffff81051a6c>] ? worker_thread+0x0/0x229
       [<ffffffff810553c9>] kthreadd+0x0/0xf2
       [<ffffffff810038d4>] kernel_thread_helper+0x4/0x10
       [<ffffffff81055342>] ? kthread+0x0/0x87
       [<ffffffff810038d0>] ? kernel_thread_helper+0x0/0x10
      Code: 00 8b 87 6c 02 00 00 85 c0 74 14 48 98 f0 48 01 86 a0 02 00 00 c7 87 6c 02 00 00 00 00 00 00 8b 87 70 02 00 00 85 c0 74 14 48 98 <f0> 48 01 86 a8 02 00 00 c7 87 70 02 00 00 00 00 00 00 8b 87 74
      RIP  [<ffffffff810b436d>] sync_mm_rss+0x33/0x6f
       RSP <ffff8802379b7e60>
      CR2: 00000000000002a8
      ---[ end trace 41603ba922beddd2 ]---
      Fixing recursive fault but reboot is needed!
      
      (note: handle_rx_net is a work item using workqueue in question).
      sync_mm_rss+0x33/0x6f gave me a hint. I also tried reverting
      34e55232 and the oops goes away.
      
      The module in question calls use_mm and later unuse_mm from a kernel
      thread.  It is when this kernel thread is destroyed that the crash
      happens.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Andrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      298359c5
  18. 13 3月, 2010 2 次提交
  19. 07 3月, 2010 5 次提交
    • R
      rmap: move exclusively owned pages to own anon_vma in do_wp_page() · c44b6743
      Rik van Riel 提交于
      When the parent process breaks the COW on a page, both the original which
      is mapped at child and the new page which is mapped parent end up in that
      same anon_vma.  Generally this won't be a problem, but for some workloads
      it could preserve the O(N) rmap scanning complexity.
      
      A simple fix is to ensure that, when a page which is mapped child gets
      reused in do_wp_page, because we already are the exclusive owner, the page
      gets moved to our own exclusive child's anon_vma.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c44b6743
    • R
      mm: change anon_vma linking to fix multi-process server scalability issue · 5beb4930
      Rik van Riel 提交于
      The old anon_vma code can lead to scalability issues with heavily forking
      workloads.  Specifically, each anon_vma will be shared between the parent
      process and all its child processes.
      
      In a workload with 1000 child processes and a VMA with 1000 anonymous
      pages per process that get COWed, this leads to a system with a million
      anonymous pages in the same anon_vma, each of which is mapped in just one
      of the 1000 processes.  However, the current rmap code needs to walk them
      all, leading to O(N) scanning complexity for each page.
      
      This can result in systems where one CPU is walking the page tables of
      1000 processes in page_referenced_one, while all other CPUs are stuck on
      the anon_vma lock.  This leads to catastrophic failure for a benchmark
      like AIM7, where the total number of processes can reach in the tens of
      thousands.  Real workloads are still a factor 10 less process intensive
      than AIM7, but they are catching up.
      
      This patch changes the way anon_vmas and VMAs are linked, which allows us
      to associate multiple anon_vmas with a VMA.  At fork time, each child
      process gets its own anon_vmas, in which its COWed pages will be
      instantiated.  The parents' anon_vma is also linked to the VMA, because
      non-COWed pages could be present in any of the children.
      
      This reduces rmap scanning complexity to O(1) for the pages of the 1000
      child processes, with O(N) complexity for at most 1/N pages in the system.
       This reduces the average scanning cost in heavily forking workloads from
      O(N) to 2.
      
      The only real complexity in this patch stems from the fact that linking a
      VMA to anon_vmas now involves memory allocations.  This means vma_adjust
      can fail, if it needs to attach a VMA to anon_vma structures.  This in
      turn means error handling needs to be added to the calling functions.
      
      A second source of complexity is that, because there can be multiple
      anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
      "the" anon_vma lock.  To prevent the rmap code from walking up an
      incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag.  This bit
      flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
      to make sure it is impossible to compile a kernel that needs both symbolic
      values for the same bitflag.
      
      Some test results:
      
      Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
      box with 16GB RAM and not quite enough IO), the system ends up running
      >99% in system time, with every CPU on the same anon_vma lock in the
      pageout code.
      
      With these changes, AIM7 hits the cross-over point around 29.7k users.
      This happens with ~99% IO wait time, there never seems to be any spike in
      system time.  The anon_vma lock contention appears to be resolved.
      
      [akpm@linux-foundation.org: cleanups]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5beb4930
    • K
      mm: count swap usage · b084d435
      KAMEZAWA Hiroyuki 提交于
      A frequent questions from users about memory management is what numbers of
      swap ents are user for processes.  And this information will give some
      hints to oom-killer.
      
      Besides we can count the number of swapents per a process by scanning
      /proc/<pid>/smaps, this is very slow and not good for usual process
      information handler which works like 'ps' or 'top'.  (ps or top is now
      enough slow..)
      
      This patch adds a counter of swapents to mm_counter and update is at each
      swap events.  Information is exported via /proc/<pid>/status file as
      
      [kamezawa@bluextal memory]$ cat /proc/self/status
      Name:   cat
      State:  R (running)
      Tgid:   2910
      Pid:    2910
      PPid:   2823
      TracerPid:      0
      Uid:    500     500     500     500
      Gid:    500     500     500     500
      FDSize: 256
      Groups: 500
      VmPeak:    82696 kB
      VmSize:    82696 kB
      VmLck:         0 kB
      VmHWM:       432 kB
      VmRSS:       432 kB
      VmData:      172 kB
      VmStk:        84 kB
      VmExe:        48 kB
      VmLib:      1568 kB
      VmPTE:        40 kB
      VmSwap:        0 kB <=============== this.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b084d435
    • K
      mm: avoid false sharing of mm_counter · 34e55232
      KAMEZAWA Hiroyuki 提交于
      Considering the nature of per mm stats, it's the shared object among
      threads and can be a cache-miss point in the page fault path.
      
      This patch adds per-thread cache for mm_counter.  RSS value will be
      counted into a struct in task_struct and synchronized with mm's one at
      events.
      
      Now, in this patch, the event is the number of calls to handle_mm_fault.
      Per-thread value is added to mm at each 64 calls.
      
       rough estimation with small benchmark on parallel thread (2threads) shows
       [before]
           4.5 cache-miss/faults
       [after]
           4.0 cache-miss/faults
       Anyway, the most contended object is mmap_sem if the number of threads grows.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34e55232
    • K
      mm: clean up mm_counter · d559db08
      KAMEZAWA Hiroyuki 提交于
      Presently, per-mm statistics counter is defined by macro in sched.h
      
      This patch modifies it to
        - defined in mm.h as inlinf functions
        - use array instead of macro's name creation.
      
      This patch is for reducing patch size in future patch to modify
      implementation of per-mm counter.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d559db08
  20. 21 2月, 2010 1 次提交
    • R
      MM: Pass a PTE pointer to update_mmu_cache() rather than the PTE itself · 4b3073e1
      Russell King 提交于
      On VIVT ARM, when we have multiple shared mappings of the same file
      in the same MM, we need to ensure that we have coherency across all
      copies.  We do this via make_coherent() by making the pages
      uncacheable.
      
      This used to work fine, until we allowed highmem with highpte - we
      now have a page table which is mapped as required, and is not available
      for modification via update_mmu_cache().
      
      Ralf Beache suggested getting rid of the PTE value passed to
      update_mmu_cache():
      
        On MIPS update_mmu_cache() calls __update_tlb() which walks pagetables
        to construct a pointer to the pte again.  Passing a pte_t * is much
        more elegant.  Maybe we might even replace the pte argument with the
        pte_t?
      
      Ben Herrenschmidt would also like the pte pointer for PowerPC:
      
        Passing the ptep in there is exactly what I want.  I want that
        -instead- of the PTE value, because I have issue on some ppc cases,
        for I$/D$ coherency, where set_pte_at() may decide to mask out the
        _PAGE_EXEC.
      
      So, pass in the mapped page table pointer into update_mmu_cache(), and
      remove the PTE value, updating all implementations and call sites to
      suit.
      
      Includes a fix from Stephen Rothwell:
      
        sparc: fix fallout from update_mmu_cache API change
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      4b3073e1
  21. 16 12月, 2009 5 次提交
    • K
      memcg: coalesce uncharge during unmap/truncate · 569b846d
      KAMEZAWA Hiroyuki 提交于
      In massive parallel enviroment, res_counter can be a performance
      bottleneck.  One strong techinque to reduce lock contention is reducing
      calls by coalescing some amount of calls into one.
      
      Considering charge/uncharge chatacteristic,
      	- charge is done one by one via demand-paging.
      	- uncharge is done by
      		- in chunk at munmap, truncate, exit, execve...
      		- one by one via vmscan/paging.
      
      It seems we have a chance to coalesce uncharges for improving scalability
      at unmap/truncation.
      
      This patch is a for coalescing uncharge.  For avoiding scattering memcg's
      structure to functions under /mm, this patch adds memcg batch uncharge
      information to the task.  A reason for per-task batching is for making use
      of caller's context information.  We do batched uncharge (deleyed
      uncharge) when truncation/unmap occurs but do direct uncharge when
      uncharge is called by memory reclaim (vmscan.c).
      
      The degree of coalescing depends on callers
        - at invalidate/trucate... pagevec size
        - at unmap ....ZAP_BLOCK_SIZE
      (memory itself will be freed in this degree.)
      Then, we'll not coalescing too much.
      
      On x86-64 8cpu server, I tested overheads of memcg at page fault by
      running a program which does map/fault/unmap in a loop. Running
      a task per a cpu by taskset and see sum of the number of page faults
      in 60secs.
      
      [without memcg config]
        40156968  page-faults              #      0.085 M/sec   ( +-   0.046% )
        27.67 cache-miss/faults
      [root cgroup]
        36659599  page-faults              #      0.077 M/sec   ( +-   0.247% )
        31.58 miss/faults
      [in a child cgroup]
        18444157  page-faults              #      0.039 M/sec   ( +-   0.133% )
        69.96 miss/faults
      [child with this patch]
        27133719  page-faults              #      0.057 M/sec   ( +-   0.155% )
        47.16 miss/faults
      
      We can see some amounts of improvement.
      (root cgroup doesn't affected by this patch)
      Another patch for "charge" will follow this and above will be improved more.
      
      Changelog(since 2009/10/02):
       - renamed filed of memcg_batch (as pages to bytes, memsw to memsw_bytes)
       - some clean up and commentary/description updates.
       - added initialize code to copy_process(). (possible bug fix)
      
      Changelog(old):
       - fixed !CONFIG_MEM_CGROUP case.
       - rebased onto the latest mmotm + softlimit fix patches.
       - unified patch for callers
       - added commetns.
       - make ->do_batch as bool.
       - removed css_get() at el. We don't need it.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      569b846d
    • W
      HWPOISON: comment dirty swapcache pages · 71f72525
      Wu Fengguang 提交于
      AK: Improve comment
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      71f72525
    • H
      ksm: let shared pages be swappable · 5ad64688
      Hugh Dickins 提交于
      Initial implementation for swapping out KSM's shared pages: add
      page_referenced_ksm() and try_to_unmap_ksm(), which rmap.c calls when
      faced with a PageKsm page.
      
      Most of what's needed can be got from the rmap_items listed from the
      stable_node of the ksm page, without discovering the actual vma: so in
      this patch just fake up a struct vma for page_referenced_one() or
      try_to_unmap_one(), then refine that in the next patch.
      
      Add VM_NONLINEAR to ksm_madvise()'s list of exclusions: it has always been
      implicit there (being only set with VM_SHARED, already excluded), but
      let's make it explicit, to help justify the lack of nonlinear unmap.
      
      Rely on the page lock to protect against concurrent modifications to that
      page's node of the stable tree.
      
      The awkward part is not swapout but swapin: do_swap_page() and
      page_add_anon_rmap() now have to allow for new possibilities - perhaps a
      ksm page still in swapcache, perhaps a swapcache page associated with one
      location in one anon_vma now needed for another location or anon_vma.
      (And the vma might even be no longer VM_MERGEABLE when that happens.)
      
      ksm_might_need_to_copy() checks for that case, and supplies a duplicate
      page when necessary, simply leaving it to a subsequent pass of ksmd to
      rediscover the identity and merge them back into one ksm page.
      Disappointingly primitive: but the alternative would have to accumulate
      unswappable info about the swapped out ksm pages, limiting swappability.
      
      Remove page_add_ksm_rmap(): page_add_anon_rmap() now has to allow for the
      particular case it was handling, so just use it instead.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ad64688
    • H
      mm: sigbus instead of abusing oom · d99be1a8
      Hugh Dickins 提交于
      When do_nonlinear_fault() realizes that the page table must have been
      corrupted for it to have been called, it does print_bad_pte() and returns
      ...  VM_FAULT_OOM, which is hard to understand.
      
      It made some sense when I did it for 2.6.15, when do_page_fault() just
      killed the current process; but nowadays it lets the OOM killer decide who
      to kill - so page table corruption in one process would be liable to kill
      another.
      
      Change it to return VM_FAULT_SIGBUS instead: that doesn't guarantee that
      the process will be killed, but is good enough for such a rare
      abnormality, accompanied as it is by the "BUG: Bad page map" message.
      
      And recent HWPOISON work has copied that code into do_swap_page(), when it
      finds an impossible swap entry: fix that to VM_FAULT_SIGBUS too.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d99be1a8
    • H
      swap_info: swap count continuations · 570a335b
      Hugh Dickins 提交于
      Swap is duplicated (reference count incremented by one) whenever the same
      swap page is inserted into another mm (when forking finds a swap entry in
      place of a pte, or when reclaim unmaps a pte to insert the swap entry).
      
      swap_info_struct's vmalloc'ed swap_map is the array of these reference
      counts: but what happens when the unsigned short (or unsigned char since
      the preceding patch) is full? (and its high bit is kept for a cache flag)
      
      We then lose track of it, never freeing, leaving it in use until swapoff:
      at which point we _hope_ that a single pass will have found all instances,
      assume there are no more, and will lose user data if we're wrong.
      
      Swapping of KSM pages has not yet been enabled; but it is implemented,
      and makes it very easy for a user to overflow the maximum swap count:
      possible with ordinary process pages, but unlikely, even when pid_max
      has been raised from PID_MAX_DEFAULT.
      
      This patch implements swap count continuations: when the count overflows,
      a continuation page is allocated and linked to the original vmalloc'ed
      map page, and this used to hold the continuation counts for that entry
      and its neighbours.  These continuation pages are seldom referenced:
      the common paths all work on the original swap_map, only referring to
      a continuation page when the low "digit" of a count is incremented or
      decremented through SWAP_MAP_MAX.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      570a335b
  22. 29 10月, 2009 1 次提交
  23. 19 10月, 2009 1 次提交