1. 28 1月, 2011 4 次提交
  2. 27 1月, 2011 2 次提交
  3. 26 1月, 2011 3 次提交
    • E
      percpu, x86: Fix percpu_xchg_op() · 889a7a6a
      Eric Dumazet 提交于
      These recent percpu commits:
      
        2485b646: x86,percpu: Move out of place 64 bit ops into X86_64 section
        8270137a: cpuops: Use cmpxchg for xchg to avoid lock semantics
      
      Caused this 'perf top' crash:
      
       Kernel panic - not syncing: Fatal exception in interrupt
       Pid: 0, comm: swapper Tainted: G     D
       2.6.38-rc2-00181-gef71723 #413 Call Trace: <IRQ> [<ffffffff810465b5>]
          ? panic
          ? kmsg_dump
          ? kmsg_dump
          ? oops_end
          ? no_context
          ? __bad_area_nosemaphore
          ? perf_output_begin
          ? bad_area_nosemaphore
          ? do_page_fault
          ? __task_pid_nr_ns
          ? perf_event_tid
          ? __perf_event_header__init_id
          ? validate_chain
          ? perf_output_sample
          ? trace_hardirqs_off
          ? page_fault
          ? irq_work_run
          ? update_process_times
          ? tick_sched_timer
          ? tick_sched_timer
          ? __run_hrtimer
          ? hrtimer_interrupt
          ? account_system_vtime
          ? smp_apic_timer_interrupt
          ? apic_timer_interrupt
       ...
      
      Looking at assembly code, I found:
      
      list = this_cpu_xchg(irq_work_list, NULL);
      
      gives this wrong code : (gcc-4.1.2 cross compiler)
      
      ffffffff810bc45e:
      	mov    %gs:0xead0,%rax
      	cmpxchg %rax,%gs:0xead0
      	jne    ffffffff810bc45e <irq_work_run+0x3e>
      	test   %rax,%rax
      	je     ffffffff810bc4aa <irq_work_run+0x8a>
      
      Tell gcc we dirty eax/rax register in percpu_xchg_op()
      
      Compiler must use another register to store pxo_new__
      
      We also dont need to reload percpu value after a jump,
      since a 'failed' cmpxchg already updated eax/rax
      
      Wrong generated code was :
      	xor     %rax,%rax   /* load 0 into %rax */
      1:	mov     %gs:0xead0,%rax
      	cmpxchg %rax,%gs:0xead0
      	jne     1b
      	test    %rax,%rax
      
      After patch :
      
      	xor     %rdx,%rdx   /* load 0 into %rdx */
      	mov     %gs:0xead0,%rax
      1:	cmpxchg %rdx,%gs:0xead0
      	jne     1b:
      	test    %rax,%rax
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Tejun Heo <tj@kernel.org>
      LKML-Reference: <1295973114.3588.312.camel@edumazet-laptop>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      889a7a6a
    • Y
      x86: Remove left over system_64.h · 9a57c3e4
      Yinghai Lu 提交于
      Left-over from the x86 merge ...
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <4D3E23D1.7010405@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9a57c3e4
    • A
      thp: fix PARAVIRT x86 32bit noPAE · cacf061c
      Andrea Arcangeli 提交于
      This fixes TRANSPARENT_HUGEPAGE=y with PARAVIRT=y and HIGHMEM64=n.
      
      The #ifdef that this patch removes was erratically introduced to fix a
      build error for noPAE (where pmd.pmd doesn't exist).  So then the kernel
      built but it failed at runtime because set_pmd_at was a noop.  This will
      correct it by enabling set_pmd_at for noPAE mode too.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: Nwerner <w.landgraf@ru.ru>
      Reported-by: NMinchan Kim <minchan.kim@gmail.com>
      Tested-by: NMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cacf061c
  4. 25 1月, 2011 1 次提交
    • J
      x86-64: Don't use pointer to out-of-scope variable in dump_trace() · 2e5aa682
      Jesper Juhl 提交于
      In arch/x86/kernel/dumpstack_64.c::dump_trace() we have this code:
      
      ...
        		if (!stack) {
        			unsigned long dummy;
        			stack = &dummy;
        			if (task && task != current)
        				stack = (unsigned long *)task->thread.sp;
        		}
      
        		bp = stack_frame(task, regs);
        		/*
        		 * Print function call entries in all stacks, starting at the
        		 * current stack address. If the stacks consist of nested
        		 * exceptions
        		 */
        		tinfo = task_thread_info(task);
      
        		for (;;) {
        			char *id;
        			unsigned long *estack_end;
        			estack_end = in_exception_stack(cpu, (unsigned long)stack,
        							&used, &id);
      ...
      
      You'll notice that we assign to 'stack' the address of the variable
      'dummy' which is only in-scope inside the 'if (!stack)'. So when we later
      access stack (at the end of the above, and assuming we did not take the
      'if (task && task != current)' branch) we'll be using the address of a
      variable that is no longer in scope. I believe this patch is the proper
      fix, but I freely admit that I'm not 100% certain.
      Signed-off-by: NJesper Juhl <jj@chaosbits.net>
      LKML-Reference: <alpine.LNX.2.00.1101242232590.10252@swampdragon.chaosbits.net>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      2e5aa682
  5. 23 1月, 2011 1 次提交
    • M
      x86: Fix jump label with RO/NX module protection crash · 89696913
      matthieu castet 提交于
      If we use jump table in module init, there are marked
      as removed in __jump_table section after init is done.
      
      But we already applied ro permissions on the module, so
      we can't modify a read only section (crash in
      remove_jump_label_module_init).
      
      Make the __jump_table section rw.
      Signed-off-by: NMatthieu CASTET <castet.matthieu@free.fr>
      Cc: Xiaotian Feng <xtfeng@gmail.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Siarhei Liakh <sliakh.lkml@gmail.com>
      Cc: Xuxian Jiang <jiang@cs.ncsu.edu>
      Cc: James Morris <jmorris@namei.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Kees Cook <kees.cook@canonical.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      LKML-Reference: <4D3C3F20.7030203@free.fr>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      89696913
  6. 22 1月, 2011 2 次提交
    • B
      x86, hotplug: Fix powersavings with offlined cores on AMD · 93789b32
      Borislav Petkov 提交于
      ea530692 made a CPU use monitor/mwait
      when offline. This is not the optimal choice for AMD wrt to powersavings
      and we'd prefer our cores to halt (i.e. enter C1) instead. For this, the
      same selection whether to use monitor/mwait has to be used as when we
      select the idle routine for the machine.
      
      With this patch, offlining cores 1-5 on a X6 machine allows core0 to
      boost again.
      
      [ hpa: putting this in urgent since it is a (power) regression fix ]
      Reported-by: NAndreas Herrmann <andreas.herrmann3@amd.com>
      Cc: stable@kernel.org # 37.x
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.hl>
      Signed-off-by: NBorislav Petkov <borislav.petkov@amd.com>
      LKML-Reference: <1295534572-10730-1-git-send-email-bp@amd64.org>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      93789b32
    • S
      xen: p2m: correctly initialize partial p2m leaf · 8e1b4cf2
      Stefan Bader 提交于
      After changing the p2m mapping to a tree by
      
        commit 58e05027
          xen: convert p2m to a 3 level tree
      
      and trying to boot a DomU with 615MB of memory, the following crash was
      observed in the dump:
      
      kernel direct mapping tables up to 26f00000 @ 1ec4000-1fff000
      BUG: unable to handle kernel NULL pointer dereference at (null)
      IP: [<c0107397>] xen_set_pte+0x27/0x60
      *pdpt = 0000000000000000 *pde = 0000000000000000
      
      Adding further debug statements showed that when trying to set up
      pfn=0x26700 the returned mapping was invalid.
      
      pfn=0x266ff calling set_pte(0xc1fe77f8, 0x6b3003)
      pfn=0x26700 calling set_pte(0xc1fe7800, 0x3)
      
      Although the last_pfn obtained from the startup info is 0x26700, which
      should in turn not be hit, the additional 8MB which are added as extra
      memory normally seem to be ok. This lead to looking into the initial
      p2m tree construction, which uses the smaller value and assuming that
      there is other code handling the extra memory.
      
      When the p2m tree is set up, the leaves are directly pointed to the
      array which the domain builder set up. But if the mapping is not on a
      boundary that fits into one p2m page, this will result in the last leaf
      being only partially valid. And as the invalid entries are not
      initialized in that case, things go badly wrong.
      
      I am trying to fix that by checking whether the current leaf is a
      complete map and if not, allocate a completely new page and copy only
      the valid pointers there. This may not be the most efficient or elegant
      solution, but at least it seems to allow me booting DomUs with memory
      assignments all over the range.
      
      BugLink: http://bugs.launchpad.net/bugs/686692
      [v2: Redid a bit of commit wording and fixed a compile warning]
      Signed-off-by: NStefan Bader <stefan.bader@canonical.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      8e1b4cf2
  7. 21 1月, 2011 4 次提交
  8. 20 1月, 2011 5 次提交
  9. 19 1月, 2011 2 次提交
  10. 18 1月, 2011 2 次提交
    • B
      x86: Clear irqstack thread_info · 7b698ea3
      Brian Gerst 提交于
      Mathias Merz reported that v2.6.37 failed to boot on his
      system.
      
      Make sure that the thread_info part of the irqstack is
      initialized to zeroes.
      Reported-and-Tested-by: NMatthias Merz <linux@merz-ka.de>
      Signed-off-by: NBrian Gerst <brgerst@gmail.com>
      Acked-by: NPekka Enberg <penberg@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <AANLkTimyKXfJ1x8tgwrr1hYnNLrPfgE1NTe4z7L6tUDm@mail.gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7b698ea3
    • S
      x86: Make relocatable kernel work with new binutils · 86b1e8dd
      Shaohua Li 提交于
      The CONFIG_RELOCATABLE=y option is broken with new binutils, which will make
      boot panic.
      
      According to Lu Hongjiu, the affected binutils are from 2.20.51.0.12 to
      2.21.51.0.3, which are release since Oct 22 this year. At least ubuntu 10.10 is
      using such binutils. See:
      
          http://sourceware.org/bugzilla/show_bug.cgi?id=12327
      
      The reason of the boot panic is that we have 'jiffies = jiffies_64;' in
      vmlinux.lds.S. The jiffies isn't in any section. In kernel build, there is
      warning saying jiffies is an absolute address and can't be relocatable. At
      runtime, jiffies will have virtual address 0.
      
      Signed-off-by: Shaohua Li<shaohua.li@intel.com>
      Cc: Lu Hongjiu<hongjiu.lu@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      LKML-Reference: <1295312269.1949.725.camel@sli10-conroe>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      86b1e8dd
  11. 15 1月, 2011 7 次提交
  12. 14 1月, 2011 7 次提交
    • A
      x86: OLPC: convert olpc-xo1 driver from pci device to platform device · 419cdc54
      Andres Salomon 提交于
      The cs5535-mfd driver now takes care of the PCI BAR handling; this
      means the olpc-xo1 driver shouldn't be touching the PCI device at all.
      
      This patch uses both cs5535-acpi and cs5535-pms platform devices rather
      than a single platform device because the cs5535-mfd driver may be used
      by other CS5535 platform-specific drivers; OLPC doesn't get to dictate
      that ACPI and PMS will always be used together.
      Signed-off-by: NAndres Salomon <dilinger@queued.net>
      Acked-by: NH. Peter Anvin <hpa@zytor.com>
      Signed-off-by: NSamuel Ortiz <sameo@linux.intel.com>
      419cdc54
    • A
      thp: mmu_notifier_test_young · 8ee53820
      Andrea Arcangeli 提交于
      For GRU and EPT, we need gup-fast to set referenced bit too (this is why
      it's correct to return 0 when shadow_access_mask is zero, it requires
      gup-fast to set the referenced bit).  qemu-kvm access already sets the
      young bit in the pte if it isn't zero-copy, if it's zero copy or a shadow
      paging EPT minor fault we relay on gup-fast to signal the page is in
      use...
      
      We also need to check the young bits on the secondary pagetables for NPT
      and not nested shadow mmu as the data may never get accessed again by the
      primary pte.
      
      Without this closer accuracy, we'd have to remove the heuristic that
      avoids collapsing hugepages in hugepage virtual regions that have not even
      a single subpage in use.
      
      ->test_young is full backwards compatible with GRU and other usages that
      don't have young bits in pagetables set by the hardware and that should
      nuke the secondary mmu mappings when ->clear_flush_young runs just like
      EPT does.
      
      Removing the heuristic that checks the young bit in
      khugepaged/collapse_huge_page completely isn't so bad either probably but
      I thought it was worth it and this makes it reliable.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ee53820
    • A
      thp: don't allow transparent hugepage support without PSE · 4b7167b9
      Andrea Arcangeli 提交于
      Archs implementing Transparent Hugepage Support must implement a function
      called has_transparent_hugepage to be sure the virtual or physical CPU
      supports Transparent Hugepages.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b7167b9
    • J
      thp: add pmd_modify · c489f125
      Johannes Weiner 提交于
      Add pmd_modify() for use with mprotect() on huge pmds.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c489f125
    • J
      thp: add x86 32bit support · f2d6bfe9
      Johannes Weiner 提交于
      Add support for transparent hugepages to x86 32bit.
      
      Share the same VM_ bitflag for VM_MAPPED_COPY.  mm/nommu.c will never
      support transparent hugepages.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f2d6bfe9
    • A
      thp: transparent hugepage core · 71e3aac0
      Andrea Arcangeli 提交于
      Lately I've been working to make KVM use hugepages transparently without
      the usual restrictions of hugetlbfs.  Some of the restrictions I'd like to
      see removed:
      
      1) hugepages have to be swappable or the guest physical memory remains
         locked in RAM and can't be paged out to swap
      
      2) if a hugepage allocation fails, regular pages should be allocated
         instead and mixed in the same vma without any failure and without
         userland noticing
      
      3) if some task quits and more hugepages become available in the
         buddy, guest physical memory backed by regular pages should be
         relocated on hugepages automatically in regions under
         madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
         kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes
         not null)
      
      4) avoidance of reservation and maximization of use of hugepages whenever
         possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
         1 machine with 1 database with 1 database cache with 1 database cache size
         known at boot time. It's definitely not feasible with a virtualization
         hypervisor usage like RHEV-H that runs an unknown number of virtual machines
         with an unknown size of each virtual machine with an unknown amount of
         pagecache that could be potentially useful in the host for guest not using
         O_DIRECT (aka cache=off).
      
      hugepages in the virtualization hypervisor (and also in the guest!) are
      much more important than in a regular host not using virtualization,
      becasue with NPT/EPT they decrease the tlb-miss cacheline accesses from 24
      to 19 in case only the hypervisor uses transparent hugepages, and they
      decrease the tlb-miss cacheline accesses from 19 to 15 in case both the
      linux hypervisor and the linux guest both uses this patch (though the
      guest will limit the addition speedup to anonymous regions only for
      now...).  Even more important is that the tlb miss handler is much slower
      on a NPT/EPT guest than for a regular shadow paging or no-virtualization
      scenario.  So maximizing the amount of virtual memory cached by the TLB
      pays off significantly more with NPT/EPT than without (even if there would
      be no significant speedup in the tlb-miss runtime).
      
      The first (and more tedious) part of this work requires allowing the VM to
      handle anonymous hugepages mixed with regular pages transparently on
      regular anonymous vmas.  This is what this patch tries to achieve in the
      least intrusive possible way.  We want hugepages and hugetlb to be used in
      a way so that all applications can benefit without changes (as usual we
      leverage the KVM virtualization design: by improving the Linux VM at
      large, KVM gets the performance boost too).
      
      The most important design choice is: always fallback to 4k allocation if
      the hugepage allocation fails!  This is the _very_ opposite of some large
      pagecache patches that failed with -EIO back then if a 64k (or similar)
      allocation failed...
      
      Second important decision (to reduce the impact of the feature on the
      existing pagetable handling code) is that at any time we can split an
      hugepage into 512 regular pages and it has to be done with an operation
      that can't fail.  This way the reliability of the swapping isn't decreased
      (no need to allocate memory when we are short on memory to swap) and it's
      trivial to plug a split_huge_page* one-liner where needed without
      polluting the VM.  Over time we can teach mprotect, mremap and friends to
      handle pmd_trans_huge natively without calling split_huge_page*.  The fact
      it can't fail isn't just for swap: if split_huge_page would return -ENOMEM
      (instead of the current void) we'd need to rollback the mprotect from the
      middle of it (ideally including undoing the split_vma) which would be a
      big change and in the very wrong direction (it'd likely be simpler not to
      call split_huge_page at all and to teach mprotect and friends to handle
      hugepages instead of rolling them back from the middle).  In short the
      very value of split_huge_page is that it can't fail.
      
      The collapsing and madvise(MADV_HUGEPAGE) part will remain separated and
      incremental and it'll just be an "harmless" addition later if this initial
      part is agreed upon.  It also should be noted that locking-wise replacing
      regular pages with hugepages is going to be very easy if compared to what
      I'm doing below in split_huge_page, as it will only happen when
      page_count(page) matches page_mapcount(page) if we can take the PG_lock
      and mmap_sem in write mode.  collapse_huge_page will be a "best effort"
      that (unlike split_huge_page) can fail at the minimal sign of trouble and
      we can try again later.  collapse_huge_page will be similar to how KSM
      works and the madvise(MADV_HUGEPAGE) will work similar to
      madvise(MADV_MERGEABLE).
      
      The default I like is that transparent hugepages are used at page fault
      time.  This can be changed with
      /sys/kernel/mm/transparent_hugepage/enabled.  The control knob can be set
      to three values "always", "madvise", "never" which mean respectively that
      hugepages are always used, or only inside madvise(MADV_HUGEPAGE) regions,
      or never used.  /sys/kernel/mm/transparent_hugepage/defrag instead
      controls if the hugepage allocation should defrag memory aggressively
      "always", only inside "madvise" regions, or "never".
      
      The pmd_trans_splitting/pmd_trans_huge locking is very solid.  The
      put_page (from get_user_page users that can't use mmu notifier like
      O_DIRECT) that runs against a __split_huge_page_refcount instead was a
      pain to serialize in a way that would result always in a coherent page
      count for both tail and head.  I think my locking solution with a
      compound_lock taken only after the page_first is valid and is still a
      PageHead should be safe but it surely needs review from SMP race point of
      view.  In short there is no current existing way to serialize the O_DIRECT
      final put_page against split_huge_page_refcount so I had to invent a new
      one (O_DIRECT loses knowledge on the mapping status by the time gup_fast
      returns so...).  And I didn't want to impact all gup/gup_fast users for
      now, maybe if we change the gup interface substantially we can avoid this
      locking, I admit I didn't think too much about it because changing the gup
      unpinning interface would be invasive.
      
      If we ignored O_DIRECT we could stick to the existing compound refcounting
      code, by simply adding a get_user_pages_fast_flags(foll_flags) where KVM
      (and any other mmu notifier user) would call it without FOLL_GET (and if
      FOLL_GET isn't set we'd just BUG_ON if nobody registered itself in the
      current task mmu notifier list yet).  But O_DIRECT is fundamental for
      decent performance of virtualized I/O on fast storage so we can't avoid it
      to solve the race of put_page against split_huge_page_refcount to achieve
      a complete hugepage feature for KVM.
      
      Swap and oom works fine (well just like with regular pages ;).  MMU
      notifier is handled transparently too, with the exception of the young bit
      on the pmd, that didn't have a range check but I think KVM will be fine
      because the whole point of hugepages is that EPT/NPT will also use a huge
      pmd when they notice gup returns pages with PageCompound set, so they
      won't care of a range and there's just the pmd young bit to check in that
      case.
      
      NOTE: in some cases if the L2 cache is small, this may slowdown and waste
      memory during COWs because 4M of memory are accessed in a single fault
      instead of 8k (the payoff is that after COW the program can run faster).
      So we might want to switch the copy_huge_page (and clear_huge_page too) to
      not temporal stores.  I also extensively researched ways to avoid this
      cache trashing with a full prefault logic that would cow in 8k/16k/32k/64k
      up to 1M (I can send those patches that fully implemented prefault) but I
      concluded they're not worth it and they add an huge additional complexity
      and they remove all tlb benefits until the full hugepage has been faulted
      in, to save a little bit of memory and some cache during app startup, but
      they still don't improve substantially the cache-trashing during startup
      if the prefault happens in >4k chunks.  One reason is that those 4k pte
      entries copied are still mapped on a perfectly cache-colored hugepage, so
      the trashing is the worst one can generate in those copies (cow of 4k page
      copies aren't so well colored so they trashes less, but again this results
      in software running faster after the page fault).  Those prefault patches
      allowed things like a pte where post-cow pages were local 4k regular anon
      pages and the not-yet-cowed pte entries were pointing in the middle of
      some hugepage mapped read-only.  If it doesn't payoff substantially with
      todays hardware it will payoff even less in the future with larger l2
      caches, and the prefault logic would blot the VM a lot.  If one is
      emebdded transparent_hugepage can be disabled during boot with sysfs or
      with the boot commandline parameter transparent_hugepage=0 (or
      transparent_hugepage=2 to restrict hugepages inside madvise regions) that
      will ensure not a single hugepage is allocated at boot time.  It is simple
      enough to just disable transparent hugepage globally and let transparent
      hugepages be allocated selectively by applications in the MADV_HUGEPAGE
      region (both at page fault time, and if enabled with the
      collapse_huge_page too through the kernel daemon).
      
      This patch supports only hugepages mapped in the pmd, archs that have
      smaller hugepages will not fit in this patch alone.  Also some archs like
      power have certain tlb limits that prevents mixing different page size in
      the same regions so they will not fit in this framework that requires
      "graceful fallback" to basic PAGE_SIZE in case of physical memory
      fragmentation.  hugetlbfs remains a perfect fit for those because its
      software limits happen to match the hardware limits.  hugetlbfs also
      remains a perfect fit for hugepage sizes like 1GByte that cannot be hoped
      to be found not fragmented after a certain system uptime and that would be
      very expensive to defragment with relocation, so requiring reservation.
      hugetlbfs is the "reservation way", the point of transparent hugepages is
      not to have any reservation at all and maximizing the use of cache and
      hugepages at all times automatically.
      
      Some performance result:
      
      vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
      ages3
      memset page fault 1566023
      memset tlb miss 453854
      memset second tlb miss 453321
      random access tlb miss 41635
      random access second tlb miss 41658
      vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
      memset page fault 1566471
      memset tlb miss 453375
      memset second tlb miss 453320
      random access tlb miss 41636
      random access second tlb miss 41637
      vmx andrea # ./largepages3
      memset page fault 1566642
      memset tlb miss 453417
      memset second tlb miss 453313
      random access tlb miss 41630
      random access second tlb miss 41647
      vmx andrea # ./largepages3
      memset page fault 1566872
      memset tlb miss 453418
      memset second tlb miss 453315
      random access tlb miss 41618
      random access second tlb miss 41659
      vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
      vmx andrea # ./largepages3
      memset page fault 2182476
      memset tlb miss 460305
      memset second tlb miss 460179
      random access tlb miss 44483
      random access second tlb miss 44186
      vmx andrea # ./largepages3
      memset page fault 2182791
      memset tlb miss 460742
      memset second tlb miss 459962
      random access tlb miss 43981
      random access second tlb miss 43988
      
      ============
      #include <stdio.h>
      #include <stdlib.h>
      #include <string.h>
      #include <sys/time.h>
      
      #define SIZE (3UL*1024*1024*1024)
      
      int main()
      {
      	char *p = malloc(SIZE), *p2;
      	struct timeval before, after;
      
      	gettimeofday(&before, NULL);
      	memset(p, 0, SIZE);
      	gettimeofday(&after, NULL);
      	printf("memset page fault %Lu\n",
      	       (after.tv_sec-before.tv_sec)*1000000UL +
      	       after.tv_usec-before.tv_usec);
      
      	gettimeofday(&before, NULL);
      	memset(p, 0, SIZE);
      	gettimeofday(&after, NULL);
      	printf("memset tlb miss %Lu\n",
      	       (after.tv_sec-before.tv_sec)*1000000UL +
      	       after.tv_usec-before.tv_usec);
      
      	gettimeofday(&before, NULL);
      	memset(p, 0, SIZE);
      	gettimeofday(&after, NULL);
      	printf("memset second tlb miss %Lu\n",
      	       (after.tv_sec-before.tv_sec)*1000000UL +
      	       after.tv_usec-before.tv_usec);
      
      	gettimeofday(&before, NULL);
      	for (p2 = p; p2 < p+SIZE; p2 += 4096)
      		*p2 = 0;
      	gettimeofday(&after, NULL);
      	printf("random access tlb miss %Lu\n",
      	       (after.tv_sec-before.tv_sec)*1000000UL +
      	       after.tv_usec-before.tv_usec);
      
      	gettimeofday(&before, NULL);
      	for (p2 = p; p2 < p+SIZE; p2 += 4096)
      		*p2 = 0;
      	gettimeofday(&after, NULL);
      	printf("random access second tlb miss %Lu\n",
      	       (after.tv_sec-before.tv_sec)*1000000UL +
      	       after.tv_usec-before.tv_usec);
      
      	return 0;
      }
      ============
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71e3aac0
    • A
      thp: kvm mmu transparent hugepage support · 936a5fe6
      Andrea Arcangeli 提交于
      This should work for both hugetlbfs and transparent hugepages.
      
      [akpm@linux-foundation.org: bring forward PageTransCompound() addition for bisectability]
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      936a5fe6