1. 02 11月, 2012 1 次提交
    • O
      xen PVonHVM: use E820_Reserved area for shared_info · 9d02b43d
      Olaf Hering 提交于
      This is a respin of 00e37bdb
      ("xen PVonHVM: move shared_info to MMIO before kexec").
      
      Currently kexec in a PVonHVM guest fails with a triple fault because the
      new kernel overwrites the shared info page. The exact failure depends on
      the size of the kernel image. This patch moves the pfn from RAM into an
      E820 reserved memory area.
      
      The pfn containing the shared_info is located somewhere in RAM. This will
      cause trouble if the current kernel is doing a kexec boot into a new
      kernel. The new kernel (and its startup code) can not know where the pfn
      is, so it can not reserve the page. The hypervisor will continue to update
      the pfn, and as a result memory corruption occours in the new kernel.
      
      The toolstack marks the memory area FC000000-FFFFFFFF as reserved in the
      E820 map. Within that range newer toolstacks (4.3+) will keep 1MB
      starting from FE700000 as reserved for guest use. Older Xen4 toolstacks
      will usually not allocate areas up to FE700000, so FE700000 is expected
      to work also with older toolstacks.
      
      In Xen3 there is no reserved area at a fixed location. If the guest is
      started on such old hosts the shared_info page will be placed in RAM. As
      a result kexec can not be used.
      Signed-off-by: NOlaf Hering <olaf@aepfle.de>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      9d02b43d
  2. 23 8月, 2012 2 次提交
    • K
      xen/p2m: Add logic to revector a P2M tree to use __va leafs. · 357a3cfb
      Konrad Rzeszutek Wilk 提交于
      During bootup Xen supplies us with a P2M array. It sticks
      it right after the ramdisk, as can be seen with a 128GB PV guest:
      
      (certain parts removed for clarity):
      xc_dom_build_image: called
      xc_dom_alloc_segment:   kernel       : 0xffffffff81000000 -> 0xffffffff81e43000  (pfn 0x1000 + 0xe43 pages)
      xc_dom_pfn_to_ptr: domU mapping: pfn 0x1000+0xe43 at 0x7f097d8bf000
      xc_dom_alloc_segment:   ramdisk      : 0xffffffff81e43000 -> 0xffffffff925c7000  (pfn 0x1e43 + 0x10784 pages)
      xc_dom_pfn_to_ptr: domU mapping: pfn 0x1e43+0x10784 at 0x7f0952dd2000
      xc_dom_alloc_segment:   phys2mach    : 0xffffffff925c7000 -> 0xffffffffa25c7000  (pfn 0x125c7 + 0x10000 pages)
      xc_dom_pfn_to_ptr: domU mapping: pfn 0x125c7+0x10000 at 0x7f0942dd2000
      xc_dom_alloc_page   :   start info   : 0xffffffffa25c7000 (pfn 0x225c7)
      xc_dom_alloc_page   :   xenstore     : 0xffffffffa25c8000 (pfn 0x225c8)
      xc_dom_alloc_page   :   console      : 0xffffffffa25c9000 (pfn 0x225c9)
      nr_page_tables: 0x0000ffffffffffff/48: 0xffff000000000000 -> 0xffffffffffffffff, 1 table(s)
      nr_page_tables: 0x0000007fffffffff/39: 0xffffff8000000000 -> 0xffffffffffffffff, 1 table(s)
      nr_page_tables: 0x000000003fffffff/30: 0xffffffff80000000 -> 0xffffffffbfffffff, 1 table(s)
      nr_page_tables: 0x00000000001fffff/21: 0xffffffff80000000 -> 0xffffffffa27fffff, 276 table(s)
      xc_dom_alloc_segment:   page tables  : 0xffffffffa25ca000 -> 0xffffffffa26e1000  (pfn 0x225ca + 0x117 pages)
      xc_dom_pfn_to_ptr: domU mapping: pfn 0x225ca+0x117 at 0x7f097d7a8000
      xc_dom_alloc_page   :   boot stack   : 0xffffffffa26e1000 (pfn 0x226e1)
      xc_dom_build_image  : virt_alloc_end : 0xffffffffa26e2000
      xc_dom_build_image  : virt_pgtab_end : 0xffffffffa2800000
      
      So the physical memory and virtual (using __START_KERNEL_map addresses)
      layout looks as so:
      
        phys                             __ka
      /------------\                   /-------------------\
      | 0          | empty             | 0xffffffff80000000|
      | ..         |                   | ..                |
      | 16MB       | <= kernel starts  | 0xffffffff81000000|
      | ..         |                   |                   |
      | 30MB       | <= kernel ends => | 0xffffffff81e43000|
      | ..         |  & ramdisk starts | ..                |
      | 293MB      | <= ramdisk ends=> | 0xffffffff925c7000|
      | ..         |  & P2M starts     | ..                |
      | ..         |                   | ..                |
      | 549MB      | <= P2M ends    => | 0xffffffffa25c7000|
      | ..         | start_info        | 0xffffffffa25c7000|
      | ..         | xenstore          | 0xffffffffa25c8000|
      | ..         | cosole            | 0xffffffffa25c9000|
      | 549MB      | <= page tables => | 0xffffffffa25ca000|
      | ..         |                   |                   |
      | 550MB      | <= PGT end     => | 0xffffffffa26e1000|
      | ..         | boot stack        |                   |
      \------------/                   \-------------------/
      
      As can be seen, the ramdisk, P2M and pagetables are taking
      a bit of __ka addresses space. Which is a problem since the
      MODULES_VADDR starts at 0xffffffffa0000000 - and P2M sits
      right in there! This results during bootup with the inability to
      load modules, with this error:
      
      ------------[ cut here ]------------
      WARNING: at /home/konrad/ssd/linux/mm/vmalloc.c:106 vmap_page_range_noflush+0x2d9/0x370()
      Call Trace:
       [<ffffffff810719fa>] warn_slowpath_common+0x7a/0xb0
       [<ffffffff81030279>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
       [<ffffffff81071a45>] warn_slowpath_null+0x15/0x20
       [<ffffffff81130b89>] vmap_page_range_noflush+0x2d9/0x370
       [<ffffffff81130c4d>] map_vm_area+0x2d/0x50
       [<ffffffff811326d0>] __vmalloc_node_range+0x160/0x250
       [<ffffffff810c5369>] ? module_alloc_update_bounds+0x19/0x80
       [<ffffffff810c6186>] ? load_module+0x66/0x19c0
       [<ffffffff8105cadc>] module_alloc+0x5c/0x60
       [<ffffffff810c5369>] ? module_alloc_update_bounds+0x19/0x80
       [<ffffffff810c5369>] module_alloc_update_bounds+0x19/0x80
       [<ffffffff810c70c3>] load_module+0xfa3/0x19c0
       [<ffffffff812491f6>] ? security_file_permission+0x86/0x90
       [<ffffffff810c7b3a>] sys_init_module+0x5a/0x220
       [<ffffffff815ce339>] system_call_fastpath+0x16/0x1b
      ---[ end trace fd8f7704fdea0291 ]---
      vmalloc: allocation failure, allocated 16384 of 20480 bytes
      modprobe: page allocation failure: order:0, mode:0xd2
      
      Since the __va and __ka are 1:1 up to MODULES_VADDR and
      cleanup_highmap rids __ka of the ramdisk mapping, what
      we want to do is similar - get rid of the P2M in the __ka
      address space. There are two ways of fixing this:
      
       1) All P2M lookups instead of using the __ka address would
          use the __va address. This means we can safely erase from
          __ka space the PMD pointers that point to the PFNs for
          P2M array and be OK.
       2). Allocate a new array, copy the existing P2M into it,
          revector the P2M tree to use that, and return the old
          P2M to the memory allocate. This has the advantage that
          it sets the stage for using XEN_ELF_NOTE_INIT_P2M
          feature. That feature allows us to set the exact virtual
          address space we want for the P2M - and allows us to
          boot as initial domain on large machines.
      
      So we pick option 2).
      
      This patch only lays the groundwork in the P2M code. The patch
      that modifies the MMU is called "xen/mmu: Copy and revector the P2M tree."
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      357a3cfb
    • K
      xen/mmu: The xen_setup_kernel_pagetable doesn't need to return anything. · 3699aad0
      Konrad Rzeszutek Wilk 提交于
      We don't need to return the new PGD - as we do not use it.
      Acked-by: NStefano Stabellini <stefano.stabellini@eu.citrix.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      3699aad0
  3. 17 8月, 2012 1 次提交
  4. 14 9月, 2012 1 次提交
  5. 20 7月, 2012 1 次提交
    • O
      xen PVonHVM: move shared_info to MMIO before kexec · 00e37bdb
      Olaf Hering 提交于
      Currently kexec in a PVonHVM guest fails with a triple fault because the
      new kernel overwrites the shared info page. The exact failure depends on
      the size of the kernel image. This patch moves the pfn from RAM into
      MMIO space before the kexec boot.
      
      The pfn containing the shared_info is located somewhere in RAM. This
      will cause trouble if the current kernel is doing a kexec boot into a
      new kernel. The new kernel (and its startup code) can not know where the
      pfn is, so it can not reserve the page. The hypervisor will continue to
      update the pfn, and as a result memory corruption occours in the new
      kernel.
      
      One way to work around this issue is to allocate a page in the
      xen-platform pci device's BAR memory range. But pci init is done very
      late and the shared_info page is already in use very early to read the
      pvclock. So moving the pfn from RAM to MMIO is racy because some code
      paths on other vcpus could access the pfn during the small   window when
      the old pfn is moved to the new pfn. There is even a  small window were
      the old pfn is not backed by a mfn, and during that time all reads
      return -1.
      
      Because it is not known upfront where the MMIO region is located it can
      not be used right from the start in xen_hvm_init_shared_info.
      
      To minimise trouble the move of the pfn is done shortly before kexec.
      This does not eliminate the race because all vcpus are still online when
      the syscore_ops will be called. But hopefully there is no work pending
      at this point in time. Also the syscore_op is run last which reduces the
      risk further.
      Signed-off-by: NOlaf Hering <olaf@aepfle.de>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      00e37bdb
  6. 08 5月, 2012 1 次提交
    • D
      xen/setup: update VA mapping when releasing memory during setup · 83d51ab4
      David Vrabel 提交于
      In xen_memory_setup(), if a page that is being released has a VA
      mapping this must also be updated.  Otherwise, the page will be not
      released completely -- it will still be referenced in Xen and won't be
      freed util the mapping is removed and this prevents it from being
      reallocated at a different PFN.
      
      This was already being done for the ISA memory region in
      xen_ident_map_ISA() but on many systems this was omitting a few pages
      as many systems marked a few pages below the ISA memory region as
      reserved in the e820 map.
      
      This fixes errors such as:
      
      (XEN) page_alloc.c:1148:d0 Over-allocation for domain 0: 2097153 > 2097152
      (XEN) memory.c:133:d0 Could not allocate order=0 extent: id=0 memflags=0 (0 of 17)
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      83d51ab4
  7. 02 5月, 2012 1 次提交
  8. 06 6月, 2011 1 次提交
  9. 19 5月, 2011 1 次提交
  10. 26 2月, 2011 1 次提交
  11. 02 12月, 2010 1 次提交
  12. 23 10月, 2010 2 次提交
    • J
      xen: add support for PAT · 41f2e477
      Jeremy Fitzhardinge 提交于
      Convert Linux PAT entries into Xen ones when constructing ptes.  Linux
      doesn't use _PAGE_PAT for ptes, so the only difference in the first 4
      entries is that Linux uses _PAGE_PWT for WC, whereas Xen (and default)
      use it for WT.
      
      xen_pte_val does the inverse conversion.
      
      We hard-code assumptions about Linux's current PAT layout, but a
      warning on the wrmsr to MSR_IA32_CR_PAT should point out any problems.
      If necessary we could go to a more general table-based conversion between
      Linux and Xen PAT entries.
      
      hugetlbfs poses a problem at the moment, the x86 architecture uses the
      same flag for _PAGE_PAT and _PAGE_PSE, which changes meaning depending
      on which pagetable level we're using.  At the moment this should be OK
      so long as nobody tries to do a pte_val on a hugetlbfs pte.
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      41f2e477
    • J
      xen: make sure xen_max_p2m_pfn is up to date · 2f7acb20
      Jeremy Fitzhardinge 提交于
      Keep xen_max_p2m_pfn up to date with the end of the extra memory
      we're adding.  It is possible that it will be too high since memory
      may be truncated by a "mem=" option on the kernel command line, but
      that won't matter.
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      2f7acb20
  13. 05 8月, 2010 1 次提交
  14. 27 7月, 2010 2 次提交
  15. 23 7月, 2010 2 次提交
  16. 04 12月, 2009 2 次提交
    • I
      xen: correctly restore pfn_to_mfn_list_list after resume · fa24ba62
      Ian Campbell 提交于
      pvops kernels >= 2.6.30 can currently only be saved and restored once. The
      second attempt to save results in:
      
          ERROR Internal error: Frame# in pfn-to-mfn frame list is not in pseudophys
          ERROR Internal error: entry 0: p2m_frame_list[0] is 0xf2c2c2c2, max 0x120000
          ERROR Internal error: Failed to map/save the p2m frame list
      
      I finally narrowed it down to:
      
          commit cdaead6b
              Author: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
              Date:   Fri Feb 27 15:34:59 2009 -0800
      
                  xen: split construction of p2m mfn tables from registration
      
                  Build the p2m_mfn_list_list early with the rest of the p2m table, but
                  register it later when the real shared_info structure is in place.
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      
      The unforeseen side-effect of this change was to cause the mfn list list to not
      be rebuilt on resume. Prior to this change it would have been rebuilt via
      xen_post_suspend() -> xen_setup_shared_info() -> xen_setup_mfn_list_list().
      
      Fix by explicitly calling xen_build_mfn_list_list() from xen_post_suspend().
      Signed-off-by: NIan Campbell <ian.campbell@citrix.com>
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Cc: Stable Kernel <stable@kernel.org>
      fa24ba62
    • I
      xen: re-register runstate area earlier on resume. · be012920
      Ian Campbell 提交于
      This is necessary to ensure the runstate area is available to
      xen_sched_clock before any calls to printk which will require it in
      order to provide a timestamp.
      
      I chose to pull the xen_setup_runstate_info out of xen_time_init into
      the caller in order to maintain parity with calling
      xen_setup_runstate_info separately from calling xen_time_resume.
      Signed-off-by: NIan Campbell <ian.campbell@citrix.com>
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Cc: Stable Kernel <stable@kernel.org>
      be012920
  17. 31 8月, 2009 1 次提交
  18. 16 5月, 2009 1 次提交
    • J
      x86: Fix performance regression caused by paravirt_ops on native kernels · b4ecc126
      Jeremy Fitzhardinge 提交于
      Xiaohui Xin and some other folks at Intel have been looking into what's
      behind the performance hit of paravirt_ops when running native.
      
      It appears that the hit is entirely due to the paravirtualized
      spinlocks introduced by:
      
       | commit 8efcbab6
       | Date:   Mon Jul 7 12:07:51 2008 -0700
       |
       |     paravirt: introduce a "lock-byte" spinlock implementation
      
      The extra call/return in the spinlock path is somehow
      causing an increase in the cycles/instruction of somewhere around 2-7%
      (seems to vary quite a lot from test to test).  The working theory is
      that the CPU's pipeline is getting upset about the
      call->call->locked-op->return->return, and seems to be failing to
      speculate (though I haven't seen anything definitive about the precise
      reasons).  This doesn't entirely make sense, because the performance
      hit is also visible on unlock and other operations which don't involve
      locked instructions.  But spinlock operations clearly swamp all the
      other pvops operations, even though I can't imagine that they're
      nearly as common (there's only a .05% increase in instructions
      executed).
      
      If I disable just the pv-spinlock calls, my tests show that pvops is
      identical to non-pvops performance on native (my measurements show that
      it is actually about .1% faster, but Xiaohui shows a .05% slowdown).
      
      Summary of results, averaging 10 runs of the "mmperf" test, using a
      no-pvops build as baseline:
      
      		nopv		Pv-nospin	Pv-spin
      CPU cycles	100.00%		99.89%		102.18%
      instructions	100.00%		100.10%		100.15%
      CPI		100.00%		99.79%		102.03%
      cache ref	100.00%		100.84%		100.28%
      cache miss	100.00%		90.47%		88.56%
      cache miss rate	100.00%		89.72%		88.31%
      branches	100.00%		99.93%		100.04%
      branch miss	100.00%		103.66%		107.72%
      branch miss rt	100.00%		103.73%		107.67%
      wallclock	100.00%		99.90%		102.20%
      
      The clear effect here is that the 2% increase in CPI is
      directly reflected in the final wallclock time.
      
      (The other interesting effect is that the more ops are
      out of line calls via pvops, the lower the cache access
      and miss rates.  Not too surprising, but it suggests that
      the non-pvops kernel is over-inlined.  On the flipside,
      the branch misses go up correspondingly...)
      
      So, what's the fix?
      
      Paravirt patching turns all the pvops calls into direct calls, so
      _spin_lock etc do end up having direct calls.  For example, the compiler
      generated code for paravirtualized _spin_lock is:
      
      <_spin_lock+0>:		mov    %gs:0xb4c8,%rax
      <_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
      <_spin_lock+15>:	callq  *0xffffffff805a5b30
      <_spin_lock+22>:	retq
      
      The indirect call will get patched to:
      <_spin_lock+0>:		mov    %gs:0xb4c8,%rax
      <_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
      <_spin_lock+15>:	callq <__ticket_spin_lock>
      <_spin_lock+20>:	nop; nop		/* or whatever 2-byte nop */
      <_spin_lock+22>:	retq
      
      One possibility is to inline _spin_lock, etc, when building an
      optimised kernel (ie, when there's no spinlock/preempt
      instrumentation/debugging enabled).  That will remove the outer
      call/return pair, returning the instruction stream to a single
      call/return, which will presumably execute the same as the non-pvops
      case.  The downsides arel 1) it will replicate the
      preempt_disable/enable code at eack lock/unlock callsite; this code is
      fairly small, but not nothing; and 2) the spinlock definitions are
      already a very heavily tangled mass of #ifdefs and other preprocessor
      magic, and making any changes will be non-trivial.
      
      The other obvious answer is to disable pv-spinlocks.  Making them a
      separate config option is fairly easy, and it would be trivial to
      enable them only when Xen is enabled (as the only non-default user).
      But it doesn't really address the common case of a distro build which
      is going to have Xen support enabled, and leaves the open question of
      whether the native performance cost of pv-spinlocks is worth the
      performance improvement on a loaded Xen system (10% saving of overall
      system CPU when guests block rather than spin).  Still it is a
      reasonable short-term workaround.
      
      [ Impact: fix pvops performance regression when running native ]
      Analysed-by: N"Xin Xiaohui" <xiaohui.xin@intel.com>
      Analysed-by: N"Li Xin" <xin.li@intel.com>
      Analysed-by: N"Nakajima Jun" <jun.nakajima@intel.com>
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Acked-by: NH. Peter Anvin <hpa@zytor.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Xen-devel <xen-devel@lists.xensource.com>
      LKML-Reference: <4A0B62F7.5030802@goop.org>
      [ fixed the help text ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b4ecc126
  19. 09 4月, 2009 1 次提交
  20. 31 3月, 2009 1 次提交
  21. 30 3月, 2009 1 次提交
  22. 05 2月, 2009 1 次提交
  23. 31 1月, 2009 1 次提交
  24. 17 12月, 2008 1 次提交
  25. 01 12月, 2008 1 次提交
  26. 09 9月, 2008 1 次提交
  27. 25 8月, 2008 1 次提交
    • A
      xen: implement CPU hotplugging · d68d82af
      Alex Nixon 提交于
      Note the changes from 2.6.18-xen CPU hotplugging:
      
      A vcpu_down request from the remote admin via Xenbus both hotunplugs the
      CPU, and disables it by removing it from the cpu_present map, and removing
      its entry in /sys.
      
      A vcpu_up request from the remote admin only re-enables the CPU, and does
      not immediately bring the CPU up. A udev event is emitted, which can be
      caught by the user if he wishes to automatically re-up CPUs when available,
      or implement a more complex policy.
      Signed-off-by: NAlex Nixon <alex.nixon@citrix.com>
      Acked-by: NJeremy Fitzhardinge <jeremy@goop.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d68d82af
  28. 22 8月, 2008 1 次提交
  29. 31 7月, 2008 1 次提交
  30. 24 7月, 2008 1 次提交
  31. 16 7月, 2008 5 次提交