1. 04 12月, 2014 4 次提交
    • J
      xen: switch to linear virtual mapped sparse p2m list · 054954eb
      Juergen Gross 提交于
      At start of the day the Xen hypervisor presents a contiguous mfn list
      to a pv-domain. In order to support sparse memory this mfn list is
      accessed via a three level p2m tree built early in the boot process.
      Whenever the system needs the mfn associated with a pfn this tree is
      used to find the mfn.
      
      Instead of using a software walked tree for accessing a specific mfn
      list entry this patch is creating a virtual address area for the
      entire possible mfn list including memory holes. The holes are
      covered by mapping a pre-defined  page consisting only of "invalid
      mfn" entries. Access to a mfn entry is possible by just using the
      virtual base address of the mfn list and the pfn as index into that
      list. This speeds up the (hot) path of determining the mfn of a
      pfn.
      
      Kernel build on a Dell Latitude E6440 (2 cores, HT) in 64 bit Dom0
      showed following improvements:
      
      Elapsed time: 32:50 ->  32:35
      System:       18:07 ->  17:47
      User:        104:00 -> 103:30
      
      Tested with following configurations:
      - 64 bit dom0, 8GB RAM
      - 64 bit dom0, 128 GB RAM, PCI-area above 4 GB
      - 32 bit domU, 512 MB, 8 GB, 43 GB (more wouldn't work even without
                                          the patch)
      - 32 bit domU, ballooning up and down
      - 32 bit domU, save and restore
      - 32 bit domU with PCI passthrough
      - 64 bit domU, 8 GB, 2049 MB, 5000 MB
      - 64 bit domU, ballooning up and down
      - 64 bit domU, save and restore
      - 64 bit domU with PCI passthrough
      Signed-off-by: NJuergen Gross <jgross@suse.com>
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      054954eb
    • J
      xen: Hide get_phys_to_machine() to be able to tune common path · 0aad5689
      Juergen Gross 提交于
      Today get_phys_to_machine() is always called when the mfn for a pfn
      is to be obtained. Add a wrapper __pfn_to_mfn() as inline function
      to be able to avoid calling get_phys_to_machine() when possible as
      soon as the switch to a linear mapped p2m list has been done.
      Signed-off-by: NJuergen Gross <jgross@suse.com>
      Reviewed-by: NDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      0aad5689
    • J
      xen: Delay remapping memory of pv-domain · 1f3ac86b
      Juergen Gross 提交于
      Early in the boot process the memory layout of a pv-domain is changed
      to match the E820 map (either the host one for Dom0 or the Xen one)
      regarding placement of RAM and PCI holes. This requires removing memory
      pages initially located at positions not suitable for RAM and adding
      them later at higher addresses where no restrictions apply.
      
      To be able to operate on the hypervisor supported p2m list until a
      virtual mapped linear p2m list can be constructed, remapping must
      be delayed until virtual memory management is initialized, as the
      initial p2m list can't be extended unlimited at physical memory
      initialization time due to it's fixed structure.
      
      A further advantage is the reduction in complexity and code volume as
      we don't have to be careful regarding memory restrictions during p2m
      updates.
      Signed-off-by: NJuergen Gross <jgross@suse.com>
      Reviewed-by: NDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      1f3ac86b
    • J
      xen: use common page allocation function in p2m.c · 7108c9ce
      Juergen Gross 提交于
      In arch/x86/xen/p2m.c three different allocation functions for
      obtaining a memory page are used: extend_brk(), alloc_bootmem_align()
      or __get_free_page().  Which of those functions is used depends on the
      progress of the boot process of the system.
      
      Introduce a common allocation routine selecting the to be called
      allocation routine dynamically based on the boot progress. This allows
      moving initialization steps without having to care about changing
      allocation calls.
      Signed-off-by: NJuergen Gross <jgross@suse.com>
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      7108c9ce
  2. 23 10月, 2014 1 次提交
    • J
      x86/xen: delay construction of mfn_list_list · 2c185687
      Juergen Gross 提交于
      The 3 level p2m tree for the Xen tools is constructed very early at
      boot by calling xen_build_mfn_list_list(). Memory needed for this tree
      is allocated via extend_brk().
      
      As this tree (other than the kernel internal p2m tree) is only needed
      for domain save/restore, live migration and crash dump analysis it
      doesn't matter whether it is constructed very early or just some
      milliseconds later when memory allocation is possible by other means.
      
      This patch moves the call of xen_build_mfn_list_list() just after
      calling xen_pagetable_p2m_copy() simplifying this function, too, as it
      doesn't have to bother with two parallel trees now. The same applies
      for some other internal functions.
      
      While simplifying code, make early_can_reuse_p2m_middle() static and
      drop the unused second parameter. p2m_mid_identity_mfn can be removed
      as well, it isn't used either.
      Signed-off-by: NJuergen Gross <jgross@suse.com>
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      2c185687
  3. 23 9月, 2014 1 次提交
  4. 10 9月, 2014 1 次提交
    • S
      x86/xen: don't copy bogus duplicate entries into kernel page tables · 0b5a5063
      Stefan Bader 提交于
      When RANDOMIZE_BASE (KASLR) is enabled; or the sum of all loaded
      modules exceeds 512 MiB, then loading modules fails with a warning
      (and hence a vmalloc allocation failure) because the PTEs for the
      newly-allocated vmalloc address space are not zero.
      
        WARNING: CPU: 0 PID: 494 at linux/mm/vmalloc.c:128
                 vmap_page_range_noflush+0x2a1/0x360()
      
      This is caused by xen_setup_kernel_pagetables() copying
      level2_kernel_pgt into level2_fixmap_pgt, overwriting many non-present
      entries.
      
      Without KASLR, the normal kernel image size only covers the first half
      of level2_kernel_pgt and module space starts after that.
      
      L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt[  0..255]->kernel
                                                        [256..511]->module
                                [511]->level2_fixmap_pgt[  0..505]->module
      
      This allows 512 MiB of of module vmalloc space to be used before
      having to use the corrupted level2_fixmap_pgt entries.
      
      With KASLR enabled, the kernel image uses the full PUD range of 1G and
      module space starts in the level2_fixmap_pgt. So basically:
      
      L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt[0..511]->kernel
                                [511]->level2_fixmap_pgt[0..505]->module
      
      And now no module vmalloc space can be used without using the corrupt
      level2_fixmap_pgt entries.
      
      Fix this by properly converting the level2_fixmap_pgt entries to MFNs,
      and setting level1_fixmap_pgt as read-only.
      
      A number of comments were also using the the wrong L3 offset for
      level2_kernel_pgt.  These have been corrected.
      Signed-off-by: NStefan Bader <stefan.bader@canonical.com>
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: stable@vger.kernel.org
      0b5a5063
  5. 27 5月, 2014 1 次提交
  6. 15 5月, 2014 1 次提交
  7. 06 5月, 2014 1 次提交
  8. 25 3月, 2014 1 次提交
    • D
      Revert "xen: properly account for _PAGE_NUMA during xen pte translations" · 5926f87f
      David Vrabel 提交于
      This reverts commit a9c8e4be.
      
      PTEs in Xen PV guests must contain machine addresses if _PAGE_PRESENT
      is set and pseudo-physical addresses is _PAGE_PRESENT is clear.
      
      This is because during a domain save/restore (migration) the page
      table entries are "canonicalised" and uncanonicalised". i.e., MFNs are
      converted to PFNs during domain save so that on a restore the page
      table entries may be rewritten with the new MFNs on the destination.
      This canonicalisation is only done for PTEs that are present.
      
      This change resulted in writing PTEs with MFNs if _PAGE_PROTNONE (or
      _PAGE_NUMA) was set but _PAGE_PRESENT was clear.  These PTEs would be
      migrated as-is which would result in unexpected behaviour in the
      destination domain.  Either a) the MFN would be translated to the
      wrong PFN/page; b) setting the _PAGE_PRESENT bit would clear the PTE
      because the MFN is no longer owned by the domain; or c) the present
      bit would not get set.
      
      Symptoms include "Bad page" reports when munmapping after migrating a
      domain.
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      Acked-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: <stable@vger.kernel.org>        [3.12+]
      5926f87f
  9. 14 3月, 2014 1 次提交
  10. 11 2月, 2014 1 次提交
    • M
      xen: properly account for _PAGE_NUMA during xen pte translations · a9c8e4be
      Mel Gorman 提交于
      Steven Noonan forwarded a users report where they had a problem starting
      vsftpd on a Xen paravirtualized guest, with this in dmesg:
      
        BUG: Bad page map in process vsftpd  pte:8000000493b88165 pmd:e9cc01067
        page:ffffea00124ee200 count:0 mapcount:-1 mapping:     (null) index:0x0
        page flags: 0x2ffc0000000014(referenced|dirty)
        addr:00007f97eea74000 vm_flags:00100071 anon_vma:ffff880e98f80380 mapping:          (null) index:7f97eea74
        CPU: 4 PID: 587 Comm: vsftpd Not tainted 3.12.7-1-ec2 #1
        Call Trace:
          dump_stack+0x45/0x56
          print_bad_pte+0x22e/0x250
          unmap_single_vma+0x583/0x890
          unmap_vmas+0x65/0x90
          exit_mmap+0xc5/0x170
          mmput+0x65/0x100
          do_exit+0x393/0x9e0
          do_group_exit+0xcc/0x140
          SyS_exit_group+0x14/0x20
          system_call_fastpath+0x1a/0x1f
        Disabling lock debugging due to kernel taint
        BUG: Bad rss-counter state mm:ffff880e9ca60580 idx:0 val:-1
        BUG: Bad rss-counter state mm:ffff880e9ca60580 idx:1 val:1
      
      The issue could not be reproduced under an HVM instance with the same
      kernel, so it appears to be exclusive to paravirtual Xen guests.  He
      bisected the problem to commit 1667918b ("mm: numa: clear numa
      hinting information on mprotect") that was also included in 3.12-stable.
      
      The problem was related to how xen translates ptes because it was not
      accounting for the _PAGE_NUMA bit.  This patch splits pte_present to add
      a pteval_present helper for use by xen so both bare metal and xen use
      the same code when checking if a PTE is present.
      
      [mgorman@suse.de: wrote changelog, proposed minor modifications]
      [akpm@linux-foundation.org: fix typo in comment]
      Reported-by: NSteven Noonan <steven@uplinklabs.net>
      Tested-by: NSteven Noonan <steven@uplinklabs.net>
      Signed-off-by: NElena Ufimtseva <ufimtseva@gmail.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NDavid Vrabel <david.vrabel@citrix.com>
      Acked-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9c8e4be
  11. 30 1月, 2014 1 次提交
  12. 06 1月, 2014 4 次提交
  13. 15 11月, 2013 2 次提交
  14. 10 10月, 2013 2 次提交
  15. 27 9月, 2013 1 次提交
  16. 12 4月, 2013 1 次提交
    • K
      x86: Use a read-only IDT alias on all CPUs · 4eefbe79
      Kees Cook 提交于
      Make a copy of the IDT (as seen via the "sidt" instruction) read-only.
      This primarily removes the IDT from being a target for arbitrary memory
      write attacks, and has the added benefit of also not leaking the kernel
      base offset, if it has been relocated.
      
      We already did this on vendor == Intel and family == 5 because of the
      F0 0F bug -- regardless of if a particular CPU had the F0 0F bug or
      not.  Since the workaround was so cheap, there simply was no reason to
      be very specific.  This patch extends the readonly alias to all CPUs,
      but does not activate the #PF to #UD conversion code needed to deliver
      the proper exception in the F0 0F case except on Intel family 5
      processors.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Link: http://lkml.kernel.org/r/20130410192422.GA17344@www.outflux.net
      Cc: Eric Northup <digitaleric@google.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      4eefbe79
  17. 11 4月, 2013 1 次提交
  18. 03 4月, 2013 1 次提交
    • K
      xen/mmu: On early bootup, flush the TLB when changing RO->RW bits Xen provided pagetables. · b2222794
      Konrad Rzeszutek Wilk 提交于
      Occassionaly on a DL380 G4 the guest would crash quite early with this:
      
      (XEN) d244:v0: unhandled page fault (ec=0003)
      (XEN) Pagetable walk from ffffffff84dc7000:
      (XEN)  L4[0x1ff] = 00000000c3f18067 0000000000001789
      (XEN)  L3[0x1fe] = 00000000c3f14067 000000000000178d
      (XEN)  L2[0x026] = 00000000dc8b2067 0000000000004def
      (XEN)  L1[0x1c7] = 00100000dc8da067 0000000000004dc7
      (XEN) domain_crash_sync called from entry.S
      (XEN) Domain 244 (vcpu#0) crashed on cpu#3:
      (XEN) ----[ Xen-4.1.3OVM  x86_64  debug=n  Not tainted ]----
      (XEN) CPU:    3
      (XEN) RIP:    e033:[<ffffffff81263f22>]
      (XEN) RFLAGS: 0000000000000216   EM: 1   CONTEXT: pv guest
      (XEN) rax: 0000000000000000   rbx: ffffffff81785f88   rcx: 000000000000003f
      (XEN) rdx: 0000000000000000   rsi: 00000000dc8da063   rdi: ffffffff84dc7000
      
      The offending code shows it to be a loop writting the value zero
      (%rax) in the %rdi (the L4 provided by Xen) register:
      
         0: 44 00 00             add    %r8b,(%rax)
         3: 31 c0                 xor    %eax,%eax
         5: b9 40 00 00 00       mov    $0x40,%ecx
         a: 66 0f 1f 84 00 00 00 nopw   0x0(%rax,%rax,1)
        11: 00 00
        13: ff c9                 dec    %ecx
        15:* 48 89 07             mov    %rax,(%rdi)     <-- trapping instruction
        18: 48 89 47 08           mov    %rax,0x8(%rdi)
        1c: 48 89 47 10           mov    %rax,0x10(%rdi)
      
      which fails. xen_setup_kernel_pagetable recycles some of the Xen's
      page-table entries when it has switched over to its Linux page-tables.
      
      Right before try to clear the page, we  make a hypercall to change
      it from _RO to  _RW and that works (otherwise we would hit an BUG()).
      And the _RW flag is set for that page:
      (XEN)  L1[0x1c7] = 001000004885f067 0000000000004dc7
      
      The error code is 3, so PFEC_page_present and PFEC_write_access, so page is
      present (correct), and we tried to write to the page, but a violation
      occurred. The one theory is that the the page entries in hardware
      (which are cached) are not up to date with what we just set. Especially
      as we have just done an CR3 write and flushed the multicalls.
      
      This patch does solve the problem by flusing out the TLB page
      entry after changing it from _RO to _RW and we don't hit this
      issue anymore.
      
      Fixed-Oracle-Bug: 16243091 [ON OCCASIONS VM START GOES INTO
      'CRASH' STATE: CLEAR_PAGE+0X12 ON HP DL380 G4]
      Reported-and-Tested-by: NSaar Maoz <Saar.Maoz@oracle.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      b2222794
  19. 28 3月, 2013 1 次提交
  20. 23 2月, 2013 1 次提交
    • K
      x86-64, xen, mmu: Provide an early version of write_cr3. · 0cc9129d
      Konrad Rzeszutek Wilk 提交于
      With commit 8170e6be ("x86, 64bit: Use a #PF handler to materialize
      early mappings on demand") we started hitting an early bootup crash
      where the Xen hypervisor would inform us that:
      
          (XEN) d7:v0: unhandled page fault (ec=0000)
          (XEN) Pagetable walk from ffffea000005b2d0:
          (XEN)  L4[0x1d4] = 0000000000000000 ffffffffffffffff
          (XEN) domain_crash_sync called from entry.S
          (XEN) Domain 7 (vcpu#0) crashed on cpu#3:
          (XEN) ----[ Xen-4.2.0  x86_64  debug=n  Not tainted ]----
      
      .. that Xen was unable to context switch back to dom0.
      
      Looking at the calling stack we find:
      
          [<ffffffff8103feba>] xen_get_user_pgd+0x5a  <--
          [<ffffffff8103feba>] xen_get_user_pgd+0x5a
          [<ffffffff81042d27>] xen_write_cr3+0x77
          [<ffffffff81ad2d21>] init_mem_mapping+0x1f9
          [<ffffffff81ac293f>] setup_arch+0x742
          [<ffffffff81666d71>] printk+0x48
      
      We are trying to figure out whether we need to up-date the user PGD as
      well.  Please keep in mind that under 64-bit PV guests we have a limited
      amount of rings: 0 for the Hypervisor, and 1 for both the Linux kernel
      and user-space.  As such the Linux pvops'fied version of write_cr3
      checks if it has to update the user-space cr3 as well.
      
      That clearly is not needed during early bootup.  The recent changes (see
      above git commit) streamline the x86 page table allocation to be much
      simpler (And also incidentally the #PF handler ends up in spirit being
      similar to how the Xen toolstack sets up the initial page-tables).
      
      The fix is to have an early-bootup version of cr3 that just loads the
      kernel %cr3.  The later version - which also handles user-page
      modifications will be used after the initial page tables have been
      setup.
      
      [ hpa: removed a redundant #ifdef and made the new function __init.
        Also note that x86-32 already has such an early xen_write_cr3. ]
      Tested-by: N"H. Peter Anvin" <hpa@zytor.com>
      Reported-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Link: http://lkml.kernel.org/r/1361579812-23709-1-git-send-email-konrad.wilk@oracle.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0cc9129d
  21. 29 11月, 2012 2 次提交
  22. 18 11月, 2012 1 次提交
  23. 01 11月, 2012 1 次提交
  24. 09 10月, 2012 1 次提交
    • K
      mm: kill vma flag VM_RESERVED and mm->reserved_vm counter · 314e51b9
      Konstantin Khlebnikov 提交于
      A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
      currently it lost original meaning but still has some effects:
      
       | effect                 | alternative flags
      -+------------------------+---------------------------------------------
      1| account as reserved_vm | VM_IO
      2| skip in core dump      | VM_IO, VM_DONTDUMP
      3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
      4| do not mlock           | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
      
      This patch removes reserved_vm counter from mm_struct.  Seems like nobody
      cares about it, it does not exported into userspace directly, it only
      reduces total_vm showed in proc.
      
      Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.
      
      remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
      remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.
      
      [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      314e51b9
  25. 04 10月, 2012 1 次提交
    • O
      xen pv-on-hvm: add pfn_is_ram helper for kdump · 34b6f01a
      Olaf Hering 提交于
      Register pfn_is_ram helper speed up reading /proc/vmcore in the kdump
      kernel. See commit message of 997c136f ("fs/proc/vmcore.c: add hook
      to read_from_oldmem() to check for non-ram pages") for details.
      
      It makes use of a new hvmop HVMOP_get_mem_type which was introduced in
      xen 4.2 (23298:26413986e6e0) and backported to 4.1.1.
      
      The new function is currently only enabled for reading /proc/vmcore.
      Later it will be used also for the kexec kernel. Since that requires
      more changes in the generic kernel make it static for the time being.
      Signed-off-by: NOlaf Hering <olaf@aepfle.de>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      34b6f01a
  26. 12 9月, 2012 4 次提交
  27. 06 9月, 2012 1 次提交
  28. 05 9月, 2012 1 次提交