1. 14 3月, 2011 3 次提交
    • K
      xen/mmu: WARN_ON when racing to swap middle leaf. · c7617798
      Konrad Rzeszutek Wilk 提交于
      The initial bootup code uses set_phys_to_machine quite a lot, and after
      bootup it would be used by the balloon driver. The balloon driver does have
      mutex lock so this should not be necessary - but just in case, add
      a WARN_ON if we do hit this scenario. If we do fail this, it is OK
      to continue as there is a backup mechanism (VM_IO) that can bypass
      the P2M and still set the _PAGE_IOMAP flags.
      
      [v2: Change from WARN to BUG_ON]
      [v3: Rebased on top of xen->p2m code split]
      [v4: Change from BUG_ON to WARN]
      Reviewed-by: NIan Campbell <Ian.Campbell@eu.citrix.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      c7617798
    • K
      xen/mmu: Set _PAGE_IOMAP if PFN is an identity PFN. · fb38923e
      Konrad Rzeszutek Wilk 提交于
      If we find that the PFN is within the P2M as an identity
      PFN make sure to tack on the _PAGE_IOMAP flag.
      Reviewed-by: NIan Campbell <ian.campbell@citrix.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      fb38923e
    • K
      xen/mmu: Add the notion of identity (1-1) mapping. · f4cec35b
      Konrad Rzeszutek Wilk 提交于
      Our P2M tree structure is a three-level. On the leaf nodes
      we set the Machine Frame Number (MFN) of the PFN. What this means
      is that when one does: pfn_to_mfn(pfn), which is used when creating
      PTE entries, you get the real MFN of the hardware. When Xen sets
      up a guest it initially populates a array which has descending
      (or ascending) MFN values, as so:
      
       idx: 0,  1,       2
       [0x290F, 0x290E, 0x290D, ..]
      
      so pfn_to_mfn(2)==0x290D. If you start, restart many guests that list
      starts looking quite random.
      
      We graft this structure on our P2M tree structure and stick in
      those MFN in the leafs. But for all other leaf entries, or for the top
      root, or middle one, for which there is a void entry, we assume it is
      "missing". So
       pfn_to_mfn(0xc0000)=INVALID_P2M_ENTRY.
      
      We add the possibility of setting 1-1 mappings on certain regions, so
      that:
       pfn_to_mfn(0xc0000)=0xc0000
      
      The benefit of this is, that we can assume for non-RAM regions (think
      PCI BARs, or ACPI spaces), we can create mappings easily b/c we
      get the PFN value to match the MFN.
      
      For this to work efficiently we introduce one new page p2m_identity and
      allocate (via reserved_brk) any other pages we need to cover the sides
      (1GB or 4MB boundary violations). All entries in p2m_identity are set to
      INVALID_P2M_ENTRY type (Xen toolstack only recognizes that and MFNs,
      no other fancy value).
      
      On lookup we spot that the entry points to p2m_identity and return the identity
      value instead of dereferencing and returning INVALID_P2M_ENTRY. If the entry
      points to an allocated page, we just proceed as before and return the PFN.
      If the PFN has IDENTITY_FRAME_BIT set we unmask that in appropriate functions
      (pfn_to_mfn).
      
      The reason for having the IDENTITY_FRAME_BIT instead of just returning the
      PFN is that we could find ourselves where pfn_to_mfn(pfn)==pfn for a
      non-identity pfn. To protect ourselves against we elect to set (and get) the
      IDENTITY_FRAME_BIT on all identity mapped PFNs.
      
      This simplistic diagram is used to explain the more subtle piece of code.
      There is also a digram of the P2M at the end that can help.
      Imagine your E820 looking as so:
      
                         1GB                                           2GB
      /-------------------+---------\/----\         /----------\    /---+-----\
      | System RAM        | Sys RAM ||ACPI|         | reserved |    | Sys RAM |
      \-------------------+---------/\----/         \----------/    \---+-----/
                                    ^- 1029MB                       ^- 2001MB
      
      [1029MB = 263424 (0x40500), 2001MB = 512256 (0x7D100), 2048MB = 524288 (0x80000)]
      
      And dom0_mem=max:3GB,1GB is passed in to the guest, meaning memory past 1GB
      is actually not present (would have to kick the balloon driver to put it in).
      
      When we are told to set the PFNs for identity mapping (see patch: "xen/setup:
      Set identity mapping for non-RAM E820 and E820 gaps.") we pass in the start
      of the PFN and the end PFN (263424 and 512256 respectively). The first step is
      to reserve_brk a top leaf page if the p2m[1] is missing. The top leaf page
      covers 512^2 of page estate (1GB) and in case the start or end PFN is not
      aligned on 512^2*PAGE_SIZE (1GB) we loop on aligned 1GB PFNs from start pfn to
      end pfn.  We reserve_brk top leaf pages if they are missing (means they point
      to p2m_mid_missing).
      
      With the E820 example above, 263424 is not 1GB aligned so we allocate a
      reserve_brk page which will cover the PFNs estate from 0x40000 to 0x80000.
      Each entry in the allocate page is "missing" (points to p2m_missing).
      
      Next stage is to determine if we need to do a more granular boundary check
      on the 4MB (or 2MB depending on architecture) off the start and end pfn's.
      We check if the start pfn and end pfn violate that boundary check, and if
      so reserve_brk a middle (p2m[x][y]) leaf page. This way we have a much finer
      granularity of setting which PFNs are missing and which ones are identity.
      In our example 263424 and 512256 both fail the check so we reserve_brk two
      pages. Populate them with INVALID_P2M_ENTRY (so they both have "missing" values)
      and assign them to p2m[1][2] and p2m[1][488] respectively.
      
      At this point we would at minimum reserve_brk one page, but could be up to
      three. Each call to set_phys_range_identity has at maximum a three page
      cost. If we were to query the P2M at this stage, all those entries from
      start PFN through end PFN (so 1029MB -> 2001MB) would return INVALID_P2M_ENTRY
      ("missing").
      
      The next step is to walk from the start pfn to the end pfn setting
      the IDENTITY_FRAME_BIT on each PFN. This is done in 'set_phys_range_identity'.
      If we find that the middle leaf is pointing to p2m_missing we can swap it over
      to p2m_identity - this way covering 4MB (or 2MB) PFN space.  At this point we
      do not need to worry about boundary aligment (so no need to reserve_brk a middle
      page, figure out which PFNs are "missing" and which ones are identity), as that
      has been done earlier.  If we find that the middle leaf is not occupied by
      p2m_identity or p2m_missing, we dereference that page (which covers
      512 PFNs) and set the appropriate PFN with IDENTITY_FRAME_BIT. In our example
      263424 and 512256 end up there, and we set from p2m[1][2][256->511] and
      p2m[1][488][0->256] with IDENTITY_FRAME_BIT set.
      
      All other regions that are void (or not filled) either point to p2m_missing
      (considered missing) or have the default value of INVALID_P2M_ENTRY (also
      considered missing). In our case, p2m[1][2][0->255] and p2m[1][488][257->511]
      contain the INVALID_P2M_ENTRY value and are considered "missing."
      
      This is what the p2m ends up looking (for the E820 above) with this
      fabulous drawing:
      
         p2m         /--------------\
       /-----\       | &mfn_list[0],|                           /-----------------\
       |  0  |------>| &mfn_list[1],|    /---------------\      | ~0, ~0, ..      |
       |-----|       |  ..., ~0, ~0 |    | ~0, ~0, [x]---+----->| IDENTITY [@256] |
       |  1  |---\   \--------------/    | [p2m_identity]+\     | IDENTITY [@257] |
       |-----|    \                      | [p2m_identity]+\\    | ....            |
       |  2  |--\  \-------------------->|  ...          | \\   \----------------/
       |-----|   \                       \---------------/  \\
       |  3  |\   \                                          \\  p2m_identity
       |-----| \   \-------------------->/---------------\   /-----------------\
       | ..  +->+                        | [p2m_identity]+-->| ~0, ~0, ~0, ... |
       \-----/ /                         | [p2m_identity]+-->| ..., ~0         |
              / /---------------\        | ....          |   \-----------------/
             /  | IDENTITY[@0]  |      /-+-[x], ~0, ~0.. |
            /   | IDENTITY[@256]|<----/  \---------------/
           /    | ~0, ~0, ....  |
          |     \---------------/
          |
          p2m_missing             p2m_missing
      /------------------\     /------------\
      | [p2m_mid_missing]+---->| ~0, ~0, ~0 |
      | [p2m_mid_missing]+---->| ..., ~0    |
      \------------------/     \------------/
      
      where ~0 is INVALID_P2M_ENTRY. IDENTITY is (PFN | IDENTITY_BIT)
      Reviewed-by: NIan Campbell <ian.campbell@citrix.com>
      [v5: Changed code to use ranges, added ASCII art]
      [v6: Rebased on top of xen->p2m code split]
      [v4: Squished patches in just this one]
      [v7: Added RESERVE_BRK for potentially allocated pages]
      [v8: Fixed alignment problem]
      [v9: Changed 1<<3X to 1<<BITS_PER_LONG-X]
      [v10: Copied git commit description in the p2m code + Add Review tag]
      [v11: Title had '2-1' - should be '1-1' mapping]
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      f4cec35b
  2. 04 3月, 2011 1 次提交
    • K
      xen: Mark all initial reserved pages for the balloon as INVALID_P2M_ENTRY. · 6eaa412f
      Konrad Rzeszutek Wilk 提交于
      With this patch, we diligently set regions that will be used by the
      balloon driver to be INVALID_P2M_ENTRY and under the ownership
      of the balloon driver. We are OK using the __set_phys_to_machine
      as we do not expect to be allocating any P2M middle or entries pages.
      The set_phys_to_machine has the side-effect of potentially allocating
      new pages and we do not want that at this stage.
      
      We can do this because xen_build_mfn_list_list will have already
      allocated all such pages up to xen_max_p2m_pfn.
      
      We also move the check for auto translated physmap down the
      stack so it is present in __set_phys_to_machine.
      
      [v2: Rebased with mmu->p2m code split]
      Reviewed-by: NIan Campbell <ian.campbell@citrix.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      6eaa412f
  3. 22 1月, 2011 1 次提交
    • S
      xen: p2m: correctly initialize partial p2m leaf · 8e1b4cf2
      Stefan Bader 提交于
      After changing the p2m mapping to a tree by
      
        commit 58e05027
          xen: convert p2m to a 3 level tree
      
      and trying to boot a DomU with 615MB of memory, the following crash was
      observed in the dump:
      
      kernel direct mapping tables up to 26f00000 @ 1ec4000-1fff000
      BUG: unable to handle kernel NULL pointer dereference at (null)
      IP: [<c0107397>] xen_set_pte+0x27/0x60
      *pdpt = 0000000000000000 *pde = 0000000000000000
      
      Adding further debug statements showed that when trying to set up
      pfn=0x26700 the returned mapping was invalid.
      
      pfn=0x266ff calling set_pte(0xc1fe77f8, 0x6b3003)
      pfn=0x26700 calling set_pte(0xc1fe7800, 0x3)
      
      Although the last_pfn obtained from the startup info is 0x26700, which
      should in turn not be hit, the additional 8MB which are added as extra
      memory normally seem to be ok. This lead to looking into the initial
      p2m tree construction, which uses the smaller value and assuming that
      there is other code handling the extra memory.
      
      When the p2m tree is set up, the leaves are directly pointed to the
      array which the domain builder set up. But if the mapping is not on a
      boundary that fits into one p2m page, this will result in the last leaf
      being only partially valid. And as the invalid entries are not
      initialized in that case, things go badly wrong.
      
      I am trying to fix that by checking whether the current leaf is a
      complete map and if not, allocate a completely new page and copy only
      the valid pointers there. This may not be the most efficient or elegant
      solution, but at least it seems to allow me booting DomUs with memory
      assignments all over the range.
      
      BugLink: http://bugs.launchpad.net/bugs/686692
      [v2: Redid a bit of commit wording and fixed a compile warning]
      Signed-off-by: NStefan Bader <stefan.bader@canonical.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      8e1b4cf2
  4. 21 1月, 2011 1 次提交
  5. 20 1月, 2011 1 次提交
    • T
      lockdep: Move early boot local IRQ enable/disable status to init/main.c · 2ce802f6
      Tejun Heo 提交于
      During early boot, local IRQ is disabled until IRQ subsystem is
      properly initialized.  During this time, no one should enable
      local IRQ and some operations which usually are not allowed with
      IRQ disabled, e.g. operations which might sleep or require
      communications with other processors, are allowed.
      
      lockdep tracked this with early_boot_irqs_off/on() callbacks.
      As other subsystems need this information too, move it to
      init/main.c and make it generally available.  While at it,
      toggle the boolean to early_boot_irqs_disabled instead of
      enabled so that it can be initialized with %false and %true
      indicates the exceptional condition.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NPekka Enberg <penberg@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <20110120110635.GB6036@htj.dyndns.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2ce802f6
  6. 15 1月, 2011 1 次提交
  7. 12 1月, 2011 5 次提交
  8. 10 1月, 2011 1 次提交
  9. 07 1月, 2011 1 次提交
  10. 17 12月, 2010 1 次提交
  11. 03 12月, 2010 1 次提交
    • J
      vmalloc: eagerly clear ptes on vunmap · 64141da5
      Jeremy Fitzhardinge 提交于
      On stock 2.6.37-rc4, running:
      
        # mount lilith:/export /mnt/lilith
        # find  /mnt/lilith/ -type f -print0 | xargs -0 file
      
      crashes the machine fairly quickly under Xen.  Often it results in oops
      messages, but the couple of times I tried just now, it just hung quietly
      and made Xen print some rude messages:
      
          (XEN) mm.c:2389:d80 Bad type (saw 7400000000000001 != exp
          3000000000000000) for mfn 1d7058 (pfn 18fa7)
          (XEN) mm.c:964:d80 Attempt to create linear p.t. with write perms
          (XEN) mm.c:2389:d80 Bad type (saw 7400000000000010 != exp
          1000000000000000) for mfn 1d2e04 (pfn 1d1fb)
          (XEN) mm.c:2965:d80 Error while pinning mfn 1d2e04
      
      Which means the domain tried to map a pagetable page RW, which would
      allow it to map arbitrary memory, so Xen stopped it.  This is because
      vm_unmap_ram() left some pages mapped in the vmalloc area after NFS had
      finished with them, and those pages got recycled as pagetable pages
      while still having these RW aliases.
      
      Removing those mappings immediately removes the Xen-visible aliases, and
      so it has no problem with those pages being reused as pagetable pages.
      Deferring the TLB flush doesn't upset Xen because it can flush the TLB
      itself as needed to maintain its invariants.
      
      When unmapping a region in the vmalloc space, clear the ptes
      immediately.  There's no point in deferring this because there's no
      amortization benefit.
      
      The TLBs are left dirty, and they are flushed lazily to amortize the
      cost of the IPIs.
      
      This specific motivation for this patch is an oops-causing regression
      since 2.6.36 when using NFS under Xen, triggered by the NFS client's use
      of vm_map_ram() introduced in 56e4ebf8 ("NFS: readdir with vmapped
      pages") .  XFS also uses vm_map_ram() and could cause similar problems.
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Bryan Schumaker <bjschuma@netapp.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64141da5
  12. 02 12月, 2010 1 次提交
  13. 30 11月, 2010 2 次提交
    • I
      xen: x86/32: perform initial startup on initial_page_table · 805e3f49
      Ian Campbell 提交于
      Only make swapper_pg_dir readonly and pinned when generic x86 architecture code
      (which also starts on initial_page_table) switches to it.  This helps ensure
      that the generic setup paths work on Xen unmodified. In particular
      clone_pgd_range writes directly to the destination pgd entries and is used to
      initialise swapper_pg_dir so we need to ensure that it remains writeable until
      the last possible moment during bring up.
      
      This is complicated slightly by the need to avoid sharing kernel PMD entries
      when running under Xen, therefore the Xen implementation must make a copy of
      the kernel PMD (which is otherwise referred to by both intial_page_table and
      swapper_pg_dir) before switching to swapper_pg_dir.
      Signed-off-by: NIan Campbell <ian.campbell@citrix.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      805e3f49
    • J
      xen: don't bother to stop other cpus on shutdown/reboot · 31e323cc
      Jeremy Fitzhardinge 提交于
      Xen will shoot all the VCPUs when we do a shutdown hypercall, so there's
      no need to do it manually.
      
      In any case it will fail because all the IPI irqs have been pulled
      down by this point, so the cross-CPU calls will simply hang forever.
      
      Until change 76fac077 the function calls
      were not synchronously waited for, so this wasn't apparent.  However after
      that change the calls became synchronous leading to a hang on shutdown
      on multi-VCPU guests.
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Cc: Stable Kernel <stable@kernel.org>
      Cc: Alok Kataria <akataria@vmware.com>
      31e323cc
  14. 28 11月, 2010 1 次提交
  15. 25 11月, 2010 2 次提交
  16. 23 11月, 2010 3 次提交
    • J
      xen: use default_idle · bc15fde7
      Jeremy Fitzhardinge 提交于
      We just need the idle loop to drop into safe_halt, which default_idle()
      is perfectly capable of doing.  There's no need to duplicate it.
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      bc15fde7
    • J
      xen: clean up "extra" memory handling some more · c2d08791
      Jeremy Fitzhardinge 提交于
      Make sure that extra_pages is added for all E820_RAM regions beyond
      mem_end - completely excluded regions as well as the remains of partially
      included regions.
      
      Also makes sure the extra region is not unnecessarily high, and simplifies
      the logic to decide which regions should be added.
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      c2d08791
    • K
      xen: set IO permission early (before early_cpu_init()) · ec35a69c
      Konrad Rzeszutek Wilk 提交于
      This patch is based off "xen dom0: Set up basic IO permissions for dom0."
      by Juan Quintela <quintela@redhat.com>.
      
      On AMD machines when we boot the kernel as Domain 0 we get this nasty:
      
      mapping kernel into physical memory
      Xen: setup ISA identity maps
      about to get started...
      (XEN) traps.c:475:d0 Unhandled general protection fault fault/trap [#13] on VCPU 0 [ec=0000]
      (XEN) domain_crash_sync called from entry.S
      (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
      (XEN) ----[ Xen-4.1-101116  x86_64  debug=y  Not tainted ]----
      (XEN) CPU:    0
      (XEN) RIP:    e033:[<ffffffff8130271b>]
      (XEN) RFLAGS: 0000000000000282   EM: 1   CONTEXT: pv guest
      (XEN) rax: 000000008000c068   rbx: ffffffff8186c680   rcx: 0000000000000068
      (XEN) rdx: 0000000000000cf8   rsi: 000000000000c000   rdi: 0000000000000000
      (XEN) rbp: ffffffff81801e98   rsp: ffffffff81801e50   r8:  ffffffff81801eac
      (XEN) r9:  ffffffff81801ea8   r10: ffffffff81801eb4   r11: 00000000ffffffff
      (XEN) r12: ffffffff8186c694   r13: ffffffff81801f90   r14: ffffffffffffffff
      (XEN) r15: 0000000000000000   cr0: 000000008005003b   cr4: 00000000000006f0
      (XEN) cr3: 0000000221803000   cr2: 0000000000000000
      (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e02b   cs: e033
      (XEN) Guest stack trace from rsp=ffffffff81801e50:
      
      RIP points to read_pci_config() function.
      
      The issue is that we don't set IO permissions for the Linux kernel early enough.
      
      The call sequence used to be:
      
          xen_start_kernel()
      	x86_init.oem.arch_setup = xen_setup_arch;
              setup_arch:
                 - early_cpu_init
                     - early_init_amd
                        - read_pci_config
                 - x86_init.oem.arch_setup [ xen_arch_setup ]
                     - set IO permissions.
      
      We need to set the IO permissions earlier on, which this patch does.
      Acked-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      ec35a69c
  17. 20 11月, 2010 1 次提交
  18. 13 11月, 2010 1 次提交
  19. 12 11月, 2010 1 次提交
  20. 11 11月, 2010 1 次提交
  21. 30 10月, 2010 1 次提交
  22. 28 10月, 2010 1 次提交
  23. 27 10月, 2010 1 次提交
    • S
      xen: initialize cpu masks for pv guests in xen_smp_init · ea5b8f73
      Stefano Stabellini 提交于
      Pv guests don't have ACPI and need the cpu masks to be set
      correctly as early as possible so we call xen_fill_possible_map from
      xen_smp_init.
      On the other hand the initial domain supports ACPI so in this case we skip
      xen_fill_possible_map and rely on it. However Xen might limit the number
      of cpus usable by the domain, so we filter those masks during smp
      initialization using the VCPUOP_is_up hypercall.
      It is important that the filtering is done before
      xen_setup_vcpu_info_placement.
      Signed-off-by: NStefano Stabellini <stefano.stabellini@eu.citrix.com>
      ea5b8f73
  24. 26 10月, 2010 1 次提交
  25. 23 10月, 2010 6 次提交