1. 23 11月, 2008 1 次提交
    • I
      xen: pin correct PGD on suspend · 86bbc2c2
      Ian Campbell 提交于
      Impact: fix Xen guest boot failure
      
      commit eefb47f6 ("xen: use
      spin_lock_nest_lock when pinning a pagetable") changed xen_pgd_walk to
      walk over mm->pgd rather than taking pgd as an argument.
      
      This breaks xen_mm_(un)pin_all() because it makes init_mm.pgd readonly
      instead of the pgd we are interested in and therefore the pin subsequently
      fails.
      
      (XEN) mm.c:2280:d15 Bad type (saw 00000000e8000001 != exp 0000000060000000) for mfn bc464 (pfn 21ca7)
      (XEN) mm.c:2665:d15 Error while pinning mfn bc464
      
      [   14.586913] 1 multicall(s) failed: cpu 0
      [   14.586926] Pid: 14, comm: kstop/0 Not tainted 2.6.28-rc5-x86_32p-xenU-00172-gee2f6cc7 #200
      [   14.586940] Call Trace:
      [   14.586955]  [<c030c17a>] ? printk+0x18/0x1e
      [   14.586972]  [<c0103df3>] xen_mc_flush+0x163/0x1d0
      [   14.586986]  [<c0104bc1>] __xen_pgd_pin+0xa1/0x110
      [   14.587000]  [<c015a330>] ? stop_cpu+0x0/0xf0
      [   14.587015]  [<c0104d7b>] xen_mm_pin_all+0x4b/0x70
      [   14.587029]  [<c022bcb9>] xen_suspend+0x39/0xe0
      [   14.587042]  [<c015a330>] ? stop_cpu+0x0/0xf0
      [   14.587054]  [<c015a3cd>] stop_cpu+0x9d/0xf0
      [   14.587067]  [<c01417cd>] run_workqueue+0x8d/0x150
      [   14.587080]  [<c030e4b3>] ? _spin_unlock_irqrestore+0x23/0x40
      [   14.587094]  [<c014558a>] ? prepare_to_wait+0x3a/0x70
      [   14.587107]  [<c0141918>] worker_thread+0x88/0xf0
      [   14.587120]  [<c01453c0>] ? autoremove_wake_function+0x0/0x50
      [   14.587133]  [<c0141890>] ? worker_thread+0x0/0xf0
      [   14.587146]  [<c014509c>] kthread+0x3c/0x70
      [   14.587157]  [<c0145060>] ? kthread+0x0/0x70
      [   14.587170]  [<c0109d1b>] kernel_thread_helper+0x7/0x10
      [   14.587181]   call  1/3: op=14 arg=[c0415000] result=0
      [   14.587192]   call  2/3: op=14 arg=[e1ca2000] result=0
      [   14.587204]   call  3/3: op=26 arg=[c1808860] result=-22
      Signed-off-by: NIan Campbell <ian.campbell@citrix.com>
      Acked-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      86bbc2c2
  2. 07 11月, 2008 2 次提交
    • J
      xen: make sure stray alias mappings are gone before pinning · d05fdf31
      Jeremy Fitzhardinge 提交于
      Xen requires that all mappings of pagetable pages are read-only, so
      that they can't be updated illegally.  As a result, if a page is being
      turned into a pagetable page, we need to make sure all its mappings
      are RO.
      
      If the page had been used for ioremap or vmalloc, it may still have
      left over mappings as a result of not having been lazily unmapped.
      This change makes sure we explicitly mop them all up before pinning
      the page.
      
      Unlike aliases created by kmap, the there can be vmalloc aliases even
      for non-high pages, so we must do the flush unconditionally.
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Cc: Linux Memory Management List <linux-mm@kvack.org>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d05fdf31
    • J
      x86, xen: fix use of pgd_page now that it really does return a page · 47cb2ed9
      Jeremy Fitzhardinge 提交于
      Impact: fix 32-bit Xen guest boot crash
      
      On 32-bit PAE, pud_page, for no good reason, didn't really return a
      struct page *.  Since Jan Beulich's fix "i386/PAE: fix pud_page()",
      pud_page does return a struct page *.
      
      Because PAE has 3 pagetable levels, the pud level is folded into the
      pgd level, so pgd_page() is the same as pud_page(), and now returns
      a struct page *.  Update the xen/mmu.c code which uses pgd_page()
      accordingly.
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      47cb2ed9
  3. 27 10月, 2008 1 次提交
    • C
      xen: fix Xen domU boot with batched mprotect · 9f32d21c
      Chris Lalancette 提交于
      Impact: fix guest kernel boot crash on certain configs
      
      Recent i686 2.6.27 kernels with a certain amount of memory (between
      736 and 855MB) have a problem booting under a hypervisor that supports
      batched mprotect (this includes the RHEL-5 Xen hypervisor as well as
      any 3.3 or later Xen hypervisor).
      
      The problem ends up being that xen_ptep_modify_prot_commit() is using
      virt_to_machine to calculate which pfn to update.  However, this only
      works for pages that are in the p2m list, and the pages coming from
      change_pte_range() in mm/mprotect.c are kmap_atomic pages.  Because of
      this, we can run into the situation where the lookup in the p2m table
      returns an INVALID_MFN, which we then try to pass to the hypervisor,
      which then (correctly) denies the request to a totally bogus pfn.
      
      The right thing to do is to use arbitrary_virt_to_machine, so that we
      can be sure we are modifying the right pfn.  This unfortunately
      introduces a performance penalty because of a full page-table-walk,
      but we can avoid that penalty for pages in the p2m list by checking if
      virt_addr_valid is true, and if so, just doing the lookup in the p2m
      table.
      
      The attached patch implements this, and allows my 2.6.27 i686 based
      guest with 768MB of memory to boot on a RHEL-5 hypervisor again.
      Thanks to Jeremy for the suggestions about how to fix this particular
      issue.
      Signed-off-by: NChris Lalancette <clalance@redhat.com>
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Cc: Chris Lalancette <clalance@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9f32d21c
  4. 20 10月, 2008 1 次提交
    • N
      mm: rewrite vmap layer · db64fe02
      Nick Piggin 提交于
      Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
      provide a fast, scalable percpu frontend for small vmaps (requires a
      slightly different API, though).
      
      The biggest problem with vmap is actually vunmap.  Presently this requires
      a global kernel TLB flush, which on most architectures is a broadcast IPI
      to all CPUs to flush the cache.  This is all done under a global lock.  As
      the number of CPUs increases, so will the number of vunmaps a scaled
      workload will want to perform, and so will the cost of a global TLB flush.
       This gives terrible quadratic scalability characteristics.
      
      Another problem is that the entire vmap subsystem works under a single
      lock.  It is a rwlock, but it is actually taken for write in all the fast
      paths, and the read locking would likely never be run concurrently anyway,
      so it's just pointless.
      
      This is a rewrite of vmap subsystem to solve those problems.  The existing
      vmalloc API is implemented on top of the rewritten subsystem.
      
      The TLB flushing problem is solved by using lazy TLB unmapping.  vmap
      addresses do not have to be flushed immediately when they are vunmapped,
      because the kernel will not reuse them again (would be a use-after-free)
      until they are reallocated.  So the addresses aren't allocated again until
      a subsequent TLB flush.  A single TLB flush then can flush multiple
      vunmaps from each CPU.
      
      XEN and PAT and such do not like deferred TLB flushing because they can't
      always handle multiple aliasing virtual addresses to a physical address.
      They now call vm_unmap_aliases() in order to flush any deferred mappings.
      That call is very expensive (well, actually not a lot more expensive than
      a single vunmap under the old scheme), however it should be OK if not
      called too often.
      
      The virtual memory extent information is stored in an rbtree rather than a
      linked list to improve the algorithmic scalability.
      
      There is a per-CPU allocator for small vmaps, which amortizes or avoids
      global locking.
      
      To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
      must be used in place of vmap and vunmap.  Vmalloc does not use these
      interfaces at the moment, so it will not be quite so scalable (although it
      will use lazy TLB flushing).
      
      As a quick test of performance, I ran a test that loops in the kernel,
      linearly mapping then touching then unmapping 4 pages.  Different numbers
      of tests were run in parallel on an 4 core, 2 socket opteron.  Results are
      in nanoseconds per map+touch+unmap.
      
      threads           vanilla         vmap rewrite
      1                 14700           2900
      2                 33600           3000
      4                 49500           2800
      8                 70631           2900
      
      So with a 8 cores, the rewritten version is already 25x faster.
      
      In a slightly more realistic test (although with an older and less
      scalable version of the patch), I ripped the not-very-good vunmap batching
      code out of XFS, and implemented the large buffer mapping with vm_map_ram
      and vm_unmap_ram...  along with a couple of other tricks, I was able to
      speed up a large directory workload by 20x on a 64 CPU system.  I believe
      vmap/vunmap is actually sped up a lot more than 20x on such a system, but
      I'm running into other locks now.  vmap is pretty well blown off the
      profiles.
      
      Before:
      1352059 total                                      0.1401
      798784 _write_lock                              8320.6667 <- vmlist_lock
      529313 default_idle                             1181.5022
       15242 smp_call_function                         15.8771  <- vmap tlb flushing
        2472 __get_vm_area_node                         1.9312  <- vmap
        1762 remove_vm_area                             4.5885  <- vunmap
         316 map_vm_area                                0.2297  <- vmap
         312 kfree                                      0.1950
         300 _spin_lock                                 3.1250
         252 sn_send_IPI_phys                           0.4375  <- tlb flushing
         238 vmap                                       0.8264  <- vmap
         216 find_lock_page                             0.5192
         196 find_next_bit                              0.3603
         136 sn2_send_IPI                               0.2024
         130 pio_phys_write_mmr                         2.0312
         118 unmap_kernel_range                         0.1229
      
      After:
       78406 total                                      0.0081
       40053 default_idle                              89.4040
       33576 ia64_spinlock_contention                 349.7500
        1650 _spin_lock                                17.1875
         319 __reg_op                                   0.5538
         281 _atomic_dec_and_lock                       1.0977
         153 mutex_unlock                               1.5938
         123 iget_locked                                0.1671
         117 xfs_dir_lookup                             0.1662
         117 dput                                       0.1406
         114 xfs_iget_core                              0.0268
          92 xfs_da_hashname                            0.1917
          75 d_alloc                                    0.0670
          68 vmap_page_range                            0.0462 <- vmap
          58 kmem_cache_alloc                           0.0604
          57 memset                                     0.0540
          52 rb_next                                    0.1625
          50 __copy_user                                0.0208
          49 bitmap_find_free_region                    0.2188 <- vmap
          46 ia64_sn_udelay                             0.1106
          45 find_inode_fast                            0.1406
          42 memcmp                                     0.2188
          42 finish_task_switch                         0.1094
          42 __d_lookup                                 0.0410
          40 radix_tree_lookup_slot                     0.1250
          37 _spin_unlock_irqrestore                    0.3854
          36 xfs_bmapi                                  0.0050
          36 kmem_cache_free                            0.0256
          35 xfs_vn_getattr                             0.0322
          34 radix_tree_lookup                          0.1062
          33 __link_path_walk                           0.0035
          31 xfs_da_do_buf                              0.0091
          30 _xfs_buf_find                              0.0204
          28 find_get_page                              0.0875
          27 xfs_iread                                  0.0241
          27 __strncpy_from_user                        0.2812
          26 _xfs_buf_initialize                        0.0406
          24 _xfs_buf_lookup_pages                      0.0179
          24 vunmap_page_range                          0.0250 <- vunmap
          23 find_lock_page                             0.0799
          22 vm_map_ram                                 0.0087 <- vmap
          20 kfree                                      0.0125
          19 put_page                                   0.0330
          18 __kmalloc                                  0.0176
          17 xfs_da_node_lookup_int                     0.0086
          17 _read_lock                                 0.0885
          17 page_waitqueue                             0.0664
      
      vmap has gone from being the top 5 on the profiles and flushing the crap
      out of all TLBs, to using less than 1% of kernel time.
      
      [akpm@linux-foundation.org: cleanups, section fix]
      [akpm@linux-foundation.org: fix build on alpha]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      db64fe02
  5. 09 10月, 2008 1 次提交
    • J
      xen: use spin_lock_nest_lock when pinning a pagetable · eefb47f6
      Jeremy Fitzhardinge 提交于
      When pinning/unpinning a pagetable with split pte locks, we can end up
      holding multiple pte locks at once (we need to hold the locks while
      there's a pending batched hypercall affecting the pte page).  Because
      all the pte locks are in the same lock class, lockdep thinks that
      we're potentially taking a lock recursively.
      
      This warning is spurious because we always take the pte locks while
      holding mm->page_table_lock.  lockdep now has spin_lock_nest_lock to
      express this kind of dominant lock use, so use it here so that lockdep
      knows what's going on.
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      eefb47f6
  6. 10 9月, 2008 1 次提交
  7. 21 8月, 2008 1 次提交
  8. 20 8月, 2008 2 次提交
  9. 22 7月, 2008 2 次提交
  10. 16 7月, 2008 9 次提交
    • J
      xen64: allocate and manage user pagetables · d6182fbf
      Jeremy Fitzhardinge 提交于
      Because the x86_64 architecture does not enforce segment limits, Xen
      cannot protect itself with them as it does in 32-bit mode.  Therefore,
      to protect itself, it runs the guest kernel in ring 3.  Since it also
      runs the guest userspace in ring3, the guest kernel must maintain a
      second pagetable for its userspace, which does not map kernel space.
      Naturally, the guest kernel pagetables map both kernel and userspace.
      
      The userspace pagetable is attached to the corresponding kernel
      pagetable via the pgd's page->private field.  It is allocated and
      freed at the same time as the kernel pgd via the
      paravirt_pgd_alloc/free hooks.
      
      Fortunately, the user pagetable is almost entirely shared with the
      kernel pagetable; the only difference is the pgd page itself.  set_pgd
      will populate all entries in the kernel pagetable, and also set the
      corresponding user pgd entry if the address is less than
      STACK_TOP_MAX.
      
      The user pagetable must be pinned and unpinned with the kernel one,
      but because the pagetables are aliased, pgd_walk() only needs to be
      called on the kernel pagetable.  The user pgd page is then
      pinned/unpinned along with the kernel pgd page.
      
      xen_write_cr3 must write both the kernel and user cr3s.
      
      The init_mm.pgd pagetable never has a user pagetable allocated for it,
      because it can never be used while running usermode.
      
      One awkward area is that early in boot the page structures are not
      available.  No user pagetable can exist at that point, but it
      complicates the logic to avoid looking at the page structure.
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Cc: Stephen Tweedie <sct@redhat.com>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Mark McLoughlin <markmc@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d6182fbf
    • J
      xen: rework pgd_walk to deal with 32/64 bit · 5deb30d1
      Jeremy Fitzhardinge 提交于
      Rewrite pgd_walk to deal with 64-bit address spaces.  There are two
      notible features of 64-bit workspaces:
      
       1. The physical address is only 48 bits wide, with the upper 16 bits
          being sign extension; kernel addresses are negative, and userspace is
          positive.
      
       2. The Xen hypervisor mapping is at the negative-most address, just above
          the sign-extension hole.
      
      1. means that we can't easily use addresses when traversing the space,
      since we must deal with sign extension.  This rewrite expresses
      everything in terms of pgd/pud/pmd indices, which means we don't need
      to worry about the exact configuration of the virtual memory space.
      This approach works equally well in 32-bit.
      
      To deal with 2, assume the hole is between the uppermost userspace
      address and PAGE_OFFSET.  For 64-bit this skips the Xen mapping hole.
      For 32-bit, the hole is zero-sized.
      
      In all cases, the uppermost kernel address is FIXADDR_TOP.
      
      A side-effect of this patch is that the upper boundary is actually
      handled properly, exposing a long-standing bug in 32-bit, which failed
      to pin kernel pmd page.  The kernel pmd is not shared, and so must be
      explicitly pinned, even though the kernel ptes are shared and don't
      need pinning.
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Cc: Stephen Tweedie <sct@redhat.com>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Mark McLoughlin <markmc@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      5deb30d1
    • J
      xen: use set_pte_vaddr · 836fe2f2
      Jeremy Fitzhardinge 提交于
      Make Xen's set_pte_mfn() use set_pte_vaddr rather than copying it.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NJuan Quintela <quintela@redhat.com>
      Signed-off-by: NMark McLoughlin <markmc@redhat.com>
      Cc: Stephen Tweedie <sct@redhat.com>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Mark McLoughlin <markmc@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      836fe2f2
    • J
      xen64: use arbitrary_virt_to_machine for xen_set_pmd · ce803e70
      Jeremy Fitzhardinge 提交于
      When building initial pagetables in 64-bit kernel the pud/pmd pointer may
      be in ioremap/fixmap space, so we need to walk the pagetable to look up the
      physical address.
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Cc: Stephen Tweedie <sct@redhat.com>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Mark McLoughlin <markmc@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ce803e70
    • J
      xen: fix truncation of machine address · ebd879e3
      Jeremy Fitzhardinge 提交于
      arbitrary_virt_to_machine can truncate a machine address if its above
      4G.  Cast the problem away.
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Cc: Stephen Tweedie <sct@redhat.com>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Mark McLoughlin <markmc@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ebd879e3
    • J
      xen64: get active_mm from the pda · ce87b3d3
      Jeremy Fitzhardinge 提交于
      x86_64 stores the active_mm in the pda, so fetch it from there.
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Cc: Stephen Tweedie <sct@redhat.com>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Mark McLoughlin <markmc@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ce87b3d3
    • J
      xen64: add extra pv_mmu_ops · f6e58732
      Jeremy Fitzhardinge 提交于
      We need extra pv_mmu_ops for 64-bit, to deal with the extra level of
      pagetable.
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Cc: Stephen Tweedie <sct@redhat.com>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Mark McLoughlin <markmc@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f6e58732
    • J
      x86: use __page_aligned_data/bss · cbcd79c2
      Jeremy Fitzhardinge 提交于
      Update arch/x86's use of page-aligned variables.  The change to
      arch/x86/xen/mmu.c fixes an actual bug, but the rest are cleanups
      and to set a precedent.
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Cc: Stephen Tweedie <sct@redhat.com>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Mark McLoughlin <markmc@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cbcd79c2
    • E
      pvops-64: call paravirt_post_allocator_init() on setup_arch() · c1f2f09e
      Eduardo Habkost 提交于
      Signed-off-by: NEduardo Habkost <ehabkost@redhat.com>
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Cc: Stephen Tweedie <sct@redhat.com>
      Cc: Mark McLoughlin <markmc@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c1f2f09e
  11. 04 7月, 2008 1 次提交
  12. 26 6月, 2008 1 次提交
  13. 25 6月, 2008 2 次提交
    • J
      xen: add mechanism to extend existing multicalls · 400d3494
      Jeremy Fitzhardinge 提交于
      Some Xen hypercalls accept an array of operations to work on.  In
      general this is because its more efficient for the hypercall to the
      work all at once rather than as separate hypercalls (even batched as a
      multicall).
      
      This patch adds a mechanism (xen_mc_extend_args()) to allocate more
      argument space to the last-issued multicall, in order to extend its
      argument list.
      
      The user of this mechanism is xen/mmu.c, which uses it to extend the
      args array of mmu_update.  This is particularly valuable when doing
      the update for a large mprotect, which goes via
      ptep_modify_prot_commit(), but it also manages to batch updates to
      pgd/pmds as well.
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      400d3494
    • J
      xen: implement ptep_modify_prot_start/commit · e57778a1
      Jeremy Fitzhardinge 提交于
      Xen has a pte update function which will update a pte while preserving
      its accessed and dirty bits.  This means that ptep_modify_prot_start() can be
      implemented as a simple read of the pte value.  The hardware may
      update the pte in the meantime, but ptep_modify_prot_commit() updates it while
      preserving any changes that may have happened in the meantime.
      
      The updates in ptep_modify_prot_commit() are batched if we're currently in lazy
      mmu mode.
      
      The mmu_update hypercall can take a batch of updates to perform, but
      this code doesn't make particular use of that feature, in favour of
      using generic multicall batching to get them all into the hypervisor.
      
      The net effect of this is that each mprotect pte update turns from two
      expensive trap-and-emulate faults into they hypervisor into a single
      hypercall whose cost is amortized in a batched multicall.
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e57778a1
  14. 24 6月, 2008 1 次提交
  15. 20 6月, 2008 4 次提交
  16. 02 6月, 2008 2 次提交
  17. 28 5月, 2008 1 次提交
    • I
      xen: fix early bootup crash on native hardware · b20aeccd
      Ingo Molnar 提交于
      -tip tree auto-testing found the following early bootup hang:
      
      -------------->
      get_memcfg_from_srat: assigning address to rsdp
      RSD PTR  v0 [Nvidia]
      BUG: Int 14: CR2 ffd00040
           EDI 8092fbfe  ESI ffd00040  EBP 80b0aee8  ESP 80b0aed0
           EBX 000f76f0  EDX 0000000e  ECX 00000003  EAX ffd00040
           err 00000000  EIP 802c055a   CS 00000060  flg 00010006
      Stack: ffd00040 80bc78d0 80b0af6c 80b1dbfe 8093d8ba 00000008 80b42810 80b4ddb4
             80b42842 00000000 80b0af1c 801079c8 808e724e 00000000 80b42871 802c0531
             00000100 00000000 0003fff0 80b0af40 80129999 00040100 00040100 00000000
      Pid: 0, comm: swapper Not tainted 2.6.26-rc4-sched-devel.git #570
       [<802c055a>] ? strncmp+0x11/0x25
       [<80b1dbfe>] ? get_memcfg_from_srat+0xb4/0x568
       [<801079c8>] ? mcount_call+0x5/0x9
       [<802c0531>] ? strcmp+0xa/0x22
       [<80129999>] ? printk+0x38/0x3a
       [<80129999>] ? printk+0x38/0x3a
       [<8011b122>] ? memory_present+0x66/0x6f
       [<80b216b4>] ? setup_memory+0x13/0x40c
       [<80b16b47>] ? propagate_e820_map+0x80/0x97
       [<80b1622a>] ? setup_arch+0x248/0x477
       [<80129999>] ? printk+0x38/0x3a
       [<80b11759>] ? start_kernel+0x6e/0x2eb
       [<80b110fc>] ? i386_start_kernel+0xeb/0xf2
       =======================
      <------
      
      with this config:
      
         http://redhat.com/~mingo/misc/config-Wed_May_28_01_33_33_CEST_2008.bad
      
      The thing is, the crash makes little sense at first sight. We crash on a
      benign-looking printk. The code around it got changed in -tip but
      checking those topic branches individually did not reproduce the bug.
      
      Bisection led to this commit:
      
      |   d5edbc1f is first bad commit
      |   commit d5edbc1f
      |   Author: Jeremy Fitzhardinge <jeremy@goop.org>
      |   Date:   Mon May 26 23:31:22 2008 +0100
      |
      |   xen: add p2m mfn_list_list
      
      Which is somewhat surprising, as on native hardware Xen client side
      should have little to no side-effects.
      
      After some head scratching, it turns out the following happened:
      randconfig enabled the following Xen options:
      
        CONFIG_XEN=y
        CONFIG_XEN_MAX_DOMAIN_MEMORY=8
        # CONFIG_XEN_BLKDEV_FRONTEND is not set
        # CONFIG_XEN_NETDEV_FRONTEND is not set
        CONFIG_HVC_XEN=y
        # CONFIG_XEN_BALLOON is not set
      
      which activated this piece of code in arch/x86/xen/mmu.c:
      
      > @@ -69,6 +69,13 @@
      >  	__attribute__((section(".data.page_aligned"))) =
      >  		{ [ 0 ... TOP_ENTRIES - 1] = &p2m_missing[0] };
      >
      > +/* Arrays of p2m arrays expressed in mfns used for save/restore */
      > +static unsigned long p2m_top_mfn[TOP_ENTRIES]
      > +	__attribute__((section(".bss.page_aligned")));
      > +
      > +static unsigned long p2m_top_mfn_list[TOP_ENTRIES / P2M_ENTRIES_PER_PAGE]
      > +	__attribute__((section(".bss.page_aligned")));
      
      The problem is, you must only put variables into .bss.page_aligned that
      have a _size_ that is _exactly_ page aligned. In this case the size of
      p2m_top_mfn_list is not page aligned:
      
       80b8d000 b p2m_top_mfn
       80b8f000 b p2m_top_mfn_list
       80b8f008 b softirq_stack
       80b97008 b hardirq_stack
       80b9f008 b bm_pte
      
      So all subsequent variables get unaligned which, depending on luck,
      breaks the kernel in various funny ways. In this case what killed the
      kernel first was the misaligned bootmap pte page, resulting in that
      creative crash above.
      
      Anyway, this was a fun bug to track down :-)
      
      I think the moral is that .bss.page_aligned is a dangerous construct in
      its current form, and the symptoms of breakage are very non-trivial, so
      i think we need build-time checks to make sure all symbols in
      .bss.page_aligned are truly page aligned.
      
      The Xen fix below gets the kernel booting again.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b20aeccd
  18. 27 5月, 2008 5 次提交
  19. 23 5月, 2008 2 次提交