1. 08 4月, 2013 1 次提交
  2. 28 3月, 2013 1 次提交
  3. 28 2月, 2013 1 次提交
    • K
      xen/pat: Disable PAT using pat_enabled value. · c79c4982
      Konrad Rzeszutek Wilk 提交于
      The git commit 8eaffa67
      (xen/pat: Disable PAT support for now) explains in details why
      we want to disable PAT for right now. However that
      change was not enough and we should have also disabled
      the pat_enabled value. Otherwise we end up with:
      
      mmap-example:3481 map pfn expected mapping type write-back for
      [mem 0x00010000-0x00010fff], got uncached-minus
       ------------[ cut here ]------------
      WARNING: at /build/buildd/linux-3.8.0/arch/x86/mm/pat.c:774 untrack_pfn+0xb8/0xd0()
      mem 0x00010000-0x00010fff], got uncached-minus
      ------------[ cut here ]------------
      WARNING: at /build/buildd/linux-3.8.0/arch/x86/mm/pat.c:774
      untrack_pfn+0xb8/0xd0()
      ...
      Pid: 3481, comm: mmap-example Tainted: GF 3.8.0-6-generic #13-Ubuntu
      Call Trace:
       [<ffffffff8105879f>] warn_slowpath_common+0x7f/0xc0
       [<ffffffff810587fa>] warn_slowpath_null+0x1a/0x20
       [<ffffffff8104bcc8>] untrack_pfn+0xb8/0xd0
       [<ffffffff81156c1c>] unmap_single_vma+0xac/0x100
       [<ffffffff81157459>] unmap_vmas+0x49/0x90
       [<ffffffff8115f808>] exit_mmap+0x98/0x170
       [<ffffffff810559a4>] mmput+0x64/0x100
       [<ffffffff810560f5>] dup_mm+0x445/0x660
       [<ffffffff81056d9f>] copy_process.part.22+0xa5f/0x1510
       [<ffffffff81057931>] do_fork+0x91/0x350
       [<ffffffff81057c76>] sys_clone+0x16/0x20
       [<ffffffff816ccbf9>] stub_clone+0x69/0x90
       [<ffffffff816cc89d>] ? system_call_fastpath+0x1a/0x1f
      ---[ end trace 4918cdd0a4c9fea4 ]---
      
      (a similar message shows up if you end up launching 'mcelog')
      
      The call chain is (as analyzed by Liu, Jinsong):
      do_fork
        --> copy_process
          --> dup_mm
            --> dup_mmap
             	--> copy_page_range
                --> track_pfn_copy
                  --> reserve_pfn_range
                    --> line 624: flags != want_flags
      It comes from different memory types of page table (_PAGE_CACHE_WB) and MTRR
      (_PAGE_CACHE_UC_MINUS).
      
      Stefan Bader dug in this deep and found out that:
      "That makes it clearer as this will do
      
      reserve_memtype(...)
      --> pat_x_mtrr_type
        --> mtrr_type_lookup
          --> __mtrr_type_lookup
      
      And that can return -1/0xff in case of MTRR not being enabled/initialized. Which
      is not the case (given there are no messages for it in dmesg). This is not equal
      to MTRR_TYPE_WRBACK and thus becomes _PAGE_CACHE_UC_MINUS.
      
      It looks like the problem starts early in reserve_memtype:
      
             	if (!pat_enabled) {
                      /* This is identical to page table setting without PAT */
                      if (new_type) {
                              if (req_type == _PAGE_CACHE_WC)
                                      *new_type = _PAGE_CACHE_UC_MINUS;
                              else
                                     	*new_type = req_type & _PAGE_CACHE_MASK;
                     	}
                      return 0;
              }
      
      This would be what we want, that is clearing the PWT and PCD flags from the
      supported flags - if pat_enabled is disabled."
      
      This patch does that - disabling PAT.
      
      CC: stable@vger.kernel.org # 3.3 and further
      Reported-by: NSander Eikelenboom <linux@eikelenboom.it>
      Reported-and-Tested-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Reported-and-Tested-by: NStefan Bader <stefan.bader@canonical.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      c79c4982
  4. 23 2月, 2013 1 次提交
    • K
      x86-64, xen, mmu: Provide an early version of write_cr3. · 0cc9129d
      Konrad Rzeszutek Wilk 提交于
      With commit 8170e6be ("x86, 64bit: Use a #PF handler to materialize
      early mappings on demand") we started hitting an early bootup crash
      where the Xen hypervisor would inform us that:
      
          (XEN) d7:v0: unhandled page fault (ec=0000)
          (XEN) Pagetable walk from ffffea000005b2d0:
          (XEN)  L4[0x1d4] = 0000000000000000 ffffffffffffffff
          (XEN) domain_crash_sync called from entry.S
          (XEN) Domain 7 (vcpu#0) crashed on cpu#3:
          (XEN) ----[ Xen-4.2.0  x86_64  debug=n  Not tainted ]----
      
      .. that Xen was unable to context switch back to dom0.
      
      Looking at the calling stack we find:
      
          [<ffffffff8103feba>] xen_get_user_pgd+0x5a  <--
          [<ffffffff8103feba>] xen_get_user_pgd+0x5a
          [<ffffffff81042d27>] xen_write_cr3+0x77
          [<ffffffff81ad2d21>] init_mem_mapping+0x1f9
          [<ffffffff81ac293f>] setup_arch+0x742
          [<ffffffff81666d71>] printk+0x48
      
      We are trying to figure out whether we need to up-date the user PGD as
      well.  Please keep in mind that under 64-bit PV guests we have a limited
      amount of rings: 0 for the Hypervisor, and 1 for both the Linux kernel
      and user-space.  As such the Linux pvops'fied version of write_cr3
      checks if it has to update the user-space cr3 as well.
      
      That clearly is not needed during early bootup.  The recent changes (see
      above git commit) streamline the x86 page table allocation to be much
      simpler (And also incidentally the #PF handler ends up in spirit being
      similar to how the Xen toolstack sets up the initial page-tables).
      
      The fix is to have an early-bootup version of cr3 that just loads the
      kernel %cr3.  The later version - which also handles user-page
      modifications will be used after the initial page tables have been
      setup.
      
      [ hpa: removed a redundant #ifdef and made the new function __init.
        Also note that x86-32 already has such an early xen_write_cr3. ]
      Tested-by: N"H. Peter Anvin" <hpa@zytor.com>
      Reported-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Link: http://lkml.kernel.org/r/1361579812-23709-1-git-send-email-konrad.wilk@oracle.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0cc9129d
  5. 20 2月, 2013 2 次提交
    • S
      xen: Send spinlock IPI to all waiters · 76eaca03
      Stefan Bader 提交于
      There is a loophole between Xen's current implementation of
      pv-spinlocks and the scheduler. This was triggerable through
      a testcase until v3.6 changed the TLB flushing code. The
      problem potentially is still there just not observable in the
      same way.
      
      What could happen was (is):
      
      1. CPU n tries to schedule task x away and goes into a slow
         wait for the runq lock of CPU n-# (must be one with a lower
         number).
      2. CPU n-#, while processing softirqs, tries to balance domains
         and goes into a slow wait for its own runq lock (for updating
         some records). Since this is a spin_lock_irqsave in softirq
         context, interrupts will be re-enabled for the duration of
         the poll_irq hypercall used by Xen.
      3. Before the runq lock of CPU n-# is unlocked, CPU n-1 receives
         an interrupt (e.g. endio) and when processing the interrupt,
         tries to wake up task x. But that is in schedule and still
         on_cpu, so try_to_wake_up goes into a tight loop.
      4. The runq lock of CPU n-# gets unlocked, but the message only
         gets sent to the first waiter, which is CPU n-# and that is
         busily stuck.
      5. CPU n-# never returns from the nested interruption to take and
         release the lock because the scheduler uses a busy wait.
         And CPU n never finishes the task migration because the unlock
         notification only went to CPU n-#.
      
      To avoid this and since the unlocking code has no real sense of
      which waiter is best suited to grab the lock, just send the IPI
      to all of them. This causes the waiters to return from the hyper-
      call (those not interrupted at least) and do active spinlocking.
      
      BugLink: http://bugs.launchpad.net/bugs/1011792Acked-by: NJan Beulich <JBeulich@suse.com>
      Signed-off-by: NStefan Bader <stefan.bader@canonical.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      76eaca03
    • K
      xen/smp: Move the common CPU init code a bit to prep for PVH patch. · dacd45f4
      Konrad Rzeszutek Wilk 提交于
      The PV and PVH code CPU init code share some functionality. The
      PVH code ("xen/pvh: Extend vcpu_guest_context, p2m, event, and XenBus")
      sets some of these up, but not all. To make it easier to read, this
      patch removes the PV specific out of the generic way.
      
      No functional change - just code movement.
      Acked-by: NStefano Stabellini <stefano.stabellini@eu.citrix.com>
      [v2: Fixed compile errors noticed by Fengguang Wu build system]
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      dacd45f4
  6. 15 2月, 2013 2 次提交
  7. 14 2月, 2013 1 次提交
    • J
      x86/xen: don't assume %ds is usable in xen_iret for 32-bit PVOPS. · 13d2b4d1
      Jan Beulich 提交于
      This fixes CVE-2013-0228 / XSA-42
      
      Drew Jones while working on CVE-2013-0190 found that that unprivileged guest user
      in 32bit PV guest can use to crash the > guest with the panic like this:
      
      -------------
      general protection fault: 0000 [#1] SMP
      last sysfs file: /sys/devices/vbd-51712/block/xvda/dev
      Modules linked in: sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4
      iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6
      xt_state nf_conntrack ip6table_filter ip6_tables ipv6 xen_netfront ext4
      mbcache jbd2 xen_blkfront dm_mirror dm_region_hash dm_log dm_mod [last
      unloaded: scsi_wait_scan]
      
      Pid: 1250, comm: r Not tainted 2.6.32-356.el6.i686 #1
      EIP: 0061:[<c0407462>] EFLAGS: 00010086 CPU: 0
      EIP is at xen_iret+0x12/0x2b
      EAX: eb8d0000 EBX: 00000001 ECX: 08049860 EDX: 00000010
      ESI: 00000000 EDI: 003d0f00 EBP: b77f8388 ESP: eb8d1fe0
       DS: 0000 ES: 007b FS: 0000 GS: 00e0 SS: 0069
      Process r (pid: 1250, ti=eb8d0000 task=c2953550 task.ti=eb8d0000)
      Stack:
       00000000 0027f416 00000073 00000206 b77f8364 0000007b 00000000 00000000
      Call Trace:
      Code: c3 8b 44 24 18 81 4c 24 38 00 02 00 00 8d 64 24 30 e9 03 00 00 00
      8d 76 00 f7 44 24 08 00 00 02 80 75 33 50 b8 00 e0 ff ff 21 e0 <8b> 40
      10 8b 04 85 a0 f6 ab c0 8b 80 0c b0 b3 c0 f6 44 24 0d 02
      EIP: [<c0407462>] xen_iret+0x12/0x2b SS:ESP 0069:eb8d1fe0
      general protection fault: 0000 [#2]
      ---[ end trace ab0d29a492dcd330 ]---
      Kernel panic - not syncing: Fatal exception
      Pid: 1250, comm: r Tainted: G      D    ---------------
      2.6.32-356.el6.i686 #1
      Call Trace:
       [<c08476df>] ? panic+0x6e/0x122
       [<c084b63c>] ? oops_end+0xbc/0xd0
       [<c084b260>] ? do_general_protection+0x0/0x210
       [<c084a9b7>] ? error_code+0x73/
      -------------
      
      Petr says: "
       I've analysed the bug and I think that xen_iret() cannot cope with
       mangled DS, in this case zeroed out (null selector/descriptor) by either
       xen_failsafe_callback() or RESTORE_REGS because the corresponding LDT
       entry was invalidated by the reproducer. "
      
      Jan took a look at the preliminary patch and came up a fix that solves
      this problem:
      
      "This code gets called after all registers other than those handled by
      IRET got already restored, hence a null selector in %ds or a non-null
      one that got loaded from a code or read-only data descriptor would
      cause a kernel mode fault (with the potential of crashing the kernel
      as a whole, if panic_on_oops is set)."
      
      The way to fix this is to realize that the we can only relay on the
      registers that IRET restores. The two that are guaranteed are the
      %cs and %ss as they are always fixed GDT selectors. Also they are
      inaccessible from user mode - so they cannot be altered. This is
      the approach taken in this patch.
      
      Another alternative option suggested by Jan would be to relay on
      the subtle realization that using the %ebp or %esp relative references uses
      the %ss segment.  In which case we could switch from using %eax to %ebp and
      would not need the %ss over-rides. That would also require one extra
      instruction to compensate for the one place where the register is used
      as scaled index. However Andrew pointed out that is too subtle and if
      further work was to be done in this code-path it could escape folks attention
      and lead to accidents.
      Reviewed-by: NPetr Matousek <pmatouse@redhat.com>
      Reported-by: NPetr Matousek <pmatouse@redhat.com>
      Reviewed-by: NAndrew Cooper <andrew.cooper3@citrix.com>
      Signed-off-by: NJan Beulich <jbeulich@suse.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      13d2b4d1
  8. 10 2月, 2013 2 次提交
    • L
      x86 idle: remove 32-bit-only "no-hlt" parameter, hlt_works_ok flag · 27be4570
      Len Brown 提交于
      Remove 32-bit x86 a cmdline param "no-hlt",
      and the cpuinfo_x86.hlt_works_ok that it sets.
      
      If a user wants to avoid HLT, then "idle=poll"
      is much more useful, as it avoids invocation of HLT
      in idle, while "no-hlt" failed to do so.
      
      Indeed, hlt_works_ok was consulted in only 3 places.
      
      First, in /proc/cpuinfo where "hlt_bug yes"
      would be printed if and only if the user booted
      the system with "no-hlt" -- as there was no other code
      to set that flag.
      
      Second, check_hlt() would not invoke halt() if "no-hlt"
      were on the cmdline.
      
      Third, it was consulted in stop_this_cpu(), which is invoked
      by native_machine_halt()/reboot_interrupt()/smp_stop_nmi_callback() --
      all cases where the machine is being shutdown/reset.
      The flag was not consulted in the more frequently invoked
      play_dead()/hlt_play_dead() used in processor offline and suspend.
      
      Since Linux-3.0 there has been a run-time notice upon "no-hlt" invocations
      indicating that it would be removed in 2012.
      Signed-off-by: NLen Brown <len.brown@intel.com>
      Cc: x86@kernel.org
      27be4570
    • L
      xen idle: make xen-specific macro xen-specific · 6a377ddc
      Len Brown 提交于
      This macro is only invoked by Xen,
      so make its definition specific to Xen.
      
      > set_pm_idle_to_default()
      < xen_set_default_idle()
      Signed-off-by: NLen Brown <len.brown@intel.com>
      Cc: xen-devel@lists.xensource.com
      6a377ddc
  9. 24 1月, 2013 1 次提交
  10. 16 1月, 2013 1 次提交
  11. 18 12月, 2012 2 次提交
  12. 01 12月, 2012 1 次提交
  13. 30 11月, 2012 1 次提交
  14. 29 11月, 2012 4 次提交
  15. 27 11月, 2012 1 次提交
  16. 18 11月, 2012 1 次提交
  17. 02 11月, 2012 1 次提交
    • O
      xen PVonHVM: use E820_Reserved area for shared_info · 9d02b43d
      Olaf Hering 提交于
      This is a respin of 00e37bdb
      ("xen PVonHVM: move shared_info to MMIO before kexec").
      
      Currently kexec in a PVonHVM guest fails with a triple fault because the
      new kernel overwrites the shared info page. The exact failure depends on
      the size of the kernel image. This patch moves the pfn from RAM into an
      E820 reserved memory area.
      
      The pfn containing the shared_info is located somewhere in RAM. This will
      cause trouble if the current kernel is doing a kexec boot into a new
      kernel. The new kernel (and its startup code) can not know where the pfn
      is, so it can not reserve the page. The hypervisor will continue to update
      the pfn, and as a result memory corruption occours in the new kernel.
      
      The toolstack marks the memory area FC000000-FFFFFFFF as reserved in the
      E820 map. Within that range newer toolstacks (4.3+) will keep 1MB
      starting from FE700000 as reserved for guest use. Older Xen4 toolstacks
      will usually not allocate areas up to FE700000, so FE700000 is expected
      to work also with older toolstacks.
      
      In Xen3 there is no reserved area at a fixed location. If the guest is
      started on such old hosts the shared_info page will be placed in RAM. As
      a result kexec can not be used.
      Signed-off-by: NOlaf Hering <olaf@aepfle.de>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      9d02b43d
  18. 01 11月, 2012 1 次提交
  19. 20 10月, 2012 1 次提交
  20. 12 10月, 2012 2 次提交
    • K
      xen/bootup: allow {read|write}_cr8 pvops call. · 1a7bbda5
      Konrad Rzeszutek Wilk 提交于
      We actually do not do anything about it. Just return a default
      value of zero and if the kernel tries to write anything but 0
      we BUG_ON.
      
      This fixes the case when an user tries to suspend the machine
      and it blows up in save_processor_state b/c 'read_cr8' is set
      to NULL and we get:
      
      kernel BUG at /home/konrad/ssd/linux/arch/x86/include/asm/paravirt.h:100!
      invalid opcode: 0000 [#1] SMP
      Pid: 2687, comm: init.late Tainted: G           O 3.6.0upstream-00002-gac264ac-dirty #4 Bochs Bochs
      RIP: e030:[<ffffffff814d5f42>]  [<ffffffff814d5f42>] save_processor_state+0x212/0x270
      
      .. snip..
      Call Trace:
       [<ffffffff810733bf>] do_suspend_lowlevel+0xf/0xac
       [<ffffffff8107330c>] ? x86_acpi_suspend_lowlevel+0x10c/0x150
       [<ffffffff81342ee2>] acpi_suspend_enter+0x57/0xd5
      
      CC: stable@vger.kernel.org
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      1a7bbda5
    • K
      xen/bootup: allow read_tscp call for Xen PV guests. · cd0608e7
      Konrad Rzeszutek Wilk 提交于
      The hypervisor will trap it. However without this patch,
      we would crash as the .read_tscp is set to NULL. This patch
      fixes it and sets it to the native_read_tscp call.
      
      CC: stable@vger.kernel.org
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      cd0608e7
  21. 09 10月, 2012 1 次提交
    • K
      mm: kill vma flag VM_RESERVED and mm->reserved_vm counter · 314e51b9
      Konstantin Khlebnikov 提交于
      A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
      currently it lost original meaning but still has some effects:
      
       | effect                 | alternative flags
      -+------------------------+---------------------------------------------
      1| account as reserved_vm | VM_IO
      2| skip in core dump      | VM_IO, VM_DONTDUMP
      3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
      4| do not mlock           | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
      
      This patch removes reserved_vm counter from mm_struct.  Seems like nobody
      cares about it, it does not exported into userspace directly, it only
      reduces total_vm showed in proc.
      
      Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.
      
      remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
      remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.
      
      [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      314e51b9
  22. 04 10月, 2012 1 次提交
    • O
      xen pv-on-hvm: add pfn_is_ram helper for kdump · 34b6f01a
      Olaf Hering 提交于
      Register pfn_is_ram helper speed up reading /proc/vmcore in the kdump
      kernel. See commit message of 997c136f ("fs/proc/vmcore.c: add hook
      to read_from_oldmem() to check for non-ram pages") for details.
      
      It makes use of a new hvmop HVMOP_get_mem_type which was introduced in
      xen 4.2 (23298:26413986e6e0) and backported to 4.1.1.
      
      The new function is currently only enabled for reading /proc/vmcore.
      Later it will be used also for the kexec kernel. Since that requires
      more changes in the generic kernel make it static for the time being.
      Signed-off-by: NOlaf Hering <olaf@aepfle.de>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      34b6f01a
  23. 24 9月, 2012 3 次提交
    • J
      xen/vga: add the xen EFI video mode support · aa387d63
      Jan Beulich 提交于
      In order to add xen EFI frambebuffer video support, it is required to add
      xen-efi's new video type (XEN_VGATYPE_EFI_LFB) case and handle it in the
      function xen_init_vga and set the video type to VIDEO_TYPE_EFI to enable
      efi video mode.
      
      The original patch from which this was broken out from:
       http://marc.info/?i=4E099AA6020000780004A4C6@nat28.tlf.novell.comSigned-off-by: NJan Beulich <JBeulich@novell.com>
      Signed-off-by: NTang Liang <liang.tang@oracle.com>
      [v2: The original author is Jan Beulich and Liang Tang ported it to upstream]
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      aa387d63
    • K
      xen/x86: retrieve keyboard shift status flags from hypervisor. · ffb8b233
      Konrad Rzeszutek Wilk 提交于
      The xen c/s 25873 allows the hypervisor to retrieve the NUMLOCK flag.
      With this patch, the Linux kernel can get the state according to the
      data in the BIOS.
      Acked-by: NJan Beulich <jbeulich@suse.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      ffb8b233
    • K
      xen/boot: Disable NUMA for PV guests. · 8d54db79
      Konrad Rzeszutek Wilk 提交于
      The hypervisor is in charge of allocating the proper "NUMA" memory
      and dealing with the CPU scheduler to keep them bound to the proper
      NUMA node. The PV guests (and PVHVM) have no inkling of where they
      run and do not need to know that right now. In the future we will
      need to inject NUMA configuration data (if a guest spans two or more
      NUMA nodes) so that the kernel can make the right choices. But those
      patches are not yet present.
      
      In the meantime, disable the NUMA capability in the PV guest, which
      also fixes a bootup issue. Andre says:
      
      "we see Dom0 crashes due to the kernel detecting the NUMA topology not
      by ACPI, but directly from the northbridge (CONFIG_AMD_NUMA).
      
      This will detect the actual NUMA config of the physical machine, but
      will crash about the mismatch with Dom0's virtual memory. Variation of
      the theme: Dom0 sees what it's not supposed to see.
      
      This happens with the said config option enabled and on a machine where
      this scanning is still enabled (K8 and Fam10h, not Bulldozer class)
      
      We have this dump then:
      NUMA: Warning: node ids are out of bound, from=-1 to=-1 distance=10
      Scanning NUMA topology in Northbridge 24
      Number of physical nodes 4
      Node 0 MemBase 0000000000000000 Limit 0000000040000000
      Node 1 MemBase 0000000040000000 Limit 0000000138000000
      Node 2 MemBase 0000000138000000 Limit 00000001f8000000
      Node 3 MemBase 00000001f8000000 Limit 0000000238000000
      Initmem setup node 0 0000000000000000-0000000040000000
        NODE_DATA [000000003ffd9000 - 000000003fffffff]
      Initmem setup node 1 0000000040000000-0000000138000000
        NODE_DATA [0000000137fd9000 - 0000000137ffffff]
      Initmem setup node 2 0000000138000000-00000001f8000000
        NODE_DATA [00000001f095e000 - 00000001f0984fff]
      Initmem setup node 3 00000001f8000000-0000000238000000
      Cannot find 159744 bytes in node 3
      BUG: unable to handle kernel NULL pointer dereference at (null)
      IP: [<ffffffff81d220e6>] __alloc_bootmem_node+0x43/0x96
      Pid: 0, comm: swapper Not tainted 3.3.6 #1 AMD Dinar/Dinar
      RIP: e030:[<ffffffff81d220e6>]  [<ffffffff81d220e6>] __alloc_bootmem_node+0x43/0x96
      .. snip..
        [<ffffffff81d23024>] sparse_early_usemaps_alloc_node+0x64/0x178
        [<ffffffff81d23348>] sparse_init+0xe4/0x25a
        [<ffffffff81d16840>] paging_init+0x13/0x22
        [<ffffffff81d07fbb>] setup_arch+0x9c6/0xa9b
        [<ffffffff81683954>] ? printk+0x3c/0x3e
        [<ffffffff81d01a38>] start_kernel+0xe5/0x468
        [<ffffffff81d012cf>] x86_64_start_reservations+0xba/0xc1
        [<ffffffff81007153>] ? xen_setup_runstate_info+0x2c/0x36
        [<ffffffff81d050ee>] xen_start_kernel+0x565/0x56c
      "
      
      so we just disable NUMA scanning by setting numa_off=1.
      
      CC: stable@vger.kernel.org
      Reported-and-Tested-by: NAndre Przywara <andre.przywara@amd.com>
      Acked-by: NAndre Przywara <andre.przywara@amd.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      8d54db79
  24. 20 9月, 2012 1 次提交
    • K
      xen/boot: Disable BIOS SMP MP table search. · bd49940a
      Konrad Rzeszutek Wilk 提交于
      As the initial domain we are able to search/map certain regions
      of memory to harvest configuration data. For all low-level we
      use ACPI tables - for interrupts we use exclusively ACPI _PRT
      (so DSDT) and MADT for INT_SRC_OVR.
      
      The SMP MP table is not used at all. As a matter of fact we do
      not even support machines that only have SMP MP but no ACPI tables.
      
      Lets follow how Moorestown does it and just disable searching
      for BIOS SMP tables.
      
      This also fixes an issue on HP Proliant BL680c G5 and DL380 G6:
      
      9f->100 for 1:1 PTE
      Freeing 9f-100 pfn range: 97 pages freed
      1-1 mapping on 9f->100
      .. snip..
      e820: BIOS-provided physical RAM map:
      Xen: [mem 0x0000000000000000-0x000000000009efff] usable
      Xen: [mem 0x000000000009f400-0x00000000000fffff] reserved
      Xen: [mem 0x0000000000100000-0x00000000cfd1dfff] usable
      .. snip..
      Scan for SMP in [mem 0x00000000-0x000003ff]
      Scan for SMP in [mem 0x0009fc00-0x0009ffff]
      Scan for SMP in [mem 0x000f0000-0x000fffff]
      found SMP MP-table at [mem 0x000f4fa0-0x000f4faf] mapped at [ffff8800000f4fa0]
      (XEN) mm.c:908:d0 Error getting mfn 100 (pfn 5555555555555555) from L1 entry 0000000000100461 for l1e_owner=0, pg_owner=0
      (XEN) mm.c:4995:d0 ptwr_emulate: could not get_page_from_l1e()
      BUG: unable to handle kernel NULL pointer dereference at           (null)
      IP: [<ffffffff81ac07e2>] xen_set_pte_init+0x66/0x71
      . snip..
      Pid: 0, comm: swapper Not tainted 3.6.0-rc6upstream-00188-gb6fb969-dirty #2 HP ProLiant BL680c G5
      .. snip..
      Call Trace:
       [<ffffffff81ad31c6>] __early_ioremap+0x18a/0x248
       [<ffffffff81624731>] ? printk+0x48/0x4a
       [<ffffffff81ad32ac>] early_ioremap+0x13/0x15
       [<ffffffff81acc140>] get_mpc_size+0x2f/0x67
       [<ffffffff81acc284>] smp_scan_config+0x10c/0x136
       [<ffffffff81acc2e4>] default_find_smp_config+0x36/0x5a
       [<ffffffff81ac3085>] setup_arch+0x5b3/0xb5b
       [<ffffffff81624731>] ? printk+0x48/0x4a
       [<ffffffff81abca7f>] start_kernel+0x90/0x390
       [<ffffffff81abc356>] x86_64_start_reservations+0x131/0x136
       [<ffffffff81abfa83>] xen_start_kernel+0x65f/0x661
      (XEN) Domain 0 crashed: 'noreboot' set - not rebooting.
      
      which is that ioremap would end up mapping 0xff using _PAGE_IOMAP
      (which is what early_ioremap sticks as a flag) - which meant
      we would get MFN 0xFF (pte ff461, which is OK), and then it would
      also map 0x100 (b/c ioremap tries to get page aligned request, and
      it was trying to map 0xf4fa0 + PAGE_SIZE - so it mapped the next page)
      as _PAGE_IOMAP. Since 0x100 is actually a RAM page, and the _PAGE_IOMAP
      bypasses the P2M lookup we would happily set the PTE to 1000461.
      Xen would deny the request since we do not have access to the
      Machine Frame Number (MFN) of 0x100. The P2M[0x100] is for example
      0x80140.
      
      CC: stable@vger.kernel.org
      Fixes-Oracle-Bugzilla: https://bugzilla.oracle.com/bugzilla/show_bug.cgi?id=13665Acked-by: NJan Beulich <jbeulich@suse.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      bd49940a
  25. 18 9月, 2012 2 次提交
  26. 12 9月, 2012 4 次提交