1. 25 3月, 2008 5 次提交
    • M
      KVM: VMX: convert init_rmode_tss() to slots_lock · 707a18a5
      Marcelo Tosatti 提交于
      init_rmode_tss was forgotten during the conversion from mmap_sem to
      slots_lock.
      
      INFO: task qemu-system-x86:3748 blocked for more than 120 seconds.
      Call Trace:
       [<ffffffff8053d100>] __down_read+0x86/0x9e
       [<ffffffff8053fb43>] do_page_fault+0x346/0x78e
       [<ffffffff8053d235>] trace_hardirqs_on_thunk+0x35/0x3a
       [<ffffffff8053dcad>] error_exit+0x0/0xa9
       [<ffffffff8035a7a7>] copy_user_generic_string+0x17/0x40
       [<ffffffff88099a8a>] :kvm:kvm_write_guest_page+0x3e/0x5f
       [<ffffffff880b661a>] :kvm_intel:init_rmode_tss+0xa7/0xf9
       [<ffffffff880b7d7e>] :kvm_intel:vmx_vcpu_reset+0x10/0x38a
       [<ffffffff8809b9a5>] :kvm:kvm_arch_vcpu_setup+0x20/0x53
       [<ffffffff8809a1e4>] :kvm:kvm_vm_ioctl+0xad/0x1cf
       [<ffffffff80249dea>] __lock_acquire+0x4f7/0xc28
       [<ffffffff8028fad9>] vfs_ioctl+0x21/0x6b
       [<ffffffff8028fd75>] do_vfs_ioctl+0x252/0x26b
       [<ffffffff8028fdca>] sys_ioctl+0x3c/0x5e
       [<ffffffff8020b01b>] system_call_after_swapgs+0x7b/0x80
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@qumranet.com>
      707a18a5
    • M
      KVM: MMU: handle page removal with shadow mapping · 15aaa819
      Marcelo Tosatti 提交于
      Do not assume that a shadow mapping will always point to the same host
      frame number.  Fixes crash with madvise(MADV_DONTNEED).
      
      [avi: move after first printk(), add another printk()]
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@qumranet.com>
      15aaa819
    • A
      KVM: MMU: Fix is_rmap_pte() with io ptes · 4b1a80fa
      Avi Kivity 提交于
      is_rmap_pte() doesn't take into account io ptes, which have the avail bit set.
      Signed-off-by: NAvi Kivity <avi@qumranet.com>
      4b1a80fa
    • A
      KVM: VMX: Restore tss even on x86_64 · 5dc83262
      Avi Kivity 提交于
      The vmx hardware state restore restores the tss selector and base address, but
      not its length.  Usually, this does not matter since most of the tss contents
      is within the default length of 0x67.  However, if a process is using ioperm()
      to grant itself I/O port permissions, an additional bitmap within the tss,
      but outside the default length is consulted.  The effect is that the process
      will receive a SIGSEGV instead of transparently accessing the port.
      
      Fix by restoring the tss length.  Note that i386 had this working already.
      
      Closes bugzilla 10246.
      Signed-off-by: NAvi Kivity <avi@qumranet.com>
      5dc83262
    • L
      x86-32: Pass the full resource data to ioremap() · b9e76a00
      Linus Torvalds 提交于
      It appears that 64-bit PCI resources cannot possibly ever have worked on
      x86-32 even when the RESOURCES_64BIT config option was set, because any
      driver that tried to [pci_]ioremap() the resource would have been unable
      to do so because the high 32 bits would have been silently dropped on
      the floor by the ioremap() routines that only used "unsigned long".
      
      Change them to use "resource_size_t" instead, which properly encodes the
      whole 64-bit resource data if RESOURCES_64BIT is enabled.
      Acked-by: NH. Peter Anvin <hpa@kernel.org>
      Acked-by: NStefan Richter <stefanr@s5r6.in-berlin.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9e76a00
  2. 23 3月, 2008 1 次提交
    • T
      x86: revert: reserve dma32 early for gart · 9e963048
      Thomas Gleixner 提交于
      Revert
      
      commit f62f1fc9
      Author: Yinghai Lu <yhlu.kernel@gmail.com>
      Date:   Fri Mar 7 15:02:50 2008 -0800
      
          x86: reserve dma32 early for gart
      
      The patch has a dependency on bootmem modifications which are not .25
      material that late in the -rc cycle. The problem which is addressed by
      the patch is limited to machines with 256G and more memory booted with
      NUMA disabled. This is not a .25 regression and the audience which is
      affected by this problem is very limited, so it's safer to do the
      revert than pulling in intrusive bootmem changes right now.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      9e963048
  3. 22 3月, 2008 11 次提交
  4. 13 3月, 2008 1 次提交
  5. 12 3月, 2008 3 次提交
    • T
      x86: remove quicklists · 985a34bd
      Thomas Gleixner 提交于
      quicklists cause a serious memory leak on 32-bit x86,
      as documented at:
      
        http://bugzilla.kernel.org/show_bug.cgi?id=9991
      
      the reason is that the quicklist pool is a special-purpose
      cache that grows out of proportion. It is not accounted for
      anywhere and users have no way to even realize that it's
      the quicklists that are causing RAM usage spikes. It was
      supposed to be a relatively small pool, but as demonstrated
      by KOSAKI Motohiro, they can grow as large as:
      
        Quicklists:    1194304 kB
      
      given how much trouble this code has caused historically,
      and given that Andrew objected to its introduction on x86
      (years ago), the best option at this point is to remove them.
      
      [ any performance benefits of caching constructed pgds should
        be implemented in a more generic way (possibly within the page
        allocator), while still allowing constructed pages to be
        allocated by other workloads. ]
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      985a34bd
    • R
      x86: ia32 syscall restart fix · 40f0933d
      Roland McGrath 提交于
      The code to restart syscalls after signals depends on checking for a
      negative orig_ax, and for particular negative -ERESTART* values in ax.
      These fields are 64 bits and for a 32-bit task they get zero-extended.
      The syscall restart behavior is lost, a regression from a native 32-bit
      kernel and from 64-bit tasks' behavior.
      
      This patch fixes the problem by doing sign-extension where it matters.
      
      For orig_ax, the only time the value should be -1 but winds up as
      0x0ffffffff is via a 32-bit ptrace call. So the patch changes ptrace to
      sign-extend the 32-bit orig_eax value when it's stored; it doesn't
      change the checks on orig_ax, though it uses the new current_syscall()
      inline to better document the subtle importance of the used of
      signedness there.
      
      The ax value is stored a lot of ways and it seems hard to get them all
      sign-extended at their origins. So for that, we use the
      current_syscall_ret() to sign-extend it only for 32-bit tasks at the
      time of the -ERESTART* comparisons.
      Signed-off-by: NRoland McGrath <roland@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      40f0933d
    • I
      x86: ioremap, remove WARN_ON() · 9a46d7e5
      Ingo Molnar 提交于
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9a46d7e5
  6. 11 3月, 2008 3 次提交
    • I
      fix BIOS PCI config cycle buglet causing ACPI boot regression · f5dbb55b
      Ingo Molnar 提交于
      I figured out another ACPI related regression today.
      
      randconfig testing triggered an early boot-time hang on a laptop of mine
      (32-bit x86, config attached) - the screen was scrolling ACPI AML
      exceptions [with no serial port and no early debugging available].
      
      v2.6.24 works fine on that laptop with the same .config, so after a few
      hours of bisection (had to restart it 3 times - other regressions
      interacted), it honed in on this commit:
      
      | 10270d48 is first bad commit
      |
      | Author: Linus Torvalds <torvalds@woody.linux-foundation.org>
      | Date:   Wed Feb 13 09:56:14 2008 -0800
      |
      |     acpi: fix acpi_os_read_pci_configuration() misuse of raw_pci_read()
      
      reverting this commit ontop of -rc5 gave a correctly booting kernel.
      
      But this commit fixes a real bug so the real question is, why did it
      break the bootup?
      
      After quite some head-scratching, the following change stood out:
      
      -                               pci_id->bus = tu8;
      +                               pci_id->bus = val;
      
      pci_id->bus is defined as u16:
      
         struct acpi_pci_id {
                 u16 segment;
                 u16 bus;
         ...
      
      and 'tu8' changed from u8 to u32. So previously we'd unconditionally
      mask the return value of acpi_os_read_pci_configuration()
      (raw_pci_read()) to 8 bits, but now we just trust whatever comes back
      from the PCI access routines and only crop it to 16 bits.
      
      But if the high 8 bits of that result contains any noise then we'll
      write that into ACPI's PCI ID descriptor and confuse the heck out of the
      rest of ACPI.
      
      So lets check the PCI-BIOS code on that theory. We have this codepath
      for 8-bit accesses (arch/x86/pci/pcbios.c:pci_bios_read()):
      
              switch (len) {
              case 1:
                      __asm__("lcall *(%%esi); cld\n\t"
                              "jc 1f\n\t"
                              "xor %%ah, %%ah\n"
                              "1:"
                              : "=c" (*value),
                                "=a" (result)
                              : "1" (PCIBIOS_READ_CONFIG_BYTE),
                                "b" (bx),
                                "D" ((long)reg),
                                "S" (&pci_indirect));
      
      Aha! The "=a" output constraint puts the full 32 bits of EAX into
      *value. But if the BIOS's routines set any of the high bits to nonzero,
      we'll return a value with more set in it than intended.
      
      The other, more common PCI access methods (v1 and v2 PCI reads) clear
      out the high bits already, for example pci_conf1_read() does:
      
              switch (len) {
              case 1:
                      *value = inb(0xCFC + (reg & 3));
      
      which explicitly converts the return byte up to 32 bits and zero-extends
      it.
      
      So zero-extending the result in the PCI-BIOS read routine fixes the
      regression on my laptop. ( It might fix some other long-standing issues
      we had with PCI-BIOS during the past decade ... ) Both 8-bit and 16-bit
      accesses were buggy.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5dbb55b
    • R
      lguest: Revert 1ce70c4f, fix real problem. · 4357bd94
      Rusty Russell 提交于
      Ahmed managed to crash the Host in release_pgd(), which cannot be a Guest
      bug, and indeed it wasn't.
      
      The bug was that handing a 0 as the address of the toplevel page table
      being manipulated can cause the lookup code in find_pgdir() to return
      an uninitialized cache entry (we shadow up to 4 top level page tables
      for each Guest).
      
      Commit 37cc8d7f introduced this
      behaviour in the Guest, uncovering the bug.
      
      The patch which he submitted (which removed the /4 from the index
      calculation) simply ensured that these high-indexed entries hit the
      early exit path of guest_set_pmd().  But you get lots of segfaults in
      guest userspace as the PMDs aren't being updated.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      4357bd94
    • R
      lguest: Sanitize the lguest clock. · 3fabc55f
      Rusty Russell 提交于
      Now the TSC code handles a zero return from calculate_cpu_khz(),
      lguest can simply pass through the value it gets from the Host: if
      non-zero, all the normal TSC code applies.
      
      Otherwise (or if the Host really doesn't support TSC), the clocksource
      code will fall back to the slower but reasonable lguest clock.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      3fabc55f
  7. 08 3月, 2008 1 次提交
    • R
      x86_64: make ptrace always sign-extend orig_ax to 64 bits · 84c6f604
      Roland McGrath 提交于
      This makes 64-bit ptrace calls setting the 64-bit orig_ax field for a
      32-bit task sign-extend the low 32 bits up to 64.  This matches what a
      64-bit debugger expects when tracing a 32-bit task.
      
      This follows on my "x86_64 ia32 syscall restart fix".  This didn't
      matter until that was fixed.
      
      The debugger ignores or zeros the high half of every register slot it
      sets (including the orig_rax pseudo-register) uniformly.  It expects
      that the setting of the low 32 bits always has the same meaning as a
      32-bit debugger setting those same 32 bits with native 32-bit
      facilities.
      
      This never arose before because the syscall restart check never
      matched any -ERESTART* values due to lack of sign extension.  Before
      that fix, even 32-bit ptrace setting orig_eax to -1 failed to trigger
      the restart check anyway.  So this was never noticed as a regression
      of 64-bit debuggers vs 32-bit debuggers on the same 64-bit kernel.
      Signed-off-by: NRoland McGrath <roland@redhat.com>
      [ Changed to just do the sign-extension unconditionally on x86-64,
        since orig_ax is always just a small integer and doesn't need
        the full 64-bit range ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84c6f604
  8. 07 3月, 2008 5 次提交
    • P
      x86-boot: don't request VBE2 information · 1722770f
      Peter Korsgaard 提交于
      The new x86 setup code (4fd06960) broke booting on an old P3/500MHz
      with an onboard Voodoo3 of mine. After debugging it, it turned out
      to be caused by the fact that the vesa probing now asks for VBE2 data.
      
      Disassembing the video BIOS shows that it overflows the vesa_general_info
      structure when VBE2 data is requested because the source addresses for the
      information strings which get strcpy'ed to the buffer lie outside the 32K
      BIOS code (and hence contain long sequences of 0xff's).
      
      E.G.:
      
      get_vbe_controller_info:
      00002A9C  60                pushaw
      00002A9D  1E                push ds
      00002A9E  0E                push cs
      00002A9F  1F                pop ds
      00002AA0  2BC9              sub cx,cx
      00002AA2  6626813D56424532  cmp dword [es:di],0x32454256 ; "VBE2"
      00002AAA  7501              jnz .1
      00002AAC  41                inc cx
      .1:
      00002AAD  51                push cx
      00002AAE  B91400            mov cx,0x14
      00002AB1  BED47F            mov si, controller_header
      00002AB4  57                push di
      00002AB5  F3A4              rep movsb ; copy vbe1.2 header
      
      00002AB7  B9EC00            mov cx,0xec
      00002ABA  2AC0              sub al,al
      00002ABC  F3AA              rep stosb ; zero pad remainder
      
      00002ABE  5F                pop di
      00002ABF  E8EB0D            call word get_memory
      00002AC2  C1E002            shl ax,0x2
      00002AC5  26894512          mov [es:di+0x12],ax ; total memory
      00002AC9  26C745040003      mov word [es:di+0x4],0x300 ; VBE version
      00002ACF  268C4D08          mov [es:di+0x8],cs
      00002AD3  268C4D10          mov [es:di+0x10],cs
      00002AD7  59                pop cx
      00002AD8  E361              jcxz .done ; VBE2 requested?
      00002ADA  8D9D0001          lea bx,[di+0x100]
      00002ADE  53                push bx
      00002ADF  87DF              xchg bx,di ; di now points to 2nd half
      00002AE1  26C747140001      mov word [es:bx+0x14],0x100 ; sw rev
      
      00002AE7  26897F06          mov [es:bx+0x6],di		; oem string
      00002AEB  268C4708          mov [es:bx+0x8],es
      00002AEF  BE5280            mov si,0x8052 ; oem string
      00002AF2  E87A1B            call word strcpy
      
      00002AF5  26897F0E          mov [es:bx+0xe],di ; video mode list
      00002AF9  268C4710          mov [es:bx+0x10],es
      00002AFD  B91E00            mov cx,0x1e
      00002B00  BEE87F            mov si,vidmodes
      00002B03  F3A5              rep movsw
      
      00002B05  26897F16          mov [es:bx+0x16],di ; oem vendor
      00002B09  268C4718          mov [es:bx+0x18],es
      00002B0D  BE2480            mov si,0x8024 ; oem vendor
      00002B10  E85C1B            call word strcpy
      
      00002B13  26897F1A          mov [es:bx+0x1a],di ; oem product
      00002B17  268C471C          mov [es:bx+0x1c],es
      00002B1B  BE3880            mov si,0x8038 ; oem product
      00002B1E  E84E1B            call word strcpy
      
      00002B21  26897F1E          mov [es:bx+0x1e],di ; oem product rev
      00002B25  268C4720          mov [es:bx+0x20],es
      00002B29  BE4580            mov si,0x8045 ; oem product rev
      00002B2C  E8401B            call word strcpy
      
      00002B2F  58                pop ax
      00002B30  B90001            mov cx,0x100
      00002B33  2BCF              sub cx,di
      00002B35  03C8              add cx,ax
      00002B37  2AC0              sub al,al
      00002B39  F3AA              rep stosb ; zero pad
      .done:
      00002B3B  1F                pop ds
      00002B3C  61                popaw
      00002B3D  B84F00            mov ax,0x4f
      00002B40  C3                ret
      
      (The full BIOS can be found at http://peter.korsgaard.com/vgabios.bin
      if interested).
      
      The old setup code didn't ask for VBE2 info, and the new code doesn't
      actually do anything with the extra information, so the fix is to simply
      not request it. Other BIOS'es might have the same problem.
      Signed-off-by: NPeter Korsgaard <jacmet@sunsite.dk>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      1722770f
    • I
      x86: re-add reboot fixups · 7432d149
      Ingo Molnar 提交于
      Jan Beulich noticed that the reboot fixups went missing during
      reboot.c unification.
      
      (commit 4d022e35)
      
      Geode and a few other rare boards with special reboot quirks are
      affected.
      Reported-by: NJan Beulich <jbeulich@novell.com>
      Signed-off-by: NJan Beulich <jbeulich@novell.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7432d149
    • J
      x86: fix typo in step.c · d032b31a
      Jan Beulich 提交于
      TIF_DEBUGCTLMSR has no meaning in the actual MSR...
      Signed-off-by: NJan Beulich <jbeulich@novell.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d032b31a
    • J
      x86: fix merge mistake in i387.c · 609b5297
      Jan Beulich 提交于
      convert_fxsr_to_user() in 2.6.24's i387_32.c did this, and
      convert_to_fxsr() also does the inverse, so I assume it's an oversight
      that it is no longer being done.
      
      [ mingo@elte.hu:
      
        we encode it this way because there's no space for the 'FPU Last
        Instruction Opcode' (->fop) field in the legacy user_i387_ia32_struct
        that PTRACE_GETFPREGS/PTRACE_SETFPREGS uses.
      
        it's probably pure legacy - i'd be surprised if any user-space relied on
        the FPU Last Opcode in any way. But indeed we used to do it previously
        so the most conservative thing is to preserve that piece of information.
      ]
      Signed-off-by: NJan Beulich <jbeulich@novell.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      609b5297
    • A
      x86: clear DF before calling signal handler · e40cd10c
      Aurelien Jarno 提交于
      The Linux kernel currently does not clear the direction flag before
      calling a signal handler, whereas the x86/x86-64 ABI requires that.
      
      Linux had this behavior/bug forever, but this becomes a real problem
      with gcc version 4.3, which assumes that the direction flag is
      correctly cleared at the entry of a function.
      
      This patches changes the setup_frame() functions to clear the
      direction before entering the signal handler.
      Signed-off-by: NAurelien Jarno <aurelien@aurel32.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NH. Peter Anvin <hpa@zytor.com>
      e40cd10c
  9. 06 3月, 2008 1 次提交
  10. 05 3月, 2008 5 次提交
  11. 04 3月, 2008 4 次提交
    • R
      x86: disable KVM for Voyager and friends · 1a4e3f89
      Randy Dunlap 提交于
      Most classic Pentiums don't have hardware virtualization extension,
      and building kvm with Voyager, Visual Workstation, or NUMAQ
      generates spurious failures.
      Signed-off-by: NAvi Kivity <avi@qumranet.com>
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      1a4e3f89
    • A
      KVM: VMX: Avoid rearranging switched guest msrs while they are loaded · 33f9c505
      Avi Kivity 提交于
      KVM tries to run as much as possible with the guest msrs loaded instead of
      host msrs, since switching msrs is very expensive.  It also tries to minimize
      the number of msrs switched according to the guest mode; for example,
      MSR_LSTAR is needed only by long mode guests.  This optimization is done by
      setup_msrs().
      
      However, we must not change which msrs are switched while we are running with
      guest msr state:
      
       - switch to guest msr state
       - call setup_msrs(), removing some msrs from the list
       - switch to host msr state, leaving a few guest msrs loaded
      
      An easy way to trigger this is to kexec an x86_64 linux guest.  Early during
      setup, the guest will switch EFER to not include SCE.  KVM will stop saving
      MSR_LSTAR, and on the next msr switch it will leave the guest LSTAR loaded.
      The next host syscall will end up in a random location in the kernel.
      
      Fix by reloading the host msrs before changing the msr list.
      Signed-off-by: NAvi Kivity <avi@qumranet.com>
      33f9c505
    • A
      KVM: MMU: Fix race when instantiating a shadow pte · f7d9c7b7
      Avi Kivity 提交于
      For improved concurrency, the guest walk is performed concurrently with other
      vcpus.  This means that we need to revalidate the guest ptes once we have
      write-protected the guest page tables, at which point they can no longer be
      modified.
      
      The current code attempts to avoid this check if the shadow page table is not
      new, on the assumption that if it has existed before, the guest could not have
      modified the pte without the shadow lock.  However the assumption is incorrect,
      as the racing vcpu could have modified the pte, then instantiated the shadow
      page, before our vcpu regains control:
      
        vcpu0        vcpu1
      
        fault
        walk pte
      
                     modify pte
                     fault in same pagetable
                     instantiate shadow page
      
        lookup shadow page
        conclude it is old
        instantiate spte based on stale guest pte
      
      We could do something clever with generation counters, but a test run by
      Marcelo suggests this is unnecessary and we can just do the revalidation
      unconditionally.  The pte will be in the processor cache and the check can
      be quite fast.
      Signed-off-by: NAvi Kivity <avi@qumranet.com>
      f7d9c7b7
    • A
      KVM: Avoid infinite-frequency local apic timer · 0b975a3c
      Avi Kivity 提交于
      If the local apic initial count is zero, don't start a an hrtimer with infinite
      frequency, locking up the host.
      Signed-off-by: NAvi Kivity <avi@qumranet.com>
      0b975a3c