1. 03 5月, 2007 40 次提交
    • Z
      [PATCH] i386: pte simplify ops · 9e5e3162
      Zachary Amsden 提交于
      Add comment and condense code to make use of native_local_ptep_get_and_clear
      function.  Also, it turns out the 2-level and 3-level paging definitions were
      identical, so move the common definition into pgtable.h
      Signed-off-by: NZachary Amsden <zach@vmware.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      9e5e3162
    • Z
      [PATCH] i386: pte xchg optimization · 142dd975
      Zachary Amsden 提交于
      In situations where page table updates need only be made locally, and there is
      no cross-processor A/D bit races involved, we need not use the heavyweight
      xchg instruction to atomically fetch and clear page table entries.  Instead,
      we can just read and clear them directly.
      
      This introduces a neat optimization for non-SMP kernels; drop the atomic xchg
      operations from page table updates.
      
      Thanks to Michel Lespinasse for noting this potential optimization.
      Signed-off-by: NZachary Amsden <zach@vmware.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      142dd975
    • Z
      [PATCH] i386: pte clear optimization · c2c1accd
      Zachary Amsden 提交于
      When exiting from an address space, no special hypervisor notification of page
      table updates needs to occur; direct page table hypervisors, such as Xen,
      switch to another address space first (init_mm) and unprotects the page tables
      to avoid the cost of trapping to the hypervisor for each pte_clear.  Shadow
      mode hypervisors, such as VMI and lhype don't need to do the extra work of
      calling through paravirt-ops, and can just directly clear the page table
      entries without notifiying the hypervisor, since all the page tables are about
      to be freed.
      
      So introduce native_pte_clear functions which bypass any paravirt-ops
      notification.  This results in a significant performance win for VMI and
      removes some indirect calls from zap_pte_range.
      
      Note the 3-level paging already had a native_pte_clear function, thus
      demanding argument conformance and extra args for the 2-level definition.
      Signed-off-by: NZachary Amsden <zach@vmware.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      c2c1accd
    • A
      [PATCH] x86-64: Auto compute __NR_syscall_max at compile time · 57a4f91a
      Andi Kleen 提交于
      No need to maintain it anymore
      Signed-off-by: NAndi Kleen <ak@suse.de>
      57a4f91a
    • F
      [PATCH] x86-64: Use safe_apic_wait_icr_idle in __send_IPI_dest_field - x86_64 · 70ae77f4
      Use safe_apic_wait_icr_idle to check ICR idle bit if the vector is
      NMI_VECTOR to avoid potential hangups in the event of crash when kdump
      tries to stop the other CPUs.
      Signed-off-by: NFernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      70ae77f4
    • F
      [PATCH] x86-64: __send_IPI_dest_field - x86_64 · 9062d888
      Implement __send_IPI_dest_field which can be used to send IPIs when the
      "destination shorthand" field of the ICR is set to 00 (destination
      field). Use it whenever possible.
      Signed-off-by: NFernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      9062d888
    • F
      [PATCH] x86-64: safe_apic_wait_icr_idle - x86_64 · 8339e9fb
      Fernando Luis VazquezCao 提交于
      apic_wait_icr_idle looks like this:
      
      static __inline__ void apic_wait_icr_idle(void)
      {
        while (apic_read(APIC_ICR) & APIC_ICR_BUSY)
          cpu_relax();
      }
      
      The busy loop in this function would not be problematic if the
      corresponding status bit in the ICR were always updated, but that does
      not seem to be the case under certain crash scenarios. Kdump uses an IPI
      to stop the other CPUs in the event of a crash, but when any of the
      other CPUs are locked-up inside the NMI handler the CPU that sends the
      IPI will end up looping forever in the ICR check, effectively
      hard-locking the whole system.
      
      Quoting from Intel's "MultiProcessor Specification" (Version 1.4), B-3:
      
      "A local APIC unit indicates successful dispatch of an IPI by
      resetting the Delivery Status bit in the Interrupt Command
      Register (ICR). The operating system polls the delivery status
      bit after sending an INIT or STARTUP IPI until the command has
      been dispatched.
      
      A period of 20 microseconds should be sufficient for IPI dispatch
      to complete under normal operating conditions. If the IPI is not
      successfully dispatched, the operating system can abort the
      command. Alternatively, the operating system can retry the IPI by
      writing the lower 32-bit double word of the ICR. This “time-out”
      mechanism can be implemented through an external interrupt, if
      interrupts are enabled on the processor, or through execution of
      an instruction or time-stamp counter spin loop."
      
      Intel's documentation suggests the implementation of a time-out
      mechanism, which, by the way, is already being open-coded in some parts
      of the kernel that tinker with ICR.
      
      Create a apic_wait_icr_idle replacement that implements the time-out
      mechanism and that can be used to solve the aforementioned problem.
      
      AK: moved both functions out of line
      AK: Added improved loop from Keith Owens
      Signed-off-by: NFernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      8339e9fb
    • F
      [PATCH] i386: safe_apic_wait_icr_idle - i386 · f2b218dd
      Fernando Luis VazquezCao 提交于
      apic_wait_icr_idle looks like this:
      
      static __inline__ void apic_wait_icr_idle(void)
      {
        while (apic_read(APIC_ICR) & APIC_ICR_BUSY)
          cpu_relax();
      }
      
      The busy loop in this function would not be problematic if the
      corresponding status bit in the ICR were always updated, but that does
      not seem to be the case under certain crash scenarios. Kdump uses an IPI
      to stop the other CPUs in the event of a crash, but when any of the
      other CPUs are locked-up inside the NMI handler the CPU that sends the
      IPI will end up looping forever in the ICR check, effectively
      hard-locking the whole system.
      
      Quoting from Intel's "MultiProcessor Specification" (Version 1.4), B-3:
      
      "A local APIC unit indicates successful dispatch of an IPI by
      resetting the Delivery Status bit in the Interrupt Command
      Register (ICR). The operating system polls the delivery status
      bit after sending an INIT or STARTUP IPI until the command has
      been dispatched.
      
      A period of 20 microseconds should be sufficient for IPI dispatch
      to complete under normal operating conditions. If the IPI is not
      successfully dispatched, the operating system can abort the
      command. Alternatively, the operating system can retry the IPI by
      writing the lower 32-bit double word of the ICR. This “time-out”
      mechanism can be implemented through an external interrupt, if
      interrupts are enabled on the processor, or through execution of
      an instruction or time-stamp counter spin loop."
      
      Intel's documentation suggests the implementation of a time-out
      mechanism, which, by the way, is already being open-coded in some parts
      of the kernel that tinker with ICR.
      
      Create a apic_wait_icr_idle replacement that implements the time-out
      mechanism and that can be used to solve the aforementioned problem.
      
      AK: moved both functions out of line
      AK: added improved loop from Keith Owens
      Signed-off-by: NFernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      f2b218dd
    • B
      [PATCH] i386: Enable support for fixed-range IORRs to keep RdMem & WrMem in sync · de938c51
      Bernhard Kaindl 提交于
      If our copy of the MTRRs of the BSP has RdMem or WrMem set, and
      we are running on an AMD64/K8 system, the boot CPU must have had
      MtrrFixDramEn and MtrrFixDramModEn set (otherwise our RDMSR would
      have copied these bits cleared), so we set them on this CPU as well.
      
      This allows us to keep the AMD64/K8 RdMem and WrMem bits in sync
      across the CPUs of SMP systems in order to fullfill the duty of
      system software to "initialize and maintain MTRR consistency
      across all processors." as written in the AMD and Intel manuals.
      
      If an WRMSR instruction fails because MtrrFixDramModEn is not
      set, I expect that also the Intel-style MTRR bits are not updated.
      
      AK: minor cleanup, moved MSR defines around
      Signed-off-by: NBernhard Kaindl <bk@suse.de>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Dave Jones <davej@codemonkey.org.uk>
      de938c51
    • B
      [PATCH] x86: Save the MTRRs of the BSP before booting an AP · 2b1f6278
      Bernhard Kaindl 提交于
      Applied fix by Andew Morton:
      http://lkml.org/lkml/2007/4/8/88 - Fix `make headers_check'.
      
      AMD and Intel x86 CPU manuals state that it is the responsibility of
      system software to initialize and maintain MTRR consistency across
      all processors in Multi-Processing Environments.
      
      Quote from page 188 of the AMD64 System Programming manual (Volume 2):
      
      7.6.5 MTRRs in Multi-Processing Environments
      
      "In multi-processing environments, the MTRRs located in all processors must
      characterize memory in the same way. Generally, this means that identical
      values are written to the MTRRs used by the processors." (short omission here)
      "Failure to do so may result in coherency violations or loss of atomicity.
      Processor implementations do not check the MTRR settings in other processors
      to ensure consistency. It is the responsibility of system software to
      initialize and maintain MTRR consistency across all processors."
      
      Current Linux MTRR code already implements the above in the case that the
      BIOS does not properly initialize MTRRs on the secondary processors,
      but the case where the fixed-range MTRRs of the boot processor are changed
      after Linux started to boot, before the initialsation of a secondary
      processor, is not handled yet.
      
      In this case, secondary processors are currently initialized by Linux
      with MTRRs which the boot processor had very early, when mtrr_bp_init()
      did run, but not with the MTRRs which the boot processor uses at the
      time when that secondary processors is actually booted,
      causing differing MTRR contents on the secondary processors.
      
      Such situation happens on Acer Ferrari 1000 and 5000 notebooks where the
      BIOS enables and sets AMD-specific IORR bits in the fixed-range MTRRs
      of the boot processor when it transitions the system into ACPI mode.
      The SMI handler of the BIOS does this in SMM, entered while Linux ACPI
      code runs acpi_enable().
      
      Other occasions where the SMI handler of the BIOS may change bits in
      the MTRRs could occur as well. To initialize newly booted secodary
      processors with the fixed-range MTRRs which the boot processor uses
      at that time, this patch saves the fixed-range MTRRs of the boot
      processor before new secondary processors are started. When the
      secondary processors run their Linux initialisation code, their
      fixed-range MTRRs will be updated with the saved fixed-range MTRRs.
      
      If CONFIG_MTRR is not set, we define mtrr_save_state
      as an empty statement because there is nothing to do.
      
      Possible TODOs:
      
      *) CPU-hotplugging outside of SMP suspend/resume is not yet tested
         with this patch.
      
      *) If, even in this case, an AP never runs i386/do_boot_cpu or x86_64/cpu_up,
         then the calls to mtrr_save_state() could be replaced by calls to
         mtrr_save_fixed_ranges(NULL) and  mtrr_save_state() would not be
         needed.
      
         That would need either verification of the CPU-hotplug code or
         at least a test on a >2 CPU machine.
      
      *) The MTRRs of other running processors are not yet checked at this
         time but it might be interesting to syncronize the MTTRs of all
         processors before booting. That would be an incremental patch,
         but of rather low priority since there is no machine known so
         far which would require this.
      
      AK: moved prototypes on x86-64 around to fix warnings
      Signed-off-by: NBernhard Kaindl <bk@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Dave Jones <davej@codemonkey.org.uk>
      2b1f6278
    • B
      [PATCH] x86: Adds mtrr_save_fixed_ranges() for use in two later patches. · 2b3b4835
      Bernhard Kaindl 提交于
      In this current implementation which is used in other patches,
      mtrr_save_fixed_ranges() accepts a dummy void pointer because
      in the current implementation of one of these patches, this
      function may be called from smp_call_function_single() which
      requires that this function takes a void pointer argument.
      
      This function calls get_fixed_ranges(), passing mtrr_state.fixed_ranges
      which is the element of the static struct which stores our current
      backup of the fixed-range MTRR values which all CPUs shall be
      using.
      
      Because  mtrr_save_fixed_ranges calls get_fixed_ranges after
      kernel initialisation time, __init needs to be removed from
      the declaration of get_fixed_ranges().
      
      If CONFIG_MTRR is not set, we define mtrr_save_fixed_ranges
      as an empty statement because there is nothing to do.
      
      AK: Moved prototypes for x86-64 around to fix warnings
      Signed-off-by: NBernhard Kaindl <bk@suse.de>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Dave Jones <davej@codemonkey.org.uk>
      2b3b4835
    • A
      [PATCH] x86-64: Move mtrr prototypes from proto.h to mtrr.h · 856f44ff
      Andi Kleen 提交于
      Signed-off-by: NAndi Kleen <ak@suse.de>
      856f44ff
    • J
      [PATCH] i386: Clean up ELF note generation · 03df4f6e
      Jeremy Fitzhardinge 提交于
      Three cleanups:
      
      1: ELF notes are never mapped, so there's no need to have any access
      flags in their phdr.
      
      2: When generating them from asm, tell the assembler to use a SHT_NOTE
      section type.  There doesn't seem to be a way to do this from C.
      
      3: Use ANSI rather than traditional cpp behaviour to stringify the
      macro argument.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      03df4f6e
    • J
      [PATCH] x86: PARAVIRT: Jeremy Fitzhardinge <jeremy@goop.org> · 441d40dc
      Jeremy Fitzhardinge 提交于
      The other symbols used to delineate the alt-instructions sections have the
      form __foo/__foo_end.  Rename parainstructions to match.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      441d40dc
    • Z
      [PATCH] i386: Convert VMI timer to use clock events · e0bb8643
      Zachary Amsden 提交于
      Convert VMI timer to use clock events, making it properly able to use the NO_HZ
      infrastructure.  On UP systems, with no local APIC, we just continue to route
      these events through the PIT.  On systems with a local APIC, or SMP, we provide
      a single source interrupt chip which creates the local timer IRQ.  It actually
      gets delivered by the APIC hardware, but we don't want to use the same local
      APIC clocksource processing, so we create our own handler here.
      Signed-off-by: NZachary Amsden <zach@vmware.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      CC: Dan Hecht <dhecht@vmware.com>
      CC: Ingo Molnar <mingo@elte.hu>
      CC: Thomas Gleixner <tglx@linutronix.de>
      e0bb8643
    • J
      [PATCH] x86: update for i386 and x86-64 check_bugs · 57decbda
      Jeremy Fitzhardinge 提交于
      Remove spurious comments, headers and keywords from x86-64 bugs.[ch].
      
      Use identify_boot_cpu()
      
      AK: merged with other patch
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      57decbda
    • J
      [PATCH] i386: Fix UP gdt bugs · c5413fbe
      Jeremy Fitzhardinge 提交于
      Fixes two problems with the GDT when compiling for uniprocessor:
       - There's no percpu segment, so trying to load its selector into %fs fails.
         Use a null selector instead.
       - The real gdt needs to be loaded at some point.  Do it in cpu_init().
      Signed-off-by: NChris Wright <chrisw@sous-sol.org>
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      c5413fbe
    • J
      [PATCH] i386: Define per_cpu_offset · 1956c73b
      Jeremy Fitzhardinge 提交于
      Define per_cpu_offset in asm-i386/percpu.h when SMP defined, like
      asm-generic/percpu.h does for UP.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Andi Kleen <ak@suse.de>
      1956c73b
    • J
      [PATCH] i386: cleanups to help using per-cpu variables from asm · 978c038e
      Jeremy Fitzhardinge 提交于
      This patch does a few small cleanups:
       - use PER_CPU_NAME to generate the names of per-cpu variables
       - use lea to add the per_cpu offset in PER_CPU(), because it doesn't
         affect condition flags
       - add PER_CPU_VAR which allows direct access to pre-cpu variables
         with the %fs: prefix on SMP.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Andi Kleen <ak@suse.de>
      978c038e
    • J
      [PATCH] i386: Convert PDA into the percpu section · 7c3576d2
      Jeremy Fitzhardinge 提交于
      Currently x86 (similar to x84-64) has a special per-cpu structure
      called "i386_pda" which can be easily and efficiently referenced via
      the %fs register.  An ELF section is more flexible than a structure,
      allowing any piece of code to use this area.  Indeed, such a section
      already exists: the per-cpu area.
      
      So this patch:
      (1) Removes the PDA and uses per-cpu variables for each current member.
      (2) Replaces the __KERNEL_PDA segment with __KERNEL_PERCPU.
      (3) Creates a per-cpu mirror of __per_cpu_offset called this_cpu_off, which
          can be used to calculate addresses for this CPU's variables.
      (4) Simplifies startup, because %fs doesn't need to be loaded with a
          special segment at early boot; it can be deferred until the first
          percpu area is allocated (or never for UP).
      
      The result is less code and one less x86-specific concept.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Andi Kleen <ak@suse.de>
      7c3576d2
    • J
      [PATCH] i386: Page-align the GDT · 7a61d35d
      Jeremy Fitzhardinge 提交于
      Xen wants a dedicated page for the GDT.  I believe VMI likes it too.
      lguest, KVM and native don't care.
      
      Simple transformation to page-aligned "struct gdt_page".
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Acked-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      7a61d35d
    • J
      [PATCH] i386: PARAVIRT: drop unused ptep_get_and_clear · 4cdd9c89
      Jeremy Fitzhardinge 提交于
      In shadow mode hypervisors, ptep_get_and_clear achieves the desired
      purpose of keeping the shadows in sync by issuing a native_get_and_clear,
      followed by a call to pte_update, which indicates the PTE has been
      modified.
      
      Direct mode hypervisors (Xen) have no need for this anyway, and will trap
      the update using writable pagetables.
      
      This means no hypervisor makes use of ptep_get_and_clear; there is no
      reason to have it in the paravirt-ops structure.  Change confusing
      terminology about raw vs. native functions into consistent use of
      native_pte_xxx for operations which do not invoke paravirt-ops.
      Signed-off-by: NZachary Amsden <zach@vmware.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      4cdd9c89
    • J
      [PATCH] i386: PARAVIRT: Clean up paravirt patchable wrappers · 1a45b7aa
      Jeremy Fitzhardinge 提交于
      Replace all the open-coded macros for generating calls with a pair of
      more general macros (__PVOP_CALL/VCALL), and redefine all the
      PVOP_V?CALL[0-4] in terms of them.
      
      [ Andrew, Andi: this should slot in immediately after "Document asm-i386/paravirt.h"
        (paravirt_ops-document-asm-i386-paravirth.patch) ]
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      1a45b7aa
    • J
      [PATCH] i386: PARAVIRT: Use enums for paravirt lazy flush modi · 4e0fa856
      Jeremy Fitzhardinge 提交于
      Remove #defines, add enum for PARAVIRT_LAZY_FLUSH.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      4e0fa856
    • J
      [PATCH] i386: PARAVIRT: add kmap_atomic_pte for mapping highpte pages · ce6234b5
      Jeremy Fitzhardinge 提交于
      Xen and VMI both have special requirements when mapping a highmem pte
      page into the kernel address space.  These can be dealt with by adding
      a new kmap_atomic_pte() function for mapping highptes, and hooking it
      into the paravirt_ops infrastructure.
      
      Xen specifically wants to map the pte page RO, so this patch exposes a
      helper function, kmap_atomic_prot, which maps the page with the
      specified page protections.
      
      This also adds a kmap_flush_unused() function to clear out the cached
      kmap mappings.  Xen needs this to clear out any potential stray RW
      mappings of pages which will become part of a pagetable.
      
      [ Zach - vmi.c will need some attention after this patch.  It wasn't
        immediately obvious to me what needs to be done. ]
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Zachary Amsden <zach@vmware.com>
      ce6234b5
    • J
      [PATCH] i386: PARAVIRT: revert map_pt_hook. · a27fe809
      Jeremy Fitzhardinge 提交于
      Back out the map_pt_hook to clear the way for kmap_atomic_pte.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Zachary Amsden <zach@vmware.com>
      a27fe809
    • J
      [PATCH] i386: PARAVIRT: add flush_tlb_others paravirt_op · d4c10477
      Jeremy Fitzhardinge 提交于
      This patch adds a pv_op for flush_tlb_others.  Linux running on native
      hardware uses cross-CPU IPIs to flush the TLB on any CPU which may
      have a particular mm's pagetable entries cached in its TLB.  This is
      inefficient in a paravirtualized environment, since the hypervisor
      knows which real CPUs actually contain cached mappings, which may be a
      small subset of a guest's VCPUs.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      d4c10477
    • J
      [PATCH] i386: PARAVIRT: add common patching machinery · 63f70270
      Jeremy Fitzhardinge 提交于
      Implement the actual patching machinery.  paravirt_patch_default()
      contains the logic to automatically patch a callsite based on a few
      simple rules:
      
       - if the paravirt_op function is paravirt_nop, then patch nops
       - if the paravirt_op function is a jmp target, then jmp to it
       - if the paravirt_op function is callable and doesn't clobber too much
          for the callsite, call it directly
      
      paravirt_patch_default is suitable as a default implementation of
      paravirt_ops.patch, will remove most of the expensive indirect calls
      in favour of either a direct call or a pile of nops.
      
      Backends may implement their own patcher, however.  There are several
      helper functions to help with this:
      
      paravirt_patch_nop	nop out a callsite
      paravirt_patch_ignore	leave the callsite as-is
      paravirt_patch_call	patch a call if the caller and callee
      			have compatible clobbers
      paravirt_patch_jmp	patch in a jmp
      paravirt_patch_insns	patch some literal instructions over
      			the callsite, if they fit
      
      This patch also implements more direct patches for the native case, so
      that when running on native hardware many common operations are
      implemented inline.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Zachary Amsden <zach@vmware.com>
      Cc: Anthony Liguori <anthony@codemonkey.ws>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      63f70270
    • J
      [PATCH] i386: PARAVIRT: Document asm-i386/paravirt.h · 294688c0
      Jeremy Fitzhardinge 提交于
      Clean things up, and broadly document:
       - the paravirt_ops functions themselves
       - the patching mechanism
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      294688c0
    • J
      [PATCH] i386: PARAVIRT: Consistently wrap paravirt ops callsites to make them patchable · f8822f42
      Jeremy Fitzhardinge 提交于
      Wrap a set of interesting paravirt_ops calls in a wrapper which makes
      the callsites available for patching.  Unfortunately this is pretty
      ugly because there's no way to get gcc to generate a function call,
      but also wrap just the callsite itself with the necessary labels.
      
      This patch supports functions with 0-4 arguments, and either void or
      returning a value.  64-bit arguments must be split into a pair of
      32-bit arguments (lower word first).  Small structures are returned in
      registers.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Zachary Amsden <zach@vmware.com>
      Cc: Anthony Liguori <anthony@codemonkey.ws>
      f8822f42
    • J
      [PATCH] i386: PARAVIRT: Fix patch site clobbers to include return register · 42c24fa2
      Jeremy Fitzhardinge 提交于
      Fix a few clobbers to include the return register.  The clobbers set
      is the set of all registers modified (or may be modified) by the code
      snippet, regardless of whether it was deliberate or accidental.
      
      Also, make sure that callsites which are used in contexts which don't
      allow clobbers actually save and restore all clobberable registers.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Zachary Amsden <zach@vmware.com>
      42c24fa2
    • J
      [PATCH] i386: PARAVIRT: Use patch site IDs computed from offset in paravirt_ops structure · d5822035
      Jeremy Fitzhardinge 提交于
      Use patch type identifiers derived from the offset of the operation in
      the paravirt_ops structure.  This avoids having to maintain a separate
      enum for patch site types.
      
      Also, since the identifier is derived from the offset into
      paravirt_ops, the offset can be derived from the identifier.  This is
      used to remove replicated information in the various callsite macros,
      which has been a source of bugs in the past.
      
      This patch also drops the fused save_fl+cli operation, which doesn't
      really add much and makes things more complex - specifically because
      it breaks the 1:1 relationship between identifiers and offsets.  If
      this operation turns out to be particularly beneficial, then the right
      answer is to define a new entrypoint for it.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Zachary Amsden <zach@vmware.com>
      d5822035
    • J
      [PATCH] i386: PARAVIRT: rename struct paravirt_patch to paravirt_patch_site for clarity · 98de032b
      Jeremy Fitzhardinge 提交于
      Rename struct paravirt_patch to paravirt_patch_site, so that it
      clearly refers to a callsite, and not the patch which may be applied
      to that callsite.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Zachary Amsden <zach@vmware.com>
      98de032b
    • J
      [PATCH] x86: PARAVIRT: add hooks to intercept mm creation and destruction · d6dd61c8
      Jeremy Fitzhardinge 提交于
      Add hooks to allow a paravirt implementation to track the lifetime of
      an mm.  Paravirtualization requires three hooks, but only two are
      needed in common code.  They are:
      
      arch_dup_mmap, which is called when a new mmap is created at fork
      
      arch_exit_mmap, which is called when the last process reference to an
        mm is dropped, which typically happens on exit and exec.
      
      The third hook is activate_mm, which is called from the arch-specific
      activate_mm() macro/function, and so doesn't need stub versions for
      other architectures.  It's called when an mm is first used.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: linux-arch@vger.kernel.org
      Cc: James Bottomley <James.Bottomley@SteelEye.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      d6dd61c8
    • J
      [PATCH] i386: PARAVIRT: Allow paravirt backend to choose kernel PMD sharing · 5311ab62
      Jeremy Fitzhardinge 提交于
      Normally when running in PAE mode, the 4th PMD maps the kernel address space,
      which can be shared among all processes (since they all need the same kernel
      mappings).
      
      Xen, however, does not allow guests to have the kernel pmd shared between page
      tables, so parameterize pgtable.c to allow both modes of operation.
      
      There are several side-effects of this.  One is that vmalloc will update the
      kernel address space mappings, and those updates need to be propagated into
      all processes if the kernel mappings are not intrinsically shared.  In the
      non-PAE case, this is done by maintaining a pgd_list of all processes; this
      list is used when all process pagetables must be updated.  pgd_list is
      threaded via otherwise unused entries in the page structure for the pgd, which
      means that the pgd must be page-sized for this to work.
      
      Normally the PAE pgd is only 4x64 byte entries large, but Xen requires the PAE
      pgd to page aligned anyway, so this patch forces the pgd to be page
      aligned+sized when the kernel pmd is unshared, to accomodate both these
      requirements.
      
      Also, since there may be several distinct kernel pmds (if the user/kernel
      split is below 3G), there's no point in allocating them from a slab cache;
      they're just allocated with get_free_page and initialized appropriately.  (Of
      course the could be cached if there is just a single kernel pmd - which is the
      default with a 3G user/kernel split - but it doesn't seem worthwhile to add
      yet another case into this code).
      
      [ Many thanks to wli for review comments. ]
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NWilliam Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Zachary Amsden <zach@vmware.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      5311ab62
    • J
      [PATCH] i386: PARAVIRT: Allocate a fixmap slot · 90caccb9
      Jeremy Fitzhardinge 提交于
      Allocate a fixmap slot for use by a paravirt_ops implementation.  This
      is intended for early-boot bootstrap mappings.  Once the zones and
      allocator have been set up, it would be better to use get_vm_area() to
      allocate some virtual space.
      
      Xen uses this to map the hypervisor's shared info page, which doesn't
      have a pseudo-physical page number, and therefore can't be mapped
      ordinarily.  It is needed early because it contains the vcpu state,
      including the interrupt mask.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      90caccb9
    • J
      [PATCH] i386: PARAVIRT: Hooks to set up initial pagetable · b239fb25
      Jeremy Fitzhardinge 提交于
      This patch introduces paravirt_ops hooks to control how the kernel's
      initial pagetable is set up.
      
      In the case of a native boot, the very early bootstrap code creates a
      simple non-PAE pagetable to map the kernel and physical memory.  When
      the VM subsystem is initialized, it creates a proper pagetable which
      respects the PAE mode, large pages, etc.
      
      When booting under a hypervisor, there are many possibilities for what
      paging environment the hypervisor establishes for the guest kernel, so
      the constructon of the kernel's pagetable depends on the hypervisor.
      
      In the case of Xen, the hypervisor boots the kernel with a fully
      constructed pagetable, which is already using PAE if necessary.  Also,
      Xen requires particular care when constructing pagetables to make sure
      all pagetables are always mapped read-only.
      
      In order to make this easier, kernel's initial pagetable construction
      has been changed to only allocate and initialize a pagetable page if
      there's no page already present in the pagetable.  This allows the Xen
      paravirt backend to make a copy of the hypervisor-provided pagetable,
      allowing the kernel to establish any more mappings it needs while
      keeping the existing ones.
      
      A slightly subtle point which is worth highlighting here is that Xen
      requires all kernel mappings to share the same pte_t pages between all
      pagetables, so that updating a kernel page's mapping in one pagetable
      is reflected in all other pagetables.  This makes it possible to
      allocate a page and attach it to a pagetable without having to
      explicitly enumerate that page's mapping in all pagetables.
      
      And:
      
      +From: "Eric W. Biederman" <ebiederm@xmission.com>
      
      If we don't set the leaf page table entries it is quite possible that
      will inherit and incorrect page table entry from the initial boot
      page table setup in head.S.  So we need to redo the effort here,
      so we pick up PSE, PGE and the like.
      
      Hypervisors like Xen require that their page tables be read-only,
      which is slightly incompatible with our low identity mappings, however
      I discussed this with Jeremy he has modified the Xen early set_pte
      function to avoid problems in this area.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Acked-by: NWilliam Irwin <bill.irwin@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      b239fb25
    • J
      [PATCH] i386: PARAVIRT: Add pagetable accessors to pack and unpack pagetable entries · 3dc494e8
      Jeremy Fitzhardinge 提交于
      Add a set of accessors to pack, unpack and modify page table entries
      (at all levels).  This allows a paravirt implementation to control the
      contents of pgd/pmd/pte entries.  For example, Xen uses this to
      convert the (pseudo-)physical address into a machine address when
      populating a pagetable entry, and converting back to pphys address
      when an entry is read.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      3dc494e8
    • J
      [PATCH] i386: PARAVIRT: use paravirt_nop to consistently mark no-op operations · 45876233
      Jeremy Fitzhardinge 提交于
      Add a _paravirt_nop function for use as a stub for no-op operations,
      and paravirt_nop #defined void * version to make using it easier
      (since all its uses are as a void *).
      
      This is useful to allow the patcher to automatically identify noop
      operations so it can simply nop out the callsite.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      [mingo] but only as a cleanup of the current open-coded (void *) casts.
      My problem with this is that it loses the types. Not that there is much
      to check for, but still, this adds some assumptions about how function
      calls look like
      45876233
    • R
      [PATCH] i386: i386 separate hardware-defined TSS from Linux additions · a75c54f9
      Rusty Russell 提交于
      On Thu, 2007-03-29 at 13:16 +0200, Andi Kleen wrote:
      > Please clean it up properly with two structs.
      
      Not sure about this, now I've done it.  Running it here.
      
      If you like it, I can do x86-64 as well.
      
      ==
      lguest defines its own TSS struct because the "struct tss_struct"
      contains linux-specific additions.  Andi asked me to split the struct
      in processor.h.
      
      Unfortunately it makes usage a little awkward.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      a75c54f9