1. 12 7月, 2011 18 次提交
    • D
      KVM: PPC: Accelerate H_PUT_TCE by implementing it in real mode · 54738c09
      David Gibson 提交于
      This improves I/O performance for guests using the PAPR
      paravirtualization interface by making the H_PUT_TCE hcall faster, by
      implementing it in real mode.  H_PUT_TCE is used for updating virtual
      IOMMU tables, and is used both for virtual I/O and for real I/O in the
      PAPR interface.
      
      Since this moves the IOMMU tables into the kernel, we define a new
      KVM_CREATE_SPAPR_TCE ioctl to allow qemu to create the tables.  The
      ioctl returns a file descriptor which can be used to mmap the newly
      created table.  The qemu driver models use them in the same way as
      userspace managed tables, but they can be updated directly by the
      guest with a real-mode H_PUT_TCE implementation, reducing the number
      of host/guest context switches during guest IO.
      
      There are certain circumstances where it is useful for userland qemu
      to write to the TCE table even if the kernel H_PUT_TCE path is used
      most of the time.  Specifically, allowing this will avoid awkwardness
      when we need to reset the table.  More importantly, we will in the
      future need to write the table in order to restore its state after a
      checkpoint resume or migration.
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      54738c09
    • P
      KVM: PPC: Handle some PAPR hcalls in the kernel · a8606e20
      Paul Mackerras 提交于
      This adds the infrastructure for handling PAPR hcalls in the kernel,
      either early in the guest exit path while we are still in real mode,
      or later once the MMU has been turned back on and we are in the full
      kernel context.  The advantage of handling hcalls in real mode if
      possible is that we avoid two partition switches -- and this will
      become more important when we support SMT4 guests, since a partition
      switch means we have to pull all of the threads in the core out of
      the guest.  The disadvantage is that we can only access the kernel
      linear mapping, not anything vmalloced or ioremapped, since the MMU
      is off.
      
      This also adds code to handle the following hcalls in real mode:
      
      H_ENTER       Add an HPTE to the hashed page table
      H_REMOVE      Remove an HPTE from the hashed page table
      H_READ        Read HPTEs from the hashed page table
      H_PROTECT     Change the protection bits in an HPTE
      H_BULK_REMOVE Remove up to 4 HPTEs from the hashed page table
      H_SET_DABR    Set the data address breakpoint register
      
      Plus code to handle the following hcalls in the kernel:
      
      H_CEDE        Idle the vcpu until an interrupt or H_PROD hcall arrives
      H_PROD        Wake up a ceded vcpu
      H_REGISTER_VPA Register a virtual processor area (VPA)
      
      The code that runs in real mode has to be in the base kernel, not in
      the module, if KVM is compiled as a module.  The real-mode code can
      only access the kernel linear mapping, not vmalloc or ioremap space.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      a8606e20
    • P
      KVM: PPC: Add support for Book3S processors in hypervisor mode · de56a948
      Paul Mackerras 提交于
      This adds support for KVM running on 64-bit Book 3S processors,
      specifically POWER7, in hypervisor mode.  Using hypervisor mode means
      that the guest can use the processor's supervisor mode.  That means
      that the guest can execute privileged instructions and access privileged
      registers itself without trapping to the host.  This gives excellent
      performance, but does mean that KVM cannot emulate a processor
      architecture other than the one that the hardware implements.
      
      This code assumes that the guest is running paravirtualized using the
      PAPR (Power Architecture Platform Requirements) interface, which is the
      interface that IBM's PowerVM hypervisor uses.  That means that existing
      Linux distributions that run on IBM pSeries machines will also run
      under KVM without modification.  In order to communicate the PAPR
      hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
      to include/linux/kvm.h.
      
      Currently the choice between book3s_hv support and book3s_pr support
      (i.e. the existing code, which runs the guest in user mode) has to be
      made at kernel configuration time, so a given kernel binary can only
      do one or the other.
      
      This new book3s_hv code doesn't support MMIO emulation at present.
      Since we are running paravirtualized guests, this isn't a serious
      restriction.
      
      With the guest running in supervisor mode, most exceptions go straight
      to the guest.  We will never get data or instruction storage or segment
      interrupts, alignment interrupts, decrementer interrupts, program
      interrupts, single-step interrupts, etc., coming to the hypervisor from
      the guest.  Therefore this introduces a new KVMTEST_NONHV macro for the
      exception entry path so that we don't have to do the KVM test on entry
      to those exception handlers.
      
      We do however get hypervisor decrementer, hypervisor data storage,
      hypervisor instruction storage, and hypervisor emulation assist
      interrupts, so we have to handle those.
      
      In hypervisor mode, real-mode accesses can access all of RAM, not just
      a limited amount.  Therefore we put all the guest state in the vcpu.arch
      and use the shadow_vcpu in the PACA only for temporary scratch space.
      We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
      anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
      We don't have a shared page with the guest, but we still need a
      kvm_vcpu_arch_shared struct to store the values of various registers,
      so we include one in the vcpu_arch struct.
      
      The POWER7 processor has a restriction that all threads in a core have
      to be in the same partition.  MMU-on kernel code counts as a partition
      (partition 0), so we have to do a partition switch on every entry to and
      exit from the guest.  At present we require the host and guest to run
      in single-thread mode because of this hardware restriction.
      
      This code allocates a hashed page table for the guest and initializes
      it with HPTEs for the guest's Virtual Real Memory Area (VRMA).  We
      require that the guest memory is allocated using 16MB huge pages, in
      order to simplify the low-level memory management.  This also means that
      we can get away without tracking paging activity in the host for now,
      since huge pages can't be paged or swapped.
      
      This also adds a few new exports needed by the book3s_hv code.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      de56a948
    • P
      KVM: PPC: Split host-state fields out of kvmppc_book3s_shadow_vcpu · 3c42bf8a
      Paul Mackerras 提交于
      There are several fields in struct kvmppc_book3s_shadow_vcpu that
      temporarily store bits of host state while a guest is running,
      rather than anything relating to the particular guest or vcpu.
      This splits them out into a new kvmppc_host_state structure and
      modifies the definitions in asm-offsets.c to suit.
      
      On 32-bit, we have a kvmppc_host_state structure inside the
      kvmppc_book3s_shadow_vcpu since the assembly code needs to be able
      to get to them both with one pointer.  On 64-bit they are separate
      fields in the PACA.  This means that on 64-bit we don't need to
      copy the kvmppc_host_state in and out on vcpu load/unload, and
      in future will mean that the book3s_hv code doesn't need a
      shadow_vcpu struct in the PACA at all.  That does mean that we
      have to be careful not to rely on any values persisting in the
      hstate field of the paca across any point where we could block
      or get preempted.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      3c42bf8a
    • P
      powerpc: Set up LPCR for running guest partitions · 923c53ca
      Paul Mackerras 提交于
      In hypervisor mode, the LPCR controls several aspects of guest
      partitions, including virtual partition memory mode, and also controls
      whether the hypervisor decrementer interrupts are enabled.  This sets
      up LPCR at boot time so that guest partitions will use a virtual real
      memory area (VRMA) composed of 16MB large pages, and hypervisor
      decrementer interrupts are disabled.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      923c53ca
    • P
      KVM: PPC: Move guest enter/exit down into subarch-specific code · df6909e5
      Paul Mackerras 提交于
      Instead of doing the kvm_guest_enter/exit() and local_irq_dis/enable()
      calls in powerpc.c, this moves them down into the subarch-specific
      book3s_pr.c and booke.c.  This eliminates an extra local_irq_enable()
      call in book3s_pr.c, and will be needed for when we do SMT4 guest
      support in the book3s hypervisor mode code.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      df6909e5
    • P
      KVM: PPC: Pass init/destroy vm and prepare/commit memory region ops down · f9e0554d
      Paul Mackerras 提交于
      This arranges for the top-level arch/powerpc/kvm/powerpc.c file to
      pass down some of the calls it gets to the lower-level subarchitecture
      specific code.  The lower-level implementations (in booke.c and book3s.c)
      are no-ops.  The coming book3s_hv.c will need this.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      f9e0554d
    • P
      powerpc, KVM: Rework KVM checks in first-level interrupt handlers · b01c8b54
      Paul Mackerras 提交于
      Instead of branching out-of-line with the DO_KVM macro to check if we
      are in a KVM guest at the time of an interrupt, this moves the KVM
      check inline in the first-level interrupt handlers.  This speeds up
      the non-KVM case and makes sure that none of the interrupt handlers
      are missing the check.
      
      Because the first-level interrupt handlers are now larger, some things
      had to be move out of line in exceptions-64s.S.
      
      This all necessitated some minor changes to the interrupt entry code
      in KVM.  This also streamlines the book3s_32 KVM test.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      b01c8b54
    • P
      KVM: PPC: Split out code from book3s.c into book3s_pr.c · f05ed4d5
      Paul Mackerras 提交于
      In preparation for adding code to enable KVM to use hypervisor mode
      on 64-bit Book 3S processors, this splits book3s.c into two files,
      book3s.c and book3s_pr.c, where book3s_pr.c contains the code that is
      specific to running the guest in problem state (user mode) and book3s.c
      contains code which should apply to all Book 3S processors.
      
      In doing this, we abstract some details, namely the interrupt offset,
      updating the interrupt pending flag, and detecting if the guest is
      in a critical section.  These are all things that will be different
      when we use hypervisor mode.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      f05ed4d5
    • P
      KVM: PPC: Move fields between struct kvm_vcpu_arch and kvmppc_vcpu_book3s · c4befc58
      Paul Mackerras 提交于
      This moves the slb field, which represents the state of the emulated
      SLB, from the kvmppc_vcpu_book3s struct to the kvm_vcpu_arch, and the
      hpte_hash_[v]pte[_long] fields from kvm_vcpu_arch to kvmppc_vcpu_book3s.
      This is in accord with the principle that the kvm_vcpu_arch struct
      represents the state of the emulated CPU, and the kvmppc_vcpu_book3s
      struct holds the auxiliary data structures used in the emulation.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      c4befc58
    • L
      KVM: PPC: e500: Add shadow PID support · dd9ebf1f
      Liu Yu 提交于
      Dynamically assign host PIDs to guest PIDs, splitting each guest PID into
      multiple host (shadow) PIDs based on kernel/user and MSR[IS/DS].  Use
      both PID0 and PID1 so that the shadow PIDs for the right mode can be
      selected, that correspond both to guest TID = zero and guest TID = guest
      PID.
      
      This allows us to significantly reduce the frequency of needing to
      invalidate the entire TLB.  When the guest mode or PID changes, we just
      update the host PID0/PID1.  And since the allocation of shadow PIDs is
      global, multiple guests can share the TLB without conflict.
      
      Note that KVM does not yet support the guest setting PID1 or PID2 to
      a value other than zero.  This will need to be fixed for nested KVM
      to work.  Until then, we enforce the requirement for guest PID1/PID2
      to stay zero by failing the emulation if the guest tries to set them
      to something else.
      Signed-off-by: NLiu Yu <yu.liu@freescale.com>
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      dd9ebf1f
    • L
      KVM: PPC: e500: Stop keeping shadow TLB · 08b7fa92
      Liu Yu 提交于
      Instead of a fully separate set of TLB entries, keep just the
      pfn and dirty status.
      Signed-off-by: NLiu Yu <yu.liu@freescale.com>
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      08b7fa92
    • S
      KVM: PPC: e500: enable magic page · a4cd8b23
      Scott Wood 提交于
      This is a shared page used for paravirtualization.  It is always present
      in the guest kernel's effective address space at the address indicated
      by the hypercall that enables it.
      
      The physical address specified by the hypercall is not used, as
      e500 does not have real mode.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      a4cd8b23
    • S
      KVM: PPC: e500: Eliminate shadow_pages[], and use pfns instead. · 59c1f4e3
      Scott Wood 提交于
      This is in line with what other architectures do, and will allow us to
      map things other than ordinary, unreserved kernel pages -- such as
      dedicated devices, or large contiguous reserved regions.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      59c1f4e3
    • S
      KVM: PPC: e500: Save/restore SPE state · 4cd35f67
      Scott Wood 提交于
      This is done lazily.  The SPE save will be done only if the guest has
      used SPE since the last preemption or heavyweight exit.  Restore will be
      done only on demand, when enabling MSR_SPE in the shadow MSR, in response
      to an SPE fault or mtmsr emulation.
      
      For SPEFSCR, Linux already switches it on context switch (non-lazily), so
      the only remaining bit is to save it between qemu and the guest.
      Signed-off-by: NLiu Yu <yu.liu@freescale.com>
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      4cd35f67
    • S
      KVM: PPC: booke: use shadow_msr · ecee273f
      Scott Wood 提交于
      Keep the guest MSR and the guest-mode true MSR separate, rather than
      modifying the guest MSR on each guest entry to produce a true MSR.
      
      Any bits which should be modified based on guest MSR must be explicitly
      propagated from vcpu->arch.shared->msr to vcpu->arch.shadow_msr in
      kvmppc_set_msr().
      
      While we're modifying the guest entry code, reorder a few instructions
      to bury some load latencies.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      ecee273f
    • S
      powerpc/e500: SPE register saving: take arbitrary struct offset · c51584d5
      Scott Wood 提交于
      Previously, these macros hardcoded THREAD_EVR0 as the base of the save
      area, relative to the base register passed.  This base offset is now
      passed as a separate macro parameter, allowing reuse with other SPE
      save areas, such as used by KVM.
      Acked-by: NKumar Gala <galak@kernel.crashing.org>
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      c51584d5
    • A
      KVM: PPC: Resolve real-mode handlers through function exports · a22a2dac
      Alexander Graf 提交于
      Up until now, Book3S KVM had variables stored in the kernel that a kernel module
      or the kvm code in the kernel could read from to figure out where some real mode
      helper functions are located.
      
      This is all unnecessary. The high bits of the EA get ignore in real mode, so we
      can just use the pointer as is. Also, it's a lot easier on relocations when we
      use the normal way of resolving the address to a function, instead of jumping
      through hoops.
      
      This patch fixes compilation with CONFIG_RELOCATABLE=y.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      a22a2dac
  2. 28 6月, 2011 1 次提交
    • K
      Fix node_start/end_pfn() definition for mm/page_cgroup.c · c6830c22
      KAMEZAWA Hiroyuki 提交于
      commit 21a3c964 uses node_start/end_pfn(nid) for detection start/end
      of nodes. But, it's not defined in linux/mmzone.h but defined in
      /arch/???/include/mmzone.h which is included only under
      CONFIG_NEED_MULTIPLE_NODES=y.
      
      Then, we see
        mm/page_cgroup.c: In function 'page_cgroup_init':
        mm/page_cgroup.c:308: error: implicit declaration of function 'node_start_pfn'
        mm/page_cgroup.c:309: error: implicit declaration of function 'node_end_pfn'
      
      So, fixiing page_cgroup.c is an idea...
      
      But node_start_pfn()/node_end_pfn() is a very generic macro and
      should be implemented in the same manner for all archs.
      (m32r has different implementation...)
      
      This patch removes definitions of node_start/end_pfn() in each archs
      and defines a unified one in linux/mmzone.h. It's not under
      CONFIG_NEED_MULTIPLE_NODES, now.
      
      A result of macro expansion is here (mm/page_cgroup.c)
      
      for !NUMA
       start_pfn = ((&contig_page_data)->node_start_pfn);
        end_pfn = ({ pg_data_t *__pgdat = (&contig_page_data); __pgdat->node_start_pfn + __pgdat->node_spanned_pages;});
      
      for NUMA (x86-64)
        start_pfn = ((node_data[nid])->node_start_pfn);
        end_pfn = ({ pg_data_t *__pgdat = (node_data[nid]); __pgdat->node_start_pfn + __pgdat->node_spanned_pages;});
      
      Changelog:
       - fixed to avoid using "nid" twice in node_end_pfn() macro.
      Reported-and-acked-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Reported-and-tested-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c6830c22
  3. 03 6月, 2011 1 次提交
  4. 29 5月, 2011 1 次提交
    • E
      ns: Wire up the setns system call · 7b21fddd
      Eric W. Biederman 提交于
      32bit and 64bit on x86 are tested and working.  The rest I have looked
      at closely and I can't find any problems.
      
      setns is an easy system call to wire up.  It just takes two ints so I
      don't expect any weird architecture porting problems.
      
      While doing this I have noticed that we have some architectures that are
      very slow to get new system calls.  cris seems to be the slowest where
      the last system calls wired up were preadv and pwritev.  avr32 is weird
      in that recvmmsg was wired up but never declared in unistd.h.  frv is
      behind with perf_event_open being the last syscall wired up.  On h8300
      the last system call wired up was epoll_wait.  On m32r the last system
      call wired up was fallocate.  mn10300 has recvmmsg as the last system
      call wired up.  The rest seem to at least have syncfs wired up which was
      new in the 2.6.39.
      
      v2: Most of the architecture support added by Daniel Lezcano <dlezcano@fr.ibm.com>
      v3: ported to v2.6.36-rc4 by: Eric W. Biederman <ebiederm@xmission.com>
      v4: Moved wiring up of the system call to another patch
      v5: ported to v2.6.39-rc6
      v6: rebased onto parisc-next and net-next to avoid syscall  conflicts.
      v7: ported to Linus's latest post 2.6.39 tree.
      
      >  arch/blackfin/include/asm/unistd.h     |    3 ++-
      >  arch/blackfin/mach-common/entry.S      |    1 +
      Acked-by: NMike Frysinger <vapier@gentoo.org>
      
      Oh - ia64 wiring looks good.
      Acked-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b21fddd
  5. 26 5月, 2011 3 次提交
    • M
      powerpc/cell: Use common smp ipi actions · 7ef71d75
      Milton Miller 提交于
      The cell iic interrupt controller has enough software caused interrupts
      to use a unique interrupt for each of the 4 messages powerpc uses.
      This means each interrupt gets its own irq action/data combination.
      
      Use the seperate, optimized, arch common ipi action functions
      registered via the helper smp_request_message_ipi instead passing the
      message as action data to a single action that then demultipexes to
      the required acton via a switch statement.
      
      smp_request_message_ipi will register the action as IRQF_PER_CPU
      and IRQF_DISABLED, and WARN if the allocation fails for some reason,
      so no need to print on that failure.  It will return positive if
      the message will not be used by the kernel, in which case we can
      free the virq.
      
      In addition to elimiating inefficient code, this also corrects the
      error that a kernel built with kexec but without a debugger would
      not register the ipi for kdump to notify the other cpus of a crash.
      
      This also restores the debugger action to be static to kernel/smp.c.
      Signed-off-by: NMilton Miller <miltonm@bga.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      7ef71d75
    • B
      powerpc/pseries: Update MAX_HCALL_OPCODE to reflect page coalescing · ca193150
      Brian King 提交于
      When page coalescing support was added recently, the MAX_HCALL_OPCODE
      define was not updated for the newly added H_GET_MPP_X hcall.
      Signed-off-by: NBrian King <brking@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      ca193150
    • I
      powerpc/ftrace: Implement raw syscall tracepoints on PowerPC · 02424d89
      Ian Munsie 提交于
      This patch implements the raw syscall tracepoints on PowerPC and exports
      them for ftrace syscalls to use.
      
      To minimise reworking existing code, I slightly re-ordered the thread
      info flags such that the new TIF_SYSCALL_TRACEPOINT bit would still fit
      within the 16 bits of the andi. instruction's UI field. The instructions
      in question are in /arch/powerpc/kernel/entry_{32,64}.S to and the
      _TIF_SYSCALL_T_OR_A with the thread flags to see if system call tracing
      is enabled.
      
      In the case of 64bit PowerPC, arch_syscall_addr and
      arch_syscall_match_sym_name are overridden to allow ftrace syscalls to
      work given the unusual system call table structure and symbol names that
      start with a period.
      Signed-off-by: NIan Munsie <imunsie@au1.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      02424d89
  6. 25 5月, 2011 3 次提交
  7. 22 5月, 2011 3 次提交
  8. 20 5月, 2011 3 次提交
  9. 19 5月, 2011 7 次提交