1. 05 3月, 2012 19 次提交
    • P
      KVM: PPC: Allow for read-only pages backing a Book3S HV guest · 4cf302bc
      Paul Mackerras 提交于
      With this, if a guest does an H_ENTER with a read/write HPTE on a page
      which is currently read-only, we make the actual HPTE inserted be a
      read-only version of the HPTE.  We now intercept protection faults as
      well as HPTE not found faults, and for a protection fault we work out
      whether it should be reflected to the guest (e.g. because the guest HPTE
      didn't allow write access to usermode) or handled by switching to
      kernel context and calling kvmppc_book3s_hv_page_fault, which will then
      request write access to the page and update the actual HPTE.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      4cf302bc
    • P
      KVM: PPC: Implement MMU notifiers for Book3S HV guests · 342d3db7
      Paul Mackerras 提交于
      This adds the infrastructure to enable us to page out pages underneath
      a Book3S HV guest, on processors that support virtualized partition
      memory, that is, POWER7.  Instead of pinning all the guest's pages,
      we now look in the host userspace Linux page tables to find the
      mapping for a given guest page.  Then, if the userspace Linux PTE
      gets invalidated, kvm_unmap_hva() gets called for that address, and
      we replace all the guest HPTEs that refer to that page with absent
      HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
      set, which will cause an HDSI when the guest tries to access them.
      Finally, the page fault handler is extended to reinstantiate the
      guest HPTE when the guest tries to access a page which has been paged
      out.
      
      Since we can't intercept the guest DSI and ISI interrupts on PPC970,
      we still have to pin all the guest pages on PPC970.  We have a new flag,
      kvm->arch.using_mmu_notifiers, that indicates whether we can page
      guest pages out.  If it is not set, the MMU notifier callbacks do
      nothing and everything operates as before.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      342d3db7
    • P
      KVM: PPC: Implement MMIO emulation support for Book3S HV guests · 697d3899
      Paul Mackerras 提交于
      This provides the low-level support for MMIO emulation in Book3S HV
      guests.  When the guest tries to map a page which is not covered by
      any memslot, that page is taken to be an MMIO emulation page.  Instead
      of inserting a valid HPTE, we insert an HPTE that has the valid bit
      clear but another hypervisor software-use bit set, which we call
      HPTE_V_ABSENT, to indicate that this is an absent page.  An
      absent page is treated much like a valid page as far as guest hcalls
      (H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
      an absent HPTE doesn't need to be invalidated with tlbie since it
      was never valid as far as the hardware is concerned.
      
      When the guest accesses a page for which there is an absent HPTE, it
      will take a hypervisor data storage interrupt (HDSI) since we now set
      the VPM1 bit in the LPCR.  Our HDSI handler for HPTE-not-present faults
      looks up the hash table and if it finds an absent HPTE mapping the
      requested virtual address, will switch to kernel mode and handle the
      fault in kvmppc_book3s_hv_page_fault(), which at present just calls
      kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
      
      This is based on an earlier patch by Benjamin Herrenschmidt, but since
      heavily reworked.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      697d3899
    • P
      KVM: PPC: Maintain a doubly-linked list of guest HPTEs for each gfn · 06ce2c63
      Paul Mackerras 提交于
      This expands the reverse mapping array to contain two links for each
      HPTE which are used to link together HPTEs that correspond to the
      same guest logical page.  Each circular list of HPTEs is pointed to
      by the rmap array entry for the guest logical page, pointed to by
      the relevant memslot.  Links are 32-bit HPT entry indexes rather than
      full 64-bit pointers, to save space.  We use 3 of the remaining 32
      bits in the rmap array entries as a lock bit, a referenced bit and
      a present bit (the present bit is needed since HPTE index 0 is valid).
      The bit lock for the rmap chain nests inside the HPTE lock bit.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      06ce2c63
    • P
      KVM: PPC: Allow I/O mappings in memory slots · 9d0ef5ea
      Paul Mackerras 提交于
      This provides for the case where userspace maps an I/O device into the
      address range of a memory slot using a VM_PFNMAP mapping.  In that
      case, we work out the pfn from vma->vm_pgoff, and record the cache
      enable bits from vma->vm_page_prot in two low-order bits in the
      slot_phys array entries.  Then, in kvmppc_h_enter() we check that the
      cache bits in the HPTE that the guest wants to insert match the cache
      bits in the slot_phys array entry.  However, we do allow the guest to
      create what it thinks is a non-cacheable or write-through mapping to
      memory that is actually cacheable, so that we can use normal system
      memory as part of an emulated device later on.  In that case the actual
      HPTE we insert is a cacheable HPTE.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      9d0ef5ea
    • P
      KVM: PPC: Allow use of small pages to back Book3S HV guests · da9d1d7f
      Paul Mackerras 提交于
      This relaxes the requirement that the guest memory be provided as
      16MB huge pages, allowing it to be provided as normal memory, i.e.
      in pages of PAGE_SIZE bytes (4k or 64k).  To allow this, we index
      the kvm->arch.slot_phys[] arrays with a small page index, even if
      huge pages are being used, and use the low-order 5 bits of each
      entry to store the order of the enclosing page with respect to
      normal pages, i.e. log_2(enclosing_page_size / PAGE_SIZE).
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      da9d1d7f
    • P
      KVM: PPC: Only get pages when actually needed, not in prepare_memory_region() · c77162de
      Paul Mackerras 提交于
      This removes the code from kvmppc_core_prepare_memory_region() that
      looked up the VMA for the region being added and called hva_to_page
      to get the pfns for the memory.  We have no guarantee that there will
      be anything mapped there at the time of the KVM_SET_USER_MEMORY_REGION
      ioctl call; userspace can do that ioctl and then map memory into the
      region later.
      
      Instead we defer looking up the pfn for each memory page until it is
      needed, which generally means when the guest does an H_ENTER hcall on
      the page.  Since we can't call get_user_pages in real mode, if we don't
      already have the pfn for the page, kvmppc_h_enter() will return
      H_TOO_HARD and we then call kvmppc_virtmode_h_enter() once we get back
      to kernel context.  That calls kvmppc_get_guest_page() to get the pfn
      for the page, and then calls back to kvmppc_h_enter() to redo the HPTE
      insertion.
      
      When the first vcpu starts executing, we need to have the RMO or VRMA
      region mapped so that the guest's real mode accesses will work.  Thus
      we now have a check in kvmppc_vcpu_run() to see if the RMO/VRMA is set
      up and if not, call kvmppc_hv_setup_rma().  It checks if the memslot
      starting at guest physical 0 now has RMO memory mapped there; if so it
      sets it up for the guest, otherwise on POWER7 it sets up the VRMA.
      The function that does that, kvmppc_map_vrma, is now a bit simpler,
      as it calls kvmppc_virtmode_h_enter instead of creating the HPTE itself.
      
      Since we are now potentially updating entries in the slot_phys[]
      arrays from multiple vcpu threads, we now have a spinlock protecting
      those updates to ensure that we don't lose track of any references
      to pages.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      c77162de
    • P
      KVM: PPC: Make the H_ENTER hcall more reliable · 075295dd
      Paul Mackerras 提交于
      At present, our implementation of H_ENTER only makes one try at locking
      each slot that it looks at, and doesn't even retry the ldarx/stdcx.
      atomic update sequence that it uses to attempt to lock the slot.  Thus
      it can return the H_PTEG_FULL error unnecessarily, particularly when
      the H_EXACT flag is set, meaning that the caller wants a specific PTEG
      slot.
      
      This improves the situation by making a second pass when no free HPTE
      slot is found, where we spin until we succeed in locking each slot in
      turn and then check whether it is full while we hold the lock.  If the
      second pass fails, then we return H_PTEG_FULL.
      
      This also moves lock_hpte to a header file (since later commits in this
      series will need to use it from other source files) and renames it to
      try_lock_hpte, which is a somewhat less misleading name.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      075295dd
    • P
      KVM: PPC: Add an interface for pinning guest pages in Book3s HV guests · 93e60249
      Paul Mackerras 提交于
      This adds two new functions, kvmppc_pin_guest_page() and
      kvmppc_unpin_guest_page(), and uses them to pin the guest pages where
      the guest has registered areas of memory for the hypervisor to update,
      (i.e. the per-cpu virtual processor areas, SLB shadow buffers and
      dispatch trace logs) and then unpin them when they are no longer
      required.
      
      Although it is not strictly necessary to pin the pages at this point,
      since all guest pages are already pinned, later commits in this series
      will mean that guest pages aren't all pinned.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      93e60249
    • P
      KVM: PPC: Keep page physical addresses in per-slot arrays · b2b2f165
      Paul Mackerras 提交于
      This allocates an array for each memory slot that is added to store
      the physical addresses of the pages in the slot.  This array is
      vmalloc'd and accessed in kvmppc_h_enter using real_vmalloc_addr().
      This allows us to remove the ram_pginfo field from the kvm_arch
      struct, and removes the 64GB guest RAM limit that we had.
      
      We use the low-order bits of the array entries to store a flag
      indicating that we have done get_page on the corresponding page,
      and therefore need to call put_page when we are finished with the
      page.  Currently this is set for all pages except those in our
      special RMO regions.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      b2b2f165
    • P
      KVM: PPC: Keep a record of HV guest view of hashed page table entries · 8936dda4
      Paul Mackerras 提交于
      This adds an array that parallels the guest hashed page table (HPT),
      that is, it has one entry per HPTE, used to store the guest's view
      of the second doubleword of the corresponding HPTE.  The first
      doubleword in the HPTE is the same as the guest's idea of it, so we
      don't need to store a copy, but the second doubleword in the HPTE has
      the real page number rather than the guest's logical page number.
      This allows us to remove the back_translate() and reverse_xlate()
      functions.
      
      This "reverse mapping" array is vmalloc'd, meaning that to access it
      in real mode we have to walk the kernel's page tables explicitly.
      That is done by the new real_vmalloc_addr() function.  (In fact this
      returns an address in the linear mapping, so the result is usable
      both in real mode and in virtual mode.)
      
      There are also some minor cleanups here: moving the definitions of
      HPT_ORDER etc. to a header file and defining HPT_NPTE for HPT_NPTEG << 3.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      8936dda4
    • S
      KVM: PPC: e500: use hardware hint when loading TLB0 entries · 57013524
      Scott Wood 提交于
      The hardware maintains a per-set next victim hint.  Using this
      reduces conflicts, especially on e500v2 where a single guest
      TLB entry is mapped to two shadow TLB entries (user and kernel).
      We want those two entries to go to different TLB ways.
      
      sesel is now only used for TLB1.
      Reported-by: NLiu Yu <yu.liu@freescale.com>
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      57013524
    • A
      KVM: PPC: Use get/set for to_svcpu to help preemption · 468a12c2
      Alexander Graf 提交于
      When running the 64-bit Book3s PR code without CONFIG_PREEMPT_NONE, we were
      doing a few things wrong, most notably access to PACA fields without making
      sure that the pointers stay stable accross the access (preempt_disable()).
      
      This patch moves to_svcpu towards a get/put model which allows us to disable
      preemption while accessing the shadow vcpu fields in the PACA. That way we
      can run preemptible and everyone's happy!
      Reported-by: NJörg Sommer <joerg@alea.gnuu.de>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      468a12c2
    • S
      KVM: PPC: booke: Improve timer register emulation · dfd4d47e
      Scott Wood 提交于
      Decrementers are now properly driven by TCR/TSR, and the guest
      has full read/write access to these registers.
      
      The decrementer keeps ticking (and setting the TSR bit) regardless of
      whether the interrupts are enabled with TCR.
      
      The decrementer stops at zero, rather than going negative.
      
      Decrementers (and FITs, once implemented) are delivered as
      level-triggered interrupts -- dequeued when the TSR bit is cleared, not
      on delivery.
      Signed-off-by: NLiu Yu <yu.liu@freescale.com>
      [scottwood@freescale.com: significant changes]
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      dfd4d47e
    • S
      KVM: PPC: Paravirtualize SPRG4-7, ESR, PIR, MASn · b5904972
      Scott Wood 提交于
      This allows additional registers to be accessed by the guest
      in PR-mode KVM without trapping.
      
      SPRG4-7 are readable from userspace.  On booke, KVM will sync
      these registers when it enters the guest, so that accesses from
      guest userspace will work.  The guest kernel, OTOH, must consistently
      use either the real registers or the shared area between exits.  This
      also applies to the already-paravirted SPRG3.
      
      On non-booke, it's not clear to what extent SPRG4-7 are supported
      (they're not architected for book3s, but exist on at least some classic
      chips).  They are copied in the get/set regs ioctls, but I do not see any
      non-booke emulation.  I also do not see any syncing with real registers
      (in PR-mode) including the user-readable SPRG3.  This patch should not
      make that situation any worse.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      b5904972
    • S
      KVM: PPC: Rename deliver_interrupts to prepare_to_enter · 7e28e60e
      Scott Wood 提交于
      This function also updates paravirt int_pending, so rename it
      to be more obvious that this is a collection of checks run prior
      to (re)entering a guest.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      7e28e60e
    • S
      KVM: PPC: e500: MMU API · dc83b8bc
      Scott Wood 提交于
      This implements a shared-memory API for giving host userspace access to
      the guest's TLB.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      dc83b8bc
    • S
      KVM: PPC: e500: clear up confusion between host and guest entries · 0164c0f0
      Scott Wood 提交于
      Split out the portions of tlbe_priv that should be associated with host
      entries into tlbe_ref.  Base victim selection on the number of hardware
      entries, not guest entries.
      
      For TLB1, where one guest entry can be mapped by multiple host entries,
      we use the host tlbe_ref for tracking page references.  For the guest
      TLB0 entries, we still track it with gtlb_priv, to avoid having to
      retranslate if the entry is evicted from the host TLB but not the
      guest TLB.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      0164c0f0
    • C
      KVM: provide synchronous registers in kvm_run · b9e5dc8d
      Christian Borntraeger 提交于
      On some cpus the overhead for virtualization instructions is in the same
      range as a system call. Having to call multiple ioctls to get set registers
      will make certain userspace handled exits more expensive than necessary.
      Lets provide a section in kvm_run that works as a shared save area
      for guest registers.
      We also provide two 64bit flags fields (architecture specific), that will
      specify
      1. which parts of these fields are valid.
      2. which registers were modified by userspace
      
      Each bit for these flag fields will define a group of registers (like
      general purpose) or a single register.
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      b9e5dc8d
  2. 14 2月, 2012 2 次提交
  3. 18 1月, 2012 1 次提交
    • E
      Audit: push audit success and retcode into arch ptrace.h · d7e7528b
      Eric Paris 提交于
      The audit system previously expected arches calling to audit_syscall_exit to
      supply as arguments if the syscall was a success and what the return code was.
      Audit also provides a helper AUDITSC_RESULT which was supposed to simplify things
      by converting from negative retcodes to an audit internal magic value stating
      success or failure.  This helper was wrong and could indicate that a valid
      pointer returned to userspace was a failed syscall.  The fix is to fix the
      layering foolishness.  We now pass audit_syscall_exit a struct pt_reg and it
      in turns calls back into arch code to collect the return value and to
      determine if the syscall was a success or failure.  We also define a generic
      is_syscall_success() macro which determines success/failure based on if the
      value is < -MAX_ERRNO.  This works for arches like x86 which do not use a
      separate mechanism to indicate syscall failure.
      
      We make both the is_syscall_success() and regs_return_value() static inlines
      instead of macros.  The reason is because the audit function must take a void*
      for the regs.  (uml calls theirs struct uml_pt_regs instead of just struct
      pt_regs so audit_syscall_exit can't take a struct pt_regs).  Since the audit
      function takes a void* we need to use static inlines to cast it back to the
      arch correct structure to dereference it.
      
      The other major change is that on some arches, like ia64, MIPS and ppc, we
      change regs_return_value() to give us the negative value on syscall failure.
      THE only other user of this macro, kretprobe_example.c, won't notice and it
      makes the value signed consistently for the audit functions across all archs.
      
      In arch/sh/kernel/ptrace_64.c I see that we were using regs[9] in the old
      audit code as the return value.  But the ptrace_64.h code defined the macro
      regs_return_value() as regs[3].  I have no idea which one is correct, but this
      patch now uses the regs_return_value() function, so it now uses regs[3].
      
      For powerpc we previously used regs->result but now use the
      regs_return_value() function which uses regs->gprs[3].  regs->gprs[3] is
      always positive so the regs_return_value(), much like ia64 makes it negative
      before calling the audit code when appropriate.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Acked-by: H. Peter Anvin <hpa@zytor.com> [for x86 portion]
      Acked-by: Tony Luck <tony.luck@intel.com> [for ia64]
      Acked-by: Richard Weinberger <richard@nod.at> [for uml]
      Acked-by: David S. Miller <davem@davemloft.net> [for sparc]
      Acked-by: Ralf Baechle <ralf@linux-mips.org> [for mips]
      Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [for ppc]
      d7e7528b
  4. 07 1月, 2012 2 次提交
  5. 05 1月, 2012 1 次提交
  6. 04 1月, 2012 2 次提交
  7. 30 12月, 2011 1 次提交
    • A
      procfs: do not confuse jiffies with cputime64_t · 34845636
      Andreas Schwab 提交于
      Commit 2a95ea6c ("procfs: do not overflow get_{idle,iowait}_time
      for nohz") did not take into account that one some architectures jiffies
      and cputime use different units.
      
      This causes get_idle_time() to return numbers in the wrong units, making
      the idle time fields in /proc/stat wrong.
      
      Instead of converting the usec value returned by
      get_cpu_{idle,iowait}_time_us to units of jiffies, use the new function
      usecs_to_cputime64 to convert it to the correct unit of cputime64_t.
      Signed-off-by: NAndreas Schwab <schwab@linux-m68k.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: "Artem S. Tashkinov" <t.artem@mailcity.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34845636
  8. 27 12月, 2011 1 次提交
  9. 26 12月, 2011 1 次提交
  10. 22 12月, 2011 1 次提交
    • K
      cpu: convert 'cpu' and 'machinecheck' sysdev_class to a regular subsystem · 8a25a2fd
      Kay Sievers 提交于
      This moves the 'cpu sysdev_class' over to a regular 'cpu' subsystem
      and converts the devices to regular devices. The sysdev drivers are
      implemented as subsystem interfaces now.
      
      After all sysdev classes are ported to regular driver core entities, the
      sysdev implementation will be entirely removed from the kernel.
      
      Userspace relies on events and generic sysfs subsystem infrastructure
      from sysdev devices, which are made available with this conversion.
      
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Borislav Petkov <bp@amd64.org>
      Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Zhang Rui <rui.zhang@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NKay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      8a25a2fd
  11. 20 12月, 2011 2 次提交
    • S
      powerpc: Define virtual-physical translations for RELOCATABLE · 368ff8f1
      Suzuki Poulose 提交于
      We find the runtime address of _stext and relocate ourselves based
      on the following calculation.
      
      	virtual_base = ALIGN(KERNELBASE,KERNEL_TLB_PIN_SIZE) +
      			MODULO(_stext.run,KERNEL_TLB_PIN_SIZE)
      
      relocate() is called with the Effective Virtual Base Address (as
      shown below)
      
                  | Phys. Addr| Virt. Addr |
      Page        |------------------------|
      Boundary    |           |            |
                  |           |            |
                  |           |            |
      Kernel Load |___________|_ __ _ _ _ _|<- Effective
      Addr(_stext)|           |      ^     |Virt. Base Addr
                  |           |      |     |
                  |           |      |     |
                  |           |reloc_offset|
                  |           |      |     |
                  |           |      |     |
                  |           |______v_____|<-(KERNELBASE)%TLB_SIZE
                  |           |            |
                  |           |            |
                  |           |            |
      Page        |-----------|------------|
      Boundary    |           |            |
      
      On BookE, we need __va() & __pa() early in the boot process to access
      the device tree.
      
      Currently this has been defined as :
      
      #define __va(x) ((void *)(unsigned long)((phys_addr_t)(x) -
      						PHYSICAL_START + KERNELBASE)
      where:
       PHYSICAL_START is kernstart_addr - a variable updated at runtime.
       KERNELBASE	is the compile time Virtual base address of kernel.
      
      This won't work for us, as kernstart_addr is dynamic and will yield different
      results for __va()/__pa() for same mapping.
      
      e.g.,
      
      Let the kernel be loaded at 64MB and KERNELBASE be 0xc0000000 (same as
      PAGE_OFFSET).
      
      In this case, we would be mapping 0 to 0xc0000000, and kernstart_addr = 64M
      
      Now __va(1MB) = (0x100000) - (0x4000000) + 0xc0000000
      		= 0xbc100000 , which is wrong.
      
      it should be : 0xc0000000 + 0x100000 = 0xc0100000
      
      On platforms which support AMP, like PPC_47x (based on 44x), the kernel
      could be loaded at highmem. Hence we cannot always depend on the compile
      time constants for mapping.
      
      Here are the possible solutions:
      
      1) Update kernstart_addr(PHSYICAL_START) to match the Physical address of
      compile time KERNELBASE value, instead of the actual Physical_Address(_stext).
      
      The disadvantage is that we may break other users of PHYSICAL_START. They
      could be replaced with __pa(_stext).
      
      2) Redefine __va() & __pa() with relocation offset
      
      #ifdef	CONFIG_RELOCATABLE_PPC32
      #define __va(x) ((void *)(unsigned long)((phys_addr_t)(x) - PHYSICAL_START + (KERNELBASE + RELOC_OFFSET)))
      #define __pa(x) ((unsigned long)(x) + PHYSICAL_START - (KERNELBASE + RELOC_OFFSET))
      #endif
      
      where, RELOC_OFFSET could be
      
        a) A variable, say relocation_offset (like kernstart_addr), updated
           at boot time. This impacts performance, as we have to load an additional
           variable from memory.
      
      		OR
      
        b) #define RELOC_OFFSET ((PHYSICAL_START & PPC_PIN_SIZE_OFFSET_MASK) - \
                            (KERNELBASE & PPC_PIN_SIZE_OFFSET_MASK))
      
         This introduces more calculations for doing the translation.
      
      3) Redefine __va() & __pa() with a new variable
      
      i.e,
      
      #define __va(x) ((void *)(unsigned long)((phys_addr_t)(x) + VIRT_PHYS_OFFSET))
      
      where VIRT_PHYS_OFFSET :
      
      #ifdef CONFIG_RELOCATABLE_PPC32
      #define VIRT_PHYS_OFFSET virt_phys_offset
      #else
      #define VIRT_PHYS_OFFSET (KERNELBASE - PHYSICAL_START)
      #endif /* CONFIG_RELOCATABLE_PPC32 */
      
      where virt_phy_offset is updated at runtime to :
      
      	Effective KERNELBASE - kernstart_addr.
      
      Taking our example, above:
      
      virt_phys_offset = effective_kernelstart_vaddr - kernstart_addr
      		 = 0xc0400000 - 0x400000
      		 = 0xc0000000
      	and
      
      	__va(0x100000) = 0xc0000000 + 0x100000 = 0xc0100000
      	 which is what we want.
      
      I have implemented (3) in the following patch which has same cost of
      operation as the existing one.
      
      I have tested the patches on 440x platforms only. However this should
      work fine for PPC_47x also, as we only depend on the runtime address
      and the current TLB XLAT entry for the startup code, which is available
      in r25. I don't have access to a 47x board yet. So, it would be great if
      somebody could test this on 47x.
      Signed-off-by: NSuzuki K. Poulose <suzuki@in.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Kumar Gala <galak@kernel.crashing.org>
      Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
      Signed-off-by: NJosh Boyer <jwboyer@gmail.com>
      368ff8f1
    • S
      powerpc: Rename mapping based RELOCATABLE to DYNAMIC_MEMSTART for BookE · 0f890c8d
      Suzuki Poulose 提交于
      The current implementation of CONFIG_RELOCATABLE in BookE is based
      on mapping the page aligned kernel load address to KERNELBASE. This
      approach however is not enough for platforms, where the TLB page size
      is large (e.g, 256M on 44x). So we are renaming the RELOCATABLE used
      currently in BookE to DYNAMIC_MEMSTART to reflect the actual method.
      
      The CONFIG_RELOCATABLE for PPC32(BookE) based on processing of the
      dynamic relocations will be introduced in the later in the patch series.
      
      This change would allow the use of the old method of RELOCATABLE for
      platforms which can afford to enforce the page alignment (platforms with
      smaller TLB size).
      
      Changes since v3:
      
      * Introduced a new config, NONSTATIC_KERNEL, to denote a kernel which is
        either a RELOCATABLE or DYNAMIC_MEMSTART(Suggested by: Josh Boyer)
      Suggested-by: NScott Wood <scottwood@freescale.com>
      Tested-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NSuzuki K. Poulose <suzuki@in.ibm.com>
      Cc: Scott Wood <scottwood@freescale.com>
      Cc: Kumar Gala <galak@kernel.crashing.org>
      Cc: Josh Boyer <jwboyer@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: linux ppc dev <linuxppc-dev@lists.ozlabs.org>
      Signed-off-by: NJosh Boyer <jwboyer@gmail.com>
      0f890c8d
  12. 19 12月, 2011 4 次提交
    • A
      powerpc: Fix comment explaining our VSID layout · b206590c
      Anton Blanchard 提交于
      We support 16TB of user address space and half a million contexts
      so update the comment to reflect this.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      b206590c
    • A
      powerpc: Fix wrong divisor in usecs_to_cputime · 9f5072d4
      Andreas Schwab 提交于
      Commit d57af9b2 (taskstats: use real microsecond granularity for CPU times)
      renamed msecs_to_cputime to usecs_to_cputime, but failed to update all
      numbers on the way.  This causes nonsensical cpu idle/iowait values to be
      displayed in /proc/stat (the only user of usecs_to_cputime so far).
      
      This also renames __cputime_msec_factor to __cputime_usec_factor, adapting
      its value and using it directly in cputime_to_usecs instead of doing two
      multiplications.
      Signed-off-by: NAndreas Schwab <schwab@linux-m68k.org>
      Acked-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      9f5072d4
    • M
      powerpc: Add __SANE_USERSPACE_TYPES__ to asm/types.h for LL64 · 2c9c6ce0
      Matt Evans 提交于
      PPC64 uses long long for u64 in the kernel, but powerpc's asm/types.h
      prevents 64-bit userland from seeing this definition, instead defaulting
      to u64 == long in userspace.  Some user programs (e.g. kvmtool) may actually
      want LL64, so this patch adds a check for __SANE_USERSPACE_TYPES__ so that,
      if defined, int-ll64.h is included instead.
      Signed-off-by: NMatt Evans <matt@ozlabs.org>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      2c9c6ce0
    • A
      powerpc: POWER7 optimised copy_to_user/copy_from_user using VMX · a66086b8
      Anton Blanchard 提交于
      Implement a POWER7 optimised copy_to_user/copy_from_user using VMX.
      For large aligned copies this new loop is over 10% faster, and for
      large unaligned copies it is over 200% faster.
      
      If we take a fault we fall back to the old version, this keeps
      things relatively simple and easy to verify.
      
      On POWER7 unaligned stores rarely slow down - they only flush when
      a store crosses a 4KB page boundary. Furthermore this flush is
      handled completely in hardware and should be 20-30 cycles.
      
      Unaligned loads on the other hand flush much more often - whenever
      crossing a 128 byte cache line, or a 32 byte sector if either sector
      is an L1 miss.
      
      Considering this information we really want to get the loads aligned
      and not worry about the alignment of the stores. Microbenchmarks
      confirm that this approach is much faster than the current unaligned
      copy loop that uses shifts and rotates to ensure both loads and
      stores are aligned.
      
      We also want to try and do the stores in cacheline aligned, cacheline
      sized chunks. If the store queue is unable to merge an entire
      cacheline of stores then the L2 cache will have to do a
      read/modify/write. Even worse, we will serialise this with the stores
      in the next iteration of the copy loop since both iterations hit
      the same cacheline.
      
      Based on this, the new loop does the following things:
      
      1 - 127 bytes
      Get the source 8 byte aligned and use 8 byte loads and stores. Pretty
      boring and similar to how the current loop works.
      
      128 - 4095 bytes
      Get the source 8 byte aligned and use 8 byte loads and stores,
      1 cacheline at a time. We aren't doing the stores in cacheline
      aligned chunks so we will potentially serialise once per cacheline.
      Even so it is much better than the loop we have today.
      
      4096 - bytes
      If both source and destination have the same alignment get them both
      16 byte aligned, then get the destination cacheline aligned. Do
      cacheline sized loads and stores using VMX.
      
      If source and destination do not have the same alignment, we get the
      destination cacheline aligned, and use permute to do aligned loads.
      
      In both cases the VMX loop should be optimal - we always do aligned
      loads and stores and are always doing stores in cacheline aligned,
      cacheline sized chunks.
      
      To be able to use VMX we must be careful about interrupts and
      sleeping. We don't use the VMX loop when in an interrupt (which should
      be rare anyway) and we wrap the VMX loop in disable/enable_pagefault
      and fall back to the existing copy_tofrom_user loop if we do need to
      sleep.
      
      The VMX breakpoint of 4096 bytes was chosen using this microbenchmark:
      
      http://ozlabs.org/~anton/junkcode/copy_to_user.c
      
      Since we are using VMX and there is a cost to saving and restoring
      the user VMX state there are two broad cases we need to benchmark:
      
      - Best case - userspace never uses VMX
      
      - Worst case - userspace always uses VMX
      
      In reality a userspace process will sit somewhere between these two
      extremes. Since we need to test both aligned and unaligned copies we
      end up with 4 combinations. The point at which the VMX loop begins to
      win is:
      
      0% VMX
      aligned		2048 bytes
      unaligned	2048 bytes
      
      100% VMX
      aligned		16384 bytes
      unaligned	8192 bytes
      
      Considering this is a microbenchmark, the data is hot in cache and
      the VMX loop has better store queue merging properties we set the
      breakpoint to 4096 bytes, a little below the unaligned breakpoints.
      
      Some future optimisations we can look at:
      
      - Looking at the perf data, a significant part of the cost when a
        task is always using VMX is the extra exception we take to restore
        the VMX state. As such we should do something similar to the x86
        optimisation that restores FPU state for heavy users. ie:
      
              /*
               * If the task has used fpu the last 5 timeslices, just do a full
               * restore of the math state immediately to avoid the trap; the
               * chances of needing FPU soon are obviously high now
               */
              preload_fpu = tsk_used_math(next_p) && next_p->fpu_counter > 5;
      
        and
      
              /*
               * fpu_counter contains the number of consecutive context switches
               * that the FPU is used. If this is over a threshold, the lazy fpu
               * saving becomes unlazy to save the trap. This is an unsigned char
               * so that after 256 times the counter wraps and the behavior turns
               * lazy again; this to deal with bursty apps that only use FPU for
               * a short time
               */
      
      - We could create a paca bit to mirror the VMX enabled MSR bit and check
        that first, avoiding multiple calls to calling enable_kernel_altivec.
        That should help with iovec based system calls like readv.
      
      - We could have two VMX breakpoints, one for when we know the user VMX
        state is loaded into the registers and one when it isn't. This could
        be a second bit in the paca so we can calculate the break points quickly.
      
      - One suggestion from Ben was to save and restore the VSX registers
        we use inline instead of using enable_kernel_altivec.
      
      [BenH: Fixed a problem with preempt and fixed build without CONFIG_ALTIVEC]
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      a66086b8
  13. 16 12月, 2011 1 次提交
  14. 15 12月, 2011 1 次提交
  15. 09 12月, 2011 1 次提交