1. 05 3月, 2012 26 次提交
    • S
      KVM: PPC: refer to paravirt docs in header file · 54f65795
      Scott Wood 提交于
      Instead of keeping separate copies of struct kvm_vcpu_arch_shared (one in
      the code, one in the docs) that inevitably fail to be kept in sync
      (already sr[] is missing from the doc version), just point to the header
      file as the source of documentation on the contents of the magic page.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Acked-by: NAvi Kivity <avi@redhat.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      54f65795
    • A
      KVM: PPC: Rename MMIO register identifiers · b3c5d3c2
      Alexander Graf 提交于
      We need the KVM_REG namespace for generic register settings now, so
      let's rename the existing users to something different, enabling
      us to reuse the namespace for more visible interfaces.
      
      While at it, also move these private constants to a private header.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      b3c5d3c2
    • P
      KVM: PPC: Move kvm_vcpu_ioctl_[gs]et_one_reg down to platform-specific code · 31f3438e
      Paul Mackerras 提交于
      This moves the get/set_one_reg implementation down from powerpc.c into
      booke.c, book3s_pr.c and book3s_hv.c.  This avoids #ifdefs in C code,
      but more importantly, it fixes a bug on Book3s HV where we were
      accessing beyond the end of the kvm_vcpu struct (via the to_book3s()
      macro) and corrupting memory, causing random crashes and file corruption.
      
      On Book3s HV we only accept setting the HIOR to zero, since the guest
      runs in supervisor mode and its vectors are never offset from zero.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      [agraf update to apply on top of changed ONE_REG patches]
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      31f3438e
    • A
      KVM: PPC: Add support for explicit HIOR setting · 1022fc3d
      Alexander Graf 提交于
      Until now, we always set HIOR based on the PVR, but this is just wrong.
      Instead, we should be setting HIOR explicitly, so user space can decide
      what the initial HIOR value is - just like on real hardware.
      
      We keep the old PVR based way around for backwards compatibility, but
      once user space uses the SET_ONE_REG based method, we drop the PVR logic.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      1022fc3d
    • P
      KVM: PPC: Book3s HV: Implement get_dirty_log using hardware changed bit · 82ed3616
      Paul Mackerras 提交于
      This changes the implementation of kvm_vm_ioctl_get_dirty_log() for
      Book3s HV guests to use the hardware C (changed) bits in the guest
      hashed page table.  Since this makes the implementation quite different
      from the Book3s PR case, this moves the existing implementation from
      book3s.c to book3s_pr.c and creates a new implementation in book3s_hv.c.
      That implementation calls kvmppc_hv_get_dirty_log() to do the actual
      work by calling kvm_test_clear_dirty on each page.  It iterates over
      the HPTEs, clearing the C bit if set, and returns 1 if any C bit was
      set (including the saved C bit in the rmap entry).
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      82ed3616
    • P
      KVM: PPC: Book3S HV: Use the hardware referenced bit for kvm_age_hva · 55514893
      Paul Mackerras 提交于
      This uses the host view of the hardware R (referenced) bit to speed
      up kvm_age_hva() and kvm_test_age_hva().  Instead of removing all
      the relevant HPTEs in kvm_age_hva(), we now just reset their R bits
      if set.  Also, kvm_test_age_hva() now scans the relevant HPTEs to
      see if any of them have R set.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      55514893
    • P
      KVM: PPC: Book3s HV: Maintain separate guest and host views of R and C bits · bad3b507
      Paul Mackerras 提交于
      This allows both the guest and the host to use the referenced (R) and
      changed (C) bits in the guest hashed page table.  The guest has a view
      of R and C that is maintained in the guest_rpte field of the revmap
      entry for the HPTE, and the host has a view that is maintained in the
      rmap entry for the associated gfn.
      
      Both view are updated from the guest HPT.  If a bit (R or C) is zero
      in either view, it will be initially set to zero in the HPTE (or HPTEs),
      until set to 1 by hardware.  When an HPTE is removed for any reason,
      the R and C bits from the HPTE are ORed into both views.  We have to
      be careful to read the R and C bits from the HPTE after invalidating
      it, but before unlocking it, in case of any late updates by the hardware.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      bad3b507
    • P
      KVM: PPC: Allow for read-only pages backing a Book3S HV guest · 4cf302bc
      Paul Mackerras 提交于
      With this, if a guest does an H_ENTER with a read/write HPTE on a page
      which is currently read-only, we make the actual HPTE inserted be a
      read-only version of the HPTE.  We now intercept protection faults as
      well as HPTE not found faults, and for a protection fault we work out
      whether it should be reflected to the guest (e.g. because the guest HPTE
      didn't allow write access to usermode) or handled by switching to
      kernel context and calling kvmppc_book3s_hv_page_fault, which will then
      request write access to the page and update the actual HPTE.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      4cf302bc
    • P
      KVM: PPC: Implement MMU notifiers for Book3S HV guests · 342d3db7
      Paul Mackerras 提交于
      This adds the infrastructure to enable us to page out pages underneath
      a Book3S HV guest, on processors that support virtualized partition
      memory, that is, POWER7.  Instead of pinning all the guest's pages,
      we now look in the host userspace Linux page tables to find the
      mapping for a given guest page.  Then, if the userspace Linux PTE
      gets invalidated, kvm_unmap_hva() gets called for that address, and
      we replace all the guest HPTEs that refer to that page with absent
      HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
      set, which will cause an HDSI when the guest tries to access them.
      Finally, the page fault handler is extended to reinstantiate the
      guest HPTE when the guest tries to access a page which has been paged
      out.
      
      Since we can't intercept the guest DSI and ISI interrupts on PPC970,
      we still have to pin all the guest pages on PPC970.  We have a new flag,
      kvm->arch.using_mmu_notifiers, that indicates whether we can page
      guest pages out.  If it is not set, the MMU notifier callbacks do
      nothing and everything operates as before.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      342d3db7
    • P
      KVM: PPC: Implement MMIO emulation support for Book3S HV guests · 697d3899
      Paul Mackerras 提交于
      This provides the low-level support for MMIO emulation in Book3S HV
      guests.  When the guest tries to map a page which is not covered by
      any memslot, that page is taken to be an MMIO emulation page.  Instead
      of inserting a valid HPTE, we insert an HPTE that has the valid bit
      clear but another hypervisor software-use bit set, which we call
      HPTE_V_ABSENT, to indicate that this is an absent page.  An
      absent page is treated much like a valid page as far as guest hcalls
      (H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
      an absent HPTE doesn't need to be invalidated with tlbie since it
      was never valid as far as the hardware is concerned.
      
      When the guest accesses a page for which there is an absent HPTE, it
      will take a hypervisor data storage interrupt (HDSI) since we now set
      the VPM1 bit in the LPCR.  Our HDSI handler for HPTE-not-present faults
      looks up the hash table and if it finds an absent HPTE mapping the
      requested virtual address, will switch to kernel mode and handle the
      fault in kvmppc_book3s_hv_page_fault(), which at present just calls
      kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
      
      This is based on an earlier patch by Benjamin Herrenschmidt, but since
      heavily reworked.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      697d3899
    • P
      KVM: PPC: Maintain a doubly-linked list of guest HPTEs for each gfn · 06ce2c63
      Paul Mackerras 提交于
      This expands the reverse mapping array to contain two links for each
      HPTE which are used to link together HPTEs that correspond to the
      same guest logical page.  Each circular list of HPTEs is pointed to
      by the rmap array entry for the guest logical page, pointed to by
      the relevant memslot.  Links are 32-bit HPT entry indexes rather than
      full 64-bit pointers, to save space.  We use 3 of the remaining 32
      bits in the rmap array entries as a lock bit, a referenced bit and
      a present bit (the present bit is needed since HPTE index 0 is valid).
      The bit lock for the rmap chain nests inside the HPTE lock bit.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      06ce2c63
    • P
      KVM: PPC: Allow I/O mappings in memory slots · 9d0ef5ea
      Paul Mackerras 提交于
      This provides for the case where userspace maps an I/O device into the
      address range of a memory slot using a VM_PFNMAP mapping.  In that
      case, we work out the pfn from vma->vm_pgoff, and record the cache
      enable bits from vma->vm_page_prot in two low-order bits in the
      slot_phys array entries.  Then, in kvmppc_h_enter() we check that the
      cache bits in the HPTE that the guest wants to insert match the cache
      bits in the slot_phys array entry.  However, we do allow the guest to
      create what it thinks is a non-cacheable or write-through mapping to
      memory that is actually cacheable, so that we can use normal system
      memory as part of an emulated device later on.  In that case the actual
      HPTE we insert is a cacheable HPTE.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      9d0ef5ea
    • P
      KVM: PPC: Allow use of small pages to back Book3S HV guests · da9d1d7f
      Paul Mackerras 提交于
      This relaxes the requirement that the guest memory be provided as
      16MB huge pages, allowing it to be provided as normal memory, i.e.
      in pages of PAGE_SIZE bytes (4k or 64k).  To allow this, we index
      the kvm->arch.slot_phys[] arrays with a small page index, even if
      huge pages are being used, and use the low-order 5 bits of each
      entry to store the order of the enclosing page with respect to
      normal pages, i.e. log_2(enclosing_page_size / PAGE_SIZE).
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      da9d1d7f
    • P
      KVM: PPC: Only get pages when actually needed, not in prepare_memory_region() · c77162de
      Paul Mackerras 提交于
      This removes the code from kvmppc_core_prepare_memory_region() that
      looked up the VMA for the region being added and called hva_to_page
      to get the pfns for the memory.  We have no guarantee that there will
      be anything mapped there at the time of the KVM_SET_USER_MEMORY_REGION
      ioctl call; userspace can do that ioctl and then map memory into the
      region later.
      
      Instead we defer looking up the pfn for each memory page until it is
      needed, which generally means when the guest does an H_ENTER hcall on
      the page.  Since we can't call get_user_pages in real mode, if we don't
      already have the pfn for the page, kvmppc_h_enter() will return
      H_TOO_HARD and we then call kvmppc_virtmode_h_enter() once we get back
      to kernel context.  That calls kvmppc_get_guest_page() to get the pfn
      for the page, and then calls back to kvmppc_h_enter() to redo the HPTE
      insertion.
      
      When the first vcpu starts executing, we need to have the RMO or VRMA
      region mapped so that the guest's real mode accesses will work.  Thus
      we now have a check in kvmppc_vcpu_run() to see if the RMO/VRMA is set
      up and if not, call kvmppc_hv_setup_rma().  It checks if the memslot
      starting at guest physical 0 now has RMO memory mapped there; if so it
      sets it up for the guest, otherwise on POWER7 it sets up the VRMA.
      The function that does that, kvmppc_map_vrma, is now a bit simpler,
      as it calls kvmppc_virtmode_h_enter instead of creating the HPTE itself.
      
      Since we are now potentially updating entries in the slot_phys[]
      arrays from multiple vcpu threads, we now have a spinlock protecting
      those updates to ensure that we don't lose track of any references
      to pages.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      c77162de
    • P
      KVM: PPC: Make the H_ENTER hcall more reliable · 075295dd
      Paul Mackerras 提交于
      At present, our implementation of H_ENTER only makes one try at locking
      each slot that it looks at, and doesn't even retry the ldarx/stdcx.
      atomic update sequence that it uses to attempt to lock the slot.  Thus
      it can return the H_PTEG_FULL error unnecessarily, particularly when
      the H_EXACT flag is set, meaning that the caller wants a specific PTEG
      slot.
      
      This improves the situation by making a second pass when no free HPTE
      slot is found, where we spin until we succeed in locking each slot in
      turn and then check whether it is full while we hold the lock.  If the
      second pass fails, then we return H_PTEG_FULL.
      
      This also moves lock_hpte to a header file (since later commits in this
      series will need to use it from other source files) and renames it to
      try_lock_hpte, which is a somewhat less misleading name.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      075295dd
    • P
      KVM: PPC: Add an interface for pinning guest pages in Book3s HV guests · 93e60249
      Paul Mackerras 提交于
      This adds two new functions, kvmppc_pin_guest_page() and
      kvmppc_unpin_guest_page(), and uses them to pin the guest pages where
      the guest has registered areas of memory for the hypervisor to update,
      (i.e. the per-cpu virtual processor areas, SLB shadow buffers and
      dispatch trace logs) and then unpin them when they are no longer
      required.
      
      Although it is not strictly necessary to pin the pages at this point,
      since all guest pages are already pinned, later commits in this series
      will mean that guest pages aren't all pinned.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      93e60249
    • P
      KVM: PPC: Keep page physical addresses in per-slot arrays · b2b2f165
      Paul Mackerras 提交于
      This allocates an array for each memory slot that is added to store
      the physical addresses of the pages in the slot.  This array is
      vmalloc'd and accessed in kvmppc_h_enter using real_vmalloc_addr().
      This allows us to remove the ram_pginfo field from the kvm_arch
      struct, and removes the 64GB guest RAM limit that we had.
      
      We use the low-order bits of the array entries to store a flag
      indicating that we have done get_page on the corresponding page,
      and therefore need to call put_page when we are finished with the
      page.  Currently this is set for all pages except those in our
      special RMO regions.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      b2b2f165
    • P
      KVM: PPC: Keep a record of HV guest view of hashed page table entries · 8936dda4
      Paul Mackerras 提交于
      This adds an array that parallels the guest hashed page table (HPT),
      that is, it has one entry per HPTE, used to store the guest's view
      of the second doubleword of the corresponding HPTE.  The first
      doubleword in the HPTE is the same as the guest's idea of it, so we
      don't need to store a copy, but the second doubleword in the HPTE has
      the real page number rather than the guest's logical page number.
      This allows us to remove the back_translate() and reverse_xlate()
      functions.
      
      This "reverse mapping" array is vmalloc'd, meaning that to access it
      in real mode we have to walk the kernel's page tables explicitly.
      That is done by the new real_vmalloc_addr() function.  (In fact this
      returns an address in the linear mapping, so the result is usable
      both in real mode and in virtual mode.)
      
      There are also some minor cleanups here: moving the definitions of
      HPT_ORDER etc. to a header file and defining HPT_NPTE for HPT_NPTEG << 3.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      8936dda4
    • S
      KVM: PPC: e500: use hardware hint when loading TLB0 entries · 57013524
      Scott Wood 提交于
      The hardware maintains a per-set next victim hint.  Using this
      reduces conflicts, especially on e500v2 where a single guest
      TLB entry is mapped to two shadow TLB entries (user and kernel).
      We want those two entries to go to different TLB ways.
      
      sesel is now only used for TLB1.
      Reported-by: NLiu Yu <yu.liu@freescale.com>
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      57013524
    • A
      KVM: PPC: Use get/set for to_svcpu to help preemption · 468a12c2
      Alexander Graf 提交于
      When running the 64-bit Book3s PR code without CONFIG_PREEMPT_NONE, we were
      doing a few things wrong, most notably access to PACA fields without making
      sure that the pointers stay stable accross the access (preempt_disable()).
      
      This patch moves to_svcpu towards a get/put model which allows us to disable
      preemption while accessing the shadow vcpu fields in the PACA. That way we
      can run preemptible and everyone's happy!
      Reported-by: NJörg Sommer <joerg@alea.gnuu.de>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      468a12c2
    • S
      KVM: PPC: booke: Improve timer register emulation · dfd4d47e
      Scott Wood 提交于
      Decrementers are now properly driven by TCR/TSR, and the guest
      has full read/write access to these registers.
      
      The decrementer keeps ticking (and setting the TSR bit) regardless of
      whether the interrupts are enabled with TCR.
      
      The decrementer stops at zero, rather than going negative.
      
      Decrementers (and FITs, once implemented) are delivered as
      level-triggered interrupts -- dequeued when the TSR bit is cleared, not
      on delivery.
      Signed-off-by: NLiu Yu <yu.liu@freescale.com>
      [scottwood@freescale.com: significant changes]
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      dfd4d47e
    • S
      KVM: PPC: Paravirtualize SPRG4-7, ESR, PIR, MASn · b5904972
      Scott Wood 提交于
      This allows additional registers to be accessed by the guest
      in PR-mode KVM without trapping.
      
      SPRG4-7 are readable from userspace.  On booke, KVM will sync
      these registers when it enters the guest, so that accesses from
      guest userspace will work.  The guest kernel, OTOH, must consistently
      use either the real registers or the shared area between exits.  This
      also applies to the already-paravirted SPRG3.
      
      On non-booke, it's not clear to what extent SPRG4-7 are supported
      (they're not architected for book3s, but exist on at least some classic
      chips).  They are copied in the get/set regs ioctls, but I do not see any
      non-booke emulation.  I also do not see any syncing with real registers
      (in PR-mode) including the user-readable SPRG3.  This patch should not
      make that situation any worse.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      b5904972
    • S
      KVM: PPC: Rename deliver_interrupts to prepare_to_enter · 7e28e60e
      Scott Wood 提交于
      This function also updates paravirt int_pending, so rename it
      to be more obvious that this is a collection of checks run prior
      to (re)entering a guest.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      7e28e60e
    • S
      KVM: PPC: e500: MMU API · dc83b8bc
      Scott Wood 提交于
      This implements a shared-memory API for giving host userspace access to
      the guest's TLB.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      dc83b8bc
    • S
      KVM: PPC: e500: clear up confusion between host and guest entries · 0164c0f0
      Scott Wood 提交于
      Split out the portions of tlbe_priv that should be associated with host
      entries into tlbe_ref.  Base victim selection on the number of hardware
      entries, not guest entries.
      
      For TLB1, where one guest entry can be mapped by multiple host entries,
      we use the host tlbe_ref for tracking page references.  For the guest
      TLB0 entries, we still track it with gtlb_priv, to avoid having to
      retranslate if the entry is evicted from the host TLB but not the
      guest TLB.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      0164c0f0
    • C
      KVM: provide synchronous registers in kvm_run · b9e5dc8d
      Christian Borntraeger 提交于
      On some cpus the overhead for virtualization instructions is in the same
      range as a system call. Having to call multiple ioctls to get set registers
      will make certain userspace handled exits more expensive than necessary.
      Lets provide a section in kvm_run that works as a shared save area
      for guest registers.
      We also provide two 64bit flags fields (architecture specific), that will
      specify
      1. which parts of these fields are valid.
      2. which registers were modified by userspace
      
      Each bit for these flag fields will define a group of registers (like
      general purpose) or a single register.
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      b9e5dc8d
  2. 14 2月, 2012 2 次提交
  3. 18 1月, 2012 1 次提交
    • E
      Audit: push audit success and retcode into arch ptrace.h · d7e7528b
      Eric Paris 提交于
      The audit system previously expected arches calling to audit_syscall_exit to
      supply as arguments if the syscall was a success and what the return code was.
      Audit also provides a helper AUDITSC_RESULT which was supposed to simplify things
      by converting from negative retcodes to an audit internal magic value stating
      success or failure.  This helper was wrong and could indicate that a valid
      pointer returned to userspace was a failed syscall.  The fix is to fix the
      layering foolishness.  We now pass audit_syscall_exit a struct pt_reg and it
      in turns calls back into arch code to collect the return value and to
      determine if the syscall was a success or failure.  We also define a generic
      is_syscall_success() macro which determines success/failure based on if the
      value is < -MAX_ERRNO.  This works for arches like x86 which do not use a
      separate mechanism to indicate syscall failure.
      
      We make both the is_syscall_success() and regs_return_value() static inlines
      instead of macros.  The reason is because the audit function must take a void*
      for the regs.  (uml calls theirs struct uml_pt_regs instead of just struct
      pt_regs so audit_syscall_exit can't take a struct pt_regs).  Since the audit
      function takes a void* we need to use static inlines to cast it back to the
      arch correct structure to dereference it.
      
      The other major change is that on some arches, like ia64, MIPS and ppc, we
      change regs_return_value() to give us the negative value on syscall failure.
      THE only other user of this macro, kretprobe_example.c, won't notice and it
      makes the value signed consistently for the audit functions across all archs.
      
      In arch/sh/kernel/ptrace_64.c I see that we were using regs[9] in the old
      audit code as the return value.  But the ptrace_64.h code defined the macro
      regs_return_value() as regs[3].  I have no idea which one is correct, but this
      patch now uses the regs_return_value() function, so it now uses regs[3].
      
      For powerpc we previously used regs->result but now use the
      regs_return_value() function which uses regs->gprs[3].  regs->gprs[3] is
      always positive so the regs_return_value(), much like ia64 makes it negative
      before calling the audit code when appropriate.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Acked-by: H. Peter Anvin <hpa@zytor.com> [for x86 portion]
      Acked-by: Tony Luck <tony.luck@intel.com> [for ia64]
      Acked-by: Richard Weinberger <richard@nod.at> [for uml]
      Acked-by: David S. Miller <davem@davemloft.net> [for sparc]
      Acked-by: Ralf Baechle <ralf@linux-mips.org> [for mips]
      Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [for ppc]
      d7e7528b
  4. 07 1月, 2012 2 次提交
  5. 05 1月, 2012 1 次提交
  6. 04 1月, 2012 2 次提交
  7. 30 12月, 2011 1 次提交
    • A
      procfs: do not confuse jiffies with cputime64_t · 34845636
      Andreas Schwab 提交于
      Commit 2a95ea6c ("procfs: do not overflow get_{idle,iowait}_time
      for nohz") did not take into account that one some architectures jiffies
      and cputime use different units.
      
      This causes get_idle_time() to return numbers in the wrong units, making
      the idle time fields in /proc/stat wrong.
      
      Instead of converting the usec value returned by
      get_cpu_{idle,iowait}_time_us to units of jiffies, use the new function
      usecs_to_cputime64 to convert it to the correct unit of cputime64_t.
      Signed-off-by: NAndreas Schwab <schwab@linux-m68k.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: "Artem S. Tashkinov" <t.artem@mailcity.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34845636
  8. 27 12月, 2011 1 次提交
  9. 26 12月, 2011 1 次提交
  10. 22 12月, 2011 1 次提交
    • K
      cpu: convert 'cpu' and 'machinecheck' sysdev_class to a regular subsystem · 8a25a2fd
      Kay Sievers 提交于
      This moves the 'cpu sysdev_class' over to a regular 'cpu' subsystem
      and converts the devices to regular devices. The sysdev drivers are
      implemented as subsystem interfaces now.
      
      After all sysdev classes are ported to regular driver core entities, the
      sysdev implementation will be entirely removed from the kernel.
      
      Userspace relies on events and generic sysfs subsystem infrastructure
      from sysdev devices, which are made available with this conversion.
      
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Borislav Petkov <bp@amd64.org>
      Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Zhang Rui <rui.zhang@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NKay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      8a25a2fd
  11. 20 12月, 2011 2 次提交
    • S
      powerpc: Define virtual-physical translations for RELOCATABLE · 368ff8f1
      Suzuki Poulose 提交于
      We find the runtime address of _stext and relocate ourselves based
      on the following calculation.
      
      	virtual_base = ALIGN(KERNELBASE,KERNEL_TLB_PIN_SIZE) +
      			MODULO(_stext.run,KERNEL_TLB_PIN_SIZE)
      
      relocate() is called with the Effective Virtual Base Address (as
      shown below)
      
                  | Phys. Addr| Virt. Addr |
      Page        |------------------------|
      Boundary    |           |            |
                  |           |            |
                  |           |            |
      Kernel Load |___________|_ __ _ _ _ _|<- Effective
      Addr(_stext)|           |      ^     |Virt. Base Addr
                  |           |      |     |
                  |           |      |     |
                  |           |reloc_offset|
                  |           |      |     |
                  |           |      |     |
                  |           |______v_____|<-(KERNELBASE)%TLB_SIZE
                  |           |            |
                  |           |            |
                  |           |            |
      Page        |-----------|------------|
      Boundary    |           |            |
      
      On BookE, we need __va() & __pa() early in the boot process to access
      the device tree.
      
      Currently this has been defined as :
      
      #define __va(x) ((void *)(unsigned long)((phys_addr_t)(x) -
      						PHYSICAL_START + KERNELBASE)
      where:
       PHYSICAL_START is kernstart_addr - a variable updated at runtime.
       KERNELBASE	is the compile time Virtual base address of kernel.
      
      This won't work for us, as kernstart_addr is dynamic and will yield different
      results for __va()/__pa() for same mapping.
      
      e.g.,
      
      Let the kernel be loaded at 64MB and KERNELBASE be 0xc0000000 (same as
      PAGE_OFFSET).
      
      In this case, we would be mapping 0 to 0xc0000000, and kernstart_addr = 64M
      
      Now __va(1MB) = (0x100000) - (0x4000000) + 0xc0000000
      		= 0xbc100000 , which is wrong.
      
      it should be : 0xc0000000 + 0x100000 = 0xc0100000
      
      On platforms which support AMP, like PPC_47x (based on 44x), the kernel
      could be loaded at highmem. Hence we cannot always depend on the compile
      time constants for mapping.
      
      Here are the possible solutions:
      
      1) Update kernstart_addr(PHSYICAL_START) to match the Physical address of
      compile time KERNELBASE value, instead of the actual Physical_Address(_stext).
      
      The disadvantage is that we may break other users of PHYSICAL_START. They
      could be replaced with __pa(_stext).
      
      2) Redefine __va() & __pa() with relocation offset
      
      #ifdef	CONFIG_RELOCATABLE_PPC32
      #define __va(x) ((void *)(unsigned long)((phys_addr_t)(x) - PHYSICAL_START + (KERNELBASE + RELOC_OFFSET)))
      #define __pa(x) ((unsigned long)(x) + PHYSICAL_START - (KERNELBASE + RELOC_OFFSET))
      #endif
      
      where, RELOC_OFFSET could be
      
        a) A variable, say relocation_offset (like kernstart_addr), updated
           at boot time. This impacts performance, as we have to load an additional
           variable from memory.
      
      		OR
      
        b) #define RELOC_OFFSET ((PHYSICAL_START & PPC_PIN_SIZE_OFFSET_MASK) - \
                            (KERNELBASE & PPC_PIN_SIZE_OFFSET_MASK))
      
         This introduces more calculations for doing the translation.
      
      3) Redefine __va() & __pa() with a new variable
      
      i.e,
      
      #define __va(x) ((void *)(unsigned long)((phys_addr_t)(x) + VIRT_PHYS_OFFSET))
      
      where VIRT_PHYS_OFFSET :
      
      #ifdef CONFIG_RELOCATABLE_PPC32
      #define VIRT_PHYS_OFFSET virt_phys_offset
      #else
      #define VIRT_PHYS_OFFSET (KERNELBASE - PHYSICAL_START)
      #endif /* CONFIG_RELOCATABLE_PPC32 */
      
      where virt_phy_offset is updated at runtime to :
      
      	Effective KERNELBASE - kernstart_addr.
      
      Taking our example, above:
      
      virt_phys_offset = effective_kernelstart_vaddr - kernstart_addr
      		 = 0xc0400000 - 0x400000
      		 = 0xc0000000
      	and
      
      	__va(0x100000) = 0xc0000000 + 0x100000 = 0xc0100000
      	 which is what we want.
      
      I have implemented (3) in the following patch which has same cost of
      operation as the existing one.
      
      I have tested the patches on 440x platforms only. However this should
      work fine for PPC_47x also, as we only depend on the runtime address
      and the current TLB XLAT entry for the startup code, which is available
      in r25. I don't have access to a 47x board yet. So, it would be great if
      somebody could test this on 47x.
      Signed-off-by: NSuzuki K. Poulose <suzuki@in.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Kumar Gala <galak@kernel.crashing.org>
      Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
      Signed-off-by: NJosh Boyer <jwboyer@gmail.com>
      368ff8f1
    • S
      powerpc: Rename mapping based RELOCATABLE to DYNAMIC_MEMSTART for BookE · 0f890c8d
      Suzuki Poulose 提交于
      The current implementation of CONFIG_RELOCATABLE in BookE is based
      on mapping the page aligned kernel load address to KERNELBASE. This
      approach however is not enough for platforms, where the TLB page size
      is large (e.g, 256M on 44x). So we are renaming the RELOCATABLE used
      currently in BookE to DYNAMIC_MEMSTART to reflect the actual method.
      
      The CONFIG_RELOCATABLE for PPC32(BookE) based on processing of the
      dynamic relocations will be introduced in the later in the patch series.
      
      This change would allow the use of the old method of RELOCATABLE for
      platforms which can afford to enforce the page alignment (platforms with
      smaller TLB size).
      
      Changes since v3:
      
      * Introduced a new config, NONSTATIC_KERNEL, to denote a kernel which is
        either a RELOCATABLE or DYNAMIC_MEMSTART(Suggested by: Josh Boyer)
      Suggested-by: NScott Wood <scottwood@freescale.com>
      Tested-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NSuzuki K. Poulose <suzuki@in.ibm.com>
      Cc: Scott Wood <scottwood@freescale.com>
      Cc: Kumar Gala <galak@kernel.crashing.org>
      Cc: Josh Boyer <jwboyer@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: linux ppc dev <linuxppc-dev@lists.ozlabs.org>
      Signed-off-by: NJosh Boyer <jwboyer@gmail.com>
      0f890c8d