1. 22 5月, 2015 3 次提交
    • S
      powerpc/powernv: Introduce sysfs control for fastsleep workaround behavior · 5703d2f4
      Shreyas B. Prabhu 提交于
      Fastsleep is one of the idle state which cpuidle subsystem currently
      uses on power8 machines. In this state L2 cache is brought down to a
      threshold voltage. Therefore when the core is in fastsleep, the
      communication between L2 and L3 needs to be fenced. But there is a bug
      in the current power8 chips surrounding this fencing.
      
      OPAL provides a workaround which precludes the possibility of hitting
      this bug. But running with this workaround applied causes checkstop
      if any correctable error in L2 cache directory is detected. Hence OPAL
      also provides a way to undo the workaround.
      
      In the existing implementation, workaround is applied by the last thread
      of the core entering fastsleep and undone by the first thread waking up.
      But this has a performance cost. These OPAL calls account for roughly
      4000 cycles everytime the core has to enter or wakeup from fastsleep.
      
      This patch introduces a sysfs attribute (fastsleep_workaround_applyonce)
      to choose the behavior of this workaround.
      
      By default, fastsleep_workaround_applyonce = 0. In this case, workaround
      is applied/undone everytime the core enters/exits fastsleep.
      
      fastsleep_workaround_applyonce = 1. In this case the workaround is
      applied once on all the cores and never undone. This can be triggered by
      echo 1 > /sys/devices/system/cpu/fastsleep_workaround_applyonce
      
      For simplicity this attribute can be modified only once. Implying, once
      fastsleep_workaround_applyonce is changed to 1, it cannot be reverted
      to the default state.
      Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Reviewed-by: NPreeti U Murthy <preeti@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      5703d2f4
    • S
      powerpc/powernv: Move cpuidle related code from setup.c to new file · d405a98c
      Shreyas B. Prabhu 提交于
      This is a cleanup patch; doesn't change any functionality. Moves
      all cpuidle related code from setup.c to a new file.
      Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Reviewed-by: NPreeti U Murthy <preeti@linux.vnet.ibm.com>
      [mpe: Fix the SMP=n build by including asm/smp.h in idle.c]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      d405a98c
    • S
      powerpc: Fix cpu_online_cores_map to return only online threads mask · e602ffb2
      Shreyas B. Prabhu 提交于
      Currently, cpu_online_cores_map returns a mask, which for every core with
      at least one online thread, has the bit for thread 0 of the core set to 1,
      and the bits for all other threads of the core set to 0. But thread 0 of
      the core itself may not be online always. In such cases, if the returned
      mask is used for IPI, then it'll cause IPIs to be skipped on cores where
      the first thread is offline, because the IPI code refuses to send IPIs to
      offline threads.
      
      Fix this by setting the bit of the first online thread in the core.
      This is done by fixing this in the underlying function
      cpu_thread_mask_to_cores.
      
      The result has the property that for all cores with online threads, there
      is one bit set in the returned map. And further, all bits that are set in
      the returned map correspond to online threads.
      Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Reviewed-by: NPreeti U Murthy <preeti@linux.vnet.ibm.com>
      [ Changelog from Michael Ellerman <mpe@ellerman.id.au> ]
      Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      e602ffb2
  2. 20 5月, 2015 1 次提交
    • L
      powerpc: Enable sys_kcmp() for CRIU · 7978f76c
      Laurent Dufour 提交于
      The commit 8170a83f ("powerpc: Wireup the kcmp syscall to sys_ni") has
      disabled the kcmp syscall for powerpc.  This has been done due to the use
      of unsigned long parameters which may require a dedicated wrapper to handle
      32bit process on top of 64bit kernel.  However in the kcmp() case, the 2
      unsigned long parameters are currently only used to carry file descriptors
      from user space to the kernel.  Since such a parameter is passed through
      register, and file descriptor doesn't need to get extended, there is,
      today, no need for a wrapper.
      
      In the case there will be a need to pass address in or out of this system
      call, then a wrapper could be required, it will then be to care of it.
      
      As today this is not the case, it is safe to enable kcmp() on powerpc.
      
      Tested (by Laurent) on 64-bit, 32-bit, and 32-bit userspace on 64-bit
      kernel using tools/testing/selftests/kcmp [mpe].
      Signed-off-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7978f76c
  3. 18 5月, 2015 1 次提交
  4. 13 5月, 2015 4 次提交
  5. 12 5月, 2015 2 次提交
  6. 11 5月, 2015 11 次提交
  7. 01 5月, 2015 4 次提交
    • S
      powerpc/powernv: Restore non-volatile CRs after nap · 0aab3747
      Sam Bobroff 提交于
      Patches 7cba160a "powernv/cpuidle: Redesign idle states management"
      and 77b54e9f "powernv/powerpc: Add winkle support for offline cpus"
      use non-volatile condition registers (cr2, cr3 and cr4) early in the system
      reset interrupt handler (system_reset_pSeries()) before it has been determined
      if state loss has occurred. If state loss has not occurred, control returns via
      the power7_wakeup_noloss() path which does not restore those condition
      registers, leaving them corrupted.
      
      Fix this by restoring the condition registers in the power7_wakeup_noloss()
      case.
      
      This is apparent when running a KVM guest on hardware that does not
      support winkle or sleep and the guest makes use of secondary threads. In
      practice this means Power7 machines, though some early unreleased Power8
      machines may also be susceptible.
      
      The secondary CPUs are taken off line before the guest is started and
      they call pnv_smp_cpu_kill_self(). This checks support for sleep
      states (in this case there is no support) and power7_nap() is called.
      
      When the CPU is woken, power7_nap() returns and because the CPU is
      still off line, the main while loop executes again. The sleep states
      support test is executed again, but because the tested values cannot
      have changed, the compiler has optimized the test away and instead we
      rely on the result of the first test, which has been left in cr3
      and/or cr4. With the result overwritten, the wrong branch is taken and
      power7_winkle() is called on a CPU that does not support it, leading
      to it stalling.
      
      Fixes: 7cba160a ("powernv/cpuidle: Redesign idle states management")
      Fixes: 77b54e9f ("powernv/powerpc: Add winkle support for offline cpus")
      [mpe: Massage change log a bit more]
      Signed-off-by: NSam Bobroff <sam.bobroff@au1.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      0aab3747
    • G
      powerpc/eeh: Delay probing EEH device during hotplug · d91dafc0
      Gavin Shan 提交于
      Commit 1c509148b ("powerpc/eeh: Do probe on pci_dn") probes EEH
      devices in early stage, which is reasonable to pSeries platform.
      However, it's wrong for PowerNV platform because the PE# isn't
      determined until the resources (IO and MMIO) are assigned to
      PE in hotplug case. So we have to delay probing EEH devices
      for PowerNV platform until the PE# is assigned.
      
      Fixes: ff57b454 ("powerpc/eeh: Do probe on pci_dn")
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      d91dafc0
    • G
      powerpc/eeh: Fix race condition in pcibios_set_pcie_reset_state() · 1ae79b78
      Gavin Shan 提交于
      When asserting reset in pcibios_set_pcie_reset_state(), the PE
      is enforced to (hardware) frozen state in order to drop unexpected
      PCI transactions (except PCI config read/write) automatically by
      hardware during reset, which would cause recursive EEH error.
      However, the (software) frozen state EEH_PE_ISOLATED is missed.
      When users get 0xFF from PCI config or MMIO read, EEH_PE_ISOLATED
      is set in PE state retrival backend. Unfortunately, nobody (the
      reset handler or the EEH recovery functinality in host) will clear
      EEH_PE_ISOLATED when the PE has been passed through to guest.
      
      The patch sets and clears EEH_PE_ISOLATED properly during reset
      in function pcibios_set_pcie_reset_state() to fix the issue.
      
      Fixes: 28158cd1 ("Enhance pcibios_set_pcie_reset_state()")
      Reported-by: NCarol L. Soto <clsoto@us.ibm.com>
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Tested-by: NCarol L. Soto <clsoto@us.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      1ae79b78
    • N
      powerpc/pseries: Correct cpu affinity for dlpar added cpus · f32393c9
      Nathan Fontenot 提交于
      The incorrect ordering of operations during cpu dlpar add results in invalid
      affinity for the cpu being added. The ibm,associativity property in the
      device tree is populated with all zeroes for the added cpu which results in
      invalid affinity mappings and all cpus appear to belong to node 0.
      
      This occurs because rtas configure-connector is called prior to making the
      rtas set-indicator calls. Phyp does not assign affinity information
      for a cpu until the rtas set-indicator calls are made to set the isolation
      and allocation state.
      
      Correct the order of operations to make the rtas set-indicator
      calls (done in dlpar_acquire_drc) before calling rtas configure-connector.
      
      Fixes: 1a8061c4 ("powerpc/pseries: Add kernel based CPU DLPAR handling")
      Signed-off-by: NNathan Fontenot <nfont@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f32393c9
  8. 30 4月, 2015 1 次提交
    • M
      Revert "powerpc/tm: Abort syscalls in active transactions" · 68fc378c
      Michael Ellerman 提交于
      This reverts commit feba4036.
      
      Although the principle of this change is good, the implementation has a
      few issues.
      
      Firstly we can sometimes fail to abort a syscall because r12 may have
      been clobbered by C code if we went down the virtual CPU accounting
      path, or if syscall tracing was enabled.
      
      Secondly we have decided that it is safer to abort the syscall even
      earlier in the syscall entry path, so that we avoid the syscall tracing
      path when we are transactional.
      
      So that we have time to thoroughly test those changes we have decided to
      revert this for this merge window and will merge the fixed version in
      the next window.
      
      NB. Rather than reverting the selftest we just drop tm-syscall from
      TEST_PROGS so that it's not run by default.
      
      Fixes: feba4036 ("powerpc/tm: Abort syscalls in active transactions")
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      68fc378c
  9. 29 4月, 2015 2 次提交
  10. 23 4月, 2015 1 次提交
  11. 21 4月, 2015 10 次提交
    • P
      KVM: PPC: Book3S HV: Use msgsnd for signalling threads on POWER8 · 66feed61
      Paul Mackerras 提交于
      This uses msgsnd where possible for signalling other threads within
      the same core on POWER8 systems, rather than IPIs through the XICS
      interrupt controller.  This includes waking secondary threads to run
      the guest, the interrupts generated by the virtual XICS, and the
      interrupts to bring the other threads out of the guest when exiting.
      
      Aggregated statistics from debugfs across vcpus for a guest with 32
      vcpus, 8 threads/vcore, running on a POWER8, show this before the
      change:
      
       rm_entry:     3387.6ns (228 - 86600, 1008969 samples)
        rm_exit:     4561.5ns (12 - 3477452, 1009402 samples)
        rm_intr:     1660.0ns (12 - 553050, 3600051 samples)
      
      and this after the change:
      
       rm_entry:     3060.1ns (212 - 65138, 953873 samples)
        rm_exit:     4244.1ns (12 - 9693408, 954331 samples)
        rm_intr:     1342.3ns (12 - 1104718, 3405326 samples)
      
      for a test of booting Fedora 20 big-endian to the login prompt.
      
      The time taken for a H_PROD hcall (which is handled in the host
      kernel) went down from about 35 microseconds to about 16 microseconds
      with this change.
      
      The noinline added to kvmppc_run_core turned out to be necessary for
      good performance, at least with gcc 4.9.2 as packaged with Fedora 21
      and a little-endian POWER8 host.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      66feed61
    • P
      KVM: PPC: Book3S HV: Translate kvmhv_commence_exit to C · eddb60fb
      Paul Mackerras 提交于
      This replaces the assembler code for kvmhv_commence_exit() with C code
      in book3s_hv_builtin.c.  It also moves the IPI sending code that was
      in book3s_hv_rm_xics.c into a new kvmhv_rm_send_ipi() function so it
      can be used by kvmhv_commence_exit() as well as icp_rm_set_vcpu_irq().
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      eddb60fb
    • P
      KVM: PPC: Book3S HV: Streamline guest entry and exit · 6af27c84
      Paul Mackerras 提交于
      On entry to the guest, secondary threads now wait for the primary to
      switch the MMU after loading up most of their state, rather than before.
      This means that the secondary threads get into the guest sooner, in the
      common case where the secondary threads get to kvmppc_hv_entry before
      the primary thread.
      
      On exit, the first thread out increments the exit count and interrupts
      the other threads (to get them out of the guest) before saving most
      of its state, rather than after.  That means that the other threads
      exit sooner and means that the first thread doesn't spend so much
      time waiting for the other threads at the point where the MMU gets
      switched back to the host.
      
      This pulls out the code that increments the exit count and interrupts
      other threads into a separate function, kvmhv_commence_exit().
      This also makes sure that r12 and vcpu->arch.trap are set correctly
      in some corner cases.
      
      Statistics from /sys/kernel/debug/kvm/vm*/vcpu*/timings show the
      improvement.  Aggregating across vcpus for a guest with 32 vcpus,
      8 threads/vcore, running on a POWER8, gives this before the change:
      
       rm_entry:     avg 4537.3ns (222 - 48444, 1068878 samples)
        rm_exit:     avg 4787.6ns (152 - 165490, 1010717 samples)
        rm_intr:     avg 1673.6ns (12 - 341304, 3818691 samples)
      
      and this after the change:
      
       rm_entry:     avg 3427.7ns (232 - 68150, 1118921 samples)
        rm_exit:     avg 4716.0ns (12 - 150720, 1119477 samples)
        rm_intr:     avg 1614.8ns (12 - 522436, 3850432 samples)
      
      showing a substantial reduction in the time spent per guest entry in
      the real-mode guest entry code, and smaller reductions in the real
      mode guest exit and interrupt handling times.  (The test was to start
      the guest and boot Fedora 20 big-endian to the login prompt.)
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      6af27c84
    • P
      KVM: PPC: Book3S HV: Use bitmap of active threads rather than count · 7d6c40da
      Paul Mackerras 提交于
      Currently, the entry_exit_count field in the kvmppc_vcore struct
      contains two 8-bit counts, one of the threads that have started entering
      the guest, and one of the threads that have started exiting the guest.
      This changes it to an entry_exit_map field which contains two bitmaps
      of 8 bits each.  The advantage of doing this is that it gives us a
      bitmap of which threads need to be signalled when exiting the guest.
      That means that we no longer need to use the trick of setting the
      HDEC to 0 to pull the other threads out of the guest, which led in
      some cases to a spurious HDEC interrupt on the next guest entry.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      7d6c40da
    • P
      KVM: PPC: Book3S HV: Use decrementer to wake napping threads · fd6d53b1
      Paul Mackerras 提交于
      This arranges for threads that are napping due to their vcpu having
      ceded or due to not having a vcpu to wake up at the end of the guest's
      timeslice without having to be poked with an IPI.  We do that by
      arranging for the decrementer to contain a value no greater than the
      number of timebase ticks remaining until the end of the timeslice.
      In the case of a thread with no vcpu, this number is in the hypervisor
      decrementer already.  In the case of a ceded vcpu, we use the smaller
      of the HDEC value and the DEC value.
      
      Using the DEC like this when ceded means we need to save and restore
      the guest decrementer value around the nap.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      fd6d53b1
    • P
      KVM: PPC: Book3S HV: Don't wake thread with no vcpu on guest IPI · ccc07772
      Paul Mackerras 提交于
      When running a multi-threaded guest and vcpu 0 in a virtual core
      is not running in the guest (i.e. it is busy elsewhere in the host),
      thread 0 of the physical core will switch the MMU to the guest and
      then go to nap mode in the code at kvm_do_nap.  If the guest sends
      an IPI to thread 0 using the msgsndp instruction, that will wake
      up thread 0 and cause all the threads in the guest to exit to the
      host unnecessarily.  To avoid the unnecessary exit, this arranges
      for the PECEDP bit to be cleared in this situation.  When napping
      due to a H_CEDE from the guest, we still set PECEDP so that the
      thread will wake up on an IPI sent using msgsndp.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      ccc07772
    • P
      KVM: PPC: Book3S HV: Get rid of vcore nap_count and n_woken · 5d5b99cd
      Paul Mackerras 提交于
      We can tell when a secondary thread has finished running a guest by
      the fact that it clears its kvm_hstate.kvm_vcpu pointer, so there
      is no real need for the nap_count field in the kvmppc_vcore struct.
      This changes kvmppc_wait_for_nap to poll the kvm_hstate.kvm_vcpu
      pointers of the secondary threads rather than polling vc->nap_count.
      Besides reducing the size of the kvmppc_vcore struct by 8 bytes,
      this also means that we can tell which secondary threads have got
      stuck and thus print a more informative error message.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      5d5b99cd
    • P
      KVM: PPC: Book3S HV: Move vcore preemption point up into kvmppc_run_vcpu · 25fedfca
      Paul Mackerras 提交于
      Rather than calling cond_resched() in kvmppc_run_core() before doing
      the post-processing for the vcpus that we have just run (that is,
      calling kvmppc_handle_exit_hv(), kvmppc_set_timer(), etc.), we now do
      that post-processing before calling cond_resched(), and that post-
      processing is moved out into its own function, post_guest_process().
      
      The reschedule point is now in kvmppc_run_vcpu() and we define a new
      vcore state, VCORE_PREEMPT, to indicate that that the vcore's runner
      task is runnable but not running.  (Doing the reschedule with the
      vcore in VCORE_INACTIVE state would be bad because there are potentially
      other vcpus waiting for the runner in kvmppc_wait_for_exec() which
      then wouldn't get woken up.)
      
      Also, we make use of the handy cond_resched_lock() function, which
      unlocks and relocks vc->lock for us around the reschedule.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      25fedfca
    • P
      KVM: PPC: Book3S HV: Minor cleanups · 1f09c3ed
      Paul Mackerras 提交于
      * Remove unused kvmppc_vcore::n_busy field.
      * Remove setting of RMOR, since it was only used on PPC970 and the
        PPC970 KVM support has been removed.
      * Don't use r1 or r2 in setting the runlatch since they are
        conventionally reserved for other things; use r0 instead.
      * Streamline the code a little and remove the ext_interrupt_to_host
        label.
      * Add some comments about register usage.
      * hcall_try_real_mode doesn't need to be global, and can't be
        called from C code anyway.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      1f09c3ed
    • P
      KVM: PPC: Book3S HV: Simplify handling of VCPUs that need a VPA update · d911f0be
      Paul Mackerras 提交于
      Previously, if kvmppc_run_core() was running a VCPU that needed a VPA
      update (i.e. one of its 3 virtual processor areas needed to be pinned
      in memory so the host real mode code can update it on guest entry and
      exit), we would drop the vcore lock and do the update there and then.
      Future changes will make it inconvenient to drop the lock, so instead
      we now remove it from the list of runnable VCPUs and wake up its
      VCPU task.  This will have the effect that the VCPU task will exit
      kvmppc_run_vcpu(), go around the do loop in kvmppc_vcpu_run_hv(), and
      re-enter kvmppc_run_vcpu(), whereupon it will do the necessary call
      to kvmppc_update_vpas() and then rejoin the vcore.
      
      The one complication is that the runner VCPU (whose VCPU task is the
      current task) might be one of the ones that gets removed from the
      runnable list.  In that case we just return from kvmppc_run_core()
      and let the code in kvmppc_run_vcpu() wake up another VCPU task to be
      the runner if necessary.
      
      This all means that the VCORE_STARTING state is no longer used, so we
      remove it.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      d911f0be