1. 30 4月, 2019 1 次提交
    • N
      powerpc/64s: Reimplement book3s idle code in C · 10d91611
      Nicholas Piggin 提交于
      Reimplement Book3S idle code in C, moving POWER7/8/9 implementation
      speific HV idle code to the powernv platform code.
      
      Book3S assembly stubs are kept in common code and used only to save
      the stack frame and non-volatile GPRs before executing architected
      idle instructions, and restoring the stack and reloading GPRs then
      returning to C after waking from idle.
      
      The complex logic dealing with threads and subcores, locking, SPRs,
      HMIs, timebase resync, etc., is all done in C which makes it more
      maintainable.
      
      This is not a strict translation to C code, there are some
      significant differences:
      
      - Idle wakeup no longer uses the ->cpu_restore call to reinit SPRs,
        but saves and restores them itself.
      
      - The optimisation where EC=ESL=0 idle modes did not have to save GPRs
        or change MSR is restored, because it's now simple to do. ESL=1
        sleeps that do not lose GPRs can use this optimization too.
      
      - KVM secondary entry and cede is now more of a call/return style
        rather than branchy. nap_state_lost is not required because KVM
        always returns via NVGPR restoring path.
      
      - KVM secondary wakeup from offline sequence is moved entirely into
        the offline wakeup, which avoids a hwsync in the normal idle wakeup
        path.
      
      Performance measured with context switch ping-pong on different
      threads or cores, is possibly improved a small amount, 1-3% depending
      on stop state and core vs thread test for shallow states. Deep states
      it's in the noise compared with other latencies.
      
      KVM improvements:
      
      - Idle sleepers now always return to caller rather than branch out
        to KVM first.
      
      - This allows optimisations like very fast return to caller when no
        state has been lost.
      
      - KVM no longer requires nap_state_lost because it controls NVGPR
        save/restore itself on the way in and out.
      
      - The heavy idle wakeup KVM request check can be moved out of the
        normal host idle code and into the not-performance-critical offline
        code.
      
      - KVM nap code now returns from where it is called, which makes the
        flow a bit easier to follow.
      Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      [mpe: Squash the KVM changes in]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      10d91611
  2. 23 2月, 2019 4 次提交
  3. 21 2月, 2019 1 次提交
  4. 31 10月, 2018 1 次提交
  5. 14 10月, 2018 2 次提交
    • N
      powerpc/64s/hash: Add a SLB preload cache · 5434ae74
      Nicholas Piggin 提交于
      When switching processes, currently all user SLBEs are cleared, and a
      few (exec_base, pc, and stack) are preloaded. In trivial testing with
      small apps, this tends to miss the heap and low 256MB segments, and it
      will also miss commonly accessed segments on large memory workloads.
      
      Add a simple round-robin preload cache that just inserts the last SLB
      miss into the head of the cache and preloads those at context switch
      time. Every 256 context switches, the oldest entry is removed from the
      cache to shrink the cache and require fewer slbmte if they are unused.
      
      Much more could go into this, including into the SLB entry reclaim
      side to track some LRU information etc, which would require a study of
      large memory workloads. But this is a simple thing we can do now that
      is an obvious win for common workloads.
      
      With the full series, process switching speed on the context_switch
      benchmark on POWER9/hash (with kernel speculation security masures
      disabled) increases from 140K/s to 178K/s (27%).
      
      POWER8 does not change much (within 1%), it's unclear why it does not
      see a big gain like POWER9.
      
      Booting to busybox init with 256MB segments has SLB misses go down
      from 945 to 69, and with 1T segments 900 to 21. These could almost all
      be eliminated by preloading a bit more carefully with ELF binary
      loading.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      5434ae74
    • N
      powerpc/64: Interrupts save PPR on stack rather than thread_struct · 4c2de74c
      Nicholas Piggin 提交于
      PPR is the odd register out when it comes to interrupt handling, it is
      saved in current->thread.ppr while all others are saved on the stack.
      
      The difficulty with this is that accessing thread.ppr can cause a SLB
      fault, but the SLB fault handler implementation in C change had
      assumed the normal exception entry handlers would not cause an SLB
      fault.
      
      Fix this by allocating room in the interrupt stack to save PPR.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      4c2de74c
  6. 03 10月, 2018 1 次提交
  7. 19 9月, 2018 1 次提交
    • N
      powerpc/64s/hash: Add a SLB preload cache · 89ca4e12
      Nicholas Piggin 提交于
      When switching processes, currently all user SLBEs are cleared, and a
      few (exec_base, pc, and stack) are preloaded. In trivial testing with
      small apps, this tends to miss the heap and low 256MB segments, and it
      will also miss commonly accessed segments on large memory workloads.
      
      Add a simple round-robin preload cache that just inserts the last SLB
      miss into the head of the cache and preloads those at context switch
      time. Every 256 context switches, the oldest entry is removed from the
      cache to shrink the cache and require fewer slbmte if they are unused.
      
      Much more could go into this, including into the SLB entry reclaim
      side to track some LRU information etc, which would require a study of
      large memory workloads. But this is a simple thing we can do now that
      is an obvious win for common workloads.
      
      With the full series, process switching speed on the context_switch
      benchmark on POWER9/hash (with kernel speculation security masures
      disabled) increases from 140K/s to 178K/s (27%).
      
      POWER8 does not change much (within 1%), it's unclear why it does not
      see a big gain like POWER9.
      
      Booting to busybox init with 256MB segments has SLB misses go down
      from 945 to 69, and with 1T segments 900 to 21. These could almost all
      be eliminated by preloading a bit more carefully with ELF binary
      loading.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      89ca4e12
  8. 30 7月, 2018 1 次提交
  9. 03 6月, 2018 2 次提交
  10. 31 3月, 2018 1 次提交
  11. 30 3月, 2018 2 次提交
  12. 20 1月, 2018 1 次提交
  13. 12 11月, 2017 2 次提交
  14. 02 7月, 2017 1 次提交
  15. 29 6月, 2017 1 次提交
  16. 19 6月, 2017 1 次提交
  17. 08 6月, 2017 1 次提交
  18. 05 5月, 2017 1 次提交
    • S
      powerpc/64e: Don't place the stack beyond TASK_SIZE · 61baf155
      Scott Wood 提交于
      Commit f4ea6dcb ("powerpc/mm: Enable mappings above 128TB") increased
      the task size on book3s, and introduced a mechanism to dynamically
      control whether a task uses these larger addresses.  While the change to
      the task size itself was ifdef-protected to only apply on book3s, the
      change to STACK_TOP_USER64 was not.  On book3e, this had the effect of
      trying to use addresses up to 128TiB for the stack despite a 64TiB task
      size limit -- which broke 64-bit userspace producing the following errors:
      
      Starting init: /sbin/init exists but couldn't execute it (error -14)
      Starting init: /bin/sh exists but couldn't execute it (error -14)
      Kernel panic - not syncing: No working init found.  Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
      
      Fixes: f4ea6dcb ("powerpc/mm: Enable mappings above 128TB")
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NScott Wood <oss@buserror.net>
      61baf155
  19. 01 4月, 2017 1 次提交
    • A
      powerpc/mm: Enable mappings above 128TB · f4ea6dcb
      Aneesh Kumar K.V 提交于
      Not all user space application is ready to handle wide addresses. It's
      known that at least some JIT compilers use higher bits in pointers to
      encode their information. It collides with valid pointers with 512TB
      addresses and leads to crashes.
      
      To mitigate this, we are not going to allocate virtual address space
      above 128TB by default.
      
      But userspace can ask for allocation from full address space by
      specifying hint address (with or without MAP_FIXED) above 128TB.
      
      If hint address set above 128TB, but MAP_FIXED is not specified, we try
      to look for unmapped area by specified address. If it's already
      occupied, we look for unmapped area in *full* address space, rather than
      from 128TB window.
      
      This approach helps to easily make application's memory allocator aware
      about large address space without manually tracking allocated virtual
      address space.
      
      This is going to be a per mmap decision. ie, we can have some mmaps with
      larger addresses and other that do not.
      
      A sample memory layout looks like:
      
        10000000-10010000 r-xp 00000000 fc:00 9057045          /home/max_addr_512TB
        10010000-10020000 r--p 00000000 fc:00 9057045          /home/max_addr_512TB
        10020000-10030000 rw-p 00010000 fc:00 9057045          /home/max_addr_512TB
        10029630000-10029660000 rw-p 00000000 00:00 0          [heap]
        7fff834a0000-7fff834b0000 rw-p 00000000 00:00 0
        7fff834b0000-7fff83670000 r-xp 00000000 fc:00 9177190  /lib/powerpc64le-linux-gnu/libc-2.23.so
        7fff83670000-7fff83680000 r--p 001b0000 fc:00 9177190  /lib/powerpc64le-linux-gnu/libc-2.23.so
        7fff83680000-7fff83690000 rw-p 001c0000 fc:00 9177190  /lib/powerpc64le-linux-gnu/libc-2.23.so
        7fff83690000-7fff836a0000 rw-p 00000000 00:00 0
        7fff836a0000-7fff836c0000 r-xp 00000000 00:00 0        [vdso]
        7fff836c0000-7fff83700000 r-xp 00000000 fc:00 9177193  /lib/powerpc64le-linux-gnu/ld-2.23.so
        7fff83700000-7fff83710000 r--p 00030000 fc:00 9177193  /lib/powerpc64le-linux-gnu/ld-2.23.so
        7fff83710000-7fff83720000 rw-p 00040000 fc:00 9177193  /lib/powerpc64le-linux-gnu/ld-2.23.so
        7fffdccf0000-7fffdcd20000 rw-p 00000000 00:00 0        [stack]
        1000000000000-1000000010000 rw-p 00000000 00:00 0
        1ffff83710000-1ffff83720000 rw-p 00000000 00:00 0
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f4ea6dcb
  20. 31 3月, 2017 1 次提交
    • A
      powerpc/mm/hash: Increase VA range to 128TB · f6eedbba
      Aneesh Kumar K.V 提交于
      We update the hash linux page table layout such that we can support
      512TB. But we limit the TASK_SIZE to 128TB. We can switch to 128TB by
      default without conditional because that is the max virtual address
      supported by other architectures. We will later add a mechanism to
      on-demand increase the application's effective address range to 512TB.
      
      Having the page table layout changed to accommodate 512TB makes testing
      large memory configuration easier with less code changes to kernel
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f6eedbba
  21. 31 1月, 2017 1 次提交
    • G
      powernv: Pass PSSCR value and mask to power9_idle_stop · 09206b60
      Gautham R. Shenoy 提交于
      The power9_idle_stop method currently takes only the requested stop
      level as a parameter and picks up the rest of the PSSCR bits from a
      hand-coded macro. This is not a very flexible design, especially when
      the firmware has the capability to communicate the psscr value and the
      mask associated with a particular stop state via device tree.
      
      This patch modifies the power9_idle_stop API to take as parameters the
      PSSCR value and the PSSCR mask corresponding to the stop state that
      needs to be set. These PSSCR value and mask are respectively obtained
      by parsing the "ibm,cpu-idle-state-psscr" and
      "ibm,cpu-idle-state-psscr-mask" fields from the device tree.
      
      In addition to this, the patch adds support for handling stop states
      for which ESL and EC bits in the PSSCR are zero. As per the
      architecture, a wakeup from these stop states resumes execution from
      the subsequent instruction as opposed to waking up at the System
      Vector.
      
      The older firmware sets only the Requested Level (RL) field in the
      psscr and psscr-mask exposed in the device tree. For older firmware
      where psscr-mask=0xf, this patch will set the default sane values that
      the set for for remaining PSSCR fields (i.e PSLL, MTL, ESL, EC, and
      TR). For the new firmware, the patch will validate that the invariants
      required by the ISA for the psscr values are maintained by the
      firmware.
      
      This skiboot patch that exports fully populated PSSCR values and the
      mask for all the stop states can be found here:
      https://lists.ozlabs.org/pipermail/skiboot/2016-September/004869.html
      
      [Optimize the number of instructions before entering STOP with
      ESL=EC=0, validate the PSSCR values provided by the firimware
      maintains the invariants required as per the ISA suggested by Balbir
      Singh]
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Signed-off-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      09206b60
  22. 25 1月, 2017 1 次提交
  23. 17 11月, 2016 1 次提交
  24. 16 11月, 2016 2 次提交
    • C
      locking/core, arch: Remove cpu_relax_lowlatency() · 5bd0b85b
      Christian Borntraeger 提交于
      As there are no users left, we can remove cpu_relax_lowlatency()
      implementations from every architecture.
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Noam Camus <noamc@ezchip.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: virtualization@lists.linux-foundation.org
      Cc: xen-devel@lists.xenproject.org
      Cc: <linux-arch@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1477386195-32736-6-git-send-email-borntraeger@de.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5bd0b85b
    • C
      locking/core: Introduce cpu_relax_yield() · 79ab11cd
      Christian Borntraeger 提交于
      For spinning loops people do often use barrier() or cpu_relax().
      For most architectures cpu_relax and barrier are the same, but on
      some architectures cpu_relax can add some latency.
      For example on power,sparc64 and arc, cpu_relax can shift the CPU
      towards other hardware threads in an SMT environment.
      On s390 cpu_relax does even more, it uses an hypercall to the
      hypervisor to give up the timeslice.
      In contrast to the SMT yielding this can result in larger latencies.
      In some places this latency is unwanted, so another variant
      "cpu_relax_lowlatency" was introduced. Before this is used in more
      and more places, lets revert the logic and provide a cpu_relax_yield
      that can be called in places where yielding is more important than
      latency. By default this is the same as cpu_relax on all architectures.
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Noam Camus <noamc@ezchip.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: virtualization@lists.linux-foundation.org
      Cc: xen-devel@lists.xenproject.org
      Link: http://lkml.kernel.org/r/1477386195-32736-2-git-send-email-borntraeger@de.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      79ab11cd
  25. 14 11月, 2016 1 次提交
  26. 04 10月, 2016 3 次提交
    • C
      powerpc: tm: Enable transactional memory (TM) lazily for userspace · 5d176f75
      Cyril Bur 提交于
      Currently the MSR TM bit is always set if the hardware is TM capable.
      This adds extra overhead as it means the TM SPRS (TFHAR, TEXASR and
      TFAIR) must be swapped for each process regardless of if they use TM.
      
      For processes that don't use TM the TM MSR bit can be turned off
      allowing the kernel to avoid the expensive swap of the TM registers.
      
      A TM unavailable exception will occur if a thread does use TM and the
      kernel will enable MSR_TM and leave it so for some time afterwards.
      Signed-off-by: NCyril Bur <cyrilbur@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      5d176f75
    • C
      powerpc: tm: Rename transct_(*) to ck(\1)_state · 000ec280
      Cyril Bur 提交于
      Make the structures being used for checkpointed state named
      consistently with the pt_regs/ckpt_regs.
      Signed-off-by: NCyril Bur <cyrilbur@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      000ec280
    • C
      powerpc: tm: Always use fp_state and vr_state to store live registers · dc310669
      Cyril Bur 提交于
      There is currently an inconsistency as to how the entire CPU register
      state is saved and restored when a thread uses transactional memory
      (TM).
      
      Using transactional memory results in the CPU having duplicated
      (almost) all of its register state. This duplication results in a set
      of registers which can be considered 'live', those being currently
      modified by the instructions being executed and another set that is
      frozen at a point in time.
      
      On context switch, both sets of state have to be saved and (later)
      restored. These two states are often called a variety of different
      things. Common terms for the state which only exists after the CPU has
      entered a transaction (performed a TBEGIN instruction) in hardware are
      'transactional' or 'speculative'.
      
      Between a TBEGIN and a TEND or TABORT (or an event that causes the
      hardware to abort), regardless of the use of TSUSPEND the
      transactional state can be referred to as the live state.
      
      The second state is often to referred to as the 'checkpointed' state
      and is a duplication of the live state when the TBEGIN instruction is
      executed. This state is kept in the hardware and will be rolled back
      to on transaction failure.
      
      Currently all the registers stored in pt_regs are ALWAYS the live
      registers, that is, when a thread has transactional registers their
      values are stored in pt_regs and the checkpointed state is in
      ckpt_regs. A strange opposite is true for fp_state/vr_state. When a
      thread is non transactional fp_state/vr_state holds the live
      registers. When a thread has initiated a transaction fp_state/vr_state
      holds the checkpointed state and transact_fp/transact_vr become the
      structure which holds the live state (at this point it is a
      transactional state).
      
      This method creates confusion as to where the live state is, in some
      circumstances it requires extra work to determine where to put the
      live state and prevents the use of common functions designed (probably
      before TM) to save the live state.
      
      With this patch pt_regs, fp_state and vr_state all represent the
      same thing and the other structures [pending rename] are for
      checkpointed state.
      Acked-by: NSimon Guo <wei.guo.simon@gmail.com>
      Signed-off-by: NCyril Bur <cyrilbur@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      dc310669
  27. 15 7月, 2016 1 次提交
    • S
      powerpc/powernv: Add platform support for stop instruction · bcef83a0
      Shreyas B. Prabhu 提交于
      POWER ISA v3 defines a new idle processor core mechanism. In summary,
       a) new instruction named stop is added. This instruction replaces
      	instructions like nap, sleep, rvwinkle.
       b) new per thread SPR named Processor Stop Status and Control Register
      	(PSSCR) is added which controls the behavior of stop instruction.
      
      PSSCR layout:
      ----------------------------------------------------------
      | PLS | /// | SD | ESL | EC | PSLL | /// | TR | MTL | RL |
      ----------------------------------------------------------
      0      4     41   42    43   44     48    54   56    60
      
      PSSCR key fields:
      	Bits 0:3  - Power-Saving Level Status. This field indicates the lowest
      	power-saving state the thread entered since stop instruction was last
      	executed.
      
      	Bit 42 - Enable State Loss
      	0 - No state is lost irrespective of other fields
      	1 - Allows state loss
      
      	Bits 44:47 - Power-Saving Level Limit
      	This limits the power-saving level that can be entered into.
      
      	Bits 60:63 - Requested Level
      	Used to specify which power-saving level must be entered on executing
      	stop instruction
      
      This patch adds support for stop instruction and PSSCR handling.
      Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      bcef83a0
  28. 21 6月, 2016 2 次提交
    • J
      powerpc: Load Monitor Register Support · bd3ea317
      Jack Miller 提交于
      This enables new registers, LMRR and LMSER, that can trigger an EBB in
      userspace code when a monitored load (via the new ldmx instruction)
      loads memory from a monitored space. This facility is controlled by a
      new FSCR bit, LM.
      
      This patch disables the FSCR LM control bit on task init and enables
      that bit when a load monitor facility unavailable exception is taken
      for using it. On context switch, this bit is then used to determine
      whether the two relevant registers are saved and restored. This is
      done lazily for performance reasons.
      Signed-off-by: NJack Miller <jack@codezen.org>
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      bd3ea317
    • M
      powerpc: Improve FSCR init and context switching · b57bd2de
      Michael Neuling 提交于
      This fixes a few issues with FSCR init and switching.
      
      In commit 152d523e ("powerpc: Create context switch helpers
      save_sprs() and restore_sprs()") we moved the setting of the FSCR
      register from inside an CPU_FTR_ARCH_207S section to inside just a
      CPU_FTR_ARCH_DSCR section. Hence we are setting FSCR on POWER6/7 where
      the FSCR doesn't exist. This is harmless but we shouldn't do it.
      
      Also, we can simplify the FSCR context switch. We don't need to go
      through the calculation involving dscr_inherit. We can just restore
      what we saved last time.
      
      We also set an initial value in INIT_THREAD, so that pid 1 which is
      cloned from that gets a sane value.
      
      Based on patch by Jack Miller.
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      b57bd2de
  29. 14 6月, 2016 1 次提交