1. 23 2月, 2019 2 次提交
  2. 21 2月, 2019 1 次提交
  3. 31 10月, 2018 1 次提交
  4. 14 10月, 2018 2 次提交
    • N
      powerpc/64s/hash: Add a SLB preload cache · 5434ae74
      Nicholas Piggin 提交于
      When switching processes, currently all user SLBEs are cleared, and a
      few (exec_base, pc, and stack) are preloaded. In trivial testing with
      small apps, this tends to miss the heap and low 256MB segments, and it
      will also miss commonly accessed segments on large memory workloads.
      
      Add a simple round-robin preload cache that just inserts the last SLB
      miss into the head of the cache and preloads those at context switch
      time. Every 256 context switches, the oldest entry is removed from the
      cache to shrink the cache and require fewer slbmte if they are unused.
      
      Much more could go into this, including into the SLB entry reclaim
      side to track some LRU information etc, which would require a study of
      large memory workloads. But this is a simple thing we can do now that
      is an obvious win for common workloads.
      
      With the full series, process switching speed on the context_switch
      benchmark on POWER9/hash (with kernel speculation security masures
      disabled) increases from 140K/s to 178K/s (27%).
      
      POWER8 does not change much (within 1%), it's unclear why it does not
      see a big gain like POWER9.
      
      Booting to busybox init with 256MB segments has SLB misses go down
      from 945 to 69, and with 1T segments 900 to 21. These could almost all
      be eliminated by preloading a bit more carefully with ELF binary
      loading.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      5434ae74
    • N
      powerpc/64: Interrupts save PPR on stack rather than thread_struct · 4c2de74c
      Nicholas Piggin 提交于
      PPR is the odd register out when it comes to interrupt handling, it is
      saved in current->thread.ppr while all others are saved on the stack.
      
      The difficulty with this is that accessing thread.ppr can cause a SLB
      fault, but the SLB fault handler implementation in C change had
      assumed the normal exception entry handlers would not cause an SLB
      fault.
      
      Fix this by allocating room in the interrupt stack to save PPR.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      4c2de74c
  5. 03 10月, 2018 1 次提交
  6. 19 9月, 2018 1 次提交
    • N
      powerpc/64s/hash: Add a SLB preload cache · 89ca4e12
      Nicholas Piggin 提交于
      When switching processes, currently all user SLBEs are cleared, and a
      few (exec_base, pc, and stack) are preloaded. In trivial testing with
      small apps, this tends to miss the heap and low 256MB segments, and it
      will also miss commonly accessed segments on large memory workloads.
      
      Add a simple round-robin preload cache that just inserts the last SLB
      miss into the head of the cache and preloads those at context switch
      time. Every 256 context switches, the oldest entry is removed from the
      cache to shrink the cache and require fewer slbmte if they are unused.
      
      Much more could go into this, including into the SLB entry reclaim
      side to track some LRU information etc, which would require a study of
      large memory workloads. But this is a simple thing we can do now that
      is an obvious win for common workloads.
      
      With the full series, process switching speed on the context_switch
      benchmark on POWER9/hash (with kernel speculation security masures
      disabled) increases from 140K/s to 178K/s (27%).
      
      POWER8 does not change much (within 1%), it's unclear why it does not
      see a big gain like POWER9.
      
      Booting to busybox init with 256MB segments has SLB misses go down
      from 945 to 69, and with 1T segments 900 to 21. These could almost all
      be eliminated by preloading a bit more carefully with ELF binary
      loading.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      89ca4e12
  7. 30 7月, 2018 1 次提交
  8. 03 6月, 2018 2 次提交
  9. 31 3月, 2018 1 次提交
  10. 30 3月, 2018 2 次提交
  11. 20 1月, 2018 1 次提交
  12. 12 11月, 2017 2 次提交
  13. 02 7月, 2017 1 次提交
  14. 29 6月, 2017 1 次提交
  15. 19 6月, 2017 1 次提交
  16. 08 6月, 2017 1 次提交
  17. 05 5月, 2017 1 次提交
    • S
      powerpc/64e: Don't place the stack beyond TASK_SIZE · 61baf155
      Scott Wood 提交于
      Commit f4ea6dcb ("powerpc/mm: Enable mappings above 128TB") increased
      the task size on book3s, and introduced a mechanism to dynamically
      control whether a task uses these larger addresses.  While the change to
      the task size itself was ifdef-protected to only apply on book3s, the
      change to STACK_TOP_USER64 was not.  On book3e, this had the effect of
      trying to use addresses up to 128TiB for the stack despite a 64TiB task
      size limit -- which broke 64-bit userspace producing the following errors:
      
      Starting init: /sbin/init exists but couldn't execute it (error -14)
      Starting init: /bin/sh exists but couldn't execute it (error -14)
      Kernel panic - not syncing: No working init found.  Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
      
      Fixes: f4ea6dcb ("powerpc/mm: Enable mappings above 128TB")
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NScott Wood <oss@buserror.net>
      61baf155
  18. 01 4月, 2017 1 次提交
    • A
      powerpc/mm: Enable mappings above 128TB · f4ea6dcb
      Aneesh Kumar K.V 提交于
      Not all user space application is ready to handle wide addresses. It's
      known that at least some JIT compilers use higher bits in pointers to
      encode their information. It collides with valid pointers with 512TB
      addresses and leads to crashes.
      
      To mitigate this, we are not going to allocate virtual address space
      above 128TB by default.
      
      But userspace can ask for allocation from full address space by
      specifying hint address (with or without MAP_FIXED) above 128TB.
      
      If hint address set above 128TB, but MAP_FIXED is not specified, we try
      to look for unmapped area by specified address. If it's already
      occupied, we look for unmapped area in *full* address space, rather than
      from 128TB window.
      
      This approach helps to easily make application's memory allocator aware
      about large address space without manually tracking allocated virtual
      address space.
      
      This is going to be a per mmap decision. ie, we can have some mmaps with
      larger addresses and other that do not.
      
      A sample memory layout looks like:
      
        10000000-10010000 r-xp 00000000 fc:00 9057045          /home/max_addr_512TB
        10010000-10020000 r--p 00000000 fc:00 9057045          /home/max_addr_512TB
        10020000-10030000 rw-p 00010000 fc:00 9057045          /home/max_addr_512TB
        10029630000-10029660000 rw-p 00000000 00:00 0          [heap]
        7fff834a0000-7fff834b0000 rw-p 00000000 00:00 0
        7fff834b0000-7fff83670000 r-xp 00000000 fc:00 9177190  /lib/powerpc64le-linux-gnu/libc-2.23.so
        7fff83670000-7fff83680000 r--p 001b0000 fc:00 9177190  /lib/powerpc64le-linux-gnu/libc-2.23.so
        7fff83680000-7fff83690000 rw-p 001c0000 fc:00 9177190  /lib/powerpc64le-linux-gnu/libc-2.23.so
        7fff83690000-7fff836a0000 rw-p 00000000 00:00 0
        7fff836a0000-7fff836c0000 r-xp 00000000 00:00 0        [vdso]
        7fff836c0000-7fff83700000 r-xp 00000000 fc:00 9177193  /lib/powerpc64le-linux-gnu/ld-2.23.so
        7fff83700000-7fff83710000 r--p 00030000 fc:00 9177193  /lib/powerpc64le-linux-gnu/ld-2.23.so
        7fff83710000-7fff83720000 rw-p 00040000 fc:00 9177193  /lib/powerpc64le-linux-gnu/ld-2.23.so
        7fffdccf0000-7fffdcd20000 rw-p 00000000 00:00 0        [stack]
        1000000000000-1000000010000 rw-p 00000000 00:00 0
        1ffff83710000-1ffff83720000 rw-p 00000000 00:00 0
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f4ea6dcb
  19. 31 3月, 2017 1 次提交
    • A
      powerpc/mm/hash: Increase VA range to 128TB · f6eedbba
      Aneesh Kumar K.V 提交于
      We update the hash linux page table layout such that we can support
      512TB. But we limit the TASK_SIZE to 128TB. We can switch to 128TB by
      default without conditional because that is the max virtual address
      supported by other architectures. We will later add a mechanism to
      on-demand increase the application's effective address range to 512TB.
      
      Having the page table layout changed to accommodate 512TB makes testing
      large memory configuration easier with less code changes to kernel
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f6eedbba
  20. 31 1月, 2017 1 次提交
    • G
      powernv: Pass PSSCR value and mask to power9_idle_stop · 09206b60
      Gautham R. Shenoy 提交于
      The power9_idle_stop method currently takes only the requested stop
      level as a parameter and picks up the rest of the PSSCR bits from a
      hand-coded macro. This is not a very flexible design, especially when
      the firmware has the capability to communicate the psscr value and the
      mask associated with a particular stop state via device tree.
      
      This patch modifies the power9_idle_stop API to take as parameters the
      PSSCR value and the PSSCR mask corresponding to the stop state that
      needs to be set. These PSSCR value and mask are respectively obtained
      by parsing the "ibm,cpu-idle-state-psscr" and
      "ibm,cpu-idle-state-psscr-mask" fields from the device tree.
      
      In addition to this, the patch adds support for handling stop states
      for which ESL and EC bits in the PSSCR are zero. As per the
      architecture, a wakeup from these stop states resumes execution from
      the subsequent instruction as opposed to waking up at the System
      Vector.
      
      The older firmware sets only the Requested Level (RL) field in the
      psscr and psscr-mask exposed in the device tree. For older firmware
      where psscr-mask=0xf, this patch will set the default sane values that
      the set for for remaining PSSCR fields (i.e PSLL, MTL, ESL, EC, and
      TR). For the new firmware, the patch will validate that the invariants
      required by the ISA for the psscr values are maintained by the
      firmware.
      
      This skiboot patch that exports fully populated PSSCR values and the
      mask for all the stop states can be found here:
      https://lists.ozlabs.org/pipermail/skiboot/2016-September/004869.html
      
      [Optimize the number of instructions before entering STOP with
      ESL=EC=0, validate the PSSCR values provided by the firimware
      maintains the invariants required as per the ISA suggested by Balbir
      Singh]
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Signed-off-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      09206b60
  21. 25 1月, 2017 1 次提交
  22. 17 11月, 2016 1 次提交
  23. 16 11月, 2016 2 次提交
    • C
      locking/core, arch: Remove cpu_relax_lowlatency() · 5bd0b85b
      Christian Borntraeger 提交于
      As there are no users left, we can remove cpu_relax_lowlatency()
      implementations from every architecture.
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Noam Camus <noamc@ezchip.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: virtualization@lists.linux-foundation.org
      Cc: xen-devel@lists.xenproject.org
      Cc: <linux-arch@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1477386195-32736-6-git-send-email-borntraeger@de.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5bd0b85b
    • C
      locking/core: Introduce cpu_relax_yield() · 79ab11cd
      Christian Borntraeger 提交于
      For spinning loops people do often use barrier() or cpu_relax().
      For most architectures cpu_relax and barrier are the same, but on
      some architectures cpu_relax can add some latency.
      For example on power,sparc64 and arc, cpu_relax can shift the CPU
      towards other hardware threads in an SMT environment.
      On s390 cpu_relax does even more, it uses an hypercall to the
      hypervisor to give up the timeslice.
      In contrast to the SMT yielding this can result in larger latencies.
      In some places this latency is unwanted, so another variant
      "cpu_relax_lowlatency" was introduced. Before this is used in more
      and more places, lets revert the logic and provide a cpu_relax_yield
      that can be called in places where yielding is more important than
      latency. By default this is the same as cpu_relax on all architectures.
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Noam Camus <noamc@ezchip.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: virtualization@lists.linux-foundation.org
      Cc: xen-devel@lists.xenproject.org
      Link: http://lkml.kernel.org/r/1477386195-32736-2-git-send-email-borntraeger@de.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      79ab11cd
  24. 14 11月, 2016 1 次提交
  25. 04 10月, 2016 3 次提交
    • C
      powerpc: tm: Enable transactional memory (TM) lazily for userspace · 5d176f75
      Cyril Bur 提交于
      Currently the MSR TM bit is always set if the hardware is TM capable.
      This adds extra overhead as it means the TM SPRS (TFHAR, TEXASR and
      TFAIR) must be swapped for each process regardless of if they use TM.
      
      For processes that don't use TM the TM MSR bit can be turned off
      allowing the kernel to avoid the expensive swap of the TM registers.
      
      A TM unavailable exception will occur if a thread does use TM and the
      kernel will enable MSR_TM and leave it so for some time afterwards.
      Signed-off-by: NCyril Bur <cyrilbur@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      5d176f75
    • C
      powerpc: tm: Rename transct_(*) to ck(\1)_state · 000ec280
      Cyril Bur 提交于
      Make the structures being used for checkpointed state named
      consistently with the pt_regs/ckpt_regs.
      Signed-off-by: NCyril Bur <cyrilbur@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      000ec280
    • C
      powerpc: tm: Always use fp_state and vr_state to store live registers · dc310669
      Cyril Bur 提交于
      There is currently an inconsistency as to how the entire CPU register
      state is saved and restored when a thread uses transactional memory
      (TM).
      
      Using transactional memory results in the CPU having duplicated
      (almost) all of its register state. This duplication results in a set
      of registers which can be considered 'live', those being currently
      modified by the instructions being executed and another set that is
      frozen at a point in time.
      
      On context switch, both sets of state have to be saved and (later)
      restored. These two states are often called a variety of different
      things. Common terms for the state which only exists after the CPU has
      entered a transaction (performed a TBEGIN instruction) in hardware are
      'transactional' or 'speculative'.
      
      Between a TBEGIN and a TEND or TABORT (or an event that causes the
      hardware to abort), regardless of the use of TSUSPEND the
      transactional state can be referred to as the live state.
      
      The second state is often to referred to as the 'checkpointed' state
      and is a duplication of the live state when the TBEGIN instruction is
      executed. This state is kept in the hardware and will be rolled back
      to on transaction failure.
      
      Currently all the registers stored in pt_regs are ALWAYS the live
      registers, that is, when a thread has transactional registers their
      values are stored in pt_regs and the checkpointed state is in
      ckpt_regs. A strange opposite is true for fp_state/vr_state. When a
      thread is non transactional fp_state/vr_state holds the live
      registers. When a thread has initiated a transaction fp_state/vr_state
      holds the checkpointed state and transact_fp/transact_vr become the
      structure which holds the live state (at this point it is a
      transactional state).
      
      This method creates confusion as to where the live state is, in some
      circumstances it requires extra work to determine where to put the
      live state and prevents the use of common functions designed (probably
      before TM) to save the live state.
      
      With this patch pt_regs, fp_state and vr_state all represent the
      same thing and the other structures [pending rename] are for
      checkpointed state.
      Acked-by: NSimon Guo <wei.guo.simon@gmail.com>
      Signed-off-by: NCyril Bur <cyrilbur@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      dc310669
  26. 15 7月, 2016 1 次提交
    • S
      powerpc/powernv: Add platform support for stop instruction · bcef83a0
      Shreyas B. Prabhu 提交于
      POWER ISA v3 defines a new idle processor core mechanism. In summary,
       a) new instruction named stop is added. This instruction replaces
      	instructions like nap, sleep, rvwinkle.
       b) new per thread SPR named Processor Stop Status and Control Register
      	(PSSCR) is added which controls the behavior of stop instruction.
      
      PSSCR layout:
      ----------------------------------------------------------
      | PLS | /// | SD | ESL | EC | PSLL | /// | TR | MTL | RL |
      ----------------------------------------------------------
      0      4     41   42    43   44     48    54   56    60
      
      PSSCR key fields:
      	Bits 0:3  - Power-Saving Level Status. This field indicates the lowest
      	power-saving state the thread entered since stop instruction was last
      	executed.
      
      	Bit 42 - Enable State Loss
      	0 - No state is lost irrespective of other fields
      	1 - Allows state loss
      
      	Bits 44:47 - Power-Saving Level Limit
      	This limits the power-saving level that can be entered into.
      
      	Bits 60:63 - Requested Level
      	Used to specify which power-saving level must be entered on executing
      	stop instruction
      
      This patch adds support for stop instruction and PSSCR handling.
      Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      bcef83a0
  27. 21 6月, 2016 2 次提交
    • J
      powerpc: Load Monitor Register Support · bd3ea317
      Jack Miller 提交于
      This enables new registers, LMRR and LMSER, that can trigger an EBB in
      userspace code when a monitored load (via the new ldmx instruction)
      loads memory from a monitored space. This facility is controlled by a
      new FSCR bit, LM.
      
      This patch disables the FSCR LM control bit on task init and enables
      that bit when a load monitor facility unavailable exception is taken
      for using it. On context switch, this bit is then used to determine
      whether the two relevant registers are saved and restored. This is
      done lazily for performance reasons.
      Signed-off-by: NJack Miller <jack@codezen.org>
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      bd3ea317
    • M
      powerpc: Improve FSCR init and context switching · b57bd2de
      Michael Neuling 提交于
      This fixes a few issues with FSCR init and switching.
      
      In commit 152d523e ("powerpc: Create context switch helpers
      save_sprs() and restore_sprs()") we moved the setting of the FSCR
      register from inside an CPU_FTR_ARCH_207S section to inside just a
      CPU_FTR_ARCH_DSCR section. Hence we are setting FSCR on POWER6/7 where
      the FSCR doesn't exist. This is harmless but we shouldn't do it.
      
      Also, we can simplify the FSCR context switch. We don't need to go
      through the calculation involving dscr_inherit. We can just restore
      what we saved last time.
      
      We also set an initial value in INIT_THREAD, so that pid 1 which is
      cloned from that gets a sane value.
      
      Based on patch by Jack Miller.
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      b57bd2de
  28. 14 6月, 2016 1 次提交
  29. 29 3月, 2016 1 次提交
  30. 02 3月, 2016 1 次提交
    • C
      powerpc: Restore FPU/VEC/VSX if previously used · 70fe3d98
      Cyril Bur 提交于
      Currently the FPU, VEC and VSX facilities are lazily loaded. This is not
      a problem unless a process is using these facilities.
      
      Modern versions of GCC are very good at automatically vectorising code,
      new and modernised workloads make use of floating point and vector
      facilities, even the kernel makes use of vectorised memcpy.
      
      All this combined greatly increases the cost of a syscall since the
      kernel uses the facilities sometimes even in syscall fast-path making it
      increasingly common for a thread to take an *_unavailable exception soon
      after a syscall, not to mention potentially taking all three.
      
      The obvious overcompensation to this problem is to simply always load
      all the facilities on every exit to userspace. Loading up all FPU, VEC
      and VSX registers every time can be expensive and if a workload does
      avoid using them, it should not be forced to incur this penalty.
      
      An 8bit counter is used to detect if the registers have been used in the
      past and the registers are always loaded until the value wraps to back
      to zero.
      
      Several versions of the assembly in entry_64.S were tested:
      
        1. Always calling C.
        2. Performing a common case check and then calling C.
        3. A complex check in asm.
      
      After some benchmarking it was determined that avoiding C in the common
      case is a performance benefit (option 2). The full check in asm (option
      3) greatly complicated that codepath for a negligible performance gain
      and the trade-off was deemed not worth it.
      Signed-off-by: NCyril Bur <cyrilbur@gmail.com>
      [mpe: Move load_vec in the struct to fill an existing hole, reword change log]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      
      fixup
      70fe3d98
  31. 01 12月, 2015 1 次提交