1. 27 3月, 2018 4 次提交
  2. 23 3月, 2018 6 次提交
    • P
      KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9 · 4bb3c7a0
      Paul Mackerras 提交于
      POWER9 has hardware bugs relating to transactional memory and thread
      reconfiguration (changes to hardware SMT mode).  Specifically, the core
      does not have enough storage to store a complete checkpoint of all the
      architected state for all four threads.  The DD2.2 version of POWER9
      includes hardware modifications designed to allow hypervisor software
      to implement workarounds for these problems.  This patch implements
      those workarounds in KVM code so that KVM guests see a full, working
      transactional memory implementation.
      
      The problems center around the use of TM suspended state, where the
      CPU has a checkpointed state but execution is not transactional.  The
      workaround is to implement a "fake suspend" state, which looks to the
      guest like suspended state but the CPU does not store a checkpoint.
      In this state, any instruction that would cause a transition to
      transactional state (rfid, rfebb, mtmsrd, tresume) or would use the
      checkpointed state (treclaim) causes a "soft patch" interrupt (vector
      0x1500) to the hypervisor so that it can be emulated.  The trechkpt
      instruction also causes a soft patch interrupt.
      
      On POWER9 DD2.2, we avoid returning to the guest in any state which
      would require a checkpoint to be present.  The trechkpt in the guest
      entry path which would normally create that checkpoint is replaced by
      either a transition to fake suspend state, if the guest is in suspend
      state, or a rollback to the pre-transactional state if the guest is in
      transactional state.  Fake suspend state is indicated by a flag in the
      PACA plus a new bit in the PSSCR.  The new PSSCR bit is write-only and
      reads back as 0.
      
      On exit from the guest, if the guest is in fake suspend state, we still
      do the treclaim instruction as we would in real suspend state, in order
      to get into non-transactional state, but we do not save the resulting
      register state since there was no checkpoint.
      
      Emulation of the instructions that cause a softpatch interrupt is
      handled in two paths.  If the guest is in real suspend mode, we call
      kvmhv_p9_tm_emulation_early() to handle the cases where the guest is
      transitioning to transactional state.  This is called before we do the
      treclaim in the guest exit path; because we haven't done treclaim, we
      can get back to the guest with the transaction still active.  If the
      instruction is a case that kvmhv_p9_tm_emulation_early() doesn't
      handle, or if the guest is in fake suspend state, then we proceed to
      do the complete guest exit path and subsequently call
      kvmhv_p9_tm_emulation() in host context with the MMU on.  This handles
      all the cases including the cases that generate program interrupts
      (illegal instruction or TM Bad Thing) and facility unavailable
      interrupts.
      
      The emulation is reasonably straightforward and is mostly concerned
      with checking for exception conditions and updating the state of
      registers such as MSR and CR0.  The treclaim emulation takes care to
      ensure that the TEXASR register gets updated as if it were the guest
      treclaim instruction that had done failure recording, not the treclaim
      done in hypervisor state in the guest exit path.
      
      With this, the KVM_CAP_PPC_HTM capability returns true (1) even if
      transactional memory is not available to host userspace.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      4bb3c7a0
    • P
      powerpc/powernv: Provide a way to force a core into SMT4 mode · 7672691a
      Paul Mackerras 提交于
      POWER9 processors up to and including "Nimbus" v2.2 have hardware
      bugs relating to transactional memory and thread reconfiguration.
      One of these bugs has a workaround which is to get the core into
      SMT4 state temporarily.  This workaround is only needed when
      running bare-metal.
      
      This patch provides a function which gets the core into SMT4 mode
      by preventing threads from going to a stop state, and waking up
      those which are already in a stop state.  Once at least 3 threads
      are not in a stop state, the core will be in SMT4 and we can
      continue.
      
      To do this, we add a "dont_stop" flag to the paca to tell the
      thread not to go into a stop state.  If this flag is set,
      power9_idle_stop() just returns immediately with a return value
      of 0.  The pnv_power9_force_smt4_catch() function does the following:
      
      1. Set the dont_stop flag for each thread in the core, except
         ourselves (in fact we use an atomic_inc() in case more than
         one thread is calling this function concurrently).
      2. See how many threads are awake, indicated by their
         requested_psscr field in the paca being 0.  If this is at
         least 3, skip to step 5.
      3. Send a doorbell interrupt to each thread that was seen as
         being in a stop state in step 2.
      4. Until at least 3 threads are awake, scan the threads to which
         we sent a doorbell interrupt and check if they are awake now.
      
      This relies on the following properties:
      
      - Once dont_stop is non-zero, requested_psccr can't go from zero to
        non-zero, except transiently (and without the thread doing stop).
      - requested_psscr being zero guarantees that the thread isn't in
        a state-losing stop state where thread reconfiguration could occur.
      - Doing stop with a PSSCR value of 0 won't be a state-losing stop
        and thus won't allow thread reconfiguration.
      - Once threads_per_core/2 + 1 (i.e. 3) threads are awake, the core
        must be in SMT4 mode, since SMT modes are powers of 2.
      
      This does add a sync to power9_idle_stop(), which is necessary to
      provide the correct ordering between setting requested_psscr and
      checking dont_stop.  The overhead of the sync should be unnoticeable
      compared to the latency of going into and out of a stop state.
      
      Because some objected to incurring this extra latency on systems where
      the XER[SO] bug is not relevant, I have put the test in
      power9_idle_stop inside a feature section.  This means that
      pnv_power9_force_smt4_catch() WILL NOT WORK correctly on systems
      without the CPU_FTR_P9_TM_XER_SO_BUG feature bit set, and will
      probably hang the system.
      
      In order to cater for uses where the caller has an operation that
      has to be done while the core is in SMT4, the core continues to be
      kept in SMT4 after pnv_power9_force_smt4_catch() function returns,
      until the pnv_power9_force_smt4_release() function is called.
      It undoes the effect of step 1 above and allows the other threads
      to go into a stop state.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7672691a
    • P
      powerpc: Add CPU feature bits for TM bug workarounds on POWER9 v2.2 · b5af4f27
      Paul Mackerras 提交于
      This adds a CPU feature bit which is set for POWER9 "Nimbus" DD2.2
      processors which will be used to enable the hypervisor to assist
      hardware with the handling of checkpointed register values while the
      CPU is in suspend state, in order to work around hardware bugs.  The
      hardware assistance for these workarounds introduced a new hardware
      bug relating to the XER[SO] bit.  We add a separate feature bit for
      this bug in case future chips fix it while still requiring the
      hypervisor assistance with suspend state.
      
      When the dt_cpu_ftrs subsystem is in use, the software assistance can
      be enabled using a "tm-suspend-hypervisor-assist" node in the device
      tree, and a "tm-suspend-xer-so-bug" node enables the workarounds for
      the XER[SO] bug.  In the absence of such nodes, a quirk enables both
      for POWER9 "Nimbus" DD2.2 processors.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      b5af4f27
    • P
      powerpc: Free up CPU feature bits on 64-bit machines · 9bbf0b57
      Paul Mackerras 提交于
      This moves all the CPU feature bits that are only used on 32-bit
      machines to the top 20 bits of the CPU feature word and arranges
      for them to be defined only in 32-bit builds.  The features that
      are common to 32-bit and 64-bit machines are moved to bits 0-11
      of the CPU feature word.  This means that for 64-bit platforms,
      bits 44-63 can now be used for new features that only exist on
      64-bit machines.  (These bit numbers are counting from the right,
      i.e. the LSB is bit 0.)
      
      Because CPU_FTR_L3_DISABLE_NAP moved from the low 16 bits to the high
      16 bits, we have to adjust some assembly code.  Also, CPU_FTR_EMB_HV
      moved from the high 16 bits to the low 16 bits.
      
      Note that CPU_FTR_REAL_LE only applies to 64-bit chips, because only
      64-bit chips (POWER6, 7, 8, 9) have a true little-endian mode that is
      a CPU execution mode as opposed to being a page attribute.
      
      With this we now have 20 free CPU feature bits on 64-bit machines.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      9bbf0b57
    • P
      powerpc: Book E: Remove unused CPU_FTR_L2CSR bit · dd0efb3f
      Paul Mackerras 提交于
      The CPU_FTR_L2CSR bit is never tested anywhere, so let's reclaim the
      bit.
      
      The last usage was removed in 86d63363 ("powerpc/e500mc: Remove
      dead L2 flushing code in idle_e500.S") (Jun 2015).
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      dd0efb3f
    • P
      powerpc: Use feature bit for RTC presence rather than timebase presence · c0d64cf9
      Paul Mackerras 提交于
      All PowerPC CPUs other than the original PPC601 have a timebase
      register rather than the "real-time clock" (RTC) register that the
      PPC601 (and the original POWER and POWER2 CPUs) had.  Currently
      we have a CPU feature bit to indicate the presence of the timebase,
      but it makes more sense to use a bit to indicate the unusual
      situation rather than the common situation.  This therefore defines
      a CPU_FTR_USE_RTC bit in place of the CPU_FTR_USE_TB bit, and
      arranges for it to be set on PPC601 systems.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c0d64cf9
  3. 20 3月, 2018 2 次提交
    • M
      powerpc: Remove unused flush_dcache_phys_range() · 31513207
      Matt Brown 提交于
      The flush_dcache_phys_range() function is no longer used in the
      kernel. The last usage was removed in c40785ad ("powerpc/dart: Use
      a cachable DART").
      
      This patch removes the function and declaration.
      Signed-off-by: NMatt Brown <matthew.brown.dev@gmail.com>
      [mpe: Munge change log, include commit that removed last user]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      31513207
    • M
      lib/raid6/altivec: Add vpermxor implementation for raid6 Q syndrome · 751ba79c
      Matt Brown 提交于
      This patch uses the vpermxor instruction to optimise the raid6 Q
      syndrome. This instruction was made available with POWER8, ISA version
      2.07. It allows for both vperm and vxor instructions to be done in a
      single instruction. This has been tested for correctness on a ppc64le
      vm with a basic RAID6 setup containing 5 drives.
      
      The performance benchmarks are from the raid6test in the
      /lib/raid6/test directory. These results are from an IBM Firestone
      machine with ppc64le architecture. The benchmark results show a 35%
      speed increase over the best existing algorithm for powerpc (altivec).
      The raid6test has also been run on a big-endian ppc64 vm to ensure it
      also works for big-endian architectures.
      
      Performance benchmarks:
        raid6: altivecx4 gen() 18773 MB/s
        raid6: altivecx8 gen() 19438 MB/s
      
        raid6: vpermxor4 gen() 25112 MB/s
        raid6: vpermxor8 gen() 26279 MB/s
      Signed-off-by: NMatt Brown <matthew.brown.dev@gmail.com>
      Reviewed-by: NDaniel Axtens <dja@axtens.net>
      [mpe: Add VPERMXOR macro so we can build with old binutils]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      751ba79c
  4. 13 3月, 2018 22 次提交
  5. 06 3月, 2018 5 次提交
    • C
      powerpc/8xx: Increase number of slices to 64 · 4bd13772
      Christophe Leroy 提交于
      On the 8xx, the minimum slice size is the size of the area
      covered by a single PMD entry, ie 4M in 4K pages mode and 64M in
      16K pages mode.
      
      This patch increases the number of slices from 16 to 64 on the 8xx.
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      4bd13772
    • C
      powerpc/mm/slice: Allow up to 64 low slices · 15472423
      Christophe Leroy 提交于
      While the implementation of the "slices" address space allows
      a significant amount of high slices, it limits the number of
      low slices to 16 due to the use of a single u64 low_slices_psize
      element in struct mm_context_t
      
      On the 8xx, the minimum slice size is the size of the area
      covered by a single PMD entry, ie 4M in 4K pages mode and 64M in
      16K pages mode. This means we could have at least 64 slices.
      
      In order to override this limitation, this patch switches the
      handling of low_slices_psize to char array as done already for
      high_slices_psize.
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      15472423
    • C
      powerpc/mm/slice: Fix hugepage allocation at hint address on 8xx · aa0ab02b
      Christophe Leroy 提交于
      On the 8xx, the page size is set in the PMD entry and applies to
      all pages of the page table pointed by the said PMD entry.
      
      When an app has some regular pages allocated (e.g. see below) and tries
      to mmap() a huge page at a hint address covered by the same PMD entry,
      the kernel accepts the hint allthough the 8xx cannot handle different
      page sizes in the same PMD entry.
      
      10000000-10001000 r-xp 00000000 00:0f 2597 /root/malloc
      10010000-10011000 rwxp 00000000 00:0f 2597 /root/malloc
      
      mmap(0x10080000, 524288, PROT_READ|PROT_WRITE,
           MAP_PRIVATE|MAP_ANONYMOUS|0x40000, -1, 0) = 0x10080000
      
      This results the app remaining forever in do_page_fault()/hugetlb_fault()
      and when interrupting that app, we get the following warning:
      
      [162980.035629] WARNING: CPU: 0 PID: 2777 at arch/powerpc/mm/hugetlbpage.c:354 hugetlb_free_pgd_range+0xc8/0x1e4
      [162980.035699] CPU: 0 PID: 2777 Comm: malloc Tainted: G W       4.14.6 #85
      [162980.035744] task: c67e2c00 task.stack: c668e000
      [162980.035783] NIP:  c000fe18 LR: c00e1eec CTR: c00f90c0
      [162980.035830] REGS: c668fc20 TRAP: 0700   Tainted: G W        (4.14.6)
      [162980.035854] MSR:  00029032 <EE,ME,IR,DR,RI>  CR: 24044224 XER: 20000000
      [162980.036003]
      [162980.036003] GPR00: c00e1eec c668fcd0 c67e2c00 00000010 c6869410 10080000 00000000 77fb4000
      [162980.036003] GPR08: ffff0001 0683c001 00000000 ffffff80 44028228 10018a34 00004008 418004fc
      [162980.036003] GPR16: c668e000 00040100 c668e000 c06c0000 c668fe78 c668e000 c6835ba0 c668fd48
      [162980.036003] GPR24: 00000000 73ffffff 74000000 00000001 77fb4000 100fffff 10100000 10100000
      [162980.036743] NIP [c000fe18] hugetlb_free_pgd_range+0xc8/0x1e4
      [162980.036839] LR [c00e1eec] free_pgtables+0x12c/0x150
      [162980.036861] Call Trace:
      [162980.036939] [c668fcd0] [c00f0774] unlink_anon_vmas+0x1c4/0x214 (unreliable)
      [162980.037040] [c668fd10] [c00e1eec] free_pgtables+0x12c/0x150
      [162980.037118] [c668fd40] [c00eabac] exit_mmap+0xe8/0x1b4
      [162980.037210] [c668fda0] [c0019710] mmput.part.9+0x20/0xd8
      [162980.037301] [c668fdb0] [c001ecb0] do_exit+0x1f0/0x93c
      [162980.037386] [c668fe00] [c001f478] do_group_exit+0x40/0xcc
      [162980.037479] [c668fe10] [c002a76c] get_signal+0x47c/0x614
      [162980.037570] [c668fe70] [c0007840] do_signal+0x54/0x244
      [162980.037654] [c668ff30] [c0007ae8] do_notify_resume+0x34/0x88
      [162980.037744] [c668ff40] [c000dae8] do_user_signal+0x74/0xc4
      [162980.037781] Instruction dump:
      [162980.037821] 7fdff378 81370000 54a3463a 80890020 7d24182e 7c841a14 712a0004 4082ff94
      [162980.038014] 2f890000 419e0010 712a0ff0 408200e0 <0fe00000> 54a9000a 7f984840 419d0094
      [162980.038216] ---[ end trace c0ceeca8e7a5800a ]---
      [162980.038754] BUG: non-zero nr_ptes on freeing mm: 1
      [162985.363322] BUG: non-zero nr_ptes on freeing mm: -1
      
      In order to fix this, this patch uses the address space "slices"
      implemented for BOOK3S/64 and enhanced to support PPC32 by the
      preceding patch.
      
      This patch modifies the context.id on the 8xx to be in the range
      [1:16] instead of [0:15] in order to identify context.id == 0 as
      not initialised contexts as done on BOOK3S
      
      This patch activates CONFIG_PPC_MM_SLICES when CONFIG_HUGETLB_PAGE is
      selected for the 8xx
      
      Alltough we could in theory have as many slices as PMD entries, the
      current slices implementation limits the number of low slices to 16.
      This limitation is not preventing us to fix the initial issue allthough
      it is suboptimal. It will be cured in a subsequent patch.
      
      Fixes: 4b914286 ("powerpc/8xx: Implement support of hugepages")
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      aa0ab02b
    • C
      powerpc/mm/slice: Enhance for supporting PPC32 · db3a528d
      Christophe Leroy 提交于
      In preparation for the following patch which will fix an issue on
      the 8xx by re-using the 'slices', this patch enhances the
      'slices' implementation to support 32 bits CPUs.
      
      On PPC32, the address space is limited to 4Gbytes, hence only the low
      slices will be used.
      
      The high slices use bitmaps. As bitmap functions are not prepared to
      handle bitmaps of size 0, this patch ensures that bitmap functions
      are called only when SLICE_NUM_HIGH is not nul.
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      db3a528d
    • C
      powerpc/mm/slice: create header files dedicated to slices · a3286f05
      Christophe Leroy 提交于
      In preparation for the following patch which will enhance 'slices'
      for supporting PPC32 in order to fix an issue on hugepages on 8xx,
      this patch takes out of page*.h all bits related to 'slices' and put
      them into newly created slice.h header files.
      While common parts go into asm/slice.h, subarch specific
      parts go into respective books3s/64/slice.c and nohash/64/slice.c
      'slices'
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a3286f05
  6. 22 2月, 2018 1 次提交