1. 27 3月, 2018 4 次提交
  2. 23 3月, 2018 5 次提交
    • P
      KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9 · 4bb3c7a0
      Paul Mackerras 提交于
      POWER9 has hardware bugs relating to transactional memory and thread
      reconfiguration (changes to hardware SMT mode).  Specifically, the core
      does not have enough storage to store a complete checkpoint of all the
      architected state for all four threads.  The DD2.2 version of POWER9
      includes hardware modifications designed to allow hypervisor software
      to implement workarounds for these problems.  This patch implements
      those workarounds in KVM code so that KVM guests see a full, working
      transactional memory implementation.
      
      The problems center around the use of TM suspended state, where the
      CPU has a checkpointed state but execution is not transactional.  The
      workaround is to implement a "fake suspend" state, which looks to the
      guest like suspended state but the CPU does not store a checkpoint.
      In this state, any instruction that would cause a transition to
      transactional state (rfid, rfebb, mtmsrd, tresume) or would use the
      checkpointed state (treclaim) causes a "soft patch" interrupt (vector
      0x1500) to the hypervisor so that it can be emulated.  The trechkpt
      instruction also causes a soft patch interrupt.
      
      On POWER9 DD2.2, we avoid returning to the guest in any state which
      would require a checkpoint to be present.  The trechkpt in the guest
      entry path which would normally create that checkpoint is replaced by
      either a transition to fake suspend state, if the guest is in suspend
      state, or a rollback to the pre-transactional state if the guest is in
      transactional state.  Fake suspend state is indicated by a flag in the
      PACA plus a new bit in the PSSCR.  The new PSSCR bit is write-only and
      reads back as 0.
      
      On exit from the guest, if the guest is in fake suspend state, we still
      do the treclaim instruction as we would in real suspend state, in order
      to get into non-transactional state, but we do not save the resulting
      register state since there was no checkpoint.
      
      Emulation of the instructions that cause a softpatch interrupt is
      handled in two paths.  If the guest is in real suspend mode, we call
      kvmhv_p9_tm_emulation_early() to handle the cases where the guest is
      transitioning to transactional state.  This is called before we do the
      treclaim in the guest exit path; because we haven't done treclaim, we
      can get back to the guest with the transaction still active.  If the
      instruction is a case that kvmhv_p9_tm_emulation_early() doesn't
      handle, or if the guest is in fake suspend state, then we proceed to
      do the complete guest exit path and subsequently call
      kvmhv_p9_tm_emulation() in host context with the MMU on.  This handles
      all the cases including the cases that generate program interrupts
      (illegal instruction or TM Bad Thing) and facility unavailable
      interrupts.
      
      The emulation is reasonably straightforward and is mostly concerned
      with checking for exception conditions and updating the state of
      registers such as MSR and CR0.  The treclaim emulation takes care to
      ensure that the TEXASR register gets updated as if it were the guest
      treclaim instruction that had done failure recording, not the treclaim
      done in hypervisor state in the guest exit path.
      
      With this, the KVM_CAP_PPC_HTM capability returns true (1) even if
      transactional memory is not available to host userspace.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      4bb3c7a0
    • P
      powerpc/powernv: Provide a way to force a core into SMT4 mode · 7672691a
      Paul Mackerras 提交于
      POWER9 processors up to and including "Nimbus" v2.2 have hardware
      bugs relating to transactional memory and thread reconfiguration.
      One of these bugs has a workaround which is to get the core into
      SMT4 state temporarily.  This workaround is only needed when
      running bare-metal.
      
      This patch provides a function which gets the core into SMT4 mode
      by preventing threads from going to a stop state, and waking up
      those which are already in a stop state.  Once at least 3 threads
      are not in a stop state, the core will be in SMT4 and we can
      continue.
      
      To do this, we add a "dont_stop" flag to the paca to tell the
      thread not to go into a stop state.  If this flag is set,
      power9_idle_stop() just returns immediately with a return value
      of 0.  The pnv_power9_force_smt4_catch() function does the following:
      
      1. Set the dont_stop flag for each thread in the core, except
         ourselves (in fact we use an atomic_inc() in case more than
         one thread is calling this function concurrently).
      2. See how many threads are awake, indicated by their
         requested_psscr field in the paca being 0.  If this is at
         least 3, skip to step 5.
      3. Send a doorbell interrupt to each thread that was seen as
         being in a stop state in step 2.
      4. Until at least 3 threads are awake, scan the threads to which
         we sent a doorbell interrupt and check if they are awake now.
      
      This relies on the following properties:
      
      - Once dont_stop is non-zero, requested_psccr can't go from zero to
        non-zero, except transiently (and without the thread doing stop).
      - requested_psscr being zero guarantees that the thread isn't in
        a state-losing stop state where thread reconfiguration could occur.
      - Doing stop with a PSSCR value of 0 won't be a state-losing stop
        and thus won't allow thread reconfiguration.
      - Once threads_per_core/2 + 1 (i.e. 3) threads are awake, the core
        must be in SMT4 mode, since SMT modes are powers of 2.
      
      This does add a sync to power9_idle_stop(), which is necessary to
      provide the correct ordering between setting requested_psscr and
      checking dont_stop.  The overhead of the sync should be unnoticeable
      compared to the latency of going into and out of a stop state.
      
      Because some objected to incurring this extra latency on systems where
      the XER[SO] bug is not relevant, I have put the test in
      power9_idle_stop inside a feature section.  This means that
      pnv_power9_force_smt4_catch() WILL NOT WORK correctly on systems
      without the CPU_FTR_P9_TM_XER_SO_BUG feature bit set, and will
      probably hang the system.
      
      In order to cater for uses where the caller has an operation that
      has to be done while the core is in SMT4, the core continues to be
      kept in SMT4 after pnv_power9_force_smt4_catch() function returns,
      until the pnv_power9_force_smt4_release() function is called.
      It undoes the effect of step 1 above and allows the other threads
      to go into a stop state.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7672691a
    • P
      powerpc: Add CPU feature bits for TM bug workarounds on POWER9 v2.2 · b5af4f27
      Paul Mackerras 提交于
      This adds a CPU feature bit which is set for POWER9 "Nimbus" DD2.2
      processors which will be used to enable the hypervisor to assist
      hardware with the handling of checkpointed register values while the
      CPU is in suspend state, in order to work around hardware bugs.  The
      hardware assistance for these workarounds introduced a new hardware
      bug relating to the XER[SO] bit.  We add a separate feature bit for
      this bug in case future chips fix it while still requiring the
      hypervisor assistance with suspend state.
      
      When the dt_cpu_ftrs subsystem is in use, the software assistance can
      be enabled using a "tm-suspend-hypervisor-assist" node in the device
      tree, and a "tm-suspend-xer-so-bug" node enables the workarounds for
      the XER[SO] bug.  In the absence of such nodes, a quirk enables both
      for POWER9 "Nimbus" DD2.2 processors.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      b5af4f27
    • P
      powerpc: Free up CPU feature bits on 64-bit machines · 9bbf0b57
      Paul Mackerras 提交于
      This moves all the CPU feature bits that are only used on 32-bit
      machines to the top 20 bits of the CPU feature word and arranges
      for them to be defined only in 32-bit builds.  The features that
      are common to 32-bit and 64-bit machines are moved to bits 0-11
      of the CPU feature word.  This means that for 64-bit platforms,
      bits 44-63 can now be used for new features that only exist on
      64-bit machines.  (These bit numbers are counting from the right,
      i.e. the LSB is bit 0.)
      
      Because CPU_FTR_L3_DISABLE_NAP moved from the low 16 bits to the high
      16 bits, we have to adjust some assembly code.  Also, CPU_FTR_EMB_HV
      moved from the high 16 bits to the low 16 bits.
      
      Note that CPU_FTR_REAL_LE only applies to 64-bit chips, because only
      64-bit chips (POWER6, 7, 8, 9) have a true little-endian mode that is
      a CPU execution mode as opposed to being a page attribute.
      
      With this we now have 20 free CPU feature bits on 64-bit machines.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      9bbf0b57
    • P
      powerpc: Use feature bit for RTC presence rather than timebase presence · c0d64cf9
      Paul Mackerras 提交于
      All PowerPC CPUs other than the original PPC601 have a timebase
      register rather than the "real-time clock" (RTC) register that the
      PPC601 (and the original POWER and POWER2 CPUs) had.  Currently
      we have a CPU feature bit to indicate the presence of the timebase,
      but it makes more sense to use a bit to indicate the unusual
      situation rather than the common situation.  This therefore defines
      a CPU_FTR_USE_RTC bit in place of the CPU_FTR_USE_TB bit, and
      arranges for it to be set on PPC601 systems.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c0d64cf9
  3. 20 3月, 2018 2 次提交
  4. 14 3月, 2018 1 次提交
  5. 13 3月, 2018 6 次提交
  6. 06 3月, 2018 2 次提交
    • C
      powerpc/mm/slice: Allow up to 64 low slices · 15472423
      Christophe Leroy 提交于
      While the implementation of the "slices" address space allows
      a significant amount of high slices, it limits the number of
      low slices to 16 due to the use of a single u64 low_slices_psize
      element in struct mm_context_t
      
      On the 8xx, the minimum slice size is the size of the area
      covered by a single PMD entry, ie 4M in 4K pages mode and 64M in
      16K pages mode. This means we could have at least 64 slices.
      
      In order to override this limitation, this patch switches the
      handling of low_slices_psize to char array as done already for
      high_slices_psize.
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      15472423
    • C
      powerpc/mm/slice: Fix hugepage allocation at hint address on 8xx · aa0ab02b
      Christophe Leroy 提交于
      On the 8xx, the page size is set in the PMD entry and applies to
      all pages of the page table pointed by the said PMD entry.
      
      When an app has some regular pages allocated (e.g. see below) and tries
      to mmap() a huge page at a hint address covered by the same PMD entry,
      the kernel accepts the hint allthough the 8xx cannot handle different
      page sizes in the same PMD entry.
      
      10000000-10001000 r-xp 00000000 00:0f 2597 /root/malloc
      10010000-10011000 rwxp 00000000 00:0f 2597 /root/malloc
      
      mmap(0x10080000, 524288, PROT_READ|PROT_WRITE,
           MAP_PRIVATE|MAP_ANONYMOUS|0x40000, -1, 0) = 0x10080000
      
      This results the app remaining forever in do_page_fault()/hugetlb_fault()
      and when interrupting that app, we get the following warning:
      
      [162980.035629] WARNING: CPU: 0 PID: 2777 at arch/powerpc/mm/hugetlbpage.c:354 hugetlb_free_pgd_range+0xc8/0x1e4
      [162980.035699] CPU: 0 PID: 2777 Comm: malloc Tainted: G W       4.14.6 #85
      [162980.035744] task: c67e2c00 task.stack: c668e000
      [162980.035783] NIP:  c000fe18 LR: c00e1eec CTR: c00f90c0
      [162980.035830] REGS: c668fc20 TRAP: 0700   Tainted: G W        (4.14.6)
      [162980.035854] MSR:  00029032 <EE,ME,IR,DR,RI>  CR: 24044224 XER: 20000000
      [162980.036003]
      [162980.036003] GPR00: c00e1eec c668fcd0 c67e2c00 00000010 c6869410 10080000 00000000 77fb4000
      [162980.036003] GPR08: ffff0001 0683c001 00000000 ffffff80 44028228 10018a34 00004008 418004fc
      [162980.036003] GPR16: c668e000 00040100 c668e000 c06c0000 c668fe78 c668e000 c6835ba0 c668fd48
      [162980.036003] GPR24: 00000000 73ffffff 74000000 00000001 77fb4000 100fffff 10100000 10100000
      [162980.036743] NIP [c000fe18] hugetlb_free_pgd_range+0xc8/0x1e4
      [162980.036839] LR [c00e1eec] free_pgtables+0x12c/0x150
      [162980.036861] Call Trace:
      [162980.036939] [c668fcd0] [c00f0774] unlink_anon_vmas+0x1c4/0x214 (unreliable)
      [162980.037040] [c668fd10] [c00e1eec] free_pgtables+0x12c/0x150
      [162980.037118] [c668fd40] [c00eabac] exit_mmap+0xe8/0x1b4
      [162980.037210] [c668fda0] [c0019710] mmput.part.9+0x20/0xd8
      [162980.037301] [c668fdb0] [c001ecb0] do_exit+0x1f0/0x93c
      [162980.037386] [c668fe00] [c001f478] do_group_exit+0x40/0xcc
      [162980.037479] [c668fe10] [c002a76c] get_signal+0x47c/0x614
      [162980.037570] [c668fe70] [c0007840] do_signal+0x54/0x244
      [162980.037654] [c668ff30] [c0007ae8] do_notify_resume+0x34/0x88
      [162980.037744] [c668ff40] [c000dae8] do_user_signal+0x74/0xc4
      [162980.037781] Instruction dump:
      [162980.037821] 7fdff378 81370000 54a3463a 80890020 7d24182e 7c841a14 712a0004 4082ff94
      [162980.038014] 2f890000 419e0010 712a0ff0 408200e0 <0fe00000> 54a9000a 7f984840 419d0094
      [162980.038216] ---[ end trace c0ceeca8e7a5800a ]---
      [162980.038754] BUG: non-zero nr_ptes on freeing mm: 1
      [162985.363322] BUG: non-zero nr_ptes on freeing mm: -1
      
      In order to fix this, this patch uses the address space "slices"
      implemented for BOOK3S/64 and enhanced to support PPC32 by the
      preceding patch.
      
      This patch modifies the context.id on the 8xx to be in the range
      [1:16] instead of [0:15] in order to identify context.id == 0 as
      not initialised contexts as done on BOOK3S
      
      This patch activates CONFIG_PPC_MM_SLICES when CONFIG_HUGETLB_PAGE is
      selected for the 8xx
      
      Alltough we could in theory have as many slices as PMD entries, the
      current slices implementation limits the number of low slices to 16.
      This limitation is not preventing us to fix the initial issue allthough
      it is suboptimal. It will be cured in a subsequent patch.
      
      Fixes: 4b914286 ("powerpc/8xx: Implement support of hugepages")
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      aa0ab02b
  7. 22 2月, 2018 1 次提交
    • M
      powerpc/pseries: Revert support for ibm,drc-info devtree property · c7a3275e
      Michael Bringmann 提交于
      This reverts commit 02ef6dd8.
      
      The earlier patch tried to enable support for a new property
      "ibm,drc-info" on powerpc systems.
      
      Unfortunately, some errors in the associated patch set break things
      in some of the DLPAR operations.  In particular when attempting to
      hot-add a new CPU or set of CPUs, the original patch failed to
      properly calculate the available resources, and aborted the operation.
      In addition, the original set missed several opportunities to compress
      and reuse common code.
      
      As the associated patch set was meant to provide an optimization of
      storage and performance of a set of device-tree properties for future
      systems with large amounts of resources, reverting just restores
      the previous behavior for existing systems.  It seems unnecessary
      to enable this feature and introduce the consequent problems in the
      field that it will cause at this time, so please revert it for now
      until testing of the corrections are finished properly.
      
      Fixes: 02ef6dd8 ("powerpc: Enable support for ibm,drc-info devtree property")
      Signed-off-by: NMichael W. Bringmann <mwb@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c7a3275e
  8. 21 2月, 2018 1 次提交
  9. 15 2月, 2018 1 次提交
  10. 12 2月, 2018 1 次提交
    • L
      vfs: do bulk POLL* -> EPOLL* replacement · a9a08845
      Linus Torvalds 提交于
      This is the mindless scripted replacement of kernel use of POLL*
      variables as described by Al, done by this script:
      
          for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
              L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
              for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
          done
      
      with de-mangling cleanups yet to come.
      
      NOTE! On almost all architectures, the EPOLL* constants have the same
      values as the POLL* constants do.  But they keyword here is "almost".
      For various bad reasons they aren't the same, and epoll() doesn't
      actually work quite correctly in some cases due to this on Sparc et al.
      
      The next patch from Al will sort out the final differences, and we
      should be all done.
      Scripted-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9a08845
  11. 11 2月, 2018 1 次提交
  12. 08 2月, 2018 1 次提交
    • N
      powerpc/64s: Fix may_hard_irq_enable() for PMI soft masking · 6cc3f91b
      Nicholas Piggin 提交于
      The soft IRQ masking code has to hard-disable interrupts in cases
      where the exception is not cleared by the masked handler. External
      interrupts used this approach for soft masking. Now recently PMU
      interrupts do the same thing.
      
      The soft IRQ masking code additionally allowed for interrupt handlers
      to hard-enable interrupts after soft-disabling them. The idea is to
      allow PMU interrupts through to profile interrupt handlers.
      
      So when interrupts are being replayed when there is a pending
      interrupt that requires hard-disabling, there is a test to prevent
      those handlers from hard-enabling them if there is a pending external
      interrupt. may_hard_irq_enable() handles this.
      
      After f442d004 ("powerpc/64s: Add support to mask perf interrupts
      and replay them"), may_hard_irq_enable() could prematurely enable
      MSR[EE] when a PMU exception exists, which would result in the
      interrupt firing again while masked, and MSR[EE] being disabled again.
      
      I haven't seen that this could cause a serious problem, but it's
      more consistent to handle these soft-masked interrupts in the same
      way. So introduce a define for all types of interrupts that require
      MSR[EE] masking in their soft-disable handlers, and use that in
      may_hard_irq_enable().
      
      Fixes: f442d004 ("powerpc/64s: Add support to mask perf interrupts and replay them")
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: NMadhavan Srinivasan <maddy@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      6cc3f91b
  13. 28 1月, 2018 3 次提交
  14. 27 1月, 2018 5 次提交
  15. 23 1月, 2018 3 次提交
  16. 22 1月, 2018 3 次提交
    • N
      powerpc/pseries, ps3: panic flush kernel messages before halting system · 35adacd6
      Nicholas Piggin 提交于
      Platforms with a panic handler that halts the system can have problems
      getting kernel messages out, because the panic notifiers are called
      before kernel/panic.c does its flushing of printk buffers an console
      etc.
      
      This was attempted to be solved with commit a3b2cb30 ("powerpc: Do
      not call ppc_md.panic in fadump panic notifier"), but that wasn't the
      right approach and caused other problems, and was reverted by commit
      ab9dbf77.
      
      Instead, the powernv shutdown paths have already had a similar
      problem, fixed by taking the message flushing sequence from
      kernel/panic.c. That's a little bit ugly, but while we have the code
      duplicated, it will work for this case as well. So have ppc panic
      handlers do the same flushing before they terminate.
      
      Without this patch, a qemu pseries_le_defconfig guest stops silently
      when issued the nmi command when xmon is off and no crash dumpers
      enabled. Afterwards, an oops is printed by each CPU as expected.
      
      Fixes: ab9dbf77 ("Revert "powerpc: Do not call ppc_md.panic in fadump panic notifier"")
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      35adacd6
    • G
      powerpc/tm: Fix endianness flip on trap · 1c200e63
      Gustavo Romero 提交于
      Currently it's possible that a thread on PPC64 LE has its endianness
      flipped inadvertently to Big-Endian resulting in a crash once the process
      is back from the signal handler.
      
      If giveup_all() is called when regs->msr has the bits MSR.FP and MSR.VEC
      disabled (and hence MSR.VSX disabled too) it returns without calling
      check_if_tm_restore_required() which copies regs->msr to ckpt_regs->msr if
      the process caught a signal whilst in transactional mode. Then once in
      setup_tm_sigcontexts() MSR from ckpt_regs.msr is used, but since
      check_if_tm_restore_required() was not called previuosly, gp_regs[PT_MSR]
      gets a copy of invalid MSR bits as MSR in ckpt_regs was not updated from
      regs->msr and so is zeroed. Later when leaving the signal handler once in
      sys_rt_sigreturn() the TS bits of gp_regs[PT_MSR] are checked to determine
      if restore_tm_sigcontexts() must be called to pull in the correct MSR state
      into the user context. Because TS bits are zeroed
      restore_tm_sigcontexts() is never called and MSR restored from the user
      context on returning from the signal handler has the MSR.LE (the endianness
      bit) forced to zero (Big-Endian). That leads, for instance, to 'nop' being
      treated as an illegal instruction in the following sequence:
      
      	tbegin.
      	beq	1f
      	trap
      	tend.
      1:	nop
      
      on PPC64 LE machines and the process dies just after returning from the
      signal handler.
      
      PPC64 BE is also affected but in a subtle way since forcing Big-Endian on
      a BE machine does not change the endianness.
      
      This commit fixes the issue described above by ensuring that once in
      setup_tm_sigcontexts() the MSR used is from regs->msr instead of from
      ckpt_regs->msr and by ensuring that we pull in only the MSR.FP, MSR.VEC,
      and MSR.VSX bits from ckpt_regs->msr.
      
      The fix was tested both on LE and BE machines and no regression regarding
      the powerpc/tm selftests was observed.
      Signed-off-by: NGustavo Romero <gromero@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      1c200e63
    • A
      powerpc: Expose TSCR via sysfs · b6d34eb4
      Anton Blanchard 提交于
      The thread switch control register (TSCR) is a per core register
      that configures how the CPU shares resources between SMT threads.
      
      Exposing it via sysfs allows us to tune it at run time.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      b6d34eb4