1. 18 7月, 2017 1 次提交
    • M
      powerpc/mm: Mark __init memory no-execute when STRICT_KERNEL_RWX=y · 029d9252
      Michael Ellerman 提交于
      Currently even with STRICT_KERNEL_RWX we leave the __init text marked
      executable after init, which is bad.
      
      Add a hook to mark it NX (no-execute) before we free it, and implement
      it for radix and hash.
      
      Note that we use __init_end as the end address, not _einittext,
      because overlaps_kernel_text() uses __init_end, because there are
      additional executable sections other than .init.text between
      __init_begin and __init_end.
      
      Tested on radix and hash with:
      
        0:mon> p $__init_begin
        *** 400 exception occurred
      
      Fixes: 1e0fc9d1 ("powerpc/Kconfig: Enable STRICT_KERNEL_RWX for some configs")
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      029d9252
  2. 12 7月, 2017 1 次提交
    • M
      powerpc/64: Fix atomic64_inc_not_zero() to return an int · 01e6a61a
      Michael Ellerman 提交于
      Although it's not documented anywhere, there is an expectation that
      atomic64_inc_not_zero() returns a result which fits in an int. This is
      the behaviour implemented on all arches except powerpc.
      
      This has caused at least one bug in practice, in the percpu-refcount
      code, where the long result from our atomic64_inc_not_zero() was
      truncated to an int leading to lost references and stuck systems. That
      was worked around in that code in commit 966d2b04 ("percpu-refcount:
      fix reference leak during percpu-atomic transition").
      
      To the best of my grepping abilities there are no other callers
      in-tree which truncate the value, but we should fix it anyway. Because
      the breakage is subtle and potentially very harmful I'm also tagging
      it for stable.
      
      Code generation is largely unaffected because in most cases the
      callers are just using the result for a test anyway. In particular the
      case of fget() that was mentioned in commit a6cf7ed5
      ("powerpc/atomic: Implement atomic*_inc_not_zero") generates exactly
      the same code.
      
      Fixes: a6cf7ed5 ("powerpc/atomic: Implement atomic*_inc_not_zero")
      Cc: stable@vger.kernel.org # v3.4
      Noticed-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      01e6a61a
  3. 11 7月, 2017 1 次提交
  4. 10 7月, 2017 1 次提交
  5. 07 7月, 2017 1 次提交
  6. 04 7月, 2017 1 次提交
  7. 03 7月, 2017 2 次提交
    • N
      powerpc64/elfv1: Only dereference function descriptor for non-text symbols · 83e840c7
      Naveen N. Rao 提交于
      Currently, we assume that the function pointer we receive in
      ppc_function_entry() points to a function descriptor. However, this is
      not always the case. In particular, assembly symbols without the right
      annotation do not have an associated function descriptor. Some of these
      symbols are added to the kprobe blacklist using _ASM_NOKPROBE_SYMBOL().
      
      When such addresses are subsequently processed through
      arch_deref_entry_point() in populate_kprobe_blacklist(), we see the
      below errors during bootup:
          [    0.663963] Failed to find blacklist at 7d9b02a648029b6c
          [    0.663970] Failed to find blacklist at a14d03d0394a0001
          [    0.663972] Failed to find blacklist at 7d5302a6f94d0388
          [    0.663973] Failed to find blacklist at 48027d11e8610178
          [    0.663974] Failed to find blacklist at f8010070f8410080
          [    0.663976] Failed to find blacklist at 386100704801f89d
          [    0.663977] Failed to find blacklist at 7d5302a6f94d00b0
      
      Fix this by checking if the function pointer we receive in
      ppc_function_entry() already points to kernel text. If so, we just
      return it as is. If not, we assume that this is a function descriptor
      and proceed to dereference it.
      Suggested-by: NNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      83e840c7
    • C
      cxl: Export library to support IBM XSL · 3ced8d73
      Christophe Lombard 提交于
      This patch exports a in-kernel 'library' API which can be called by
      other drivers to help interacting with an IBM XSL on a POWER9 system.
      
      The XSL (Translation Service Layer) is a stripped down version of the
      PSL (Power Service Layer) used in some cards such as the Mellanox CX5.
      Like the PSL, it implements the CAIA architecture, but has a number
      of differences, mostly in it's implementation dependent registers.
      
      The XSL also uses a special DMA cxl mode, which uses a slightly
      different init sequence for the CAPP and PHB.
      Signed-off-by: NAndrew Donnellan <andrew.donnellan@au1.ibm.com>
      Signed-off-by: NChristophe Lombard <clombard@linux.vnet.ibm.com>
      Acked-by: NFrederic Barrat <fbarrat@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      3ced8d73
  8. 02 7月, 2017 3 次提交
  9. 01 7月, 2017 1 次提交
    • P
      KVM: PPC: Book3S HV: Simplify dynamic micro-threading code · 898b25b2
      Paul Mackerras 提交于
      Since commit b009031f ("KVM: PPC: Book3S HV: Take out virtual
      core piggybacking code", 2016-09-15), we only have at most one
      vcore per subcore.  Previously, the fact that there might be more
      than one vcore per subcore meant that we had the notion of a
      "master vcore", which was the vcore that controlled thread 0 of
      the subcore.  We also needed a list per subcore in the core_info
      struct to record which vcores belonged to each subcore.  Now that
      there can only be one vcore in the subcore, we can replace the
      list with a simple pointer and get rid of the notion of the
      master vcore (and in fact treat every vcore as a master vcore).
      
      We can also get rid of the subcore_vm[] field in the core_info
      struct since it is never read.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      898b25b2
  10. 29 6月, 2017 1 次提交
  11. 28 6月, 2017 3 次提交
  12. 27 6月, 2017 1 次提交
  13. 26 6月, 2017 1 次提交
    • M
      powerpc/32: Avoid miscompilation w/GCC 4.6.3 - don't inline copy_to/from_user() · d6bd8194
      Michael Ellerman 提交于
      Larry Finger reported that his Powerbook G4 was no longer booting with v4.12-rc,
      userspace was up but giving weird errors such as:
      
        udevd[64]: starting version 175
        udevd[64]: Unable to receive ctrl message: Bad address.
        modprobe: chdir(4.12-rc1): No such file or directory
      
      He bisected the problem to commit 3448890c ("powerpc: get rid of zeroing,
      switch to RAW_COPY_USER").
      
      Al identified that the problem is actually a miscompilation by GCC 4.6.3, which
      is exposed by the above commit.
      
      Al also pointed out that inlining copy_to/from_user() is probably of little or
      no benefit, which is correct. Using Anton's copy_to_user benchmark, with a
      pathological single byte copy, we see a small increase in performance
      by *removing* inlining:
      
        Before (inlined):
        # time ./copy_to_user -w -l 1 -i 10000000	( x 3 )
        real	0m22.063s
        real	0m22.059s
        real	0m22.076s
      
        After:
        # time ./copy_to_user -w -l 1 -i 10000000	( x 3 )
        real	0m21.325s
        real	0m21.299s
        real	0m21.364s
      
      So as a small performance improvement and to avoid the miscompilation, drop
      inlining copy_to/from_user() on 32-bit.
      
      Fixes: 3448890c ("powerpc: get rid of zeroing, switch to RAW_COPY_USER")
      Reported-by: NLarry Finger <Larry.Finger@lwfinger.net>
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      d6bd8194
  14. 23 6月, 2017 2 次提交
  15. 22 6月, 2017 1 次提交
    • A
      KVM: PPC: Book3S HV: Exit guest upon MCE when FWNMI capability is enabled · e20bbd3d
      Aravinda Prasad 提交于
      Enhance KVM to cause a guest exit with KVM_EXIT_NMI
      exit reason upon a machine check exception (MCE) in
      the guest address space if the KVM_CAP_PPC_FWNMI
      capability is enabled (instead of delivering a 0x200
      interrupt to guest). This enables QEMU to build error
      log and deliver machine check exception to guest via
      guest registered machine check handler.
      
      This approach simplifies the delivery of machine
      check exception to guest OS compared to the earlier
      approach of KVM directly invoking 0x200 guest interrupt
      vector.
      
      This design/approach is based on the feedback for the
      QEMU patches to handle machine check exception. Details
      of earlier approach of handling machine check exception
      in QEMU and related discussions can be found at:
      
      https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg00813.html
      
      Note:
      
      This patch now directly invokes machine_check_print_event_info()
      from kvmppc_handle_exit_hv() to print the event to host console
      at the time of guest exit before the exception is passed on to the
      guest. Hence, the host-side handling which was performed earlier
      via machine_check_fwnmi is removed.
      
      The reasons for this approach is (i) it is not possible
      to distinguish whether the exception occurred in the
      guest or the host from the pt_regs passed on the
      machine_check_exception(). Hence machine_check_exception()
      calls panic, instead of passing on the exception to
      the guest, if the machine check exception is not
      recoverable. (ii) the approach introduced in this
      patch gives opportunity to the host kernel to perform
      actions in virtual mode before passing on the exception
      to the guest. This approach does not require complex
      tweaks to machine_check_fwnmi and friends.
      Signed-off-by: NAravinda Prasad <aravinda@linux.vnet.ibm.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e20bbd3d
  16. 21 6月, 2017 1 次提交
    • A
      KVM: PPC: Book3S HV: Add new capability to control MCE behaviour · 134764ed
      Aravinda Prasad 提交于
      This introduces a new KVM capability to control how KVM behaves
      on machine check exception (MCE) in HV KVM guests.
      
      If this capability has not been enabled, KVM redirects machine check
      exceptions to guest's 0x200 vector, if the address in error belongs to
      the guest. With this capability enabled, KVM will cause a guest exit
      with the exit reason indicating an NMI.
      
      The new capability is required to avoid problems if a new kernel/KVM
      is used with an old QEMU, running a guest that doesn't issue
      "ibm,nmi-register".  As old QEMU does not understand the NMI exit
      type, it treats it as a fatal error.  However, the guest could have
      handled the machine check error if the exception was delivered to
      guest's 0x200 interrupt vector instead of NMI exit in case of old
      QEMU.
      
      [paulus@ozlabs.org - Reworded the commit message to be clearer,
       enable only on HV KVM.]
      Signed-off-by: NAravinda Prasad <aravinda@linux.vnet.ibm.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      134764ed
  17. 20 6月, 2017 5 次提交
  18. 19 6月, 2017 8 次提交
    • N
      powerpc/64s/idle: Branch to handler with virtual mode offset · b51351e2
      Nicholas Piggin 提交于
      Have the system reset idle wakeup handlers branched to in real mode
      with the 0xc... kernel address applied. This allows simplifications of
      avoiding rfid when switching to virtual mode in the wakeup handler.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      b51351e2
    • N
      powerpc/64s: msgclr when handling doorbell exceptions from system reset · a9af97aa
      Nicholas Piggin 提交于
      msgsnd doorbell exceptions are cleared when the doorbell interrupt is
      taken. However if a doorbell exception causes a system reset interrupt
      wake from power saving state, the message is not cleared. Processing
      the doorbell from the system reset interrupt requires msgclr to avoid
      taking the exception again.
      
      Testing this plus the previous wakup direct patch gives:
      
                                      original         wakeup direct     msgclr
      Different threads, same core:   315k/s           264k/s            345k/s
      Different cores:                235k/s           242k/s            242k/s
      
      Net speedup is +10% for same core, and +3% for different core.
      Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a9af97aa
    • N
      powerpc/64s/idle: Process interrupts from system reset wakeup · 771d4304
      Nicholas Piggin 提交于
      When the CPU wakes from low power state, it begins at the system reset
      interrupt with the exception that caused the wakeup encoded in SRR1.
      
      Today, powernv idle wakeup ignores the wakeup reason (except a special
      case for HMI), and the regular interrupt corresponding to the
      exception will fire after the idle wakeup exits.
      
      Change this to replay the interrupt from the idle wakeup before
      interrupts are hard-enabled.
      
      Test on POWER8 of context_switch selftests benchmark with polling idle
      disabled (e.g., always nap, giving cross-CPU IPIs) gives the following
      results:
      
                                      original         wakeup direct
      Different threads, same core:   315k/s           264k/s
      Different cores:                235k/s           242k/s
      
      There is a slowdown for doorbell IPI (same core) case because system
      reset wakeup does not clear the message and the doorbell interrupt
      fires again needlessly.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      771d4304
    • N
      powerpc/64s/idle: Move soft interrupt mask logic into C code · 2201f994
      Nicholas Piggin 提交于
      This simplifies the asm and fixes irq-off tracing over sleep
      instructions.
      
      Also move powersave_nap check for POWER8 into C code, and move
      PSSCR register value calculation for POWER9 into C.
      Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      2201f994
    • P
      KVM: PPC: Book3S HV: Virtualize doorbell facility on POWER9 · 57900694
      Paul Mackerras 提交于
      On POWER9, we no longer have the restriction that we had on POWER8
      where all threads in a core have to be in the same partition, so
      the CPU threads are now independent.  However, we still want to be
      able to run guests with a virtual SMT topology, if only to allow
      migration of guests from POWER8 systems to POWER9.
      
      A guest that has a virtual SMT mode greater than 1 will expect to
      be able to use the doorbell facility; it will expect the msgsndp
      and msgclrp instructions to work appropriately and to be able to read
      sensible values from the TIR (thread identification register) and
      DPDES (directed privileged doorbell exception status) special-purpose
      registers.  However, since each CPU thread is a separate sub-processor
      in POWER9, these instructions and registers can only be used within
      a single CPU thread.
      
      In order for these instructions to appear to act correctly according
      to the guest's virtual SMT mode, we have to trap and emulate them.
      We cause them to trap by clearing the HFSCR_MSGP bit in the HFSCR
      register.  The emulation is triggered by the hypervisor facility
      unavailable interrupt that occurs when the guest uses them.
      
      To cause a doorbell interrupt to occur within the guest, we set the
      DPDES register to 1.  If the guest has interrupts enabled, the CPU
      will generate a doorbell interrupt and clear the DPDES register in
      hardware.  The DPDES hardware register for the guest is saved in the
      vcpu->arch.vcore->dpdes field.  Since this gets written by the guest
      exit code, other VCPUs wishing to cause a doorbell interrupt don't
      write that field directly, but instead set a vcpu->arch.doorbell_request
      flag.  This is consumed and set to 0 by the guest entry code, which
      then sets DPDES to 1.
      
      Emulating reads of the DPDES register is somewhat involved, because
      it requires reading the doorbell pending interrupt status of all of the
      VCPU threads in the virtual core, and if any of those VCPUs are
      running, their doorbell status is only up-to-date in the hardware
      DPDES registers of the CPUs where they are running.  In order to get
      a reasonable approximation of the current doorbell status, we send
      those CPUs an IPI, causing an exit from the guest which will update
      the vcpu->arch.vcore->dpdes field.  We then use that value in
      constructing the emulated DPDES register value.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      57900694
    • P
      KVM: PPC: Book3S HV: Allow userspace to set the desired SMT mode · 3c313524
      Paul Mackerras 提交于
      This allows userspace to set the desired virtual SMT (simultaneous
      multithreading) mode for a VM, that is, the number of VCPUs that
      get assigned to each virtual core.  Previously, the virtual SMT mode
      was fixed to the number of threads per subcore, and if userspace
      wanted to have fewer vcpus per vcore, then it would achieve that by
      using a sparse CPU numbering.  This had the disadvantage that the
      vcpu numbers can get quite large, particularly for SMT1 guests on
      a POWER8 with 8 threads per core.  With this patch, userspace can
      set its desired virtual SMT mode and then use contiguous vcpu
      numbering.
      
      On POWER8, where the threading mode is "strict", the virtual SMT mode
      must be less than or equal to the number of threads per subcore.  On
      POWER9, which implements a "loose" threading mode, the virtual SMT
      mode can be any power of 2 between 1 and 8, even though there is
      effectively one thread per subcore, since the threads are independent
      and can all be in different partitions.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      3c313524
    • P
      KVM: PPC: Book3S HV: Context-switch HFSCR between host and guest on POWER9 · 769377f7
      Paul Mackerras 提交于
      This adds code to allow us to use a different value for the HFSCR
      (Hypervisor Facilities Status and Control Register) when running the
      guest from that which applies in the host.  The reason for doing this
      is to allow us to trap the msgsndp instruction and related operations
      in future so that they can be virtualized.  We also save the value of
      HFSCR when a hypervisor facility unavailable interrupt occurs, because
      the high byte of HFSCR indicates which facility the guest attempted to
      access.
      
      We save and restore the host value on guest entry/exit because some
      bits of it affect host userspace execution.
      
      We only do all this on POWER9, not on POWER8, because we are not
      intending to virtualize any of the facilities controlled by HFSCR on
      POWER8.  In particular, the HFSCR bit that controls execution of
      msgsndp and related operations does not exist on POWER8.  The HFSCR
      doesn't exist at all on POWER7.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      769377f7
    • P
      KVM: PPC: Book3S HV: Enable guests to use large decrementer mode on POWER9 · 1bc3fe81
      Paul Mackerras 提交于
      This allows userspace (e.g. QEMU) to enable large decrementer mode for
      the guest when running on a POWER9 host, by setting the LPCR_LD bit in
      the guest LPCR value.  With this, the guest exit code saves 64 bits of
      the guest DEC value on exit.  Other places that use the guest DEC
      value check the LPCR_LD bit in the guest LPCR value, and if it is set,
      omit the 32-bit sign extension that would otherwise be done.
      
      This doesn't change the DEC emulation used by PR KVM because PR KVM
      is not supported on POWER9 yet.
      
      This is partly based on an earlier patch by Oliver O'Halloran.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      1bc3fe81
  19. 16 6月, 2017 2 次提交
  20. 15 6月, 2017 3 次提交