1. 31 8月, 2017 1 次提交
  2. 28 7月, 2017 1 次提交
  3. 26 7月, 2017 1 次提交
    • B
      powerpc/mm/radix: Workaround prefetch issue with KVM · a25bd72b
      Benjamin Herrenschmidt 提交于
      There's a somewhat architectural issue with Radix MMU and KVM.
      
      When coming out of a guest with AIL (Alternate Interrupt Location, ie,
      MMU enabled), we start executing hypervisor code with the PID register
      still containing whatever the guest has been using.
      
      The problem is that the CPU can (and will) then start prefetching or
      speculatively load from whatever host context has that same PID (if
      any), thus bringing translations for that context into the TLB, which
      Linux doesn't know about.
      
      This can cause stale translations and subsequent crashes.
      
      Fixing this in a way that is neither racy nor a huge performance
      impact is difficult. We could just make the host invalidations always
      use broadcast forms but that would hurt single threaded programs for
      example.
      
      We chose to fix it instead by partitioning the PID space between guest
      and host. This is possible because today Linux only use 19 out of the
      20 bits of PID space, so existing guests will work if we make the host
      use the top half of the 20 bits space.
      
      We additionally add support for a property to indicate to Linux the
      size of the PID register which will be useful if we eventually have
      processors with a larger PID space available.
      
      There is still an issue with malicious guests purposefully setting the
      PID register to a value in the hosts PID range. Hopefully future HW
      can prevent that, but in the meantime, we handle it with a pair of
      kludges:
      
       - On the way out of a guest, before we clear the current VCPU in the
         PACA, we check the PID and if it's outside of the permitted range
         we flush the TLB for that PID.
      
       - When context switching, if the mm is "new" on that CPU (the
         corresponding bit was set for the first time in the mm cpumask), we
         check if any sibling thread is in KVM (has a non-NULL VCPU pointer
         in the PACA). If that is the case, we also flush the PID for that
         CPU (core).
      
      This second part is needed to handle the case where a process is
      migrated (or starts a new pthread) on a sibling thread of the CPU
      coming out of KVM, as there's a window where stale translations can
      exist before we detect it and flush them out.
      
      A future optimization could be added by keeping track of whether the
      PID has ever been used and avoid doing that for completely fresh PIDs.
      We could similarily mark PIDs that have been the subject of a global
      invalidation as "fresh". But for now this will do.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      [mpe: Rework the asm to build with CONFIG_PPC_RADIX_MMU=n, drop
            unneeded include of kvm_book3s_asm.h]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a25bd72b
  4. 20 7月, 2017 1 次提交
  5. 18 7月, 2017 1 次提交
    • M
      powerpc/mm: Mark __init memory no-execute when STRICT_KERNEL_RWX=y · 029d9252
      Michael Ellerman 提交于
      Currently even with STRICT_KERNEL_RWX we leave the __init text marked
      executable after init, which is bad.
      
      Add a hook to mark it NX (no-execute) before we free it, and implement
      it for radix and hash.
      
      Note that we use __init_end as the end address, not _einittext,
      because overlaps_kernel_text() uses __init_end, because there are
      additional executable sections other than .init.text between
      __init_begin and __init_end.
      
      Tested on radix and hash with:
      
        0:mon> p $__init_begin
        *** 400 exception occurred
      
      Fixes: 1e0fc9d1 ("powerpc/Kconfig: Enable STRICT_KERNEL_RWX for some configs")
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      029d9252
  6. 17 7月, 2017 1 次提交
  7. 13 7月, 2017 2 次提交
    • M
      mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic · dcda9b04
      Michal Hocko 提交于
      __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
      the page allocator.  This has been true but only for allocations
      requests larger than PAGE_ALLOC_COSTLY_ORDER.  It has been always
      ignored for smaller sizes.  This is a bit unfortunate because there is
      no way to express the same semantic for those requests and they are
      considered too important to fail so they might end up looping in the
      page allocator for ever, similarly to GFP_NOFAIL requests.
      
      Now that the whole tree has been cleaned up and accidental or misled
      usage of __GFP_REPEAT flag has been removed for !costly requests we can
      give the original flag a better name and more importantly a more useful
      semantic.  Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
      that the allocator would try really hard but there is no promise of a
      success.  This will work independent of the order and overrides the
      default allocator behavior.  Page allocator users have several levels of
      guarantee vs.  cost options (take GFP_KERNEL as an example)
      
       - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
         attempt to free memory at all. The most light weight mode which even
         doesn't kick the background reclaim. Should be used carefully because
         it might deplete the memory and the next user might hit the more
         aggressive reclaim
      
       - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
         allocation without any attempt to free memory from the current
         context but can wake kswapd to reclaim memory if the zone is below
         the low watermark. Can be used from either atomic contexts or when
         the request is a performance optimization and there is another
         fallback for a slow path.
      
       - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
         non sleeping allocation with an expensive fallback so it can access
         some portion of memory reserves. Usually used from interrupt/bh
         context with an expensive slow path fallback.
      
       - GFP_KERNEL - both background and direct reclaim are allowed and the
         _default_ page allocator behavior is used. That means that !costly
         allocation requests are basically nofail but there is no guarantee of
         that behavior so failures have to be checked properly by callers
         (e.g. OOM killer victim is allowed to fail currently).
      
       - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
         and all allocation requests fail early rather than cause disruptive
         reclaim (one round of reclaim in this implementation). The OOM killer
         is not invoked.
      
       - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
         behavior and all allocation requests try really hard. The request
         will fail if the reclaim cannot make any progress. The OOM killer
         won't be triggered.
      
       - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
         and all allocation requests will loop endlessly until they succeed.
         This might be really dangerous especially for larger orders.
      
      Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
      because they already had their semantic.  No new users are added.
      __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
      there is no progress and we have already passed the OOM point.
      
      This means that all the reclaim opportunities have been exhausted except
      the most disruptive one (the OOM killer) and a user defined fallback
      behavior is more sensible than keep retrying in the page allocator.
      
      [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
      [mhocko@suse.com: semantic fix]
        Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
      [mhocko@kernel.org: address other thing spotted by Vlastimil]
        Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Alex Belits <alex.belits@cavium.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David Daney <david.daney@cavium.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dcda9b04
    • N
      powerpc/64s: implement arch-specific hardlockup watchdog · 2104180a
      Nicholas Piggin 提交于
      Implement an arch-speicfic watchdog rather than use the perf-based
      hardlockup detector.
      
      The new watchdog takes the soft-NMI directly, rather than going through
      perf.  Perf interrupts are to be made maskable in future, so that would
      prevent the perf detector from working in those regions.
      
      Additionally, implement a SMP based detector where all CPUs watch one
      another by pinging a shared cpumask.  This is because powerpc Book3S
      does not have a true periodic local NMI, but some platforms do implement
      a true NMI IPI.
      
      If a CPU is stuck with interrupts hard disabled, the soft-NMI watchdog
      does not work, but the SMP watchdog will.  Even on platforms without a
      true NMI IPI to get a good trace from the stuck CPU, other CPUs will
      notice the lockup sufficiently to report it and panic.
      
      [npiggin@gmail.com: honor watchdog disable at boot/hotplug]
        Link: http://lkml.kernel.org/r/20170621001346.5bb337c9@roar.ozlabs.ibm.com
      [npiggin@gmail.com: fix false positive warning at CPU unplug]
        Link: http://lkml.kernel.org/r/20170630080740.20766-1-npiggin@gmail.com
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/20170616065715.18390-6-npiggin@gmail.comSigned-off-by: NNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: NDon Zickus <dzickus@redhat.com>
      Tested-by: Babu Moger <babu.moger@oracle.com>	[sparc]
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2104180a
  8. 12 7月, 2017 1 次提交
    • M
      powerpc/64: Fix atomic64_inc_not_zero() to return an int · 01e6a61a
      Michael Ellerman 提交于
      Although it's not documented anywhere, there is an expectation that
      atomic64_inc_not_zero() returns a result which fits in an int. This is
      the behaviour implemented on all arches except powerpc.
      
      This has caused at least one bug in practice, in the percpu-refcount
      code, where the long result from our atomic64_inc_not_zero() was
      truncated to an int leading to lost references and stuck systems. That
      was worked around in that code in commit 966d2b04 ("percpu-refcount:
      fix reference leak during percpu-atomic transition").
      
      To the best of my grepping abilities there are no other callers
      in-tree which truncate the value, but we should fix it anyway. Because
      the breakage is subtle and potentially very harmful I'm also tagging
      it for stable.
      
      Code generation is largely unaffected because in most cases the
      callers are just using the result for a test anyway. In particular the
      case of fget() that was mentioned in commit a6cf7ed5
      ("powerpc/atomic: Implement atomic*_inc_not_zero") generates exactly
      the same code.
      
      Fixes: a6cf7ed5 ("powerpc/atomic: Implement atomic*_inc_not_zero")
      Cc: stable@vger.kernel.org # v3.4
      Noticed-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      01e6a61a
  9. 11 7月, 2017 2 次提交
  10. 10 7月, 2017 1 次提交
  11. 07 7月, 2017 1 次提交
  12. 04 7月, 2017 2 次提交
  13. 03 7月, 2017 2 次提交
    • N
      powerpc64/elfv1: Only dereference function descriptor for non-text symbols · 83e840c7
      Naveen N. Rao 提交于
      Currently, we assume that the function pointer we receive in
      ppc_function_entry() points to a function descriptor. However, this is
      not always the case. In particular, assembly symbols without the right
      annotation do not have an associated function descriptor. Some of these
      symbols are added to the kprobe blacklist using _ASM_NOKPROBE_SYMBOL().
      
      When such addresses are subsequently processed through
      arch_deref_entry_point() in populate_kprobe_blacklist(), we see the
      below errors during bootup:
          [    0.663963] Failed to find blacklist at 7d9b02a648029b6c
          [    0.663970] Failed to find blacklist at a14d03d0394a0001
          [    0.663972] Failed to find blacklist at 7d5302a6f94d0388
          [    0.663973] Failed to find blacklist at 48027d11e8610178
          [    0.663974] Failed to find blacklist at f8010070f8410080
          [    0.663976] Failed to find blacklist at 386100704801f89d
          [    0.663977] Failed to find blacklist at 7d5302a6f94d00b0
      
      Fix this by checking if the function pointer we receive in
      ppc_function_entry() already points to kernel text. If so, we just
      return it as is. If not, we assume that this is a function descriptor
      and proceed to dereference it.
      Suggested-by: NNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      83e840c7
    • C
      cxl: Export library to support IBM XSL · 3ced8d73
      Christophe Lombard 提交于
      This patch exports a in-kernel 'library' API which can be called by
      other drivers to help interacting with an IBM XSL on a POWER9 system.
      
      The XSL (Translation Service Layer) is a stripped down version of the
      PSL (Power Service Layer) used in some cards such as the Mellanox CX5.
      Like the PSL, it implements the CAIA architecture, but has a number
      of differences, mostly in it's implementation dependent registers.
      
      The XSL also uses a special DMA cxl mode, which uses a slightly
      different init sequence for the CAPP and PHB.
      Signed-off-by: NAndrew Donnellan <andrew.donnellan@au1.ibm.com>
      Signed-off-by: NChristophe Lombard <clombard@linux.vnet.ibm.com>
      Acked-by: NFrederic Barrat <fbarrat@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      3ced8d73
  14. 02 7月, 2017 3 次提交
  15. 01 7月, 2017 1 次提交
    • P
      KVM: PPC: Book3S HV: Simplify dynamic micro-threading code · 898b25b2
      Paul Mackerras 提交于
      Since commit b009031f ("KVM: PPC: Book3S HV: Take out virtual
      core piggybacking code", 2016-09-15), we only have at most one
      vcore per subcore.  Previously, the fact that there might be more
      than one vcore per subcore meant that we had the notion of a
      "master vcore", which was the vcore that controlled thread 0 of
      the subcore.  We also needed a list per subcore in the core_info
      struct to record which vcores belonged to each subcore.  Now that
      there can only be one vcore in the subcore, we can replace the
      list with a simple pointer and get rid of the notion of the
      master vcore (and in fact treat every vcore as a master vcore).
      
      We can also get rid of the subcore_vm[] field in the core_info
      struct since it is never read.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      898b25b2
  16. 29 6月, 2017 1 次提交
  17. 28 6月, 2017 3 次提交
  18. 27 6月, 2017 1 次提交
  19. 26 6月, 2017 1 次提交
    • M
      powerpc/32: Avoid miscompilation w/GCC 4.6.3 - don't inline copy_to/from_user() · d6bd8194
      Michael Ellerman 提交于
      Larry Finger reported that his Powerbook G4 was no longer booting with v4.12-rc,
      userspace was up but giving weird errors such as:
      
        udevd[64]: starting version 175
        udevd[64]: Unable to receive ctrl message: Bad address.
        modprobe: chdir(4.12-rc1): No such file or directory
      
      He bisected the problem to commit 3448890c ("powerpc: get rid of zeroing,
      switch to RAW_COPY_USER").
      
      Al identified that the problem is actually a miscompilation by GCC 4.6.3, which
      is exposed by the above commit.
      
      Al also pointed out that inlining copy_to/from_user() is probably of little or
      no benefit, which is correct. Using Anton's copy_to_user benchmark, with a
      pathological single byte copy, we see a small increase in performance
      by *removing* inlining:
      
        Before (inlined):
        # time ./copy_to_user -w -l 1 -i 10000000	( x 3 )
        real	0m22.063s
        real	0m22.059s
        real	0m22.076s
      
        After:
        # time ./copy_to_user -w -l 1 -i 10000000	( x 3 )
        real	0m21.325s
        real	0m21.299s
        real	0m21.364s
      
      So as a small performance improvement and to avoid the miscompilation, drop
      inlining copy_to/from_user() on 32-bit.
      
      Fixes: 3448890c ("powerpc: get rid of zeroing, switch to RAW_COPY_USER")
      Reported-by: NLarry Finger <Larry.Finger@lwfinger.net>
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      d6bd8194
  20. 23 6月, 2017 2 次提交
  21. 22 6月, 2017 1 次提交
    • A
      KVM: PPC: Book3S HV: Exit guest upon MCE when FWNMI capability is enabled · e20bbd3d
      Aravinda Prasad 提交于
      Enhance KVM to cause a guest exit with KVM_EXIT_NMI
      exit reason upon a machine check exception (MCE) in
      the guest address space if the KVM_CAP_PPC_FWNMI
      capability is enabled (instead of delivering a 0x200
      interrupt to guest). This enables QEMU to build error
      log and deliver machine check exception to guest via
      guest registered machine check handler.
      
      This approach simplifies the delivery of machine
      check exception to guest OS compared to the earlier
      approach of KVM directly invoking 0x200 guest interrupt
      vector.
      
      This design/approach is based on the feedback for the
      QEMU patches to handle machine check exception. Details
      of earlier approach of handling machine check exception
      in QEMU and related discussions can be found at:
      
      https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg00813.html
      
      Note:
      
      This patch now directly invokes machine_check_print_event_info()
      from kvmppc_handle_exit_hv() to print the event to host console
      at the time of guest exit before the exception is passed on to the
      guest. Hence, the host-side handling which was performed earlier
      via machine_check_fwnmi is removed.
      
      The reasons for this approach is (i) it is not possible
      to distinguish whether the exception occurred in the
      guest or the host from the pt_regs passed on the
      machine_check_exception(). Hence machine_check_exception()
      calls panic, instead of passing on the exception to
      the guest, if the machine check exception is not
      recoverable. (ii) the approach introduced in this
      patch gives opportunity to the host kernel to perform
      actions in virtual mode before passing on the exception
      to the guest. This approach does not require complex
      tweaks to machine_check_fwnmi and friends.
      Signed-off-by: NAravinda Prasad <aravinda@linux.vnet.ibm.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e20bbd3d
  22. 21 6月, 2017 1 次提交
    • A
      KVM: PPC: Book3S HV: Add new capability to control MCE behaviour · 134764ed
      Aravinda Prasad 提交于
      This introduces a new KVM capability to control how KVM behaves
      on machine check exception (MCE) in HV KVM guests.
      
      If this capability has not been enabled, KVM redirects machine check
      exceptions to guest's 0x200 vector, if the address in error belongs to
      the guest. With this capability enabled, KVM will cause a guest exit
      with the exit reason indicating an NMI.
      
      The new capability is required to avoid problems if a new kernel/KVM
      is used with an old QEMU, running a guest that doesn't issue
      "ibm,nmi-register".  As old QEMU does not understand the NMI exit
      type, it treats it as a fatal error.  However, the guest could have
      handled the machine check error if the exception was delivered to
      guest's 0x200 interrupt vector instead of NMI exit in case of old
      QEMU.
      
      [paulus@ozlabs.org - Reworded the commit message to be clearer,
       enable only on HV KVM.]
      Signed-off-by: NAravinda Prasad <aravinda@linux.vnet.ibm.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      134764ed
  23. 20 6月, 2017 5 次提交
  24. 19 6月, 2017 4 次提交