1. 10 11月, 2019 1 次提交
  2. 06 11月, 2019 2 次提交
  3. 12 10月, 2019 11 次提交
    • A
      powerpc/book3s64/radix: Rename CPU_FTR_P9_TLBIE_BUG feature flag · d1e4b4cc
      Aneesh Kumar K.V 提交于
      commit 09ce98cacd51fcd0fa0af2f79d1e1d3192f4cbb0 upstream.
      
      Rename the #define to indicate this is related to store vs tlbie
      ordering issue. In the next patch, we will be adding another feature
      flag that is used to handles ERAT flush vs tlbie ordering issue.
      
      Fixes: a5d4b589 ("powerpc/mm: Fixup tlbie vs store ordering issue on POWER9")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190924035254.24612-2-aneesh.kumar@linux.ibm.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d1e4b4cc
    • G
      powerpc/pseries: Fix cpu_hotplug_lock acquisition in resize_hpt() · f5f31a6e
      Gautham R. Shenoy 提交于
      [ Upstream commit c784be435d5dae28d3b03db31753dd7a18733f0c ]
      
      The calls to arch_add_memory()/arch_remove_memory() are always made
      with the read-side cpu_hotplug_lock acquired via memory_hotplug_begin().
      On pSeries, arch_add_memory()/arch_remove_memory() eventually call
      resize_hpt() which in turn calls stop_machine() which acquires the
      read-side cpu_hotplug_lock again, thereby resulting in the recursive
      acquisition of this lock.
      
      In the absence of CONFIG_PROVE_LOCKING, we hadn't observed a system
      lockup during a memory hotplug operation because cpus_read_lock() is a
      per-cpu rwsem read, which, in the fast-path (in the absence of the
      writer, which in our case is a CPU-hotplug operation) simply
      increments the read_count on the semaphore. Thus a recursive read in
      the fast-path doesn't cause any problems.
      
      However, we can hit this problem in practice if there is a concurrent
      CPU-Hotplug operation in progress which is waiting to acquire the
      write-side of the lock. This will cause the second recursive read to
      block until the writer finishes. While the writer is blocked since the
      first read holds the lock. Thus both the reader as well as the writers
      fail to make any progress thereby blocking both CPU-Hotplug as well as
      Memory Hotplug operations.
      
      Memory-Hotplug				CPU-Hotplug
      CPU 0					CPU 1
      ------                                  ------
      
      1. down_read(cpu_hotplug_lock.rw_sem)
         [memory_hotplug_begin]
      					2. down_write(cpu_hotplug_lock.rw_sem)
      					[cpu_up/cpu_down]
      3. down_read(cpu_hotplug_lock.rw_sem)
         [stop_machine()]
      
      Lockdep complains as follows in these code-paths.
      
       swapper/0/1 is trying to acquire lock:
       (____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: stop_machine+0x2c/0x60
      
      but task is already holding lock:
      (____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: mem_hotplug_begin+0x20/0x50
      
       other info that might help us debug this:
        Possible unsafe locking scenario:
      
              CPU0
              ----
         lock(cpu_hotplug_lock.rw_sem);
         lock(cpu_hotplug_lock.rw_sem);
      
        *** DEADLOCK ***
      
        May be due to missing lock nesting notation
      
       3 locks held by swapper/0/1:
        #0: (____ptrval____) (&dev->mutex){....}, at: __driver_attach+0x12c/0x1b0
        #1: (____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: mem_hotplug_begin+0x20/0x50
        #2: (____ptrval____) (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x54/0x1a0
      
      stack backtrace:
       CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc5-58373-gbc99402235f3-dirty #166
       Call Trace:
         dump_stack+0xe8/0x164 (unreliable)
         __lock_acquire+0x1110/0x1c70
         lock_acquire+0x240/0x290
         cpus_read_lock+0x64/0xf0
         stop_machine+0x2c/0x60
         pseries_lpar_resize_hpt+0x19c/0x2c0
         resize_hpt_for_hotplug+0x70/0xd0
         arch_add_memory+0x58/0xfc
         devm_memremap_pages+0x5e8/0x8f0
         pmem_attach_disk+0x764/0x830
         nvdimm_bus_probe+0x118/0x240
         really_probe+0x230/0x4b0
         driver_probe_device+0x16c/0x1e0
         __driver_attach+0x148/0x1b0
         bus_for_each_dev+0x90/0x130
         driver_attach+0x34/0x50
         bus_add_driver+0x1a8/0x360
         driver_register+0x108/0x170
         __nd_driver_register+0xd0/0xf0
         nd_pmem_driver_init+0x34/0x48
         do_one_initcall+0x1e0/0x45c
         kernel_init_freeable+0x540/0x64c
         kernel_init+0x2c/0x160
         ret_from_kernel_thread+0x5c/0x68
      
      Fix this issue by
        1) Requiring all the calls to pseries_lpar_resize_hpt() be made
           with cpu_hotplug_lock held.
      
        2) In pseries_lpar_resize_hpt() invoke stop_machine_cpuslocked()
           as a consequence of 1)
      
        3) To satisfy 1), in hpt_order_set(), call mmu_hash_ops.resize_hpt()
           with cpu_hotplug_lock held.
      
      Fixes: dbcf929c ("powerpc/pseries: Add support for hash table resizing")
      Cc: stable@vger.kernel.org # v4.11+
      Reported-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/1557906352-29048-1-git-send-email-ego@linux.vnet.ibm.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      f5f31a6e
    • C
      KVM: PPC: Book3S HV: XIVE: Free escalation interrupts before disabling the VP · 34b13ff6
      Cédric Le Goater 提交于
      [ Upstream commit 237aed48c642328ff0ab19b63423634340224a06 ]
      
      When a vCPU is brought done, the XIVE VP (Virtual Processor) is first
      disabled and then the event notification queues are freed. When freeing
      the queues, we check for possible escalation interrupts and free them
      also.
      
      But when a XIVE VP is disabled, the underlying XIVE ENDs also are
      disabled in OPAL. When an END (Event Notification Descriptor) is
      disabled, its ESB pages (ESn and ESe) are disabled and loads return all
      1s. Which means that any access on the ESB page of the escalation
      interrupt will return invalid values.
      
      When an interrupt is freed, the shutdown handler computes a 'saved_p'
      field from the value returned by a load in xive_do_source_set_mask().
      This value is incorrect for escalation interrupts for the reason
      described above.
      
      This has no impact on Linux/KVM today because we don't make use of it
      but we will introduce in future changes a xive_get_irqchip_state()
      handler. This handler will use the 'saved_p' field to return the state
      of an interrupt and 'saved_p' being incorrect, softlockup will occur.
      
      Fix the vCPU cleanup sequence by first freeing the escalation interrupts
      if any, then disable the XIVE VP and last free the queues.
      
      Fixes: 90c73795afa2 ("KVM: PPC: Book3S HV: Add a new KVM device for the XIVE native exploitation mode")
      Fixes: 5af50993 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190806172538.5087-1-clg@kaod.orgSigned-off-by: NSasha Levin <sashal@kernel.org>
      34b13ff6
    • A
      powerpc/book3s64/mm: Don't do tlbie fixup for some hardware revisions · 9124eac4
      Aneesh Kumar K.V 提交于
      commit 677733e296b5c7a37c47da391fc70a43dc40bd67 upstream.
      
      The store ordering vs tlbie issue mentioned in commit
      a5d4b589 ("powerpc/mm: Fixup tlbie vs store ordering issue on
      POWER9") is fixed for Nimbus 2.3 and Cumulus 1.3 revisions. We don't
      need to apply the fixup if we are running on them
      
      We can only do this on PowerNV. On pseries guest with KVM we still
      don't support redoing the feature fixup after migration. So we should
      be enabling all the workarounds needed, because whe can possibly
      migrate between DD 2.3 and DD 2.2
      
      Fixes: a5d4b589 ("powerpc/mm: Fixup tlbie vs store ordering issue on POWER9")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190924035254.24612-1-aneesh.kumar@linux.ibm.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9124eac4
    • A
      powerpc/powernv/ioda: Fix race in TCE level allocation · 19c12f12
      Alexey Kardashevskiy 提交于
      commit 56090a3902c80c296e822d11acdb6a101b322c52 upstream.
      
      pnv_tce() returns a pointer to a TCE entry and originally a TCE table
      would be pre-allocated. For the default case of 2GB window the table
      needs only a single level and that is fine. However if more levels are
      requested, it is possible to get a race when 2 threads want a pointer
      to a TCE entry from the same page of TCEs.
      
      This adds cmpxchg to handle the race. Note that once TCE is non-zero,
      it cannot become zero again.
      
      Fixes: a68bd126 ("powerpc/powernv/ioda: Allocate indirect TCE levels on demand")
      CC: stable@vger.kernel.org # v4.19+
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190718051139.74787-2-aik@ozlabs.ruSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      19c12f12
    • A
      powerpc/powernv: Restrict OPAL symbol map to only be readable by root · 032ce7d7
      Andrew Donnellan 提交于
      commit e7de4f7b64c23e503a8c42af98d56f2a7462bd6d upstream.
      
      Currently the OPAL symbol map is globally readable, which seems bad as
      it contains physical addresses.
      
      Restrict it to root.
      
      Fixes: c8742f85 ("powerpc/powernv: Expose OPAL firmware symbol map")
      Cc: stable@vger.kernel.org # v3.19+
      Suggested-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NAndrew Donnellan <ajd@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190503075253.22798-1-ajd@linux.ibm.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      032ce7d7
    • S
      powerpc/mce: Schedule work from irq_work · ba3ca9fc
      Santosh Sivaraj 提交于
      commit b5bda6263cad9a927e1a4edb7493d542da0c1410 upstream.
      
      schedule_work() cannot be called from MCE exception context as MCE can
      interrupt even in interrupt disabled context.
      
      Fixes: 733e4a4c ("powerpc/mce: hookup memory_failure for UE errors")
      Cc: stable@vger.kernel.org # v4.15+
      Reviewed-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Signed-off-by: NSantosh Sivaraj <santosh@fossix.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190820081352.8641-2-santosh@fossix.orgSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ba3ca9fc
    • B
      powerpc/mce: Fix MCE handling for huge pages · ee6eeeb8
      Balbir Singh 提交于
      commit 99ead78afd1128bfcebe7f88f3b102fb2da09aee upstream.
      
      The current code would fail on huge pages addresses, since the shift would
      be incorrect. Use the correct page shift value returned by
      __find_linux_pte() to get the correct physical address. The code is more
      generic and can handle both regular and compound pages.
      
      Fixes: ba41e1e1 ("powerpc/mce: Hookup derror (load/store) UE errors")
      Signed-off-by: NBalbir Singh <bsingharora@gmail.com>
      [arbab@linux.ibm.com: Fixup pseries_do_memory_failure()]
      Signed-off-by: NReza Arbab <arbab@linux.ibm.com>
      Tested-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NSantosh Sivaraj <santosh@fossix.org>
      Cc: stable@vger.kernel.org # v4.15+
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190820081352.8641-3-santosh@fossix.orgSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ee6eeeb8
    • P
      KVM: PPC: Book3S HV: Don't lose pending doorbell request on migration on P9 · 30fbe0d3
      Paul Mackerras 提交于
      commit ff42df49e75f053a8a6b4c2533100cdcc23afe69 upstream.
      
      On POWER9, when userspace reads the value of the DPDES register on a
      vCPU, it is possible for 0 to be returned although there is a doorbell
      interrupt pending for the vCPU.  This can lead to a doorbell interrupt
      being lost across migration.  If the guest kernel uses doorbell
      interrupts for IPIs, then it could malfunction because of the lost
      interrupt.
      
      This happens because a newly-generated doorbell interrupt is signalled
      by setting vcpu->arch.doorbell_request to 1; the DPDES value in
      vcpu->arch.vcore->dpdes is not updated, because it can only be updated
      when holding the vcpu mutex, in order to avoid races.
      
      To fix this, we OR in vcpu->arch.doorbell_request when reading the
      DPDES value.
      
      Cc: stable@vger.kernel.org # v4.13+
      Fixes: 57900694 ("KVM: PPC: Book3S HV: Virtualize doorbell facility on POWER9")
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Tested-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      30fbe0d3
    • P
      KVM: PPC: Book3S HV: Check for MMU ready on piggybacked virtual cores · 4faa7f05
      Paul Mackerras 提交于
      commit d28eafc5a64045c78136162af9d4ba42f8230080 upstream.
      
      When we are running multiple vcores on the same physical core, they
      could be from different VMs and so it is possible that one of the
      VMs could have its arch.mmu_ready flag cleared (for example by a
      concurrent HPT resize) when we go to run it on a physical core.
      We currently check the arch.mmu_ready flag for the primary vcore
      but not the flags for the other vcores that will be run alongside
      it.  This adds that check, and also a check when we select the
      secondary vcores from the preempted vcores list.
      
      Cc: stable@vger.kernel.org # v4.14+
      Fixes: 38c53af8 ("KVM: PPC: Book3S HV: Fix exclusion between HPT resizing and other HPT updates")
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4faa7f05
    • P
      KVM: PPC: Book3S HV: Fix race in re-enabling XIVE escalation interrupts · 577a5119
      Paul Mackerras 提交于
      commit 959c5d5134786b4988b6fdd08e444aa67d1667ed upstream.
      
      Escalation interrupts are interrupts sent to the host by the XIVE
      hardware when it has an interrupt to deliver to a guest VCPU but that
      VCPU is not running anywhere in the system.  Hence we disable the
      escalation interrupt for the VCPU being run when we enter the guest
      and re-enable it when the guest does an H_CEDE hypercall indicating
      it is idle.
      
      It is possible that an escalation interrupt gets generated just as we
      are entering the guest.  In that case the escalation interrupt may be
      using a queue entry in one of the interrupt queues, and that queue
      entry may not have been processed when the guest exits with an H_CEDE.
      The existing entry code detects this situation and does not clear the
      vcpu->arch.xive_esc_on flag as an indication that there is a pending
      queue entry (if the queue entry gets processed, xive_esc_irq() will
      clear the flag).  There is a comment in the code saying that if the
      flag is still set on H_CEDE, we have to abort the cede rather than
      re-enabling the escalation interrupt, lest we end up with two
      occurrences of the escalation interrupt in the interrupt queue.
      
      However, the exit code doesn't do that; it aborts the cede in the sense
      that vcpu->arch.ceded gets cleared, but it still enables the escalation
      interrupt by setting the source's PQ bits to 00.  Instead we need to
      set the PQ bits to 10, indicating that an interrupt has been triggered.
      We also need to avoid setting vcpu->arch.xive_esc_on in this case
      (i.e. vcpu->arch.xive_esc_on seen to be set on H_CEDE) because
      xive_esc_irq() will run at some point and clear it, and if we race with
      that we may end up with an incorrect result (i.e. xive_esc_on set when
      the escalation interrupt has just been handled).
      
      It is extremely unlikely that having two queue entries would cause
      observable problems; theoretically it could cause queue overflow, but
      the CPU would have to have thousands of interrupts targetted to it for
      that to be possible.  However, this fix will also make it possible to
      determine accurately whether there is an unhandled escalation
      interrupt in the queue, which will be needed by the following patch.
      
      Fixes: 9b9b13a6 ("KVM: PPC: Book3S HV: Keep XIVE escalation interrupt masked unless ceded")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190813100349.GD9567@blackberrySigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      577a5119
  4. 08 10月, 2019 9 次提交
    • G
      powerpc: dump kernel log before carrying out fadump or kdump · 324b0c9e
      Ganesh Goudar 提交于
      [ Upstream commit e7ca44ed3ba77fc26cf32650bb71584896662474 ]
      
      Since commit 4388c9b3 ("powerpc: Do not send system reset request
      through the oops path"), pstore dmesg file is not updated when dump is
      triggered from HMC. This commit modified system reset (sreset) handler
      to invoke fadump or kdump (if configured), without pushing dmesg to
      pstore. This leaves pstore to have old dmesg data which won't be much
      of a help if kdump fails to capture the dump. This patch fixes that by
      calling kmsg_dump() before heading to fadump ot kdump.
      
      Fixes: 4388c9b3 ("powerpc: Do not send system reset request through the oops path")
      Reviewed-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NGanesh Goudar <ganeshgr@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190904075949.15607-1-ganeshgr@linux.ibm.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      324b0c9e
    • N
      powerpc/pseries: correctly track irq state in default idle · b717a47d
      Nathan Lynch 提交于
      [ Upstream commit 92c94dfb69e350471473fd3075c74bc68150879e ]
      
      prep_irq_for_idle() is intended to be called before entering
      H_CEDE (and it is used by the pseries cpuidle driver). However the
      default pseries idle routine does not call it, leading to mismanaged
      lazy irq state when the cpuidle driver isn't in use. Manifestations of
      this include:
      
      * Dropped IPIs in the time immediately after a cpu comes
        online (before it has installed the cpuidle handler), making the
        online operation block indefinitely waiting for the new cpu to
        respond.
      
      * Hitting this WARN_ON in arch_local_irq_restore():
      	/*
      	 * We should already be hard disabled here. We had bugs
      	 * where that wasn't the case so let's dbl check it and
      	 * warn if we are wrong. Only do that when IRQ tracing
      	 * is enabled as mfmsr() can be costly.
      	 */
      	if (WARN_ON_ONCE(mfmsr() & MSR_EE))
      		__hard_irq_disable();
      
      Call prep_irq_for_idle() from pseries_lpar_idle() and honor its
      result.
      
      Fixes: 363edbe2 ("powerpc: Default arch idle could cede processor on pseries")
      Signed-off-by: NNathan Lynch <nathanl@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190910225244.25056-1-nathanl@linux.ibm.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      b717a47d
    • N
      powerpc/64s/exception: machine check use correct cfar for late handler · 0c09b028
      Nicholas Piggin 提交于
      [ Upstream commit 0b66370c61fcf5fcc1d6901013e110284da6e2bb ]
      
      Bare metal machine checks run an "early" handler in real mode before
      running the main handler which reports the event.
      
      The main handler runs exactly as a normal interrupt handler, after the
      "windup" which sets registers back as they were at interrupt entry.
      CFAR does not get restored by the windup code, so that will be wrong
      when the handler is run.
      
      Restore the CFAR to the saved value before running the late handler.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190802105709.27696-8-npiggin@gmail.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      0c09b028
    • S
      powerpc/eeh: Clear stale EEH_DEV_NO_HANDLER flag · c1f7b3fb
      Sam Bobroff 提交于
      [ Upstream commit aa06e3d60e245284d1e55497eb3108828092818d ]
      
      The EEH_DEV_NO_HANDLER flag is used by the EEH system to prevent the
      use of driver callbacks in drivers that have been bound part way
      through the recovery process. This is necessary to prevent later stage
      handlers from being called when the earlier stage handlers haven't,
      which can be confusing for drivers.
      
      However, the flag is set for all devices that are added after boot
      time and only cleared at the end of the EEH recovery process. This
      results in hot plugged devices erroneously having the flag set during
      the first recovery after they are added (causing their driver's
      handlers to be incorrectly ignored).
      
      To remedy this, clear the flag at the beginning of recovery
      processing. The flag is still cleared at the end of recovery
      processing, although it is no longer really necessary.
      
      Also clear the flag during eeh_handle_special_event(), for the same
      reasons.
      Signed-off-by: NSam Bobroff <sbobroff@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/b8ca5629d27de74c957d4f4b250177d1b6fc4bbd.1565930772.git.sbobroff@linux.ibm.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      c1f7b3fb
    • N
      powerpc/pseries/mobility: use cond_resched when updating device tree · 4c91e678
      Nathan Lynch 提交于
      [ Upstream commit ccfb5bd71d3d1228090a8633800ae7cdf42a94ac ]
      
      After a partition migration, pseries_devicetree_update() processes
      changes to the device tree communicated from the platform to
      Linux. This is a relatively heavyweight operation, with multiple
      device tree searches, memory allocations, and conversations with
      partition firmware.
      
      There's a few levels of nested loops which are bounded only by
      decisions made by the platform, outside of Linux's control, and indeed
      we have seen RCU stalls on large systems while executing this call
      graph. Use cond_resched() in these loops so that the cpu is yielded
      when needed.
      Signed-off-by: NNathan Lynch <nathanl@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190802192926.19277-4-nathanl@linux.ibm.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      4c91e678
    • C
      powerpc/futex: Fix warning: 'oldval' may be used uninitialized in this function · 6d728a17
      Christophe Leroy 提交于
      [ Upstream commit 38a0d0cdb46d3f91534e5b9839ec2d67be14c59d ]
      
      We see warnings such as:
        kernel/futex.c: In function 'do_futex':
        kernel/futex.c:1676:17: warning: 'oldval' may be used uninitialized in this function [-Wmaybe-uninitialized]
           return oldval == cmparg;
                         ^
        kernel/futex.c:1651:6: note: 'oldval' was declared here
          int oldval, ret;
              ^
      
      This is because arch_futex_atomic_op_inuser() only sets *oval if ret
      is 0 and GCC doesn't see that it will only use it when ret is 0.
      
      Anyway, the non-zero ret path is an error path that won't suffer from
      setting *oval, and as *oval is a local var in futex_atomic_op_inuser()
      it will have no impact.
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      [mpe: reword change log slightly]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/86b72f0c134367b214910b27b9a6dd3321af93bb.1565774657.git.christophe.leroy@c-s.frSigned-off-by: NSasha Levin <sashal@kernel.org>
      6d728a17
    • N
      powerpc/rtas: use device model APIs and serialization during LPM · 6aa455b0
      Nathan Lynch 提交于
      [ Upstream commit a6717c01ddc259f6f73364779df058e2c67309f8 ]
      
      The LPAR migration implementation and userspace-initiated cpu hotplug
      can interleave their executions like so:
      
      1. Set cpu 7 offline via sysfs.
      
      2. Begin a partition migration, whose implementation requires the OS
         to ensure all present cpus are online; cpu 7 is onlined:
      
           rtas_ibm_suspend_me -> rtas_online_cpus_mask -> cpu_up
      
         This sets cpu 7 online in all respects except for the cpu's
         corresponding struct device; dev->offline remains true.
      
      3. Set cpu 7 online via sysfs. _cpu_up() determines that cpu 7 is
         already online and returns success. The driver core (device_online)
         sets dev->offline = false.
      
      4. The migration completes and restores cpu 7 to offline state:
      
           rtas_ibm_suspend_me -> rtas_offline_cpus_mask -> cpu_down
      
      This leaves cpu7 in a state where the driver core considers the cpu
      device online, but in all other respects it is offline and
      unused. Attempts to online the cpu via sysfs appear to succeed but the
      driver core actually does not pass the request to the lower-level
      cpuhp support code. This makes the cpu unusable until the cpu device
      is manually set offline and then online again via sysfs.
      
      Instead of directly calling cpu_up/cpu_down, the migration code should
      use the higher-level device core APIs to maintain consistent state and
      serialize operations.
      
      Fixes: 120496ac ("powerpc: Bring all threads online prior to migration/hibernation")
      Signed-off-by: NNathan Lynch <nathanl@linux.ibm.com>
      Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190802192926.19277-2-nathanl@linux.ibm.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      6aa455b0
    • C
      powerpc/xmon: Check for HV mode when dumping XIVE info from OPAL · 25c501f0
      Cédric Le Goater 提交于
      [ Upstream commit c3e0dbd7f780a58c4695f1cd8fc8afde80376737 ]
      
      Currently, the xmon 'dx' command calls OPAL to dump the XIVE state in
      the OPAL logs and also outputs some of the fields of the internal XIVE
      structures in Linux. The OPAL calls can only be done on baremetal
      (PowerNV) and they crash a pseries machine. Fix by checking the
      hypervisor feature of the CPU.
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190814154754.23682-2-clg@kaod.orgSigned-off-by: NSasha Levin <sashal@kernel.org>
      25c501f0
    • A
      powerpc/powernv/ioda2: Allocate TCE table levels on demand for default DMA window · 437399ed
      Alexey Kardashevskiy 提交于
      [ Upstream commit c37c792dec0929dbb6360a609fb00fa20bb16fc2 ]
      
      We allocate only the first level of multilevel TCE tables for KVM
      already (alloc_userspace_copy==true), and the rest is allocated on demand.
      This is not enabled though for bare metal.
      
      This removes the KVM limitation (implicit, via the alloc_userspace_copy
      parameter) and always allocates just the first level. The on-demand
      allocation of missing levels is already implemented.
      
      As from now on DMA map might happen with disabled interrupts, this
      allocates TCEs with GFP_ATOMIC; otherwise lockdep reports errors 1].
      In practice just a single page is allocated there so chances for failure
      are quite low.
      
      To save time when creating a new clean table, this skips non-allocated
      indirect TCE entries in pnv_tce_free just like we already do in
      the VFIO IOMMU TCE driver.
      
      This changes the default level number from 1 to 2 to reduce the amount
      of memory required for the default 32bit DMA window at the boot time.
      The default window size is up to 2GB which requires 4MB of TCEs which is
      unlikely to be used entirely or at all as most devices these days are
      64bit capable so by switching to 2 levels by default we save 4032KB of
      RAM per a device.
      
      While at this, add __GFP_NOWARN to alloc_pages_node() as the userspace
      can trigger this path via VFIO, see the failure and try creating a table
      again with different parameters which might succeed.
      
      [1]:
      ===
      BUG: sleeping function called from invalid context at mm/page_alloc.c:4596
      in_atomic(): 1, irqs_disabled(): 1, pid: 1038, name: scsi_eh_1
      2 locks held by scsi_eh_1/1038:
       #0: 000000005efd659a (&host->eh_mutex){+.+.}, at: ata_eh_acquire+0x34/0x80
       #1: 0000000006cf56a6 (&(&host->lock)->rlock){....}, at: ata_exec_internal_sg+0xb0/0x5c0
      irq event stamp: 500
      hardirqs last  enabled at (499): [<c000000000cb8a74>] _raw_spin_unlock_irqrestore+0x94/0xd0
      hardirqs last disabled at (500): [<c000000000cb85c4>] _raw_spin_lock_irqsave+0x44/0x120
      softirqs last  enabled at (0): [<c000000000101120>] copy_process.isra.4.part.5+0x640/0x1a80
      softirqs last disabled at (0): [<0000000000000000>] 0x0
      CPU: 73 PID: 1038 Comm: scsi_eh_1 Not tainted 5.2.0-rc6-le_nv2_aikATfstn1-p1 #634
      Call Trace:
      [c000003d064cef50] [c000000000c8e6c4] dump_stack+0xe8/0x164 (unreliable)
      [c000003d064cefa0] [c00000000014ed78] ___might_sleep+0x2f8/0x310
      [c000003d064cf020] [c0000000003ca084] __alloc_pages_nodemask+0x2a4/0x1560
      [c000003d064cf220] [c0000000000c2530] pnv_alloc_tce_level.isra.0+0x90/0x130
      [c000003d064cf290] [c0000000000c2888] pnv_tce+0x128/0x3b0
      [c000003d064cf360] [c0000000000c2c00] pnv_tce_build+0xb0/0xf0
      [c000003d064cf3c0] [c0000000000bbd9c] pnv_ioda2_tce_build+0x3c/0xb0
      [c000003d064cf400] [c00000000004cfe0] ppc_iommu_map_sg+0x210/0x550
      [c000003d064cf510] [c00000000004b7a4] dma_iommu_map_sg+0x74/0xb0
      [c000003d064cf530] [c000000000863944] ata_qc_issue+0x134/0x470
      [c000003d064cf5b0] [c000000000863ec4] ata_exec_internal_sg+0x244/0x5c0
      [c000003d064cf700] [c0000000008642d0] ata_exec_internal+0x90/0xe0
      [c000003d064cf780] [c0000000008650ac] ata_dev_read_id+0x2ec/0x640
      [c000003d064cf8d0] [c000000000878e28] ata_eh_recover+0x948/0x16d0
      [c000003d064cfa10] [c00000000087d760] sata_pmp_error_handler+0x480/0xbf0
      [c000003d064cfbc0] [c000000000884624] ahci_error_handler+0x74/0xe0
      [c000003d064cfbf0] [c000000000879fa8] ata_scsi_port_error_handler+0x2d8/0x7c0
      [c000003d064cfca0] [c00000000087a544] ata_scsi_error+0xb4/0x100
      [c000003d064cfd00] [c000000000802450] scsi_error_handler+0x120/0x510
      [c000003d064cfdb0] [c000000000140c48] kthread+0x1b8/0x1c0
      [c000003d064cfe20] [c00000000000bd8c] ret_from_kernel_thread+0x5c/0x70
      ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
      irq event stamp: 2305
      
      ========================================================
      hardirqs last  enabled at (2305): [<c00000000000e4c8>] fast_exc_return_irq+0x28/0x34
      hardirqs last disabled at (2303): [<c000000000cb9fd0>] __do_softirq+0x4a0/0x654
      WARNING: possible irq lock inversion dependency detected
      5.2.0-rc6-le_nv2_aikATfstn1-p1 #634 Tainted: G        W
      softirqs last  enabled at (2304): [<c000000000cba054>] __do_softirq+0x524/0x654
      softirqs last disabled at (2297): [<c00000000010f278>] irq_exit+0x128/0x180
      --------------------------------------------------------
      swapper/0/0 just changed the state of lock:
      0000000006cf56a6 (&(&host->lock)->rlock){-...}, at: ahci_single_level_irq_intr+0xac/0x120
      but this lock took another, HARDIRQ-unsafe lock in the past:
       (fs_reclaim){+.+.}
      
      and interrupts could create inverse lock ordering between them.
      
      other info that might help us debug this:
       Possible interrupt unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(fs_reclaim);
                                     local_irq_disable();
                                     lock(&(&host->lock)->rlock);
                                     lock(fs_reclaim);
        <Interrupt>
          lock(&(&host->lock)->rlock);
      
       *** DEADLOCK ***
      
      no locks held by swapper/0/0.
      
      the shortest dependencies between 2nd lock and 1st lock:
       -> (fs_reclaim){+.+.} ops: 167579 {
          HARDIRQ-ON-W at:
                            lock_acquire+0xf8/0x2a0
                            fs_reclaim_acquire.part.23+0x44/0x60
                            kmem_cache_alloc_node_trace+0x80/0x590
                            alloc_desc+0x64/0x270
                            __irq_alloc_descs+0x2e4/0x3a0
                            irq_domain_alloc_descs+0xb0/0x150
                            irq_create_mapping+0x168/0x2c0
                            xics_smp_probe+0x2c/0x98
                            pnv_smp_probe+0x40/0x9c
                            smp_prepare_cpus+0x524/0x6c4
                            kernel_init_freeable+0x1b4/0x650
                            kernel_init+0x2c/0x148
                            ret_from_kernel_thread+0x5c/0x70
          SOFTIRQ-ON-W at:
                            lock_acquire+0xf8/0x2a0
                            fs_reclaim_acquire.part.23+0x44/0x60
                            kmem_cache_alloc_node_trace+0x80/0x590
                            alloc_desc+0x64/0x270
                            __irq_alloc_descs+0x2e4/0x3a0
                            irq_domain_alloc_descs+0xb0/0x150
                            irq_create_mapping+0x168/0x2c0
                            xics_smp_probe+0x2c/0x98
                            pnv_smp_probe+0x40/0x9c
                            smp_prepare_cpus+0x524/0x6c4
                            kernel_init_freeable+0x1b4/0x650
                            kernel_init+0x2c/0x148
                            ret_from_kernel_thread+0x5c/0x70
          INITIAL USE at:
                           lock_acquire+0xf8/0x2a0
                           fs_reclaim_acquire.part.23+0x44/0x60
                           kmem_cache_alloc_node_trace+0x80/0x590
                           alloc_desc+0x64/0x270
                           __irq_alloc_descs+0x2e4/0x3a0
                           irq_domain_alloc_descs+0xb0/0x150
                           irq_create_mapping+0x168/0x2c0
                           xics_smp_probe+0x2c/0x98
                           pnv_smp_probe+0x40/0x9c
                           smp_prepare_cpus+0x524/0x6c4
                           kernel_init_freeable+0x1b4/0x650
                           kernel_init+0x2c/0x148
                           ret_from_kernel_thread+0x5c/0x70
        }
      ===
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NAlistair Popple <alistair@popple.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190718051139.74787-4-aik@ozlabs.ruSigned-off-by: NSasha Levin <sashal@kernel.org>
      437399ed
  5. 05 10月, 2019 1 次提交
    • M
      powerpc/imc: Dont create debugfs files for cpu-less nodes · ecfe4b5f
      Madhavan Srinivasan 提交于
      commit 41ba17f20ea835c489e77bd54e2da73184e22060 upstream.
      
      Commit <684d9840> ('powerpc/powernv: Add debugfs interface for
      imc-mode and imc') added debugfs interface for the nest imc pmu
      devices to support changing of different ucode modes. Primarily adding
      this capability for debug. But when doing so, the code did not
      consider the case of cpu-less nodes. So when reading the _cmd_ or
      _mode_ file of a cpu-less node will create this crash.
      
        Faulting instruction address: 0xc0000000000d0d58
        Oops: Kernel access of bad area, sig: 11 [#1]
        ...
        CPU: 67 PID: 5301 Comm: cat Not tainted 5.2.0-rc6-next-20190627+ #19
        NIP:  c0000000000d0d58 LR: c00000000049aa18 CTR:c0000000000d0d50
        REGS: c00020194548f9e0 TRAP: 0300   Not tainted  (5.2.0-rc6-next-20190627+)
        MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR:28022822  XER: 00000000
        CFAR: c00000000049aa14 DAR: 000000000003fc08 DSISR:40000000 IRQMASK: 0
        ...
        NIP imc_mem_get+0x8/0x20
        LR  simple_attr_read+0x118/0x170
        Call Trace:
          simple_attr_read+0x70/0x170 (unreliable)
          debugfs_attr_read+0x6c/0xb0
          __vfs_read+0x3c/0x70
           vfs_read+0xbc/0x1a0
          ksys_read+0x7c/0x140
          system_call+0x5c/0x70
      
      Patch fixes the issue with a more robust check for vbase to NULL.
      
      Before patch, ls output for the debugfs imc directory
      
        # ls /sys/kernel/debug/powerpc/imc/
        imc_cmd_0    imc_cmd_251  imc_cmd_253  imc_cmd_255  imc_mode_0    imc_mode_251  imc_mode_253  imc_mode_255
        imc_cmd_250  imc_cmd_252  imc_cmd_254  imc_cmd_8    imc_mode_250  imc_mode_252  imc_mode_254  imc_mode_8
      
      After patch, ls output for the debugfs imc directory
      
        # ls /sys/kernel/debug/powerpc/imc/
        imc_cmd_0  imc_cmd_8  imc_mode_0  imc_mode_8
      
      Actual bug here is that, we have two loops with potentially different
      loop counts. That is, in imc_get_mem_addr_nest(), loop count is
      obtained from the dt entries. But in case of export_imc_mode_and_cmd(),
      loop was based on for_each_nid() count. Patch fixes the loop count in
      latter based on the struct mem_info. Ideally it would be better to
      have array size in struct imc_pmu.
      
      Fixes: 684d9840 ('powerpc/powernv: Add debugfs interface for imc-mode and imc')
      Reported-by: NQian Cai <cai@lca.pw>
      Suggested-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NMadhavan Srinivasan <maddy@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190827101635.6942-1-maddy@linux.vnet.ibm.com
      Cc: Jan Stancek <jstancek@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ecfe4b5f
  6. 01 10月, 2019 1 次提交
  7. 21 9月, 2019 1 次提交
  8. 19 9月, 2019 1 次提交
  9. 16 9月, 2019 10 次提交
    • G
      powerpc/tm: Fix restoring FP/VMX facility incorrectly on interrupts · 569775bd
      Gustavo Romero 提交于
      [ Upstream commit a8318c13e79badb92bc6640704a64cc022a6eb97 ]
      
      When in userspace and MSR FP=0 the hardware FP state is unrelated to
      the current process. This is extended for transactions where if tbegin
      is run with FP=0, the hardware checkpoint FP state will also be
      unrelated to the current process. Due to this, we need to ensure this
      hardware checkpoint is updated with the correct state before we enable
      FP for this process.
      
      Unfortunately we get this wrong when returning to a process from a
      hardware interrupt. A process that starts a transaction with FP=0 can
      take an interrupt. When the kernel returns back to that process, we
      change to FP=1 but with hardware checkpoint FP state not updated. If
      this transaction is then rolled back, the FP registers now contain the
      wrong state.
      
      The process looks like this:
         Userspace:                      Kernel
      
                     Start userspace
                      with MSR FP=0 TM=1
                        < -----
         ...
         tbegin
         bne
                     Hardware interrupt
                         ---- >
                                          <do_IRQ...>
                                          ....
                                          ret_from_except
                                            restore_math()
      				        /* sees FP=0 */
                                              restore_fp()
                                                tm_active_with_fp()
      					    /* sees FP=1 (Incorrect) */
                                                load_fp_state()
                                              FP = 0 -> 1
                        < -----
                     Return to userspace
                       with MSR TM=1 FP=1
                       with junk in the FP TM checkpoint
         TM rollback
         reads FP junk
      
      When returning from the hardware exception, tm_active_with_fp() is
      incorrectly making restore_fp() call load_fp_state() which is setting
      FP=1.
      
      The fix is to remove tm_active_with_fp().
      
      tm_active_with_fp() is attempting to handle the case where FP state
      has been changed inside a transaction. In this case the checkpointed
      and transactional FP state is different and hence we must restore the
      FP state (ie. we can't do lazy FP restore inside a transaction that's
      used FP). It's safe to remove tm_active_with_fp() as this case is
      handled by restore_tm_state(). restore_tm_state() detects if FP has
      been using inside a transaction and will set load_fp and call
      restore_math() to ensure the FP state (checkpoint and transaction) is
      restored.
      
      This is a data integrity problem for the current process as the FP
      registers are corrupted. It's also a security problem as the FP
      registers from one process may be leaked to another.
      
      Similarly for VMX.
      
      A simple testcase to replicate this will be posted to
      tools/testing/selftests/powerpc/tm/tm-poison.c
      
      This fixes CVE-2019-15031.
      
      Fixes: a7771176 ("powerpc: Don't enable FP/Altivec if not checkpointed")
      Cc: stable@vger.kernel.org # 4.15+
      Signed-off-by: NGustavo Romero <gromero@linux.ibm.com>
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190904045529.23002-2-gromero@linux.vnet.ibm.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      569775bd
    • B
      powerpc/tm: Remove msr_tm_active() · 052bc385
      Breno Leitao 提交于
      [ Upstream commit 5c784c8414fba11b62e12439f11e109fb5751f38 ]
      
      Currently msr_tm_active() is a wrapper around MSR_TM_ACTIVE() if
      CONFIG_PPC_TRANSACTIONAL_MEM is set, or it is just a function that
      returns false if CONFIG_PPC_TRANSACTIONAL_MEM is not set.
      
      This function is not necessary, since MSR_TM_ACTIVE() just do the same and
      could be used, removing the dualism and simplifying the code.
      
      This patchset remove every instance of msr_tm_active() and replaced it
      by MSR_TM_ACTIVE().
      Signed-off-by: NBreno Leitao <leitao@debian.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      052bc385
    • S
      powerpc/mm: Limit rma_size to 1TB when running without HV mode · c4fc7cb9
      Suraj Jitindar Singh 提交于
      [ Upstream commit da0ef93310e67ae6902efded60b6724dab27a5d1 ]
      
      The virtual real mode addressing (VRMA) mechanism is used when a
      partition is using HPT (Hash Page Table) translation and performs real
      mode accesses (MSR[IR|DR] = 0) in non-hypervisor mode. In this mode
      effective address bits 0:23 are treated as zero (i.e. the access is
      aliased to 0) and the access is performed using an implicit 1TB SLB
      entry.
      
      The size of the RMA (Real Memory Area) is communicated to the guest as
      the size of the first memory region in the device tree. And because of
      the mechanism described above can be expected to not exceed 1TB. In
      the event that the host erroneously represents the RMA as being larger
      than 1TB, guest accesses in real mode to memory addresses above 1TB
      will be aliased down to below 1TB. This means that a memory access
      performed in real mode may differ to one performed in virtual mode for
      the same memory address, which would likely have unintended
      consequences.
      
      To avoid this outcome have the guest explicitly limit the size of the
      RMA to the current maximum, which is 1TB. This means that even if the
      first memory block is larger than 1TB, only the first 1TB should be
      accessed in real mode.
      
      Fixes: c610d65c ("powerpc/pseries: lift RTAS limit for hash")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Tested-by: NSatheesh Rajendran <sathnaga@linux.vnet.ibm.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190710052018.14628-1-sjitindarsingh@gmail.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      c4fc7cb9
    • M
      KVM: PPC: Book3S HV: Fix CR0 setting in TM emulation · 3a1b79ad
      Michael Neuling 提交于
      [ Upstream commit 3fefd1cd95df04da67c83c1cb93b663f04b3324f ]
      
      When emulating tsr, treclaim and trechkpt, we incorrectly set CR0. The
      code currently sets:
          CR0 <- 00 || MSR[TS]
      but according to the ISA it should be:
          CR0 <-  0 || MSR[TS] || 0
      
      This fixes the bit shift to put the bits in the correct location.
      
      This is a data integrity issue as CR0 is corrupted.
      
      Fixes: 4bb3c7a0 ("KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9")
      Cc: stable@vger.kernel.org # v4.17+
      Tested-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      3a1b79ad
    • P
      KVM: PPC: Use ccr field in pt_regs struct embedded in vcpu struct · 3ac71806
      Paul Mackerras 提交于
      [ Upstream commit fd0944baad806dfb4c777124ec712c55b714ff51 ]
      
      When the 'regs' field was added to struct kvm_vcpu_arch, the code
      was changed to use several of the fields inside regs (e.g., gpr, lr,
      etc.) but not the ccr field, because the ccr field in struct pt_regs
      is 64 bits on 64-bit platforms, but the cr field in kvm_vcpu_arch is
      only 32 bits.  This changes the code to use the regs.ccr field
      instead of cr, and changes the assembly code on 64-bit platforms to
      use 64-bit loads and stores instead of 32-bit ones.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      3ac71806
    • M
      powerpc/kvm: Save and restore host AMR/IAMR/UAMOR · 915c9d0a
      Michael Ellerman 提交于
      [ Upstream commit c3c7470c75566a077c8dc71dcf8f1948b8ddfab4 ]
      
      When the hash MMU is active the AMR, IAMR and UAMOR are used for
      pkeys. The AMR is directly writable by user space, and the UAMOR masks
      those writes, meaning both registers are effectively user register
      state. The IAMR is used to create an execute only key.
      
      Also we must maintain the value of at least the AMR when running in
      process context, so that any memory accesses done by the kernel on
      behalf of the process are correctly controlled by the AMR.
      
      Although we are correctly switching all registers when going into a
      guest, on returning to the host we just write 0 into all regs, except
      on Power9 where we restore the IAMR correctly.
      
      This could be observed by a user process if it writes the AMR, then
      runs a guest and we then return immediately to it without
      rescheduling. Because we have written 0 to the AMR that would have the
      effect of granting read/write permission to pages that the process was
      trying to protect.
      
      In addition, when using the Radix MMU, the AMR can prevent inadvertent
      kernel access to userspace data, writing 0 to the AMR disables that
      protection.
      
      So save and restore AMR, IAMR and UAMOR.
      
      Fixes: cf43d3b2 ("powerpc: Enable pkey subsystem")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: NRussell Currey <ruscur@russell.cc>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Acked-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      915c9d0a
    • R
      powerpc/pkeys: Fix handling of pkey state across fork() · cfbf227e
      Ram Pai 提交于
      [ Upstream commit 2cd4bd192ee94848695c1c052d87913260e10f36 ]
      
      Protection key tracking information is not copied over to the
      mm_struct of the child during fork(). This can cause the child to
      erroneously allocate keys that were already allocated. Any allocated
      execute-only key is lost aswell.
      
      Add code; called by dup_mmap(), to copy the pkey state from parent to
      child explicitly.
      
      This problem was originally found by Dave Hansen on x86, which turns
      out to be a problem on powerpc aswell.
      
      Fixes: cf43d3b2 ("powerpc: Enable pkey subsystem")
      Cc: stable@vger.kernel.org # v4.16+
      Reviewed-by: NThiago Jung Bauermann <bauerman@linux.ibm.com>
      Signed-off-by: NRam Pai <linuxram@us.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      cfbf227e
    • P
      KVM: PPC: Book3S HV: Fix race between kvm_unmap_hva_range and MMU mode switch · d3984e80
      Paul Mackerras 提交于
      [ Upstream commit 234ff0b729ad882d20f7996591a964965647addf ]
      
      Testing has revealed an occasional crash which appears to be caused
      by a race between kvmppc_switch_mmu_to_hpt and kvm_unmap_hva_range_hv.
      The symptom is a NULL pointer dereference in __find_linux_pte() called
      from kvm_unmap_radix() with kvm->arch.pgtable == NULL.
      
      Looking at kvmppc_switch_mmu_to_hpt(), it does indeed clear
      kvm->arch.pgtable (via kvmppc_free_radix()) before setting
      kvm->arch.radix to NULL, and there is nothing to prevent
      kvm_unmap_hva_range_hv() or the other MMU callback functions from
      being called concurrently with kvmppc_switch_mmu_to_hpt() or
      kvmppc_switch_mmu_to_radix().
      
      This patch therefore adds calls to spin_lock/unlock on the kvm->mmu_lock
      around the assignments to kvm->arch.radix, and makes sure that the
      partition-scoped radix tree or HPT is only freed after changing
      kvm->arch.radix.
      
      This also takes the kvm->mmu_lock in kvmppc_rmap_reset() to make sure
      that the clearing of each rmap array (one per memslot) doesn't happen
      concurrently with use of the array in the kvm_unmap_hva_range_hv()
      or the other MMU callbacks.
      
      Fixes: 18c3640c ("KVM: PPC: Book3S HV: Add infrastructure for running HPT guests on radix host")
      Cc: stable@vger.kernel.org # v4.15+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d3984e80
    • C
      powerpc/64: mark start_here_multiplatform as __ref · 7f8b2360
      Christophe Leroy 提交于
      [ Upstream commit 9c4e4c90ec24652921e31e9551fcaedc26eec86d ]
      
      Otherwise, the following warning is encountered:
      
      WARNING: vmlinux.o(.text+0x3dc6): Section mismatch in reference from the variable start_here_multiplatform to the function .init.text:.early_setup()
      The function start_here_multiplatform() references
      the function __init .early_setup().
      This is often because start_here_multiplatform lacks a __init
      annotation or the annotation of .early_setup is wrong.
      
      Fixes: 56c46bba9bbf ("powerpc/64: Fix booting large kernels with STRICT_KERNEL_RWX")
      Cc: Russell Currey <ruscur@russell.cc>
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      7f8b2360
    • G
      powerpc/tm: Fix FP/VMX unavailable exceptions inside a transaction · 47a0f70d
      Gustavo Romero 提交于
      commit 8205d5d98ef7f155de211f5e2eb6ca03d95a5a60 upstream.
      
      When we take an FP unavailable exception in a transaction we have to
      account for the hardware FP TM checkpointed registers being
      incorrect. In this case for this process we know the current and
      checkpointed FP registers must be the same (since FP wasn't used
      inside the transaction) hence in the thread_struct we copy the current
      FP registers to the checkpointed ones.
      
      This copy is done in tm_reclaim_thread(). We use thread->ckpt_regs.msr
      to determine if FP was on when in userspace. thread->ckpt_regs.msr
      represents the state of the MSR when exiting userspace. This is setup
      by check_if_tm_restore_required().
      
      Unfortunatley there is an optimisation in giveup_all() which returns
      early if tsk->thread.regs->msr (via local variable `usermsr`) has
      FP=VEC=VSX=SPE=0. This optimisation means that
      check_if_tm_restore_required() is not called and hence
      thread->ckpt_regs.msr is not updated and will contain an old value.
      
      This can happen if due to load_fp=255 we start a userspace process
      with MSR FP=1 and then we are context switched out. In this case
      thread->ckpt_regs.msr will contain FP=1. If that same process is then
      context switched in and load_fp overflows, MSR will have FP=0. If that
      process now enters a transaction and does an FP instruction, the FP
      unavailable will not update thread->ckpt_regs.msr (the bug) and MSR
      FP=1 will be retained in thread->ckpt_regs.msr.  tm_reclaim_thread()
      will then not perform the required memcpy and the checkpointed FP regs
      in the thread struct will contain the wrong values.
      
      The code path for this happening is:
      
             Userspace:                      Kernel
                         Start userspace
                          with MSR FP/VEC/VSX/SPE=0 TM=1
                            < -----
             ...
             tbegin
             bne
             fp instruction
                         FP unavailable
                             ---- >
                                              fp_unavailable_tm()
      					  tm_reclaim_current()
      					    tm_reclaim_thread()
      					      giveup_all()
      					        return early since FP/VMX/VSX=0
      						/* ckpt MSR not updated (Incorrect) */
      					      tm_reclaim()
      					        /* thread_struct ckpt FP regs contain junk (OK) */
                                                    /* Sees ckpt MSR FP=1 (Incorrect) */
      					      no memcpy() performed
      					        /* thread_struct ckpt FP regs not fixed (Incorrect) */
      					  tm_recheckpoint()
      					     /* Put junk in hardware checkpoint FP regs */
                                               ....
                            < -----
                         Return to userspace
                           with MSR TM=1 FP=1
                           with junk in the FP TM checkpoint
             TM rollback
             reads FP junk
      
      This is a data integrity problem for the current process as the FP
      registers are corrupted. It's also a security problem as the FP
      registers from one process may be leaked to another.
      
      This patch moves up check_if_tm_restore_required() in giveup_all() to
      ensure thread->ckpt_regs.msr is updated correctly.
      
      A simple testcase to replicate this will be posted to
      tools/testing/selftests/powerpc/tm/tm-poison.c
      
      Similarly for VMX.
      
      This fixes CVE-2019-15030.
      
      Fixes: f48e91e8 ("powerpc/tm: Fix FP and VMX register corruption")
      Cc: stable@vger.kernel.org # 4.12+
      Signed-off-by: NGustavo Romero <gromero@linux.vnet.ibm.com>
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190904045529.23002-1-gromero@linux.vnet.ibm.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      47a0f70d
  10. 06 9月, 2019 1 次提交
  11. 29 8月, 2019 1 次提交
  12. 16 8月, 2019 1 次提交
    • W
      KVM: Fix leak vCPU's VMCS value into other pCPU · 2bc73d91
      Wanpeng Li 提交于
      commit 17e433b54393a6269acbcb792da97791fe1592d8 upstream.
      
      After commit d73eb57b80b (KVM: Boost vCPUs that are delivering interrupts), a
      five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs
      on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting
      in the VMs after stress testing:
      
       INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073)
       Call Trace:
         flush_tlb_mm_range+0x68/0x140
         tlb_flush_mmu.part.75+0x37/0xe0
         tlb_finish_mmu+0x55/0x60
         zap_page_range+0x142/0x190
         SyS_madvise+0x3cd/0x9c0
         system_call_fastpath+0x1c/0x21
      
      swait_active() sustains to be true before finish_swait() is called in
      kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account
      by kvm_vcpu_on_spin() loop greatly increases the probability condition
      kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv
      is enabled the yield-candidate vCPU's VMCS RVI field leaks(by
      vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current
      VMCS.
      
      This patch fixes it by checking conservatively a subset of events.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Marc Zyngier <Marc.Zyngier@arm.com>
      Cc: stable@vger.kernel.org
      Fixes: 98f4a146 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop)
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2bc73d91