1. 17 5月, 2018 5 次提交
    • P
      KVM: PPC: Book 3S HV: Do ptesync in radix guest exit path · df158189
      Paul Mackerras 提交于
      A radix guest can execute tlbie instructions to invalidate TLB entries.
      After a tlbie or a group of tlbies, it must then do the architected
      sequence eieio; tlbsync; ptesync to ensure that the TLB invalidation
      has been processed by all CPUs in the system before it can rely on
      no CPU using any translation that it just invalidated.
      
      In fact it is the ptesync which does the actual synchronization in
      this sequence, and hardware has a requirement that the ptesync must
      be executed on the same CPU thread as the tlbies which it is expected
      to order.  Thus, if a vCPU gets moved from one physical CPU to
      another after it has done some tlbies but before it can get to do the
      ptesync, the ptesync will not have the desired effect when it is
      executed on the second physical CPU.
      
      To fix this, we do a ptesync in the exit path for radix guests.  If
      there are any pending tlbies, this will wait for them to complete.
      If there aren't, then ptesync will just do the same as sync.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      df158189
    • B
      KVM: PPC: Book3S HV: XIVE: Resend re-routed interrupts on CPU priority change · 9dc81d6b
      Benjamin Herrenschmidt 提交于
      When a vcpu priority (CPPR) is set to a lower value (masking more
      interrupts), we stop processing interrupts already in the queue
      for the priorities that have now been masked.
      
      If those interrupts were previously re-routed to a different
      CPU, they might still be stuck until the older one that has
      them in its queue processes them. In the case of guest CPU
      unplug, that can be never.
      
      To address that without creating additional overhead for
      the normal interrupt processing path, this changes H_CPPR
      handling so that when such a priority change occurs, we
      scan the interrupt queue for that vCPU, and for any
      interrupt in there that has been re-routed, we replace it
      with a dummy and force a re-trigger.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Tested-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      9dc81d6b
    • N
      KVM: PPC: Book3S HV: Make radix clear pte when unmapping · 7e3d9a1d
      Nicholas Piggin 提交于
      The current partition table unmap code clears the _PAGE_PRESENT bit
      out of the pte, which leaves pud_huge/pmd_huge true and does not
      clear pud_present/pmd_present.  This can confuse subsequent page
      faults and possibly lead to the guest looping doing continual
      hypervisor page faults.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      7e3d9a1d
    • N
      KVM: PPC: Book3S HV: Make radix use correct tlbie sequence in kvmppc_radix_tlbie_page · e2560b10
      Nicholas Piggin 提交于
      The standard eieio ; tlbsync ; ptesync must follow tlbie to ensure it
      is ordered with respect to subsequent operations.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e2560b10
    • P
      KVM: PPC: Book3S HV: Snapshot timebase offset on guest entry · 57b8daa7
      Paul Mackerras 提交于
      Currently, the HV KVM guest entry/exit code adds the timebase offset
      from the vcore struct to the timebase on guest entry, and subtracts
      it on guest exit.  Which is fine, except that it is possible for
      userspace to change the offset using the SET_ONE_REG interface while
      the vcore is running, as there is only one timebase offset per vcore
      but potentially multiple VCPUs in the vcore.  If that were to happen,
      KVM would subtract a different offset on guest exit from that which
      it had added on guest entry, leading to the timebase being out of sync
      between cores in the host, which then leads to bad things happening
      such as hangs and spurious watchdog timeouts.
      
      To fix this, we add a new field 'tb_offset_applied' to the vcore struct
      which stores the offset that is currently applied to the timebase.
      This value is set from the vcore tb_offset field on guest entry, and
      is what is subtracted from the timebase on guest exit.  Since it is
      zero when the timebase offset is not applied, we can simplify the
      logic in kvmhv_start_timing and kvmhv_accumulate_time.
      
      In addition, we had secondary threads reading the timebase while
      running concurrently with code on the primary thread which would
      eventually add or subtract the timebase offset from the timebase.
      This occurred while saving or restoring the DEC register value on
      the secondary threads.  Although no specific incorrect behaviour has
      been observed, this is a race which should be fixed.  To fix it, we
      move the DEC saving code to just before we call kvmhv_commence_exit,
      and the DEC restoring code to after the point where we have waited
      for the primary thread to switch the MMU context and add the timebase
      offset.  That way we are sure that the timebase contains the guest
      timebase value in both cases.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      57b8daa7
  2. 27 4月, 2018 2 次提交
  3. 25 4月, 2018 2 次提交
    • N
      powerpc: Fix smp_send_stop NMI IPI handling · ac61c115
      Nicholas Piggin 提交于
      The NMI IPI handler for a receiving CPU increments nmi_ipi_busy_count
      over the handler function call, which causes later smp_send_nmi_ipi()
      callers to spin until the call is finished.
      
      The stop_this_cpu() function never returns, so the busy count is never
      decremeted, which can cause the system to hang in some cases. For
      example panic() will call smp_send_stop() early on which calls
      stop_this_cpu() on other CPUs, then later in the reboot path,
      pnv_restart() will call smp_send_stop() again, which hangs.
      
      Fix this by adding a special case to the stop_this_cpu() handler to
      decrement the busy count, because it will never return.
      
      Now that the NMI/non-NMI versions of stop_this_cpu() are different,
      split them out into separate functions rather than doing #ifdef tricks
      to share the body between the two functions.
      
      Fixes: 6bed3237 ("powerpc: use NMI IPI for smp_send_stop")
      Reported-by: NAbdul Haleem <abdhalee@linux.vnet.ibm.com>
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      [mpe: Split out the functions, tweak change log a bit]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      ac61c115
    • N
      rtc: opal: Fix OPAL RTC driver OPAL_BUSY loops · 682e6b4d
      Nicholas Piggin 提交于
      The OPAL RTC driver does not sleep in case it gets OPAL_BUSY or
      OPAL_BUSY_EVENT from firmware, which causes large scheduling
      latencies, up to 50 seconds have been observed here when RTC stops
      responding (BMC reboot can do it).
      
      Fix this by converting it to the standard form OPAL_BUSY loop that
      sleeps.
      
      Fixes: 628daa8d ("powerpc/powernv: Add RTC and NVRAM support plus RTAS fallbacks")
      Cc: stable@vger.kernel.org # v3.2+
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Acked-by: NAlexandre Belloni <alexandre.belloni@bootlin.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      682e6b4d
  4. 24 4月, 2018 6 次提交
    • M
      powerpc/mce: Fix a bug where mce loops on memory UE. · 75ecfb49
      Mahesh Salgaonkar 提交于
      The current code extracts the physical address for UE errors and then
      hooks it up into memory failure infrastructure. On successful
      extraction of physical address it wrongly sets "handled = 1" which
      means this UE error has been recovered. Since MCE handler gets return
      value as handled = 1, it assumes that error has been recovered and
      goes back to same NIP. This causes MCE interrupt again and again in a
      loop leading to hard lockup.
      
      Also, initialize phys_addr to ULONG_MAX so that we don't end up
      queuing undesired page to hwpoison.
      
      Without this patch we see:
        Severe Machine check interrupt [Recovered]
          NIP: [000000001002588c] PID: 7109 Comm: find
          Initiator: CPU
          Error type: UE [Load/Store]
            Effective address: 00007fffd2755940
            Physical address:  000020181a080000
        ...
        Severe Machine check interrupt [Recovered]
          NIP: [000000001002588c] PID: 7109 Comm: find
          Initiator: CPU
          Error type: UE [Load/Store]
            Effective address: 00007fffd2755940
            Physical address:  000020181a080000
        Severe Machine check interrupt [Recovered]
          NIP: [000000001002588c] PID: 7109 Comm: find
          Initiator: CPU
          Error type: UE [Load/Store]
            Effective address: 00007fffd2755940
            Physical address:  000020181a080000
        Memory failure: 0x20181a08: recovery action for dirty LRU page: Recovered
        Memory failure: 0x20181a08: already hardware poisoned
        Memory failure: 0x20181a08: already hardware poisoned
        Memory failure: 0x20181a08: already hardware poisoned
        Memory failure: 0x20181a08: already hardware poisoned
        Memory failure: 0x20181a08: already hardware poisoned
        Memory failure: 0x20181a08: already hardware poisoned
        ...
        Watchdog CPU:38 Hard LOCKUP
      
      After this patch we see:
      
        Severe Machine check interrupt [Not recovered]
          NIP: [00007fffaae585f4] PID: 7168 Comm: find
          Initiator: CPU
          Error type: UE [Load/Store]
            Effective address: 00007fffaafe28ac
            Physical address:  00002017c0bd0000
        find[7168]: unhandled signal 7 at 00007fffaae585f4 nip 00007fffaae585f4 lr 00007fffaae585e0 code 4
        Memory failure: 0x2017c0bd: recovery action for dirty LRU page: Recovered
      
      Fixes: 01eaac2b ("powerpc/mce: Hookup ierror (instruction) UE errors")
      Fixes: ba41e1e1 ("powerpc/mce: Hookup derror (load/store) UE errors")
      Cc: stable@vger.kernel.org # v4.15+
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NBalbir Singh <bsingharora@gmail.com>
      Reviewed-by: NBalbir Singh <bsingharora@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      75ecfb49
    • A
      powerpc/powernv/npu: Do a PID GPU TLB flush when invalidating a large address range · d0cf9b56
      Alistair Popple 提交于
      The NPU has a limited number of address translation shootdown (ATSD)
      registers and the GPU has limited bandwidth to process ATSDs. This can
      result in contention of ATSD registers leading to soft lockups on some
      threads, particularly when invalidating a large address range in
      pnv_npu2_mn_invalidate_range().
      
      At some threshold it becomes more efficient to flush the entire GPU
      TLB for the given MM context (PID) than individually flushing each
      address in the range. This patch will result in ranges greater than
      2MB being converted from 32+ ATSDs into a single ATSD which will flush
      the TLB for the given PID on each GPU.
      
      Fixes: 1ab66d1f ("powerpc/powernv: Introduce address translation services for Nvlink2")
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: NAlistair Popple <alistair@popple.id.au>
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Tested-by: NBalbir Singh <bsingharora@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      d0cf9b56
    • A
      powerpc/powernv/npu: Prevent overwriting of pnv_npu2_init_contex() callback parameters · a1409ada
      Alistair Popple 提交于
      There is a single npu context per set of callback parameters. Callers
      should be prevented from overwriting existing callback values so
      instead return an error if different parameters are passed.
      
      Fixes: 1ab66d1f ("powerpc/powernv: Introduce address translation services for Nvlink2")
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: NAlistair Popple <alistair@popple.id.au>
      Reviewed-by: NMark Hairgrove <mhairgrove@nvidia.com>
      Tested-by: NMark Hairgrove <mhairgrove@nvidia.com>
      Reviewed-by: NBalbir Singh <bsingharora@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a1409ada
    • A
      powerpc/powernv/npu: Add lock to prevent race in concurrent context init/destroy · 28a5933e
      Alistair Popple 提交于
      The pnv_npu2_init_context() and pnv_npu2_destroy_context() functions
      are used to allocate/free contexts to allow address translation and
      shootdown by the NPU on a particular GPU. Context initialisation is
      implicitly safe as it is protected by the requirement mmap_sem be held
      in write mode, however pnv_npu2_destroy_context() does not require
      mmap_sem to be held and it is not safe to call with a concurrent
      initialisation for a different GPU.
      
      It was assumed the driver would ensure destruction was not called
      concurrently with initialisation. However the driver may be simplified
      by allowing concurrent initialisation and destruction for different
      GPUs. As npu context creation/destruction is not a performance
      critical path and the critical section is not large a single spinlock
      is used for simplicity.
      
      Fixes: 1ab66d1f ("powerpc/powernv: Introduce address translation services for Nvlink2")
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: NAlistair Popple <alistair@popple.id.au>
      Reviewed-by: NMark Hairgrove <mhairgrove@nvidia.com>
      Tested-by: NMark Hairgrove <mhairgrove@nvidia.com>
      Reviewed-by: NBalbir Singh <bsingharora@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      28a5933e
    • B
      powerpc/powernv/memtrace: Let the arch hotunplug code flush cache · 7fd6641d
      Balbir Singh 提交于
      Don't do this via custom code, instead now that we have support in the
      arch hotplug/hotunplug code, rely on those routines to do the right
      thing.
      
      The existing flush doesn't work because it uses ppc64_caches.l1d.size
      instead of ppc64_caches.l1d.line_size.
      
      Fixes: 9d5171a8 ("powerpc/powernv: Enable removal of memory for in memory tracing")
      Signed-off-by: NBalbir Singh <bsingharora@gmail.com>
      Reviewed-by: NRashmica Gupta <rashmica.g@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7fd6641d
    • B
      powerpc/mm: Flush cache on memory hot(un)plug · fb5924fd
      Balbir Singh 提交于
      This patch adds support for flushing potentially dirty cache lines
      when memory is hot-plugged/hot-un-plugged. The support is currently
      limited to 64 bit systems.
      
      The bug was exposed when mappings for a device were actually
      hot-unplugged and plugged in back later. A similar issue was observed
      during the development of memtrace, but memtrace does it's own
      flushing of region via a custom routine.
      
      These patches do a flush both on hotplug/unplug to clear any stale
      data in the cache w.r.t mappings, there is a small race window where a
      clean cache line may be created again just prior to tearing down the
      mapping.
      
      The patches were tested by disabling the flush routines in memtrace
      and doing I/O on the trace file. The system immediately
      checkstops (quite reliablly if prior to the hot-unplug of the memtrace
      region, we memset the regions we are about to hot unplug). After these
      patches no custom flushing is needed in the memtrace code.
      
      Fixes: 9d5171a8 ("powerpc/powernv: Enable removal of memory for in memory tracing")
      Cc: stable@vger.kernel.org # v4.14+
      Signed-off-by: NBalbir Singh <bsingharora@gmail.com>
      Acked-by: NReza Arbab <arbab@linux.ibm.com>
      Reviewed-by: NRashmica Gupta <rashmica.g@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      fb5924fd
  5. 21 4月, 2018 1 次提交
  6. 19 4月, 2018 2 次提交
    • M
      powerpc/kvm: Fix lockups when running KVM guests on Power8 · 56376c58
      Michael Ellerman 提交于
      When running KVM guests on Power8 we can see a lockup where one CPU
      stops responding. This often leads to a message such as:
      
        watchdog: CPU 136 detected hard LOCKUP on other CPUs 72
        Task dump for CPU 72:
        qemu-system-ppc R  running task    10560 20917  20908 0x00040004
      
      And then backtraces on other CPUs, such as:
      
        Task dump for CPU 48:
        ksmd            R  running task    10032  1519      2 0x00000804
        Call Trace:
          ...
          --- interrupt: 901 at smp_call_function_many+0x3c8/0x460
              LR = smp_call_function_many+0x37c/0x460
          pmdp_invalidate+0x100/0x1b0
          __split_huge_pmd+0x52c/0xdb0
          try_to_unmap_one+0x764/0x8b0
          rmap_walk_anon+0x15c/0x370
          try_to_unmap+0xb4/0x170
          split_huge_page_to_list+0x148/0xa30
          try_to_merge_one_page+0xc8/0x990
          try_to_merge_with_ksm_page+0x74/0xf0
          ksm_scan_thread+0x10ec/0x1ac0
          kthread+0x160/0x1a0
          ret_from_kernel_thread+0x5c/0x78
      
      This is caused by commit 8c1c7fb0 ("powerpc/64s/idle: avoid sync
      for KVM state when waking from idle"), which added a check in
      pnv_powersave_wakeup() to see if the kvm_hstate.hwthread_state is
      already set to KVM_HWTHREAD_IN_KERNEL, and if so to skip the store and
      test of kvm_hstate.hwthread_req.
      
      The problem is that the primary does not set KVM_HWTHREAD_IN_KVM when
      entering the guest, so it can then come out to cede with
      KVM_HWTHREAD_IN_KERNEL set. It can then go idle in kvm_do_nap after
      setting hwthread_req to 1, but because hwthread_state is still
      KVM_HWTHREAD_IN_KERNEL we will skip the test of hwthread_req when we
      wake up from idle and won't go to kvm_start_guest. From there the
      thread will return somewhere garbage and crash.
      
      Fix it by skipping the store of hwthread_state, but not the test of
      hwthread_req, when coming out of idle. It's OK to skip the sync in
      that case because hwthread_req will have been set on the same thread,
      so there is no synchronisation required.
      
      Fixes: 8c1c7fb0 ("powerpc/64s/idle: avoid sync for KVM state when waking from idle")
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      56376c58
    • M
      powerpc/eeh: Fix enabling bridge MMIO windows · 13a83eac
      Michael Neuling 提交于
      On boot we save the configuration space of PCIe bridges. We do this so
      when we get an EEH event and everything gets reset that we can restore
      them.
      
      Unfortunately we save this state before we've enabled the MMIO space
      on the bridges. Hence if we have to reset the bridge when we come back
      MMIO is not enabled and we end up taking an PE freeze when the driver
      starts accessing again.
      
      This patch forces the memory/MMIO and bus mastering on when restoring
      bridges on EEH. Ideally we'd do this correctly by saving the
      configuration space writes later, but that will have to come later in
      a larger EEH rewrite. For now we have this simple fix.
      
      The original bug can be triggered on a boston machine by doing:
        echo 0x8000000000000000 > /sys/kernel/debug/powerpc/PCI0001/err_injct_outbound
      On boston, this PHB has a PCIe switch on it.  Without this patch,
      you'll see two EEH events, 1 expected and 1 the failure we are fixing
      here. The second EEH event causes the anything under the PHB to
      disappear (i.e. the i40e eth).
      
      With this patch, only 1 EEH event occurs and devices properly recover.
      
      Fixes: 652defed ("powerpc/eeh: Check PCIe link after reset")
      Cc: stable@vger.kernel.org # v3.11+
      Reported-by: NPridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Acked-by: NRussell Currey <ruscur@russell.cc>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      13a83eac
  7. 18 4月, 2018 1 次提交
  8. 17 4月, 2018 1 次提交
  9. 16 4月, 2018 1 次提交
    • M
      powerpc/lib: Fix off-by-one in alternate feature patching · b8858581
      Michael Ellerman 提交于
      When we patch an alternate feature section, we have to adjust any
      relative branches that branch out of the alternate section.
      
      But currently we have a bug if we have a branch that points to past
      the last instruction of the alternate section, eg:
      
        FTR_SECTION_ELSE
        1:     b       2f
               or      6,6,6
        2:
        ALT_FTR_SECTION_END(...)
               nop
      
      This will result in a relative branch at 1 with a target that equals
      the end of the alternate section.
      
      That branch does not need adjusting when it's moved to the non-else
      location. Currently we do adjust it, resulting in a branch that goes
      off into the link-time location of the else section, which is junk.
      
      The fix is to not patch branches that have a target == end of the
      alternate section.
      
      Fixes: d20fe50a ("KVM: PPC: Book3S HV: Branch inside feature section")
      Fixes: 9b1a735d ("powerpc: Add logic to patch alternative feature sections")
      Cc: stable@vger.kernel.org # v2.6.27+
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      b8858581
  10. 14 4月, 2018 3 次提交
  11. 13 4月, 2018 1 次提交
    • M
      powerpc/64s: Fix CPU_FTRS_ALWAYS vs DT CPU features · 81b654c2
      Michael Ellerman 提交于
      The cpu_has_feature() mechanism has an optimisation where at build
      time we construct a mask of the CPU feature bits that will always be
      true for the given .config, based on the platform/bitness/etc. that we
      are building for.
      
      That is incompatible with DT CPU features, where the set of CPU
      features is dependent on feature flags that are given to us by
      firmware.
      
      The result is that some feature bits can not be *disabled* by DT CPU
      features. Or more accurately, they can be disabled but they will still
      appear in the ALWAYS mask, meaning cpu_has_feature() will always
      return true for them.
      
      In the past this hasn't really been a problem because on Book3S
      64 (where we support DT CPU features), the set of ALWAYS bits has been
      very small. That was because we always built for POWER4 and later,
      meaning the set of common bits was small.
      
      The only bit that could be cleared by DT CPU features that was also in
      the ALWAYS mask was CPU_FTR_NODSISRALIGN, and that was only used in
      the alignment handler to create a fake DSISR. That code was itself
      deleted in 31bfdb03 ("powerpc: Use instruction emulation
      infrastructure to handle alignment faults") (Sep 2017).
      
      However the set of ALWAYS features changed with the recent commit
      db5ae1c1 ("powerpc/64s: Refine feature sets for little endian
      builds") which restricted the set of feature flags when building
      little endian to Power7 or later. That caused the ALWAYS mask to
      become much larger for little endian builds.
      
      The result is that the following feature bits can currently not
      be *disabled* by DT CPU features:
      
        CPU_FTR_REAL_LE, CPU_FTR_MMCRA, CPU_FTR_CTRL, CPU_FTR_SMT,
        CPU_FTR_PURR, CPU_FTR_SPURR, CPU_FTR_DSCR, CPU_FTR_PKEY,
        CPU_FTR_VMX_COPY, CPU_FTR_CFAR, CPU_FTR_HAS_PPR.
      
      To fix it we need to mask the set of ALWAYS features with the base set
      of DT CPU features, ie. the features that are always enabled by DT CPU
      features. That way there are no bits in the ALWAYS mask that are not
      also always set by DT CPU features.
      
      Fixes: db5ae1c1 ("powerpc/64s: Refine feature sets for little endian builds")
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      81b654c2
  12. 12 4月, 2018 3 次提交
    • M
      powerpc/mm/radix: Fix checkstops caused by invalid tlbiel · 2675c13b
      Michael Ellerman 提交于
      In tlbiel_radix_set_isa300() we use the PPC_TLBIEL() macro to
      construct tlbiel instructions. The instruction takes 5 fields, two of
      which are registers, and the others are constants. But because it's
      constructed with inline asm the compiler doesn't know that.
      
      We got the constraint wrong on the 'r' field, using "r" tells the
      compiler to put the value in a register. The value we then get in the
      macro is the *register number*, not the value of the field.
      
      That means when we mask the register number with 0x1 we get 0 or 1
      depending on which register the compiler happens to put the constant
      in, eg:
      
        li      r10,1
        tlbiel  r8,r9,2,0,0
      
        li      r7,1
        tlbiel  r10,r6,0,0,1
      
      If we're unlucky we might generate an invalid instruction form, for
      example RIC=0, PRS=1 and R=0, tlbiel r8,r7,0,1,0, this has been
      observed to cause machine checks:
      
        Oops: Machine check, sig: 7 [#1]
        CPU: 24 PID: 0 Comm: swapper
        NIP:  00000000000385f4 LR: 000000000100ed00 CTR: 000000000000007f
        REGS: c00000000110bb40 TRAP: 0200
        MSR:  9000000000201003 <SF,HV,ME,RI,LE>  CR: 48002222  XER: 20040000
        CFAR: 00000000000385d0 DAR: 0000000000001c00 DSISR: 00000200 SOFTE: 1
      
      If the machine check happens early in boot while we have MSR_ME=0 it
      will escalate into a checkstop and kill the box entirely.
      
      To fix it we could change the inline asm constraint to "i" which
      tells the compiler the value is a constant. But a better fix is to just
      pass a literal 1 into the macro, which bypasses any problems with inline
      asm constraints.
      
      Fixes: d4748276 ("powerpc/64s: Improve local TLB flush for boot and MCE on POWER9")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      2675c13b
    • K
      exec: pass stack rlimit into mm layout functions · 8f2af155
      Kees Cook 提交于
      Patch series "exec: Pin stack limit during exec".
      
      Attempts to solve problems with the stack limit changing during exec
      continue to be frustrated[1][2].  In addition to the specific issues
      around the Stack Clash family of flaws, Andy Lutomirski pointed out[3]
      other places during exec where the stack limit is used and is assumed to
      be unchanging.  Given the many places it gets used and the fact that it
      can be manipulated/raced via setrlimit() and prlimit(), I think the only
      way to handle this is to move away from the "current" view of the stack
      limit and instead attach it to the bprm, and plumb this down into the
      functions that need to know the stack limits.  This series implements
      the approach.
      
      [1] 04e35f44 ("exec: avoid RLIMIT_STACK races with prlimit()")
      [2] 779f4e1c ("Revert "exec: avoid RLIMIT_STACK races with prlimit()"")
      [3] to security@kernel.org, "Subject: existing rlimit races?"
      
      This patch (of 3):
      
      Since it is possible that the stack rlimit can change externally during
      exec (either via another thread calling setrlimit() or another process
      calling prlimit()), provide a way to pass the rlimit down into the
      per-architecture mm layout functions so that the rlimit can stay in the
      bprm structure instead of sitting in the signal structure until exec is
      finalized.
      
      Link: http://lkml.kernel.org/r/1518638796-20819-2-git-send-email-keescook@chromium.orgSigned-off-by: NKees Cook <keescook@chromium.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
      Cc: Brad Spengler <spender@grsecurity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f2af155
    • M
      mm, migrate: remove reason argument from new_page_t · 666feb21
      Michal Hocko 提交于
      No allocation callback is using this argument anymore.  new_page_node
      used to use this parameter to convey node_id resp.  migration error up
      to move_pages code (do_move_page_to_node_array).  The error status never
      made it into the final status field and we have a better way to
      communicate node id to the status field now.  All other allocation
      callbacks simply ignored the argument so we can drop it finally.
      
      [mhocko@suse.com: fix migration callback]
        Link: http://lkml.kernel.org/r/20180105085259.GH2801@dhcp22.suse.cz
      [akpm@linux-foundation.org: fix alloc_misplaced_dst_page()]
      [mhocko@kernel.org: fix build]
        Link: http://lkml.kernel.org/r/20180103091134.GB11319@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20180103082555.14592-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NZi Yan <zi.yan@cs.rutgers.edu>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      666feb21
  13. 11 4月, 2018 3 次提交
    • N
      KVM: PPC: Book3S HV: trace_tlbie must not be called in realmode · 19ce7909
      Nicholas Piggin 提交于
      This crashes with a "Bad real address for load" attempting to load
      from the vmalloc region in realmode (faulting address is in DAR).
      
        Oops: Bad interrupt in KVM entry/exit code, sig: 6 [#1]
        LE SMP NR_CPUS=2048 NUMA PowerNV
        CPU: 53 PID: 6582 Comm: qemu-system-ppc Not tainted 4.16.0-01530-g43d1859f0994
        NIP:  c0000000000155ac LR: c0000000000c2430 CTR: c000000000015580
        REGS: c000000fff76dd80 TRAP: 0200   Not tainted  (4.16.0-01530-g43d1859f0994)
        MSR:  9000000000201003 <SF,HV,ME,RI,LE>  CR: 48082222  XER: 00000000
        CFAR: 0000000102900ef0 DAR: d00017fffd941a28 DSISR: 00000040 SOFTE: 3
        NIP [c0000000000155ac] perf_trace_tlbie+0x2c/0x1a0
        LR [c0000000000c2430] do_tlbies+0x230/0x2f0
      
      I suspect the reason is the per-cpu data is not in the linear chunk.
      This could be restored if that was able to be fixed, but for now,
      just remove the tracepoints.
      
      Fixes: 0428491c ("powerpc/mm: Trace tlbie(l) instructions")
      Cc: stable@vger.kernel.org # v4.13+
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      19ce7909
    • A
      powerpc/8xx: Fix build with hugetlbfs enabled · 032900e6
      Aneesh Kumar K.V 提交于
      8xx uses the slice code when hugetlbfs is enabled. We missed a header
      include on 8xx which resulted in the below build failure:
      
        config: mpc885_ads_defconfig + CONFIG_HUGETLBFS
      
        arch/powerpc/mm/slice.c: In function 'slice_get_unmapped_area':
        arch/powerpc/mm/slice.c:655:2: error: implicit declaration of function 'need_extra_context'
        arch/powerpc/mm/slice.c:656:3: error: implicit declaration of function 'alloc_extended_context'
      
      on PPC64 the mmu_context.h was included via linux/pkeys.h
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      032900e6
    • N
      powerpc/powernv: Fix OPAL NVRAM driver OPAL_BUSY loops · 3b807033
      Nicholas Piggin 提交于
      The OPAL NVRAM driver does not sleep in case it gets OPAL_BUSY or
      OPAL_BUSY_EVENT from firmware, which causes large scheduling
      latencies, and various lockup errors to trigger (again, BMC reboot
      can cause it).
      
      Fix this by converting it to the standard form OPAL_BUSY loop that
      sleeps.
      
      Fixes: 628daa8d ("powerpc/powernv: Add RTC and NVRAM support plus RTAS fallbacks")
      Depends-on: 34dd25de ("powerpc/powernv: define a standard delay for OPAL_BUSY type retry loops")
      Cc: stable@vger.kernel.org # v3.2+
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      3b807033
  14. 10 4月, 2018 3 次提交
    • N
      powerpc/powernv: define a standard delay for OPAL_BUSY type retry loops · 34dd25de
      Nicholas Piggin 提交于
      This is the start of an effort to tidy up and standardise all the
      delays. Existing loops have a range of delay/sleep periods from 1ms
      to 20ms, and some have no delay. They all loop forever except rtc,
      which times out after 10 retries, and that uses 10ms delays. So use
      10ms as our standard delay. The OPAL maintainer agrees 10ms is a
      reasonable starting point.
      
      The idea is to use the same recipe everywhere, once this is proven to
      work then it will be documented as an OPAL API standard. Then both
      firmware and OS can agree, and if a particular call needs something
      else, then that can be documented with reasoning.
      
      This is not the end-all of this effort, it's just a relatively easy
      change that fixes some existing high latency delays. There should be
      provision for standardising timeouts and/or interruptible loops where
      possible, so non-fatal firmware errors don't cause hangs.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      34dd25de
    • A
      powerpc/fscr: Enable interrupts earlier before calling get_user() · 709b973c
      Anshuman Khandual 提交于
      The function get_user() can sleep while trying to fetch instruction
      from user address space and causes the following warning from the
      scheduler.
      
      BUG: sleeping function called from invalid context
      
      Though interrupts get enabled back but it happens bit later after
      get_user() is called. This change moves enabling these interrupts
      earlier covering the function get_user(). While at this, lets check
      for kernel mode and crash as this interrupt should not have been
      triggered from the kernel context.
      Signed-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      709b973c
    • M
      powerpc/64s: Fix section mismatch warnings from setup_rfi_flush() · 501a78cb
      Michael Ellerman 提交于
      The recent LPM changes to setup_rfi_flush() are causing some section
      mismatch warnings because we removed the __init annotation on
      setup_rfi_flush():
      
        The function setup_rfi_flush() references
        the function __init ppc64_bolted_size().
        the function __init memblock_alloc_base().
      
      The references are actually in init_fallback_flush(), but that is
      inlined into setup_rfi_flush().
      
      These references are safe because:
       - only pseries calls setup_rfi_flush() at runtime
       - pseries always passes L1D_FLUSH_FALLBACK at boot
       - so the fallback flush area will always be allocated
       - so the check in init_fallback_flush() will always return early:
         /* Only allocate the fallback flush area once (at boot time). */
         if (l1d_flush_fallback_area)
         	return;
      
       - and therefore we won't actually call the freed init routines.
      
      We should rework the code to make it safer by default rather than
      relying on the above, but for now as a quick-fix just add a __ref
      annotation to squash the warning.
      
      Fixes: abf110f3 ("powerpc/rfi-flush: Make it possible to call setup_rfi_flush() again")
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      501a78cb
  15. 09 4月, 2018 1 次提交
    • M
      powerpc/modules: Fix crashes by adding CONFIG_RELOCATABLE to vermagic · 73aca179
      Michael Ellerman 提交于
      If you build the kernel with CONFIG_RELOCATABLE=n, then install the
      modules, rebuild the kernel with CONFIG_RELOCATABLE=y and leave the
      old modules installed, we crash something like:
      
        Unable to handle kernel paging request for data at address 0xd000000018d66cef
        Faulting instruction address: 0xc0000000021ddd08
        Oops: Kernel access of bad area, sig: 11 [#1]
        Modules linked in: x_tables autofs4
        CPU: 2 PID: 1 Comm: systemd Not tainted 4.16.0-rc6-gcc_ubuntu_le-g99fec39e #1
        ...
        NIP check_version.isra.22+0x118/0x170
        Call Trace:
          __ksymtab_xt_unregister_table+0x58/0xfffffffffffffcb8 [x_tables] (unreliable)
          resolve_symbol+0xb4/0x150
          load_module+0x10e8/0x29a0
          SyS_finit_module+0x110/0x140
          system_call+0x58/0x6c
      
      This happens because since commit 71810db2 ("modversions: treat
      symbol CRCs as 32 bit quantities"), a relocatable kernel encodes and
      handles symbol CRCs differently from a non-relocatable kernel.
      
      Although it's possible we could try and detect this situation and
      handle it, it's much more robust to simply make the state of
      CONFIG_RELOCATABLE part of the module vermagic.
      
      Fixes: 71810db2 ("modversions: treat symbol CRCs as 32 bit quantities")
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      73aca179
  16. 07 4月, 2018 1 次提交
  17. 06 4月, 2018 3 次提交
  18. 05 4月, 2018 1 次提交