1. 11 9月, 2020 1 次提交
    • A
      arm64/mm: Change THP helpers to comply with generic MM semantics · b65399f6
      Anshuman Khandual 提交于
      pmd_present() and pmd_trans_huge() are expected to behave in the following
      manner during various phases of a given PMD. It is derived from a previous
      detailed discussion on this topic [1] and present THP documentation [2].
      
      pmd_present(pmd):
      
      - Returns true if pmd refers to system RAM with a valid pmd_page(pmd)
      - Returns false if pmd refers to a migration or swap entry
      
      pmd_trans_huge(pmd):
      
      - Returns true if pmd refers to system RAM and is a trans huge mapping
      
      -------------------------------------------------------------------------
      |	PMD states	|	pmd_present	|	pmd_trans_huge	|
      -------------------------------------------------------------------------
      |	Mapped		|	Yes		|	Yes		|
      -------------------------------------------------------------------------
      |	Splitting	|	Yes		|	Yes		|
      -------------------------------------------------------------------------
      |	Migration/Swap	|	No		|	No		|
      -------------------------------------------------------------------------
      
      The problem:
      
      PMD is first invalidated with pmdp_invalidate() before it's splitting. This
      invalidation clears PMD_SECT_VALID as below.
      
      PMD Split -> pmdp_invalidate() -> pmd_mkinvalid -> Clears PMD_SECT_VALID
      
      Once PMD_SECT_VALID gets cleared, it results in pmd_present() return false
      on the PMD entry. It will need another bit apart from PMD_SECT_VALID to re-
      affirm pmd_present() as true during the THP split process. To comply with
      above mentioned semantics, pmd_trans_huge() should also check pmd_present()
      first before testing presence of an actual transparent huge mapping.
      
      The solution:
      
      Ideally PMD_TYPE_SECT should have been used here instead. But it shares the
      bit position with PMD_SECT_VALID which is used for THP invalidation. Hence
      it will not be there for pmd_present() check after pmdp_invalidate().
      
      A new software defined PMD_PRESENT_INVALID (bit 59) can be set on the PMD
      entry during invalidation which can help pmd_present() return true and in
      recognizing the fact that it still points to memory.
      
      This bit is transient. During the split process it will be overridden by a
      page table page representing normal pages in place of erstwhile huge page.
      Other pmdp_invalidate() callers always write a fresh PMD value on the entry
      overriding this transient PMD_PRESENT_INVALID bit, which makes it safe.
      
      [1]: https://lkml.org/lkml/2018/10/17/231
      [2]: https://www.kernel.org/doc/Documentation/vm/transhuge.txtSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Suzuki Poulose <suzuki.poulose@arm.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-kernel@vger.kernel.org
      Link: https://lore.kernel.org/r/1599627183-14453-2-git-send-email-anshuman.khandual@arm.comSigned-off-by: NWill Deacon <will@kernel.org>
      b65399f6
  2. 08 9月, 2020 1 次提交
  3. 28 8月, 2020 7 次提交
  4. 27 8月, 2020 9 次提交
    • P
      Revert "powerpc/powernv/idle: Replace CPU feature check with PVR check" · 16d83a54
      Pratik Rajesh Sampat 提交于
      cpuidle stop state implementation has minor optimizations for P10
      where hardware preserves more SPR registers compared to P9. The
      current P9 driver works for P10, although does few extra
      save-restores. P9 driver can provide the required power management
      features like SMT thread folding and core level power savings on a P10
      platform.
      
      Until the P10 stop driver is available, revert the commit which allows
      for only P9 systems to utilize cpuidle and blocks all idle stop states
      for P10. CPU idle states are enabled and tested on the P10 platform
      with this fix.
      
      This reverts commit 8747bf36.
      
      Fixes: 8747bf36 ("powerpc/powernv/idle: Replace CPU feature check with PVR check")
      Signed-off-by: NPratik Rajesh Sampat <psampat@linux.ibm.com>
      Reviewed-by: NVaidyanathan Srinivasan <svaidy@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200826082918.89306-1-psampat@linux.ibm.com
      16d83a54
    • A
      powerpc/perf: Fix reading of MSR[HV/PR] bits in trace-imc · 82715a0f
      Athira Rajeev 提交于
      IMC trace-mode uses MSR[HV/PR] bits to set the cpumode for the
      instruction pointer captured in each sample. The bits are fetched from
      the third double word of the trace record. Reading third double word
      from IMC trace record should use be64_to_cpu() along with READ_ONCE
      inorder to fetch correct MSR[HV/PR] bits. Patch addresses this change.
      
      Currently we are using PERF_RECORD_MISC_HYPERVISOR as cpumode if MSR
      HV is 1 and PR is 0 which means the address is from host counter. But
      using PERF_RECORD_MISC_HYPERVISOR for host counter data will fail to
      resolve the address -> symbol during "perf report" because perf tools
      side uses PERF_RECORD_MISC_KERNEL to represent the host counter data.
      Therefore, fix the trace imc sample data to use
      PERF_RECORD_MISC_KERNEL as cpumode for host kernel information.
      
      Fixes: 77ca3951 ("powerpc/perf: Add kernel support for new MSR[HV PR] bits in trace-imc")
      Signed-off-by: NAthira Rajeev <atrajeev@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/1598424029-1662-1-git-send-email-atrajeev@linux.vnet.ibm.com
      82715a0f
    • A
      powerpc/perf: Fix crashes with generic_compat_pmu & BHRB · b460b512
      Alexey Kardashevskiy 提交于
      The bhrb_filter_map ("The Branch History Rolling Buffer") callback is
      only defined in raw CPUs' power_pmu structs. The "architected" CPUs
      use generic_compat_pmu, which does not have this callback, and crashes
      occur if a user tries to enable branch stack for an event.
      
      This add a NULL pointer check for bhrb_filter_map() which behaves as
      if the callback returned an error.
      
      This does not add the same check for config_bhrb() as the only caller
      checks for cpuhw->bhrb_users which remains zero if bhrb_filter_map==0.
      
      Fixes: be80e758 ("powerpc/perf: Add generic compat mode pmu driver")
      Cc: stable@vger.kernel.org # v5.2+
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NMadhavan Srinivasan <maddy@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200602025612.62707-1-aik@ozlabs.ru
      b460b512
    • M
      powerpc/64s: Fix crash in load_fp_state() due to fpexc_mode · b91eb518
      Michael Ellerman 提交于
      The recent commit 01eb0187 ("powerpc/64s: Fix restore_math
      unnecessarily changing MSR") changed some of the handling of floating
      point/vector restore.
      
      In particular it caused current->thread.fpexc_mode to be copied into
      the current MSR (via msr_check_and_set()), rather than just into
      regs->msr (which is moved into MSR on return to userspace).
      
      This can lead to a crash in the kernel if we take a floating point
      exception when restoring FPSCR:
      
        Oops: Exception in kernel mode, sig: 8 [#1]
        LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
        Modules linked in:
        CPU: 3 PID: 101213 Comm: ld64.so.2 Not tainted 5.9.0-rc1-00098-g18445bf4-dirty #9
        NIP:  c00000000000fbb4 LR: c00000000001a7ac CTR: c000000000183570
        REGS: c0000016b7cfb3b0 TRAP: 0700   Not tainted  (5.9.0-rc1-00098-g18445bf4-dirty)
        MSR:  900000000290b933 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 44002444  XER: 00000000
        CFAR: c00000000001a7a8 IRQMASK: 1
        GPR00: c00000000001ae40 c0000016b7cfb640 c0000000011b7f00 c000001542a0f740
        GPR04: c000001542a0f720 c000001542a0eb00 0000000000000900 c000001542a0eb00
        GPR08: 000000000000000a 0000000000002000 9000000000009033 0000000000000000
        GPR12: 0000000000004000 c0000017ffffd900 0000000000000001 c000000000df5a58
        GPR16: c000000000e19c18 c0000000010e1123 0000000000000001 c000000000e1a638
        GPR20: 0000000000000000 c0000000044b1d00 0000000000000000 c000001542a0f2a0
        GPR24: 00000016c7fe0000 c000001542a0f720 c000000001c93da0 c000000000fe5f28
        GPR28: c000001542a0f720 0000000000800000 c0000016b7cfbe90 0000000002802900
        NIP load_fp_state+0x4/0x214
        LR  restore_math+0x17c/0x1f0
        Call Trace:
          0xc0000016b7cfb680 (unreliable)
          __switch_to+0x330/0x460
          __schedule+0x318/0x920
          schedule+0x74/0x140
          schedule_timeout+0x318/0x3f0
          wait_for_completion+0xc8/0x210
          call_usermodehelper_exec+0x234/0x280
          do_coredump+0xedc/0x13c0
          get_signal+0x1d4/0xbe0
          do_notify_resume+0x1a0/0x490
          interrupt_exit_user_prepare+0x1c4/0x230
          interrupt_return+0x14/0x1c0
        Instruction dump:
        ebe10168 e88101a0 7c8ff120 382101e0 e8010010 7c0803a6 4e800020 790605c4
        782905c4 7c0008a8 7c0008a8 c8030200 <fffe058e> 48000088 c8030000 c8230010
      
      Fix it by only loading the fpexc_mode value into regs->msr.
      
      Also add a comment to explain that although VSX is subject to the
      value of fpexc_mode, we don't have to handle that separately because
      we only allow VSX to be enabled if FP is also enabled.
      
      Fixes: 01eb0187 ("powerpc/64s: Fix restore_math unnecessarily changing MSR")
      Reported-by: NMilton Miller <miltonm@us.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      Link: https://lore.kernel.org/r/20200825093424.3967813-1-mpe@ellerman.id.au
      b91eb518
    • N
      powerpc/64s: scv entry should set PPR · e5fe5609
      Nicholas Piggin 提交于
      Kernel entry sets PPR to HMT_MEDIUM by convention. The scv entry
      path missed this.
      
      Fixes: 7fa95f9a ("powerpc/64s: system call support for scv/rfscv instructions")
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200825075309.224184-1-npiggin@gmail.com
      e5fe5609
    • T
      x86/irq: Unbreak interrupt affinity setting · e027ffff
      Thomas Gleixner 提交于
      Several people reported that 5.8 broke the interrupt affinity setting
      mechanism.
      
      The consolidation of the entry code reused the regular exception entry code
      for device interrupts and changed the way how the vector number is conveyed
      from ptregs->orig_ax to a function argument.
      
      The low level entry uses the hardware error code slot to push the vector
      number onto the stack which is retrieved from there into a function
      argument and the slot on stack is set to -1.
      
      The reason for setting it to -1 is that the error code slot is at the
      position where pt_regs::orig_ax is. A positive value in pt_regs::orig_ax
      indicates that the entry came via a syscall. If it's not set to a negative
      value then a signal delivery on return to userspace would try to restart a
      syscall. But there are other places which rely on pt_regs::orig_ax being a
      valid indicator for syscall entry.
      
      But setting pt_regs::orig_ax to -1 has a nasty side effect vs. the
      interrupt affinity setting mechanism, which was overlooked when this change
      was made.
      
      Moving interrupts on x86 happens in several steps. A new vector on a
      different CPU is allocated and the relevant interrupt source is
      reprogrammed to that. But that's racy and there might be an interrupt
      already in flight to the old vector. So the old vector is preserved until
      the first interrupt arrives on the new vector and the new target CPU. Once
      that happens the old vector is cleaned up, but this cleanup still depends
      on the vector number being stored in pt_regs::orig_ax, which is now -1.
      
      That -1 makes the check for cleanup: pt_regs::orig_ax == new_vector
      always false. As a consequence the interrupt is moved once, but then it
      cannot be moved anymore because the cleanup of the old vector never
      happens.
      
      There would be several ways to convey the vector information to that place
      in the guts of the interrupt handling, but on deeper inspection it turned
      out that this check is pointless and a leftover from the old affinity model
      of X86 which supported multi-CPU affinities. Under this model it was
      possible that an interrupt had an old and a new vector on the same CPU, so
      the vector match was required.
      
      Under the new model the effective affinity of an interrupt is always a
      single CPU from the requested affinity mask. If the affinity mask changes
      then either the interrupt stays on the CPU and on the same vector when that
      CPU is still in the new affinity mask or it is moved to a different CPU, but
      it is never moved to a different vector on the same CPU.
      
      Ergo the cleanup check for the matching vector number is not required and
      can be removed which makes the dependency on pt_regs:orig_ax go away.
      
      The remaining check for new_cpu == smp_processsor_id() is completely
      sufficient. If it matches then the interrupt was successfully migrated and
      the cleanup can proceed.
      
      For paranoia sake add a warning into the vector assignment code to
      validate that the assumption of never moving to a different vector on
      the same CPU holds.
      
      Fixes: 633260fa ("x86/irq: Convey vector as argument and not in ptregs")
      Reported-by: NAlex bykov <alex.bykov@scylladb.com>
      Reported-by: NAvi Kivity <avi@scylladb.com>
      Reported-by: NAlexander Graf <graf@amazon.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NAlexander Graf <graf@amazon.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/87wo1ltaxz.fsf@nanos.tec.linutronix.de
      e027ffff
    • A
      x86/hotplug: Silence APIC only after all interrupts are migrated · 52d6b926
      Ashok Raj 提交于
      There is a race when taking a CPU offline. Current code looks like this:
      
      native_cpu_disable()
      {
      	...
      	apic_soft_disable();
      	/*
      	 * Any existing set bits for pending interrupt to
      	 * this CPU are preserved and will be sent via IPI
      	 * to another CPU by fixup_irqs().
      	 */
      	cpu_disable_common();
      	{
      		....
      		/*
      		 * Race window happens here. Once local APIC has been
      		 * disabled any new interrupts from the device to
      		 * the old CPU are lost
      		 */
      		fixup_irqs(); // Too late to capture anything in IRR.
      		...
      	}
      }
      
      The fix is to disable the APIC *after* cpu_disable_common().
      
      Testing was done with a USB NIC that provided a source of frequent
      interrupts. A script migrated interrupts to a specific CPU and
      then took that CPU offline.
      
      Fixes: 60dcaad5 ("x86/hotplug: Silence APIC and NMI when CPU is dead")
      Reported-by: NEvan Green <evgreen@chromium.org>
      Signed-off-by: NAshok Raj <ashok.raj@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NMathias Nyman <mathias.nyman@linux.intel.com>
      Tested-by: NEvan Green <evgreen@chromium.org>
      Reviewed-by: NEvan Green <evgreen@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/lkml/875zdarr4h.fsf@nanos.tec.linutronix.de/
      Link: https://lore.kernel.org/r/1598501530-45821-1-git-send-email-ashok.raj@intel.com
      52d6b926
    • V
      s390/vmem: fix vmem_add_range for 4-level paging · bffc2f7a
      Vasily Gorbik 提交于
      The kernel currently crashes if 4-level paging is used. Add missing
      p4d_populate for just allocated pud entry.
      
      Fixes: 3e0d3e40 ("s390/vmem: consolidate vmem_add_range() and vmem_remove_range()")
      Reviewed-by: NGerald Schaefer <gerald.schaefer@linux.ibm.com>
      Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>
      bffc2f7a
    • S
      s390: don't trace preemption in percpu macros · 1196f12a
      Sven Schnelle 提交于
      Since commit a21ee605 ("lockdep: Change hardirq{s_enabled,_context}
      to per-cpu variables") the lockdep code itself uses percpu variables. This
      leads to recursions because the percpu macros are calling preempt_enable()
      which might call trace_preempt_on().
      Signed-off-by: NSven Schnelle <svens@linux.ibm.com>
      Reviewed-by: NVasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>
      1196f12a
  5. 26 8月, 2020 7 次提交
  6. 24 8月, 2020 3 次提交
  7. 22 8月, 2020 3 次提交
    • W
      KVM: arm64: Only reschedule if MMU_NOTIFIER_RANGE_BLOCKABLE is not set · b5331379
      Will Deacon 提交于
      When an MMU notifier call results in unmapping a range that spans multiple
      PGDs, we end up calling into cond_resched_lock() when crossing a PGD boundary,
      since this avoids running into RCU stalls during VM teardown. Unfortunately,
      if the VM is destroyed as a result of OOM, then blocking is not permitted
      and the call to the scheduler triggers the following BUG():
      
       | BUG: sleeping function called from invalid context at arch/arm64/kvm/mmu.c:394
       | in_atomic(): 1, irqs_disabled(): 0, non_block: 1, pid: 36, name: oom_reaper
       | INFO: lockdep is turned off.
       | CPU: 3 PID: 36 Comm: oom_reaper Not tainted 5.8.0 #1
       | Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
       | Call trace:
       |  dump_backtrace+0x0/0x284
       |  show_stack+0x1c/0x28
       |  dump_stack+0xf0/0x1a4
       |  ___might_sleep+0x2bc/0x2cc
       |  unmap_stage2_range+0x160/0x1ac
       |  kvm_unmap_hva_range+0x1a0/0x1c8
       |  kvm_mmu_notifier_invalidate_range_start+0x8c/0xf8
       |  __mmu_notifier_invalidate_range_start+0x218/0x31c
       |  mmu_notifier_invalidate_range_start_nonblock+0x78/0xb0
       |  __oom_reap_task_mm+0x128/0x268
       |  oom_reap_task+0xac/0x298
       |  oom_reaper+0x178/0x17c
       |  kthread+0x1e4/0x1fc
       |  ret_from_fork+0x10/0x30
      
      Use the new 'flags' argument to kvm_unmap_hva_range() to ensure that we
      only reschedule if MMU_NOTIFIER_RANGE_BLOCKABLE is set in the notifier
      flags.
      
      Cc: <stable@vger.kernel.org>
      Fixes: 8b3405e3 ("kvm: arm/arm64: Fix locking for kvm_free_stage2_pgd")
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
      Cc: James Morse <james.morse@arm.com>
      Signed-off-by: NWill Deacon <will@kernel.org>
      Message-Id: <20200811102725.7121-3-will@kernel.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b5331379
    • W
      KVM: Pass MMU notifier range flags to kvm_unmap_hva_range() · fdfe7cbd
      Will Deacon 提交于
      The 'flags' field of 'struct mmu_notifier_range' is used to indicate
      whether invalidate_range_{start,end}() are permitted to block. In the
      case of kvm_mmu_notifier_invalidate_range_start(), this field is not
      forwarded on to the architecture-specific implementation of
      kvm_unmap_hva_range() and therefore the backend cannot sensibly decide
      whether or not to block.
      
      Add an extra 'flags' parameter to kvm_unmap_hva_range() so that
      architectures are aware as to whether or not they are permitted to block.
      
      Cc: <stable@vger.kernel.org>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
      Cc: James Morse <james.morse@arm.com>
      Signed-off-by: NWill Deacon <will@kernel.org>
      Message-Id: <20200811102725.7121-2-will@kernel.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fdfe7cbd
    • S
      ARM64: vdso32: Install vdso32 from vdso_install · 8d75785a
      Stephen Boyd 提交于
      Add the 32-bit vdso Makefile to the vdso_install rule so that 'make
      vdso_install' installs the 32-bit compat vdso when it is compiled.
      
      Fixes: a7f71a2c ("arm64: compat: Add vDSO")
      Signed-off-by: NStephen Boyd <swboyd@chromium.org>
      Reviewed-by: NVincenzo Frascino <vincenzo.frascino@arm.com>
      Acked-by: NWill Deacon <will@kernel.org>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Link: https://lore.kernel.org/r/20200818014950.42492-1-swboyd@chromium.orgSigned-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      8d75785a
  8. 21 8月, 2020 9 次提交