1. 03 12月, 2020 2 次提交
    • H
      s390: fix irq state tracing · b1cae1f8
      Heiko Carstens 提交于
      With commit 58c644ba ("sched/idle: Fix arch_cpu_idle() vs
      tracing") common code calls arch_cpu_idle() with a lockdep state that
      tells irqs are on.
      
      This doesn't work very well for s390: psw_idle() will enable interrupts
      to wait for an interrupt. As soon as an interrupt occurs the interrupt
      handler will verify if the old context was psw_idle(). If that is the
      case the interrupt enablement bits in the old program status word will
      be cleared.
      
      A subsequent test in both the external as well as the io interrupt
      handler checks if in the old context interrupts were enabled. Due to
      the above patching of the old program status word it is assumed the
      old context had interrupts disabled, and therefore a call to
      TRACE_IRQS_OFF (aka trace_hardirqs_off_caller) is skipped. Which in
      turn makes lockdep incorrectly "think" that interrupts are enabled
      within the interrupt handler.
      
      Fix this by unconditionally calling TRACE_IRQS_OFF when entering
      interrupt handlers. Also call unconditionally TRACE_IRQS_ON when
      leaving interrupts handlers.
      
      This leaves the special psw_idle() case, which now returns with
      interrupts disabled, but has an "irqs on" lockdep state. So callers of
      psw_idle() must adjust the state on their own, if required. This is
      currently only __udelay_disabled().
      
      Fixes: 58c644ba ("sched/idle: Fix arch_cpu_idle() vs tracing")
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NHeiko Carstens <hca@linux.ibm.com>
      b1cae1f8
    • A
      s390/pci: fix CPU address in MSI for directed IRQ · a2bd4097
      Alexander Gordeev 提交于
      The directed MSIs are delivered to CPUs whose address is
      written to the MSI message address. The current code assumes
      that a CPU logical number (as it is seen by the kernel)
      is also the CPU address.
      
      The above assumption is not correct, as the CPU address
      is rather the value returned by STAP instruction. That
      value does not necessarily match the kernel logical CPU
      number.
      
      Fixes: e979ce7b ("s390/pci: provide support for CPU directed interrupts")
      Cc: <stable@vger.kernel.org> # v5.2+
      Signed-off-by: NAlexander Gordeev <agordeev@linux.ibm.com>
      Reviewed-by: NHalil Pasic <pasic@linux.ibm.com>
      Reviewed-by: NNiklas Schnelle <schnelle@linux.ibm.com>
      Signed-off-by: NNiklas Schnelle <schnelle@linux.ibm.com>
      Signed-off-by: NHeiko Carstens <hca@linux.ibm.com>
      a2bd4097
  2. 02 12月, 2020 1 次提交
  3. 01 12月, 2020 2 次提交
    • G
      KVM: PPC: Book3S HV: XIVE: Fix vCPU id sanity check · f54db39f
      Greg Kurz 提交于
      Commit 062cfab7 ("KVM: PPC: Book3S HV: XIVE: Make VP block size
      configurable") updated kvmppc_xive_vcpu_id_valid() in a way that
      allows userspace to trigger an assertion in skiboot and crash the host:
      
      [  696.186248988,3] XIVE[ IC 08  ] eq_blk != vp_blk (0 vs. 1) for target 0x4300008c/0
      [  696.186314757,0] Assert fail: hw/xive.c:2370:0
      [  696.186342458,0] Aborting!
      xive-kvCPU 0043 Backtrace:
       S: 0000000031e2b8f0 R: 0000000030013840   .backtrace+0x48
       S: 0000000031e2b990 R: 000000003001b2d0   ._abort+0x4c
       S: 0000000031e2ba10 R: 000000003001b34c   .assert_fail+0x34
       S: 0000000031e2ba90 R: 0000000030058984   .xive_eq_for_target.part.20+0xb0
       S: 0000000031e2bb40 R: 0000000030059fdc   .xive_setup_silent_gather+0x2c
       S: 0000000031e2bc20 R: 000000003005a334   .opal_xive_set_vp_info+0x124
       S: 0000000031e2bd20 R: 00000000300051a4   opal_entry+0x134
       --- OPAL call token: 0x8a caller R1: 0xc000001f28563850 ---
      
      XIVE maintains the interrupt context state of non-dispatched vCPUs in
      an internal VP structure. We allocate a bunch of those on startup to
      accommodate all possible vCPUs. Each VP has an id, that we derive from
      the vCPU id for efficiency:
      
      static inline u32 kvmppc_xive_vp(struct kvmppc_xive *xive, u32 server)
      {
      	return xive->vp_base + kvmppc_pack_vcpu_id(xive->kvm, server);
      }
      
      The KVM XIVE device used to allocate KVM_MAX_VCPUS VPs. This was
      limitting the number of concurrent VMs because the VP space is
      limited on the HW. Since most of the time, VMs run with a lot less
      vCPUs, commit 062cfab7 ("KVM: PPC: Book3S HV: XIVE: Make VP
      block size configurable") gave the possibility for userspace to
      tune the size of the VP block through the KVM_DEV_XIVE_NR_SERVERS
      attribute.
      
      The check in kvmppc_pack_vcpu_id() was changed from
      
      	cpu < KVM_MAX_VCPUS * xive->kvm->arch.emul_smt_mode
      
      to
      
      	cpu < xive->nr_servers * xive->kvm->arch.emul_smt_mode
      
      The previous check was based on the fact that the VP block had
      KVM_MAX_VCPUS entries and that kvmppc_pack_vcpu_id() guarantees
      that packed vCPU ids are below KVM_MAX_VCPUS. We've changed the
      size of the VP block, but kvmppc_pack_vcpu_id() has nothing to
      do with it and it certainly doesn't ensure that the packed vCPU
      ids are below xive->nr_servers. kvmppc_xive_vcpu_id_valid() might
      thus return true when the VM was configured with a non-standard
      VSMT mode, even if the packed vCPU id is higher than what we
      expect. We end up using an unallocated VP id, which confuses
      OPAL. The assert in OPAL is probably abusive and should be
      converted to a regular error that the kernel can handle, but
      we shouldn't really use broken VP ids in the first place.
      
      Fix kvmppc_xive_vcpu_id_valid() so that it checks the packed
      vCPU id is below xive->nr_servers, which is explicitly what we
      want.
      
      Fixes: 062cfab7 ("KVM: PPC: Book3S HV: XIVE: Make VP block size configurable")
      Cc: stable@vger.kernel.org # v5.5+
      Signed-off-by: NGreg Kurz <groug@kaod.org>
      Reviewed-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/160673876747.695514.1809676603724514920.stgit@bahia.lan
      f54db39f
    • V
      arm64: mte: Fix typo in macro definition · 9e5344e0
      Vincenzo Frascino 提交于
      UL in the definition of SYS_TFSR_EL1_TF1 was misspelled causing
      compilation issues when trying to implement in kernel MTE async
      mode.
      
      Fix the macro correcting the typo.
      
      Note: MTE async mode will be introduced with a future series.
      
      Fixes: c058b1c4 ("arm64: mte: system register definitions")
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NVincenzo Frascino <vincenzo.frascino@arm.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Link: https://lore.kernel.org/r/20201130170709.22309-1-vincenzo.frascino@arm.comSigned-off-by: NWill Deacon <will@kernel.org>
      9e5344e0
  4. 30 11月, 2020 11 次提交
    • M
      arm64: entry: fix EL1 debug transitions · 2a9b3e6a
      Mark Rutland 提交于
      In debug_exception_enter() and debug_exception_exit() we trace hardirqs
      on/off while RCU isn't guaranteed to be watching, and we don't save and
      restore the hardirq state, and so may return with this having changed.
      
      Handle this appropriately with new entry/exit helpers which do the bare
      minimum to ensure this is appropriately maintained, without marking
      debug exceptions as NMIs. These are placed in entry-common.c with the
      other entry/exit helpers.
      
      In future we'll want to reconsider whether some debug exceptions should
      be NMIs, but this will require a significant refactoring, and for now
      this should prevent issues with lockdep and RCU.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Catalin Marins <catalin.marinas@arm.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: https://lore.kernel.org/r/20201130115950.22492-12-mark.rutland@arm.comSigned-off-by: NWill Deacon <will@kernel.org>
      2a9b3e6a
    • M
      arm64: entry: fix NMI {user, kernel}->kernel transitions · f0cd5ac1
      Mark Rutland 提交于
      Exceptions which can be taken at (almost) any time are consdiered to be
      NMIs. On arm64 that includes:
      
      * SDEI events
      * GICv3 Pseudo-NMIs
      * Kernel stack overflows
      * Unexpected/unhandled exceptions
      
      ... but currently debug exceptions (BRKs, breakpoints, watchpoints,
      single-step) are not considered NMIs.
      
      As these can be taken at any time, kernel features (lockdep, RCU,
      ftrace) may not be in a consistent kernel state. For example, we may
      take an NMI from the idle code or partway through an entry/exit path.
      
      While nmi_enter() and nmi_exit() handle most of this state, notably they
      don't save/restore the lockdep state across an NMI being taken and
      handled. When interrupts are enabled and an NMI is taken, lockdep may
      see interrupts become disabled within the NMI code, but not see
      interrupts become enabled when returning from the NMI, leaving lockdep
      believing interrupts are disabled when they are actually disabled.
      
      The x86 code handles this in idtentry_{enter,exit}_nmi(), which will
      shortly be moved to the generic entry code. As we can't use either yet,
      we copy the x86 approach in arm64-specific helpers. All the NMI
      entrypoints are marked as noinstr to prevent any instrumentation
      handling code being invoked before the state has been corrected.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: https://lore.kernel.org/r/20201130115950.22492-11-mark.rutland@arm.comSigned-off-by: NWill Deacon <will@kernel.org>
      f0cd5ac1
    • M
      arm64: entry: fix non-NMI kernel<->kernel transitions · 7cd1ea10
      Mark Rutland 提交于
      There are periods in kernel mode when RCU is not watching and/or the
      scheduler tick is disabled, but we can still take exceptions such as
      interrupts. The arm64 exception handlers do not account for this, and
      it's possible that RCU is not watching while an exception handler runs.
      
      The x86/generic entry code handles this by ensuring that all (non-NMI)
      kernel exception handlers call irqentry_enter() and irqentry_exit(),
      which handle RCU, lockdep, and IRQ flag tracing. We can't yet move to
      the generic entry code, and already hadnle the user<->kernel transitions
      elsewhere, so we add new kernel<->kernel transition helpers alog the
      lines of the generic entry code.
      
      Since we now track interrupts becoming masked when an exception is
      taken, local_daif_inherit() is modified to track interrupts becoming
      re-enabled when the original context is inherited. To balance the
      entry/exit paths, each handler masks all DAIF exceptions before
      exit_to_kernel_mode().
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: https://lore.kernel.org/r/20201130115950.22492-10-mark.rutland@arm.comSigned-off-by: NWill Deacon <will@kernel.org>
      7cd1ea10
    • M
      arm64: ptrace: prepare for EL1 irq/rcu tracking · 1ec2f2c0
      Mark Rutland 提交于
      Exceptions from EL1 may be taken when RCU isn't watching (e.g. in idle
      sequences), or when the lockdep hardirqs transiently out-of-sync with
      the hardware state (e.g. in the middle of local_irq_enable()). To
      correctly handle these cases, we'll need to save/restore this state
      across some exceptions taken from EL1.
      
      A series of subsequent patches will update EL1 exception handlers to
      handle this. In preparation for this, and to avoid dependencies between
      those patches, this patch adds two new fields to struct pt_regs so that
      exception handlers can track this state.
      
      Note that this is placed in pt_regs as some entry/exit sequences such as
      el1_irq are invoked from assembly, which makes it very difficult to add
      a separate structure as with the irqentry_state used by x86. We can
      separate this once more of the exception logic is moved to C. While the
      fields only need to be bool, they are both made u64 to keep pt_regs
      16-byte aligned.
      
      There should be no functional change as a result of this patch.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: https://lore.kernel.org/r/20201130115950.22492-9-mark.rutland@arm.comSigned-off-by: NWill Deacon <will@kernel.org>
      1ec2f2c0
    • M
      arm64: entry: fix non-NMI user<->kernel transitions · 23529049
      Mark Rutland 提交于
      When built with PROVE_LOCKING, NO_HZ_FULL, and CONTEXT_TRACKING_FORCE
      will WARN() at boot time that interrupts are enabled when we call
      context_tracking_user_enter(), despite the DAIF flags indicating that
      IRQs are masked.
      
      The problem is that we're not tracking IRQ flag changes accurately, and
      so lockdep believes interrupts are enabled when they are not (and
      vice-versa). We can shuffle things so to make this more accurate. For
      kernel->user transitions there are a number of constraints we need to
      consider:
      
      1) When we call __context_tracking_user_enter() HW IRQs must be disabled
         and lockdep must be up-to-date with this.
      
      2) Userspace should be treated as having IRQs enabled from the PoV of
         both lockdep and tracing.
      
      3) As context_tracking_user_enter() stops RCU from watching, we cannot
         use RCU after calling it.
      
      4) IRQ flag tracing and lockdep have state that must be manipulated
         before RCU is disabled.
      
      ... with similar constraints applying for user->kernel transitions, with
      the ordering reversed.
      
      The generic entry code has enter_from_user_mode() and
      exit_to_user_mode() helpers to handle this. We can't use those directly,
      so we add arm64 copies for now (without the instrumentation markers
      which aren't used on arm64). These replace the existing user_exit() and
      user_exit_irqoff() calls spread throughout handlers, and the exception
      unmasking is left as-is.
      
      Note that:
      
      * The accounting for debug exceptions from userspace now happens in
        el0_dbg() and ret_to_user(), so this is removed from
        debug_exception_enter() and debug_exception_exit(). As
        user_exit_irqoff() wakes RCU, the userspace-specific check is removed.
      
      * The accounting for syscalls now happens in el0_svc(),
        el0_svc_compat(), and ret_to_user(), so this is removed from
        el0_svc_common(). This does not adversely affect the workaround for
        erratum 1463225, as this does not depend on any of the state tracking.
      
      * In ret_to_user() we mask interrupts with local_daif_mask(), and so we
        need to inform lockdep and tracing. Here a trace_hardirqs_off() is
        sufficient and safe as we have not yet exited kernel context and RCU
        is usable.
      
      * As PROVE_LOCKING selects TRACE_IRQFLAGS, the ifdeferry in entry.S only
        needs to check for the latter.
      
      * EL0 SError handling will be dealt with in a subsequent patch, as this
        needs to be treated as an NMI.
      
      Prior to this patch, booting an appropriately-configured kernel would
      result in spats as below:
      
      | DEBUG_LOCKS_WARN_ON(lockdep_hardirqs_enabled())
      | WARNING: CPU: 2 PID: 1 at kernel/locking/lockdep.c:5280 check_flags.part.54+0x1dc/0x1f0
      | Modules linked in:
      | CPU: 2 PID: 1 Comm: init Not tainted 5.10.0-rc3 #3
      | Hardware name: linux,dummy-virt (DT)
      | pstate: 804003c5 (Nzcv DAIF +PAN -UAO -TCO BTYPE=--)
      | pc : check_flags.part.54+0x1dc/0x1f0
      | lr : check_flags.part.54+0x1dc/0x1f0
      | sp : ffff80001003bd80
      | x29: ffff80001003bd80 x28: ffff66ce801e0000
      | x27: 00000000ffffffff x26: 00000000000003c0
      | x25: 0000000000000000 x24: ffffc31842527258
      | x23: ffffc31842491368 x22: ffffc3184282d000
      | x21: 0000000000000000 x20: 0000000000000001
      | x19: ffffc318432ce000 x18: 0080000000000000
      | x17: 0000000000000000 x16: ffffc31840f18a78
      | x15: 0000000000000001 x14: ffffc3184285c810
      | x13: 0000000000000001 x12: 0000000000000000
      | x11: ffffc318415857a0 x10: ffffc318406614c0
      | x9 : ffffc318415857a0 x8 : ffffc31841f1d000
      | x7 : 647261685f706564 x6 : ffffc3183ff7c66c
      | x5 : ffff66ce801e0000 x4 : 0000000000000000
      | x3 : ffffc3183fe00000 x2 : ffffc31841500000
      | x1 : e956dc24146b3500 x0 : 0000000000000000
      | Call trace:
      |  check_flags.part.54+0x1dc/0x1f0
      |  lock_is_held_type+0x10c/0x188
      |  rcu_read_lock_sched_held+0x70/0x98
      |  __context_tracking_enter+0x310/0x350
      |  context_tracking_enter.part.3+0x5c/0xc8
      |  context_tracking_user_enter+0x6c/0x80
      |  finish_ret_to_user+0x2c/0x13cr
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: https://lore.kernel.org/r/20201130115950.22492-8-mark.rutland@arm.comSigned-off-by: NWill Deacon <will@kernel.org>
      23529049
    • M
      arm64: entry: move el1 irq/nmi logic to C · 105fc335
      Mark Rutland 提交于
      In preparation for reworking the EL1 irq/nmi entry code, move the
      existing logic to C. We no longer need the asm_nmi_enter() and
      asm_nmi_exit() wrappers, so these are removed. The new C functions are
      marked noinstr, which prevents compiler instrumentation and runtime
      probing.
      
      In subsequent patches we'll want the new C helpers to be called in all
      cases, so we don't bother wrapping the calls with ifdeferry. Even when
      the new C functions are stubs the trivial calls are unlikely to have a
      measurable impact on the IRQ or NMI paths anyway.
      
      Prototypes are added to <asm/exception.h> as otherwise (in some
      configurations) GCC will complain about the lack of a forward
      declaration. We already do this for existing function, e.g.
      enter_from_user_mode().
      
      The new helpers are marked as noinstr (which prevents all
      instrumentation, tracing, and kprobes). Otherwise, there should be no
      functional change as a result of this patch.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: https://lore.kernel.org/r/20201130115950.22492-7-mark.rutland@arm.comSigned-off-by: NWill Deacon <will@kernel.org>
      105fc335
    • M
      arm64: entry: prepare ret_to_user for function call · 3cb5ed4d
      Mark Rutland 提交于
      In a subsequent patch ret_to_user will need to make a C function call
      (in some configurations) which may clobber x0-x18 at the start of the
      finish_ret_to_user block, before enable_step_tsk consumes the flags
      loaded into x1.
      
      In preparation for this, let's load the flags into x19, which is
      preserved across C function calls. This avoids a redundant reload of the
      flags and ensures we operate on a consistent shapshot regardless.
      
      There should be no functional change as a result of this patch. At this
      point of the entry/exit paths we only need to preserve x28 (tsk) and the
      sp, and x19 is free for this use.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: https://lore.kernel.org/r/20201130115950.22492-6-mark.rutland@arm.comSigned-off-by: NWill Deacon <will@kernel.org>
      3cb5ed4d
    • M
      arm64: entry: move enter_from_user_mode to entry-common.c · 2f911d49
      Mark Rutland 提交于
      In later patches we'll want to extend enter_from_user_mode() and add a
      corresponding exit_to_user_mode(). As these will be common for all
      entries/exits from userspace, it'd be better for these to live in
      entry-common.c with the rest of the entry logic.
      
      This patch moves enter_from_user_mode() into entry-common.c. As with
      other functions in entry-common.c it is marked as noinstr (which
      prevents all instrumentation, tracing, and kprobes) but there are no
      other functional changes.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: https://lore.kernel.org/r/20201130115950.22492-5-mark.rutland@arm.comSigned-off-by: NWill Deacon <will@kernel.org>
      2f911d49
    • M
      arm64: entry: mark entry code as noinstr · da192676
      Mark Rutland 提交于
      Functions in entry-common.c are marked as notrace and NOKPROBE_SYMBOL(),
      but they're still subject to other instrumentation which may rely on
      lockdep/rcu/context-tracking being up-to-date, and may cause nested
      exceptions (e.g. for WARN/BUG or KASAN's use of BRK) which will corrupt
      exceptions registers which have not yet been read.
      
      Prevent this by marking all functions in entry-common.c as noinstr to
      prevent compiler instrumentation. This also blacklists the functions for
      tracing and kprobes, so we don't need to handle that separately.
      Functions elsewhere will be dealt with in subsequent patches.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: https://lore.kernel.org/r/20201130115950.22492-4-mark.rutland@arm.comSigned-off-by: NWill Deacon <will@kernel.org>
      da192676
    • M
      arm64: mark idle code as noinstr · 114e0a68
      Mark Rutland 提交于
      Core code disables RCU when calling arch_cpu_idle(), so it's not safe
      for arch_cpu_idle() or its calees to be instrumented, as the
      instrumentation callbacks may attempt to use RCU or other features which
      are unsafe to use in this context.
      
      Mark them noinstr to prevent issues.
      
      The use of local_irq_enable() in arch_cpu_idle() is similarly
      problematic, and the "sched/idle: Fix arch_cpu_idle() vs tracing" patch
      queued in the tip tree addresses that case.
      Reported-by: NMarco Elver <elver@google.com>
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: https://lore.kernel.org/r/20201130115950.22492-3-mark.rutland@arm.comSigned-off-by: NWill Deacon <will@kernel.org>
      114e0a68
    • M
      arm64: syscall: exit userspace before unmasking exceptions · ca1314d7
      Mark Rutland 提交于
      In el0_svc_common() we unmask exceptions before we call user_exit(), and
      so there's a window where an IRQ or debug exception can be taken while
      RCU is not watching. In do_debug_exception() we account for this in via
      debug_exception_{enter,exit}(), but in the el1_irq asm we do not and we
      call trace functions which rely on RCU before we have a guarantee that
      RCU is watching.
      
      Let's avoid this by having el0_svc_common() exit userspace before
      unmasking exceptions, matching what we do for all other EL0 entry paths.
      We can use user_exit_irqoff() to avoid the pointless save/restore of IRQ
      flags while we're sure exceptions are masked in DAIF.
      
      The workaround for Cortex-A76 erratum 1463225 may trigger a debug
      exception before this point, but the debug code invoked in this case is
      safe even when RCU is not watching.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: https://lore.kernel.org/r/20201130115950.22492-2-mark.rutland@arm.comSigned-off-by: NWill Deacon <will@kernel.org>
      ca1314d7
  5. 28 11月, 2020 2 次提交
    • G
      x86/mce: Do not overwrite no_way_out if mce_end() fails · 25bc65d8
      Gabriele Paoloni 提交于
      Currently, if mce_end() fails, no_way_out - the variable denoting
      whether the machine can recover from this MCE - is determined by whether
      the worst severity that was found across the MCA banks associated with
      the current CPU, is of panic severity.
      
      However, at this point no_way_out could have been already set by
      mca_start() after looking at all severities of all CPUs that entered the
      MCE handler. If mce_end() fails, check first if no_way_out is already
      set and, if so, stick to it, otherwise use the local worst value.
      
       [ bp: Massage. ]
      Signed-off-by: NGabriele Paoloni <gabriele.paoloni@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NTony Luck <tony.luck@intel.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20201127161819.3106432-2-gabriele.paoloni@intel.com
      25bc65d8
    • V
      kvm: x86/mmu: Fix get_mmio_spte() on CPUs supporting 5-level PT · 9a2a0d3c
      Vitaly Kuznetsov 提交于
      Commit 95fb5b02 ("kvm: x86/mmu: Support MMIO in the TDP MMU") caused
      the following WARNING on an Intel Ice Lake CPU:
      
       get_mmio_spte: detect reserved bits on spte, addr 0xb80a0, dump hierarchy:
       ------ spte 0xb80a0 level 5.
       ------ spte 0xfcd210107 level 4.
       ------ spte 0x1004c40107 level 3.
       ------ spte 0x1004c41107 level 2.
       ------ spte 0x1db00000000b83b6 level 1.
       WARNING: CPU: 109 PID: 10254 at arch/x86/kvm/mmu/mmu.c:3569 kvm_mmu_page_fault.cold.150+0x54/0x22f [kvm]
      ...
       Call Trace:
        ? kvm_io_bus_get_first_dev+0x55/0x110 [kvm]
        vcpu_enter_guest+0xaa1/0x16a0 [kvm]
        ? vmx_get_cs_db_l_bits+0x17/0x30 [kvm_intel]
        ? skip_emulated_instruction+0xaa/0x150 [kvm_intel]
        kvm_arch_vcpu_ioctl_run+0xca/0x520 [kvm]
      
      The guest triggering this crashes. Note, this happens with the traditional
      MMU and EPT enabled, not with the newly introduced TDP MMU. Turns out,
      there was a subtle change in the above mentioned commit. Previously,
      walk_shadow_page_get_mmio_spte() was setting 'root' to 'iterator.level'
      which is returned by shadow_walk_init() and this equals to
      'vcpu->arch.mmu->shadow_root_level'. Now, get_mmio_spte() sets it to
      'int root = vcpu->arch.mmu->root_level'.
      
      The difference between 'root_level' and 'shadow_root_level' on CPUs
      supporting 5-level page tables is that in some case we don't want to
      use 5-level, in particular when 'cpuid_maxphyaddr(vcpu) <= 48'
      kvm_mmu_get_tdp_level() returns '4'. In case upper layer is not used,
      the corresponding SPTE will fail '__is_rsvd_bits_set()' check.
      
      Revert to using 'shadow_root_level'.
      
      Fixes: 95fb5b02 ("kvm: x86/mmu: Support MMIO in the TDP MMU")
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20201126110206.2118959-1-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9a2a0d3c
  6. 27 11月, 2020 3 次提交
    • P
      KVM: x86: Fix split-irqchip vs interrupt injection window request · 71cc849b
      Paolo Bonzini 提交于
      kvm_cpu_accept_dm_intr and kvm_vcpu_ready_for_interrupt_injection are
      a hodge-podge of conditions, hacked together to get something that
      more or less works.  But what is actually needed is much simpler;
      in both cases the fundamental question is, do we have a place to stash
      an interrupt if userspace does KVM_INTERRUPT?
      
      In userspace irqchip mode, that is !vcpu->arch.interrupt.injected.
      Currently kvm_event_needs_reinjection(vcpu) covers it, but it is
      unnecessarily restrictive.
      
      In split irqchip mode it's a bit more complicated, we need to check
      kvm_apic_accept_pic_intr(vcpu) (the IRQ window exit is basically an INTACK
      cycle and thus requires ExtINTs not to be masked) as well as
      !pending_userspace_extint(vcpu).  However, there is no need to
      check kvm_event_needs_reinjection(vcpu), since split irqchip keeps
      pending ExtINT state separate from event injection state, and checking
      kvm_cpu_has_interrupt(vcpu) is wrong too since ExtINT has higher
      priority than APIC interrupts.  In fact the latter fixes a bug:
      when userspace requests an IRQ window vmexit, an interrupt in the
      local APIC can cause kvm_cpu_has_interrupt() to be true and thus
      kvm_vcpu_ready_for_interrupt_injection() to return false.  When this
      happens, vcpu_run does not exit to userspace but the interrupt window
      vmexits keep occurring.  The VM loops without any hope of making progress.
      
      Once we try to fix these with something like
      
           return kvm_arch_interrupt_allowed(vcpu) &&
      -        !kvm_cpu_has_interrupt(vcpu) &&
      -        !kvm_event_needs_reinjection(vcpu) &&
      -        kvm_cpu_accept_dm_intr(vcpu);
      +        (!lapic_in_kernel(vcpu)
      +         ? !vcpu->arch.interrupt.injected
      +         : (kvm_apic_accept_pic_intr(vcpu)
      +            && !pending_userspace_extint(v)));
      
      we realize two things.  First, thanks to the previous patch the complex
      conditional can reuse !kvm_cpu_has_extint(vcpu).  Second, the interrupt
      window request in vcpu_enter_guest()
      
              bool req_int_win =
                      dm_request_for_irq_injection(vcpu) &&
                      kvm_cpu_accept_dm_intr(vcpu);
      
      should be kept in sync with kvm_vcpu_ready_for_interrupt_injection():
      it is unnecessary to ask the processor for an interrupt window
      if we would not be able to return to userspace.  Therefore,
      kvm_cpu_accept_dm_intr(vcpu) is basically !kvm_cpu_has_extint(vcpu)
      ANDed with the existing check for masked ExtINT.  It all makes sense:
      
      - we can accept an interrupt from userspace if there is a place
        to stash it (and, for irqchip split, ExtINTs are not masked).
        Interrupts from userspace _can_ be accepted even if right now
        EFLAGS.IF=0.
      
      - in order to tell userspace we will inject its interrupt ("IRQ
        window open" i.e. kvm_vcpu_ready_for_interrupt_injection), both
        KVM and the vCPU need to be ready to accept the interrupt.
      
      ... and this is what the patch implements.
      Reported-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Analyzed-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: NNikos Tsironis <ntsironis@arrikto.com>
      Reviewed-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Tested-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      71cc849b
    • P
      KVM: x86: handle !lapic_in_kernel case in kvm_cpu_*_extint · 72c3bcdc
      Paolo Bonzini 提交于
      Centralize handling of interrupts from the userspace APIC
      in kvm_cpu_has_extint and kvm_cpu_get_extint, since
      userspace APIC interrupts are handled more or less the
      same as ExtINTs are with split irqchip.  This removes
      duplicated code from kvm_cpu_has_injectable_intr and
      kvm_cpu_has_interrupt, and makes the code more similar
      between kvm_cpu_has_{extint,interrupt} on one side
      and kvm_cpu_get_{extint,interrupt} on the other.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: NFilippo Sironi <sironi@amazon.de>
      Reviewed-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Tested-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      72c3bcdc
    • S
      powerpc/numa: Fix a regression on memoryless node 0 · 10f78fd0
      Srikar Dronamraju 提交于
      Commit e75130f2 ("powerpc/numa: Offline memoryless cpuless node 0")
      offlines node 0 and expects nodes to be subsequently onlined when CPUs
      or nodes are detected.
      
      Commit 6398eaa2 ("powerpc/numa: Prefer node id queried from vphn")
      skips onlining node 0 when CPUs are associated with node 0.
      
      On systems with node 0 having CPUs but no memory, this causes node 0 be
      marked offline. This causes issues at boot time when trying to set
      memory node for online CPUs while building the zonelist.
      
      0:mon> t
      [link register   ] c000000000400354 __build_all_zonelists+0x164/0x280
      [c00000000161bda0] c0000000016533c8 node_states+0x20/0xa0 (unreliable)
      [c00000000161bdc0] c000000000400384 __build_all_zonelists+0x194/0x280
      [c00000000161be30] c000000001041800 build_all_zonelists_init+0x4c/0x118
      [c00000000161be80] c0000000004020d0 build_all_zonelists+0x190/0x1b0
      [c00000000161bef0] c000000001003cf8 start_kernel+0x18c/0x6a8
      [c00000000161bf90] c00000000000adb4 start_here_common+0x1c/0x3e8
      0:mon> r
      R00 = c000000000400354   R16 = 000000000b57a0e8
      R01 = c00000000161bda0   R17 = 000000000b57a6b0
      R02 = c00000000161ce00   R18 = 000000000b5afee8
      R03 = 0000000000000000   R19 = 000000000b6448a0
      R04 = 0000000000000000   R20 = fffffffffffffffd
      R05 = 0000000000000000   R21 = 0000000001400000
      R06 = 0000000000000000   R22 = 000000001ec00000
      R07 = 0000000000000001   R23 = c000000001175580
      R08 = 0000000000000000   R24 = c000000001651ed8
      R09 = c0000000017e84d8   R25 = c000000001652480
      R10 = 0000000000000000   R26 = c000000001175584
      R11 = c000000c7fac0d10   R27 = c0000000019568d0
      R12 = c000000000400180   R28 = 0000000000000000
      R13 = c000000002200000   R29 = c00000000164dd78
      R14 = 000000000b579f78   R30 = 0000000000000000
      R15 = 000000000b57a2b8   R31 = c000000001175584
      pc  = c000000000400194 local_memory_node+0x24/0x80
      cfar= c000000000074334 mcount+0xc/0x10
      lr  = c000000000400354 __build_all_zonelists+0x164/0x280
      msr = 8000000002001033   cr  = 44002284
      ctr = c000000000400180   xer = 0000000000000001   trap =  380
      dar = 0000000000001388   dsisr = c00000000161bc90
      0:mon>
      
      Fix this by setting node to be online while onlining CPUs that belong to
      node 0.
      
      Fixes: e75130f2 ("powerpc/numa: Offline memoryless cpuless node 0")
      Fixes: 6398eaa2 ("powerpc/numa: Prefer node id queried from vphn")
      Reported-by: NMilan Mohanty <milmohan@in.ibm.com>
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20201127053738.10085-1-srikar@linux.vnet.ibm.com
      10f78fd0
  7. 26 11月, 2020 7 次提交
  8. 25 11月, 2020 6 次提交
  9. 24 11月, 2020 5 次提交
    • P
      sched/idle: Fix arch_cpu_idle() vs tracing · 58c644ba
      Peter Zijlstra 提交于
      We call arch_cpu_idle() with RCU disabled, but then use
      local_irq_{en,dis}able(), which invokes tracing, which relies on RCU.
      
      Switch all arch_cpu_idle() implementations to use
      raw_local_irq_{en,dis}able() and carefully manage the
      lockdep,rcu,tracing state like we do in entry.
      
      (XXX: we really should change arch_cpu_idle() to not return with
      interrupts enabled)
      Reported-by: NSven Schnelle <svens@linux.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NMark Rutland <mark.rutland@arm.com>
      Tested-by: NMark Rutland <mark.rutland@arm.com>
      Link: https://lkml.kernel.org/r/20201120114925.594122626@infradead.org
      58c644ba
    • X
      x86/resctrl: Add necessary kernfs_put() calls to prevent refcount leak · 75899924
      Xiaochen Shen 提交于
      On resource group creation via a mkdir an extra kernfs_node reference is
      obtained by kernfs_get() to ensure that the rdtgroup structure remains
      accessible for the rdtgroup_kn_unlock() calls where it is removed on
      deletion. Currently the extra kernfs_node reference count is only
      dropped by kernfs_put() in rdtgroup_kn_unlock() while the rdtgroup
      structure is removed in a few other locations that lack the matching
      reference drop.
      
      In call paths of rmdir and umount, when a control group is removed,
      kernfs_remove() is called to remove the whole kernfs nodes tree of the
      control group (including the kernfs nodes trees of all child monitoring
      groups), and then rdtgroup structure is freed by kfree(). The rdtgroup
      structures of all child monitoring groups under the control group are
      freed by kfree() in free_all_child_rdtgrp().
      
      Before calling kfree() to free the rdtgroup structures, the kernfs node
      of the control group itself as well as the kernfs nodes of all child
      monitoring groups still take the extra references which will never be
      dropped to 0 and the kernfs nodes will never be freed. It leads to
      reference count leak and kernfs_node_cache memory leak.
      
      For example, reference count leak is observed in these two cases:
        (1) mount -t resctrl resctrl /sys/fs/resctrl
            mkdir /sys/fs/resctrl/c1
            mkdir /sys/fs/resctrl/c1/mon_groups/m1
            umount /sys/fs/resctrl
      
        (2) mkdir /sys/fs/resctrl/c1
            mkdir /sys/fs/resctrl/c1/mon_groups/m1
            rmdir /sys/fs/resctrl/c1
      
      The same reference count leak issue also exists in the error exit paths
      of mkdir in mkdir_rdt_prepare() and rdtgroup_mkdir_ctrl_mon().
      
      Fix this issue by following changes to make sure the extra kernfs_node
      reference on rdtgroup is dropped before freeing the rdtgroup structure.
        (1) Introduce rdtgroup removal helper rdtgroup_remove() to wrap up
        kernfs_put() and kfree().
      
        (2) Call rdtgroup_remove() in rdtgroup removal path where the rdtgroup
        structure is about to be freed by kfree().
      
        (3) Call rdtgroup_remove() or kernfs_put() as appropriate in the error
        exit paths of mkdir where an extra reference is taken by kernfs_get().
      
      Fixes: f3cbeaca ("x86/intel_rdt/cqm: Add rmdir support")
      Fixes: e02737d5 ("x86/intel_rdt: Add tasks files")
      Fixes: 60cf5e10 ("x86/intel_rdt: Add mkdir to resctrl file system")
      Reported-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NXiaochen Shen <xiaochen.shen@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NReinette Chatre <reinette.chatre@intel.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/1604085088-31707-1-git-send-email-xiaochen.shen@intel.com
      75899924
    • X
      x86/resctrl: Remove superfluous kernfs_get() calls to prevent refcount leak · fd8d9db3
      Xiaochen Shen 提交于
      Willem reported growing of kernfs_node_cache entries in slabtop when
      repeatedly creating and removing resctrl subdirectories as well as when
      repeatedly mounting and unmounting the resctrl filesystem.
      
      On resource group (control as well as monitoring) creation via a mkdir
      an extra kernfs_node reference is obtained to ensure that the rdtgroup
      structure remains accessible for the rdtgroup_kn_unlock() calls where it
      is removed on deletion. The kernfs_node reference count is dropped by
      kernfs_put() in rdtgroup_kn_unlock().
      
      With the above explaining the need for one kernfs_get()/kernfs_put()
      pair in resctrl there are more places where a kernfs_node reference is
      obtained without a corresponding release. The excessive amount of
      reference count on kernfs nodes will never be dropped to 0 and the
      kernfs nodes will never be freed in the call paths of rmdir and umount.
      It leads to reference count leak and kernfs_node_cache memory leak.
      
      Remove the superfluous kernfs_get() calls and expand the existing
      comments surrounding the remaining kernfs_get()/kernfs_put() pair that
      remains in use.
      
      Superfluous kernfs_get() calls are removed from two areas:
      
        (1) In call paths of mount and mkdir, when kernfs nodes for "info",
        "mon_groups" and "mon_data" directories and sub-directories are
        created, the reference count of newly created kernfs node is set to 1.
        But after kernfs_create_dir() returns, superfluous kernfs_get() are
        called to take an additional reference.
      
        (2) kernfs_get() calls in rmdir call paths.
      
      Fixes: 17eafd07 ("x86/intel_rdt: Split resource group removal in two")
      Fixes: 4af4a88e ("x86/intel_rdt/cqm: Add mount,umount support")
      Fixes: f3cbeaca ("x86/intel_rdt/cqm: Add rmdir support")
      Fixes: d89b7379 ("x86/intel_rdt/cqm: Add mon_data")
      Fixes: c7d9aac6 ("x86/intel_rdt/cqm: Add mkdir support for RDT monitoring")
      Fixes: 5dc1d5c6 ("x86/intel_rdt: Simplify info and base file lists")
      Fixes: 60cf5e10 ("x86/intel_rdt: Add mkdir to resctrl file system")
      Fixes: 4e978d06 ("x86/intel_rdt: Add "info" files to resctrl file system")
      Reported-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NXiaochen Shen <xiaochen.shen@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NReinette Chatre <reinette.chatre@intel.com>
      Tested-by: NWillem de Bruijn <willemb@google.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/1604085053-31639-1-git-send-email-xiaochen.shen@intel.com
      fd8d9db3
    • W
      arm64: pgtable: Ensure dirty bit is preserved across pte_wrprotect() · ff1712f9
      Will Deacon 提交于
      With hardware dirty bit management, calling pte_wrprotect() on a writable,
      dirty PTE will lose the dirty state and return a read-only, clean entry.
      
      Move the logic from ptep_set_wrprotect() into pte_wrprotect() to ensure that
      the dirty bit is preserved for writable entries, as this is required for
      soft-dirty bit management if we enable it in the future.
      
      Cc: <stable@vger.kernel.org>
      Fixes: 2f4b829c ("arm64: Add support for hardware updates of the access and dirty pte bits")
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Link: https://lore.kernel.org/r/20201120143557.6715-3-will@kernel.orgSigned-off-by: NWill Deacon <will@kernel.org>
      ff1712f9
    • W
      arm64: pgtable: Fix pte_accessible() · 07509e10
      Will Deacon 提交于
      pte_accessible() is used by ptep_clear_flush() to figure out whether TLB
      invalidation is necessary when unmapping pages for reclaim. Although our
      implementation is correct according to the architecture, returning true
      only for valid, young ptes in the absence of racing page-table
      modifications, this is in fact flawed due to lazy invalidation of old
      ptes in ptep_clear_flush_young() where we elide the expensive DSB
      instruction for completing the TLB invalidation.
      
      Rather than penalise the aging path, adjust pte_accessible() to return
      true for any valid pte, even if the access flag is cleared.
      
      Cc: <stable@vger.kernel.org>
      Fixes: 76c714be ("arm64: pgtable: implement pte_accessible()")
      Reported-by: NYu Zhao <yuzhao@google.com>
      Acked-by: NYu Zhao <yuzhao@google.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Link: https://lore.kernel.org/r/20201120143557.6715-2-will@kernel.orgSigned-off-by: NWill Deacon <will@kernel.org>
      07509e10
  10. 23 11月, 2020 1 次提交