1. 19 8月, 2019 2 次提交
  2. 16 8月, 2019 4 次提交
    • P
      powerpc/xive: Implement get_irqchip_state method for XIVE to fix shutdown race · da15c03b
      Paul Mackerras 提交于
      Testing has revealed the existence of a race condition where a XIVE
      interrupt being shut down can be in one of the XIVE interrupt queues
      (of which there are up to 8 per CPU, one for each priority) at the
      point where free_irq() is called.  If this happens, can return an
      interrupt number which has been shut down.  This can lead to various
      symptoms:
      
      - irq_to_desc(irq) can be NULL.  In this case, no end-of-interrupt
        function gets called, resulting in the CPU's elevated interrupt
        priority (numerically lowered CPPR) never gets reset.  That then
        means that the CPU stops processing interrupts, causing device
        timeouts and other errors in various device drivers.
      
      - The irq descriptor or related data structures can be in the process
        of being freed as the interrupt code is using them.  This typically
        leads to crashes due to bad pointer dereferences.
      
      This race is basically what commit 62e04686 ("genirq: Add optional
      hardware synchronization for shutdown", 2019-06-28) is intended to
      fix, given a get_irqchip_state() method for the interrupt controller
      being used.  It works by polling the interrupt controller when an
      interrupt is being freed until the controller says it is not pending.
      
      With XIVE, the PQ bits of the interrupt source indicate the state of
      the interrupt source, and in particular the P bit goes from 0 to 1 at
      the point where the hardware writes an entry into the interrupt queue
      that this interrupt is directed towards.  Normally, the code will then
      process the interrupt and do an end-of-interrupt (EOI) operation which
      will reset PQ to 00 (assuming another interrupt hasn't been generated
      in the meantime).  However, there are situations where the code resets
      P even though a queue entry exists (for example, by setting PQ to 01,
      which disables the interrupt source), and also situations where the
      code leaves P at 1 after removing the queue entry (for example, this
      is done for escalation interrupts so they cannot fire again until
      they are explicitly re-enabled).
      
      The code already has a 'saved_p' flag for the interrupt source which
      indicates that a queue entry exists, although it isn't maintained
      consistently.  This patch adds a 'stale_p' flag to indicate that
      P has been left at 1 after processing a queue entry, and adds code
      to set and clear saved_p and stale_p as necessary to maintain a
      consistent indication of whether a queue entry may or may not exist.
      
      With this, we can implement xive_get_irqchip_state() by looking at
      stale_p, saved_p and the ESB PQ bits for the interrupt.
      
      There is some additional code to handle escalation interrupts
      properly; because they are enabled and disabled in KVM assembly code,
      which does not have access to the xive_irq_data struct for the
      escalation interrupt.  Hence, stale_p may be incorrect when the
      escalation interrupt is freed in kvmppc_xive_{,native_}cleanup_vcpu().
      Fortunately, we can fix it up by looking at vcpu->arch.xive_esc_on,
      with some careful attention to barriers in order to ensure the correct
      result if xive_esc_irq() races with kvmppc_xive_cleanup_vcpu().
      
      Finally, this adds code to make noise on the console (pr_crit and
      WARN_ON(1)) if we find an interrupt queue entry for an interrupt
      which does not have a descriptor.  While this won't catch the race
      reliably, if it does get triggered it will be an indication that
      the race is occurring and needs to be debugged.
      
      Fixes: 243e2511 ("powerpc/xive: Native exploitation of the XIVE interrupt controller")
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190813100648.GE9567@blackberry
      da15c03b
    • P
      KVM: PPC: Book3S HV: Don't push XIVE context when not using XIVE device · 8d4ba9c9
      Paul Mackerras 提交于
      At present, when running a guest on POWER9 using HV KVM but not using
      an in-kernel interrupt controller (XICS or XIVE), for example if QEMU
      is run with the kernel_irqchip=off option, the guest entry code goes
      ahead and tries to load the guest context into the XIVE hardware, even
      though no context has been set up.
      
      To fix this, we check that the "CAM word" is non-zero before pushing
      it to the hardware.  The CAM word is initialized to a non-zero value
      in kvmppc_xive_connect_vcpu() and kvmppc_xive_native_connect_vcpu(),
      and is now cleared in kvmppc_xive_{,native_}cleanup_vcpu.
      
      Fixes: 5af50993 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
      Cc: stable@vger.kernel.org # v4.12+
      Reported-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190813100100.GC9567@blackberry
      8d4ba9c9
    • P
      KVM: PPC: Book3S HV: Fix race in re-enabling XIVE escalation interrupts · 959c5d51
      Paul Mackerras 提交于
      Escalation interrupts are interrupts sent to the host by the XIVE
      hardware when it has an interrupt to deliver to a guest VCPU but that
      VCPU is not running anywhere in the system.  Hence we disable the
      escalation interrupt for the VCPU being run when we enter the guest
      and re-enable it when the guest does an H_CEDE hypercall indicating
      it is idle.
      
      It is possible that an escalation interrupt gets generated just as we
      are entering the guest.  In that case the escalation interrupt may be
      using a queue entry in one of the interrupt queues, and that queue
      entry may not have been processed when the guest exits with an H_CEDE.
      The existing entry code detects this situation and does not clear the
      vcpu->arch.xive_esc_on flag as an indication that there is a pending
      queue entry (if the queue entry gets processed, xive_esc_irq() will
      clear the flag).  There is a comment in the code saying that if the
      flag is still set on H_CEDE, we have to abort the cede rather than
      re-enabling the escalation interrupt, lest we end up with two
      occurrences of the escalation interrupt in the interrupt queue.
      
      However, the exit code doesn't do that; it aborts the cede in the sense
      that vcpu->arch.ceded gets cleared, but it still enables the escalation
      interrupt by setting the source's PQ bits to 00.  Instead we need to
      set the PQ bits to 10, indicating that an interrupt has been triggered.
      We also need to avoid setting vcpu->arch.xive_esc_on in this case
      (i.e. vcpu->arch.xive_esc_on seen to be set on H_CEDE) because
      xive_esc_irq() will run at some point and clear it, and if we race with
      that we may end up with an incorrect result (i.e. xive_esc_on set when
      the escalation interrupt has just been handled).
      
      It is extremely unlikely that having two queue entries would cause
      observable problems; theoretically it could cause queue overflow, but
      the CPU would have to have thousands of interrupts targetted to it for
      that to be possible.  However, this fix will also make it possible to
      determine accurately whether there is an unhandled escalation
      interrupt in the queue, which will be needed by the following patch.
      
      Fixes: 9b9b13a6 ("KVM: PPC: Book3S HV: Keep XIVE escalation interrupt masked unless ceded")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190813100349.GD9567@blackberry
      959c5d51
    • C
      KVM: PPC: Book3S HV: XIVE: Free escalation interrupts before disabling the VP · 237aed48
      Cédric Le Goater 提交于
      When a vCPU is brought done, the XIVE VP (Virtual Processor) is first
      disabled and then the event notification queues are freed. When freeing
      the queues, we check for possible escalation interrupts and free them
      also.
      
      But when a XIVE VP is disabled, the underlying XIVE ENDs also are
      disabled in OPAL. When an END (Event Notification Descriptor) is
      disabled, its ESB pages (ESn and ESe) are disabled and loads return all
      1s. Which means that any access on the ESB page of the escalation
      interrupt will return invalid values.
      
      When an interrupt is freed, the shutdown handler computes a 'saved_p'
      field from the value returned by a load in xive_do_source_set_mask().
      This value is incorrect for escalation interrupts for the reason
      described above.
      
      This has no impact on Linux/KVM today because we don't make use of it
      but we will introduce in future changes a xive_get_irqchip_state()
      handler. This handler will use the 'saved_p' field to return the state
      of an interrupt and 'saved_p' being incorrect, softlockup will occur.
      
      Fix the vCPU cleanup sequence by first freeing the escalation interrupts
      if any, then disable the XIVE VP and last free the queues.
      
      Fixes: 90c73795 ("KVM: PPC: Book3S HV: Add a new KVM device for the XIVE native exploitation mode")
      Fixes: 5af50993 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190806172538.5087-1-clg@kaod.org
      237aed48
  3. 15 8月, 2019 3 次提交
  4. 12 8月, 2019 1 次提交
  5. 05 8月, 2019 11 次提交
  6. 31 7月, 2019 3 次提交
  7. 30 7月, 2019 2 次提交
    • M
      powerpc/spe: Mark expected switch fall-throughs · 7db57e77
      Michael Ellerman 提交于
      Mark switch cases where we are expecting to fall through.
      
      Fixes errors such as below, seen with mpc85xx_defconfig:
      
        arch/powerpc/kernel/align.c: In function 'emulate_spe':
        arch/powerpc/kernel/align.c:178:8: error: this statement may fall through
          ret |= __get_user_inatomic(temp.v[3], p++);
              ^~
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190730141917.21817-1-mpe@ellerman.id.au
      7db57e77
    • A
      powerpc/nvdimm: Pick nearby online node if the device node is not online · da1115fd
      Aneesh Kumar K.V 提交于
      Currently, nvdimm subsystem expects the device numa node for SCM device to be
      an online node. It also doesn't try to bring the device numa node online. Hence
      if we use a non-online numa node as device node we hit crashes like below. This
      is because we try to access uninitialized NODE_DATA in different code paths.
      
      cpu 0x0: Vector: 300 (Data Access) at [c0000000fac53170]
          pc: c0000000004bbc50: ___slab_alloc+0x120/0xca0
          lr: c0000000004bc834: __slab_alloc+0x64/0xc0
          sp: c0000000fac53400
         msr: 8000000002009033
         dar: 73e8
       dsisr: 80000
        current = 0xc0000000fabb6d80
        paca    = 0xc000000003870000   irqmask: 0x03   irq_happened: 0x01
          pid   = 7, comm = kworker/u16:0
      Linux version 5.2.0-06234-g76bd729b2644 (kvaneesh@ltc-boston123) (gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #135 SMP Thu Jul 11 05:36:30 CDT 2019
      enter ? for help
      [link register   ] c0000000004bc834 __slab_alloc+0x64/0xc0
      [c0000000fac53400] c0000000fac53480 (unreliable)
      [c0000000fac53500] c0000000004bc818 __slab_alloc+0x48/0xc0
      [c0000000fac53560] c0000000004c30a0 __kmalloc_node_track_caller+0x3c0/0x6b0
      [c0000000fac535d0] c000000000cfafe4 devm_kmalloc+0x74/0xc0
      [c0000000fac53600] c000000000d69434 nd_region_activate+0x144/0x560
      [c0000000fac536d0] c000000000d6b19c nd_region_probe+0x17c/0x370
      [c0000000fac537b0] c000000000d6349c nvdimm_bus_probe+0x10c/0x230
      [c0000000fac53840] c000000000cf3cc4 really_probe+0x254/0x4e0
      [c0000000fac538d0] c000000000cf429c driver_probe_device+0x16c/0x1e0
      [c0000000fac53950] c000000000cf0b44 bus_for_each_drv+0x94/0x130
      [c0000000fac539b0] c000000000cf392c __device_attach+0xdc/0x200
      [c0000000fac53a50] c000000000cf231c bus_probe_device+0x4c/0xf0
      [c0000000fac53a90] c000000000ced268 device_add+0x528/0x810
      [c0000000fac53b60] c000000000d62a58 nd_async_device_register+0x28/0xa0
      [c0000000fac53bd0] c0000000001ccb8c async_run_entry_fn+0xcc/0x1f0
      [c0000000fac53c50] c0000000001bcd9c process_one_work+0x46c/0x860
      [c0000000fac53d20] c0000000001bd4f4 worker_thread+0x364/0x5f0
      [c0000000fac53db0] c0000000001c7260 kthread+0x1b0/0x1c0
      [c0000000fac53e20] c00000000000b954 ret_from_kernel_thread+0x5c/0x68
      
      The patch tries to fix this by picking the nearest online node as the SCM node.
      This does have a problem of us losing the information that SCM node is
      equidistant from two other online nodes. If applications need to understand these
      fine-grained details we should express then like x86 does via
      /sys/devices/system/node/nodeX/accessY/initiators/
      
      With the patch we get
      
       # numactl -H
      available: 2 nodes (0-1)
      node 0 cpus:
      node 0 size: 0 MB
      node 0 free: 0 MB
      node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
      node 1 size: 130865 MB
      node 1 free: 129130 MB
      node distances:
      node   0   1
        0:  10  20
        1:  20  10
       # cat /sys/bus/nd/devices/region0/numa_node
      0
       # dmesg | grep papr_scm
      [   91.332305] papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Region registered with target node 2 and online node 0
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190729095128.23707-1-aneesh.kumar@linux.ibm.com
      da1115fd
  8. 29 7月, 2019 10 次提交
  9. 28 7月, 2019 4 次提交
    • L
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · a9815a4f
      Linus Torvalds 提交于
      Pull x86 fixes from Thomas Gleixner:
       "A set of x86 fixes and functional updates:
      
         - Prevent stale huge I/O TLB mappings on 32bit. A long standing bug
           which got exposed by KPTI support for 32bit
      
         - Prevent bogus access_ok() warnings in arch_stack_walk_user()
      
         - Add display quirks for Lenovo devices which have height and width
           swapped
      
         - Add the missing CR2 fixup for 32 bit async pagefaults. Fallout of
           the CR2 bug fix series.
      
         - Unbreak handling of force enabled HPET by moving the 'is HPET
           counting' check back to the original place.
      
         - A more accurate check for running on a hypervisor platform in the
           MDS mitigation code. Not perfect, but more accurate than the
           previous one.
      
         - Update a stale and confusing comment vs. IRQ stacks"
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/speculation/mds: Apply more accurate check on hypervisor platform
        x86/hpet: Undo the early counter is counting check
        x86/entry/32: Pass cr2 to do_async_page_fault()
        x86/irq/64: Update stale comment
        x86/sysfb_efi: Add quirks for some devices with swapped width and height
        x86/stacktrace: Prevent access_ok() warnings in arch_stack_walk_user()
        mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()
        x86/mm: Sync also unmappings in vmalloc_sync_all()
        x86/mm: Check for pfn instead of page in vmalloc_sync_one()
      a9815a4f
    • L
      Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · e24ce84e
      Linus Torvalds 提交于
      Pull scheduler fixes from Thomas Gleixner:
       "Two fixes for the fair scheduling class:
      
         - Prevent freeing memory which is accessible by concurrent readers
      
         - Make the RCU annotations for numa groups consistent"
      
      * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched/fair: Use RCU accessors consistently for ->numa_group
        sched/fair: Don't free p->numa_faults with concurrent readers
      e24ce84e
    • L
      Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 750991f9
      Linus Torvalds 提交于
      Pull perf fixes from Thomas Gleixner:
       "A pile of perf related fixes:
      
        Kernel:
         - Fix SLOTS PEBS event constraints for Icelake CPUs
      
         - Add the missing mask bit to allow counting hardware generated
           prefetches on L3 for Icelake CPUs
      
         - Make the test for hypervisor platforms more accurate (as far as
           possible)
      
         - Handle PMUs correctly which override event->cpu
      
         - Yet another missing fallthrough annotation
      
        Tools:
           perf.data:
              - Fix loading of compressed data split across adjacent records
              - Fix buffer size setting for processing CPU topology perf.data
                header.
      
           perf stat:
              - Fix segfault for event group in repeat mode
              - Always separate "stalled cycles per insn" line, it was being
                appended to the "instructions" line.
      
           perf script:
              - Fix --max-blocks man page description.
              - Improve man page description of metrics.
              - Fix off by one in brstackinsn IPC computation.
      
           perf probe:
              - Avoid calling freeing routine multiple times for same pointer.
      
           perf build:
              - Do not use -Wshadow on gcc < 4.8, avoiding too strict warnings
                treated as errors, breaking the build"
      
      * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf/x86/intel: Mark expected switch fall-throughs
        perf/core: Fix creating kernel counters for PMUs that override event->cpu
        perf/x86: Apply more accurate check on hypervisor platform
        perf/x86/intel: Fix invalid Bit 13 for Icelake MSR_OFFCORE_RSP_x register
        perf/x86/intel: Fix SLOTS PEBS event constraint
        perf build: Do not use -Wshadow on gcc < 4.8
        perf probe: Avoid calling freeing routine multiple times for same pointer
        perf probe: Set pev->nargs to zero after freeing pev->args entries
        perf session: Fix loading of compressed data split across adjacent records
        perf stat: Always separate stalled cycles per insn
        perf stat: Fix segfault for event group in repeat mode
        perf tools: Fix proper buffer size for feature processing
        perf script: Fix off by one in brstackinsn IPC computation
        perf script: Improve man page description of metrics
        perf script: Fix --max-blocks man page description
      750991f9
    • L
      Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 431f288e
      Linus Torvalds 提交于
      Pull locking fixes from Thomas Gleixner:
       "A set of locking fixes:
      
         - Address the fallout of the rwsem rework. Missing ACQUIREs and a
           sanity check to prevent a use-after-free
      
         - Add missing checks for unitialized mutexes when mutex debugging is
           enabled.
      
         - Remove the bogus code in the generic SMP variant of
           arch_futex_atomic_op_inuser()
      
         - Fixup the #ifdeffery in lockdep to prevent compile warnings"
      
      * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        locking/mutex: Test for initialized mutex
        locking/lockdep: Clean up #ifdef checks
        locking/lockdep: Hide unused 'class' variable
        locking/rwsem: Add ACQUIRE comments
        tty/ldsem, locking/rwsem: Add missing ACQUIRE to read_failed sleep loop
        lcoking/rwsem: Add missing ACQUIRE to read_slowpath sleep loop
        locking/rwsem: Add missing ACQUIRE to read_slowpath exit when queue is empty
        locking/rwsem: Don't call owner_on_cpu() on read-owner
        futex: Cleanup generic SMP variant of arch_futex_atomic_op_inuser()
      431f288e