1. 13 10月, 2017 1 次提交
    • A
      powerpc/perf: Fix IMC initialization crash · 0d8ba162
      Anju T Sudhakar 提交于
      Panic observed with latest firmware, and upstream kernel:
      
       NIP init_imc_pmu+0x8c/0xcf0
       LR  init_imc_pmu+0x2f8/0xcf0
       Call Trace:
         init_imc_pmu+0x2c8/0xcf0 (unreliable)
         opal_imc_counters_probe+0x300/0x400
         platform_drv_probe+0x64/0x110
         driver_probe_device+0x3d8/0x580
         __driver_attach+0x14c/0x1a0
         bus_for_each_dev+0x8c/0xf0
         driver_attach+0x34/0x50
         bus_add_driver+0x298/0x350
         driver_register+0x9c/0x180
         __platform_driver_register+0x5c/0x70
         opal_imc_driver_init+0x2c/0x40
         do_one_initcall+0x64/0x1d0
         kernel_init_freeable+0x280/0x374
         kernel_init+0x24/0x160
         ret_from_kernel_thread+0x5c/0x74
      
      While registering nest imc at init, cpu-hotplug callback
      nest_pmu_cpumask_init() makes an OPAL call to stop the engine. And if
      the OPAL call fails, imc_common_cpuhp_mem_free() is invoked to cleanup
      memory and cpuhotplug setup.
      
      But when cleaning up the attribute group, we are dereferencing the
      attribute element array without checking whether the backing element
      is not NULL. This causes the kernel panic.
      
      Add a check for the backing element prior to dereferencing the
      attribute element, to handle the failing case gracefully.
      Signed-off-by: NAnju T Sudhakar <anju@linux.vnet.ibm.com>
      Reported-by: NPridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
      [mpe: Trim change log]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      0d8ba162
  2. 12 10月, 2017 2 次提交
    • A
      powerpc/perf: Add ___GFP_NOWARN flag to alloc_pages_node() · cd4f2b30
      Anju T Sudhakar 提交于
      Stack trace output during a stress test:
       [    4.310049] Freeing initrd memory: 22592K
      [    4.310646] rtas_flash: no firmware flash support
      [    4.313341] cpuhp/64: page allocation failure: order:0, mode:0x14480c0(GFP_KERNEL|__GFP_ZERO|__GFP_THISNODE), nodemask=(null)
      [    4.313465] cpuhp/64 cpuset=/ mems_allowed=0
      [    4.313521] CPU: 64 PID: 392 Comm: cpuhp/64 Not tainted 4.11.0-39.el7a.ppc64le #1
      [    4.313588] Call Trace:
      [    4.313622] [c000000f1fb1b8e0] [c000000000c09388] dump_stack+0xb0/0xf0 (unreliable)
      [    4.313694] [c000000f1fb1b920] [c00000000030ef6c] warn_alloc+0x12c/0x1c0
      [    4.313753] [c000000f1fb1b9c0] [c00000000030ff68] __alloc_pages_nodemask+0xea8/0x1000
      [    4.313823] [c000000f1fb1bbb0] [c000000000113a8c] core_imc_mem_init+0xbc/0x1c0
      [    4.313892] [c000000f1fb1bc00] [c000000000113cdc] ppc_core_imc_cpu_online+0x14c/0x170
      [    4.313962] [c000000f1fb1bc90] [c000000000125758] cpuhp_invoke_callback+0x198/0x5d0
      [    4.314031] [c000000f1fb1bd00] [c00000000012782c] cpuhp_thread_fun+0x8c/0x3d0
      [    4.314101] [c000000f1fb1bd60] [c0000000001678d0] smpboot_thread_fn+0x290/0x2a0
      [    4.314169] [c000000f1fb1bdc0] [c00000000015ee78] kthread+0x168/0x1b0
      [    4.314229] [c000000f1fb1be30] [c00000000000b368] ret_from_kernel_thread+0x5c/0x74
      [    4.314313] Mem-Info:
      [    4.314356] active_anon:0 inactive_anon:0 isolated_anon:0
      
      core_imc_mem_init() at system boot use alloc_pages_node() to get memory
      and alloc_pages_node() throws this stack dump when tried to allocate
      memory from a node which has no memory behind it. Add a ___GFP_NOWARN
      flag in allocation request as a fix.
      Signed-off-by: NAnju T Sudhakar <anju@linux.vnet.ibm.com>
      Reported-by: NMichael Ellerman <mpe@ellerman.id.au>
      Reported-by: NVenkat R.B <venkatb3@in.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      cd4f2b30
    • A
      powerpc/perf: Fix for core/nest imc call trace on cpuhotplug · 0d923820
      Anju T Sudhakar 提交于
      Nest/core pmu units are enabled only when it is used. A reference count is
      maintained for the events which uses the nest/core pmu units. Currently in
      *_imc_counters_release function a WARN() is used for notification of any
      underflow of ref count.
      
      The case where event ref count hit a negative value is, when perf session is
      started, followed by offlining of all cpus in a given core.
      i.e. in cpuhotplug offline path ppc_core_imc_cpu_offline() function set the
      ref->count to zero, if the current cpu which is about to offline is the last
      cpu in a given core and make an OPAL call to disable the engine in that core.
      And on perf session termination, perf->destroy (core_imc_counters_release) will
      first decrement the ref->count for this core and based on the ref->count value
      an opal call is made to disable the core-imc engine.
      Now, since cpuhotplug path already clears the ref->count for core and disabled
      the engine, perf->destroy() decrementing again at event termination make it
      negative which in turn fires the WARN_ON. The same happens for nest units.
      
      Add a check to see if the reference count is alreday zero, before decrementing
      the count, so that the ref count will not hit a negative value.
      Signed-off-by: NAnju T Sudhakar <anju@linux.vnet.ibm.com>
      Reviewed-by: NSantosh Sivaraj <santosh@fossix.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      0d923820
  3. 10 10月, 2017 3 次提交
    • T
      powerpc: Don't call lockdep_assert_cpus_held() from arch_update_cpu_topology() · 6b2c08f9
      Thiago Jung Bauermann 提交于
      It turns out that not all paths calling arch_update_cpu_topology() hold
      cpu_hotplug_lock, but that's OK because those paths can't race with
      any concurrent hotplug events.
      
      Warnings were reported with the following trace:
      
        lockdep_assert_cpus_held
        arch_update_cpu_topology
        sched_init_domains
        sched_init_smp
        kernel_init_freeable
        kernel_init
        ret_from_kernel_thread
      
      Which is safe because it's called early in boot when hotplug is not
      live yet.
      
      And also this trace:
      
        lockdep_assert_cpus_held
        arch_update_cpu_topology
        partition_sched_domains
        cpuset_update_active_cpus
        sched_cpu_deactivate
        cpuhp_invoke_callback
        cpuhp_down_callbacks
        cpuhp_thread_fun
        smpboot_thread_fn
        kthread
        ret_from_kernel_thread
      
      Which is safe because it's called as part of CPU hotplug, so although
      we don't hold the CPU hotplug lock, there is another thread driving
      the CPU hotplug operation which does hold the lock, and there is no
      race.
      
      Thanks to tglx for deciphering it for us.
      
      Fixes: 3e401f7a ("powerpc: Only obtain cpu_hotplug_lock if called by rtasd")
      Signed-off-by: NThiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      6b2c08f9
    • S
      powerpc/lib/sstep: Fix count leading zeros instructions · b0490a04
      Sandipan Das 提交于
      According to the GCC documentation, the behaviour of __builtin_clz()
      and __builtin_clzl() is undefined if the value of the input argument
      is zero. Without handling this special case, these builtins have been
      used for emulating the following instructions:
        * Count Leading Zeros Word (cntlzw[.])
        * Count Leading Zeros Doubleword (cntlzd[.])
      
      This fixes the emulated behaviour of these instructions by adding an
      additional check for this special case.
      
      Fixes: 3cdfcbfd ("powerpc: Change analyse_instr so it doesn't modify *regs")
      Signed-off-by: NSandipan Das <sandipan@linux.vnet.ibm.com>
      Reviewed-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      b0490a04
    • K
      powerpc/livepatch: Fix livepatch stack access · e36a82ee
      Kamalesh Babulal 提交于
      While running stress test with livepatch module loaded, kernel bug was
      triggered.
      
        cpu 0x5: Vector: 400 (Instruction Access) at [c0000000eb9d3b60]
        5:mon> t
        [c0000000eb9d3de0] c0000000eb9d3e30 (unreliable)
        [c0000000eb9d3e30] c000000000008ab4 hardware_interrupt_common+0x114/0x120
         --- Exception: 501 (Hardware Interrupt) at c000000000053040 livepatch_handler+0x4c/0x74
        [c0000000eb9d4120] 0000000057ac6e9d (unreliable)
        [d0000000089d9f78] 2e0965747962382e
        SP (965747962342e09) is in userspace
      
      When an interrupt occurs during the livepatch_handler execution, it's
      possible for the livepatch_stack and/or thread_info to be corrupted.
      eg:
      
        Task A                        Interrupt Handler
        =========                     =================
        livepatch_handler:
        mr r0, r1
        ld r1, TI_livepatch_sp(r12)
                                      hardware_interrupt_common:
                                        do_IRQ+0x8:
                                          mflr    r0          <- saved stack pointer is overwritten
                                          bl      _mcount
                                          ...
                                          std     r27,-40(r1) <- overwrite of thread_info()
      
        lis r2, STACK_END_MAGIC@h
        ori r2, r2, STACK_END_MAGIC@l
        ld  r12, -8(r1)
      
      Fix the corruption by using r11 register for livepatch stack
      manipulation, instead of shuffling task stack and livepatch stack into
      r1 register. Using r11 register also avoids disabling/enabling irq's
      while setting up the livepatch stack.
      Signed-off-by: NKamalesh Babulal <kamalesh@linux.vnet.ibm.com>
      Reviewed-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Reviewed-by: NBalbir Singh <bsingharora@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      e36a82ee
  4. 06 10月, 2017 3 次提交
    • G
      powerpc/tm: Fix illegal TM state in signal handler · 044215d1
      Gustavo Romero 提交于
      Currently it's possible that on returning from the signal handler
      through the restore_tm_sigcontexts() code path (e.g. from a signal
      caught due to a `trap` instruction executed in the middle of an HTM
      block, or a deliberately constructed sigframe) an illegal TM state
      (like TS=10 TM=0, i.e. "T0") is set in SRR1 and when `rfid` sets
      implicitly the MSR register from SRR1 register on return to userspace
      it causes a TM Bad Thing exception.
      
      That illegal state can be set (a) by a malicious user that disables
      the TM bit by tweaking the bits in uc_mcontext before returning from
      the signal handler or (b) by a sufficient number of context switches
      occurring such that the load_tm counter overflows and TM is disabled
      whilst in the signal handler.
      
      This commit fixes the illegal TM state by ensuring that TM bit is
      always enabled before we return from restore_tm_sigcontexts(). A small
      comment correction is made as well.
      
      Fixes: 5d176f75 ("powerpc: tm: Enable transactional memory (TM) lazily for userspace")
      Cc: stable@vger.kernel.org # v4.9+
      Signed-off-by: NGustavo Romero <gromero@linux.vnet.ibm.com>
      Signed-off-by: NBreno Leitao <leitao@debian.org>
      Signed-off-by: NCyril Bur <cyrilbur@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      044215d1
    • C
      powerpc/64s: Use emergency stack for kernel TM Bad Thing program checks · 265e60a1
      Cyril Bur 提交于
      When using transactional memory (TM), the CPU can be in one of six
      states as far as TM is concerned, encoded in the Machine State
      Register (MSR). Certain state transitions are illegal and if attempted
      trigger a "TM Bad Thing" type program check exception.
      
      If we ever hit one of these exceptions it's treated as a bug, ie. we
      oops, and kill the process and/or panic, depending on configuration.
      
      One case where we can trigger a TM Bad Thing, is when returning to
      userspace after a system call or interrupt, using RFID. When this
      happens the CPU first restores the user register state, in particular
      r1 (the stack pointer) and then attempts to update the MSR. However
      the MSR update is not allowed and so we take the program check with
      the user register state, but the kernel MSR.
      
      This tricks the exception entry code into thinking we have a bad
      kernel stack pointer, because the MSR says we're coming from the
      kernel, but r1 is pointing to userspace.
      
      To avoid this we instead always switch to the emergency stack if we
      take a TM Bad Thing from the kernel. That way none of the user
      register values are used, other than for printing in the oops message.
      
      This is the fix for CVE-2017-1000255.
      
      Fixes: 5d176f75 ("powerpc: tm: Enable transactional memory (TM) lazily for userspace")
      Cc: stable@vger.kernel.org # v4.9+
      Signed-off-by: NCyril Bur <cyrilbur@gmail.com>
      [mpe: Rewrite change log & comments, tweak asm slightly]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      265e60a1
    • A
      powerpc/powernv: Increase memory block size to 1GB on radix · 53ecde0b
      Anton Blanchard 提交于
      Memory hot unplug on PowerNV radix hosts is broken. Our memory block
      size is 256MB but since we map the linear region with very large
      pages, each pte we tear down maps 1GB.
      
      A hot unplug of one 256MB memory block results in 768MB of memory
      getting unintentionally unmapped. At this point we are likely to oops.
      
      Fix this by increasing our memory block size to 1GB on PowerNV radix
      hosts.
      
      Fixes: 4b5d62ca ("powerpc/mm: add radix__remove_section_mapping()")
      Cc: stable@vger.kernel.org # v4.11+
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      53ecde0b
  5. 04 10月, 2017 7 次提交
    • G
      powerpc/mm: Call flush_tlb_kernel_range with interrupts enabled · 7c6a4f3b
      Guenter Roeck 提交于
      flush_tlb_kernel_range() may call smp_call_function_many() which expects
      interrupts to be enabled. This results in a traceback.
      
      WARNING: CPU: 0 PID: 1 at kernel/smp.c:416 smp_call_function_many+0xcc/0x2fc
      CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.14.0-rc1-00009-g0666f560 #1
      task: cf830000 task.stack: cf82e000
      NIP:  c00a93c8 LR: c00a9634 CTR: 00000001
      REGS: cf82fde0 TRAP: 0700   Not tainted  (4.14.0-rc1-00009-g0666f560)
      MSR:  00021000 <CE,ME>  CR: 24000082  XER: 00000000
      
      GPR00: c00a9634 cf82fe90 cf830000 c050ad3c c0015a54 00000000 00000001 00000001
      GPR08: 00000001 00000000 00000000 cf82e000 24000084 00000000 c0003150 00000000
      GPR16: 00000000 00000000 00000000 00000000 00000000 00000001 00000000 c0510000
      GPR24: 00000000 c0015a54 00000000 c050ad3c c051823c c050ad3c 00000025 00000000
      NIP [c00a93c8] smp_call_function_many+0xcc/0x2fc
      LR [c00a9634] smp_call_function+0x3c/0x50
      Call Trace:
      [cf82fe90] [00000010] 0x10 (unreliable)
      [cf82fed0] [c00a9634] smp_call_function+0x3c/0x50
      [cf82fee0] [c0015d2c] flush_tlb_kernel_range+0x20/0x38
      [cf82fef0] [c001524c] mark_initmem_nx+0x154/0x16c
      [cf82ff20] [c001484c] free_initmem+0x20/0x4c
      [cf82ff30] [c000316c] kernel_init+0x1c/0x108
      [cf82ff40] [c000f3a8] ret_from_kernel_thread+0x5c/0x64
      Instruction dump:
      7c0803a6 7d808120 38210040 4e800020 3d20c052 812981a0 2f890000 40beffac
      3d20c051 8929ac64 2f890000 40beff9c <0fe00000> 4bffff94 7fc3f378 7f64db78
      
      Fixes: 3184cc4b ("powerpc/mm: Fix kernel RAM protection after freeing ...")
      Fixes: e611939f ("powerpc/mm: Ensure change_page_attr() doesn't ...")
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
      Reviewed-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7c6a4f3b
    • C
      powerpc/xive: Clear XIVE internal structures when a CPU is removed · cc569398
      Cédric Le Goater 提交于
      Commit eac1e731 ("powerpc/xive: guest exploitation of the XIVE
      interrupt controller") introduced support for the XIVE exploitation
      mode of the P9 interrupt controller on the pseries platform.
      
      At that time, support for CPU removal was not complete on PowerVM and
      CPU hot unplug remained untested. It appears that some cleanups of the
      XIVE internal structures are required before releasing the CPU,
      without which the kernel crashes in a RTAS call doing the CPU
      isolation.
      
      These changes fix the crash by deconfiguring the IPI interrupt source
      and clearing the event queues of the CPU when it is removed.
      
      Fixes: eac1e731 ("powerpc/xive: guest exploitation of the XIVE interrupt controller")
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      cc569398
    • C
      powerpc/xive: Fix IPI reset · 74f12821
      Cédric Le Goater 提交于
      When resetting an IPI, hw_ipi should also be set to zero.
      
      Fixes: eac1e731 ("powerpc/xive: guest exploitation of the XIVE interrupt controller")
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      74f12821
    • T
      powerpc/watchdog: Make use of watchdog_nmi_probe() · 34ddaa3e
      Thomas Gleixner 提交于
      The rework of the core hotplug code triggers the WARN_ON in start_wd_cpu()
      on powerpc because it is called multiple times for the boot CPU.
      
      The first call is via:
      
        start_wd_on_cpu+0x80/0x2f0
        watchdog_nmi_reconfigure+0x124/0x170
        softlockup_reconfigure_threads+0x110/0x130
        lockup_detector_init+0xbc/0xe0
        kernel_init_freeable+0x18c/0x37c
        kernel_init+0x2c/0x160
        ret_from_kernel_thread+0x5c/0xbc
      
      And then again via the CPU hotplug registration:
      
        start_wd_on_cpu+0x80/0x2f0
        cpuhp_invoke_callback+0x194/0x620
        cpuhp_thread_fun+0x7c/0x1b0
        smpboot_thread_fn+0x290/0x2a0
        kthread+0x168/0x1b0
        ret_from_kernel_thread+0x5c/0xbc
      
      This can be avoided by setting up the cpu hotplug state with nocalls and
      move the initialization to the watchdog_nmi_probe() function. That
      initializes the hotplug callbacks without invoking the callback and the
      following core initialization function then configures the watchdog for the
      online CPUs (in this case CPU0) via softlockup_reconfigure_threads().
      Reported-and-tested-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: linuxppc-dev@lists.ozlabs.org
      34ddaa3e
    • T
      watchdog/core, powerpc: Lock cpus across reconfiguration · e31d6883
      Thomas Gleixner 提交于
      Instead of dropping the cpu hotplug lock after stopping NMI watchdog and
      threads and reaquiring for restart, the code and the protection rules
      become more obvious when holding cpu hotplug lock across the full
      reconfiguration.
      Suggested-by: NLinus Torvalds <torvalds@linuxfoundation.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1710022105570.2114@nanos
      e31d6883
    • T
      watchdog/core, powerpc: Replace watchdog_nmi_reconfigure() · 6b9dc480
      Thomas Gleixner 提交于
      The recent cleanup of the watchdog code split watchdog_nmi_reconfigure()
      into two stages. One to stop the NMI and one to restart it after
      reconfiguration. That was done by adding a boolean 'run' argument to the
      code, which is functionally correct but not necessarily a piece of art.
      
      Replace it by two explicit functions: watchdog_nmi_stop() and
      watchdog_nmi_start().
      
      Fixes: 6592ad2f ("watchdog/core, powerpc: Make watchdog_nmi_reconfigure() two stage")
      Requested-by: NLinus 'Nursing his pet-peeve' Torvalds <torvalds@linuxfoundation.org>
      Signed-off-by: NThomas 'Mopping up garbage' Gleixner <tglx@linutronix.de>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1710021957480.2114@nanos
      6b9dc480
    • I
      rapidio: remove global irq spinlocks from the subsystem · 31d1e130
      Ioan Nicu 提交于
      Locking of config and doorbell operations should be done only if the
      underlying hardware requires it.
      
      This patch removes the global spinlocks from the rapidio subsystem and
      moves them to the mport drivers (fsl_rio and tsi721), only to the
      necessary places.  For example, local config space read and write
      operations (lcread/lcwrite) are atomic in all existing drivers, so there
      should be no need for locking, while the cread/cwrite operations which
      generate maintenance transactions need to be synchronized with a lock.
      
      Later, each driver could chose to use a per-port lock instead of a
      global one, or even more granular locking.
      
      Link: http://lkml.kernel.org/r/20170824113023.GD50104@nokia.comSigned-off-by: NIoan Nicu <ioan.nicu.ext@nokia.com>
      Signed-off-by: NFrank Kunz <frank.kunz@nokia.com>
      Acked-by: NAlexandre Bounine <alexandre.bounine@idt.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31d1e130
  6. 03 10月, 2017 3 次提交
    • S
      KVM: PPC: Book3S: Fix server always zero from kvmppc_xive_get_xive() · 2fb1e946
      Sam Bobroff 提交于
      In KVM's XICS-on-XIVE emulation, kvmppc_xive_get_xive() returns the
      value of state->guest_server as "server". However, this value is not
      set by it's counterpart kvmppc_xive_set_xive(). When the guest uses
      this interface to migrate interrupts away from a CPU that is going
      offline, it sees all interrupts as belonging to CPU 0, so they are
      left assigned to (now) offline CPUs.
      
      This patch removes the guest_server field from the state, and returns
      act_server in it's place (that is, the CPU actually handling the
      interrupt, which may differ from the one requested).
      
      Fixes: 5af50993 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSam Bobroff <sam.bobroff@au1.ibm.com>
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      2fb1e946
    • C
      powerpc/4xx: Fix compile error with 64K pages on 40x, 44x · 070e0049
      Christian Lamparter 提交于
      The mmu context on the 40x, 44x does not define pte_frag entry. This
      causes gcc abort the compilation due to:
      
        setup-common.c: In function ‘setup_arch’:
        setup-common.c:908: error: ‘mm_context_t’ has no ‘pte_frag’
      
      This patch fixes the issue by removing the pte_frag initialization in
      setup-common.c.
      
      This is possible, because the compiler will do the initialization,
      since the mm_context is a sub struct of init_mm. init_mm is declared
      in mm_types.h as external linkage.
      
      According to C99 6.2.4.3:
        An object whose identifier is declared with external linkage
        [...] has static storage duration.
      
      C99 defines in 6.7.8.10 that:
        If an object that has static storage duration is not
        initialized explicitly, then:
        - if it has pointer type, it is initialized to a null pointer
      
      Fixes: b1923caa ("powerpc: Merge 32-bit and 64-bit setup_arch()")
      Signed-off-by: NChristian Lamparter <chunkeey@gmail.com>
      Reviewed-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      070e0049
    • J
      powerpc: Fix action argument for cpufeatures-based TLB flush · 3b7af5c0
      Jeremy Kerr 提交于
      Commit 41d0c2ec ("powerpc/powernv: Fix local TLB flush for boot
      and MCE on POWER9") introduced calls to __flush_tlb_power[89] from the
      cpufeatures code, specifying the number of sets to flush.
      
      However, these functions take an action argument, not a number of
      sets. This means we hit the BUG() in __flush_tlb_{206,300} when using
      cpufeatures-style configuration.
      
      This change passes TLB_INVAL_SCOPE_GLOBAL instead.
      
      Fixes: 41d0c2ec ("powerpc/powernv: Fix local TLB flush for boot and MCE on POWER9")
      Cc: stable@vger.kernel.org # v4.13+
      Signed-off-by: NJeremy Kerr <jk@ozlabs.org>
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      3b7af5c0
  7. 29 9月, 2017 1 次提交
  8. 26 9月, 2017 1 次提交
    • M
      powerpc: Handle MCE on POWER9 with only DSISR bit 30 set · d8bd9f3f
      Michael Neuling 提交于
      On POWER9 DD2.1 and below, it's possible for a paste instruction to
      cause a Machine Check Exception (MCE) where only DSISR bit 30 (IBM 33)
      is set. This will result in the MCE handler seeing an unknown event,
      which triggers linux to crash.
      
      We change this by detecting unknown events caused by load/stores in
      the MCE handler and marking them as handled so that we no longer
      crash.
      
      An MCE that occurs like this is spurious, so we don't need to do
      anything in terms of servicing it. If there is something that needs to
      be serviced, the CPU will raise the MCE again with the correct DSISR
      so that it can be serviced properly.
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Reviewed-by: Nicholas Piggin <npiggin@gmail.com
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      [mpe: Expand comment with details from change log, use normal bit #s]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      d8bd9f3f
  9. 22 9月, 2017 1 次提交
    • M
      KVM: PPC: Book3S HV: Check for updated HDSISR on P9 HDSI exception · e001fa78
      Michael Neuling 提交于
      On POWER9 DD2.1 and below, sometimes on a Hypervisor Data Storage
      Interrupt (HDSI) the HDSISR is not be updated at all.
      
      To work around this we put a canary value into the HDSISR before
      returning to a guest and then check for this canary when we take a
      HDSI. If we find the canary on a HDSI, we know the hardware didn't
      update the HDSISR. In this case we return to the guest to retake the
      HDSI which should correctly update the HDSISR the second time HDSI
      entry.
      
      After talking to Paulus we've applied this workaround to all POWER9
      CPUs. The workaround of returning to the guest shouldn't ever be
      triggered on well behaving CPU. The extra instructions should have
      negligible performance impact.
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e001fa78
  10. 21 9月, 2017 3 次提交
    • T
      powerpc/pseries: Fix parent_dn reference leak in add_dt_node() · b537ca6f
      Tyrel Datwyler 提交于
      A reference to the parent device node is held by add_dt_node() for the
      node to be added. If the call to dlpar_configure_connector() fails
      add_dt_node() returns ENOENT and that reference is not freed.
      
      Add a call to of_node_put(parent_dn) prior to bailing out after a
      failed dlpar_configure_connector() call.
      
      Fixes: 8d5ff320 ("powerpc/pseries: Make dlpar_configure_connector parent node aware")
      Cc: stable@vger.kernel.org # v3.12+
      Signed-off-by: NTyrel Datwyler <tyreld@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      b537ca6f
    • T
      powerpc/pseries: Fix "OF: ERROR: Bad of_node_put() on /cpus" during DLPAR · 087ff6a5
      Tyrel Datwyler 提交于
      Commit 215ee763 ("powerpc: pseries: remove dlpar_attach_node
      dependency on full path") reworked dlpar_attach_node() to no longer
      look up the parent node "/cpus", but instead to have the parent node
      passed by the caller in the function parameter list.
      
      As a result dlpar_attach_node() is no longer responsible for freeing
      the reference to the parent node. However, commit 215ee763 failed
      to remove the of_node_put(parent) call in dlpar_attach_node(), or to
      take into account that the reference to the parent in the caller
      dlpar_cpu_add() needs to be held until after dlpar_attach_node()
      returns.
      
      As a result doing repeated cpu add/remove dlpar operations will
      eventually result in the following error:
      
        OF: ERROR: Bad of_node_put() on /cpus
        CPU: 0 PID: 10896 Comm: drmgr Not tainted 4.13.0-autotest #1
        Call Trace:
         dump_stack+0x15c/0x1f8 (unreliable)
         of_node_release+0x1a4/0x1c0
         kobject_put+0x1a8/0x310
         kobject_del+0xbc/0xf0
         __of_detach_node_sysfs+0x144/0x210
         of_detach_node+0xf0/0x180
         dlpar_detach_node+0xc4/0x120
         dlpar_cpu_remove+0x280/0x560
         dlpar_cpu_release+0xbc/0x1b0
         arch_cpu_release+0x6c/0xb0
         cpu_release_store+0xa0/0x100
         dev_attr_store+0x68/0xa0
         sysfs_kf_write+0xa8/0xf0
         kernfs_fop_write+0x2cc/0x400
         __vfs_write+0x5c/0x340
         vfs_write+0x1a8/0x3d0
         SyS_write+0xa8/0x1a0
         system_call+0x58/0x6c
      
      Fix the issue by removing the of_node_put(parent) call from
      dlpar_attach_node(), and ensuring that the reference to the parent
      node is properly held and released by the caller dlpar_cpu_add().
      
      Fixes: 215ee763 ("powerpc: pseries: remove dlpar_attach_node dependency on full path")
      Signed-off-by: NTyrel Datwyler <tyreld@linux.vnet.ibm.com>
      Reported-by: NAbdul Haleem <abdhalee@linux.vnet.ibm.com>
      [mpe: Add a comment in the code and frob the change log slightly]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      087ff6a5
    • B
      powerpc/eeh: Create PHB PEs after EEH is initialized · 3e77adee
      Benjamin Herrenschmidt 提交于
      Otherwise we end up not yet having computed the right diag data size
      on powernv where EEH initialization is delayed, thus causing memory
      corruption later on when calling OPAL.
      
      Fixes: 5cb1f8fd ("powerpc/powernv/pci: Dynamically allocate PHB diag data")
      Cc: stable@vger.kernel.org # v4.13+
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Acked-by: NRussell Currey <ruscur@russell.cc>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      3e77adee
  11. 20 9月, 2017 8 次提交
  12. 15 9月, 2017 2 次提交
  13. 14 9月, 2017 4 次提交
    • T
      watchdog/hardlockup: Clean up hotplug locking mess · ab5fe3ff
      Thomas Gleixner 提交于
      All watchdog thread related functions are delegated to the smpboot thread
      infrastructure, which handles serialization against CPU hotplug correctly.
      
      The sysctl interface is completely decoupled from anything which requires
      CPU hotplug protection.
      
      No need to protect the sysctl writes against cpu hotplug anymore. Remove it
      and add the now required protection to the powerpc arch_nmi_watchdog
      implementation.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NDon Zickus <dzickus@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sebastian Siewior <bigeasy@linutronix.de>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lkml.kernel.org/r/20170912194148.418497420@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ab5fe3ff
    • T
      watchdog/core, powerpc: Make watchdog_nmi_reconfigure() two stage · 6592ad2f
      Thomas Gleixner 提交于
      Both the perf reconfiguration and the powerpc watchdog_nmi_reconfigure()
      need to be done in two steps.
      
           1) Stop all NMIs
           2) Read the new parameters and start NMIs
      
      Right now watchdog_nmi_reconfigure() is a combination of both. To allow a
      clean reconfiguration add a 'run' argument and split the functionality in
      powerpc.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NDon Zickus <dzickus@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sebastian Siewior <bigeasy@linutronix.de>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lkml.kernel.org/r/20170912194147.862865570@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6592ad2f
    • T
      watchdog/core: Remove broken suspend/resume interfaces · 5490125d
      Thomas Gleixner 提交于
      This interface has several issues:
      
       - It's causing recursive locking of the hotplug lock.
      
       - It's complete overkill to teardown all threads and then recreate them
      
      The same can be achieved with the simple hardlockup_detector_perf_stop /
      restart() interfaces. The abuse from the busy looping poweroff() loop of
      PARISC has been solved as well.
      
      Remove the cruft.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NDon Zickus <dzickus@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sebastian Siewior <bigeasy@linutronix.de>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Link: http://lkml.kernel.org/r/20170912194146.487537732@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5490125d
    • M
      mm: treewide: remove GFP_TEMPORARY allocation flag · 0ee931c4
      Michal Hocko 提交于
      GFP_TEMPORARY was introduced by commit e12ba74d ("Group short-lived
      and reclaimable kernel allocations") along with __GFP_RECLAIMABLE.  It's
      primary motivation was to allow users to tell that an allocation is
      short lived and so the allocator can try to place such allocations close
      together and prevent long term fragmentation.  As much as this sounds
      like a reasonable semantic it becomes much less clear when to use the
      highlevel GFP_TEMPORARY allocation flag.  How long is temporary? Can the
      context holding that memory sleep? Can it take locks? It seems there is
      no good answer for those questions.
      
      The current implementation of GFP_TEMPORARY is basically GFP_KERNEL |
      __GFP_RECLAIMABLE which in itself is tricky because basically none of
      the existing caller provide a way to reclaim the allocated memory.  So
      this is rather misleading and hard to evaluate for any benefits.
      
      I have checked some random users and none of them has added the flag
      with a specific justification.  I suspect most of them just copied from
      other existing users and others just thought it might be a good idea to
      use without any measuring.  This suggests that GFP_TEMPORARY just
      motivates for cargo cult usage without any reasoning.
      
      I believe that our gfp flags are quite complex already and especially
      those with highlevel semantic should be clearly defined to prevent from
      confusion and abuse.  Therefore I propose dropping GFP_TEMPORARY and
      replace all existing users to simply use GFP_KERNEL.  Please note that
      SLAB users with shrinkers will still get __GFP_RECLAIMABLE heuristic and
      so they will be placed properly for memory fragmentation prevention.
      
      I can see reasons we might want some gfp flag to reflect shorterm
      allocations but I propose starting from a clear semantic definition and
      only then add users with proper justification.
      
      This was been brought up before LSF this year by Matthew [1] and it
      turned out that GFP_TEMPORARY really doesn't have a clear semantic.  It
      seems to be a heuristic without any measured advantage for most (if not
      all) its current users.  The follow up discussion has revealed that
      opinions on what might be temporary allocation differ a lot between
      developers.  So rather than trying to tweak existing users into a
      semantic which they haven't expected I propose to simply remove the flag
      and start from scratch if we really need a semantic for short term
      allocations.
      
      [1] http://lkml.kernel.org/r/20170118054945.GD18349@bombadil.infradead.org
      
      [akpm@linux-foundation.org: fix typo]
      [akpm@linux-foundation.org: coding-style fixes]
      [sfr@canb.auug.org.au: drm/i915: fix up]
        Link: http://lkml.kernel.org/r/20170816144703.378d4f4d@canb.auug.org.au
      Link: http://lkml.kernel.org/r/20170728091904.14627-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ee931c4
  14. 12 9月, 2017 1 次提交
    • P
      KVM: PPC: Book3S HV: Fix bug causing host SLB to be restored incorrectly · 67f8a8c1
      Paul Mackerras 提交于
      Aneesh Kumar reported seeing host crashes when running recent kernels
      on POWER8.  The symptom was an oops like this:
      
      Unable to handle kernel paging request for data at address 0xf00000000786c620
      Faulting instruction address: 0xc00000000030e1e4
      Oops: Kernel access of bad area, sig: 11 [#1]
      LE SMP NR_CPUS=2048 NUMA PowerNV
      Modules linked in: powernv_op_panel
      CPU: 24 PID: 6663 Comm: qemu-system-ppc Tainted: G        W 4.13.0-rc7-43932-gfc36c59 #2
      task: c000000fdeadfe80 task.stack: c000000fdeb68000
      NIP:  c00000000030e1e4 LR: c00000000030de6c CTR: c000000000103620
      REGS: c000000fdeb6b450 TRAP: 0300   Tainted: G        W        (4.13.0-rc7-43932-gfc36c59)
      MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24044428  XER: 20000000
      CFAR: c00000000030e134 DAR: f00000000786c620 DSISR: 40000000 SOFTE: 0
      GPR00: 0000000000000000 c000000fdeb6b6d0 c0000000010bd000 000000000000e1b0
      GPR04: c00000000115e168 c000001fffa6e4b0 c00000000115d000 c000001e1b180386
      GPR08: f000000000000000 c000000f9a8913e0 f00000000786c600 00007fff587d0000
      GPR12: c000000fdeb68000 c00000000fb0f000 0000000000000001 00007fff587cffff
      GPR16: 0000000000000000 c000000000000000 00000000003fffff c000000fdebfe1f8
      GPR20: 0000000000000004 c000000fdeb6b8a8 0000000000000001 0008000000000040
      GPR24: 07000000000000c0 00007fff587cffff c000000fdec20bf8 00007fff587d0000
      GPR28: c000000fdeca9ac0 00007fff587d0000 00007fff587c0000 00007fff587d0000
      NIP [c00000000030e1e4] __get_user_pages_fast+0x434/0x1070
      LR [c00000000030de6c] __get_user_pages_fast+0xbc/0x1070
      Call Trace:
      [c000000fdeb6b6d0] [c00000000139dab8] lock_classes+0x0/0x35fe50 (unreliable)
      [c000000fdeb6b7e0] [c00000000030ef38] get_user_pages_fast+0xf8/0x120
      [c000000fdeb6b830] [c000000000112318] kvmppc_book3s_hv_page_fault+0x308/0xf30
      [c000000fdeb6b960] [c00000000010e10c] kvmppc_vcpu_run_hv+0xfdc/0x1f00
      [c000000fdeb6bb20] [c0000000000e915c] kvmppc_vcpu_run+0x2c/0x40
      [c000000fdeb6bb40] [c0000000000e5650] kvm_arch_vcpu_ioctl_run+0x110/0x300
      [c000000fdeb6bbe0] [c0000000000d6468] kvm_vcpu_ioctl+0x528/0x900
      [c000000fdeb6bd40] [c0000000003bc04c] do_vfs_ioctl+0xcc/0x950
      [c000000fdeb6bde0] [c0000000003bc930] SyS_ioctl+0x60/0x100
      [c000000fdeb6be30] [c00000000000b96c] system_call+0x58/0x6c
      Instruction dump:
      7ca81a14 2fa50000 41de0010 7cc8182a 68c60002 78c6ffe2 0b060000 3cc2000a
      794a3664 390610d8 e9080000 7d485214 <e90a0020> 7d435378 790507e1 408202f0
      ---[ end trace fad4a342d0414aa2 ]---
      
      It turns out that what has happened is that the SLB entry for the
      vmmemap region hasn't been reloaded on exit from a guest, and it has
      the wrong page size.  Then, when the host next accesses the vmemmap
      region, it gets a page fault.
      
      Commit a25bd72b ("powerpc/mm/radix: Workaround prefetch issue with
      KVM", 2017-07-24) modified the guest exit code so that it now only clears
      out the SLB for hash guest.  The code tests the radix flag and puts the
      result in a non-volatile CR field, CR2, and later branches based on CR2.
      
      Unfortunately, the kvmppc_save_tm function, which gets called between
      those two points, modifies all the user-visible registers in the case
      where the guest was in transactional or suspended state, except for a
      few which it restores (namely r1, r2, r9 and r13).  Thus the hash/radix indication in CR2 gets corrupted.
      
      This fixes the problem by re-doing the comparison just before the
      result is needed.  For good measure, this also adds comments next to
      the call sites of kvmppc_save_tm and kvmppc_restore_tm pointing out
      that non-volatile register state will be lost.
      
      Cc: stable@vger.kernel.org # v4.13
      Fixes: a25bd72b ("powerpc/mm/radix: Workaround prefetch issue with KVM")
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      67f8a8c1