1. 02 12月, 2020 1 次提交
  2. 01 12月, 2020 1 次提交
    • G
      KVM: PPC: Book3S HV: XIVE: Fix vCPU id sanity check · f54db39f
      Greg Kurz 提交于
      Commit 062cfab7 ("KVM: PPC: Book3S HV: XIVE: Make VP block size
      configurable") updated kvmppc_xive_vcpu_id_valid() in a way that
      allows userspace to trigger an assertion in skiboot and crash the host:
      
      [  696.186248988,3] XIVE[ IC 08  ] eq_blk != vp_blk (0 vs. 1) for target 0x4300008c/0
      [  696.186314757,0] Assert fail: hw/xive.c:2370:0
      [  696.186342458,0] Aborting!
      xive-kvCPU 0043 Backtrace:
       S: 0000000031e2b8f0 R: 0000000030013840   .backtrace+0x48
       S: 0000000031e2b990 R: 000000003001b2d0   ._abort+0x4c
       S: 0000000031e2ba10 R: 000000003001b34c   .assert_fail+0x34
       S: 0000000031e2ba90 R: 0000000030058984   .xive_eq_for_target.part.20+0xb0
       S: 0000000031e2bb40 R: 0000000030059fdc   .xive_setup_silent_gather+0x2c
       S: 0000000031e2bc20 R: 000000003005a334   .opal_xive_set_vp_info+0x124
       S: 0000000031e2bd20 R: 00000000300051a4   opal_entry+0x134
       --- OPAL call token: 0x8a caller R1: 0xc000001f28563850 ---
      
      XIVE maintains the interrupt context state of non-dispatched vCPUs in
      an internal VP structure. We allocate a bunch of those on startup to
      accommodate all possible vCPUs. Each VP has an id, that we derive from
      the vCPU id for efficiency:
      
      static inline u32 kvmppc_xive_vp(struct kvmppc_xive *xive, u32 server)
      {
      	return xive->vp_base + kvmppc_pack_vcpu_id(xive->kvm, server);
      }
      
      The KVM XIVE device used to allocate KVM_MAX_VCPUS VPs. This was
      limitting the number of concurrent VMs because the VP space is
      limited on the HW. Since most of the time, VMs run with a lot less
      vCPUs, commit 062cfab7 ("KVM: PPC: Book3S HV: XIVE: Make VP
      block size configurable") gave the possibility for userspace to
      tune the size of the VP block through the KVM_DEV_XIVE_NR_SERVERS
      attribute.
      
      The check in kvmppc_pack_vcpu_id() was changed from
      
      	cpu < KVM_MAX_VCPUS * xive->kvm->arch.emul_smt_mode
      
      to
      
      	cpu < xive->nr_servers * xive->kvm->arch.emul_smt_mode
      
      The previous check was based on the fact that the VP block had
      KVM_MAX_VCPUS entries and that kvmppc_pack_vcpu_id() guarantees
      that packed vCPU ids are below KVM_MAX_VCPUS. We've changed the
      size of the VP block, but kvmppc_pack_vcpu_id() has nothing to
      do with it and it certainly doesn't ensure that the packed vCPU
      ids are below xive->nr_servers. kvmppc_xive_vcpu_id_valid() might
      thus return true when the VM was configured with a non-standard
      VSMT mode, even if the packed vCPU id is higher than what we
      expect. We end up using an unallocated VP id, which confuses
      OPAL. The assert in OPAL is probably abusive and should be
      converted to a regular error that the kernel can handle, but
      we shouldn't really use broken VP ids in the first place.
      
      Fix kvmppc_xive_vcpu_id_valid() so that it checks the packed
      vCPU id is below xive->nr_servers, which is explicitly what we
      want.
      
      Fixes: 062cfab7 ("KVM: PPC: Book3S HV: XIVE: Make VP block size configurable")
      Cc: stable@vger.kernel.org # v5.5+
      Signed-off-by: NGreg Kurz <groug@kaod.org>
      Reviewed-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/160673876747.695514.1809676603724514920.stgit@bahia.lan
      f54db39f
  3. 27 11月, 2020 1 次提交
    • S
      powerpc/numa: Fix a regression on memoryless node 0 · 10f78fd0
      Srikar Dronamraju 提交于
      Commit e75130f2 ("powerpc/numa: Offline memoryless cpuless node 0")
      offlines node 0 and expects nodes to be subsequently onlined when CPUs
      or nodes are detected.
      
      Commit 6398eaa2 ("powerpc/numa: Prefer node id queried from vphn")
      skips onlining node 0 when CPUs are associated with node 0.
      
      On systems with node 0 having CPUs but no memory, this causes node 0 be
      marked offline. This causes issues at boot time when trying to set
      memory node for online CPUs while building the zonelist.
      
      0:mon> t
      [link register   ] c000000000400354 __build_all_zonelists+0x164/0x280
      [c00000000161bda0] c0000000016533c8 node_states+0x20/0xa0 (unreliable)
      [c00000000161bdc0] c000000000400384 __build_all_zonelists+0x194/0x280
      [c00000000161be30] c000000001041800 build_all_zonelists_init+0x4c/0x118
      [c00000000161be80] c0000000004020d0 build_all_zonelists+0x190/0x1b0
      [c00000000161bef0] c000000001003cf8 start_kernel+0x18c/0x6a8
      [c00000000161bf90] c00000000000adb4 start_here_common+0x1c/0x3e8
      0:mon> r
      R00 = c000000000400354   R16 = 000000000b57a0e8
      R01 = c00000000161bda0   R17 = 000000000b57a6b0
      R02 = c00000000161ce00   R18 = 000000000b5afee8
      R03 = 0000000000000000   R19 = 000000000b6448a0
      R04 = 0000000000000000   R20 = fffffffffffffffd
      R05 = 0000000000000000   R21 = 0000000001400000
      R06 = 0000000000000000   R22 = 000000001ec00000
      R07 = 0000000000000001   R23 = c000000001175580
      R08 = 0000000000000000   R24 = c000000001651ed8
      R09 = c0000000017e84d8   R25 = c000000001652480
      R10 = 0000000000000000   R26 = c000000001175584
      R11 = c000000c7fac0d10   R27 = c0000000019568d0
      R12 = c000000000400180   R28 = 0000000000000000
      R13 = c000000002200000   R29 = c00000000164dd78
      R14 = 000000000b579f78   R30 = 0000000000000000
      R15 = 000000000b57a2b8   R31 = c000000001175584
      pc  = c000000000400194 local_memory_node+0x24/0x80
      cfar= c000000000074334 mcount+0xc/0x10
      lr  = c000000000400354 __build_all_zonelists+0x164/0x280
      msr = 8000000002001033   cr  = 44002284
      ctr = c000000000400180   xer = 0000000000000001   trap =  380
      dar = 0000000000001388   dsisr = c00000000161bc90
      0:mon>
      
      Fix this by setting node to be online while onlining CPUs that belong to
      node 0.
      
      Fixes: e75130f2 ("powerpc/numa: Offline memoryless cpuless node 0")
      Fixes: 6398eaa2 ("powerpc/numa: Prefer node id queried from vphn")
      Reported-by: NMilan Mohanty <milmohan@in.ibm.com>
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20201127053738.10085-1-srikar@linux.vnet.ibm.com
      10f78fd0
  4. 26 11月, 2020 3 次提交
  5. 24 11月, 2020 1 次提交
  6. 23 11月, 2020 2 次提交
  7. 19 11月, 2020 4 次提交
    • D
      powerpc/64s: rename pnv|pseries_setup_rfi_flush to _setup_security_mitigations · da631f7f
      Daniel Axtens 提交于
      pseries|pnv_setup_rfi_flush already does the count cache flush setup, and
      we just added entry and uaccess flushes. So the name is not very accurate
      any more. In both platforms we then also immediately setup the STF flush.
      
      Rename them to _setup_security_mitigations and fold the STF flush in.
      Signed-off-by: NDaniel Axtens <dja@axtens.net>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      da631f7f
    • M
      powerpc: Only include kup-radix.h for 64-bit Book3S · 178d52c6
      Michael Ellerman 提交于
      In kup.h we currently include kup-radix.h for all 64-bit builds, which
      includes Book3S and Book3E. The latter doesn't make sense, Book3E
      never uses the Radix MMU.
      
      This has worked up until now, but almost by accident, and the recent
      uaccess flush changes introduced a build breakage on Book3E because of
      the bad structure of the code.
      
      So disentangle things so that we only use kup-radix.h for Book3S. This
      requires some more stubs in kup.h and fixing an include in
      syscall_64.c.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      178d52c6
    • N
      powerpc/64s: flush L1D after user accesses · 9a32a7e7
      Nicholas Piggin 提交于
      IBM Power9 processors can speculatively operate on data in the L1 cache
      before it has been completely validated, via a way-prediction mechanism. It
      is not possible for an attacker to determine the contents of impermissible
      memory using this method, since these systems implement a combination of
      hardware and software security measures to prevent scenarios where
      protected data could be leaked.
      
      However these measures don't address the scenario where an attacker induces
      the operating system to speculatively execute instructions using data that
      the attacker controls. This can be used for example to speculatively bypass
      "kernel user access prevention" techniques, as discovered by Anthony
      Steinhauser of Google's Safeside Project. This is not an attack by itself,
      but there is a possibility it could be used in conjunction with
      side-channels or other weaknesses in the privileged code to construct an
      attack.
      
      This issue can be mitigated by flushing the L1 cache between privilege
      boundaries of concern. This patch flushes the L1 cache after user accesses.
      
      This is part of the fix for CVE-2020-4788.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NDaniel Axtens <dja@axtens.net>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      9a32a7e7
    • N
      powerpc/64s: flush L1D on kernel entry · f7964378
      Nicholas Piggin 提交于
      IBM Power9 processors can speculatively operate on data in the L1 cache
      before it has been completely validated, via a way-prediction mechanism. It
      is not possible for an attacker to determine the contents of impermissible
      memory using this method, since these systems implement a combination of
      hardware and software security measures to prevent scenarios where
      protected data could be leaked.
      
      However these measures don't address the scenario where an attacker induces
      the operating system to speculatively execute instructions using data that
      the attacker controls. This can be used for example to speculatively bypass
      "kernel user access prevention" techniques, as discovered by Anthony
      Steinhauser of Google's Safeside Project. This is not an attack by itself,
      but there is a possibility it could be used in conjunction with
      side-channels or other weaknesses in the privileged code to construct an
      attack.
      
      This issue can be mitigated by flushing the L1 cache between privilege
      boundaries of concern. This patch flushes the L1 cache on kernel entry.
      
      This is part of the fix for CVE-2020-4788.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NDaniel Axtens <dja@axtens.net>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f7964378
  8. 18 11月, 2020 1 次提交
  9. 17 11月, 2020 1 次提交
  10. 16 11月, 2020 3 次提交
    • A
      arch: pgtable: define MAX_POSSIBLE_PHYSMEM_BITS where needed · cef39703
      Arnd Bergmann 提交于
      Stefan Agner reported a bug when using zsram on 32-bit Arm machines
      with RAM above the 4GB address boundary:
      
        Unable to handle kernel NULL pointer dereference at virtual address 00000000
        pgd = a27bd01c
        [00000000] *pgd=236a0003, *pmd=1ffa64003
        Internal error: Oops: 207 [#1] SMP ARM
        Modules linked in: mdio_bcm_unimac(+) brcmfmac cfg80211 brcmutil raspberrypi_hwmon hci_uart crc32_arm_ce bcm2711_thermal phy_generic genet
        CPU: 0 PID: 123 Comm: mkfs.ext4 Not tainted 5.9.6 #1
        Hardware name: BCM2711
        PC is at zs_map_object+0x94/0x338
        LR is at zram_bvec_rw.constprop.0+0x330/0xa64
        pc : [<c0602b38>]    lr : [<c0bda6a0>]    psr: 60000013
        sp : e376bbe0  ip : 00000000  fp : c1e2921c
        r10: 00000002  r9 : c1dda730  r8 : 00000000
        r7 : e8ff7a00  r6 : 00000000  r5 : 02f9ffa0  r4 : e3710000
        r3 : 000fdffe  r2 : c1e0ce80  r1 : ebf979a0  r0 : 00000000
        Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
        Control: 30c5383d  Table: 235c2a80  DAC: fffffffd
        Process mkfs.ext4 (pid: 123, stack limit = 0x495a22e6)
        Stack: (0xe376bbe0 to 0xe376c000)
      
      As it turns out, zsram needs to know the maximum memory size, which
      is defined in MAX_PHYSMEM_BITS when CONFIG_SPARSEMEM is set, or in
      MAX_POSSIBLE_PHYSMEM_BITS on the x86 architecture.
      
      The same problem will be hit on all 32-bit architectures that have a
      physical address space larger than 4GB and happen to not enable sparsemem
      and include asm/sparsemem.h from asm/pgtable.h.
      
      After the initial discussion, I suggested just always defining
      MAX_POSSIBLE_PHYSMEM_BITS whenever CONFIG_PHYS_ADDR_T_64BIT is
      set, or provoking a build error otherwise. This addresses all
      configurations that can currently have this runtime bug, but
      leaves all other configurations unchanged.
      
      I looked up the possible number of bits in source code and
      datasheets, here is what I found:
      
       - on ARC, CONFIG_ARC_HAS_PAE40 controls whether 32 or 40 bits are used
       - on ARM, CONFIG_LPAE enables 40 bit addressing, without it we never
         support more than 32 bits, even though supersections in theory allow
         up to 40 bits as well.
       - on MIPS, some MIPS32r1 or later chips support 36 bits, and MIPS32r5
         XPA supports up to 60 bits in theory, but 40 bits are more than
         anyone will ever ship
       - On PowerPC, there are three different implementations of 36 bit
         addressing, but 32-bit is used without CONFIG_PTE_64BIT
       - On RISC-V, the normal page table format can support 34 bit
         addressing. There is no highmem support on RISC-V, so anything
         above 2GB is unused, but it might be useful to eventually support
         CONFIG_ZRAM for high pages.
      
      Fixes: 61989a80 ("staging: zsmalloc: zsmalloc memory allocation library")
      Fixes: 02390b87 ("mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS")
      Acked-by: NThomas Bogendoerfer <tsbogend@alpha.franken.de>
      Reviewed-by: NStefan Agner <stefan@agner.ch>
      Tested-by: NStefan Agner <stefan@agner.ch>
      Acked-by: NMike Rapoport <rppt@linux.ibm.com>
      Link: https://lore.kernel.org/linux-mm/bdfa44bf1c570b05d6c70898e2bbb0acf234ecdf.1604762181.git.stefan@agner.ch/Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      cef39703
    • C
      KVM: PPC: Book3S HV: XIVE: Fix possible oops when accessing ESB page · 75b49620
      Cédric Le Goater 提交于
      When accessing the ESB page of a source interrupt, the fault handler
      will retrieve the page address from the XIVE interrupt 'xive_irq_data'
      structure. If the associated KVM XIVE interrupt is not valid, that is
      not allocated at the HW level for some reason, the fault handler will
      dereference a NULL pointer leading to the oops below :
      
        WARNING: CPU: 40 PID: 59101 at arch/powerpc/kvm/book3s_xive_native.c:259 xive_native_esb_fault+0xe4/0x240 [kvm]
        CPU: 40 PID: 59101 Comm: qemu-system-ppc Kdump: loaded Tainted: G        W        --------- -  - 4.18.0-240.el8.ppc64le #1
        NIP:  c00800000e949fac LR: c00000000044b164 CTR: c00800000e949ec8
        REGS: c000001f69617840 TRAP: 0700   Tainted: G        W        --------- -  -  (4.18.0-240.el8.ppc64le)
        MSR:  9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 44044282  XER: 00000000
        CFAR: c00000000044b160 IRQMASK: 0
        GPR00: c00000000044b164 c000001f69617ac0 c00800000e96e000 c000001f69617c10
        GPR04: 05faa2b21e000080 0000000000000000 0000000000000005 ffffffffffffffff
        GPR08: 0000000000000000 0000000000000001 0000000000000000 0000000000000001
        GPR12: c00800000e949ec8 c000001ffffd3400 0000000000000000 0000000000000000
        GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
        GPR20: 0000000000000000 0000000000000000 c000001f5c065160 c000000001c76f90
        GPR24: c000001f06f20000 c000001f5c065100 0000000000000008 c000001f0eb98c78
        GPR28: c000001dcab40000 c000001dcab403d8 c000001f69617c10 0000000000000011
        NIP [c00800000e949fac] xive_native_esb_fault+0xe4/0x240 [kvm]
        LR [c00000000044b164] __do_fault+0x64/0x220
        Call Trace:
        [c000001f69617ac0] [0000000137a5dc20] 0x137a5dc20 (unreliable)
        [c000001f69617b50] [c00000000044b164] __do_fault+0x64/0x220
        [c000001f69617b90] [c000000000453838] do_fault+0x218/0x930
        [c000001f69617bf0] [c000000000456f50] __handle_mm_fault+0x350/0xdf0
        [c000001f69617cd0] [c000000000457b1c] handle_mm_fault+0x12c/0x310
        [c000001f69617d10] [c00000000007ef44] __do_page_fault+0x264/0xbb0
        [c000001f69617df0] [c00000000007f8c8] do_page_fault+0x38/0xd0
        [c000001f69617e30] [c00000000000a714] handle_page_fault+0x18/0x38
        Instruction dump:
        40c2fff0 7c2004ac 2fa90000 409e0118 73e90001 41820080 e8bd0008 7c2004ac
        7ca90074 39400000 915c0000 7929d182 <0b090000> 2fa50000 419e0080 e89e0018
        ---[ end trace 66c6ff034c53f64f ]---
        xive-kvm: xive_native_esb_fault: accessing invalid ESB page for source 8 !
      
      Fix that by checking the validity of the KVM XIVE interrupt structure.
      
      Fixes: 6520ca64 ("KVM: PPC: Book3S HV: XIVE: Add a mapping for the source ESB pages")
      Cc: stable@vger.kernel.org # v5.2+
      Reported-by: NGreg Kurz <groug@kaod.org>
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Tested-by: NGreg Kurz <groug@kaod.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20201105134713.656160-1-clg@kaod.org
      75b49620
    • N
      powerpc/64s: Fix KVM system reset handling when CONFIG_PPC_PSERIES=y · 575cba20
      Nicholas Piggin 提交于
      pseries guest kernels have a FWNMI handler for SRESET and MCE NMIs,
      which is basically the same as the regular handlers for those
      interrupts.
      
      The system reset FWNMI handler did not have a KVM guest test in it,
      although it probably should have because the guest can itself run
      guests.
      
      Commit 4f50541f ("powerpc/64s/exception: Move all interrupt
      handlers to new style code gen macros") convert the handler faithfully
      to avoid a KVM test with a "clever" trick to modify the IKVM_REAL
      setting to 0 when the fwnmi handler is to be generated (PPC_PSERIES=y).
      This worked when the KVM test was generated in the interrupt entry
      handlers, but a later patch moved the KVM test to the common handler,
      and the common handler macro is expanded below the fwnmi entry. This
      prevents the KVM test from being generated even for the 0x100 entry
      point as well.
      
      The result is NMI IPIs in the host kernel when a guest is running will
      use gest registers. This goes particularly badly when an HPT guest is
      running and the MMU is set to guest mode.
      
      Remove this trickery and just generate the test always.
      
      Fixes: 9600f261 ("powerpc/64s/exception: Move KVM test to common code")
      Cc: stable@vger.kernel.org # v5.7+
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20201114114743.3306283-1-npiggin@gmail.com
      575cba20
  11. 10 11月, 2020 2 次提交
  12. 08 11月, 2020 1 次提交
  13. 06 11月, 2020 1 次提交
  14. 05 11月, 2020 5 次提交
  15. 02 11月, 2020 2 次提交
    • Q
      powerpc/smp: Call rcu_cpu_starting() earlier · 99f070b6
      Qian Cai 提交于
      The call to rcu_cpu_starting() in start_secondary() is not early
      enough in the CPU-hotplug onlining process, which results in lockdep
      splats as follows (with CONFIG_PROVE_RCU_LIST=y):
      
        WARNING: suspicious RCU usage
        -----------------------------
        kernel/locking/lockdep.c:3497 RCU-list traversed in non-reader section!!
      
        other info that might help us debug this:
      
        RCU used illegally from offline CPU!
        rcu_scheduler_active = 1, debug_locks = 1
        no locks held by swapper/1/0.
      
        Call Trace:
        dump_stack+0xec/0x144 (unreliable)
        lockdep_rcu_suspicious+0x128/0x14c
        __lock_acquire+0x1060/0x1c60
        lock_acquire+0x140/0x5f0
        _raw_spin_lock_irqsave+0x64/0xb0
        clockevents_register_device+0x74/0x270
        register_decrementer_clockevent+0x94/0x110
        start_secondary+0x134/0x800
        start_secondary_prolog+0x10/0x14
      
      This is avoided by adding a call to rcu_cpu_starting() near the
      beginning of the start_secondary() function. Note that the
      raw_smp_processor_id() is required in order to avoid calling into
      lockdep before RCU has declared the CPU to be watched for readers.
      
      It's safe to call rcu_cpu_starting() in the arch code as well as later
      in generic code, as explained by Paul:
      
        It uses a per-CPU variable so that RCU pays attention only to the
        first call to rcu_cpu_starting() if there is more than one of them.
        This is even intentional, due to there being a generic
        arch-independent call to rcu_cpu_starting() in
        notify_cpu_starting().
      
        So multiple calls to rcu_cpu_starting() are fine by design.
      
      Fixes: 4d004099 ("lockdep: Fix lockdep recursion")
      Signed-off-by: NQian Cai <cai@redhat.com>
      Acked-by: NPaul E. McKenney <paulmck@kernel.org>
      [mpe: Add Fixes tag, reword slightly & expand change log]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20201028182334.13466-1-cai@redhat.com
      99f070b6
    • Q
      powerpc/eeh_cache: Fix a possible debugfs deadlock · fd552e05
      Qian Cai 提交于
      Lockdep complains that a possible deadlock below in
      eeh_addr_cache_show() because it is acquiring a lock with IRQ enabled,
      but eeh_addr_cache_insert_dev() needs to acquire the same lock with IRQ
      disabled. Let's just make eeh_addr_cache_show() acquire the lock with
      IRQ disabled as well.
      
              CPU0                    CPU1
              ----                    ----
         lock(&pci_io_addr_cache_root.piar_lock);
                                      local_irq_disable();
                                      lock(&tp->lock);
                                      lock(&pci_io_addr_cache_root.piar_lock);
         <Interrupt>
           lock(&tp->lock);
      
        *** DEADLOCK ***
      
        lock_acquire+0x140/0x5f0
        _raw_spin_lock_irqsave+0x64/0xb0
        eeh_addr_cache_insert_dev+0x48/0x390
        eeh_probe_device+0xb8/0x1a0
        pnv_pcibios_bus_add_device+0x3c/0x80
        pcibios_bus_add_device+0x118/0x290
        pci_bus_add_device+0x28/0xe0
        pci_bus_add_devices+0x54/0xb0
        pcibios_init+0xc4/0x124
        do_one_initcall+0xac/0x528
        kernel_init_freeable+0x35c/0x3fc
        kernel_init+0x24/0x148
        ret_from_kernel_thread+0x5c/0x80
      
        lock_acquire+0x140/0x5f0
        _raw_spin_lock+0x4c/0x70
        eeh_addr_cache_show+0x38/0x110
        seq_read+0x1a0/0x660
        vfs_read+0xc8/0x1f0
        ksys_read+0x74/0x130
        system_call_exception+0xf8/0x1d0
        system_call_common+0xe8/0x218
      
      Fixes: 5ca85ae6 ("powerpc/eeh_cache: Add a way to dump the EEH address cache")
      Signed-off-by: NQian Cai <cai@redhat.com>
      Reviewed-by: NOliver O'Halloran <oohall@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20201028152717.8967-1-cai@redhat.com
      fd552e05
  16. 26 10月, 2020 1 次提交
  17. 22 10月, 2020 4 次提交
    • G
      powerpc/pseries: Avoid using addr_to_pfn in real mode · 4ff753fe
      Ganesh Goudar 提交于
      When an UE or memory error exception is encountered the MCE handler
      tries to find the pfn using addr_to_pfn() which takes effective
      address as an argument, later pfn is used to poison the page where
      memory error occurred, recent rework in this area made addr_to_pfn
      to run in real mode, which can be fatal as it may try to access
      memory outside RMO region.
      
      Have two helper functions to separate things to be done in real mode
      and virtual mode without changing any functionality. This also fixes
      the following error as the use of addr_to_pfn is now moved to virtual
      mode.
      
      Without this change following kernel crash is seen on hitting UE.
      
      [  485.128036] Oops: Kernel access of bad area, sig: 11 [#1]
      [  485.128040] LE SMP NR_CPUS=2048 NUMA pSeries
      [  485.128047] Modules linked in:
      [  485.128067] CPU: 15 PID: 6536 Comm: insmod Kdump: loaded Tainted: G OE 5.7.0 #22
      [  485.128074] NIP:  c00000000009b24c LR: c0000000000398d8 CTR: c000000000cd57c0
      [  485.128078] REGS: c000000003f1f970 TRAP: 0300   Tainted: G OE (5.7.0)
      [  485.128082] MSR:  8000000000001003 <SF,ME,RI,LE>  CR: 28008284  XER: 00000001
      [  485.128088] CFAR: c00000000009b190 DAR: c0000001fab00000 DSISR: 40000000 IRQMASK: 1
      [  485.128088] GPR00: 0000000000000001 c000000003f1fbf0 c000000001634300 0000b0fa01000000
      [  485.128088] GPR04: d000000002220000 0000000000000000 00000000fab00000 0000000000000022
      [  485.128088] GPR08: c0000001fab00000 0000000000000000 c0000001fab00000 c000000003f1fc14
      [  485.128088] GPR12: 0000000000000008 c000000003ff5880 d000000002100008 0000000000000000
      [  485.128088] GPR16: 000000000000ff20 000000000000fff1 000000000000fff2 d0000000021a1100
      [  485.128088] GPR20: d000000002200000 c00000015c893c50 c000000000d49b28 c00000015c893c50
      [  485.128088] GPR24: d0000000021a0d08 c0000000014e5da8 d0000000021a0818 000000000000000a
      [  485.128088] GPR28: 0000000000000008 000000000000000a c0000000017e2970 000000000000000a
      [  485.128125] NIP [c00000000009b24c] __find_linux_pte+0x11c/0x310
      [  485.128130] LR [c0000000000398d8] addr_to_pfn+0x138/0x170
      [  485.128133] Call Trace:
      [  485.128135] Instruction dump:
      [  485.128138] 3929ffff 7d4a3378 7c883c36 7d2907b4 794a1564 7d294038 794af082 3900ffff
      [  485.128144] 79291f24 790af00e 78e70020 7d095214 <7c69502a> 2fa30000 419e011c 70690040
      [  485.128152] ---[ end trace d34b27e29ae0e340 ]---
      
      Fixes: 9ca766f9 ("powerpc/64s/pseries: machine check convert to use common event code")
      Signed-off-by: NGanesh Goudar <ganeshgr@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200724063946.21378-1-ganeshgr@linux.ibm.com
      4ff753fe
    • C
      powerpc/uaccess: Don't use "m<>" constraint with GCC 4.9 · 592bbe9c
      Christophe Leroy 提交于
      GCC 4.9 sometimes fails to build with "m<>" constraint in
      inline assembly.
      
        CC      lib/iov_iter.o
      In file included from ./arch/powerpc/include/asm/cmpxchg.h:6:0,
                       from ./arch/powerpc/include/asm/atomic.h:11,
                       from ./include/linux/atomic.h:7,
                       from ./include/linux/crypto.h:15,
                       from ./include/crypto/hash.h:11,
                       from lib/iov_iter.c:2:
      lib/iov_iter.c: In function 'iovec_from_user.part.30':
      ./arch/powerpc/include/asm/uaccess.h:287:2: error: 'asm' operand has impossible constraints
        __asm__ __volatile__(    \
        ^
      ./include/linux/compiler.h:78:42: note: in definition of macro 'unlikely'
       # define unlikely(x) __builtin_expect(!!(x), 0)
                                                ^
      ./arch/powerpc/include/asm/uaccess.h:583:34: note: in expansion of macro 'unsafe_op_wrap'
       #define unsafe_get_user(x, p, e) unsafe_op_wrap(__get_user_allowed(x, p), e)
                                        ^
      ./arch/powerpc/include/asm/uaccess.h:329:10: note: in expansion of macro '__get_user_asm'
        case 4: __get_user_asm(x, (u32 __user *)ptr, retval, "lwz"); break; \
                ^
      ./arch/powerpc/include/asm/uaccess.h:363:3: note: in expansion of macro '__get_user_size_allowed'
         __get_user_size_allowed(__gu_val, __gu_addr, __gu_size, __gu_err); \
         ^
      ./arch/powerpc/include/asm/uaccess.h:100:2: note: in expansion of macro '__get_user_nocheck'
        __get_user_nocheck((x), (ptr), sizeof(*(ptr)), false)
        ^
      ./arch/powerpc/include/asm/uaccess.h:583:49: note: in expansion of macro '__get_user_allowed'
       #define unsafe_get_user(x, p, e) unsafe_op_wrap(__get_user_allowed(x, p), e)
                                                       ^
      lib/iov_iter.c:1663:3: note: in expansion of macro 'unsafe_get_user'
         unsafe_get_user(len, &uiov[i].iov_len, uaccess_end);
         ^
      make[1]: *** [scripts/Makefile.build:283: lib/iov_iter.o] Error 1
      
      Define a UPD_CONSTR macro that is "<>" by default and
      only "" with GCC prior to GCC 5.
      
      Fixes: fcf1f268 ("powerpc/uaccess: Add pre-update addressing to __put_user_asm_goto()")
      Fixes: 2f279eeb ("powerpc/uaccess: Add pre-update addressing to __get_user_asm() and __put_user_asm()")
      Signed-off-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
      Acked-by: NSegher Boessenkool <segher@kernel.crashing.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/212d3bc4a52ca71523759517bb9c61f7e477c46a.1603179582.git.christophe.leroy@csgroup.eu
      592bbe9c
    • O
      powerpc/eeh: Fix eeh_dev_check_failure() for PE#0 · 99f6e979
      Oliver O'Halloran 提交于
      In commit 269e5833 ("powerpc/eeh: Delete eeh_pe->config_addr") the
      following simplification was made:
      
      -       if (!pe->addr && !pe->config_addr) {
      +       if (!pe->addr) {
                      eeh_stats.no_cfg_addr++;
                      return 0;
              }
      
      This introduced a bug which causes EEH checking to be skipped for
      devices in PE#0.
      
      Before the change above the check would always pass since at least one
      of the two PE addresses would be non-zero in all circumstances. On
      PowerNV pe->config_addr would be the BDFN of the first device added to
      the PE. The zero BDFN is reserved for the PHB's root port, but this is
      fine since for obscure platform reasons the root port is never
      assigned to PE#0.
      
      Similarly, on pseries pe->addr has always been non-zero for the
      reasons outlined in commit 42de19d5 ("powerpc/pseries/eeh: Allow
      zero to be a valid PE configuration address").
      
      We can fix the problem by deleting the block entirely The original
      purpose of this test was to avoid performing EEH checks on devices
      that were not on an EEH capable bus. In modern Linux the edev->pe
      pointer will be NULL for devices that are not on an EEH capable bus.
      The code block immediately above this one already checks for the
      edev->pe == NULL case so this test (new and old) is entirely
      redundant.
      
      Ideally we'd delete eeh_stats.no_cfg_addr too since nothing increments
      it any more. Unfortunately, that information is exposed via
      /proc/powerpc/eeh which means it's technically ABI. We could make it
      hard-coded, but that's a change for another patch.
      
      Fixes: 269e5833 ("powerpc/eeh: Delete eeh_pe->config_addr")
      Signed-off-by: NOliver O'Halloran <oohall@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20201021232554.1434687-1-oohall@gmail.com
      99f6e979
    • J
      KVM: PPC: Book3S HV: Make struct kernel_param_ops definition const · a4f1d94e
      Joe Perches 提交于
      This should be const, so make it so.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Message-Id: <d130e88dd4c82a12d979da747cc0365c72c3ba15.1601770305.git.joe@perches.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a4f1d94e
  18. 20 10月, 2020 2 次提交
  19. 19 10月, 2020 4 次提交
    • V
      powerpc/powernv/dump: Handle multiple writes to ack attribute · 358ab796
      Vasant Hegde 提交于
      Even though we use self removing sysfs helper, we still need
      to make sure we do the final kobject delete conditionally.
      sysfs_remove_file_self() will handle parallel calls to remove
      the sysfs attribute file and returns true only in the caller
      that removed the attribute file. The other parallel callers
      are returned false. Do the final kobject delete checking
      the return value of sysfs_remove_file_self().
      Signed-off-by: NVasant Hegde <hegdevasant@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20201017164236.264713-1-hegdevasant@linux.vnet.ibm.com
      358ab796
    • V
      powerpc/powernv/dump: Fix race while processing OPAL dump · 0a43ae3e
      Vasant Hegde 提交于
      Every dump reported by OPAL is exported to userspace through a sysfs
      interface and notified using kobject_uevent(). The userspace daemon
      (opal_errd) then reads the dump and acknowledges that the dump is
      saved safely to disk. Once acknowledged the kernel removes the
      respective sysfs file entry causing respective resources to be
      released including kobject.
      
      However it's possible the userspace daemon may already be scanning
      dump entries when a new sysfs dump entry is created by the kernel.
      User daemon may read this new entry and ack it even before kernel can
      notify userspace about it through kobject_uevent() call. If that
      happens then we have a potential race between
      dump_ack_store->kobject_put() and kobject_uevent which can lead to
      use-after-free of a kernfs object resulting in a kernel crash.
      
      This patch fixes this race by protecting the sysfs file
      creation/notification by holding a reference count on kobject until we
      safely send kobject_uevent().
      
      The function create_dump_obj() returns the dump object which if used
      by caller function will end up in use-after-free problem again.
      However, the return value of create_dump_obj() function isn't being
      used today and there is no need as well. Hence change it to return
      void to make this fix complete.
      
      Fixes: c7e64b9c ("powerpc/powernv Platform dump interface")
      Signed-off-by: NVasant Hegde <hegdevasant@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20201017164210.264619-1-hegdevasant@linux.vnet.ibm.com
      0a43ae3e
    • S
      powerpc/smp: Use GFP_ATOMIC while allocating tmp mask · 84dbf66c
      Srikar Dronamraju 提交于
      Qian Cai reported a regression where CPU Hotplug fails with the latest
      powerpc/next
      
      BUG: sleeping function called from invalid context at mm/slab.h:494
      in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/88
      no locks held by swapper/88/0.
      irq event stamp: 18074448
      hardirqs last  enabled at (18074447): [<c0000000001a2a7c>] tick_nohz_idle_enter+0x9c/0x110
      hardirqs last disabled at (18074448): [<c000000000106798>] do_idle+0x138/0x3b0
      do_idle at kernel/sched/idle.c:253 (discriminator 1)
      softirqs last  enabled at (18074440): [<c0000000000bbec4>] irq_enter_rcu+0x94/0xa0
      softirqs last disabled at (18074439): [<c0000000000bbea0>] irq_enter_rcu+0x70/0xa0
      CPU: 88 PID: 0 Comm: swapper/88 Tainted: G        W         5.9.0-rc8-next-20201007 #1
      Call Trace:
      [c00020000a4bfcf0] [c000000000649e98] dump_stack+0xec/0x144 (unreliable)
      [c00020000a4bfd30] [c0000000000f6c34] ___might_sleep+0x2f4/0x310
      [c00020000a4bfdb0] [c000000000354f94] slab_pre_alloc_hook.constprop.82+0x124/0x190
      [c00020000a4bfe00] [c00000000035e9e8] __kmalloc_node+0x88/0x3a0
      slab_alloc_node at mm/slub.c:2817
      (inlined by) __kmalloc_node at mm/slub.c:4013
      [c00020000a4bfe80] [c0000000006494d8] alloc_cpumask_var_node+0x38/0x80
      kmalloc_node at include/linux/slab.h:577
      (inlined by) alloc_cpumask_var_node at lib/cpumask.c:116
      [c00020000a4bfef0] [c00000000003eedc] start_secondary+0x27c/0x800
      update_mask_by_l2 at arch/powerpc/kernel/smp.c:1267
      (inlined by) add_cpu_to_masks at arch/powerpc/kernel/smp.c:1387
      (inlined by) start_secondary at arch/powerpc/kernel/smp.c:1420
      [c00020000a4bff90] [c00000000000c468] start_secondary_resume+0x10/0x14
      
      Allocating a temporary mask while performing a CPU Hotplug operation
      with CONFIG_CPUMASK_OFFSTACK enabled, leads to calling a sleepable
      function from a atomic context. Fix this by allocating the temporary
      mask with GFP_ATOMIC flag. Also instead of having to allocate twice,
      allocate the mask in the caller so that we only have to allocate once.
      If the allocation fails, assume the mask to be same as sibling mask, which
      will make the scheduler to drop this domain for this CPU.
      
      Fixes: 70a94089 ("powerpc/smp: Optimize update_coregroup_mask")
      Fixes: 3ab33d6d ("powerpc/smp: Optimize update_mask_by_l2")
      Reported-by: NQian Cai <cai@redhat.com>
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20201019042716.106234-3-srikar@linux.vnet.ibm.com
      84dbf66c
    • S
      powerpc/smp: Remove unnecessary variable · 966730a6
      Srikar Dronamraju 提交于
      Commit 3ab33d6d ("powerpc/smp: Optimize update_mask_by_l2")
      introduced submask_fn in update_mask_by_l2 to track the right submask.
      However commit f6606cfd ("powerpc/smp: Dont assume l2-cache to be
      superset of sibling") introduced sibling_mask in update_mask_by_l2 to
      track the same submask. Remove sibling_mask in favour of submask_fn.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20201019042716.106234-2-srikar@linux.vnet.ibm.com
      966730a6