1. 15 5月, 2019 1 次提交
    • J
      powerpc/speculation: Support 'mitigations=' cmdline option · 74857f69
      Josh Poimboeuf 提交于
      commit 782e69efb3dfed6e8360bc612e8c7827a901a8f9 upstream
      
      Configure powerpc CPU runtime speculation bug mitigations in accordance
      with the 'mitigations=' cmdline option.  This affects Meltdown, Spectre
      v1, Spectre v2, and Speculative Store Bypass.
      
      The default behavior is unchanged.
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: Jiri Kosina <jkosina@suse.cz> (on x86)
      Reviewed-by: NJiri Kosina <jkosina@suse.cz>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: linux-s390@vger.kernel.org
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-arch@vger.kernel.org
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Tyler Hicks <tyhicks@canonical.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Phil Auld <pauld@redhat.com>
      Link: https://lkml.kernel.org/r/245a606e1a42a558a310220312d9b6adb9159df6.1555085500.git.jpoimboe@redhat.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      74857f69
  2. 08 5月, 2019 2 次提交
  3. 02 5月, 2019 2 次提交
    • M
      powerpc/mm/radix: Make Radix require HUGETLB_PAGE · 087341c0
      Michael Ellerman 提交于
      commit 8adddf349fda0d3de2f6bb41ddf838cbf36a8ad2 upstream.
      
      Joel reported weird crashes using skiroot_defconfig, in his case we
      jumped into an NX page:
      
        kernel tried to execute exec-protected page (c000000002bff4f0) - exploit attempt? (uid: 0)
        BUG: Unable to handle kernel instruction fetch
        Faulting instruction address: 0xc000000002bff4f0
      
      Looking at the disassembly, we had simply branched to that address:
      
        c000000000c001bc  49fff335    bl     c000000002bff4f0
      
      But that didn't match the original kernel image:
      
        c000000000c001bc  4bfff335    bl     c000000000bff4f0 <kobject_get+0x8>
      
      When STRICT_KERNEL_RWX is enabled, and we're using the radix MMU, we
      call radix__change_memory_range() late in boot to change page
      protections. We do that both to mark rodata read only and also to mark
      init text no-execute. That involves walking the kernel page tables,
      and clearing _PAGE_WRITE or _PAGE_EXEC respectively.
      
      With radix we may use hugepages for the linear mapping, so the code in
      radix__change_memory_range() uses eg. pmd_huge() to test if it has
      found a huge mapping, and if so it stops the page table walk and
      changes the PMD permissions.
      
      However if the kernel is built without HUGETLBFS support, pmd_huge()
      is just a #define that always returns 0. That causes the code in
      radix__change_memory_range() to incorrectly interpret the PMD value as
      a pointer to a PTE page rather than as a PTE at the PMD level.
      
      We can see this using `dv` in xmon which also uses pmd_huge():
      
        0:mon> dv c000000000000000
        pgd  @ 0xc000000001740000
        pgdp @ 0xc000000001740000 = 0x80000000ffffb009
        pudp @ 0xc0000000ffffb000 = 0x80000000ffffa009
        pmdp @ 0xc0000000ffffa000 = 0xc00000000000018f   <- this is a PTE
        ptep @ 0xc000000000000100 = 0xa64bb17da64ab07d   <- kernel text
      
      The end result is we treat the value at 0xc000000000000100 as a PTE
      and clear _PAGE_WRITE or _PAGE_EXEC, potentially corrupting the code
      at that address.
      
      In Joel's specific case we cleared the sign bit in the offset of the
      branch, causing a backward branch to turn into a forward branch which
      caused us to branch into a non-executable page. However the exact
      nature of the crash depends on kernel version, compiler version, and
      other factors.
      
      We need to fix radix__change_memory_range() to not use accessors that
      depend on HUGETLBFS, but we also have radix memory hotplug code that
      uses pmd_huge() etc that will also need fixing. So for now just
      disallow the broken combination of Radix with HUGETLBFS disabled.
      
      The only defconfig we have that is affected is skiroot_defconfig, so
      turn on HUGETLBFS there so that it still gets Radix.
      
      Fixes: 566ca99a ("powerpc/mm/radix: Add dummy radix_enabled()")
      Cc: stable@vger.kernel.org # v4.7+
      Reported-by: NJoel Stanley <joel@jms.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      087341c0
    • C
      powerpc/vdso32: fix CLOCK_MONOTONIC on PPC64 · 1e0cab1b
      Christophe Leroy 提交于
      [ Upstream commit dd9a994fc68d196a052b73747e3366c57d14a09e ]
      
      Commit b5b4453e7912 ("powerpc/vdso64: Fix CLOCK_MONOTONIC
      inconsistencies across Y2038") changed the type of wtom_clock_sec
      to s64 on PPC64. Therefore, VDSO32 needs to read it with a 4 bytes
      shift in order to retrieve the lower part of it.
      
      Fixes: b5b4453e7912 ("powerpc/vdso64: Fix CLOCK_MONOTONIC inconsistencies across Y2038")
      Reported-by: NChristian Zigotzky <chzigotzky@xenosoft.de>
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      1e0cab1b
  4. 20 4月, 2019 1 次提交
    • N
      powerpc/pseries: Remove prrn_work workqueue · 6947d853
      Nathan Fontenot 提交于
      [ Upstream commit cd24e457fd8b2d087d9236700c8d2957054598bf ]
      
      When a PRRN event is received we are already running in a worker
      thread. Instead of spawning off another worker thread on the prrn_work
      workqueue to handle the PRRN event we can just call the PRRN handler
      routine directly.
      
      With this update we can also pass the scope variable for the PRRN
      event directly to the handler instead of it being a global variable.
      
      This patch fixes the following oops mnessage we are seeing in PRRN testing:
      
        Oops: Bad kernel stack pointer, sig: 6 [#1]
        SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache binfmt_misc reiserfs vfat fat rpadlpar_io(X) rpaphp(X) tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag af_packet xfs libcrc32c dm_service_time ibmveth(X) ses enclosure scsi_transport_sas rtc_generic btrfs xor raid6_pq sd_mod ibmvscsi(X) scsi_transport_srp ipr(X) libata sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4
        Supported: Yes, External                                                     54
        CPU: 7 PID: 18967 Comm: kworker/u96:0 Tainted: G                 X 4.4.126-94.22-default #1
        Workqueue: pseries hotplug workque pseries_hp_work_fn
        task: c000000775367790 ti: c00000001ebd4000 task.ti: c00000070d140000
        NIP: 0000000000000000 LR: 000000001fb3d050 CTR: 0000000000000000
        REGS: c00000001ebd7d40 TRAP: 0700   Tainted: G                 X  (4.4.126-94.22-default)
        MSR: 8000000102081000 <41,VEC,ME5  CR: 28000002  XER: 20040018   4
        CFAR: 000000001fb3d084 40 419   1                                3
        GPR00: 000000000000000040000000000010007 000000001ffff400 000000041fffe200
        GPR04: 000000000000008050000000000000000 000000001fb15fa8 0000000500000500
        GPR08: 000000000001f40040000000000000001 0000000000000000 000005:5200040002
        GPR12: 00000000000000005c000000007a05400 c0000000000e89f8 000000001ed9f668
        GPR16: 000000001fbeff944000000001fbeff94 000000001fb545e4 0000006000000060
        GPR20: ffffffffffffffff4ffffffffffffffff 0000000000000000 0000000000000000
        GPR24: 00000000000000005400000001fb3c000 0000000000000000 000000001fb1b040
        GPR28: 000000001fb240004000000001fb440d8 0000000000000008 0000000000000000
        NIP [0000000000000000] 5         (null)
        LR [000000001fb3d050] 031fb3d050
        Call Trace:            4
        Instruction dump:      4                                       5:47 12    2
        XXXXXXXX XXXXXXXX XXXXX4XX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
        XXXXXXXX XXXXXXXX XXXXX5XX XXXXXXXX 60000000 60000000 60000000 60000000
        ---[ end trace aa5627b04a7d9d6b ]---                                       3NMI watchdog: BUG: soft lockup - CPU#27 stuck for 23s! [kworker/27:0:13903]
        Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache binfmt_misc reiserfs vfat fat rpadlpar_io(X) rpaphp(X) tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag af_packet xfs libcrc32c dm_service_time ibmveth(X) ses enclosure scsi_transport_sas rtc_generic btrfs xor raid6_pq sd_mod ibmvscsi(X) scsi_transport_srp ipr(X) libata sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4
        Supported: Yes, External
        CPU: 27 PID: 13903 Comm: kworker/27:0 Tainted: G      D          X 4.4.126-94.22-default #1
        Workqueue: events prrn_work_fn
        task: c000000747cfa390 ti: c00000074712c000 task.ti: c00000074712c000
        NIP: c0000000008002a8 LR: c000000000090770 CTR: 000000000032e088
        REGS: c00000074712f7b0 TRAP: 0901   Tainted: G      D          X  (4.4.126-94.22-default)
        MSR: 8000000100009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 22482044  XER: 20040000
        CFAR: c0000000008002c4 SOFTE: 1
        GPR00: c000000000090770 c00000074712fa30 c000000000f09800 c000000000fa1928 6:02
        GPR04: c000000775f5e000 fffffffffffffffe 0000000000000001 c000000000f42db8
        GPR08: 0000000000000001 0000000080000007 0000000000000000 0000000000000000
        GPR12: 8006210083180000 c000000007a14400
        NIP [c0000000008002a8] _raw_spin_lock+0x68/0xd0
        LR [c000000000090770] mobility_rtas_call+0x50/0x100
        Call Trace:            59                                        5
        [c00000074712fa60] [c000000000090770] mobility_rtas_call+0x50/0x100
        [c00000074712faf0] [c000000000090b08] pseries_devicetree_update+0xf8/0x530
        [c00000074712fc20] [c000000000031ba4] prrn_work_fn+0x34/0x50
        [c00000074712fc40] [c0000000000e0390] process_one_work+0x1a0/0x4e0
        [c00000074712fcd0] [c0000000000e0870] worker_thread+0x1a0/0x6105:57       2
        [c00000074712fd80] [c0000000000e8b18] kthread+0x128/0x150
        [c00000074712fe30] [c0000000000096f8] ret_from_kernel_thread+0x5c/0x64
        Instruction dump:
        2c090000 40c20010 7d40192d 40c2fff0 7c2004ac 2fa90000 40de0018 5:540030   3
        e8010010 ebe1fff8 7c0803a6 4e800020 <7c210b78> e92d0000 89290009 792affe3
      Signed-off-by: NJohn Allen <jallen@linux.ibm.com>
      Signed-off-by: NHaren Myneni <haren@us.ibm.com>
      Signed-off-by: NNathan Fontenot <nfont@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      6947d853
  5. 17 4月, 2019 1 次提交
    • B
      powerpc/tm: Limit TM code inside PPC_TRANSACTIONAL_MEM · 902eca1a
      Breno Leitao 提交于
      commit 897bc3df8c5aebb54c32d831f917592e873d0559 upstream.
      
      Commit e1c3743e1a20 ("powerpc/tm: Set MSR[TS] just prior to recheckpoint")
      moved a code block around and this block uses a 'msr' variable outside of
      the CONFIG_PPC_TRANSACTIONAL_MEM, however the 'msr' variable is declared
      inside a CONFIG_PPC_TRANSACTIONAL_MEM block, causing a possible error when
      CONFIG_PPC_TRANSACTION_MEM is not defined.
      
      	error: 'msr' undeclared (first use in this function)
      
      This is not causing a compilation error in the mainline kernel, because
      'msr' is being used as an argument of MSR_TM_ACTIVE(), which is defined as
      the following when CONFIG_PPC_TRANSACTIONAL_MEM is *not* set:
      
      	#define MSR_TM_ACTIVE(x) 0
      
      This patch just fixes this issue avoiding the 'msr' variable usage outside
      the CONFIG_PPC_TRANSACTIONAL_MEM block, avoiding trusting in the
      MSR_TM_ACTIVE() definition.
      
      Cc: stable@vger.kernel.org
      Reported-by: NChristoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de>
      Fixes: e1c3743e1a20 ("powerpc/tm: Set MSR[TS] just prior to recheckpoint")
      Signed-off-by: NBreno Leitao <leitao@debian.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      902eca1a
  6. 06 4月, 2019 5 次提交
    • N
      powerpc/pseries: Perform full re-add of CPU for topology update post-migration · 06af7dda
      Nathan Fontenot 提交于
      [ Upstream commit 81b61324922c67f73813d8a9c175f3c153f6a1c6 ]
      
      On pseries systems, performing a partition migration can result in
      altering the nodes a CPU is assigned to on the destination system. For
      exampl, pre-migration on the source system CPUs are in node 1 and 3,
      post-migration on the destination system CPUs are in nodes 2 and 3.
      
      Handling the node change for a CPU can cause corruption in the slab
      cache if we hit a timing where a CPUs node is changed while cache_reap()
      is invoked. The corruption occurs because the slab cache code appears
      to rely on the CPU and slab cache pages being on the same node.
      
      The current dynamic updating of a CPUs node done in arch/powerpc/mm/numa.c
      does not prevent us from hitting this scenario.
      
      Changing the device tree property update notification handler that
      recognizes an affinity change for a CPU to do a full DLPAR remove and
      add of the CPU instead of dynamically changing its node resolves this
      issue.
      Signed-off-by: NNathan Fontenot <nfont@linux.vnet.ibm.com>
      Signed-off-by: NMichael W. Bringmann <mwb@linux.vnet.ibm.com>
      Tested-by: NMichael W. Bringmann <mwb@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      06af7dda
    • N
      powerpc/64s: Clear on-stack exception marker upon exception return · b52681e6
      Nicolai Stange 提交于
      [ Upstream commit eddd0b332304d554ad6243942f87c2fcea98c56b ]
      
      The ppc64 specific implementation of the reliable stacktracer,
      save_stack_trace_tsk_reliable(), bails out and reports an "unreliable
      trace" whenever it finds an exception frame on the stack. Stack frames
      are classified as exception frames if the STACK_FRAME_REGS_MARKER
      magic, as written by exception prologues, is found at a particular
      location.
      
      However, as observed by Joe Lawrence, it is possible in practice that
      non-exception stack frames can alias with prior exception frames and
      thus, that the reliable stacktracer can find a stale
      STACK_FRAME_REGS_MARKER on the stack. It in turn falsely reports an
      unreliable stacktrace and blocks any live patching transition to
      finish. Said condition lasts until the stack frame is
      overwritten/initialized by function call or other means.
      
      In principle, we could mitigate this by making the exception frame
      classification condition in save_stack_trace_tsk_reliable() stronger:
      in addition to testing for STACK_FRAME_REGS_MARKER, we could also take
      into account that for all exceptions executing on the kernel stack
        - their stack frames's backlink pointers always match what is saved
          in their pt_regs instance's ->gpr[1] slot and that
        - their exception frame size equals STACK_INT_FRAME_SIZE, a value
          uncommonly large for non-exception frames.
      
      However, while these are currently true, relying on them would make
      the reliable stacktrace implementation more sensitive towards future
      changes in the exception entry code. Note that false negatives, i.e.
      not detecting exception frames, would silently break the live patching
      consistency model.
      
      Furthermore, certain other places (diagnostic stacktraces, perf, xmon)
      rely on STACK_FRAME_REGS_MARKER as well.
      
      Make the exception exit code clear the on-stack
      STACK_FRAME_REGS_MARKER for those exceptions running on the "normal"
      kernel stack and returning to kernelspace: because the topmost frame
      is ignored by the reliable stack tracer anyway, returns to userspace
      don't need to take care of clearing the marker.
      
      Furthermore, as I don't have the ability to test this on Book 3E or 32
      bits, limit the change to Book 3S and 64 bits.
      
      Fixes: df78d3f6 ("powerpc/livepatch: Implement reliable stack tracing for the consistency model")
      Reported-by: NJoe Lawrence <joe.lawrence@redhat.com>
      Signed-off-by: NNicolai Stange <nstange@suse.de>
      Signed-off-by: NJoe Lawrence <joe.lawrence@redhat.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b52681e6
    • A
      powerpc/hugetlb: Handle mmap_min_addr correctly in get_unmapped_area callback · 6cf5f631
      Aneesh Kumar K.V 提交于
      [ Upstream commit 5330367fa300742a97e20e953b1f77f48392faae ]
      
      After we ALIGN up the address we need to make sure we didn't overflow
      and resulted in zero address. In that case, we need to make sure that
      the returned address is greater than mmap_min_addr.
      
      This fixes selftest va_128TBswitch --run-hugetlb reporting failures when
      run as non root user for
      
      mmap(-1, MAP_HUGETLB)
      
      The bug is that a non-root user requesting address -1 will be given address 0
      which will then fail, whereas they should have been given something else that
      would have succeeded.
      
      We also avoid the first mmap(-1, MAP_HUGETLB) returning NULL address as mmap address
      with this change. So we think this is not a security issue, because it only affects
      whether we choose an address below mmap_min_addr, not whether we
      actually allow that address to be mapped. ie. there are existing capability
      checks to prevent a user mapping below mmap_min_addr and those will still be
      honoured even without this fix.
      
      Fixes: 48483760 ("powerpc/mm: Add radix support for hugetlb")
      Reviewed-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      6cf5f631
    • N
      powerpc/xmon: Fix opcode being uninitialized in print_insn_powerpc · c70214d5
      Nathan Chancellor 提交于
      [ Upstream commit e7140639b1de65bba435a6bd772d134901141f86 ]
      
      When building with -Wsometimes-uninitialized, Clang warns:
      
        arch/powerpc/xmon/ppc-dis.c:157:7: warning: variable 'opcode' is used
        uninitialized whenever 'if' condition is false
        [-Wsometimes-uninitialized]
          if (cpu_has_feature(CPU_FTRS_POWER9))
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        arch/powerpc/xmon/ppc-dis.c:167:7: note: uninitialized use occurs here
          if (opcode == NULL)
              ^~~~~~
        arch/powerpc/xmon/ppc-dis.c:157:3: note: remove the 'if' if its
        condition is always true
          if (cpu_has_feature(CPU_FTRS_POWER9))
          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        arch/powerpc/xmon/ppc-dis.c:132:38: note: initialize the variable
        'opcode' to silence this warning
          const struct powerpc_opcode *opcode;
                                             ^
                                              = NULL
        1 warning generated.
      
      This warning seems to make no sense on the surface because opcode is set
      to NULL right below this statement. However, there is a comma instead of
      semicolon to end the dialect assignment, meaning that the opcode
      assignment only happens in the if statement. Properly terminate that
      line so that Clang no longer warns.
      
      Fixes: 5b102782 ("powerpc/xmon: Enable disassembly files (compilation changes)")
      Signed-off-by: NNathan Chancellor <natechancellor@gmail.com>
      Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      c70214d5
    • A
      powerpc/powernv/ioda: Fix locked_vm counting for memory used by IOMMU tables · 4acf7974
      Alexey Kardashevskiy 提交于
      [ Upstream commit 11f5acce2fa43b015a8120fa7620fa4efd0a2952 ]
      
      We store 2 multilevel tables in iommu_table - one for the hardware and
      one with the corresponding userspace addresses. Before allocating
      the tables, the iommu_table_group_ops::get_table_size() hook returns
      the combined size of the two and VFIO SPAPR TCE IOMMU driver adjusts
      the locked_vm counter correctly. When the table is actually allocated,
      the amount of allocated memory is stored in iommu_table::it_allocated_size
      and used to decrement the locked_vm counter when we release the memory
      used by the table; .get_table_size() and .create_table() calculate it
      independently but the result is expected to be the same.
      
      However the allocator does not add the userspace table size to
      .it_allocated_size so when we destroy the table because of VFIO PCI
      unplug (i.e. VFIO container is gone but the userspace keeps running),
      we decrement locked_vm by just a half of size of memory we are
      releasing.
      
      To make things worse, since we enabled on-demand allocation of
      indirect levels, it_allocated_size contains only the amount of memory
      actually allocated at the table creation time which can just be a
      fraction. It is not a problem with incrementing locked_vm (as
      get_table_size() value is used) but it is with decrementing.
      
      As the result, we leak locked_vm and may not be able to allocate more
      IOMMU tables after few iterations of hotplug/unplug.
      
      This sets it_allocated_size in the pnv_pci_ioda2_ops::create_table()
      hook to what pnv_pci_ioda2_get_table_size() returns so from now on we
      have a single place which calculates the maximum memory a table can
      occupy. The original meaning of it_allocated_size is somewhat lost now
      though.
      
      We do not ditch it_allocated_size whatsoever here and we do not call
      get_table_size() from vfio_iommu_spapr_tce.c when decrementing
      locked_vm as we may have multiple IOMMU groups per container and even
      though they all are supposed to have the same get_table_size()
      implementation, there is a small chance for failure or confusion.
      
      Fixes: 090bad39 ("powerpc/powernv: Add indirect levels to it_userspace")
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4acf7974
  7. 03 4月, 2019 15 次提交
  8. 27 3月, 2019 1 次提交
    • M
      powerpc/vdso64: Fix CLOCK_MONOTONIC inconsistencies across Y2038 · b8ea151a
      Michael Ellerman 提交于
      commit b5b4453e7912f056da1ca7572574cada32ecb60c upstream.
      
      Jakub Drnec reported:
        Setting the realtime clock can sometimes make the monotonic clock go
        back by over a hundred years. Decreasing the realtime clock across
        the y2k38 threshold is one reliable way to reproduce. Allegedly this
        can also happen just by running ntpd, I have not managed to
        reproduce that other than booting with rtc at >2038 and then running
        ntp. When this happens, anything with timers (e.g. openjdk) breaks
        rather badly.
      
      And included a test case (slightly edited for brevity):
        #define _POSIX_C_SOURCE 199309L
        #include <stdio.h>
        #include <time.h>
        #include <stdlib.h>
        #include <unistd.h>
      
        long get_time(void) {
          struct timespec tp;
          clock_gettime(CLOCK_MONOTONIC, &tp);
          return tp.tv_sec + tp.tv_nsec / 1000000000;
        }
      
        int main(void) {
          long last = get_time();
          while(1) {
            long now = get_time();
            if (now < last) {
              printf("clock went backwards by %ld seconds!\n", last - now);
            }
            last = now;
            sleep(1);
          }
          return 0;
        }
      
      Which when run concurrently with:
       # date -s 2040-1-1
       # date -s 2037-1-1
      
      Will detect the clock going backward.
      
      The root cause is that wtom_clock_sec in struct vdso_data is only a
      32-bit signed value, even though we set its value to be equal to
      tk->wall_to_monotonic.tv_sec which is 64-bits.
      
      Because the monotonic clock starts at zero when the system boots the
      wall_to_montonic.tv_sec offset is negative for current and future
      dates. Currently on a freshly booted system the offset will be in the
      vicinity of negative 1.5 billion seconds.
      
      However if the wall clock is set past the Y2038 boundary, the offset
      from wall to monotonic becomes less than negative 2^31, and no longer
      fits in 32-bits. When that value is assigned to wtom_clock_sec it is
      truncated and becomes positive, causing the VDSO assembly code to
      calculate CLOCK_MONOTONIC incorrectly.
      
      That causes CLOCK_MONOTONIC to jump ahead by ~4 billion seconds which
      it is not meant to do. Worse, if the time is then set back before the
      Y2038 boundary CLOCK_MONOTONIC will jump backward.
      
      We can fix it simply by storing the full 64-bit offset in the
      vdso_data, and using that in the VDSO assembly code. We also shuffle
      some of the fields in vdso_data to avoid creating a hole.
      
      The original commit that added the CLOCK_MONOTONIC support to the VDSO
      did actually use a 64-bit value for wtom_clock_sec, see commit
      a7f290da ("[PATCH] powerpc: Merge vdso's and add vdso support to
      32 bits kernel") (Nov 2005). However just 3 days later it was
      converted to 32-bits in commit 0c37ec2a ("[PATCH] powerpc: vdso
      fixes (take #2)"), and the bug has existed since then AFAICS.
      
      Fixes: 0c37ec2a ("[PATCH] powerpc: vdso fixes (take #2)")
      Cc: stable@vger.kernel.org # v2.6.15+
      Link: http://lkml.kernel.org/r/HaC.ZfES.62bwlnvAvMP.1STMMj@seznam.czReported-by: NJakub Drnec <jaydee@email.cz>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b8ea151a
  9. 24 3月, 2019 11 次提交
    • S
      KVM: Call kvm_arch_memslots_updated() before updating memslots · 23ad135a
      Sean Christopherson 提交于
      commit 152482580a1b0accb60676063a1ac57b2d12daf6 upstream.
      
      kvm_arch_memslots_updated() is at this point in time an x86-specific
      hook for handling MMIO generation wraparound.  x86 stashes 19 bits of
      the memslots generation number in its MMIO sptes in order to avoid
      full page fault walks for repeat faults on emulated MMIO addresses.
      Because only 19 bits are used, wrapping the MMIO generation number is
      possible, if unlikely.  kvm_arch_memslots_updated() alerts x86 that
      the generation has changed so that it can invalidate all MMIO sptes in
      case the effective MMIO generation has wrapped so as to avoid using a
      stale spte, e.g. a (very) old spte that was created with generation==0.
      
      Given that the purpose of kvm_arch_memslots_updated() is to prevent
      consuming stale entries, it needs to be called before the new generation
      is propagated to memslots.  Invalidating the MMIO sptes after updating
      memslots means that there is a window where a vCPU could dereference
      the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO
      spte that was created with (pre-wrap) generation==0.
      
      Fixes: e59dbe09 ("KVM: Introduce kvm_arch_memslots_updated()")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      23ad135a
    • C
      powerpc/traps: Fix the message printed when stack overflows · d6d004b3
      Christophe Leroy 提交于
      commit 9bf3d3c4e4fd82c7174f4856df372ab2a71005b9 upstream.
      
      Today's message is useless:
      
        [   42.253267] Kernel stack overflow in process (ptrval), r1=c65500b0
      
      This patch fixes it:
      
        [   66.905235] Kernel stack overflow in process sh[356], r1=c65560b0
      
      Fixes: ad67b74d ("printk: hash addresses printed with %p")
      Cc: stable@vger.kernel.org # v4.15+
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      [mpe: Use task_pid_nr()]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d6d004b3
    • C
      powerpc/traps: fix recoverability of machine check handling on book3s/32 · 461a52a4
      Christophe Leroy 提交于
      commit 0bbea75c476b77fa7d7811d6be911cc7583e640f upstream.
      
      Looks like book3s/32 doesn't set RI on machine check, so
      checking RI before calling die() will always be fatal
      allthought this is not an issue in most cases.
      
      Fixes: b96672dd ("powerpc: Machine check interrupt is a non-maskable interrupt")
      Fixes: daf00ae71dad ("powerpc/traps: restore recoverability of machine_check interrupts")
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Cc: stable@vger.kernel.org
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      461a52a4
    • A
      powerpc/hugetlb: Don't do runtime allocation of 16G pages in LPAR configuration · baed68a9
      Aneesh Kumar K.V 提交于
      commit 35f2806b481f5b9207f25e1886cba5d1c4d12cc7 upstream.
      
      We added runtime allocation of 16G pages in commit 4ae279c2
      ("powerpc/mm/hugetlb: Allow runtime allocation of 16G.") That was done
      to enable 16G allocation on PowerNV and KVM config. In case of KVM
      config, we mostly would have the entire guest RAM backed by 16G
      hugetlb pages for this to work. PAPR do support partial backing of
      guest RAM with hugepages via ibm,expected#pages node of memory node in
      the device tree. This means rest of the guest RAM won't be backed by
      16G contiguous pages in the host and hence a hash page table insertion
      can fail in such case.
      
      An example error message will look like
      
        hash-mmu: mm: Hashing failure ! EA=0x7efc00000000 access=0x8000000000000006 current=readback
        hash-mmu:     trap=0x300 vsid=0x67af789 ssize=1 base psize=14 psize 14 pte=0xc000000400000386
        readback[12260]: unhandled signal 7 at 00007efc00000000 nip 00000000100012d0 lr 000000001000127c code 2
      
      This patch address that by preventing runtime allocation of 16G
      hugepages in LPAR config. To allocate 16G hugetlb one need to kernel
      command line hugepagesz=16G hugepages=<number of 16G pages>
      
      With radix translation mode we don't run into this issue.
      
      This change will prevent runtime allocation of 16G hugetlb pages on
      kvm with hash translation mode. However, with the current upstream it
      was observed that 16G hugetlbfs backed guest doesn't boot at all.
      
      We observe boot failure with the below message:
        [131354.647546] KVM: map_vrma at 0 failed, ret=-4
      
      That means this patch is not resulting in an observable regression.
      Once we fix the boot issue with 16G hugetlb backed memory, we need to
      use ibm,expected#pages memory node attribute to indicate 16G page
      reservation to the guest. This will also enable partial backing of
      guest RAM with 16G pages.
      
      Fixes: 4ae279c2 ("powerpc/mm/hugetlb: Allow runtime allocation of 16G.")
      Cc: stable@vger.kernel.org # v4.14+
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      baed68a9
    • M
      powerpc/ptrace: Simplify vr_get/set() to avoid GCC warning · 9d2e929c
      Michael Ellerman 提交于
      commit ca6d5149d2ad0a8d2f9c28cbe379802260a0a5e0 upstream.
      
      GCC 8 warns about the logic in vr_get/set(), which with -Werror breaks
      the build:
      
        In function ‘user_regset_copyin’,
            inlined from ‘vr_set’ at arch/powerpc/kernel/ptrace.c:628:9:
        include/linux/regset.h:295:4: error: ‘memcpy’ offset [-527, -529] is
        out of the bounds [0, 16] of object ‘vrsave’ with type ‘union
        <anonymous>’ [-Werror=array-bounds]
        arch/powerpc/kernel/ptrace.c: In function ‘vr_set’:
        arch/powerpc/kernel/ptrace.c:623:5: note: ‘vrsave’ declared here
           } vrsave;
      
      This has been identified as a regression in GCC, see GCC bug 88273.
      
      However we can avoid the warning and also simplify the logic and make
      it more robust.
      
      Currently we pass -1 as end_pos to user_regset_copyout(). This says
      "copy up to the end of the regset".
      
      The definition of the regset is:
      	[REGSET_VMX] = {
      		.core_note_type = NT_PPC_VMX, .n = 34,
      		.size = sizeof(vector128), .align = sizeof(vector128),
      		.active = vr_active, .get = vr_get, .set = vr_set
      	},
      
      The end is calculated as (n * size), ie. 34 * sizeof(vector128).
      
      In vr_get/set() we pass start_pos as 33 * sizeof(vector128), meaning
      we can copy up to sizeof(vector128) into/out-of vrsave.
      
      The on-stack vrsave is defined as:
        union {
      	  elf_vrreg_t reg;
      	  u32 word;
        } vrsave;
      
      And elf_vrreg_t is:
        typedef __vector128 elf_vrreg_t;
      
      So there is no bug, but we rely on all those sizes lining up,
      otherwise we would have a kernel stack exposure/overwrite on our
      hands.
      
      Rather than relying on that we can pass an explict end_pos based on
      the sizeof(vrsave). The result should be exactly the same but it's
      more obviously not over-reading/writing the stack and it avoids the
      compiler warning.
      Reported-by: NMeelis Roos <mroos@linux.ee>
      Reported-by: NMathieu Malaterre <malat@debian.org>
      Cc: stable@vger.kernel.org
      Tested-by: NMathieu Malaterre <malat@debian.org>
      Tested-by: NMeelis Roos <mroos@linux.ee>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9d2e929c
    • M
      powerpc: Fix 32-bit KVM-PR lockup and host crash with MacOS guest · 344996a8
      Mark Cave-Ayland 提交于
      commit fe1ef6bcdb4fca33434256a802a3ed6aacf0bd2f upstream.
      
      Commit 8792468d "powerpc: Add the ability to save FPU without
      giving it up" unexpectedly removed the MSR_FE0 and MSR_FE1 bits from
      the bitmask used to update the MSR of the previous thread in
      __giveup_fpu() causing a KVM-PR MacOS guest to lockup and panic the
      host kernel.
      
      Leaving FE0/1 enabled means unrelated processes might receive FPEs
      when they're not expecting them and crash. In particular if this
      happens to init the host will then panic.
      
      eg (transcribed):
        qemu-system-ppc[837]: unhandled signal 8 at 12cc9ce4 nip 12cc9ce4 lr 12cc9ca4 code 0
        systemd[1]: unhandled signal 8 at 202f02e0 nip 202f02e0 lr 001003d4 code 0
        Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
      
      Reinstate these bits to the MSR bitmask to enable MacOS guests to run
      under 32-bit KVM-PR once again without issue.
      
      Fixes: 8792468d ("powerpc: Add the ability to save FPU without giving it up")
      Cc: stable@vger.kernel.org # v4.6+
      Signed-off-by: NMark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      344996a8
    • P
      powerpc/powernv: Don't reprogram SLW image on every KVM guest entry/exit · 3bf8ff7b
      Paul Mackerras 提交于
      commit 19f8a5b5be2898573a5e1dc1db93e8d40117606a upstream.
      
      Commit 24be85a2 ("powerpc/powernv: Clear PECE1 in LPCR via stop-api
      only on Hotplug", 2017-07-21) added two calls to opal_slw_set_reg()
      inside pnv_cpu_offline(), with the aim of changing the LPCR value in
      the SLW image to disable wakeups from the decrementer while a CPU is
      offline.  However, pnv_cpu_offline() gets called each time a secondary
      CPU thread is woken up to participate in running a KVM guest, that is,
      not just when a CPU is offlined.
      
      Since opal_slw_set_reg() is a very slow operation (with observed
      execution times around 20 milliseconds), this means that an offline
      secondary CPU can often be busy doing the opal_slw_set_reg() call
      when the primary CPU wants to grab all the secondary threads so that
      it can run a KVM guest.  This leads to messages like "KVM: couldn't
      grab CPU n" being printed and guest execution failing.
      
      There is no need to reprogram the SLW image on every KVM guest entry
      and exit.  So that we do it only when a CPU is really transitioning
      between online and offline, this moves the calls to
      pnv_program_cpu_hotplug_lpcr() into pnv_smp_cpu_kill_self().
      
      Fixes: 24be85a2 ("powerpc/powernv: Clear PECE1 in LPCR via stop-api only on Hotplug")
      Cc: stable@vger.kernel.org # v4.14+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3bf8ff7b
    • C
      powerpc/83xx: Also save/restore SPRG4-7 during suspend · f6f03d60
      Christophe Leroy 提交于
      commit 36da5ff0bea2dc67298150ead8d8471575c54c7d upstream.
      
      The 83xx has 8 SPRG registers and uses at least SPRG4
      for DTLB handling LRU.
      
      Fixes: 2319f123 ("powerpc/mm: e300c2/c3/c4 TLB errata workaround")
      Cc: stable@vger.kernel.org
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f6f03d60
    • J
      powerpc/powernv: Make opal log only readable by root · b0934990
      Jordan Niethe 提交于
      commit 7b62f9bd2246b7d3d086e571397c14ba52645ef1 upstream.
      
      Currently the opal log is globally readable. It is kernel policy to
      limit the visibility of physical addresses / kernel pointers to root.
      Given this and the fact the opal log may contain this information it
      would be better to limit the readability to root.
      
      Fixes: bfc36894 ("powerpc/powernv: Add OPAL message log interface")
      Cc: stable@vger.kernel.org # v3.15+
      Signed-off-by: NJordan Niethe <jniethe5@gmail.com>
      Reviewed-by: NStewart Smith <stewart@linux.ibm.com>
      Reviewed-by: NAndrew Donnellan <andrew.donnellan@au1.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b0934990
    • C
      powerpc/wii: properly disable use of BATs when requested. · 9b530550
      Christophe Leroy 提交于
      commit 6d183ca8baec983dc4208ca45ece3c36763df912 upstream.
      
      'nobats' kernel parameter or some options like CONFIG_DEBUG_PAGEALLOC
      deny the use of BATS for mapping memory.
      
      This patch makes sure that the specific wii RAM mapping function
      takes it into account as well.
      
      Fixes: de32400d ("wii: use both mem1 and mem2 as ram")
      Cc: stable@vger.kernel.org
      Reviewed-by: NJonathan Neuschafer <j.neuschaefer@gmx.net>
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9b530550
    • C
      powerpc/32: Clear on-stack exception marker upon exception return · 40b97853
      Christophe Leroy 提交于
      commit 9580b71b5a7863c24a9bd18bcd2ad759b86b1eff upstream.
      
      Clear the on-stack STACK_FRAME_REGS_MARKER on exception exit in order
      to avoid confusing stacktrace like the one below.
      
        Call Trace:
        [c0e9dca0] [c01c42a0] print_address_description+0x64/0x2bc (unreliable)
        [c0e9dcd0] [c01c4684] kasan_report+0xfc/0x180
        [c0e9dd10] [c0895130] memchr+0x24/0x74
        [c0e9dd30] [c00a9e38] msg_print_text+0x124/0x574
        [c0e9dde0] [c00ab710] console_unlock+0x114/0x4f8
        [c0e9de40] [c00adc60] vprintk_emit+0x188/0x1c4
        --- interrupt: c0e9df00 at 0x400f330
            LR = init_stack+0x1f00/0x2000
        [c0e9de80] [c00ae3c4] printk+0xa8/0xcc (unreliable)
        [c0e9df20] [c0c27e44] early_irq_init+0x38/0x108
        [c0e9df50] [c0c15434] start_kernel+0x310/0x488
        [c0e9dff0] [00003484] 0x3484
      
      With this patch the trace becomes:
      
        Call Trace:
        [c0e9dca0] [c01c42c0] print_address_description+0x64/0x2bc (unreliable)
        [c0e9dcd0] [c01c46a4] kasan_report+0xfc/0x180
        [c0e9dd10] [c0895150] memchr+0x24/0x74
        [c0e9dd30] [c00a9e58] msg_print_text+0x124/0x574
        [c0e9dde0] [c00ab730] console_unlock+0x114/0x4f8
        [c0e9de40] [c00adc80] vprintk_emit+0x188/0x1c4
        [c0e9de80] [c00ae3e4] printk+0xa8/0xcc
        [c0e9df20] [c0c27e44] early_irq_init+0x38/0x108
        [c0e9df50] [c0c15434] start_kernel+0x310/0x488
        [c0e9dff0] [00003484] 0x3484
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      40b97853
  10. 27 2月, 2019 1 次提交