1. 13 10月, 2018 1 次提交
    • M
      powerpc/pseries/mobility: Extend start/stop topology update scope · 65b9fdad
      Michael Bringmann 提交于
      The powerpc mobility code may receive RTAS requests to perform PRRN
      (Platform Resource Reassignment Notification) topology changes at any
      time, including during LPAR migration operations.
      
      In some configurations where the affinity of CPUs or memory is being
      changed on that platform, the PRRN requests may apply or refer to
      outdated information prior to the complete update of the device-tree.
      
      This patch changes the duration for which topology updates are
      suppressed during LPAR migrations from just the rtas_ibm_suspend_me()
      / 'ibm,suspend-me' call(s) to cover the entire migration_store()
      operation to allow all changes to the device-tree to be applied prior
      to accepting and applying any PRRN requests.
      
      For tracking purposes, pr_info notices are added to the functions
      start_topology_update() and stop_topology_update() of 'numa.c'.
      Signed-off-by: NMichael Bringmann <mwb@linux.vnet.ibm.com>
      Reviewed-by: NNathan Fontenot <nfont@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      65b9fdad
  2. 05 10月, 2018 1 次提交
    • S
      powerpc/numa: Skip onlining a offline node in kdump path · ac1788cc
      Srikar Dronamraju 提交于
      With commit 2ea62630 ("powerpc/topology: Get topology for shared
      processors at boot"), kdump kernel on shared LPAR may crash.
      
      The necessary conditions are
      - Shared LPAR with at least 2 nodes having memory and CPUs.
      - Memory requirement for kdump kernel must be met by the first N-1
        nodes where there are at least N nodes with memory and CPUs.
      
      Example numactl of such a machine.
        $ numactl -H
        available: 5 nodes (0,2,5-7)
        node 0 cpus:
        node 0 size: 0 MB
        node 0 free: 0 MB
        node 2 cpus:
        node 2 size: 255 MB
        node 2 free: 189 MB
        node 5 cpus: 24 25 26 27 28 29 30 31
        node 5 size: 4095 MB
        node 5 free: 4024 MB
        node 6 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
        node 6 size: 6353 MB
        node 6 free: 5998 MB
        node 7 cpus: 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39
        node 7 size: 7640 MB
        node 7 free: 7164 MB
        node distances:
        node   0   2   5   6   7
          0:  10  40  40  40  40
          2:  40  10  40  40  40
          5:  40  40  10  40  40
          6:  40  40  40  10  20
          7:  40  40  40  20  10
      
      Steps to reproduce.
      1. Load / start kdump service.
      2. Trigger a kdump (for example : echo c > /proc/sysrq-trigger)
      
      When booting a kdump kernel with 2048M:
      
        kexec: Starting switchover sequence.
        I'm in purgatory
        Using 1TB segments
        hash-mmu: Initializing hash mmu with SLB
        Linux version 4.19.0-rc5-master+ (srikar@linux-xxu6) (gcc version 4.8.5 (SUSE Linux)) #1 SMP Thu Sep 27 19:45:00 IST 2018
        Found initrd at 0xc000000009e70000:0xc00000000ae554b4
        Using pSeries machine description
        -----------------------------------------------------
        ppc64_pft_size    = 0x1e
        phys_mem_size     = 0x88000000
        dcache_bsize      = 0x80
        icache_bsize      = 0x80
        cpu_features      = 0x000000ff8f5d91a7
          possible        = 0x0000fbffcf5fb1a7
          always          = 0x0000006f8b5c91a1
        cpu_user_features = 0xdc0065c2 0xef000000
        mmu_features      = 0x7c006001
        firmware_features = 0x00000007c45bfc57
        htab_hash_mask    = 0x7fffff
        physical_start    = 0x8000000
        -----------------------------------------------------
        numa:   NODE_DATA [mem 0x87d5e300-0x87d67fff]
        numa:     NODE_DATA(0) on node 6
        numa:   NODE_DATA [mem 0x87d54600-0x87d5e2ff]
        Top of RAM: 0x88000000, Total RAM: 0x88000000
        Memory hole size: 0MB
        Zone ranges:
          DMA      [mem 0x0000000000000000-0x0000000087ffffff]
          DMA32    empty
          Normal   empty
        Movable zone start for each node
        Early memory node ranges
          node   6: [mem 0x0000000000000000-0x0000000087ffffff]
        Could not find start_pfn for node 0
        Initmem setup node 0 [mem 0x0000000000000000-0x0000000000000000]
        On node 0 totalpages: 0
        Initmem setup node 6 [mem 0x0000000000000000-0x0000000087ffffff]
        On node 6 totalpages: 34816
      
        Unable to handle kernel paging request for data at address 0x00000060
        Faulting instruction address: 0xc000000008703a54
        Oops: Kernel access of bad area, sig: 11 [#1]
        LE SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in:
        CPU: 11 PID: 1 Comm: swapper/11 Not tainted 4.19.0-rc5-master+ #1
        NIP:  c000000008703a54 LR: c000000008703a38 CTR: 0000000000000000
        REGS: c00000000b673440 TRAP: 0380   Not tainted  (4.19.0-rc5-master+)
        MSR:  8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 24022022  XER: 20000002
        CFAR: c0000000086fc238 IRQMASK: 0
        GPR00: c000000008703a38 c00000000b6736c0 c000000009281900 0000000000000000
        GPR04: 0000000000000000 0000000000000000 fffffffffffff001 c00000000b660080
        GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000220
        GPR12: 0000000000002200 c000000009e51400 0000000000000000 0000000000000008
        GPR16: 0000000000000000 c000000008c152e8 c000000008c152a8 0000000000000000
        GPR20: c000000009422fd8 c000000009412fd8 c000000009426040 0000000000000008
        GPR24: 0000000000000000 0000000000000000 c000000009168bc8 c000000009168c78
        GPR28: c00000000b126410 0000000000000000 c00000000916a0b8 c00000000b126400
        NIP [c000000008703a54] bus_add_device+0x84/0x1e0
        LR [c000000008703a38] bus_add_device+0x68/0x1e0
        Call Trace:
        [c00000000b6736c0] [c000000008703a38] bus_add_device+0x68/0x1e0 (unreliable)
        [c00000000b673740] [c000000008700194] device_add+0x454/0x7c0
        [c00000000b673800] [c00000000872e660] __register_one_node+0xb0/0x240
        [c00000000b673860] [c00000000839a6bc] __try_online_node+0x12c/0x180
        [c00000000b673900] [c00000000839b978] try_online_node+0x58/0x90
        [c00000000b673930] [c0000000080846d8] find_and_online_cpu_nid+0x158/0x190
        [c00000000b673a10] [c0000000080848a0] numa_update_cpu_topology+0x190/0x580
        [c00000000b673c00] [c000000008d3f2e4] smp_cpus_done+0x94/0x108
        [c00000000b673c70] [c000000008d5c00c] smp_init+0x174/0x19c
        [c00000000b673d00] [c000000008d346b8] kernel_init_freeable+0x1e0/0x450
        [c00000000b673dc0] [c0000000080102e8] kernel_init+0x28/0x160
        [c00000000b673e30] [c00000000800b65c] ret_from_kernel_thread+0x5c/0x80
        Instruction dump:
        60000000 60000000 e89e0020 7fe3fb78 4bff87d5 60000000 7c7d1b79 4082008c
        e8bf0050 e93e0098 3b9f0010 2fa50000 <e8690060> 38630018 419e0114 7f84e378
        ---[ end trace 593577668c2daa65 ]---
      
      However a regular kernel with 4096M (2048 gets reserved for crash
      kernel) boots properly.
      
      Unlike regular kernels, which mark all available nodes as online,
      kdump kernel only marks just enough nodes as online and marks the rest
      as offline at boot. However kdump kernel boots with all available
      CPUs. With Commit 2ea62630 ("powerpc/topology: Get topology for
      shared processors at boot"), all CPUs are onlined on their respective
      nodes at boot time. try_online_node() tries to online the offline
      nodes but fails as all needed subsystems are not yet initialized.
      
      As part of fix, detect and skip early onlining of a offline node.
      
      Fixes: 2ea62630 ("powerpc/topology: Get topology for shared processors at boot")
      Reported-by: NPavithra Prakash <pavrampu@in.ibm.com>
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Tested-by: NHari Bathini <hbathini@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      ac1788cc
  3. 04 10月, 2018 3 次提交
    • N
      powerpc/64s/radix: Explicitly flush ERAT with local LPID invalidation · 053c5a75
      Nicholas Piggin 提交于
      Local radix TLB flush operations that operate on congruence classes
      have explicit ERAT flushes for POWER9. The process scoped LPID flush
      did not have a flush, so add it.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      053c5a75
    • N
      powerpc/64s/hash: Do not use PPC_INVALIDATE_ERAT on CPUs before POWER9 · bc276ecb
      Nicholas Piggin 提交于
      PPC_INVALIDATE_ERAT is slbia IH=7 which is a new variant introduced
      with POWER9, and the result is undefined on earlier CPUs.
      
      Commits 7b9f71f9 ("powerpc/64s: POWER9 machine check handler") and
      d4748276 ("powerpc/64s: Improve local TLB flush for boot and MCE on
      POWER9") caused POWER7/8 code to use this instruction. Remove it. An
      ERAT flush can be made by invalidatig the SLB, but before POWER9 that
      requires a flush and rebolt.
      
      Fixes: 7b9f71f9 ("powerpc/64s: POWER9 machine check handler")
      Fixes: d4748276 ("powerpc/64s: Improve local TLB flush for boot and MCE on POWER9")
      Cc: stable@vger.kernel.org # v4.11+
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      bc276ecb
    • D
      powerpc/nohash: fix undefined behaviour when testing page size support · f5e28480
      Daniel Axtens 提交于
      When enumerating page size definitions to check hardware support,
      we construct a constant which is (1U << (def->shift - 10)).
      
      However, the array of page size definitions is only initalised for
      various MMU_PAGE_* constants, so it contains a number of 0-initialised
      elements with def->shift == 0. This means we end up shifting by a
      very large number, which gives the following UBSan splat:
      
      ================================================================================
      UBSAN: Undefined behaviour in /home/dja/dev/linux/linux/arch/powerpc/mm/tlb_nohash.c:506:21
      shift exponent 4294967286 is too large for 32-bit type 'unsigned int'
      CPU: 0 PID: 0 Comm: swapper Not tainted 4.19.0-rc3-00045-ga604f927b012-dirty #6
      Call Trace:
      [c00000000101bc20] [c000000000a13d54] .dump_stack+0xa8/0xec (unreliable)
      [c00000000101bcb0] [c0000000004f20a8] .ubsan_epilogue+0x18/0x64
      [c00000000101bd30] [c0000000004f2b10] .__ubsan_handle_shift_out_of_bounds+0x110/0x1a4
      [c00000000101be20] [c000000000d21760] .early_init_mmu+0x1b4/0x5a0
      [c00000000101bf10] [c000000000d1ba28] .early_setup+0x100/0x130
      [c00000000101bf90] [c000000000000528] start_here_multiplatform+0x68/0x80
      ================================================================================
      
      Fix this by first checking if the element exists (shift != 0) before
      constructing the constant.
      Signed-off-by: NDaniel Axtens <dja@axtens.net>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f5e28480
  4. 03 10月, 2018 6 次提交
  5. 25 9月, 2018 1 次提交
    • S
      powerpc/numa: Use associativity if VPHN hcall is successful · 2483ef05
      Srikar Dronamraju 提交于
      Currently associativity is used to lookup node-id even if the
      preceding VPHN hcall failed. However this can cause CPU to be made
      part of the wrong node, (most likely to be node 0). This is because
      VPHN is not enabled on KVM guests.
      
      With 2ea62630 ("powerpc/topology: Get topology for shared processors at
      boot"), associativity is used to set to the wrong node. Hence KVM
      guest topology is broken.
      
      For example : A 4 node KVM guest before would have reported.
      
        [root@localhost ~]#  numactl -H
        available: 4 nodes (0-3)
        node 0 cpus: 0 1 2 3
        node 0 size: 1746 MB
        node 0 free: 1604 MB
        node 1 cpus: 4 5 6 7
        node 1 size: 2044 MB
        node 1 free: 1765 MB
        node 2 cpus: 8 9 10 11
        node 2 size: 2044 MB
        node 2 free: 1837 MB
        node 3 cpus: 12 13 14 15
        node 3 size: 2044 MB
        node 3 free: 1903 MB
        node distances:
        node   0   1   2   3
          0:  10  40  40  40
          1:  40  10  40  40
          2:  40  40  10  40
          3:  40  40  40  10
      
      Would now report:
      
        [root@localhost ~]# numactl -H
        available: 4 nodes (0-3)
        node 0 cpus: 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15
        node 0 size: 1746 MB
        node 0 free: 1244 MB
        node 1 cpus:
        node 1 size: 2044 MB
        node 1 free: 2032 MB
        node 2 cpus: 1
        node 2 size: 2044 MB
        node 2 free: 2028 MB
        node 3 cpus:
        node 3 size: 2044 MB
        node 3 free: 2032 MB
        node distances:
        node   0   1   2   3
          0:  10  40  40  40
          1:  40  10  40  40
          2:  40  40  10  40
          3:  40  40  40  10
      
      Fix this by skipping associativity lookup if the VPHN hcall failed.
      
      Fixes: 2ea62630 ("powerpc/topology: Get topology for shared processors at boot")
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      2483ef05
  6. 24 9月, 2018 1 次提交
    • M
      powerpc/pseries: Fix unitialized timer reset on migration · 8604895a
      Michael Bringmann 提交于
      After migration of a powerpc LPAR, the kernel executes code to
      update the system state to reflect new platform characteristics.
      
      Such changes include modifications to device tree properties provided
      to the system by PHYP. Property notifications received by the
      post_mobility_fixup() code are passed along to the kernel in general
      through a call to of_update_property() which in turn passes such
      events back to all modules through entries like the '.notifier_call'
      function within the NUMA module.
      
      When the NUMA module updates its state, it resets its event timer. If
      this occurs after a previous call to stop_topology_update() or on a
      system without VPHN enabled, the code runs into an unitialized timer
      structure and crashes. This patch adds a safety check along this path
      toward the problem code.
      
      An example crash log is as follows.
      
        ibmvscsi 30000081: Re-enabling adapter!
        ------------[ cut here ]------------
        kernel BUG at kernel/time/timer.c:958!
        Oops: Exception in kernel mode, sig: 5 [#1]
        LE SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in: nfsv3 nfs_acl nfs tcp_diag udp_diag inet_diag lockd unix_diag af_packet_diag netlink_diag grace fscache sunrpc xts vmx_crypto pseries_rng sg binfmt_misc ip_tables xfs libcrc32c sd_mod ibmvscsi ibmveth scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod
        CPU: 11 PID: 3067 Comm: drmgr Not tainted 4.17.0+ #179
        ...
        NIP mod_timer+0x4c/0x400
        LR  reset_topology_timer+0x40/0x60
        Call Trace:
          0xc0000003f9407830 (unreliable)
          reset_topology_timer+0x40/0x60
          dt_update_callback+0x100/0x120
          notifier_call_chain+0x90/0x100
          __blocking_notifier_call_chain+0x60/0x90
          of_property_notify+0x90/0xd0
          of_update_property+0x104/0x150
          update_dt_property+0xdc/0x1f0
          pseries_devicetree_update+0x2d0/0x510
          post_mobility_fixup+0x7c/0xf0
          migration_store+0xa4/0xc0
          kobj_attr_store+0x30/0x60
          sysfs_kf_write+0x64/0xa0
          kernfs_fop_write+0x16c/0x240
          __vfs_write+0x40/0x200
          vfs_write+0xc8/0x240
          ksys_write+0x5c/0x100
          system_call+0x58/0x6c
      
      Fixes: 5d88aa85 ("powerpc/pseries: Update CPU maps when device tree is updated")
      Cc: stable@vger.kernel.org # v3.10+
      Signed-off-by: NMichael Bringmann <mwb@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8604895a
  7. 20 9月, 2018 1 次提交
    • T
      powerpc/pkeys: Fix reading of ibm, processor-storage-keys property · c716a25b
      Thiago Jung Bauermann 提交于
      scan_pkey_feature() uses of_property_read_u32_array() to read the
      ibm,processor-storage-keys property and calls be32_to_cpu() on the
      value it gets. The problem is that of_property_read_u32_array() already
      returns the value converted to the CPU byte order.
      
      The value of pkeys_total ends up more or less sane because there's a min()
      call in pkey_initialize() which reduces pkeys_total to 32. So in practice
      the kernel ignores the fact that the hypervisor reserved one key for
      itself (the device tree advertises 31 keys in my test VM).
      
      This is wrong, but the effect in practice is that when a process tries to
      allocate the 32nd key, it gets an -EINVAL error instead of -ENOSPC which
      would indicate that there aren't any keys available
      
      Fixes: cf43d3b2 ("powerpc: Enable pkey subsystem")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: NThiago Jung Bauermann <bauerman@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c716a25b
  8. 19 9月, 2018 12 次提交
    • N
      powerpc/64s/hash: Add a SLB preload cache · 89ca4e12
      Nicholas Piggin 提交于
      When switching processes, currently all user SLBEs are cleared, and a
      few (exec_base, pc, and stack) are preloaded. In trivial testing with
      small apps, this tends to miss the heap and low 256MB segments, and it
      will also miss commonly accessed segments on large memory workloads.
      
      Add a simple round-robin preload cache that just inserts the last SLB
      miss into the head of the cache and preloads those at context switch
      time. Every 256 context switches, the oldest entry is removed from the
      cache to shrink the cache and require fewer slbmte if they are unused.
      
      Much more could go into this, including into the SLB entry reclaim
      side to track some LRU information etc, which would require a study of
      large memory workloads. But this is a simple thing we can do now that
      is an obvious win for common workloads.
      
      With the full series, process switching speed on the context_switch
      benchmark on POWER9/hash (with kernel speculation security masures
      disabled) increases from 140K/s to 178K/s (27%).
      
      POWER8 does not change much (within 1%), it's unclear why it does not
      see a big gain like POWER9.
      
      Booting to busybox init with 256MB segments has SLB misses go down
      from 945 to 69, and with 1T segments 900 to 21. These could almost all
      be eliminated by preloading a bit more carefully with ELF binary
      loading.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      89ca4e12
    • N
      powerpc/64s/hash: provide arch_setup_exec hooks for hash slice setup · 2e162674
      Nicholas Piggin 提交于
      This will be used by the SLB code in the next patch, but for now this
      sets the slb_addr_limit to the correct size for 32-bit tasks.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      2e162674
    • N
      powerpc/64s/hash: SLB allocation status bitmaps · 655deecf
      Nicholas Piggin 提交于
      Add 32-entry bitmaps to track the allocation status of the first 32
      SLB entries, and whether they are user or kernel entries. These are
      used to allocate free SLB entries first, before resorting to the round
      robin allocator.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      655deecf
    • N
      powerpc/64s/hash: remove user SLB data from the paca · 8fed04d0
      Nicholas Piggin 提交于
      User SLB mappig data is copied into the PACA from the mm->context so
      it can be accessed by the SLB miss handlers.
      
      After the C conversion, SLB miss handlers now run with relocation on,
      and user SLB misses are able to take recursive kernel SLB misses, so
      the user SLB mapping data can be removed from the paca and accessed
      directly.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8fed04d0
    • N
      powerpc/64s/hash: convert SLB miss handlers to C · 5e46e29e
      Nicholas Piggin 提交于
      This patch moves SLB miss handlers completely to C, using the standard
      exception handler macros to set up the stack and branch to C.
      
      This can be done because the segment containing the kernel stack is
      always bolted, so accessing it with relocation on will not cause an
      SLB exception.
      
      Arbitrary kernel memory may not be accessed when handling kernel space
      SLB misses, so care should be taken there. However user SLB misses can
      access any kernel memory, which can be used to move some fields out of
      the paca (in later patches).
      
      User SLB misses could quite easily reconcile IRQs and set up a first
      class kernel environment and exit via ret_from_except, however that
      doesn't seem to be necessary at the moment, so we only do that if a
      bad fault is encountered.
      
      [ Credit to Aneesh for bug fixes, error checks, and improvements to bad
        address handling, etc ]
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      
      Since RFC:
      - Added MSR[RI] handling
      - Fixed up a register loss bug exposed by irq tracing (Aneesh)
      - Reject misses outside the defined kernel regions (Aneesh)
      - Added several more sanity checks and error handling (Aneesh), we may
        look at consolidating these tests and tightenig up the code but for
        a first pass we decided it's better to check carefully.
      
      Since v1:
      - Fixed SLB cache corruption (Aneesh)
      - Fixed untidy SLBE allocation "leak" in get_vsid error case
      - Now survives some stress testing on real hardware
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      5e46e29e
    • N
      powerpc/64s/hash: Use POWER9 SLBIA IH=3 variant in switch_slb · 82d8f4c2
      Nicholas Piggin 提交于
      POWER9 introduces SLBIA IH=3, which invalidates all SLB entries and
      associated lookaside information that have a class value of 1, which
      Linux assigns to user addresses. This matches what switch_slb wants,
      and allows a simple fast implementation that avoids the slb_cache
      complexity.
      
      As a side-effect, the POWER5 < DD2.1 SLB invalidation workaround is
      also avoided on POWER9.
      
      Process context switching rate is improved about 2.2% for a small
      process that hits the slb cache which is the best case for the current
      code.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      82d8f4c2
    • N
      powerpc/64s/hash: Use POWER6 SLBIA IH=1 variant in switch_slb · 5141c182
      Nicholas Piggin 提交于
      The SLBIA IH=1 hint will remove all non-zero SLBEs, but only
      invalidate ERAT entries associated with a class value of 1, for
      processors that support the hint (e.g., POWER6 and newer), which
      Linux assigns to user addresses.
      
      This prevents kernel ERAT entries from being invalidated when
      context switchig (if the thread faulted in more than 8 user SLBEs).
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      5141c182
    • N
      powerpc/64s/hash: remove the vmalloc segment from the bolted SLB · 85376e2a
      Nicholas Piggin 提交于
      Remove the vmalloc segment from bolted SLBEs. This is not required to
      be bolted, and seems like it was added to help pre-load the SLB on
      context switch. However there are now other segments like the vmemmap
      segment and non-zero node memory that often take misses after a context
      switch, so it is better to solve this in a more general way.
      
      A subsequent change will track free SLB entries and uses those rather
      than round-robin overwrite valid entries, which makes it far less
      likely for kernel SLBEs to be evicted after they are installed.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      85376e2a
    • N
      powerpc/64s/hash: move POWER5 < DD2.1 slbie workaround where it is needed · 8b92887c
      Nicholas Piggin 提交于
      The POWER5 < DD2.1 issue is that slbie needs to be issued more than
      once. It came in with this change:
      
      ChangeSet@1.1608, 2004-04-29 07:12:31-07:00, david@gibson.dropbear.id.au
        [PATCH] POWER5 erratum workaround
      
        Early POWER5 revisions (<DD2.1) have a problem requiring slbie
        instructions to be repeated under some circumstances.  The patch below
        adds a workaround (patch made by Anton Blanchard).
      
      (aka. 3e4520f7605243abf66a7ccd3d2e49e48e8c0483 in the full history tree)
      
      The extra slbie in switch_slb is done even for the case where slbia is
      called (slb_flush_and_rebolt). I don't believe that is required
      because there are other slb_flush_and_rebolt callers which do not
      issue the workaround slbie, which would be broken if it was required.
      
      It also seems to be fine inside the isync with the first slbie, as it
      is in the kernel stack switch code.
      
      So move this workaround to where it is required. This is not much of
      an optimisation because this is the fast path, but it makes the code
      more understandable and neater.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      [mpe: Retain slbie_data initialisation to avoid compiler warning]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8b92887c
    • N
      powerpc/64s/hash: avoid the POWER5 < DD2.1 slb invalidate workaround on POWER8/9 · 505ea82e
      Nicholas Piggin 提交于
      I only have POWER8/9 to test, so just remove it for those.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      505ea82e
    • N
      powerpc/64s/hash: Fix stab_rr off by one initialization · 09b4438d
      Nicholas Piggin 提交于
      This causes SLB alloation to start 1 beyond the start of the SLB.
      There is no real problem because after it wraps it stats behaving
      properly, it's just surprisig to see when looking at SLB traces.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      09b4438d
    • M
      powerpc/pseries: Dump the SLB contents on SLB MCE errors. · c6d15258
      Mahesh Salgaonkar 提交于
      If we get a machine check exceptions due to SLB errors then dump the
      current SLB contents which will be very much helpful in debugging the
      root cause of SLB errors. Introduce an exclusive buffer per cpu to hold
      faulty SLB entries. In real mode mce handler saves the old SLB contents
      into this buffer accessible through paca and print it out later in virtual
      mode.
      
      With this patch the console will log SLB contents like below on SLB MCE
      errors:
      
      [  507.297236] SLB contents of cpu 0x1
      [  507.297237] Last SLB entry inserted at slot 16
      [  507.297238] 00 c000000008000000 400ea1b217000500
      [  507.297239]   1T  ESID=   c00000  VSID=      ea1b217 LLP:100
      [  507.297240] 01 d000000008000000 400d43642f000510
      [  507.297242]   1T  ESID=   d00000  VSID=      d43642f LLP:110
      [  507.297243] 11 f000000008000000 400a86c85f000500
      [  507.297244]   1T  ESID=   f00000  VSID=      a86c85f LLP:100
      [  507.297245] 12 00007f0008000000 4008119624000d90
      [  507.297246]   1T  ESID=       7f  VSID=      8119624 LLP:110
      [  507.297247] 13 0000000018000000 00092885f5150d90
      [  507.297247]  256M ESID=        1  VSID=   92885f5150 LLP:110
      [  507.297248] 14 0000010008000000 4009e7cb50000d90
      [  507.297249]   1T  ESID=        1  VSID=      9e7cb50 LLP:110
      [  507.297250] 15 d000000008000000 400d43642f000510
      [  507.297251]   1T  ESID=   d00000  VSID=      d43642f LLP:110
      [  507.297252] 16 d000000008000000 400d43642f000510
      [  507.297253]   1T  ESID=   d00000  VSID=      d43642f LLP:110
      [  507.297253] ----------------------------------
      [  507.297254] SLB cache ptr value = 3
      [  507.297254] Valid SLB cache entries:
      [  507.297255] 00 EA[0-35]=    7f000
      [  507.297256] 01 EA[0-35]=        1
      [  507.297257] 02 EA[0-35]=     1000
      [  507.297257] Rest of SLB cache entries:
      [  507.297258] 03 EA[0-35]=    7f000
      [  507.297258] 04 EA[0-35]=        1
      [  507.297259] 05 EA[0-35]=     1000
      [  507.297260] 06 EA[0-35]=       12
      [  507.297260] 07 EA[0-35]=    7f000
      Suggested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Suggested-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c6d15258
  9. 18 9月, 2018 1 次提交
    • M
      powerpc: Avoid code patching freed init sections · 51c3c62b
      Michael Neuling 提交于
      This stops us from doing code patching in init sections after they've
      been freed.
      
      In this chain:
        kvm_guest_init() ->
          kvm_use_magic_page() ->
            fault_in_pages_readable() ->
      	 __get_user() ->
      	   __get_user_nocheck() ->
      	     barrier_nospec();
      
      We have a code patching location at barrier_nospec() and
      kvm_guest_init() is an init function. This whole chain gets inlined,
      so when we free the init section (hence kvm_guest_init()), this code
      goes away and hence should no longer be patched.
      
      We seen this as userspace memory corruption when using a memory
      checker while doing partition migration testing on powervm (this
      starts the code patching post migration via
      /sys/kernel/mobility/migration). In theory, it could also happen when
      using /sys/kernel/debug/powerpc/barrier_nospec.
      
      Cc: stable@vger.kernel.org # 4.13+
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      51c3c62b
  10. 23 8月, 2018 3 次提交
    • M
      powerpc/mce: Fix SLB rebolting during MCE recovery path. · 0f52b3a0
      Mahesh Salgaonkar 提交于
      The commit e7e81847 ("powerpc/64s: move machine check SLB flushing
      to mm/slb.c") introduced a bug in reloading bolted SLB entries. Unused
      bolted entries are stored with .esid=0 in the slb_shadow area, and
      that value is now used directly as the RB input to slbmte, which means
      the RB[52:63] index field is set to 0, which causes SLB entry 0 to be
      cleared.
      
      Fix this by storing the index bits in the unused bolted entries, which
      directs the slbmte to the right place.
      
      The SLB shadow area is also used by the hypervisor, but PAPR is okay
      with that, from LoPAPR v1.1, 14.11.1.3 SLB Shadow Buffer:
      
        Note: SLB is filled sequentially starting at index 0
        from the shadow buffer ignoring the contents of
        RB field bits 52-63
      
      Fixes: e7e81847 ("powerpc/64s: move machine check SLB flushing to mm/slb.c")
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      0f52b3a0
    • P
      KVM: PPC: Book3S: Fix guest DMA when guest partially backed by THP pages · 8cfbdbdc
      Paul Mackerras 提交于
      Commit 76fa4975 ("KVM: PPC: Check if IOMMU page is contained in
      the pinned physical page", 2018-07-17) added some checks to ensure
      that guest DMA mappings don't attempt to map more than the guest is
      entitled to access. However, errors in the logic mean that legitimate
      guest requests to map pages for DMA are being denied in some
      situations. Specifically, if the first page of the range passed to
      mm_iommu_get() is mapped with a normal page, and subsequent pages are
      mapped with transparent huge pages, we end up with mem->pageshift ==
      0. That means that the page size checks in mm_iommu_ua_to_hpa() and
      mm_iommu_up_to_hpa_rm() will always fail for every page in that
      region, and thus the guest can never map any memory in that region for
      DMA, typically leading to a flood of error messages like this:
      
        qemu-system-ppc64: VFIO_MAP_DMA: -22
        qemu-system-ppc64: vfio_dma_map(0x10005f47780, 0x800000000000000, 0x10000, 0x7fff63ff0000) = -22 (Invalid argument)
      
      The logic errors in mm_iommu_get() are:
      
        (a) use of 'ua' not 'ua + (i << PAGE_SHIFT)' in the find_linux_pte()
            call (meaning that find_linux_pte() returns the pte for the
            first address in the range, not the address we are currently up
            to);
        (b) use of 'pageshift' as the variable to receive the hugepage shift
            returned by find_linux_pte() - for a normal page this gets set
            to 0, leading to us setting mem->pageshift to 0 when we conclude
            that the pte returned by find_linux_pte() didn't match the page
            we were looking at;
        (c) comparing 'compshift', which is a page order, i.e. log base 2 of
            the number of pages, with 'pageshift', which is a log base 2 of
            the number of bytes.
      
      To fix these problems, this patch introduces 'cur_ua' to hold the
      current user address and uses that in the find_linux_pte() call;
      introduces 'pteshift' to hold the hugepage shift found by
      find_linux_pte(); and compares 'pteshift' with 'compshift +
      PAGE_SHIFT' rather than 'compshift'.
      
      The patch also moves the local_irq_restore to the point after the PTE
      pointer returned by find_linux_pte() has been dereferenced because
      otherwise the PTE could change underneath us, and adds a check to
      avoid doing the find_linux_pte() call once mem->pageshift has been
      reduced to PAGE_SHIFT, as an optimization.
      
      Fixes: 76fa4975 ("KVM: PPC: Check if IOMMU page is contained in the pinned physical page")
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8cfbdbdc
    • A
      powerpc/mm/radix: Only need the Nest MMU workaround for R -> RW transition · f08d08f3
      Aneesh Kumar K.V 提交于
      The Nest MMU workaround is only needed for RW upgrades. Avoid doing
      that for other PTE updates.
      
      We also avoid clearing the PTE while marking it invalid. This is
      because other page table walkers will find this PTE none and can
      result in unexpected behaviour due to that. Instead we clear
      _PAGE_PRESENT and set the software PTE bit _PAGE_INVALID.
      pte_present() is already updated to check for both bits. This makes
      sure page table walkers will find the PTE present and things like
      pte_pfn(pte) returns the right value.
      
      Based on an original patch from Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f08d08f3
  11. 22 8月, 2018 1 次提交
  12. 21 8月, 2018 1 次提交
    • S
      powerpc/topology: Get topology for shared processors at boot · 2ea62630
      Srikar Dronamraju 提交于
      On a shared LPAR, Phyp will not update the CPU associativity at boot
      time. Just after the boot system does recognize itself as a shared
      LPAR and trigger a request for correct CPU associativity. But by then
      the scheduler would have already created/destroyed its sched domains.
      
      This causes
        - Broken load balance across Nodes causing islands of cores.
        - Performance degradation esp if the system is lightly loaded
        - dmesg to wrongly report all CPUs to be in Node 0.
        - Messages in dmesg saying borken topology.
        - With commit 051f3ca0 ("sched/topology: Introduce NUMA identity
          node sched domain"), can cause rcu stalls at boot up.
      
      The sched_domains_numa_masks table which is used to generate cpumasks
      is only created at boot time just before creating sched domains and
      never updated. Hence, its better to get the topology correct before
      the sched domains are created.
      
      For example on 64 core Power 8 shared LPAR, dmesg reports
      
        Brought up 512 CPUs
        Node 0 CPUs: 0-511
        Node 1 CPUs:
        Node 2 CPUs:
        Node 3 CPUs:
        Node 4 CPUs:
        Node 5 CPUs:
        Node 6 CPUs:
        Node 7 CPUs:
        Node 8 CPUs:
        Node 9 CPUs:
        Node 10 CPUs:
        Node 11 CPUs:
        ...
        BUG: arch topology borken
             the DIE domain not a subset of the NUMA domain
        BUG: arch topology borken
             the DIE domain not a subset of the NUMA domain
      
      numactl/lscpu output will still be correct with cores spreading across
      all nodes:
      
        Socket(s):             64
        NUMA node(s):          12
        Model:                 2.0 (pvr 004d 0200)
        Model name:            POWER8 (architected), altivec supported
        Hypervisor vendor:     pHyp
        Virtualization type:   para
        L1d cache:             64K
        L1i cache:             32K
        NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,176-183,272-279,368-375,464-471
        NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,184-191,280-287,376-383,472-479
        NUMA node2 CPU(s): 16-23,48-55,80-87,112-119,192-199,288-295,384-391,480-487
        NUMA node3 CPU(s): 24-31,56-63,88-95,120-127,200-207,296-303,392-399,488-495
        NUMA node4 CPU(s):     208-215,304-311,400-407,496-503
        NUMA node5 CPU(s):     168-175,264-271,360-367,456-463
        NUMA node6 CPU(s):     128-135,224-231,320-327,416-423
        NUMA node7 CPU(s):     136-143,232-239,328-335,424-431
        NUMA node8 CPU(s):     216-223,312-319,408-415,504-511
        NUMA node9 CPU(s):     144-151,240-247,336-343,432-439
        NUMA node10 CPU(s):    152-159,248-255,344-351,440-447
        NUMA node11 CPU(s):    160-167,256-263,352-359,448-455
      
      Currently on this LPAR, the scheduler detects 2 levels of Numa and
      created numa sched domains for all CPUs, but it finds a single DIE
      domain consisting of all CPUs. Hence it deletes all numa sched
      domains.
      
      To address this, detect the shared processor and update topology soon
      after CPUs are setup so that correct topology is updated just before
      scheduler creates sched domain.
      
      With the fix, dmesg reports:
      
        numa: Node 0 CPUs: 0-7 32-39 64-71 96-103 176-183 272-279 368-375 464-471
        numa: Node 1 CPUs: 8-15 40-47 72-79 104-111 184-191 280-287 376-383 472-479
        numa: Node 2 CPUs: 16-23 48-55 80-87 112-119 192-199 288-295 384-391 480-487
        numa: Node 3 CPUs: 24-31 56-63 88-95 120-127 200-207 296-303 392-399 488-495
        numa: Node 4 CPUs: 208-215 304-311 400-407 496-503
        numa: Node 5 CPUs: 168-175 264-271 360-367 456-463
        numa: Node 6 CPUs: 128-135 224-231 320-327 416-423
        numa: Node 7 CPUs: 136-143 232-239 328-335 424-431
        numa: Node 8 CPUs: 216-223 312-319 408-415 504-511
        numa: Node 9 CPUs: 144-151 240-247 336-343 432-439
        numa: Node 10 CPUs: 152-159 248-255 344-351 440-447
        numa: Node 11 CPUs: 160-167 256-263 352-359 448-455
      
      and lscpu also reports:
      
        Socket(s):             64
        NUMA node(s):          12
        Model:                 2.0 (pvr 004d 0200)
        Model name:            POWER8 (architected), altivec supported
        Hypervisor vendor:     pHyp
        Virtualization type:   para
        L1d cache:             64K
        L1i cache:             32K
        NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,176-183,272-279,368-375,464-471
        NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,184-191,280-287,376-383,472-479
        NUMA node2 CPU(s): 16-23,48-55,80-87,112-119,192-199,288-295,384-391,480-487
        NUMA node3 CPU(s): 24-31,56-63,88-95,120-127,200-207,296-303,392-399,488-495
        NUMA node4 CPU(s):     208-215,304-311,400-407,496-503
        NUMA node5 CPU(s):     168-175,264-271,360-367,456-463
        NUMA node6 CPU(s):     128-135,224-231,320-327,416-423
        NUMA node7 CPU(s):     136-143,232-239,328-335,424-431
        NUMA node8 CPU(s):     216-223,312-319,408-415,504-511
        NUMA node9 CPU(s):     144-151,240-247,336-343,432-439
        NUMA node10 CPU(s):    152-159,248-255,344-351,440-447
        NUMA node11 CPU(s):    160-167,256-263,352-359,448-455
      Reported-by: NManjunatha H R <manjuhr1@in.ibm.com>
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      [mpe: Trim / format change log]
      Tested-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      2ea62630
  13. 18 8月, 2018 1 次提交
    • S
      mm: convert return type of handle_mm_fault() caller to vm_fault_t · 50a7ca3c
      Souptick Joarder 提交于
      Use new return type vm_fault_t for fault handler.  For now, this is just
      documenting that the function returns a VM_FAULT value rather than an
      errno.  Once all instances are converted, vm_fault_t will become a
      distinct type.
      
      Ref-> commit 1c8f4220 ("mm: change return type to vm_fault_t")
      
      In this patch all the caller of handle_mm_fault() are changed to return
      vm_fault_t type.
      
      Link: http://lkml.kernel.org/r/20180617084810.GA6730@jordon-HP-15-Notebook-PCSigned-off-by: NSouptick Joarder <jrdr.linux@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: James E.J. Bottomley <jejb@parisc-linux.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Palmer Dabbelt <palmer@sifive.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Levin, Alexander (Sasha Levin)" <alexander.levin@verizon.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      50a7ca3c
  14. 13 8月, 2018 1 次提交
  15. 10 8月, 2018 2 次提交
  16. 07 8月, 2018 3 次提交
  17. 30 7月, 2018 1 次提交