1. 11 4月, 2017 12 次提交
    • G
      powerpc/powernv/idle: Don't override default/deepest directly in kernel · f3b3f284
      Gautham R. Shenoy 提交于
      Currently during idle-init on power9, if we don't find suitable stop
      states in the device tree that can be used as the
      default_stop/deepest_stop, we set stop0 (ESL=1,EC=1) as the default
      stop state psscr to be used by power9_idle and deepest stop state
      which is used by CPU-Hotplug.
      
      However, if the platform firmware has not configured or enabled a stop
      state, the kernel should not make any assumptions and fallback to a
      default choice.
      
      If the kernel uses a stop state that is not configured by the platform
      firmware, it may lead to further failures which should be avoided.
      
      In this patch, we modify the init code to ensure that the kernel uses
      only the stop states exposed by the firmware through the device
      tree. When a suitable default stop state isn't found, we disable
      ppc_md.power_save for power9. Similarly, when a suitable
      deepest_stop_state is not found in the device tree exported by the
      firmware, fall back to the default busy-wait loop in the CPU-Hotplug
      code.
      
      [Changelog written with inputs from svaidy@linux.vnet.ibm.com]
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f3b3f284
    • G
      powerpc/powernv/smp: Add busy-wait loop as fall back for CPU-Hotplug · 90061231
      Gautham R. Shenoy 提交于
      Currently, the powernv cpu-offline function assumes that platform idle
      states such as stop on POWER9, winkle/sleep/nap on POWER8 are always
      available. On POWER8, it picks nap as the default state if other deep
      idle states like sleep/winkle are not available and enabled in the
      platform.
      
      On POWER9, nap is not available and all idle states are managed by
      STOP instruction.  The parameters to the idle state are passed through
      processor stop status control register (PSSCR).  Hence as such
      executing STOP would take parameters from current PSSCR. We do not
      want to make any assumptions in kernel on what STOP states and PSSCR
      features are configured by the platform.
      
      Ideally platform will configure a good set of stop states that can be
      used in the kernel.  We would like to start with a clean slate, if the
      platform choose to not configure any state or there is an error in
      platform firmware that lead to no stop states being configured or
      allowed to be requested.
      
      This patch adds a fallback method for CPU-Hotplug that is similar to
      snooze loop at idle where the threads are left to spin at low priority
      and hence reduce the cycles consumed.
      
      This is a safe fallback mechanism in the case when no stop state would
      be requested if the platform firmware did not configure them most
      likely due to an error condition.
      
      Requesting a stop state when the platform has not configured them or
      enabled them would lead to further error conditions which could be
      difficult to debug.
      
      [Changelog written with inputs from svaidy@linux.vnet.ibm.com]
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      90061231
    • G
      powerpc/powernv: Move CPU-Offline idle state invocation from smp.c to idle.c · a7cd88da
      Gautham R. Shenoy 提交于
      Move the piece of code in powernv/smp.c::pnv_smp_cpu_kill_self() which
      transitions the CPU to the deepest available platform idle state to a
      new function named pnv_cpu_offline() in powernv/idle.c. The rationale
      behind this code movement is that the data required to determine the
      deepest available platform state resides in powernv/idle.c.
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a7cd88da
    • A
      powerpc/hugetlb: Add ABI defines for supported HugeTLB page sizes · 2c9faa76
      Anshuman Khandual 提交于
      Add user space exported API definitions for 512KB, 1MB, 2MB, 8MB, 16MB,
      1GB, 16GB non default huge page sizes to be used with mmap() system
      call.
      Signed-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
      [mpe: Reword the comment to emphasise that these are only needed to use
       the non-default huge page size, and updated the change log.]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      2c9faa76
    • A
      powerpc/mm: Remove reduntant initmem information from log · ea614555
      Anshuman Khandual 提交于
      Generic core VM already prints these information in the log
      buffer, hence there is no need for a second print. This just
      removes the second print from arch powerpc NUMA init path.
      
      Before the patch:
      
        $ dmesg | grep "Initmem"
      
        numa: Initmem setup node 0 [mem 0x00000000-0xffffffff]
        numa: Initmem setup node 1 [mem 0x100000000-0x1ffffffff]
        numa: Initmem setup node 2 [mem 0x200000000-0x2ffffffff]
        numa: Initmem setup node 3 [mem 0x300000000-0x3ffffffff]
        numa: Initmem setup node 4 [mem 0x400000000-0x4ffffffff]
        numa: Initmem setup node 5 [mem 0x500000000-0x5ffffffff]
        numa: Initmem setup node 6 [mem 0x600000000-0x6ffffffff]
        numa: Initmem setup node 7 [mem 0x700000000-0x7ffffffff]
        Initmem setup node 0 [mem 0x0000000000000000-0x00000000ffffffff]
        Initmem setup node 1 [mem 0x0000000100000000-0x00000001ffffffff]
        Initmem setup node 2 [mem 0x0000000200000000-0x00000002ffffffff]
        Initmem setup node 3 [mem 0x0000000300000000-0x00000003ffffffff]
        Initmem setup node 4 [mem 0x0000000400000000-0x00000004ffffffff]
        Initmem setup node 5 [mem 0x0000000500000000-0x00000005ffffffff]
        Initmem setup node 6 [mem 0x0000000600000000-0x00000006ffffffff]
        Initmem setup node 7 [mem 0x0000000700000000-0x00000007ffffffff]
      
      After the patch just the latter set is printed.
      Signed-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      ea614555
    • M
      powerpc: Make sparsemem the default on 64-bit Book3S · 7b3912f4
      Michael Ellerman 提交于
      Make sparsemem the default on all 64-bit Book3S platforms. It already is
      for pseries and ps3, and we need to enable it for powernv because on
      POWER9 memory between chips is discontiguous.
      
      For the other platforms sparsemem should work fine, though it might add
      a small amount of overhead. We can always force FLATMEM in the
      defconfigs if necessary.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7b3912f4
    • M
      powerpc/nohash: Fix use of mmu_has_feature() in setup_initial_memory_limit() · 4868e350
      Michael Ellerman 提交于
      setup_initial_memory_limit() is called from early_init_devtree(), which
      runs prior to feature patching. If the kernel is built with CONFIG_JUMP_LABEL=y
      and CONFIG_JUMP_LABEL_FEATURE_CHECKS=y then we will potentially get the
      wrong value.
      
      If we also have CONFIG_JUMP_LABEL_FEATURE_CHECK_DEBUG=y we get a warning
      and backtrace:
      
        Warning! mmu_has_feature() used prior to jump label init!
        CPU: 0 PID: 0 Comm: swapper Not tainted 4.11.0-rc4-gccN-next-20170331-g6af2434 #1
        Call Trace:
        [c000000000fc3d50] [c000000000a26c30] .dump_stack+0xa8/0xe8 (unreliable)
        [c000000000fc3de0] [c00000000002e6b8] .setup_initial_memory_limit+0xa4/0x104
        [c000000000fc3e60] [c000000000d5c23c] .early_init_devtree+0xd0/0x2f8
        [c000000000fc3f00] [c000000000d5d3b0] .early_setup+0x90/0x11c
        [c000000000fc3f90] [c000000000000520] start_here_multiplatform+0x68/0x80
      
      Fix it by using early_mmu_has_feature().
      
      Fixes: c12e6f24 ("powerpc: Add option to use jump label for mmu_has_feature()")
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      4868e350
    • M
      powerpc: Remove unnecessary includes of asm/debug.h · 3ae05fb3
      Michael Ellerman 提交于
      These files don't seem to have any need for asm/debug.h, now that all it
      includes are the debugger hooks and breakpoint definitions.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      3ae05fb3
    • M
      powerpc: Create asm/debugfs.h and move powerpc_debugfs_root there · 7644d581
      Michael Ellerman 提交于
      powerpc_debugfs_root is the dentry representing the root of the
      "powerpc" directory tree in debugfs.
      
      Currently it sits in asm/debug.h, a long with some other things that
      have "debug" in the name, but are otherwise unrelated.
      
      Pull it out into a separate header, which also includes linux/debugfs.h,
      and convert all the users to include debugfs.h instead of debug.h.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7644d581
    • A
      powerpc/powernv: Require MMU_NOTIFIER to fix NPU build · abfe8026
      Alistair Popple 提交于
      In the recent commit 1ab66d1f ("powerpc/powernv: Introduce address
      translation services for Nvlink2") the NPU code gained a dependency on MMU
      notifiers.
      
      All our defconfigs have KVM enabled, which selects MMU_NOTIFIER, but if KVM is
      not enabled then the build breaks.
      
      Fix it by always selecting MMU_NOTIFIER when we're building powernv.
      
      Fixes: 1ab66d1f ("powerpc/powernv: Introduce address translation services for Nvlink2")
      Signed-off-by: NAlistair Popple <alistair@popple.id.au>
      [mpe: Reword change log]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      abfe8026
    • A
      powerpc/mm/radix: Remove unnecessary ptesync · f7327e0b
      Aneesh Kumar K.V 提交于
      For a tlbiel with pid, we need to issue tlbiel with set number encoded. We
      don't need to do ptesync for each of those. Instead we need one for the entire
      tlbiel pid operation.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f7327e0b
    • A
      powerpc/mm/radix: Don't do page walk cache flush when doing full mm flush · f6b0df55
      Aneesh Kumar K.V 提交于
      For fullmm tlb flush, we do a flush with RIC_FLUSH_ALL which will invalidate all
      related caches (radix__tlb_flush()). Hence the pwc flush is not needed.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f6b0df55
  2. 04 4月, 2017 3 次提交
    • M
      powerpc/powernv: Add OPAL exports attributes to sysfs · 11fe909d
      Matt Brown 提交于
      New versions of OPAL have a device node /ibm,opal/firmware/exports, each
      property of which describes a range of memory in OPAL that Linux might
      want to export to userspace for debugging.
      
      This patch adds a sysfs file under 'opal/exports' for each property
      found there, and makes it read-only by root.
      Signed-off-by: NMatt Brown <matthew.brown.dev@gmail.com>
      [mpe: Drop counting of props, rename to attr, free on sysfs error, c'log]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      11fe909d
    • S
      powerpc/prom: Increase minimum RMA size to 512MB · 687da8fc
      Sukadev Bhattiprolu 提交于
      When booting very large systems with a large initrd, we run out of
      space early in boot for either RTAS or the flattened device tree (FDT).
      Boot fails with messages like:
      
      	Could not allocate memory for RTAS
      or
      	No memory for flatten_device_tree (no room)
      
      Increasing the minimum RMA size to 512MB fixes the problem. This
      should not have an impact on smaller LPARs (with 256MB memory),
      as the firmware will cap the RMA to the memory assigned to the LPAR.
      
      Fix is based on input/discussions with Michael Ellerman. Thanks to
      Praveen K. Pandey for testing on a large system.
      Reported-by: NPraveen K. Pandey <preveen.pandey@in.ibm.com>
      Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      687da8fc
    • A
      powerpc/powernv: Introduce address translation services for Nvlink2 · 1ab66d1f
      Alistair Popple 提交于
      Nvlink2 supports address translation services (ATS) allowing devices
      to request address translations from an mmu known as the nest MMU
      which is setup to walk the CPU page tables.
      
      To access this functionality certain firmware calls are required to
      setup and manage hardware context tables in the nvlink processing unit
      (NPU). The NPU also manages forwarding of TLB invalidates (known as
      address translation shootdowns/ATSDs) to attached devices.
      
      This patch exports several methods to allow device drivers to register
      a process id (PASID/PID) in the hardware tables and to receive
      notification of when a device should stop issuing address translation
      requests (ATRs). It also adds a fault handler to allow device drivers
      to demand fault pages in.
      Signed-off-by: NAlistair Popple <alistair@popple.id.au>
      [mpe: Fix up comment formatting, use flush_tlb_mm()]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      1ab66d1f
  3. 03 4月, 2017 6 次提交
  4. 01 4月, 2017 5 次提交
    • A
      powerpc/mm: Enable mappings above 128TB · f4ea6dcb
      Aneesh Kumar K.V 提交于
      Not all user space application is ready to handle wide addresses. It's
      known that at least some JIT compilers use higher bits in pointers to
      encode their information. It collides with valid pointers with 512TB
      addresses and leads to crashes.
      
      To mitigate this, we are not going to allocate virtual address space
      above 128TB by default.
      
      But userspace can ask for allocation from full address space by
      specifying hint address (with or without MAP_FIXED) above 128TB.
      
      If hint address set above 128TB, but MAP_FIXED is not specified, we try
      to look for unmapped area by specified address. If it's already
      occupied, we look for unmapped area in *full* address space, rather than
      from 128TB window.
      
      This approach helps to easily make application's memory allocator aware
      about large address space without manually tracking allocated virtual
      address space.
      
      This is going to be a per mmap decision. ie, we can have some mmaps with
      larger addresses and other that do not.
      
      A sample memory layout looks like:
      
        10000000-10010000 r-xp 00000000 fc:00 9057045          /home/max_addr_512TB
        10010000-10020000 r--p 00000000 fc:00 9057045          /home/max_addr_512TB
        10020000-10030000 rw-p 00010000 fc:00 9057045          /home/max_addr_512TB
        10029630000-10029660000 rw-p 00000000 00:00 0          [heap]
        7fff834a0000-7fff834b0000 rw-p 00000000 00:00 0
        7fff834b0000-7fff83670000 r-xp 00000000 fc:00 9177190  /lib/powerpc64le-linux-gnu/libc-2.23.so
        7fff83670000-7fff83680000 r--p 001b0000 fc:00 9177190  /lib/powerpc64le-linux-gnu/libc-2.23.so
        7fff83680000-7fff83690000 rw-p 001c0000 fc:00 9177190  /lib/powerpc64le-linux-gnu/libc-2.23.so
        7fff83690000-7fff836a0000 rw-p 00000000 00:00 0
        7fff836a0000-7fff836c0000 r-xp 00000000 00:00 0        [vdso]
        7fff836c0000-7fff83700000 r-xp 00000000 fc:00 9177193  /lib/powerpc64le-linux-gnu/ld-2.23.so
        7fff83700000-7fff83710000 r--p 00030000 fc:00 9177193  /lib/powerpc64le-linux-gnu/ld-2.23.so
        7fff83710000-7fff83720000 rw-p 00040000 fc:00 9177193  /lib/powerpc64le-linux-gnu/ld-2.23.so
        7fffdccf0000-7fffdcd20000 rw-p 00000000 00:00 0        [stack]
        1000000000000-1000000010000 rw-p 00000000 00:00 0
        1ffff83710000-1ffff83720000 rw-p 00000000 00:00 0
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f4ea6dcb
    • A
    • A
      powerpc/pseries: Skip using reserved virtual address range · 82228e36
      Aneesh Kumar K.V 提交于
      Now that we use all the available virtual address range, we need to make
      sure we don't generate VSID such that it overlaps with the reserved vsid
      range. Reserved vsid range include the virtual address range used by the
      adjunct partition and also the VRMA virtual segment. We find the context
      value that can result in generating such a VSID and reserve it early in
      boot.
      
      We don't look at the adjunct range, because for now we disable the
      adjunct usage in a Linux LPAR via CAS interface.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      [mpe: Rewrite hash__reserve_context_id(), move the rest into pseries]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      82228e36
    • A
      powerpc/mm/hash: Store addr_limit in PACA · bb183221
      Aneesh Kumar K.V 提交于
      We optmize the slice page size array copy to paca by copying only the
      range based on addr_limit. This will require us to not look at page size
      array beyond addr_limit in PACA on slb fault. To enable that copy task
      size to paca which will be used during slb fault.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      [mpe: Rename from task_size to addr_limit, consolidate #ifdefs]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      bb183221
    • A
      powerpc/mm: Add addr_limit to mm_context and use it to derive max slice index · 957b778a
      Aneesh Kumar K.V 提交于
      In the followup patch, we will increase the slice array size to handle
      512TB range, but will limit the max addr to 128TB. Avoid doing
      unnecessary computation and avoid doing slice mask related operation
      above address limit.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      957b778a
  5. 31 3月, 2017 14 次提交