1. 02 6月, 2017 5 次提交
  2. 17 5月, 2017 1 次提交
    • M
      powerpc/mm: Fix crash in page table dump with huge pages · bfb9956a
      Michael Ellerman 提交于
      The page table dump code doesn't know about huge pages, so currently
      it crashes (or walks random memory, usually leading to a crash), if it
      finds a huge page. On Book3S we only see huge pages in the Linux page
      tables when we're using the P9 Radix MMU.
      
      Teaching the code to properly handle huge pages is a bit more involved,
      so for now just prevent the crash.
      
      Cc: stable@vger.kernel.org # v4.10+
      Fixes: 8eb07b18 ("powerpc/mm: Dump linux pagetables")
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      bfb9956a
  3. 09 5月, 2017 1 次提交
  4. 03 5月, 2017 1 次提交
    • M
      powerpc/mm/radix: Drop support for CPUs without lockless tlbie · 3c9ac2bc
      Michael Ellerman 提交于
      Currently the radix TLB code includes support for CPUs that do *not*
      have MMU_FTR_LOCKLESS_TLBIE. On those CPUs we are required to take a
      global spinlock before issuing a tlbie.
      
      Radix can only be built for 64-bit Book3s CPUs, and of those, only
      POWER4, 970, Cell and PA6T do not have MMU_FTR_LOCKLESS_TLBIE. Although
      it's possible to build a kernel with Radix support that can also boot on
      those CPUs, we happen to know that in reality none of those CPUs support
      the Radix MMU, so the code can never actually run on those CPUs.
      
      So remove the native_tlbie_lock in the Radix TLB code.
      
      Note that there is another lock of the same name in the hash code, which
      is unaffected by this patch.
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      3c9ac2bc
  5. 27 4月, 2017 6 次提交
  6. 26 4月, 2017 1 次提交
  7. 21 4月, 2017 1 次提交
    • M
      powerpc/mm: Add support for runtime configuration of ASLR limits · 9fea59bd
      Michael Ellerman 提交于
      Add powerpc support for mmap_rnd_bits and mmap_rnd_compat_bits, which are two
      sysctls that allow a user to configure the number of bits of randomness used for
      ASLR.
      
      Because of the way the Kconfig for ARCH_MMAP_RND_BITS is defined, we have to
      construct at least the MIN value in Kconfig, vs in a header which would be more
      natural. Given that we just go ahead and do it all in Kconfig.
      
      At least according to the code (the documentation makes no mention of it), the
      value is defined as the number of bits of randomisation *of the page*, not the
      address. This makes some sense, with larger page sizes more of the low bits are
      forced to zero, which would reduce the randomisation if we didn't take the
      PAGE_SIZE into account. However it does mean the min/max values have to change
      depending on the PAGE_SIZE in order to actually limit the amount of address
      space consumed by the randomisation.
      
      The result of that is that we have to define the default values based on both
      32-bit vs 64-bit, but also the configured PAGE_SIZE. Furthermore now that we
      have 128TB address space support on Book3S, we also have to take that into
      account.
      
      Finally we can wire up the value in arch_mmap_rnd().
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NBhupesh Sharma <bhsharma@redhat.com>
      Tested-by: NBhupesh Sharma <bhsharma@redhat.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      9fea59bd
  8. 19 4月, 2017 3 次提交
    • A
      powerpc/iommu: Do not call PageTransHuge() on tail pages · e889e96e
      Alexey Kardashevskiy 提交于
      The CMA pages migration code does not support compound pages at
      the moment so it performs few tests before proceeding to actual page
      migration.
      
      One of the tests - PageTransHuge() - has VM_BUG_ON_PAGE(PageTail()) as
      it is designed to be called on head pages only. Since we also test for
      PageCompound(), and it contains PageTail() and PageHead(), we can
      simplify the check by leaving just PageCompound() and therefore avoid
      possible VM_BUG_ON_PAGE.
      
      Fixes: 2e5bbb54 ("KVM: PPC: Book3S HV: Migrate pinned pages out of CMA")
      Cc: stable@vger.kernel.org # v4.9+
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      e889e96e
    • A
      powerpc/mmap: Any hint > 128TB searches the full VA space · 321f7d29
      Aneesh Kumar K.V 提交于
      As part of the new large address space support, processes start out life with a
      128TB virtual address space. However when calling mmap() a process can pass a
      hint address, and if that hint is > 128TB the kernel will use the full 512TB
      address space to try and satisfy the mmap() request.
      
      Currently we have a check that the hint is > 128TB and < 512TB (TASK_SIZE),
      which was added as an optimisation to avoid updating addr_limit unnecessarily
      and also to avoid calling slice_flush_segments() on all CPUs more than
      necessary.
      
      However this has the user-visible side effect that an mmap() hint above 512TB
      does not search the full address space unless a preceding mmap() used a hint
      value > 128TB && < 512TB.
      
      So fix it to treat any hint above 128TB as a hint to search the full address
      space, instead of checking the hint against TASK_SIZE, we instead check if the
      addr_limit is already == TASK_SIZE.
      
      This also brings the ABI in-line with what is proposed on x86. ie, that a hint
      address above 128TB up to and including (2^64)-1 is an indication to search the
      full address space.
      
      Fixes: f4ea6dcb (powerpc/mm: Enable mappings above 128TB)
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      321f7d29
    • A
      powerpc/mm/radix: Use mm->task_size for boundary checking instead of addr_limit · be77e999
      Aneesh Kumar K.V 提交于
      We don't init addr_limit correctly for 32 bit applications. So default to using
      mm->task_size for boundary condition checking. We use addr_limit to only control
      free space search. This makes sure that we do the right thing with 32 bit
      applications.
      
      We should consolidate the usage of TASK_SIZE/mm->task_size and
      mm->context.addr_limit later.
      
      This partially reverts commit fbfef902 (powerpc/mm: Switch some
      TASK_SIZE checks to use mm_context addr_limit).
      
      Fixes: fbfef902 ("powerpc/mm: Switch some TASK_SIZE checks to use mm_context addr_limit")
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      be77e999
  9. 13 4月, 2017 1 次提交
  10. 12 4月, 2017 3 次提交
  11. 11 4月, 2017 5 次提交
    • A
      powerpc/mm: Remove reduntant initmem information from log · ea614555
      Anshuman Khandual 提交于
      Generic core VM already prints these information in the log
      buffer, hence there is no need for a second print. This just
      removes the second print from arch powerpc NUMA init path.
      
      Before the patch:
      
        $ dmesg | grep "Initmem"
      
        numa: Initmem setup node 0 [mem 0x00000000-0xffffffff]
        numa: Initmem setup node 1 [mem 0x100000000-0x1ffffffff]
        numa: Initmem setup node 2 [mem 0x200000000-0x2ffffffff]
        numa: Initmem setup node 3 [mem 0x300000000-0x3ffffffff]
        numa: Initmem setup node 4 [mem 0x400000000-0x4ffffffff]
        numa: Initmem setup node 5 [mem 0x500000000-0x5ffffffff]
        numa: Initmem setup node 6 [mem 0x600000000-0x6ffffffff]
        numa: Initmem setup node 7 [mem 0x700000000-0x7ffffffff]
        Initmem setup node 0 [mem 0x0000000000000000-0x00000000ffffffff]
        Initmem setup node 1 [mem 0x0000000100000000-0x00000001ffffffff]
        Initmem setup node 2 [mem 0x0000000200000000-0x00000002ffffffff]
        Initmem setup node 3 [mem 0x0000000300000000-0x00000003ffffffff]
        Initmem setup node 4 [mem 0x0000000400000000-0x00000004ffffffff]
        Initmem setup node 5 [mem 0x0000000500000000-0x00000005ffffffff]
        Initmem setup node 6 [mem 0x0000000600000000-0x00000006ffffffff]
        Initmem setup node 7 [mem 0x0000000700000000-0x00000007ffffffff]
      
      After the patch just the latter set is printed.
      Signed-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      ea614555
    • M
      powerpc/nohash: Fix use of mmu_has_feature() in setup_initial_memory_limit() · 4868e350
      Michael Ellerman 提交于
      setup_initial_memory_limit() is called from early_init_devtree(), which
      runs prior to feature patching. If the kernel is built with CONFIG_JUMP_LABEL=y
      and CONFIG_JUMP_LABEL_FEATURE_CHECKS=y then we will potentially get the
      wrong value.
      
      If we also have CONFIG_JUMP_LABEL_FEATURE_CHECK_DEBUG=y we get a warning
      and backtrace:
      
        Warning! mmu_has_feature() used prior to jump label init!
        CPU: 0 PID: 0 Comm: swapper Not tainted 4.11.0-rc4-gccN-next-20170331-g6af2434 #1
        Call Trace:
        [c000000000fc3d50] [c000000000a26c30] .dump_stack+0xa8/0xe8 (unreliable)
        [c000000000fc3de0] [c00000000002e6b8] .setup_initial_memory_limit+0xa4/0x104
        [c000000000fc3e60] [c000000000d5c23c] .early_init_devtree+0xd0/0x2f8
        [c000000000fc3f00] [c000000000d5d3b0] .early_setup+0x90/0x11c
        [c000000000fc3f90] [c000000000000520] start_here_multiplatform+0x68/0x80
      
      Fix it by using early_mmu_has_feature().
      
      Fixes: c12e6f24 ("powerpc: Add option to use jump label for mmu_has_feature()")
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      4868e350
    • M
      powerpc: Create asm/debugfs.h and move powerpc_debugfs_root there · 7644d581
      Michael Ellerman 提交于
      powerpc_debugfs_root is the dentry representing the root of the
      "powerpc" directory tree in debugfs.
      
      Currently it sits in asm/debug.h, a long with some other things that
      have "debug" in the name, but are otherwise unrelated.
      
      Pull it out into a separate header, which also includes linux/debugfs.h,
      and convert all the users to include debugfs.h instead of debug.h.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7644d581
    • A
      powerpc/mm/radix: Remove unnecessary ptesync · f7327e0b
      Aneesh Kumar K.V 提交于
      For a tlbiel with pid, we need to issue tlbiel with set number encoded. We
      don't need to do ptesync for each of those. Instead we need one for the entire
      tlbiel pid operation.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f7327e0b
    • A
      powerpc/mm/radix: Don't do page walk cache flush when doing full mm flush · f6b0df55
      Aneesh Kumar K.V 提交于
      For fullmm tlb flush, we do a flush with RIC_FLUSH_ALL which will invalidate all
      related caches (radix__tlb_flush()). Hence the pwc flush is not needed.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f6b0df55
  12. 05 4月, 2017 1 次提交
  13. 04 4月, 2017 1 次提交
    • A
      powerpc/powernv: Introduce address translation services for Nvlink2 · 1ab66d1f
      Alistair Popple 提交于
      Nvlink2 supports address translation services (ATS) allowing devices
      to request address translations from an mmu known as the nest MMU
      which is setup to walk the CPU page tables.
      
      To access this functionality certain firmware calls are required to
      setup and manage hardware context tables in the nvlink processing unit
      (NPU). The NPU also manages forwarding of TLB invalidates (known as
      address translation shootdowns/ATSDs) to attached devices.
      
      This patch exports several methods to allow device drivers to register
      a process id (PASID/PID) in the hardware tables and to receive
      notification of when a device should stop issuing address translation
      requests (ATRs). It also adds a fault handler to allow device drivers
      to demand fault pages in.
      Signed-off-by: NAlistair Popple <alistair@popple.id.au>
      [mpe: Fix up comment formatting, use flush_tlb_mm()]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      1ab66d1f
  14. 03 4月, 2017 2 次提交
    • O
      powerpc/mm: Remove stale comment about the DART hole · f6f9195b
      Oliver O'Halloran 提交于
      The code to fix the problem it describes was removed in commit
      c40785ad ("powerpc/dart: Use a cachable DART"), and it uses the
      stupid comment style. Away it goooooooooooooes!
      Signed-off-by: NOliver O'Halloran <oohall@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f6f9195b
    • A
      powerpc: Avoid taking a data miss on every userspace instruction miss · a7a9dcd8
      Anton Blanchard 提交于
      Early on in do_page_fault() we call store_updates_sp(), regardless of
      the type of exception. For an instruction miss this doesn't make
      sense, because we only use this information to detect if a data miss
      is the result of a stack expansion instruction or not.
      
      Worse still, it results in a data miss within every userspace
      instruction miss handler, because we try and load the very instruction
      we are about to install a pte for!
      
      A simple exec microbenchmark runs 6% faster on POWER8 with this fix:
      
       #include <stdlib.h>
       #include <stdio.h>
       #include <unistd.h>
      
      int main(int argc, char *argv[])
      {
      	unsigned long left = atol(argv[1]);
      	char leftstr[16];
      
      	if (left-- == 0)
      		return 0;
      
      	sprintf(leftstr, "%ld", left);
      	execlp(argv[0], argv[0], leftstr, NULL);
      	perror("exec failed\n");
      
      	return 0;
      }
      
      Pass the number of iterations on the command line (eg 10000) and time
      how long it takes to execute.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a7a9dcd8
  15. 01 4月, 2017 5 次提交
    • A
      powerpc/mm: Enable mappings above 128TB · f4ea6dcb
      Aneesh Kumar K.V 提交于
      Not all user space application is ready to handle wide addresses. It's
      known that at least some JIT compilers use higher bits in pointers to
      encode their information. It collides with valid pointers with 512TB
      addresses and leads to crashes.
      
      To mitigate this, we are not going to allocate virtual address space
      above 128TB by default.
      
      But userspace can ask for allocation from full address space by
      specifying hint address (with or without MAP_FIXED) above 128TB.
      
      If hint address set above 128TB, but MAP_FIXED is not specified, we try
      to look for unmapped area by specified address. If it's already
      occupied, we look for unmapped area in *full* address space, rather than
      from 128TB window.
      
      This approach helps to easily make application's memory allocator aware
      about large address space without manually tracking allocated virtual
      address space.
      
      This is going to be a per mmap decision. ie, we can have some mmaps with
      larger addresses and other that do not.
      
      A sample memory layout looks like:
      
        10000000-10010000 r-xp 00000000 fc:00 9057045          /home/max_addr_512TB
        10010000-10020000 r--p 00000000 fc:00 9057045          /home/max_addr_512TB
        10020000-10030000 rw-p 00010000 fc:00 9057045          /home/max_addr_512TB
        10029630000-10029660000 rw-p 00000000 00:00 0          [heap]
        7fff834a0000-7fff834b0000 rw-p 00000000 00:00 0
        7fff834b0000-7fff83670000 r-xp 00000000 fc:00 9177190  /lib/powerpc64le-linux-gnu/libc-2.23.so
        7fff83670000-7fff83680000 r--p 001b0000 fc:00 9177190  /lib/powerpc64le-linux-gnu/libc-2.23.so
        7fff83680000-7fff83690000 rw-p 001c0000 fc:00 9177190  /lib/powerpc64le-linux-gnu/libc-2.23.so
        7fff83690000-7fff836a0000 rw-p 00000000 00:00 0
        7fff836a0000-7fff836c0000 r-xp 00000000 00:00 0        [vdso]
        7fff836c0000-7fff83700000 r-xp 00000000 fc:00 9177193  /lib/powerpc64le-linux-gnu/ld-2.23.so
        7fff83700000-7fff83710000 r--p 00030000 fc:00 9177193  /lib/powerpc64le-linux-gnu/ld-2.23.so
        7fff83710000-7fff83720000 rw-p 00040000 fc:00 9177193  /lib/powerpc64le-linux-gnu/ld-2.23.so
        7fffdccf0000-7fffdcd20000 rw-p 00000000 00:00 0        [stack]
        1000000000000-1000000010000 rw-p 00000000 00:00 0
        1ffff83710000-1ffff83720000 rw-p 00000000 00:00 0
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f4ea6dcb
    • A
    • A
      powerpc/pseries: Skip using reserved virtual address range · 82228e36
      Aneesh Kumar K.V 提交于
      Now that we use all the available virtual address range, we need to make
      sure we don't generate VSID such that it overlaps with the reserved vsid
      range. Reserved vsid range include the virtual address range used by the
      adjunct partition and also the VRMA virtual segment. We find the context
      value that can result in generating such a VSID and reserve it early in
      boot.
      
      We don't look at the adjunct range, because for now we disable the
      adjunct usage in a Linux LPAR via CAS interface.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      [mpe: Rewrite hash__reserve_context_id(), move the rest into pseries]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      82228e36
    • A
      powerpc/mm/hash: Store addr_limit in PACA · bb183221
      Aneesh Kumar K.V 提交于
      We optmize the slice page size array copy to paca by copying only the
      range based on addr_limit. This will require us to not look at page size
      array beyond addr_limit in PACA on slb fault. To enable that copy task
      size to paca which will be used during slb fault.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      [mpe: Rename from task_size to addr_limit, consolidate #ifdefs]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      bb183221
    • A
      powerpc/mm: Add addr_limit to mm_context and use it to derive max slice index · 957b778a
      Aneesh Kumar K.V 提交于
      In the followup patch, we will increase the slice array size to handle
      512TB range, but will limit the max addr to 128TB. Avoid doing
      unnecessary computation and avoid doing slice mask related operation
      above address limit.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      957b778a
  16. 31 3月, 2017 3 次提交
    • A
      powerpc/mm/hash: Support 68 bit VA · e6f81a92
      Aneesh Kumar K.V 提交于
      Inorder to support large effective address range (512TB), we want to
      increase the virtual address bits to 68. But we do have platforms like
      p4 and p5 that can only do 65 bit VA. We support those platforms by
      limiting context bits on them to 16.
      
      The protovsid -> vsid conversion is verified to work with both 65 and 68
      bit va values. I also documented the restrictions in a table format as
      part of code comments.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      e6f81a92
    • A
      powerpc/mm/hash: Use context ids 1-4 for the kernel · 941711a3
      Aneesh Kumar K.V 提交于
      Currently we use the top 4 context ids (0x7fffc-0x7ffff) for the kernel.
      Kernel VSIDs are built using these top context values and effective the
      segement ID. In subsequent patches we want to increase the max effective
      address to 512TB. We will achieve that by increasing the effective
      segment IDs there by increasing virtual address range.
      
      We will be switching to a 68bit virtual address in the following patch.
      But platforms like Power4 and Power5 only support a 65 bit virtual
      address. We will handle that by limiting the context bits to 16 instead
      of 19 on those platforms. That means the max context id will have a
      different value on different platforms.
      
      So that we don't have to deal with the kernel context ids changing
      between different platforms, move the kernel context ids down to use
      context ids 1-4.
      
      We can't use segment 0 of context-id 0, because that maps to VSID 0,
      which we want to keep as invalid, so we avoid context-id 0 entirely.
      Similarly we can't use the last segment of the maximum context, so we
      avoid it too.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      [mpe: Switch from 0-3 to 1-4 so VSID=0 remains invalid]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      941711a3
    • M
      powerpc/mm: Split radix vs hash mm context initialisation · 760573c1
      Michael Ellerman 提交于
      Complete the split of the radix vs hash mm context initialisation.
      
      This is mostly code movement, with the exception that we now limit the
      context allocation to PRTB_ENTRIES - 1 on radix.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      760573c1