1. 31 8月, 2017 1 次提交
  2. 28 7月, 2017 1 次提交
  3. 26 7月, 2017 1 次提交
    • B
      powerpc/mm/radix: Workaround prefetch issue with KVM · a25bd72b
      Benjamin Herrenschmidt 提交于
      There's a somewhat architectural issue with Radix MMU and KVM.
      
      When coming out of a guest with AIL (Alternate Interrupt Location, ie,
      MMU enabled), we start executing hypervisor code with the PID register
      still containing whatever the guest has been using.
      
      The problem is that the CPU can (and will) then start prefetching or
      speculatively load from whatever host context has that same PID (if
      any), thus bringing translations for that context into the TLB, which
      Linux doesn't know about.
      
      This can cause stale translations and subsequent crashes.
      
      Fixing this in a way that is neither racy nor a huge performance
      impact is difficult. We could just make the host invalidations always
      use broadcast forms but that would hurt single threaded programs for
      example.
      
      We chose to fix it instead by partitioning the PID space between guest
      and host. This is possible because today Linux only use 19 out of the
      20 bits of PID space, so existing guests will work if we make the host
      use the top half of the 20 bits space.
      
      We additionally add support for a property to indicate to Linux the
      size of the PID register which will be useful if we eventually have
      processors with a larger PID space available.
      
      There is still an issue with malicious guests purposefully setting the
      PID register to a value in the hosts PID range. Hopefully future HW
      can prevent that, but in the meantime, we handle it with a pair of
      kludges:
      
       - On the way out of a guest, before we clear the current VCPU in the
         PACA, we check the PID and if it's outside of the permitted range
         we flush the TLB for that PID.
      
       - When context switching, if the mm is "new" on that CPU (the
         corresponding bit was set for the first time in the mm cpumask), we
         check if any sibling thread is in KVM (has a non-NULL VCPU pointer
         in the PACA). If that is the case, we also flush the PID for that
         CPU (core).
      
      This second part is needed to handle the case where a process is
      migrated (or starts a new pthread) on a sibling thread of the CPU
      coming out of KVM, as there's a window where stale translations can
      exist before we detect it and flush them out.
      
      A future optimization could be added by keeping track of whether the
      PID has ever been used and avoid doing that for completely fresh PIDs.
      We could similarily mark PIDs that have been the subject of a global
      invalidation as "fresh". But for now this will do.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      [mpe: Rework the asm to build with CONFIG_PPC_RADIX_MMU=n, drop
            unneeded include of kvm_book3s_asm.h]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a25bd72b
  4. 18 7月, 2017 1 次提交
    • M
      powerpc/mm: Mark __init memory no-execute when STRICT_KERNEL_RWX=y · 029d9252
      Michael Ellerman 提交于
      Currently even with STRICT_KERNEL_RWX we leave the __init text marked
      executable after init, which is bad.
      
      Add a hook to mark it NX (no-execute) before we free it, and implement
      it for radix and hash.
      
      Note that we use __init_end as the end address, not _einittext,
      because overlaps_kernel_text() uses __init_end, because there are
      additional executable sections other than .init.text between
      __init_begin and __init_end.
      
      Tested on radix and hash with:
      
        0:mon> p $__init_begin
        *** 400 exception occurred
      
      Fixes: 1e0fc9d1 ("powerpc/Kconfig: Enable STRICT_KERNEL_RWX for some configs")
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      029d9252
  5. 13 7月, 2017 1 次提交
    • M
      mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic · dcda9b04
      Michal Hocko 提交于
      __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
      the page allocator.  This has been true but only for allocations
      requests larger than PAGE_ALLOC_COSTLY_ORDER.  It has been always
      ignored for smaller sizes.  This is a bit unfortunate because there is
      no way to express the same semantic for those requests and they are
      considered too important to fail so they might end up looping in the
      page allocator for ever, similarly to GFP_NOFAIL requests.
      
      Now that the whole tree has been cleaned up and accidental or misled
      usage of __GFP_REPEAT flag has been removed for !costly requests we can
      give the original flag a better name and more importantly a more useful
      semantic.  Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
      that the allocator would try really hard but there is no promise of a
      success.  This will work independent of the order and overrides the
      default allocator behavior.  Page allocator users have several levels of
      guarantee vs.  cost options (take GFP_KERNEL as an example)
      
       - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
         attempt to free memory at all. The most light weight mode which even
         doesn't kick the background reclaim. Should be used carefully because
         it might deplete the memory and the next user might hit the more
         aggressive reclaim
      
       - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
         allocation without any attempt to free memory from the current
         context but can wake kswapd to reclaim memory if the zone is below
         the low watermark. Can be used from either atomic contexts or when
         the request is a performance optimization and there is another
         fallback for a slow path.
      
       - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
         non sleeping allocation with an expensive fallback so it can access
         some portion of memory reserves. Usually used from interrupt/bh
         context with an expensive slow path fallback.
      
       - GFP_KERNEL - both background and direct reclaim are allowed and the
         _default_ page allocator behavior is used. That means that !costly
         allocation requests are basically nofail but there is no guarantee of
         that behavior so failures have to be checked properly by callers
         (e.g. OOM killer victim is allowed to fail currently).
      
       - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
         and all allocation requests fail early rather than cause disruptive
         reclaim (one round of reclaim in this implementation). The OOM killer
         is not invoked.
      
       - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
         behavior and all allocation requests try really hard. The request
         will fail if the reclaim cannot make any progress. The OOM killer
         won't be triggered.
      
       - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
         and all allocation requests will loop endlessly until they succeed.
         This might be really dangerous especially for larger orders.
      
      Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
      because they already had their semantic.  No new users are added.
      __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
      there is no progress and we have already passed the OOM point.
      
      This means that all the reclaim opportunities have been exhausted except
      the most disruptive one (the OOM killer) and a user defined fallback
      behavior is more sensible than keep retrying in the page allocator.
      
      [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
      [mhocko@suse.com: semantic fix]
        Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
      [mhocko@kernel.org: address other thing spotted by Vlastimil]
        Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Alex Belits <alex.belits@cavium.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David Daney <david.daney@cavium.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dcda9b04
  6. 07 7月, 2017 1 次提交
  7. 04 7月, 2017 1 次提交
  8. 02 7月, 2017 1 次提交
  9. 08 6月, 2017 1 次提交
  10. 05 6月, 2017 1 次提交
    • B
      powerpc/mm/book(e)(3s)/64: Add page table accounting · de3b8761
      Balbir Singh 提交于
      Introduce a helper pgtable_gfp_flags() which
      just returns the current gfp flags and adds
      __GFP_ACCOUNT to account for page table allocation.
      The generic helper is added to include/asm/pgalloc.h
      and has two variants - WARNING ugly bits ahead
      
      1. If the header is included from a module, no check
      for mm == &init_mm is done, since init_mm is not
      exported
      2. For kernel includes, the check is done and required
      see (3e79ec7d arch: x86: charge page tables to kmemcg)
      
      The fundamental assumption is that no module should be
      doing pgd/pud/pmd and pte alloc's on behalf of init_mm
      directly.
      
      NOTE: This adds an overhead to pmd/pud/pgd allocations
      similar to x86.  The other alternative was to implement
      pmd_alloc_kernel/pud_alloc_kernel and pgd_alloc_kernel
      with their offset variants.
      
      For 4k page size, pte_alloc_one no longer calls
      pte_alloc_one_kernel.
      Signed-off-by: NBalbir Singh <bsingharora@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      de3b8761
  11. 09 5月, 2017 1 次提交
    • M
      powerpc/mm/book3s/64: Rework page table geometry for lower memory usage · ba95b5d0
      Michael Ellerman 提交于
      Recently in commit f6eedbba ("powerpc/mm/hash: Increase VA range to 128TB")
      we increased the virtual address space for user processes to 128TB by default,
      and up to 512TB if user space opts in.
      
      This obviously required expanding the range of the Linux page tables. For Book3s
      64-bit using hash and with PAGE_SIZE=64K, we increased the PGD to 2^15 entries.
      This meant we could cover the full address range, while still being able to
      insert a 16G hugepage at the PGD level and a 16M hugepage in the PMD.
      
      The downside of that geometry is that it uses a lot of memory for the PGD, and
      in particular makes the PGD a 4-page allocation, which means it's much more
      likely to fail under memory pressure.
      
      Instead we can make the PMD larger, so that a single PUD entry maps 16G,
      allowing the 16G hugepages to sit at that level in the tree. We're then able to
      split the remaining bits between the PUG and PGD. We make the PGD slightly
      larger as that results in lower memory usage for typical programs.
      
      When THP is enabled the PMD actually doubles in size, to 2^11 entries, or 2^14
      bytes, which is large but still < PAGE_SIZE.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: NBalbir Singh <bsingharora@gmail.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      ba95b5d0
  12. 28 4月, 2017 1 次提交
  13. 27 4月, 2017 1 次提交
  14. 12 4月, 2017 1 次提交
    • M
      powerpc/mm: Fix swapper_pg_dir size on 64-bit hash w/64K pages · 03dfee6d
      Michael Ellerman 提交于
      Recently in commit f6eedbba ("powerpc/mm/hash: Increase VA range to 128TB"),
      we increased H_PGD_INDEX_SIZE to 15 when we're building with 64K pages. This
      makes it larger than RADIX_PGD_INDEX_SIZE (13), which means the logic to
      calculate MAX_PGD_INDEX_SIZE in book3s/64/pgtable.h is wrong.
      
      The end result is that the PGD (Page Global Directory, ie top level page table)
      of the kernel (aka. swapper_pg_dir), is too small.
      
      This generally doesn't lead to a crash, as we don't use the full range in normal
      operation. However if we try to dump the kernel pagetables we can trigger a
      crash because we walk off the end of the pgd into other memory and eventually
      try to dereference something bogus:
      
        $ cat /sys/kernel/debug/kernel_pagetables
        Unable to handle kernel paging request for data at address 0xe8fece0000000000
        Faulting instruction address: 0xc000000000072314
        cpu 0xc: Vector: 380 (Data SLB Access) at [c0000000daa13890]
            pc: c000000000072314: ptdump_show+0x164/0x430
            lr: c000000000072550: ptdump_show+0x3a0/0x430
           dar: e802cf0000000000
        seq_read+0xf8/0x560
        full_proxy_read+0x84/0xc0
        __vfs_read+0x6c/0x1d0
        vfs_read+0xbc/0x1b0
        SyS_read+0x6c/0x110
        system_call+0x38/0xfc
      
      The root cause is that MAX_PGD_INDEX_SIZE isn't actually computed to be
      the max of H_PGD_INDEX_SIZE or RADIX_PGD_INDEX_SIZE. To fix that move
      the calculation into asm-offsets.c where we can do it easily using
      max().
      
      Fixes: f6eedbba ("powerpc/mm/hash: Increase VA range to 128TB")
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      03dfee6d
  15. 04 4月, 2017 1 次提交
    • A
      powerpc/powernv: Introduce address translation services for Nvlink2 · 1ab66d1f
      Alistair Popple 提交于
      Nvlink2 supports address translation services (ATS) allowing devices
      to request address translations from an mmu known as the nest MMU
      which is setup to walk the CPU page tables.
      
      To access this functionality certain firmware calls are required to
      setup and manage hardware context tables in the nvlink processing unit
      (NPU). The NPU also manages forwarding of TLB invalidates (known as
      address translation shootdowns/ATSDs) to attached devices.
      
      This patch exports several methods to allow device drivers to register
      a process id (PASID/PID) in the hardware tables and to receive
      notification of when a device should stop issuing address translation
      requests (ATRs). It also adds a fault handler to allow device drivers
      to demand fault pages in.
      Signed-off-by: NAlistair Popple <alistair@popple.id.au>
      [mpe: Fix up comment formatting, use flush_tlb_mm()]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      1ab66d1f
  16. 01 4月, 2017 2 次提交
  17. 31 3月, 2017 12 次提交
  18. 10 3月, 2017 3 次提交
  19. 01 3月, 2017 1 次提交
    • P
      KVM: PPC: Book3S HV: Fix software walk of guest process page tables · 70cd4c10
      Paul Mackerras 提交于
      This fixes some bugs in the code that walks the guest's page tables.
      These bugs cause MMIO emulation to fail whenever the guest is in
      virtial mode (MMU on), leading to the guest hanging if it tried to
      access a virtio device.
      
      The first bug was that when reading the guest's process table, we were
      using the whole of arch->process_table, not just the field that contains
      the process table base address.  The second bug was that the mask used
      when reading the process table entry to get the radix tree base address,
      RPDB_MASK, had the wrong value.
      
      Fixes: 9e04ba69 ("KVM: PPC: Book3S HV: Add basic infrastructure for radix guests")
      Fixes: e9983344 ("powerpc/mm/radix: Add partition table format & callback")
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      70cd4c10
  20. 28 2月, 2017 1 次提交
  21. 25 2月, 2017 1 次提交
  22. 15 2月, 2017 3 次提交
  23. 14 2月, 2017 1 次提交
  24. 10 2月, 2017 1 次提交
    • D
      powerpc/pseries: Add support for hash table resizing · dbcf929c
      David Gibson 提交于
      This adds support for using two hypercalls to change the size of the
      main hash page table while running as a PAPR guest. For now these
      hypercalls are only in experimental qemu versions.
      
      The interface is two part: first H_RESIZE_HPT_PREPARE is used to
      allocate and prepare the new hash table. This may be slow, but can be
      done asynchronously. Then, H_RESIZE_HPT_COMMIT is used to switch to the
      new hash table. This requires that no CPUs be concurrently updating the
      HPT, and so must be run under stop_machine().
      
      This also adds a debugfs file which can be used to manually control
      HPT resizing or testing purposes.
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: NPaul Mackerras <paulus@samba.org>
      [mpe: Rename the debugfs file to "hpt_order"]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      dbcf929c