1. 10 2月, 2017 1 次提交
    • D
      powerpc/pseries: Add support for hash table resizing · dbcf929c
      David Gibson 提交于
      This adds support for using two hypercalls to change the size of the
      main hash page table while running as a PAPR guest. For now these
      hypercalls are only in experimental qemu versions.
      
      The interface is two part: first H_RESIZE_HPT_PREPARE is used to
      allocate and prepare the new hash table. This may be slow, but can be
      done asynchronously. Then, H_RESIZE_HPT_COMMIT is used to switch to the
      new hash table. This requires that no CPUs be concurrently updating the
      HPT, and so must be run under stop_machine().
      
      This also adds a debugfs file which can be used to manually control
      HPT resizing or testing purposes.
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: NPaul Mackerras <paulus@samba.org>
      [mpe: Rename the debugfs file to "hpt_order"]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      dbcf929c
  2. 30 1月, 2017 1 次提交
    • A
      powerpc/mm/hash: Properly mask the ESID bits when building proto VSID · 79270e0a
      Aneesh Kumar K.V 提交于
      The proto VSID is built using both the MMU context id and effective
      segment ID (ESID). We should not have overlapping bits between those.
      That could result in us having a VSID collision. With the current code
      we missed masking the top bits of the ESID. This implies for kernel
      address we ended up using the top 4 bits of the ESID as part of the
      proto VSID, which is wrong.
      
      The current code use the top 4 context values (0x7fffc - 0x7ffff) for
      the kernel. With those context IDs used for the kernel, we don't run
      into VSID collisions because we get the same proto VSID irrespective of
      whether we mask the ESID bits or not. eg:
      
        ea         = 0xf000000000000000
        context    = 0x7ffff
      
        w/out masking:
        proto_vsid = (0x7ffff << 6 | 0xf000000000000000 >> 40)
      	     = (0x1ffffc0 | 0xf00000)
      	     =  0x1ffffc0
      
        with masking:
        proto_vsid = (0x7ffff << 6 | ((0xf000000000000000 >> 40) & 0x3f))
      	     = (0x1ffffc0 | (0xf00000 & 0x3f))
      	     =  0x1ffffc0 | 0)
      	     =  0x1ffffc0
      
      So although there is no bug, the code is still overly subtle, so fix it
      to save ourselves pain in future.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      79270e0a
  3. 16 11月, 2016 1 次提交
    • P
      powerpc/64: Simplify adaptation to new ISA v3.00 HPTE format · 6b243fcf
      Paul Mackerras 提交于
      This changes the way that we support the new ISA v3.00 HPTE format.
      Instead of adapting everything that uses HPTE values to handle either
      the old format or the new format, depending on which CPU we are on,
      we now convert explicitly between old and new formats if necessary
      in the low-level routines that actually access HPTEs in memory.
      This limits the amount of code that needs to know about the new
      format and makes the conversions explicit.  This is OK because the
      old format contains all the information that is in the new format.
      
      This also fixes operation under a hypervisor, because the H_ENTER
      hypercall (and other hypercalls that deal with HPTEs) will continue
      to require the HPTE value to be supplied in the old format.  At
      present the kernel will not boot in HPT mode on POWER9 under a
      hypervisor.
      
      This fixes and partially reverts commit 50de596d
      ("powerpc/mm/hash: Add support for Power9 Hash", 2016-04-29).
      
      Fixes: 50de596d ("powerpc/mm/hash: Add support for Power9 Hash")
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      6b243fcf
  4. 09 9月, 2016 1 次提交
    • P
      powerpc/mm: Speed up computation of base and actual page size for a HPTE · 0eeede0c
      Paul Mackerras 提交于
      This replaces a 2-D search through an array with a simple 8-bit table
      lookup for determining the actual and/or base page size for a HPT entry.
      
      The encoding in the second doubleword of the HPTE is designed to encode
      the actual and base page sizes without using any more bits than would be
      needed for a 4k page number, by using between 1 and 8 low-order bits of
      the RPN (real page number) field to encode the page sizes.  A single
      "large page" bit in the first doubleword indicates that these low-order
      bits are to be interpreted like this.
      
      We can determine the page sizes by using the low-order 8 bits of the RPN
      to look up a 256-entry table.  For actual page sizes less than 1MB, some
      of the upper bits of these 8 bits are going to be real address bits, but
      we can cope with that by replicating the entries for those smaller page
      sizes.
      
      While we're at it, let's move the hpte_page_size() and hpte_base_page_size()
      functions from a KVM-specific header to a header for 64-bit HPT systems,
      since this computation doesn't have anything specifically to do with KVM.
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      0eeede0c
  5. 01 8月, 2016 2 次提交
  6. 26 7月, 2016 2 次提交
  7. 21 7月, 2016 1 次提交
  8. 14 6月, 2016 2 次提交
  9. 01 5月, 2016 4 次提交
  10. 03 3月, 2016 1 次提交
  11. 02 3月, 2016 1 次提交
  12. 22 2月, 2016 1 次提交
    • M
      powerpc: Add POWER9 cputable entry · c3ab300e
      Michael Neuling 提交于
      Add a cputable entry for POWER9.  More code is required to actually
      boot and run on a POWER9 but this gets the base piece in which we can
      start building on.
      
      Copies over from POWER8 except for:
      - Adds a new CPU_FTR_ARCH_300 bit to start hanging new architecture
         features from (in subsequent patches).
      - Advertises new user features bits PPC_FEATURE2_ARCH_3_00 &
        HAS_IEEE128 when on POWER9.
      - Drops CPU_FTR_SUBCORE.
      - Drops PMU code and machine check.
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c3ab300e
  13. 14 12月, 2015 1 次提交
  14. 12 10月, 2015 1 次提交
    • A
      powerpc/mm: Differentiate between hugetlb and THP during page walk · 891121e6
      Aneesh Kumar K.V 提交于
      We need to properly identify whether a hugepage is an explicit or
      a transparent hugepage in follow_huge_addr(). We used to depend
      on hugepage shift argument to do that. But in some case that can
      result in wrong results. For ex:
      
      On finding a transparent hugepage we set hugepage shift to PMD_SHIFT.
      But we can end up clearing the thp pte, via pmdp_huge_get_and_clear.
      We do prevent reusing the pfn page via the usage of
      kick_all_cpus_sync(). But that happens after we updated the pte to 0.
      Hence in follow_huge_addr() we can find hugepage shift set, but transparent
      huge page check fail for a thp pte.
      
      NOTE: We fixed a variant of this race against thp split in commit
      691e95fd
      ("powerpc/mm/thp: Make page table walk safe against thp split/collapse")
      
      Without this patch, we may hit the BUG_ON(flags & FOLL_GET) in
      follow_page_mask occasionally.
      
      In the long term, we may want to switch ppc64 64k page size config to
      enable CONFIG_ARCH_WANT_GENERAL_HUGETLB
      Reported-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      891121e6
  15. 11 6月, 2015 1 次提交
    • A
      powerpc/mmu: Add userspace-to-physical addresses translation cache · 15b244a8
      Alexey Kardashevskiy 提交于
      We are adding support for DMA memory pre-registration to be used in
      conjunction with VFIO. The idea is that the userspace which is going to
      run a guest may want to pre-register a user space memory region so
      it all gets pinned once and never goes away. Having this done,
      a hypervisor will not have to pin/unpin pages on every DMA map/unmap
      request. This is going to help with multiple pinning of the same memory.
      
      Another use of it is in-kernel real mode (mmu off) acceleration of
      DMA requests where real time translation of guest physical to host
      physical addresses is non-trivial and may fail as linux ptes may be
      temporarily invalid. Also, having cached host physical addresses
      (compared to just pinning at the start and then walking the page table
      again on every H_PUT_TCE), we can be sure that the addresses which we put
      into TCE table are the ones we already pinned.
      
      This adds a list of memory regions to mm_context_t. Each region consists
      of a header and a list of physical addresses. This adds API to:
      1. register/unregister memory regions;
      2. do final cleanup (which puts all pre-registered pages);
      3. do userspace to physical address translation;
      4. manage usage counters; multiple registration of the same memory
      is allowed (once per container).
      
      This implements 2 counters per registered memory region:
      - @mapped: incremented on every DMA mapping; decremented on unmapping;
      initialized to 1 when a region is just registered; once it becomes zero,
      no more mappings allowe;
      - @used: incremented on every "register" ioctl; decremented on
      "unregister"; unregistration is allowed for DMA mapped regions unless
      it is the very last reference. For the very last reference this checks
      that the region is still mapped and returns -EBUSY so the userspace
      gets to know that memory is still pinned and unregistration needs to
      be retried; @used remains 1.
      
      Host physical addresses are stored in vmalloc'ed array. In order to
      access these in the real mode (mmu off), there is a real_vmalloc_addr()
      helper. In-kernel acceleration patchset will move it from KVM to MMU code.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      15b244a8
  16. 17 3月, 2015 1 次提交
  17. 05 12月, 2014 1 次提交
    • A
      powerpc/mm: don't do tlbie for updatepp request with NO HPTE fault · aefa5688
      Aneesh Kumar K.V 提交于
      upatepp can get called for a nohpte fault when we find from the linux
      page table that the translation was hashed before. In that case
      we are sure that there is no existing translation, hence we could
      avoid doing tlbie.
      
      We could possibly race with a parallel fault filling the TLB. But
      that should be ok because updatepp is only ever relaxing permissions.
      We also look at linux pte permission bits when filling hash pte
      permission bits. We also hold the linux pte busy bits while
      inserting/updating a hashpte entry, hence a paralle update of
      linux pte is not possible. On the other hand mprotect involves
      ptep_modify_prot_start which cause a hpte invalidate and not updatepp.
      
      Performance number:
      We use randbox_access_bench written by Anton.
      
      Kernel with THP disabled and smaller hash page table size.
      
          86.60%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_updatepp
           2.10%  random_access_b  random_access_bench              [.] doit
           1.99%  random_access_b  [kernel.kallsyms]                [k] .do_raw_spin_lock
           1.85%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_insert
           1.26%  random_access_b  [kernel.kallsyms]                [k] .native_flush_hash_range
           1.18%  random_access_b  [kernel.kallsyms]                [k] .__delay
           0.69%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_remove
           0.37%  random_access_b  [kernel.kallsyms]                [k] .clear_user_page
           0.34%  random_access_b  [kernel.kallsyms]                [k] .__hash_page_64K
           0.32%  random_access_b  [kernel.kallsyms]                [k] fast_exception_return
           0.30%  random_access_b  [kernel.kallsyms]                [k] .hash_page_mm
      
      With Fix:
      
          27.54%  random_access_b  random_access_bench              [.] doit
          22.90%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_insert
           5.76%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_remove
           5.20%  random_access_b  [kernel.kallsyms]                [k] fast_exception_return
           5.12%  random_access_b  [kernel.kallsyms]                [k] .__hash_page_64K
           4.80%  random_access_b  [kernel.kallsyms]                [k] .hash_page_mm
           3.31%  random_access_b  [kernel.kallsyms]                [k] data_access_common
           1.84%  random_access_b  [kernel.kallsyms]                [k] .trace_hardirqs_on_caller
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      aefa5688
  18. 08 10月, 2014 2 次提交
  19. 25 9月, 2014 1 次提交
  20. 28 7月, 2014 1 次提交
  21. 22 7月, 2014 1 次提交
    • A
      powerpc: subpage_protect: Increase the array size to take care of 64TB · dad6f37c
      Aneesh Kumar K.V 提交于
      We now support TASK_SIZE of 16TB, hence the array should be 8.
      
      Fixes the below crash:
      
      Unable to handle kernel paging request for data at address 0x000100bd
      Faulting instruction address: 0xc00000000004f914
      cpu 0x13: Vector: 300 (Data Access) at [c000000fea75fa90]
          pc: c00000000004f914: .sys_subpage_prot+0x2d4/0x5c0
          lr: c00000000004fb5c: .sys_subpage_prot+0x51c/0x5c0
          sp: c000000fea75fd10
         msr: 9000000000009032
         dar: 100bd
       dsisr: 40000000
        current = 0xc000000fea6ae490
        paca    = 0xc00000000fb8ab00   softe: 0        irq_happened: 0x00
          pid   = 8237, comm = a.out
      enter ? for help
      [c000000fea75fe30] c00000000000a164 syscall_exit+0x0/0x98
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      dad6f37c
  22. 11 10月, 2013 1 次提交
  23. 25 6月, 2013 1 次提交
  24. 21 6月, 2013 1 次提交
  25. 30 4月, 2013 4 次提交
  26. 17 3月, 2013 3 次提交
  27. 06 12月, 2012 1 次提交
    • P
      KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking · b4072df4
      Paul Mackerras 提交于
      Currently, if a machine check interrupt happens while we are in the
      guest, we exit the guest and call the host's machine check handler,
      which tends to cause the host to panic.  Some machine checks can be
      triggered by the guest; for example, if the guest creates two entries
      in the SLB that map the same effective address, and then accesses that
      effective address, the CPU will take a machine check interrupt.
      
      To handle this better, when a machine check happens inside the guest,
      we call a new function, kvmppc_realmode_machine_check(), while still in
      real mode before exiting the guest.  On POWER7, it handles the cases
      that the guest can trigger, either by flushing and reloading the SLB,
      or by flushing the TLB, and then it delivers the machine check interrupt
      directly to the guest without going back to the host.  On POWER7, the
      OPAL firmware patches the machine check interrupt vector so that it
      gets control first, and it leaves behind its analysis of the situation
      in a structure pointed to by the opal_mc_evt field of the paca.  The
      kvmppc_realmode_machine_check() function looks at this, and if OPAL
      reports that there was no error, or that it has handled the error, we
      also go straight back to the guest with a machine check.  We have to
      deliver a machine check to the guest since the machine check interrupt
      might have trashed valid values in SRR0/1.
      
      If the machine check is one we can't handle in real mode, and one that
      OPAL hasn't already handled, or on PPC970, we exit the guest and call
      the host's machine check handler.  We do this by jumping to the
      machine_check_fwnmi label, rather than absolute address 0x200, because
      we don't want to re-execute OPAL's handler on POWER7.  On PPC970, the
      two are equivalent because address 0x200 just contains a branch.
      
      Then, if the host machine check handler decides that the system can
      continue executing, kvmppc_handle_exit() delivers a machine check
      interrupt to the guest -- once again to let the guest know that SRR0/1
      have been modified.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      [agraf: fix checkpatch warnings]
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      b4072df4
  28. 17 9月, 2012 1 次提交