1. 12 10月, 2015 1 次提交
    • A
      powerpc/mm: Differentiate between hugetlb and THP during page walk · 891121e6
      Aneesh Kumar K.V 提交于
      We need to properly identify whether a hugepage is an explicit or
      a transparent hugepage in follow_huge_addr(). We used to depend
      on hugepage shift argument to do that. But in some case that can
      result in wrong results. For ex:
      
      On finding a transparent hugepage we set hugepage shift to PMD_SHIFT.
      But we can end up clearing the thp pte, via pmdp_huge_get_and_clear.
      We do prevent reusing the pfn page via the usage of
      kick_all_cpus_sync(). But that happens after we updated the pte to 0.
      Hence in follow_huge_addr() we can find hugepage shift set, but transparent
      huge page check fail for a thp pte.
      
      NOTE: We fixed a variant of this race against thp split in commit
      691e95fd
      ("powerpc/mm/thp: Make page table walk safe against thp split/collapse")
      
      Without this patch, we may hit the BUG_ON(flags & FOLL_GET) in
      follow_page_mask occasionally.
      
      In the long term, we may want to switch ppc64 64k page size config to
      enable CONFIG_ARCH_WANT_GENERAL_HUGETLB
      Reported-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      891121e6
  2. 11 6月, 2015 1 次提交
    • A
      powerpc/mmu: Add userspace-to-physical addresses translation cache · 15b244a8
      Alexey Kardashevskiy 提交于
      We are adding support for DMA memory pre-registration to be used in
      conjunction with VFIO. The idea is that the userspace which is going to
      run a guest may want to pre-register a user space memory region so
      it all gets pinned once and never goes away. Having this done,
      a hypervisor will not have to pin/unpin pages on every DMA map/unmap
      request. This is going to help with multiple pinning of the same memory.
      
      Another use of it is in-kernel real mode (mmu off) acceleration of
      DMA requests where real time translation of guest physical to host
      physical addresses is non-trivial and may fail as linux ptes may be
      temporarily invalid. Also, having cached host physical addresses
      (compared to just pinning at the start and then walking the page table
      again on every H_PUT_TCE), we can be sure that the addresses which we put
      into TCE table are the ones we already pinned.
      
      This adds a list of memory regions to mm_context_t. Each region consists
      of a header and a list of physical addresses. This adds API to:
      1. register/unregister memory regions;
      2. do final cleanup (which puts all pre-registered pages);
      3. do userspace to physical address translation;
      4. manage usage counters; multiple registration of the same memory
      is allowed (once per container).
      
      This implements 2 counters per registered memory region:
      - @mapped: incremented on every DMA mapping; decremented on unmapping;
      initialized to 1 when a region is just registered; once it becomes zero,
      no more mappings allowe;
      - @used: incremented on every "register" ioctl; decremented on
      "unregister"; unregistration is allowed for DMA mapped regions unless
      it is the very last reference. For the very last reference this checks
      that the region is still mapped and returns -EBUSY so the userspace
      gets to know that memory is still pinned and unregistration needs to
      be retried; @used remains 1.
      
      Host physical addresses are stored in vmalloc'ed array. In order to
      access these in the real mode (mmu off), there is a real_vmalloc_addr()
      helper. In-kernel acceleration patchset will move it from KVM to MMU code.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      15b244a8
  3. 17 3月, 2015 1 次提交
  4. 05 12月, 2014 1 次提交
    • A
      powerpc/mm: don't do tlbie for updatepp request with NO HPTE fault · aefa5688
      Aneesh Kumar K.V 提交于
      upatepp can get called for a nohpte fault when we find from the linux
      page table that the translation was hashed before. In that case
      we are sure that there is no existing translation, hence we could
      avoid doing tlbie.
      
      We could possibly race with a parallel fault filling the TLB. But
      that should be ok because updatepp is only ever relaxing permissions.
      We also look at linux pte permission bits when filling hash pte
      permission bits. We also hold the linux pte busy bits while
      inserting/updating a hashpte entry, hence a paralle update of
      linux pte is not possible. On the other hand mprotect involves
      ptep_modify_prot_start which cause a hpte invalidate and not updatepp.
      
      Performance number:
      We use randbox_access_bench written by Anton.
      
      Kernel with THP disabled and smaller hash page table size.
      
          86.60%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_updatepp
           2.10%  random_access_b  random_access_bench              [.] doit
           1.99%  random_access_b  [kernel.kallsyms]                [k] .do_raw_spin_lock
           1.85%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_insert
           1.26%  random_access_b  [kernel.kallsyms]                [k] .native_flush_hash_range
           1.18%  random_access_b  [kernel.kallsyms]                [k] .__delay
           0.69%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_remove
           0.37%  random_access_b  [kernel.kallsyms]                [k] .clear_user_page
           0.34%  random_access_b  [kernel.kallsyms]                [k] .__hash_page_64K
           0.32%  random_access_b  [kernel.kallsyms]                [k] fast_exception_return
           0.30%  random_access_b  [kernel.kallsyms]                [k] .hash_page_mm
      
      With Fix:
      
          27.54%  random_access_b  random_access_bench              [.] doit
          22.90%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_insert
           5.76%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_remove
           5.20%  random_access_b  [kernel.kallsyms]                [k] fast_exception_return
           5.12%  random_access_b  [kernel.kallsyms]                [k] .__hash_page_64K
           4.80%  random_access_b  [kernel.kallsyms]                [k] .hash_page_mm
           3.31%  random_access_b  [kernel.kallsyms]                [k] data_access_common
           1.84%  random_access_b  [kernel.kallsyms]                [k] .trace_hardirqs_on_caller
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      aefa5688
  5. 08 10月, 2014 2 次提交
  6. 25 9月, 2014 1 次提交
  7. 28 7月, 2014 1 次提交
  8. 22 7月, 2014 1 次提交
    • A
      powerpc: subpage_protect: Increase the array size to take care of 64TB · dad6f37c
      Aneesh Kumar K.V 提交于
      We now support TASK_SIZE of 16TB, hence the array should be 8.
      
      Fixes the below crash:
      
      Unable to handle kernel paging request for data at address 0x000100bd
      Faulting instruction address: 0xc00000000004f914
      cpu 0x13: Vector: 300 (Data Access) at [c000000fea75fa90]
          pc: c00000000004f914: .sys_subpage_prot+0x2d4/0x5c0
          lr: c00000000004fb5c: .sys_subpage_prot+0x51c/0x5c0
          sp: c000000fea75fd10
         msr: 9000000000009032
         dar: 100bd
       dsisr: 40000000
        current = 0xc000000fea6ae490
        paca    = 0xc00000000fb8ab00   softe: 0        irq_happened: 0x00
          pid   = 8237, comm = a.out
      enter ? for help
      [c000000fea75fe30] c00000000000a164 syscall_exit+0x0/0x98
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      dad6f37c
  9. 11 10月, 2013 1 次提交
  10. 25 6月, 2013 1 次提交
  11. 21 6月, 2013 1 次提交
  12. 30 4月, 2013 4 次提交
  13. 17 3月, 2013 3 次提交
  14. 06 12月, 2012 1 次提交
    • P
      KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking · b4072df4
      Paul Mackerras 提交于
      Currently, if a machine check interrupt happens while we are in the
      guest, we exit the guest and call the host's machine check handler,
      which tends to cause the host to panic.  Some machine checks can be
      triggered by the guest; for example, if the guest creates two entries
      in the SLB that map the same effective address, and then accesses that
      effective address, the CPU will take a machine check interrupt.
      
      To handle this better, when a machine check happens inside the guest,
      we call a new function, kvmppc_realmode_machine_check(), while still in
      real mode before exiting the guest.  On POWER7, it handles the cases
      that the guest can trigger, either by flushing and reloading the SLB,
      or by flushing the TLB, and then it delivers the machine check interrupt
      directly to the guest without going back to the host.  On POWER7, the
      OPAL firmware patches the machine check interrupt vector so that it
      gets control first, and it leaves behind its analysis of the situation
      in a structure pointed to by the opal_mc_evt field of the paca.  The
      kvmppc_realmode_machine_check() function looks at this, and if OPAL
      reports that there was no error, or that it has handled the error, we
      also go straight back to the guest with a machine check.  We have to
      deliver a machine check to the guest since the machine check interrupt
      might have trashed valid values in SRR0/1.
      
      If the machine check is one we can't handle in real mode, and one that
      OPAL hasn't already handled, or on PPC970, we exit the guest and call
      the host's machine check handler.  We do this by jumping to the
      machine_check_fwnmi label, rather than absolute address 0x200, because
      we don't want to re-execute OPAL's handler on POWER7.  On PPC970, the
      two are equivalent because address 0x200 just contains a branch.
      
      Then, if the host machine check handler decides that the system can
      continue executing, kvmppc_handle_exit() delivers a machine check
      interrupt to the guest -- once again to let the guest know that SRR0/1
      have been modified.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      [agraf: fix checkpatch warnings]
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      b4072df4
  15. 17 9月, 2012 5 次提交
  16. 28 3月, 2012 1 次提交
  17. 05 3月, 2012 1 次提交
    • P
      KVM: PPC: Implement MMIO emulation support for Book3S HV guests · 697d3899
      Paul Mackerras 提交于
      This provides the low-level support for MMIO emulation in Book3S HV
      guests.  When the guest tries to map a page which is not covered by
      any memslot, that page is taken to be an MMIO emulation page.  Instead
      of inserting a valid HPTE, we insert an HPTE that has the valid bit
      clear but another hypervisor software-use bit set, which we call
      HPTE_V_ABSENT, to indicate that this is an absent page.  An
      absent page is treated much like a valid page as far as guest hcalls
      (H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
      an absent HPTE doesn't need to be invalidated with tlbie since it
      was never valid as far as the hardware is concerned.
      
      When the guest accesses a page for which there is an absent HPTE, it
      will take a hypervisor data storage interrupt (HDSI) since we now set
      the VPM1 bit in the LPCR.  Our HDSI handler for HPTE-not-present faults
      looks up the hash table and if it finds an absent HPTE mapping the
      requested virtual address, will switch to kernel mode and handle the
      fault in kvmppc_book3s_hv_page_fault(), which at present just calls
      kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
      
      This is based on an earlier patch by Benjamin Herrenschmidt, but since
      heavily reworked.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      697d3899
  18. 19 12月, 2011 1 次提交
  19. 20 9月, 2011 1 次提交
    • B
      powerpc: Hugetlb for BookE · 41151e77
      Becky Bruce 提交于
      Enable hugepages on Freescale BookE processors.  This allows the kernel to
      use huge TLB entries to map pages, which can greatly reduce the number of
      TLB misses and the amount of TLB thrashing experienced by applications with
      large memory footprints.  Care should be taken when using this on FSL
      processors, as the number of large TLB entries supported by the core is low
      (16-64) on current processors.
      
      The supported set of hugepage sizes include 4m, 16m, 64m, 256m, and 1g.
      Page sizes larger than the max zone size are called "gigantic" pages and
      must be allocated on the command line (and cannot be deallocated).
      
      This is currently only fully implemented for Freescale 32-bit BookE
      processors, but there is some infrastructure in the code for
      64-bit BooKE.
      Signed-off-by: NBecky Bruce <beckyb@kernel.crashing.org>
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      41151e77
  20. 12 7月, 2011 1 次提交
    • P
      KVM: PPC: Add support for Book3S processors in hypervisor mode · de56a948
      Paul Mackerras 提交于
      This adds support for KVM running on 64-bit Book 3S processors,
      specifically POWER7, in hypervisor mode.  Using hypervisor mode means
      that the guest can use the processor's supervisor mode.  That means
      that the guest can execute privileged instructions and access privileged
      registers itself without trapping to the host.  This gives excellent
      performance, but does mean that KVM cannot emulate a processor
      architecture other than the one that the hardware implements.
      
      This code assumes that the guest is running paravirtualized using the
      PAPR (Power Architecture Platform Requirements) interface, which is the
      interface that IBM's PowerVM hypervisor uses.  That means that existing
      Linux distributions that run on IBM pSeries machines will also run
      under KVM without modification.  In order to communicate the PAPR
      hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
      to include/linux/kvm.h.
      
      Currently the choice between book3s_hv support and book3s_pr support
      (i.e. the existing code, which runs the guest in user mode) has to be
      made at kernel configuration time, so a given kernel binary can only
      do one or the other.
      
      This new book3s_hv code doesn't support MMIO emulation at present.
      Since we are running paravirtualized guests, this isn't a serious
      restriction.
      
      With the guest running in supervisor mode, most exceptions go straight
      to the guest.  We will never get data or instruction storage or segment
      interrupts, alignment interrupts, decrementer interrupts, program
      interrupts, single-step interrupts, etc., coming to the hypervisor from
      the guest.  Therefore this introduces a new KVMTEST_NONHV macro for the
      exception entry path so that we don't have to do the KVM test on entry
      to those exception handlers.
      
      We do however get hypervisor decrementer, hypervisor data storage,
      hypervisor instruction storage, and hypervisor emulation assist
      interrupts, so we have to handle those.
      
      In hypervisor mode, real-mode accesses can access all of RAM, not just
      a limited amount.  Therefore we put all the guest state in the vcpu.arch
      and use the shadow_vcpu in the PACA only for temporary scratch space.
      We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
      anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
      We don't have a shared page with the guest, but we still need a
      kvm_vcpu_arch_shared struct to store the values of various registers,
      so we include one in the vcpu_arch struct.
      
      The POWER7 processor has a restriction that all threads in a core have
      to be in the same partition.  MMU-on kernel code counts as a partition
      (partition 0), so we have to do a partition switch on every entry to and
      exit from the guest.  At present we require the host and guest to run
      in single-thread mode because of this hardware restriction.
      
      This code allocates a hashed page table for the guest and initializes
      it with HPTEs for the guest's Virtual Real Memory Area (VRMA).  We
      require that the guest memory is allocated using 16MB huge pages, in
      order to simplify the low-level memory management.  This also means that
      we can get away without tracking paging activity in the host for now,
      since huge pages can't be paged or swapped.
      
      This also adds a few new exports needed by the book3s_hv code.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      de56a948
  21. 04 5月, 2011 1 次提交
  22. 30 3月, 2011 1 次提交
  23. 24 8月, 2010 1 次提交
  24. 23 7月, 2010 1 次提交
  25. 08 12月, 2009 1 次提交
  26. 02 12月, 2009 1 次提交
  27. 27 11月, 2009 1 次提交
  28. 30 10月, 2009 2 次提交
    • D
      powerpc/mm: Bring hugepage PTE accessor functions back into sync with normal accessors · 0895ecda
      David Gibson 提交于
      The hugepage arch code provides a number of hook functions/macros
      which mirror the functionality of various normal page pte access
      functions.  Various changes in the normal page accessors (in
      particular BenH's recent changes to the handling of lazy icache
      flushing and PAGE_EXEC) have caused the hugepage versions to get out
      of sync with the originals.  In some cases, this is a bug, at least on
      some MMU types.
      
      One of the reasons that some hooks were not identical to the normal
      page versions, is that the fact we're dealing with a hugepage needed
      to be passed down do use the correct dcache-icache flush function.
      This patch makes the main flush_dcache_icache_page() function hugepage
      aware (by checking for the PageCompound flag).  That in turn means we
      can make set_huge_pte_at() just a call to set_pte_at() bringing it
      back into sync.  As a bonus, this lets us remove the
      hash_huge_page_do_lazy_icache() function, replacing it with a call to
      the hash_page_do_lazy_icache() function it was based on.
      
      Some other hugepage pte access hooks - huge_ptep_get_and_clear() and
      huge_ptep_clear_flush() - are not so easily unified, but this patch at
      least brings them back into sync with the current versions of the
      corresponding normal page functions.
      Signed-off-by: NDavid Gibson <dwg@au1.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      0895ecda
    • D
      powerpc/mm: Allow more flexible layouts for hugepage pagetables · a4fe3ce7
      David Gibson 提交于
      Currently each available hugepage size uses a slightly different
      pagetable layout: that is, the bottem level table of pointers to
      hugepages is a different size, and may branch off from the normal page
      tables at a different level.  Every hugepage aware path that needs to
      walk the pagetables must therefore look up the hugepage size from the
      slice info first, and work out the correct way to walk the pagetables
      accordingly.  Future hardware is likely to add more possible hugepage
      sizes, more layout options and more mess.
      
      This patch, therefore reworks the handling of hugepage pagetables to
      reduce this complexity.  In the new scheme, instead of having to
      consult the slice mask, pagetable walking code can check a flag in the
      PGD/PUD/PMD entries to see where to branch off to hugepage pagetables,
      and the entry also contains the information (eseentially hugepage
      shift) necessary to then interpret that table without recourse to the
      slice mask.  This scheme can be extended neatly to handle multiple
      levels of self-describing "special" hugepage pagetables, although for
      now we assume only one level exists.
      
      This approach means that only the pagetable allocation path needs to
      know how the pagetables should be set out.  All other (hugepage)
      pagetable walking paths can just interpret the structure as they go.
      
      There already was a flag bit in PGD/PUD/PMD entries for hugepage
      directory pointers, but it was only used for debug.  We alter that
      flag bit to instead be a 0 in the MSB to indicate a hugepage pagetable
      pointer (normally it would be 1 since the pointer lies in the linear
      mapping).  This means that asm pagetable walking can test for (and
      punt on) hugepage pointers with the same test that checks for
      unpopulated page directory entries (beq becomes bge), since hugepage
      pointers will always be positive, and normal pointers always negative.
      
      While we're at it, we get rid of the confusing (and grep defeating)
      #defining of hugepte_shift to be the same thing as mmu_huge_psizes.
      Signed-off-by: NDavid Gibson <dwg@au1.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      a4fe3ce7
  29. 02 9月, 2009 1 次提交
    • B
      powerpc/pseries: Fix to handle slb resize across migration · 46db2f86
      Brian King 提交于
      The SLB can change sizes across a live migration, which was not
      being handled, resulting in possible machine crashes during
      migration if migrating to a machine which has a smaller max SLB
      size than the source machine. Fix this by first reducing the
      SLB size to the minimum possible value, which is 32, prior to
      migration. Then during the device tree update which occurs after
      migration, we make the call to ensure the SLB gets updated. Also
      add the slb_size to the lparcfg output so that the migration
      tools can check to make sure the kernel has this capability
      before allowing migration in scenarios where the SLB size will change.
      
      BenH: Fixed #include <asm/mmu-hash64.h> -> <asm/mmu.h> to avoid
            breaking ppc32 build
      Signed-off-by: NBrian King <brking@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      46db2f86