1. 05 6月, 2017 1 次提交
    • B
      powerpc/mm/book(e)(3s)/64: Add page table accounting · de3b8761
      Balbir Singh 提交于
      Introduce a helper pgtable_gfp_flags() which
      just returns the current gfp flags and adds
      __GFP_ACCOUNT to account for page table allocation.
      The generic helper is added to include/asm/pgalloc.h
      and has two variants - WARNING ugly bits ahead
      
      1. If the header is included from a module, no check
      for mm == &init_mm is done, since init_mm is not
      exported
      2. For kernel includes, the check is done and required
      see (3e79ec7d arch: x86: charge page tables to kmemcg)
      
      The fundamental assumption is that no module should be
      doing pgd/pud/pmd and pte alloc's on behalf of init_mm
      directly.
      
      NOTE: This adds an overhead to pmd/pud/pgd allocations
      similar to x86.  The other alternative was to implement
      pmd_alloc_kernel/pud_alloc_kernel and pgd_alloc_kernel
      with their offset variants.
      
      For 4k page size, pte_alloc_one no longer calls
      pte_alloc_one_kernel.
      Signed-off-by: NBalbir Singh <bsingharora@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      de3b8761
  2. 25 6月, 2016 2 次提交
    • M
      powerpc: get rid of superfluous __GFP_REPEAT · 2379a23e
      Michal Hocko 提交于
      __GFP_REPEAT has a rather weak semantic but since it has been introduced
      around 2.6.12 it has been ignored for low order allocations.
      
      {pud,pmd}_alloc_one are allocating from {PGT,PUD}_CACHE initialized in
      pgtable_cache_init which doesn't have larger than sizeof(void *) << 12
      size and that fits into !costly allocation request size.
      
      PGALLOC_GFP is used only in radix__pgd_alloc which uses either order-0
      or order-4 requests.  The first one doesn't need the flag while the
      second does.  Drop __GFP_REPEAT from PGALLOC_GFP and add it for the
      order-4 one.
      
      This means that this flag has never been actually useful here because it
      has always been used only for PAGE_ALLOC_COSTLY requests.
      
      Link: http://lkml.kernel.org/r/1464599699-30131-12-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2379a23e
    • M
      tree wide: get rid of __GFP_REPEAT for order-0 allocations part I · 32d6bd90
      Michal Hocko 提交于
      This is the third version of the patchset previously sent [1].  I have
      basically only rebased it on top of 4.7-rc1 tree and dropped "dm: get
      rid of superfluous gfp flags" which went through dm tree.  I am sending
      it now because it is tree wide and chances for conflicts are reduced
      considerably when we want to target rc2.  I plan to send the next step
      and rename the flag and move to a better semantic later during this
      release cycle so we will have a new semantic ready for 4.8 merge window
      hopefully.
      
      Motivation:
      
      While working on something unrelated I've checked the current usage of
      __GFP_REPEAT in the tree.  It seems that a majority of the usage is and
      always has been bogus because __GFP_REPEAT has always been about costly
      high order allocations while we are using it for order-0 or very small
      orders very often.  It seems that a big pile of them is just a
      copy&paste when a code has been adopted from one arch to another.
      
      I think it makes some sense to get rid of them because they are just
      making the semantic more unclear.  Please note that GFP_REPEAT is
      documented as
      
      * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
      
      * _might_ fail.  This depends upon the particular VM implementation.
        while !costly requests have basically nofail semantic.  So one could
        reasonably expect that order-0 request with __GFP_REPEAT will not loop
        for ever.  This is not implemented right now though.
      
      I would like to move on with __GFP_REPEAT and define a better semantic
      for it.
      
        $ git grep __GFP_REPEAT origin/master | wc -l
        111
        $ git grep __GFP_REPEAT | wc -l
        36
      
      So we are down to the third after this patch series.  The remaining
      places really seem to be relying on __GFP_REPEAT due to large allocation
      requests.  This still needs some double checking which I will do later
      after all the simple ones are sorted out.
      
      I am touching a lot of arch specific code here and I hope I got it right
      but as a matter of fact I even didn't compile test for some archs as I
      do not have cross compiler for them.  Patches should be quite trivial to
      review for stupid compile mistakes though.  The tricky parts are usually
      hidden by macro definitions and thats where I would appreciate help from
      arch maintainers.
      
      [1] http://lkml.kernel.org/r/1461849846-27209-1-git-send-email-mhocko@kernel.org
      
      This patch (of 19):
      
      __GFP_REPEAT has a rather weak semantic but since it has been introduced
      around 2.6.12 it has been ignored for low order allocations.  Yet we
      have the full kernel tree with its usage for apparently order-0
      allocations.  This is really confusing because __GFP_REPEAT is
      explicitly documented to allow allocation failures which is a weaker
      semantic than the current order-0 has (basically nofail).
      
      Let's simply drop __GFP_REPEAT from those places.  This would allow to
      identify place which really need allocator to retry harder and formulate
      a more specific semantic for what the flag is supposed to do actually.
      
      Link: http://lkml.kernel.org/r/1464599699-30131-2-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com> [for tile]
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: John Crispin <blogic@openwrt.org>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      32d6bd90
  3. 10 6月, 2016 1 次提交
    • A
      powerpc/mm/radix: Flush page walk cache when freeing page table · a145abf1
      Aneesh Kumar K.V 提交于
      Even though a tlb_flush() does a flush with invalidate all cache,
      we can end up doing an RCU page table free before calling tlb_flush().
      That means we can have page walk cache entries even after we free the
      page table pages. This can result in us doing wrong page table walk.
      
      Avoid this by doing pwc flush on every page table free. We can't batch
      the pwc flush, because the rcu call back function where we free the
      page table pages doesn't have information of the mmu gather. Thus we
      have to do a pwc on every page table page freed.
      
      Note: I also removed the dummy tlb_flush_pgtable call functions for
      hash 32.
      
      Fixes: 1a472c9d ("powerpc/mm/radix: Add tlbflush routines")
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a145abf1
  4. 11 5月, 2016 6 次提交
  5. 03 3月, 2016 1 次提交
  6. 29 2月, 2016 1 次提交
  7. 14 12月, 2015 2 次提交
  8. 10 12月, 2013 1 次提交
    • H
      powerpc: Fix PTE page address mismatch in pgtable ctor/dtor · cf77ee54
      Hong H. Pham 提交于
      In pte_alloc_one(), pgtable_page_ctor() is passed an address that has
      not been converted by page_address() to the newly allocated PTE page.
      
      When the PTE is freed, __pte_free_tlb() calls pgtable_page_dtor()
      with an address to the PTE page that has been converted by page_address().
      The mismatch in the PTE's page address causes pgtable_page_dtor() to access
      invalid memory, so resources for that PTE (such as the page lock) is not
      properly cleaned up.
      
      On PPC32, only SMP kernels are affected.
      
      On PPC64, only SMP kernels with 4K page size are affected.
      
      This bug was introduced by commit d614bb04
      "powerpc: Move the pte free routines from common header".
      
      On a preempt-rt kernel, a spinlock is dynamically allocated for each
      PTE in pgtable_page_ctor().  When the PTE is freed, calling
      pgtable_page_dtor() with a mismatched page address causes a memory leak,
      as the pointer to the PTE's spinlock is bogus.
      
      On mainline, there isn't any immediately obvious symptoms, but the
      problem still exists here.
      
      Fixes: d614bb04 "powerpc: Move the pte free routes from common header"
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: linux-stable <stable@vger.kernel.org> # v3.10+
      Signed-off-by: NHong H. Pham <hong.pham@windriver.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      cf77ee54
  9. 25 11月, 2013 1 次提交
    • H
      powerpc/kdump: Adding symbols in vmcoreinfo to facilitate dump filtering · 8ff81271
      Hari Bathini 提交于
      When CONFIG_SPARSEMEM_VMEMMAP option is used in kernel, makedumpfile fails
      to filter vmcore dump as it fails to do vmemmap translations. So far
      dump filtering on ppc64 never had to deal with vmemmap addresses seperately
      as vmemmap regions where mapped in zone normal. But with the inclusion of
      CONFIG_SPARSEMEM_VMEMMAP config option in kernel, this vmemmap address
      translation support becomes necessary for dump filtering. For vmemmap adress
      translation, few kernel symbols are needed by dump filtering tool. This patch
      adds those symbols to vmcoreinfo, which a dump filtering tool can use for
      filtering the kernel dump. Tested this changes successfully with makedumpfile
      tool that supports vmemmap to physical address translation outside zone normal.
      
      [ Removed unneeded #ifdef as suggested by Michael Ellerman --BenH ]
      Signed-off-by: NHari Bathini <hbathini@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      8ff81271
  10. 15 11月, 2013 1 次提交
  11. 21 6月, 2013 1 次提交
  12. 14 5月, 2013 1 次提交
  13. 30 4月, 2013 3 次提交
  14. 06 5月, 2010 1 次提交
    • M
      powerpc/mm: Track backing pages allocated by vmemmap_populate() · 91eea67c
      Mark Nelson 提交于
      We need to keep track of the backing pages that get allocated by
      vmemmap_populate() so that when we use kdump, the dump-capture kernel knows
      where these pages are.
      
      We use a simple linked list of structures that contain the physical address
      of the backing page and corresponding virtual address to track the backing
      pages.
      To save space, we just use a pointer to the next struct vmemmap_backing. We
      can also do this because we never remove nodes.  We call the pointer "list"
      to be compatible with changes made to the crash utility.
      
      vmemmap_populate() is called either at boot-time or on a memory hotplug
      operation. We don't have to worry about the boot-time calls because they
      will be inherently single-threaded, and for a memory hotplug operation
      vmemmap_populate() is called through:
      sparse_add_one_section()
                  |
                  V
      kmalloc_section_memmap()
                  |
                  V
      sparse_mem_map_populate()
                  |
                  V
      vmemmap_populate()
      and in sparse_add_one_section() we're protected by pgdat_resize_lock().
      So, we don't need a spinlock to protect the vmemmap_list.
      
      We allocate space for the vmemmap_backing structs by allocating whole pages
      in vmemmap_list_alloc() and then handing out chunks of this to
      vmemmap_list_populate().
      
      This means that we waste at most just under one page, but this keeps the code
      is simple.
      Signed-off-by: NMark Nelson <markn@au1.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      91eea67c
  15. 08 12月, 2009 1 次提交
  16. 02 12月, 2009 1 次提交
  17. 27 11月, 2009 1 次提交
  18. 30 10月, 2009 1 次提交
    • D
      powerpc/mm: Cleanup management of kmem_caches for pagetables · a0668cdc
      David Gibson 提交于
      Currently we have a fair bit of rather fiddly code to manage the
      various kmem_caches used to store page tables of various levels.  We
      generally have two caches holding some combination of PGD, PUD and PMD
      tables, plus several more for the special hugepage pagetables.
      
      This patch cleans this all up by taking a different approach.  Rather
      than the caches being designated as for PUDs or for hugeptes for 16M
      pages, the caches are simply allocated to be a specific size.  Thus
      sharing of caches between different types/levels of pagetables happens
      naturally.  The pagetable size, where needed, is passed around encoded
      in the same way as {PGD,PUD,PMD}_INDEX_SIZE; that is n where the
      pagetable contains 2^n pointers.
      Signed-off-by: NDavid Gibson <dwg@au1.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      a0668cdc
  19. 28 7月, 2009 1 次提交
    • B
      mm: Pass virtual address to [__]p{te,ud,md}_free_tlb() · 9e1b32ca
      Benjamin Herrenschmidt 提交于
      mm: Pass virtual address to [__]p{te,ud,md}_free_tlb()
      
      Upcoming paches to support the new 64-bit "BookE" powerpc architecture
      will need to have the virtual address corresponding to PTE page when
      freeing it, due to the way the HW table walker works.
      
      Basically, the TLB can be loaded with "large" pages that cover the whole
      virtual space (well, sort-of, half of it actually) represented by a PTE
      page, and which contain an "indirect" bit indicating that this TLB entry
      RPN points to an array of PTEs from which the TLB can then create direct
      entries. Thus, in order to invalidate those when PTE pages are deleted,
      we need the virtual address to pass to tlbilx or tlbivax instructions.
      
      The old trick of sticking it somewhere in the PTE page struct page sucks
      too much, the address is almost readily available in all call sites and
      almost everybody implemets these as macros, so we may as well add the
      argument everywhere. I added it to the pmd and pud variants for consistency.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Acked-by: David Howells <dhowells@redhat.com> [MN10300 & FRV]
      Acked-by: NNick Piggin <npiggin@suse.de>
      Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> [s390]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9e1b32ca
  20. 03 12月, 2008 1 次提交
  21. 04 8月, 2008 1 次提交
  22. 25 7月, 2008 1 次提交
  23. 09 2月, 2008 1 次提交
    • M
      CONFIG_HIGHPTE vs. sub-page page tables. · 2f569afd
      Martin Schwidefsky 提交于
      Background: I've implemented 1K/2K page tables for s390.  These sub-page
      page tables are required to properly support the s390 virtualization
      instruction with KVM.  The SIE instruction requires that the page tables
      have 256 page table entries (pte) followed by 256 page status table entries
      (pgste).  The pgstes are only required if the process is using the SIE
      instruction.  The pgstes are updated by the hardware and by the hypervisor
      for a number of reasons, one of them is dirty and reference bit tracking.
      To avoid wasting memory the standard pte table allocation should return
      1K/2K (31/64 bit) and 2K/4K if the process is using SIE.
      
      Problem: Page size on s390 is 4K, page table size is 1K or 2K.  That means
      the s390 version for pte_alloc_one cannot return a pointer to a struct
      page.  Trouble is that with the CONFIG_HIGHPTE feature on x86 pte_alloc_one
      cannot return a pointer to a pte either, since that would require more than
      32 bit for the return value of pte_alloc_one (and the pte * would not be
      accessible since its not kmapped).
      
      Solution: The only solution I found to this dilemma is a new typedef: a
      pgtable_t.  For s390 pgtable_t will be a (pte *) - to be introduced with a
      later patch.  For everybody else it will be a (struct page *).  The
      additional problem with the initialization of the ptl lock and the
      NR_PAGETABLE accounting is solved with a constructor pgtable_page_ctor and
      a destructor pgtable_page_dtor.  The page table allocation and free
      functions need to call these two whenever a page table page is allocated or
      freed.  pmd_populate will get a pgtable_t instead of a struct page pointer.
       To get the pgtable_t back from a pmd entry that has been installed with
      pmd_populate a new function pmd_pgtable is added.  It replaces the pmd_page
      call in free_pte_range and apply_to_pte_range.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f569afd
  24. 06 2月, 2008 1 次提交
  25. 24 1月, 2008 1 次提交
    • P
      [POWERPC] Provide a way to protect 4k subpages when using 64k pages · fa28237c
      Paul Mackerras 提交于
      Using 64k pages on 64-bit PowerPC systems makes life difficult for
      emulators that are trying to emulate an ISA, such as x86, which use a
      smaller page size, since the emulator can no longer use the MMU and
      the normal system calls for controlling page protections.  Of course,
      the emulator can emulate the MMU by checking and possibly remapping
      the address for each memory access in software, but that is pretty
      slow.
      
      This provides a facility for such programs to control the access
      permissions on individual 4k sub-pages of 64k pages.  The idea is
      that the emulator supplies an array of protection masks to apply to a
      specified range of virtual addresses.  These masks are applied at the
      level where hardware PTEs are inserted into the hardware page table
      based on the Linux PTEs, so the Linux PTEs are not affected.  Note
      that this new mechanism does not allow any access that would otherwise
      be prohibited; it can only prohibit accesses that would otherwise be
      allowed.  This new facility is only available on 64-bit PowerPC and
      only when the kernel is configured for 64k pages.
      
      The masks are supplied using a new subpage_prot system call, which
      takes a starting virtual address and length, and a pointer to an array
      of protection masks in memory.  The array has a 32-bit word per 64k
      page to be protected; each 32-bit word consists of 16 2-bit fields,
      for which 0 allows any access (that is otherwise allowed), 1 prevents
      write accesses, and 2 or 3 prevent any access.
      
      Implicit in this is that the regions of the address space that are
      protected are switched to use 4k hardware pages rather than 64k
      hardware pages (on machines with hardware 64k page support).  In fact
      the whole process is switched to use 4k hardware pages when the
      subpage_prot system call is used, but this could be improved in future
      to switch only the affected segments.
      
      The subpage protection bits are stored in a 3 level tree akin to the
      page table tree.  The top level of this tree is stored in a structure
      that is appended to the top level of the page table tree, i.e., the
      pgd array.  Since it will often only be 32-bit addresses (below 4GB)
      that are protected, the pointers to the first four bottom level pages
      are also stored in this structure (each bottom level page contains the
      protection bits for 1GB of address space), so the protection bits for
      addresses below 4GB can be accessed with one fewer loads than those
      for higher addresses.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      fa28237c
  26. 02 6月, 2007 1 次提交
  27. 09 5月, 2007 1 次提交
  28. 02 5月, 2007 1 次提交
    • D
      [POWERPC] Remove arch/powerpc's dependence on asm-ppc/pg{alloc,table}.h · f88df14b
      David Gibson 提交于
      Currently, all 32-bit powerpc platforms use asm-ppc/pgtable.h and
      asm-ppc/pgalloc.h, even when otherwise compiled with ARCH=powerpc.
      Those asm-ppc files are a fairly nasty tangle of #ifdefs including a
      bunch of things which shouldn't be necessary any more in arch/powerpc.
      
      Cleaning up that mess is going to take a while, but this patch is a
      first step.  It separates the asm-powerpc/pg{alloc,table}.h into 64
      bit and 32 bit versions in asm-powerpc, which the basic .h files in
      asm-powerpc select based on config.  We make a few tiny tweaks to the
      innards of the files along the way, making the outermost ifdefs
      (double-inclusion protection and __KERNEL__) a little cleaner, and
      #including asm-generic/pgtable.h from the top-level
      asm-powerpc/pgtable.h (since both the old 32-bit and 64-bit versions
      ended with such an #include).
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      f88df14b
  29. 08 12月, 2006 1 次提交
  30. 24 8月, 2006 1 次提交
    • A
      [POWERPC] hugepage BUG fix · c9169f87
      Adam Litke 提交于
      On Tue, 2006-08-15 at 08:22 -0700, Dave Hansen wrote:
      > kernel BUG in cache_free_debugcheck at mm/slab.c:2748!
      
      Alright, this one is only triggered when slab debugging is enabled.  The
      slabs are assumed to be aligned on a HUGEPTE_TABLE_SIZE boundary.  The free
      path makes use of this assumption and uses the lowest nibble to pass around
      an index into an array of kmem_cache pointers.  With slab debugging turned
      on, the slab is still aligned, but the "working" object pointer is not.
      This would break the assumption above that a full nibble is available for
      the PGF_CACHENUM_MASK.
      
      The following patch reduces PGF_CACHENUM_MASK to cover only the two least
      significant bits, which is enough to cover the current number of 4 pgtable
      cache types.  Then use this constant to mask out the appropriate part of
      the huge pte pointer.
      Signed-off-by: NAdam Litke <agl@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      c9169f87
  31. 28 4月, 2006 1 次提交
    • D
      [PATCH] powerpc: Fix pagetable bloat for hugepages · f10a04c0
      David Gibson 提交于
      At present, ARCH=powerpc kernels can waste considerable space in
      pagetables when making large hugepage mappings.  Hugepage PTEs go in
      PMD pages, but each PMD page maps 256M and so contains only 16
      hugepage PTEs (128 bytes of data), but takes up a 1024 byte
      allocation.  With CONFIG_PPC_64K_PAGES enabled (64k base page size),
      the situation is worse.  Now hugepage PTEs are at the PTE page level
      (also mapping 256M), so we store 16 hugepage PTEs in a 64k allocation.
      
      The PowerPC MMU already means that any 256M region is either all
      hugepage, or all normal pages.  Thus, with some care, we can use a
      different allocation for the hugepage PTE tables and only allocate the
      128 bytes necessary.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      f10a04c0