1. 18 3月, 2014 1 次提交
  2. 13 11月, 2013 1 次提交
    • D
      sparc64: Move from 4MB to 8MB huge pages. · 37b3a8ff
      David S. Miller 提交于
      The impetus for this is that we would like to move to 64-bit PMDs and
      PGDs, but that would result in only supporting a 42-bit address space
      with the current page table layout.  It'd be nice to support at least
      43-bits.
      
      The reason we'd end up with only 42-bits after making PMDs and PGDs
      64-bit is that we only use half-page sized PTE tables in order to make
      PMDs line up to 4MB, the hardware huge page size we use.
      
      So what we do here is we make huge pages 8MB, and fabricate them using
      4MB hw TLB entries.
      
      Facilitate this by providing a "REAL_HPAGE_SHIFT" which is used in
      places that really need to operate on hardware 4MB pages.
      
      Use full pages (512 entries) for PTE tables, and adjust PMD_SHIFT,
      PGD_SHIFT, and the build time CPP test as needed.  Use a CPP test to
      make sure REAL_HPAGE_SHIFT and the _PAGE_SZHUGE_* we use match up.
      
      This makes the pgtable cache completely unused, so remove the code
      managing it and the state used in mm_context_t.  Now we have less
      spinlocks taken in the page table allocation path.
      
      The technique we use to fabricate the 8MB pages is to transfer bit 22
      from the missing virtual address into the PTEs physical address field.
      That takes care of the transparent huge pages case.
      
      For hugetlb, we fill things in at the PTE level and that code already
      puts the sub huge page physical bits into the PTEs, based upon the
      offset, so there is nothing special we need to do.  It all just works
      out.
      
      So, a small amount of complexity in the THP case, but this code is
      about to get much simpler when we move the 64-bit PMDs as we can move
      away from the fancy 32-bit huge PMD encoding and just put a real PTE
      value in there.
      
      With bug fixes and help from Bob Picco.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      37b3a8ff
  3. 20 4月, 2013 1 次提交
    • D
      sparc64: Fix race in TLB batch processing. · f36391d2
      David S. Miller 提交于
      As reported by Dave Kleikamp, when we emit cross calls to do batched
      TLB flush processing we have a race because we do not synchronize on
      the sibling cpus completing the cross call.
      
      So meanwhile the TLB batch can be reset (tb->tlb_nr set to zero, etc.)
      and either flushes are missed or flushes will flush the wrong
      addresses.
      
      Fix this by using generic infrastructure to synchonize on the
      completion of the cross call.
      
      This first required getting the flush_tlb_pending() call out from
      switch_to() which operates with locks held and interrupts disabled.
      The problem is that smp_call_function_many() cannot be invoked with
      IRQs disabled and this is explicitly checked for with WARN_ON_ONCE().
      
      We get the batch processing outside of locked IRQ disabled sections by
      using some ideas from the powerpc port. Namely, we only batch inside
      of arch_{enter,leave}_lazy_mmu_mode() calls.  If we're not in such a
      region, we flush TLBs synchronously.
      
      1) Get rid of xcall_flush_tlb_pending and per-cpu type
         implementations.
      
      2) Do TLB batch cross calls instead via:
      
      	smp_call_function_many()
      		tlb_pending_func()
      			__flush_tlb_pending()
      
      3) Batch only in lazy mmu sequences:
      
      	a) Add 'active' member to struct tlb_batch
      	b) Define __HAVE_ARCH_ENTER_LAZY_MMU_MODE
      	c) Set 'active' in arch_enter_lazy_mmu_mode()
      	d) Run batch and clear 'active' in arch_leave_lazy_mmu_mode()
      	e) Check 'active' in tlb_batch_add_one() and do a synchronous
                 flush if it's clear.
      
      4) Add infrastructure for synchronous TLB page flushes.
      
      	a) Implement __flush_tlb_page and per-cpu variants, patch
      	   as needed.
      	b) Likewise for xcall_flush_tlb_page.
      	c) Implement smp_flush_tlb_page() to invoke the cross-call.
      	d) Wire up global_flush_tlb_page() to the right routine based
                 upon CONFIG_SMP
      
      5) It turns out that singleton batches are very common, 2 out of every
         3 batch flushes have only a single entry in them.
      
         The batch flush waiting is very expensive, both because of the poll
         on sibling cpu completeion, as well as because passing the tlb batch
         pointer to the sibling cpus invokes a shared memory dereference.
      
         Therefore, in flush_tlb_pending(), if there is only one entry in
         the batch perform a completely asynchronous global_flush_tlb_page()
         instead.
      Reported-by: NDave Kleikamp <dave.kleikamp@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NDave Kleikamp <dave.kleikamp@oracle.com>
      f36391d2
  4. 21 2月, 2013 1 次提交
  5. 09 10月, 2012 3 次提交
    • D
      sparc64: Support transparent huge pages. · 9e695d2e
      David Miller 提交于
      This is relatively easy since PMD's now cover exactly 4MB of memory.
      
      Our PMD entries are 32-bits each, so we use a special encoding.  The
      lowest bit, PMD_ISHUGE, determines the interpretation.  This is possible
      because sparc64's page tables are purely software entities so we can use
      whatever encoding scheme we want.  We just have to make the TLB miss
      assembler page table walkers aware of the layout.
      
      set_pmd_at() works much like set_pte_at() but it has to operate in two
      page from a table of non-huge PTEs, so we have to queue up TLB flushes
      based upon what mappings are valid in the PTE table.  In the second regime
      we are going from huge-page to non-huge-page, and in that case we need
      only queue up a single TLB flush to push out the huge page mapping.
      
      We still have 5 bits remaining in the huge PMD encoding so we can very
      likely support any new pieces of THP state tracking that might get added
      in the future.
      
      With lots of help from Johannes Weiner.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9e695d2e
    • D
      sparc64: Eliminate PTE table memory wastage. · c460bec7
      David Miller 提交于
      We've split up the PTE tables so that they take up half a page instead of
      a full page.  This is in order to facilitate transparent huge page
      support, which works much better if our PMDs cover 4MB instead of 8MB.
      
      What we do is have a one-behind cache for PTE table allocations in the
      mm struct.
      
      This logic triggers only on allocations.  For example, we don't try to
      keep track of free'd up page table blocks in the style that the s390 port
      does.
      
      There were only two slightly annoying aspects to this change:
      
      1) Changing pgtable_t to be a "pte_t *".  There's all of this special
         logic in the TLB free paths that needed adjustments, as did the
         PMD populate interfaces.
      
      2) init_new_context() needs to zap the pointer, since the mm struct
         just gets copied from the parent on fork.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c460bec7
    • D
      sparc64: Only support 4MB huge pages and 8KB base pages. · 15b9350a
      David Miller 提交于
      Narrowing the scope of the page size configurations will make the
      transparent hugepage changes much simpler.
      
      In the end what we really want to do is have the kernel support multiple
      huge page sizes and use whatever is appropriate as the context dictactes.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      15b9350a
  6. 29 3月, 2012 1 次提交
  7. 26 7月, 2011 1 次提交
  8. 08 6月, 2011 1 次提交
  9. 25 5月, 2011 1 次提交
    • P
      sparc: mmu_gather rework · 90f08e39
      Peter Zijlstra 提交于
      Rework the sparc mmu_gather usage to conform to the new world order :-)
      
      Sparc mmu_gather does two things:
       - tracks vaddrs to unhash
       - tracks pages to free
      
      Split these two things like powerpc has done and keep the vaddrs
      in per-cpu data structures and flush them on context switch.
      
      The remaining bits can then use the generic mmu_gather.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NDavid Miller <davem@davemloft.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90f08e39
  10. 30 3月, 2010 1 次提交
    • T
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo 提交于
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  11. 05 12月, 2008 3 次提交
  12. 05 8月, 2008 1 次提交
  13. 18 7月, 2008 1 次提交
    • D
      sparc64: Remove 4MB and 512K base page size options. · f7fe9334
      David S. Miller 提交于
      Adrian Bunk reported that enabling 4MB page size breaks the build.
      The problem is that MAX_ORDER combined with the page shift exceeds the
      SECTION_SIZE_BITS we use in asm-sparc64/sparsemem.h
      
      There are several ways I suppose we could work around this.  For one
      we could define a CONFIG_FORCE_MAX_ZONEORDER to decrease MAX_ORDER in
      these higher page size cases.
      
      But I also know that these page size cases are broken wrt. TLB miss
      handling especially on pre-hypervisor systems, and there isn't an easy
      way to fix that.
      
      These options were meant to be fun experimental hacks anyways, and
      only 8K and 64K make any sense to support.
      
      So remove 512K and 4M base page size support.  Of course, we still
      support these page sizes for huge pages.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f7fe9334
  14. 24 4月, 2008 1 次提交
  15. 01 11月, 2007 1 次提交
  16. 20 7月, 2007 1 次提交
    • P
      mm: Remove slab destructors from kmem_cache_create(). · 20c2df83
      Paul Mundt 提交于
      Slab destructors were no longer supported after Christoph's
      c59def9f change. They've been
      BUGs for both slab and slub, and slob never supported them
      either.
      
      This rips out support for the dtor pointer from kmem_cache_create()
      completely and fixes up every single callsite in the kernel (there were
      about 224, not including the slab allocator definitions themselves,
      or the documentation references).
      Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
      20c2df83
  17. 08 5月, 2007 3 次提交
  18. 08 12月, 2006 1 次提交
  19. 22 3月, 2006 1 次提交
  20. 20 3月, 2006 15 次提交
    • D
      [SPARC64]: Optimized TSB table initialization. · bb8646d8
      David S. Miller 提交于
      We only need to write an invalid tag every 16 bytes,
      so taking advantage of this can save many instructions
      compared to the simple memset() call we make now.
      
      A prefetching implementation is implemented for sun4u
      and a block-init store version if implemented for Niagara.
      
      The next trick is to be able to perform an init and
      a copy_tsb() in parallel when growing a TSB table.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bb8646d8
    • D
      [SPARC64]: Use SLAB caches for TSB tables. · 9b4006dc
      David S. Miller 提交于
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b4006dc
    • D
      [SPARC64]: Don't kill the page allocator when growing a TSB. · b52439c2
      David S. Miller 提交于
      Try only lightly on > 1 order allocations.
      
      If a grow fails, we are under memory pressure, so do not try
      to grow the TSB for this address space any more.
      
      If a > 0 order TSB allocation fails on a new fork, retry using
      a 0 order allocation.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b52439c2
    • D
      [SPARC64]: Fix and re-enable dynamic TSB sizing. · 7a1ac526
      David S. Miller 提交于
      This is good for up to %50 performance improvement of some test cases.
      The problem has been the race conditions, and hopefully I've plugged
      them all up here.
      
      1) There was a serious race in switch_mm() wrt. lazy TLB
         switching to and from kernel threads.
      
         We could erroneously skip a tsb_context_switch() and thus
         use a stale TSB across a TSB grow event.
      
         There is a big comment now in that function describing
         exactly how it can happen.
      
      2) All code paths that do something with the TSB need to be
         guarded with the mm->context.lock spinlock.  This makes
         page table flushing paths properly synchronize with both
         TSB growing and TLB context changes.
      
      3) TSB growing events are moved to the end of successful fault
         processing.  Previously it was in update_mmu_cache() but
         that is deadlock prone.  At the end of do_sparc64_fault()
         we hold no spinlocks that could deadlock the TSB grow
         sequence.  We also have dropped the address space semaphore.
      
      While we're here, add prefetching to the copy_tsb() routine
      and put it in assembler into the tsb.S file.  This piece of
      code is quite time critical.
      
      There are some small negative side effects to this code which
      can be improved upon.  In particular we grab the mm->context.lock
      even for the tsb insert done by update_mmu_cache() now and that's
      a bit excessive.  We can get rid of that locking, and the same
      lock taking in flush_tsb_user(), by disabling PSTATE_IE around
      the whole operation including the capturing of the tsb pointer
      and tsb_nentries value.  That would work because anyone growing
      the TSB won't free up the old TSB until all cpus respond to the
      TSB change cross call.
      
      I'm not quite so confident in that optimization to put it in
      right now, but eventually we might be able to and the description
      is here for reference.
      
      This code seems very solid now.  It passes several parallel GCC
      bootstrap builds, and our favorite "nut cruncher" stress test which is
      a full "make -j8192" build of a "make allmodconfig" kernel.  That puts
      about 256 processes on each cpu's run queue, makes lots of process cpu
      migrations occur, causes lots of page table and TLB flushing activity,
      incurs many context version number changes, and it swaps the machine
      real far out to disk even though there is 16GB of ram on this test
      system. :-)
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7a1ac526
    • D
      [SPARC64]: Bulletproof MMU context locking. · a77754b4
      David S. Miller 提交于
      1) Always spin_lock_init() in init_context().  The caller essentially
         clears it out, or copies the mm info from the parent.  In both
         cases we need to explicitly initialize the spinlock.
      
      2) Always do explicit IRQ disabling while taking mm->context.lock
         and ctx_alloc_lock.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a77754b4
    • D
      [SPARC64]: destroy_context() needs to disable interrupts. · 77b838fa
      David S. Miller 提交于
      get_new_mmu_context() can be invoked from interrupt context
      now for the new SMP version wrap handling.
      
      So disable interrupt while taking ctx_alloc_lock in destroy_context()
      so we don't deadlock.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      77b838fa
    • D
      [SPARC64]: More TLB/TSB handling fixes. · 8b234274
      David S. Miller 提交于
      The SUN4V convention with non-shared TSBs is that the context
      bit of the TAG is clear.  So we have to choose an "invalid"
      bit and initialize new TSBs appropriately.  Otherwise a zero
      TAG looks "valid".
      
      Make sure, for the window fixup cases, that we use the right
      global registers and that we don't potentially trample on
      the live global registers in etrap/rtrap handling (%g2 and
      %g6) and that we put the missing virtual address properly
      in %g5.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b234274
    • D
      [SPARC64]: Fix flush_tsb_user() on SUN4V. · de635d83
      David S. Miller 提交于
      Needs to use physical addressing just like cheetah_plus.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de635d83
    • D
      [SPARC64]: Deal with PTE layout differences in SUN4V. · c4bce90e
      David S. Miller 提交于
      Yes, you heard it right, they changed the PTE layout for
      SUN4V.  Ho hum...
      
      This is the simple and inefficient way to support this.
      It'll get optimized, don't worry.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c4bce90e
    • D
    • D
      618e9ed9
    • D
      [SPARC64]: Turn off TSB growing for now. · f4e841da
      David S. Miller 提交于
      There are several tricky races involved with growing the TSB.  So just
      use base-size TSBs for user contexts and we can revisit enabling this
      later.
      
      One part of the SMP problems is that tsb_context_switch() can see
      partially updated TSB configuration state if tsb_grow() is running in
      parallel.  That's easily solved with a seqlock taken as a writer by
      tsb_grow() and taken as a reader to capture all the TSB config state
      in tsb_context_switch().
      
      Then there is flush_tsb_user() running in parallel with a tsb_grow().
      In theory we could take the seqlock as a reader there too, and just
      resample the TSB pointer and reflush but that looks really ugly.
      
      Lastly, I believe there is a case with threads that results in a TSB
      entry lock bit being set spuriously which will cause the next access
      to that TSB entry to wedge the cpu (since the TSB entry lock bit will
      never clear).  It's either copy_tsb() or some bug elsewhere in the TSB
      assembly.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f4e841da
    • D
      [SPARC64]: Access TSB with physical addresses when possible. · 517af332
      David S. Miller 提交于
      This way we don't need to lock the TSB into the TLB.
      The trick is that every TSB load/store is registered into
      a special instruction patch section.  The default uses
      virtual addresses, and the patch instructions use physical
      address load/stores.
      
      We can't do this on all chips because only cheetah+ and later
      have the physical variant of the atomic quad load.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      517af332
    • D
      2f7ee7c6
    • D
      [SPARC64]: Fix incorrect TSB lock bit handling. · 4753eb2a
      David S. Miller 提交于
      The TSB_LOCK_BIT define is actually a special
      value shifted down by 32-bits for the assembler
      code macros.
      
      In C code, this isn't what we want.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4753eb2a