1. 12 10月, 2015 1 次提交
    • A
      powerpc/mm: Differentiate between hugetlb and THP during page walk · 891121e6
      Aneesh Kumar K.V 提交于
      We need to properly identify whether a hugepage is an explicit or
      a transparent hugepage in follow_huge_addr(). We used to depend
      on hugepage shift argument to do that. But in some case that can
      result in wrong results. For ex:
      
      On finding a transparent hugepage we set hugepage shift to PMD_SHIFT.
      But we can end up clearing the thp pte, via pmdp_huge_get_and_clear.
      We do prevent reusing the pfn page via the usage of
      kick_all_cpus_sync(). But that happens after we updated the pte to 0.
      Hence in follow_huge_addr() we can find hugepage shift set, but transparent
      huge page check fail for a thp pte.
      
      NOTE: We fixed a variant of this race against thp split in commit
      691e95fd
      ("powerpc/mm/thp: Make page table walk safe against thp split/collapse")
      
      Without this patch, we may hit the BUG_ON(flags & FOLL_GET) in
      follow_page_mask occasionally.
      
      In the long term, we may want to switch ppc64 64k page size config to
      enable CONFIG_ARCH_WANT_GENERAL_HUGETLB
      Reported-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      891121e6
  2. 18 8月, 2015 1 次提交
  3. 10 6月, 2015 1 次提交
    • A
      powerpc/mm: Add trace point for tracking hash pte fault · cfcb3d80
      Aneesh Kumar K.V 提交于
      This enables us to understand how many hash fault we are taking
      when running benchmarks.
      
      For ex:
      -bash-4.2# ./perf stat -e  powerpc:hash_fault -e page-faults /tmp/ebizzy.ppc64 -S 30  -P -n 1000
      ...
      
       Performance counter stats for '/tmp/ebizzy.ppc64 -S 30 -P -n 1000':
      
             1,10,04,075      powerpc:hash_fault
             1,10,03,429      page-faults
      
            30.865978991 seconds time elapsed
      
      NOTE:
      The impact of the tracepoint was not noticeable when running test. It was
      within the run-time variance of the test. For ex:
      
      without-patch:
      --------------
      
       Performance counter stats for './a.out 3000 300':
      
      	       643      page-faults               #    0.089 M/sec
      	  7.236562      task-clock (msec)         #    0.928 CPUs utilized
      	 2,179,213      stalled-cycles-frontend   #    0.00% frontend cycles idle
      	17,174,367      stalled-cycles-backend    #    0.00% backend  cycles idle
      		 0      context-switches          #    0.000 K/sec
      
             0.007794658 seconds time elapsed
      
      And with-patch:
      ---------------
      
       Performance counter stats for './a.out 3000 300':
      
      	       643      page-faults               #    0.089 M/sec
      	  7.233746      task-clock (msec)         #    0.921 CPUs utilized
      		 0      context-switches          #    0.000 K/sec
      
             0.007854876 seconds time elapsed
      
       Performance counter stats for './a.out 3000 300':
      
      	       643      page-faults               #    0.087 M/sec
      	       649      powerpc:hash_fault        #    0.087 M/sec
      	  7.430376      task-clock (msec)         #    0.938 CPUs utilized
      	 2,347,174      stalled-cycles-frontend   #    0.00% frontend cycles idle
      	17,524,282      stalled-cycles-backend    #    0.00% backend  cycles idle
      		 0      context-switches          #    0.000 K/sec
      
             0.007920284 seconds time elapsed
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      cfcb3d80
  4. 02 6月, 2015 1 次提交
  5. 23 4月, 2015 1 次提交
  6. 17 4月, 2015 1 次提交
    • A
      powerpc/mm/thp: Make page table walk safe against thp split/collapse · 691e95fd
      Aneesh Kumar K.V 提交于
      We can disable a THP split or a hugepage collapse by disabling irq.
      We do send IPI to all the cpus in the early part of split/collapse,
      and disabling local irq ensure we don't make progress with
      split/collapse. If the THP is getting split we return NULL from
      find_linux_pte_or_hugepte(). For all the current callers it should be ok.
      We need to be careful if we want to use returned pte_t pointer outside
      the irq disabled region. W.r.t to THP split, the pfn remains the same,
      but then a hugepage collapse will result in a pfn change. There are
      few steps we can take to avoid a hugepage collapse.One way is to take page
      reference inside the irq disable region. Other option is to take
      mmap_sem so that a parallel collapse will not happen. We can also
      disable collapse by taking pmd_lock. Another method used by kvm
      subsystem is to check whether we had a mmu_notifer update in between
      using mmu_notifier_retry().
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      691e95fd
  7. 14 12月, 2014 1 次提交
    • J
      mm/debug-pagealloc: make debug-pagealloc boottime configurable · 031bc574
      Joonsoo Kim 提交于
      Now, we have prepared to avoid using debug-pagealloc in boottime.  So
      introduce new kernel-parameter to disable debug-pagealloc in boottime, and
      makes related functions to be disabled in this case.
      
      Only non-intuitive part is change of guard page functions.  Because guard
      page is effective only if debug-pagealloc is enabled, turning off
      according to debug-pagealloc is reasonable thing to do.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      031bc574
  8. 05 12月, 2014 1 次提交
    • A
      powerpc/mm: don't do tlbie for updatepp request with NO HPTE fault · aefa5688
      Aneesh Kumar K.V 提交于
      upatepp can get called for a nohpte fault when we find from the linux
      page table that the translation was hashed before. In that case
      we are sure that there is no existing translation, hence we could
      avoid doing tlbie.
      
      We could possibly race with a parallel fault filling the TLB. But
      that should be ok because updatepp is only ever relaxing permissions.
      We also look at linux pte permission bits when filling hash pte
      permission bits. We also hold the linux pte busy bits while
      inserting/updating a hashpte entry, hence a paralle update of
      linux pte is not possible. On the other hand mprotect involves
      ptep_modify_prot_start which cause a hpte invalidate and not updatepp.
      
      Performance number:
      We use randbox_access_bench written by Anton.
      
      Kernel with THP disabled and smaller hash page table size.
      
          86.60%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_updatepp
           2.10%  random_access_b  random_access_bench              [.] doit
           1.99%  random_access_b  [kernel.kallsyms]                [k] .do_raw_spin_lock
           1.85%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_insert
           1.26%  random_access_b  [kernel.kallsyms]                [k] .native_flush_hash_range
           1.18%  random_access_b  [kernel.kallsyms]                [k] .__delay
           0.69%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_remove
           0.37%  random_access_b  [kernel.kallsyms]                [k] .clear_user_page
           0.34%  random_access_b  [kernel.kallsyms]                [k] .__hash_page_64K
           0.32%  random_access_b  [kernel.kallsyms]                [k] fast_exception_return
           0.30%  random_access_b  [kernel.kallsyms]                [k] .hash_page_mm
      
      With Fix:
      
          27.54%  random_access_b  random_access_bench              [.] doit
          22.90%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_insert
           5.76%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_remove
           5.20%  random_access_b  [kernel.kallsyms]                [k] fast_exception_return
           5.12%  random_access_b  [kernel.kallsyms]                [k] .__hash_page_64K
           4.80%  random_access_b  [kernel.kallsyms]                [k] .hash_page_mm
           3.31%  random_access_b  [kernel.kallsyms]                [k] data_access_common
           1.84%  random_access_b  [kernel.kallsyms]                [k] .trace_hardirqs_on_caller
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      aefa5688
  9. 02 12月, 2014 2 次提交
  10. 03 11月, 2014 1 次提交
    • C
      powerpc: Replace __get_cpu_var uses · 69111bac
      Christoph Lameter 提交于
      This still has not been merged and now powerpc is the only arch that does
      not have this change. Sorry about missing linuxppc-dev before.
      
      V2->V2
        - Fix up to work against 3.18-rc1
      
      __get_cpu_var() is used for multiple purposes in the kernel source. One of
      them is address calculation via the form &__get_cpu_var(x).  This calculates
      the address for the instance of the percpu variable of the current processor
      based on an offset.
      
      Other use cases are for storing and retrieving data from the current
      processors percpu area.  __get_cpu_var() can be used as an lvalue when
      writing data or on the right side of an assignment.
      
      __get_cpu_var() is defined as :
      
      __get_cpu_var() always only does an address determination. However, store
      and retrieve operations could use a segment prefix (or global register on
      other platforms) to avoid the address calculation.
      
      this_cpu_write() and this_cpu_read() can directly take an offset into a
      percpu area and use optimized assembly code to read and write per cpu
      variables.
      
      This patch converts __get_cpu_var into either an explicit address
      calculation using this_cpu_ptr() or into a use of this_cpu operations that
      use the offset.  Thereby address calculations are avoided and less registers
      are used when code is generated.
      
      At the end of the patch set all uses of __get_cpu_var have been removed so
      the macro is removed too.
      
      The patch set includes passes over all arches as well. Once these operations
      are used throughout then specialized macros can be defined in non -x86
      arches as well in order to optimize per cpu access by f.e.  using a global
      register that may be set to the per cpu base.
      
      Transformations done to __get_cpu_var()
      
      1. Determine the address of the percpu instance of the current processor.
      
      	DEFINE_PER_CPU(int, y);
      	int *x = &__get_cpu_var(y);
      
          Converts to
      
      	int *x = this_cpu_ptr(&y);
      
      2. Same as #1 but this time an array structure is involved.
      
      	DEFINE_PER_CPU(int, y[20]);
      	int *x = __get_cpu_var(y);
      
          Converts to
      
      	int *x = this_cpu_ptr(y);
      
      3. Retrieve the content of the current processors instance of a per cpu
      variable.
      
      	DEFINE_PER_CPU(int, y);
      	int x = __get_cpu_var(y)
      
         Converts to
      
      	int x = __this_cpu_read(y);
      
      4. Retrieve the content of a percpu struct
      
      	DEFINE_PER_CPU(struct mystruct, y);
      	struct mystruct x = __get_cpu_var(y);
      
         Converts to
      
      	memcpy(&x, this_cpu_ptr(&y), sizeof(x));
      
      5. Assignment to a per cpu variable
      
      	DEFINE_PER_CPU(int, y)
      	__get_cpu_var(y) = x;
      
         Converts to
      
      	__this_cpu_write(y, x);
      
      6. Increment/Decrement etc of a per cpu variable
      
      	DEFINE_PER_CPU(int, y);
      	__get_cpu_var(y)++
      
         Converts to
      
      	__this_cpu_inc(y)
      
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      CC: Paul Mackerras <paulus@samba.org>
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      [mpe: Fix build errors caused by set/or_softirq_pending(), and rework
            assignment in __set_breakpoint() to use memcpy().]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      69111bac
  11. 08 10月, 2014 3 次提交
  12. 25 9月, 2014 3 次提交
  13. 27 8月, 2014 2 次提交
    • T
      Revert "powerpc: Replace __get_cpu_var uses" · 23f66e2d
      Tejun Heo 提交于
      This reverts commit 5828f666 due to
      build failure after merging with pending powerpc changes.
      
      Link: http://lkml.kernel.org/g/20140827142243.6277eaff@canb.auug.org.auSigned-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      23f66e2d
    • C
      powerpc: Replace __get_cpu_var uses · 5828f666
      Christoph Lameter 提交于
      __get_cpu_var() is used for multiple purposes in the kernel source. One of
      them is address calculation via the form &__get_cpu_var(x).  This calculates
      the address for the instance of the percpu variable of the current processor
      based on an offset.
      
      Other use cases are for storing and retrieving data from the current
      processors percpu area.  __get_cpu_var() can be used as an lvalue when
      writing data or on the right side of an assignment.
      
      __get_cpu_var() is defined as :
      
      #define __get_cpu_var(var) (*this_cpu_ptr(&(var)))
      
      __get_cpu_var() always only does an address determination. However, store
      and retrieve operations could use a segment prefix (or global register on
      other platforms) to avoid the address calculation.
      
      this_cpu_write() and this_cpu_read() can directly take an offset into a
      percpu area and use optimized assembly code to read and write per cpu
      variables.
      
      This patch converts __get_cpu_var into either an explicit address
      calculation using this_cpu_ptr() or into a use of this_cpu operations that
      use the offset.  Thereby address calculations are avoided and less registers
      are used when code is generated.
      
      At the end of the patch set all uses of __get_cpu_var have been removed so
      the macro is removed too.
      
      The patch set includes passes over all arches as well. Once these operations
      are used throughout then specialized macros can be defined in non -x86
      arches as well in order to optimize per cpu access by f.e.  using a global
      register that may be set to the per cpu base.
      
      Transformations done to __get_cpu_var()
      
      1. Determine the address of the percpu instance of the current processor.
      
      	DEFINE_PER_CPU(int, y);
      	int *x = &__get_cpu_var(y);
      
          Converts to
      
      	int *x = this_cpu_ptr(&y);
      
      2. Same as #1 but this time an array structure is involved.
      
      	DEFINE_PER_CPU(int, y[20]);
      	int *x = __get_cpu_var(y);
      
          Converts to
      
      	int *x = this_cpu_ptr(y);
      
      3. Retrieve the content of the current processors instance of a per cpu
      variable.
      
      	DEFINE_PER_CPU(int, y);
      	int x = __get_cpu_var(y)
      
         Converts to
      
      	int x = __this_cpu_read(y);
      
      4. Retrieve the content of a percpu struct
      
      	DEFINE_PER_CPU(struct mystruct, y);
      	struct mystruct x = __get_cpu_var(y);
      
         Converts to
      
      	memcpy(&x, this_cpu_ptr(&y), sizeof(x));
      
      5. Assignment to a per cpu variable
      
      	DEFINE_PER_CPU(int, y)
      	__get_cpu_var(y) = x;
      
         Converts to
      
      	__this_cpu_write(y, x);
      
      6. Increment/Decrement etc of a per cpu variable
      
      	DEFINE_PER_CPU(int, y);
      	__get_cpu_var(y)++
      
         Converts to
      
      	__this_cpu_inc(y)
      
      tj: Folded a fix patch.
          http://lkml.kernel.org/g/alpine.DEB.2.11.1408172143020.9652@gentwo.org
      
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      CC: Paul Mackerras <paulus@samba.org>
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      5828f666
  14. 05 8月, 2014 1 次提交
  15. 28 7月, 2014 2 次提交
  16. 06 6月, 2014 1 次提交
    • M
      powerpc/mm: Check paca psize is up to date for huge mappings · 09567e7f
      Michael Ellerman 提交于
      We have a bug in our hugepage handling which exhibits as an infinite
      loop of hash faults. If the fault is being taken in the kernel it will
      typically trigger the softlockup detector, or the RCU stall detector.
      
      The bug is as follows:
      
       1. mmap(0xa0000000, ..., MAP_FIXED | MAP_HUGE_TLB | MAP_ANONYMOUS ..)
       2. Slice code converts the slice psize to 16M.
       3. The code on lines 539-540 of slice.c in slice_get_unmapped_area()
          synchronises the mm->context with the paca->context. So the paca slice
          mask is updated to include the 16M slice.
       3. Either:
          * mmap() fails because there are no huge pages available.
          * mmap() succeeds and the mapping is then munmapped.
          In both cases the slice psize remains at 16M in both the paca & mm.
       4. mmap(0xa0000000, ..., MAP_FIXED | MAP_ANONYMOUS ..)
       5. The slice psize is converted back to 64K. Because of the check on line 539
          of slice.c we DO NOT update the paca->context. The paca slice mask is now
          out of sync with the mm slice mask.
       6. User/kernel accesses 0xa0000000.
       7. The SLB miss handler slb_allocate_realmode() **uses the paca slice mask**
          to create an SLB entry and inserts it in the SLB.
      18. With the 16M SLB entry in place the hardware does a hash lookup, no entry
          is found so a data access exception is generated.
      19. The data access handler calls do_page_fault() -> handle_mm_fault().
      10. __handle_mm_fault() creates a THP mapping with do_huge_pmd_anonymous_page().
      11. The hardware retries the access, there is still nothing in the hash table
          so once again a data access exception is generated.
      12. hash_page() calls into __hash_page_thp() and inserts a mapping in the
          hash. Although the THP mapping maps 16M the hashing is done using 64K
          as the segment page size.
      13. hash_page() returns immediately after calling __hash_page_thp(), skipping
          over the code at line 1125. Resulting in the mismatch between the
          paca->context and mm->context not being detected.
      14. The hardware retries the access, the hash it generates using the 16M
          SLB entry does NOT match the hash we inserted.
      15. We take another data access and go into __hash_page_thp().
      16. We see a valid entry in the hpte_slot_array and so we call updatepp()
          which succeeds.
      17. Goto 14.
      
      We could fix this in two ways. The first would be to remove or modify
      the check on line 539 of slice.c.
      
      The second option is to cause the check of paca psize in hash_page() on
      line 1125 to also be done for THP pages.
      
      We prefer the latter, because the check & update of the paca psize is
      not done until we know it's necessary. It's also done only on the
      current cpu, so we don't need to IPI all other cpus.
      
      Without further rearranging the code, the simplest fix is to pull out
      the code that checks paca psize and call it in two places. Firstly for
      THP/hugetlb, and secondly for other mappings as before.
      
      Thanks to Dave Jones for trinity, which originally found this bug.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      CC: stable@vger.kernel.org [v3.11+]
      09567e7f
  17. 01 5月, 2014 1 次提交
  18. 30 4月, 2014 1 次提交
  19. 29 4月, 2014 1 次提交
    • A
      KVM guest: Make pv trampoline code executable · b18db0b8
      Alexander Graf 提交于
      Our PV guest patching code assembles chunks of instructions on the fly when it
      encounters more complicated instructions to hijack. These instructions need
      to live in a section that we don't mark as non-executable, as otherwise we
      fault when jumping there.
      
      Right now we put it into the .bss section where it automatically gets marked
      as non-executable. Add a check to the NX setting function to ensure that we
      leave these particular pages executable.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      b18db0b8
  20. 23 4月, 2014 2 次提交
  21. 11 2月, 2014 1 次提交
    • M
      powerpc: Fix kdump hang issue on p8 with relocation on exception enabled. · 429d2e83
      Mahesh Salgaonkar 提交于
      On p8 systems, with relocation on exception feature enabled we are seeing
      kdump kernel hang at interrupt vector 0xc*4400. The reason is, with this
      feature enabled, exception are raised with MMU (IR=DR=1) ON with the
      default offset of 0xc*4000. Since exception is raised in virtual mode it
      requires the vector region to be executable without which it fails to
      fetch and execute instruction at 0xc*4xxx. For default kernel since kernel
      is loaded at real 0, the htab mappings sets the entire kernel text region
      executable. But for relocatable kernel (e.g. kdump case) we only copy
      interrupt vectors down to real 0 and never marked that region as
      executable because in p7 and below we always get exception in real mode.
      
      This patch fixes this issue by marking htab mapping range as executable
      that overlaps with the interrupt vector region for relocatable kernel.
      
      Thanks to Ben who helped me to debug this issue and find the root cause.
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      429d2e83
  22. 09 12月, 2013 1 次提交
  23. 11 10月, 2013 1 次提交
  24. 14 8月, 2013 1 次提交
  25. 01 7月, 2013 1 次提交
    • P
      powerpc: Delete __cpuinit usage from all users · 061d19f2
      Paul Gortmaker 提交于
      The __cpuinit type of throwaway sections might have made sense
      some time ago when RAM was more constrained, but now the savings
      do not offset the cost and complications.  For example, the fix in
      commit 5e427ec2 ("x86: Fix bit corruption at CPU resume time")
      is a good example of the nasty type of bugs that can be created
      with improper use of the various __init prefixes.
      
      After a discussion on LKML[1] it was decided that cpuinit should go
      the way of devinit and be phased out.  Once all the users are gone,
      we can then finally remove the macros themselves from linux/init.h.
      
      This removes all the powerpc uses of the __cpuinit macros.  There
      are no __CPUINIT users in assembly files in powerpc.
      
      [1] https://lkml.org/lkml/2013/5/20/589
      
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Josh Boyer <jwboyer@gmail.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Kumar Gala <galak@kernel.crashing.org>
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      061d19f2
  26. 21 6月, 2013 4 次提交
    • A
      powerpc: Make linux pagetable walk safe with THP enabled · 0ac52dd7
      Aneesh Kumar K.V 提交于
      We need to have irqs disabled to handle all the possible parallel update for
      linux page table without holding locks.
      
      Events that we are intersted in while walking page tables are
      1) Page fault
      2) umap
      3) THP split
      4) THP collapse
      
      A) local_irq_disabled:
      ------------------------
      1) page fault:
      A none to valid transition via page fault is not an issue because we
      would either see a none or valid. If it is none, we would error out
      the page table walk. We may need to use on stack values when checking for
      type of page table elements, because if we do
      
      if (!is_hugepd()) {
          if (!pmd_none() {
             if (pmd_bad() {
      
      We could take that bad condition because the pmd got converted to a hugepd
      after the !is_hugepd check via a hugetlb fault.
      
      The right way would be to check for pmd_none higher up or use on stack value.
      
      2) A valid to none conversion via unmap:
      We can safely walk the upper level table, because we don't remove the the
      page table entries until rcu grace period. So even if we followed a
      wrong pointer we still have the pointer valid till the grace period.
      
      A PTE pointer returned need to be atomically checked for _PAGE_PRESENT and
       _PAGE_BUSY. A valid pointer returned could becoming none later. To prevent
      pte_clear we take _PAGE_BUSY.
      
      3) THP split:
      A valid transparent hugepage is converted to nomal page. Before we split we
      do pmd_splitting_flush, which sets the hugepage PTE to _PAGE_SPLITTING
      So when walking page table we need to check for pmd_trans_splitting and
      handle that. The pte returned should also need to be checked for
      _PAGE_SPLITTING before setting _PAGE_BUSY similar to _PAGE_PRESENT. We save
      the value of PTE on stack and check for the flag in the local pte value.
      If we don't have the value set we can safely operate on the local pte value
      and we atomicaly set _PAGE_BUSY.
      
      4) THP collapse:
      A normal page gets converted to hugepage. In the collapse path, we
      mark the pmd none early (pmdp_clear_flush). With irq disabled, if we
      are aleady walking page table we would see the pmd_none and won't continue.
      If we see a valid PMD, we should still check for _PAGE_PRESENT before
      setting _PAGE_BUSY, to make sure we didn't collapse the PTE to a Huge PTE.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      0ac52dd7
    • A
      powerpc/THP: Add code to handle HPTE faults for hugepages · 6d492ecc
      Aneesh Kumar K.V 提交于
      The deposted PTE page in the second half of the PMD table is used to
      track the state on hash PTEs. After updating the HPTE, we mark the
      coresponding slot in the deposted PTE page valid.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      6d492ecc
    • A
      powerpc: Replace find_linux_pte with find_linux_pte_or_hugepte · 12bc9f6f
      Aneesh Kumar K.V 提交于
      Replace find_linux_pte with find_linux_pte_or_hugepte and explicitly
      document why we don't need to handle transparent hugepages at callsites.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      12bc9f6f
    • A
      powerpc/mm: handle hugepage size correctly when invalidating hpte entries · db3d8534
      Aneesh Kumar K.V 提交于
      If a hash bucket gets full, we "evict" a more/less random entry from it.
      When we do that we don't invalidate the TLB (hpte_remove) because we assume
      the old translation is still technically "valid". This implies that when
      we are invalidating or updating pte, even if HPTE entry is not valid
      we should do a tlb invalidate. With hugepages, we need to pass the correct
      actual page size value for tlb invalidation.
      
      This change update the patch 0608d692
      "powerpc/mm: Always invalidate tlb on hpte invalidate and update" to handle
      transparent hugepages correctly.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      db3d8534
  27. 14 5月, 2013 1 次提交
    • L
      powerpc: Exception hooks for context tracking subsystem · ba12eede
      Li Zhong 提交于
      This is the exception hooks for context tracking subsystem, including
      data access, program check, single step, instruction breakpoint, machine check,
      alignment, fp unavailable, altivec assist, unknown exception, whose handlers
      might use RCU.
      
      This patch corresponds to
      [PATCH] x86: Exception hooks for userspace RCU extended QS
        commit 6ba3c97a
      
      But after the exception handling moved to generic code, and some changes in
      following two commits:
      56dd9470
        context_tracking: Move exception handling to generic code
      6c1e0256
        context_tracking: Restore correct previous context state on exception exit
      
      it is able for exception hooks to use the generic code above instead of a
      redundant arch implementation.
      Signed-off-by: NLi Zhong <zhong@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      ba12eede
  28. 06 5月, 2013 1 次提交
  29. 30 4月, 2013 1 次提交