1. 06 6月, 2014 1 次提交
    • M
      powerpc/mm: Check paca psize is up to date for huge mappings · 09567e7f
      Michael Ellerman 提交于
      We have a bug in our hugepage handling which exhibits as an infinite
      loop of hash faults. If the fault is being taken in the kernel it will
      typically trigger the softlockup detector, or the RCU stall detector.
      
      The bug is as follows:
      
       1. mmap(0xa0000000, ..., MAP_FIXED | MAP_HUGE_TLB | MAP_ANONYMOUS ..)
       2. Slice code converts the slice psize to 16M.
       3. The code on lines 539-540 of slice.c in slice_get_unmapped_area()
          synchronises the mm->context with the paca->context. So the paca slice
          mask is updated to include the 16M slice.
       3. Either:
          * mmap() fails because there are no huge pages available.
          * mmap() succeeds and the mapping is then munmapped.
          In both cases the slice psize remains at 16M in both the paca & mm.
       4. mmap(0xa0000000, ..., MAP_FIXED | MAP_ANONYMOUS ..)
       5. The slice psize is converted back to 64K. Because of the check on line 539
          of slice.c we DO NOT update the paca->context. The paca slice mask is now
          out of sync with the mm slice mask.
       6. User/kernel accesses 0xa0000000.
       7. The SLB miss handler slb_allocate_realmode() **uses the paca slice mask**
          to create an SLB entry and inserts it in the SLB.
      18. With the 16M SLB entry in place the hardware does a hash lookup, no entry
          is found so a data access exception is generated.
      19. The data access handler calls do_page_fault() -> handle_mm_fault().
      10. __handle_mm_fault() creates a THP mapping with do_huge_pmd_anonymous_page().
      11. The hardware retries the access, there is still nothing in the hash table
          so once again a data access exception is generated.
      12. hash_page() calls into __hash_page_thp() and inserts a mapping in the
          hash. Although the THP mapping maps 16M the hashing is done using 64K
          as the segment page size.
      13. hash_page() returns immediately after calling __hash_page_thp(), skipping
          over the code at line 1125. Resulting in the mismatch between the
          paca->context and mm->context not being detected.
      14. The hardware retries the access, the hash it generates using the 16M
          SLB entry does NOT match the hash we inserted.
      15. We take another data access and go into __hash_page_thp().
      16. We see a valid entry in the hpte_slot_array and so we call updatepp()
          which succeeds.
      17. Goto 14.
      
      We could fix this in two ways. The first would be to remove or modify
      the check on line 539 of slice.c.
      
      The second option is to cause the check of paca psize in hash_page() on
      line 1125 to also be done for THP pages.
      
      We prefer the latter, because the check & update of the paca psize is
      not done until we know it's necessary. It's also done only on the
      current cpu, so we don't need to IPI all other cpus.
      
      Without further rearranging the code, the simplest fix is to pull out
      the code that checks paca psize and call it in two places. Firstly for
      THP/hugetlb, and secondly for other mappings as before.
      
      Thanks to Dave Jones for trinity, which originally found this bug.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      CC: stable@vger.kernel.org [v3.11+]
      09567e7f
  2. 05 6月, 2014 13 次提交
  3. 28 5月, 2014 17 次提交
    • S
      powerpc: Document sysfs DSCR interface · 26c88f93
      Sam bobroff 提交于
      Add some documentation about ...
      
      /sys/devices/system/cpu/dscr_default
      /sys/devices/system/cpu/cpuN/dscr
      
      ... to Documentation/ABI/stable.
      Signed-off-by: NSam Bobroff <sam.bobroff@au1.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      26c88f93
    • S
      powerpc: Fix regression of per-CPU DSCR setting · 1739ea9e
      Sam bobroff 提交于
      Since commit "efcac658 powerpc: Per process DSCR + some fixes (try#4)"
      it is no longer possible to set the DSCR on a per-CPU basis.
      
      The old behaviour was to minipulate the DSCR SPR directly but this is no
      longer sufficient: the value is quickly overwritten by context switching.
      
      This patch stores the per-CPU DSCR value in a kernel variable rather than
      directly in the SPR and it is used whenever a process has not set the DSCR
      itself. The sysfs interface (/sys/devices/system/cpu/cpuN/dscr) is unchanged.
      
      Writes to the old global default (/sys/devices/system/cpu/dscr_default)
      now set all of the per-CPU values and reads return the last written value.
      
      The new per-CPU default is added to the paca_struct and is used everywhere
      outside of sysfs.c instead of the old global default.
      Signed-off-by: NSam Bobroff <sam.bobroff@au1.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      1739ea9e
    • S
      powerpc: Split __SYSFS_SPRSETUP macro · 39a360ef
      Sam bobroff 提交于
      Split the __SYSFS_SPRSETUP macro into two parts so that registers requiring
      custom read and write functions can use common code for their show and store
      functions.
      Signed-off-by: NSam Bobroff <sam.bobroff@au1.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      39a360ef
    • R
      arch: powerpc/fadump: Cleaning up inconsistent NULL checks · b717d985
      Rickard Strandqvist 提交于
      Cleaning up inconsistent NULL checks.
      There is otherwise a risk of a possible null pointer dereference.
      
      Was largely found by using a static code analysis program called cppcheck.
      Signed-off-by: NRickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      b717d985
    • P
      powerpc: Fix comment around arch specific definition of RECLAIM_DISTANCE · 4750afa2
      Preeti U Murthy 提交于
      Commit 32e45ff4 changed the default value of
      RECLAIM_DISTANCE to 30. However the comment around arch
      specifc definition of RECLAIM_DISTANCE is not updated to
      reflect the same. Correct the value mentioned in the comment.
      Signed-off-by: NPreeti U Murthy <preeti@linux.vnet.ibm.com>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NKOSAKI Motohiro <Kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      4750afa2
    • M
      powerpc/powernv: Add support for POWER8 split core on powernv · e2186023
      Michael Ellerman 提交于
      Upcoming POWER8 chips support a concept called split core. This is where the
      core can be split into subcores that although not full cores, are able to
      appear as full cores to a guest.
      
      The splitting & unsplitting procedure is mildly complicated, and explained at
      length in the comments within the patch.
      
      One notable detail is that when splitting or unsplitting we need to pull
      offline cpus out of their offline state to do work as part of the procedure.
      
      The interface for changing the split mode is via a sysfs file, eg:
      
       $ echo 2 > /sys/devices/system/cpu/subcores_per_core
      
      Currently supported values are '1', '2' and '4'. And indicate respectively that
      the core should be unsplit, split in half, and split in quarters. These modes
      correspond to threads_per_subcore of 8, 4 and 2.
      
      We do not allow changing the split mode while KVM VMs are active. This is to
      prevent the value changing while userspace is configuring the VM, and also to
      prevent the mode being changed in such a way that existing guests are unable to
      be run.
      
      CPU hotplug fixes by Srivatsa.  max_cpus fixes by Mahesh.  cpuset fixes by
      benh.  Fix for irq race by paulus.  The rest by mikey and mpe.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      e2186023
    • M
      powerpc/kvm/book3s_hv: Use threads_per_subcore in KVM · 3102f784
      Michael Ellerman 提交于
      To support split core on POWER8 we need to modify various parts of the
      KVM code to use threads_per_subcore instead of threads_per_core. On
      systems that do not support split core threads_per_subcore ==
      threads_per_core and these changes are a nop.
      
      We use threads_per_subcore as the value reported by KVM_CAP_PPC_SMT.
      This communicates to userspace that guests can only be created with
      a value of threads_per_core that is less than or equal to the current
      threads_per_subcore. This ensures that guests can only be created with a
      thread configuration that we are able to run given the current split
      core mode.
      
      Although threads_per_subcore can change during the life of the system,
      the commit that enables that will ensure that threads_per_subcore does
      not change during the life of a KVM VM.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Acked-by: NAlexander Graf <agraf@suse.de>
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      3102f784
    • M
      powerpc: Check cpu_thread_in_subcore() in __cpu_up() · 6f5e40a3
      Michael Ellerman 提交于
      To support split core we need to change the check in __cpu_up() that
      determines if a cpu is allowed to come online.
      
      Currently we refuse to online cpus which are not the primary thread
      within their core.
      
      On POWER8 with split core support this check needs to instead refuse to
      online cpus which are not the primary thread within their *sub* core.
      
      On POWER7 and other systems that do not support split core,
      threads_per_subcore == threads_per_core and so the check is equivalent.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      6f5e40a3
    • M
      powerpc: Add threads_per_subcore · 5853aef1
      Michael Ellerman 提交于
      On POWER8 we have a new concept of a subcore. This is what happens when
      you take a regular core and split it. A subcore is a grouping of two or
      four SMT threads, as well as a handfull of SPRs which allows the subcore
      to appear as if it were a core from the point of view of a guest.
      
      Unlike threads_per_core which is fixed at boot, threads_per_subcore can
      change while the system is running. Most code will not want to use
      threads_per_subcore.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      5853aef1
    • M
      powerpc/powernv: Make it possible to skip the IRQHAPPENED check in power7_nap() · 8d6f7c5a
      Michael Ellerman 提交于
      To support split core we need to be able to force all secondaries into
      nap, so the core can detect they are idle and do an unsplit.
      
      Currently power7_nap() will return without napping if there is an irq
      pending. We want to ignore the pending irq and nap anyway, we will deal
      with the interrupt later.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      8d6f7c5a
    • M
      powerpc/kvm/book3s_hv: Rework the secondary inhibit code · 441c19c8
      Michael Ellerman 提交于
      As part of the support for split core on POWER8, we want to be able to
      block splitting of the core while KVM VMs are active.
      
      The logic to do that would be exactly the same as the code we currently
      have for inhibiting onlining of secondaries.
      
      Instead of adding an identical mechanism to block split core, rework the
      secondary inhibit code to be a "HV KVM is active" check. We can then use
      that in both the cpu hotplug code and the upcoming split core code.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Acked-by: NAlexander Graf <agraf@suse.de>
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      441c19c8
    • N
      powerpc/numa: Enable CONFIG_HAVE_MEMORYLESS_NODES · 64bb80d8
      Nishanth Aravamudan 提交于
      Based off fd1197f1 for ia64, enable CONFIG_HAVE_MEMORYLESS_NODES if
      NUMA. Initialize the local memory node in start_secondary.
      
      With this commit and the preceding to enable
      CONFIG_USER_PERCPU_NUMA_NODE_ID, which is a prerequisite, in a PowerKVM
      guest with the following topology:
      
      numactl --hardware
      available: 3 nodes (0-2)
      node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
      23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
      47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
      71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94
      95 96 97 98 99
      node 0 size: 1998 MB
      node 0 free: 521 MB
      node 1 cpus: 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114
      115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132
      133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150
      151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168
      169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186
      187 188 189 190 191 192 193 194 195 196 197 198 199
      node 1 size: 0 MB
      node 1 free: 0 MB
      node 2 cpus:
      node 2 size: 2039 MB
      node 2 free: 1739 MB
      node distances:
      node   0   1   2
        0:  10  40  40
        1:  40  10  40
        2:  40  40  10
      
      the unreclaimable slab is reduced by close to 130M:
      
      Before:
              Slab:             418176 kB
              SReclaimable:      26624 kB
              SUnreclaim:       391552 kB
      
      After:
              Slab:             298944 kB
              SReclaimable:      31744 kB
              SUnreclaim:       267200 kB
      Signed-off-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      64bb80d8
    • N
      powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID · 8c272261
      Nishanth Aravamudan 提交于
      Based off 3bccd996 for ia64, convert powerpc to use the generic per-CPU
      topology tracking, specifically:
      
          initialize per cpu numa_node entry in start_secondary
          remove the powerpc cpu_to_node()
          define CONFIG_USE_PERCPU_NUMA_NODE_ID if NUMA
      Signed-off-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      8c272261
    • B
      Merge branch 'merge' into next · 86969cf7
      Benjamin Herrenschmidt 提交于
      Merge the binutils and kexec fixes.
      86969cf7
    • S
      powerpc, kexec: Fix "Processor X is stuck" issue during kexec from ST mode · 011e4b02
      Srivatsa S. Bhat 提交于
      If we try to perform a kexec when the machine is in ST (Single-Threaded) mode
      (ppc64_cpu --smt=off), the kexec operation doesn't succeed properly, and we
      get the following messages during boot:
      
      [    0.089866] POWER8 performance monitor hardware support registered
      [    0.089985] power8-pmu: PMAO restore workaround active.
      [    5.095419] Processor 1 is stuck.
      [   10.097933] Processor 2 is stuck.
      [   15.100480] Processor 3 is stuck.
      [   20.102982] Processor 4 is stuck.
      [   25.105489] Processor 5 is stuck.
      [   30.108005] Processor 6 is stuck.
      [   35.110518] Processor 7 is stuck.
      [   40.113369] Processor 9 is stuck.
      [   45.115879] Processor 10 is stuck.
      [   50.118389] Processor 11 is stuck.
      [   55.120904] Processor 12 is stuck.
      [   60.123425] Processor 13 is stuck.
      [   65.125970] Processor 14 is stuck.
      [   70.128495] Processor 15 is stuck.
      [   75.131316] Processor 17 is stuck.
      
      Note that only the sibling threads are stuck, while the primary threads (0, 8,
      16 etc) boot just fine. Looking closer at the previous step of kexec, we observe
      that kexec tries to wakeup (bring online) the sibling threads of all the cores,
      before performing kexec:
      
      [ 9464.131231] Starting new kernel
      [ 9464.148507] kexec: Waking offline cpu 1.
      [ 9464.148552] kexec: Waking offline cpu 2.
      [ 9464.148600] kexec: Waking offline cpu 3.
      [ 9464.148636] kexec: Waking offline cpu 4.
      [ 9464.148671] kexec: Waking offline cpu 5.
      [ 9464.148708] kexec: Waking offline cpu 6.
      [ 9464.148743] kexec: Waking offline cpu 7.
      [ 9464.148779] kexec: Waking offline cpu 9.
      [ 9464.148815] kexec: Waking offline cpu 10.
      [ 9464.148851] kexec: Waking offline cpu 11.
      [ 9464.148887] kexec: Waking offline cpu 12.
      [ 9464.148922] kexec: Waking offline cpu 13.
      [ 9464.148958] kexec: Waking offline cpu 14.
      [ 9464.148994] kexec: Waking offline cpu 15.
      [ 9464.149030] kexec: Waking offline cpu 17.
      
      Instrumenting this piece of code revealed that the cpu_up() operation actually
      fails with -EBUSY. Thus, only the primary threads of all the cores are online
      during kexec, and hence this is a sure-shot receipe for disaster, as explained
      in commit e8e5c215 (powerpc/kexec: Fix orphaned offline CPUs across kexec),
      as well as in the comment above wake_offline_cpus().
      
      It turns out that cpu_up() was returning -EBUSY because the variable
      'cpu_hotplug_disabled' was set to 1; and this disabling of CPU hotplug was done
      by migrate_to_reboot_cpu() inside kernel_kexec().
      
      Now, migrate_to_reboot_cpu() was originally written with the assumption that
      any further code will not need to perform CPU hotplug, since we are anyway in
      the reboot path. However, kexec is clearly not such a case, since we depend on
      onlining CPUs, atleast on powerpc.
      
      So re-enable cpu-hotplug after returning from migrate_to_reboot_cpu() in the
      kexec path, to fix this regression in kexec on powerpc.
      
      Also, wrap the cpu_up() in powerpc kexec code within a WARN_ON(), so that we
      can catch such issues more easily in the future.
      
      Fixes: c97102ba (kexec: migrate to reboot cpu)
      Cc: stable@vger.kernel.org
      Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      011e4b02
    • G
      powerpc: Fix 64 bit builds with binutils 2.24 · 7998eb3d
      Guenter Roeck 提交于
      With binutils 2.24, various 64 bit builds fail with relocation errors
      such as
      
      arch/powerpc/kernel/built-in.o: In function `exc_debug_crit_book3e':
      	(.text+0x165ee): relocation truncated to fit: R_PPC64_ADDR16_HI
      	against symbol `interrupt_base_book3e' defined in .text section
      	in arch/powerpc/kernel/built-in.o
      arch/powerpc/kernel/built-in.o: In function `exc_debug_crit_book3e':
      	(.text+0x16602): relocation truncated to fit: R_PPC64_ADDR16_HI
      	against symbol `interrupt_end_book3e' defined in .text section
      	in arch/powerpc/kernel/built-in.o
      
      The assembler maintainer says:
      
       I changed the ABI, something that had to be done but unfortunately
       happens to break the booke kernel code.  When building up a 64-bit
       value with lis, ori, shl, oris, ori or similar sequences, you now
       should use @high and @higha in place of @h and @ha.  @h and @ha
       (and their associated relocs R_PPC64_ADDR16_HI and R_PPC64_ADDR16_HA)
       now report overflow if the value is out of 32-bit signed range.
       ie. @h and @ha assume you're building a 32-bit value. This is needed
       to report out-of-range -mcmodel=medium toc pointer offsets in @toc@h
       and @toc@ha expressions, and for consistency I did the same for all
       other @h and @ha relocs.
      
      Replacing @h with @high in one strategic location fixes the relocation
      errors. This has to be done conditionally since the assembler either
      supports @h or @high but not both.
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      7998eb3d
    • B
      Merge remote-tracking branch 'scott/next' into next · b9d80095
      Benjamin Herrenschmidt 提交于
      <<
      Highlights include a few new boards, a device tree binding for CCF
      (including backwards-compatible device tree updates to distinguish
      incompatible versions), and some fixes.
      >>
      b9d80095
  4. 23 5月, 2014 9 次提交
新手
引导
客服 返回
顶部