1. 15 12月, 2014 4 次提交
    • S
      powernv/powerpc: Add winkle support for offline cpus · 77b54e9f
      Shreyas B. Prabhu 提交于
      Winkle is a deep idle state supported in power8 chips. A core enters
      winkle when all the threads of the core enter winkle. In this state
      power supply to the entire chiplet i.e core, private L2 and private L3
      is turned off. As a result it gives higher powersavings compared to
      sleep.
      
      But entering winkle results in a total hypervisor state loss. Hence the
      hypervisor context has to be preserved before entering winkle and
      restored upon wake up.
      
      Power-on Reset Engine (PORE) is a dedicated engine which is responsible
      for powering on the chiplet during wake up. It can be programmed to
      restore the register contests of a few specific registers. This patch
      uses PORE to restore register state wherever possible and uses stack to
      save and restore rest of the necessary registers.
      
      With hypervisor state restore things fall under three categories-
      per-core state, per-subcore state and per-thread state. To manage this,
      extend the infrastructure introduced for sleep. Mainly we add a paca
      variable subcore_sibling_mask. Using this and the core_idle_state we can
      distingush first thread in core and subcore.
      Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      77b54e9f
    • S
      powernv/cpuidle: Redesign idle states management · 7cba160a
      Shreyas B. Prabhu 提交于
      Deep idle states like sleep and winkle are per core idle states. A core
      enters these states only when all the threads enter either the
      particular idle state or a deeper one. There are tasks like fastsleep
      hardware bug workaround and hypervisor core state save which have to be
      done only by the last thread of the core entering deep idle state and
      similarly tasks like timebase resync, hypervisor core register restore
      that have to be done only by the first thread waking up from these
      state.
      
      The current idle state management does not have a way to distinguish the
      first/last thread of the core waking/entering idle states. Tasks like
      timebase resync are done for all the threads. This is not only is
      suboptimal, but can cause functionality issues when subcores and kvm is
      involved.
      
      This patch adds the necessary infrastructure to track idle states of
      threads in a per-core structure. It uses this info to perform tasks like
      fastsleep workaround and timebase resync only once per core.
      Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Originally-by: NPreeti U. Murthy <preeti@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: linux-pm@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7cba160a
    • S
      powerpc/powernv: Enable Offline CPUs to enter deep idle states · 8eb8ac89
      Shreyas B. Prabhu 提交于
      The secondary threads should enter deep idle states so as to gain maximum
      powersavings when the entire core is offline. To do so the offline path
      must be made aware of the available deepest idle state. Hence probe the
      device tree for the possible idle states in powernv core code and
      expose the deepest idle state through flags.
      
      Since the  device tree is probed by the cpuidle driver as well, move
      the parameters required to discover the idle states into an appropriate
      common place to both the driver and the powernv core code.
      
      Another point is that fastsleep idle state may require workarounds in
      the kernel to function properly. This workaround is introduced in the
      subsequent patches. However neither the cpuidle driver or the hotplug
      path need be bothered about this workaround.
      
      They will be taken care of by the core powernv code.
      Originally-by: NSrivatsa S. Bhat <srivatsa@mit.edu>
      Signed-off-by: NPreeti U. Murthy <preeti@linux.vnet.ibm.com>
      Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Reviewed-by: NPaul Mackerras <paulus@samba.org>
      
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: linux-pm@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8eb8ac89
    • P
      powerpc/powernv: Switch off MMU before entering nap/sleep/rvwinkle mode · 8117ac6a
      Paul Mackerras 提交于
      Currently, when going idle, we set the flag indicating that we are in
      nap mode (paca->kvm_hstate.hwthread_state) and then execute the nap
      (or sleep or rvwinkle) instruction, all with the MMU on.  This is bad
      for two reasons: (a) the architecture specifies that those instructions
      must be executed with the MMU off, and in fact with only the SF, HV, ME
      and possibly RI bits set, and (b) this introduces a race, because as
      soon as we set the flag, another thread can switch the MMU to a guest
      context.  If the race is lost, this thread will typically start looping
      on relocation-on ISIs at 0xc...4400.
      
      This fixes it by setting the MSR as required by the architecture before
      setting the flag or executing the nap/sleep/rvwinkle instruction.
      
      Cc: stable@vger.kernel.org
      [ shreyas@linux.vnet.ibm.com: Edited to handle LE ]
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8117ac6a
  2. 14 12月, 2014 1 次提交
  3. 12 12月, 2014 6 次提交
  4. 09 12月, 2014 1 次提交
    • A
      powerpc: Secondary CPUs must set cpu_callin_map after setting active and online · 7c5c92ed
      Anton Blanchard 提交于
      I have a busy ppc64le KVM box where guests sometimes hit the infamous
      "kernel BUG at kernel/smpboot.c:134!" issue during boot:
      
        BUG_ON(td->cpu != smp_processor_id());
      
      Basically a per CPU hotplug thread scheduled on the wrong CPU. The oops
      output confirms it:
      
        CPU: 0
        Comm: watchdog/130
      
      The problem is that we aren't ensuring the CPU active and online bits are set
      before allowing the master to continue on. The master unparks the secondary
      CPUs kthreads and the scheduler looks for a CPU to run on. It calls
      select_task_rq and realises the suggested CPU is not in the cpus_allowed
      mask. It then ends up in select_fallback_rq, and since the active and
      online bits aren't set we choose some other CPU to run on.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7c5c92ed
  5. 08 12月, 2014 1 次提交
    • P
      powerpc/powernv: Return to cpu offline loop when finished in KVM guest · 56548fc0
      Paul Mackerras 提交于
      When a secondary hardware thread has finished running a KVM guest, we
      currently put that thread into nap mode using a nap instruction in
      the KVM code.  This changes the code so that instead of doing a nap
      instruction directly, we instead cause the call to power7_nap() that
      put the thread into nap mode to return.  The reason for doing this is
      to avoid having the KVM code having to know what low-power mode to
      put the thread into.
      
      In the case of a secondary thread used to run a KVM guest, the thread
      will be offline from the point of view of the host kernel, and the
      relevant power7_nap() call is the one in pnv_smp_cpu_disable().
      In this case we don't want to clear pending IPIs in the offline loop
      in that function, since that might cause us to miss the wakeup for
      the next time the thread needs to run a guest.  To tell whether or
      not to clear the interrupt, we use the SRR1 value returned from
      power7_nap(), and check if it indicates an external interrupt.  We
      arrange that the return from power7_nap() when we have finished running
      a guest returns 0, so pending interrupts don't get flushed in that
      case.
      
      Note that it is important a secondary thread that has finished
      executing in the guest, or that didn't have a guest to run, should
      not return to power7_nap's caller while the kvm_hstate.hwthread_req
      flag in the PACA is non-zero, because the return from power7_nap
      will reenable the MMU, and the MMU might still be in guest context.
      In this situation we spin at low priority in real mode waiting for
      hwthread_req to become zero.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      56548fc0
  6. 05 12月, 2014 2 次提交
    • M
      powerpc/book3s: Fix partial invalidation of TLBs in MCE code. · 682e77c8
      Mahesh Salgaonkar 提交于
      The existing MCE code calls flush_tlb hook with IS=0 (single page) resulting
      in partial invalidation of TLBs which is not right. This patch fixes
      that by passing IS=0xc00 to invalidate whole TLB for successful recovery
      from TLB and ERAT errors.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      682e77c8
    • A
      powerpc/mm: don't do tlbie for updatepp request with NO HPTE fault · aefa5688
      Aneesh Kumar K.V 提交于
      upatepp can get called for a nohpte fault when we find from the linux
      page table that the translation was hashed before. In that case
      we are sure that there is no existing translation, hence we could
      avoid doing tlbie.
      
      We could possibly race with a parallel fault filling the TLB. But
      that should be ok because updatepp is only ever relaxing permissions.
      We also look at linux pte permission bits when filling hash pte
      permission bits. We also hold the linux pte busy bits while
      inserting/updating a hashpte entry, hence a paralle update of
      linux pte is not possible. On the other hand mprotect involves
      ptep_modify_prot_start which cause a hpte invalidate and not updatepp.
      
      Performance number:
      We use randbox_access_bench written by Anton.
      
      Kernel with THP disabled and smaller hash page table size.
      
          86.60%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_updatepp
           2.10%  random_access_b  random_access_bench              [.] doit
           1.99%  random_access_b  [kernel.kallsyms]                [k] .do_raw_spin_lock
           1.85%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_insert
           1.26%  random_access_b  [kernel.kallsyms]                [k] .native_flush_hash_range
           1.18%  random_access_b  [kernel.kallsyms]                [k] .__delay
           0.69%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_remove
           0.37%  random_access_b  [kernel.kallsyms]                [k] .clear_user_page
           0.34%  random_access_b  [kernel.kallsyms]                [k] .__hash_page_64K
           0.32%  random_access_b  [kernel.kallsyms]                [k] fast_exception_return
           0.30%  random_access_b  [kernel.kallsyms]                [k] .hash_page_mm
      
      With Fix:
      
          27.54%  random_access_b  random_access_bench              [.] doit
          22.90%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_insert
           5.76%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_remove
           5.20%  random_access_b  [kernel.kallsyms]                [k] fast_exception_return
           5.12%  random_access_b  [kernel.kallsyms]                [k] .__hash_page_64K
           4.80%  random_access_b  [kernel.kallsyms]                [k] .hash_page_mm
           3.31%  random_access_b  [kernel.kallsyms]                [k] data_access_common
           1.84%  random_access_b  [kernel.kallsyms]                [k] .trace_hardirqs_on_caller
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      aefa5688
  7. 02 12月, 2014 14 次提交
  8. 19 11月, 2014 2 次提交
    • M
      powerpc: Remove more traces of bootmem · e39f223f
      Michael Ellerman 提交于
      Although we are now selecting NO_BOOTMEM, we still have some traces of
      bootmem lying around. That is because even with NO_BOOTMEM there is
      still a shim that converts bootmem calls into memblock calls, but
      ultimately we want to remove all traces of bootmem.
      
      Most of the patch is conversions from alloc_bootmem() to
      memblock_virt_alloc(). In general a call such as:
      
        p = (struct foo *)alloc_bootmem(x);
      
      Becomes:
      
        p = memblock_virt_alloc(x, 0);
      
      We don't need the cast because memblock_virt_alloc() returns a void *.
      The alignment value of zero tells memblock to use the default alignment,
      which is SMP_CACHE_BYTES, the same value alloc_bootmem() uses.
      
      We remove a number of NULL checks on the result of
      memblock_virt_alloc(). That is because memblock_virt_alloc() will panic
      if it can't allocate, in exactly the same way as alloc_bootmem(), so the
      NULL checks are and always have been redundant.
      
      The memory returned by memblock_virt_alloc() is already zeroed, so we
      remove several memsets of the result of memblock_virt_alloc().
      
      Finally we convert a few uses of __alloc_bootmem(x, y, MAX_DMA_ADDRESS)
      to just plain memblock_virt_alloc(). We don't use memblock_alloc_base()
      because MAX_DMA_ADDRESS is ~0ul on powerpc, so limiting the allocation
      to that is pointless, 16XB ought to be enough for anyone.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      e39f223f
    • L
      powerpc/pseries: Initialise nvram_pstore_info's buf_lock · a49ab6ee
      Li Zhong 提交于
      nvram_pstore_info's buf_lock is not initialized before registering,
      which is clearly incorrect.
      
      It causes some strange behavior when trying to obtain the lock during
      kdump process.
      
      On a UP configuration, the console stopped for a couple of seconds, then
      "lockup suspected" warning printed out, but then it continued to run.
      
      So try lock fails, and lockup reported, but then arch_spin_lock()
      passes.
      Signed-off-by: NLi Zhong <zhong@linux.vnet.ibm.com>
      [mpe: Edited changelog]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a49ab6ee
  9. 18 11月, 2014 4 次提交
    • M
      Merge remote-tracking branch 'scottwood/next' into next · 35891d40
      Michael Ellerman 提交于
      Scott says:
      
      "Highlights include a bunch of 8xx optimizations, device tree bindings
      for Freescale BMan, QMan, and FMan datapath components, misc device tree
      updates, and inbound rio window support."
      35891d40
    • M
      cxl: Name interrupts in /proc/interrupt · 80fa93fc
      Michael Neuling 提交于
      Currently all interrupts generated by cxl are named "cxl".  This is not very
      informative as we can't distinguish between cards, AFUs, error interrupts, user
      contexts and user interrupts numbers.  Being able to distinguish them is useful
      for setting affinity.
      
      This patch gives each of these names in /proc/interrupts.
      
      A two card CAPI system, with afu0.0 having 2 active contexts each with 4 user
      IRQs each, will now look like this:
      
          % grep cxl /proc/interrupts
          444:          0  OPAL ICS 141312 Level     cxl-card1-err
          445:          0  OPAL ICS 141313 Level     cxl-afu1.0-err
          446:          0  OPAL ICS 141314 Level     cxl-afu1.0
          462:          0  OPAL ICS 2052 Level     cxl-afu0.0-pe0-1
          463:      75517  OPAL ICS 2053 Level     cxl-afu0.0-pe0-2
          468:          0  OPAL ICS 2054 Level     cxl-afu0.0-pe0-3
          469:          0  OPAL ICS 2055 Level     cxl-afu0.0-pe0-4
          470:          0  OPAL ICS 2056 Level     cxl-afu0.0-pe1-1
          471:      75506  OPAL ICS 2057 Level     cxl-afu0.0-pe1-2
          472:          0  OPAL ICS 2058 Level     cxl-afu0.0-pe1-3
          473:          0  OPAL ICS 2059 Level     cxl-afu0.0-pe1-4
          502:       1066  OPAL ICS 2050 Level     cxl-afu0.0
          514:          0  OPAL ICS 2048 Level     cxl-card0-err
          515:          0  OPAL ICS 2049 Level     cxl-afu0.0-err
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NIan Munsie <imunsie@au1.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      80fa93fc
    • I
      cxl: Return error to PSL if IRQ demultiplexing fails & print clearer warning · bc78b05b
      Ian Munsie 提交于
      If an AFU has a hardware bug that causes it to acknowledge a context
      terminate or remove while that context has outstanding transactions, it
      is possible for the kernel to receive an interrupt for that context
      after we have removed it from the context list.
      
      The kernel will not be able to demultiplex the interrupt (or worse - if
      we have already reallocated the process handle we could mis-attribute it
      to the new context), and printed a big scary warning.
      
      It did not acknowledge the interrupt, which would effectively halt
      further translation fault processing on the PSL.
      
      This patch makes the warning clearer about the likely cause of the issue
      (i.e. hardware bug) to make it obvious to future AFU designers of what
      needs to be fixed. It also prints out the process handle which can then
      be matched up with hardware and software traces for debugging.
      
      It also acknowledges the interrupt to the PSL with either an address
      error or acknowledge, so that the PSL can continue with other
      translations.
      Signed-off-by: NIan Munsie <imunsie@au1.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      bc78b05b
    • P
      powerpc/config: Enable memory driver · 76f3e292
      Prabhakar Kushwaha 提交于
      As Freescale IFC controller has been moved to driver to driver/memory.
      
      So enable memory driver in powerpc config
      Signed-off-by: NPrabhakar Kushwaha <prabhakar@freescale.com>
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      76f3e292
  10. 17 11月, 2014 2 次提交
    • N
      rtc/tpo: Driver to support rtc and wakeup on PowerNV platform · 16b1d26e
      Neelesh Gupta 提交于
      The patch implements the OPAL rtc driver that binds with the rtc
      driver subsystem. The driver uses the platform device infrastructure
      to probe the rtc device and register it to rtc class framework. The
      'wakeup' is supported depending upon the property 'has-tpo' present
      in the OF node. It provides a way to load the generic rtc driver in
      in the absence of an OPAL driver.
      
      The patch also moves the existing OPAL rtc get/set time interfaces to the
      new driver and exposes the necessary OPAL calls using EXPORT_SYMBOL_GPL.
      
      Test results:
      -------------
      Host:
      [root@tul169p1 ~]# ls -l /sys/class/rtc/
      total 0
      lrwxrwxrwx 1 root root 0 Oct 14 03:07 rtc0 -> ../../devices/opal-rtc/rtc/rtc0
      [root@tul169p1 ~]# cat /sys/devices/opal-rtc/rtc/rtc0/time
      08:10:07
      [root@tul169p1 ~]# echo `date '+%s' -d '+ 2 minutes'` > /sys/class/rtc/rtc0/wakealarm
      [root@tul169p1 ~]# cat /sys/class/rtc/rtc0/wakealarm
      1413274345
      [root@tul169p1 ~]#
      
      FSP:
      $ smgr mfgState
      standby
      $ rtim timeofday
      
      System time is valid: 2014/10/14 08:12:04.225115
      
      $ smgr mfgState
      ipling
      $
      
      CC: devicetree@vger.kernel.org
      CC: tglx@linutronix.de
      CC: rtc-linux@googlegroups.com
      CC: a.zummo@towertech.it
      Signed-off-by: NNeelesh Gupta <neelegup@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      16b1d26e
    • V
      powerpc: Use generic PIE randomization · 59994fb0
      Vineeth Vijayan 提交于
      Back in 2009 we merged 501cb16d "Randomise PIEs", which added support for
      randomizing PIE (Position Independent Executable) binaries.
      
      That commit added randomize_et_dyn(), which correctly randomized the addresses,
      but failed to honor PF_RANDOMIZE. That means it was not possible to disable PIE
      randomization via the personality flag, or /proc/sys/kernel/randomize_va_space.
      
      Since then there has been generic support for PIE randomization added to
      binfmt_elf.c, selectable via ARCH_BINFMT_ELF_RANDOMIZE_PIE.
      
      Enabling that allows us to drop randomize_et_dyn(), which means we start
      honoring PF_RANDOMIZE correctly.
      
      It also causes a fairly major change to how we layout PIE binaries.
      
      Currently we will place the binary at 512MB-520MB for 32 bit binaries, or
      512MB-1.5GB for 64 bit binaries, eg:
      
          $ cat /proc/$$/maps
          4e550000-4e580000 r-xp 00000000 08:02 129813       /bin/dash
          4e580000-4e590000 rw-p 00020000 08:02 129813       /bin/dash
          10014110000-10014140000 rw-p 00000000 00:00 0      [heap]
          3fffaa3f0000-3fffaa5a0000 r-xp 00000000 08:02 921  /lib/powerpc64le-linux-gnu/libc-2.19.so
          3fffaa5a0000-3fffaa5b0000 rw-p 001a0000 08:02 921  /lib/powerpc64le-linux-gnu/libc-2.19.so
          3fffaa5c0000-3fffaa5d0000 rw-p 00000000 00:00 0
          3fffaa5d0000-3fffaa5f0000 r-xp 00000000 00:00 0    [vdso]
          3fffaa5f0000-3fffaa620000 r-xp 00000000 08:02 1246 /lib/powerpc64le-linux-gnu/ld-2.19.so
          3fffaa620000-3fffaa630000 rw-p 00020000 08:02 1246 /lib/powerpc64le-linux-gnu/ld-2.19.so
          3ffffc340000-3ffffc370000 rw-p 00000000 00:00 0    [stack]
      
      With this commit applied we don't do any special randomisation for the binary,
      and instead rely on mmap randomisation. This means the binary ends up at high
      addresses, eg:
      
          $ cat /proc/$$/maps
          3fff99820000-3fff999d0000 r-xp 00000000 08:02 921    /lib/powerpc64le-linux-gnu/libc-2.19.so
          3fff999d0000-3fff999e0000 rw-p 001a0000 08:02 921    /lib/powerpc64le-linux-gnu/libc-2.19.so
          3fff999f0000-3fff99a00000 rw-p 00000000 00:00 0
          3fff99a00000-3fff99a20000 r-xp 00000000 00:00 0      [vdso]
          3fff99a20000-3fff99a50000 r-xp 00000000 08:02 1246   /lib/powerpc64le-linux-gnu/ld-2.19.so
          3fff99a50000-3fff99a60000 rw-p 00020000 08:02 1246   /lib/powerpc64le-linux-gnu/ld-2.19.so
          3fff99a60000-3fff99a90000 r-xp 00000000 08:02 129813 /bin/dash
          3fff99a90000-3fff99aa0000 rw-p 00020000 08:02 129813 /bin/dash
          3fffc3de0000-3fffc3e10000 rw-p 00000000 00:00 0      [stack]
          3fffc55e0000-3fffc5610000 rw-p 00000000 00:00 0      [heap]
      
      Although this should be OK, it's possible it might break badly written
      binaries that make assumptions about the address space layout.
      Signed-off-by: NVineeth Vijayan <vvijayan@mvista.com>
      [mpe: Rewrite changelog]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      59994fb0
  11. 14 11月, 2014 3 次提交