1. 12 5月, 2015 1 次提交
  2. 11 5月, 2015 4 次提交
  3. 01 5月, 2015 3 次提交
    • S
      powerpc/powernv: Restore non-volatile CRs after nap · 0aab3747
      Sam Bobroff 提交于
      Patches 7cba160a "powernv/cpuidle: Redesign idle states management"
      and 77b54e9f "powernv/powerpc: Add winkle support for offline cpus"
      use non-volatile condition registers (cr2, cr3 and cr4) early in the system
      reset interrupt handler (system_reset_pSeries()) before it has been determined
      if state loss has occurred. If state loss has not occurred, control returns via
      the power7_wakeup_noloss() path which does not restore those condition
      registers, leaving them corrupted.
      
      Fix this by restoring the condition registers in the power7_wakeup_noloss()
      case.
      
      This is apparent when running a KVM guest on hardware that does not
      support winkle or sleep and the guest makes use of secondary threads. In
      practice this means Power7 machines, though some early unreleased Power8
      machines may also be susceptible.
      
      The secondary CPUs are taken off line before the guest is started and
      they call pnv_smp_cpu_kill_self(). This checks support for sleep
      states (in this case there is no support) and power7_nap() is called.
      
      When the CPU is woken, power7_nap() returns and because the CPU is
      still off line, the main while loop executes again. The sleep states
      support test is executed again, but because the tested values cannot
      have changed, the compiler has optimized the test away and instead we
      rely on the result of the first test, which has been left in cr3
      and/or cr4. With the result overwritten, the wrong branch is taken and
      power7_winkle() is called on a CPU that does not support it, leading
      to it stalling.
      
      Fixes: 7cba160a ("powernv/cpuidle: Redesign idle states management")
      Fixes: 77b54e9f ("powernv/powerpc: Add winkle support for offline cpus")
      [mpe: Massage change log a bit more]
      Signed-off-by: NSam Bobroff <sam.bobroff@au1.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      0aab3747
    • G
      powerpc/eeh: Delay probing EEH device during hotplug · d91dafc0
      Gavin Shan 提交于
      Commit 1c509148b ("powerpc/eeh: Do probe on pci_dn") probes EEH
      devices in early stage, which is reasonable to pSeries platform.
      However, it's wrong for PowerNV platform because the PE# isn't
      determined until the resources (IO and MMIO) are assigned to
      PE in hotplug case. So we have to delay probing EEH devices
      for PowerNV platform until the PE# is assigned.
      
      Fixes: ff57b454 ("powerpc/eeh: Do probe on pci_dn")
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      d91dafc0
    • G
      powerpc/eeh: Fix race condition in pcibios_set_pcie_reset_state() · 1ae79b78
      Gavin Shan 提交于
      When asserting reset in pcibios_set_pcie_reset_state(), the PE
      is enforced to (hardware) frozen state in order to drop unexpected
      PCI transactions (except PCI config read/write) automatically by
      hardware during reset, which would cause recursive EEH error.
      However, the (software) frozen state EEH_PE_ISOLATED is missed.
      When users get 0xFF from PCI config or MMIO read, EEH_PE_ISOLATED
      is set in PE state retrival backend. Unfortunately, nobody (the
      reset handler or the EEH recovery functinality in host) will clear
      EEH_PE_ISOLATED when the PE has been passed through to guest.
      
      The patch sets and clears EEH_PE_ISOLATED properly during reset
      in function pcibios_set_pcie_reset_state() to fix the issue.
      
      Fixes: 28158cd1 ("Enhance pcibios_set_pcie_reset_state()")
      Reported-by: NCarol L. Soto <clsoto@us.ibm.com>
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Tested-by: NCarol L. Soto <clsoto@us.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      1ae79b78
  4. 30 4月, 2015 1 次提交
    • M
      Revert "powerpc/tm: Abort syscalls in active transactions" · 68fc378c
      Michael Ellerman 提交于
      This reverts commit feba4036.
      
      Although the principle of this change is good, the implementation has a
      few issues.
      
      Firstly we can sometimes fail to abort a syscall because r12 may have
      been clobbered by C code if we went down the virtual CPU accounting
      path, or if syscall tracing was enabled.
      
      Secondly we have decided that it is safer to abort the syscall even
      earlier in the syscall entry path, so that we avoid the syscall tracing
      path when we are transactional.
      
      So that we have time to thoroughly test those changes we have decided to
      revert this for this merge window and will merge the fixed version in
      the next window.
      
      NB. Rather than reverting the selftest we just drop tm-syscall from
      TEST_PROGS so that it's not run by default.
      
      Fixes: feba4036 ("powerpc/tm: Abort syscalls in active transactions")
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      68fc378c
  5. 21 4月, 2015 5 次提交
    • P
      KVM: PPC: Book3S HV: Use msgsnd for signalling threads on POWER8 · 66feed61
      Paul Mackerras 提交于
      This uses msgsnd where possible for signalling other threads within
      the same core on POWER8 systems, rather than IPIs through the XICS
      interrupt controller.  This includes waking secondary threads to run
      the guest, the interrupts generated by the virtual XICS, and the
      interrupts to bring the other threads out of the guest when exiting.
      
      Aggregated statistics from debugfs across vcpus for a guest with 32
      vcpus, 8 threads/vcore, running on a POWER8, show this before the
      change:
      
       rm_entry:     3387.6ns (228 - 86600, 1008969 samples)
        rm_exit:     4561.5ns (12 - 3477452, 1009402 samples)
        rm_intr:     1660.0ns (12 - 553050, 3600051 samples)
      
      and this after the change:
      
       rm_entry:     3060.1ns (212 - 65138, 953873 samples)
        rm_exit:     4244.1ns (12 - 9693408, 954331 samples)
        rm_intr:     1342.3ns (12 - 1104718, 3405326 samples)
      
      for a test of booting Fedora 20 big-endian to the login prompt.
      
      The time taken for a H_PROD hcall (which is handled in the host
      kernel) went down from about 35 microseconds to about 16 microseconds
      with this change.
      
      The noinline added to kvmppc_run_core turned out to be necessary for
      good performance, at least with gcc 4.9.2 as packaged with Fedora 21
      and a little-endian POWER8 host.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      66feed61
    • P
      KVM: PPC: Book3S HV: Use bitmap of active threads rather than count · 7d6c40da
      Paul Mackerras 提交于
      Currently, the entry_exit_count field in the kvmppc_vcore struct
      contains two 8-bit counts, one of the threads that have started entering
      the guest, and one of the threads that have started exiting the guest.
      This changes it to an entry_exit_map field which contains two bitmaps
      of 8 bits each.  The advantage of doing this is that it gives us a
      bitmap of which threads need to be signalled when exiting the guest.
      That means that we no longer need to use the trick of setting the
      HDEC to 0 to pull the other threads out of the guest, which led in
      some cases to a spurious HDEC interrupt on the next guest entry.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      7d6c40da
    • P
      KVM: PPC: Book3S HV: Get rid of vcore nap_count and n_woken · 5d5b99cd
      Paul Mackerras 提交于
      We can tell when a secondary thread has finished running a guest by
      the fact that it clears its kvm_hstate.kvm_vcpu pointer, so there
      is no real need for the nap_count field in the kvmppc_vcore struct.
      This changes kvmppc_wait_for_nap to poll the kvm_hstate.kvm_vcpu
      pointers of the secondary threads rather than polling vc->nap_count.
      Besides reducing the size of the kvmppc_vcore struct by 8 bytes,
      this also means that we can tell which secondary threads have got
      stuck and thus print a more informative error message.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      5d5b99cd
    • P
      KVM: PPC: Book3S HV: Minor cleanups · 1f09c3ed
      Paul Mackerras 提交于
      * Remove unused kvmppc_vcore::n_busy field.
      * Remove setting of RMOR, since it was only used on PPC970 and the
        PPC970 KVM support has been removed.
      * Don't use r1 or r2 in setting the runlatch since they are
        conventionally reserved for other things; use r0 instead.
      * Streamline the code a little and remove the ext_interrupt_to_host
        label.
      * Add some comments about register usage.
      * hcall_try_real_mode doesn't need to be global, and can't be
        called from C code anyway.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      1f09c3ed
    • P
      KVM: PPC: Book3S HV: Accumulate timing information for real-mode code · b6c295df
      Paul Mackerras 提交于
      This reads the timebase at various points in the real-mode guest
      entry/exit code and uses that to accumulate total, minimum and
      maximum time spent in those parts of the code.  Currently these
      times are accumulated per vcpu in 5 parts of the code:
      
      * rm_entry - time taken from the start of kvmppc_hv_entry() until
        just before entering the guest.
      * rm_intr - time from when we take a hypervisor interrupt in the
        guest until we either re-enter the guest or decide to exit to the
        host.  This includes time spent handling hcalls in real mode.
      * rm_exit - time from when we decide to exit the guest until the
        return from kvmppc_hv_entry().
      * guest - time spend in the guest
      * cede - time spent napping in real mode due to an H_CEDE hcall
        while other threads in the same vcore are active.
      
      These times are exposed in debugfs in a directory per vcpu that
      contains a file called "timings".  This file contains one line for
      each of the 5 timings above, with the name followed by a colon and
      4 numbers, which are the count (number of times the code has been
      executed), the total time, the minimum time, and the maximum time,
      all in nanoseconds.
      
      The overhead of the extra code amounts to about 30ns for an hcall that
      is handled in real mode (e.g. H_SET_DABR), which is about 25%.  Since
      production environments may not wish to incur this overhead, the new
      code is conditional on a new config symbol,
      CONFIG_KVM_BOOK3S_HV_EXIT_TIMING.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      b6c295df
  6. 17 4月, 2015 1 次提交
    • A
      powerpc/mm/thp: Make page table walk safe against thp split/collapse · 691e95fd
      Aneesh Kumar K.V 提交于
      We can disable a THP split or a hugepage collapse by disabling irq.
      We do send IPI to all the cpus in the early part of split/collapse,
      and disabling local irq ensure we don't make progress with
      split/collapse. If the THP is getting split we return NULL from
      find_linux_pte_or_hugepte(). For all the current callers it should be ok.
      We need to be careful if we want to use returned pte_t pointer outside
      the irq disabled region. W.r.t to THP split, the pfn remains the same,
      but then a hugepage collapse will result in a pfn change. There are
      few steps we can take to avoid a hugepage collapse.One way is to take page
      reference inside the irq disable region. Other option is to take
      mmap_sem so that a parallel collapse will not happen. We can also
      disable collapse by taking pmd_lock. Another method used by kvm
      subsystem is to check whether we had a mmu_notifer update in between
      using mmu_notifier_retry().
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      691e95fd
  7. 14 4月, 2015 1 次提交
  8. 11 4月, 2015 13 次提交
  9. 10 4月, 2015 2 次提交
    • M
      powerpc: Reword the "returning from prom_init" message · 7e862d7e
      Michael Ellerman 提交于
      We get way too many bug reports that say "the kernel is hung in
      prom_init", which stems from the fact that the last piece of output
      people see is "returning from prom_init".
      
      The kernel is almost never hung in prom_init(), it's just that it's
      crashed somewhere after prom_init() but prior to the console coming up.
      
      The existing message should give a clue to that, ie. "returning from"
      indicates that prom_init() has finished, but it doesn't seem to work.
      Let's try something different.
      
      This prints:
      
        Quiescing Open Firmware ...
        Booting Linux via __start() ...
      
      Which hopefully makes it clear that prom_init() is not the problem, and
      although __start() probably isn't either, it's at least the right place
      to begin looking.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Wistfully-Acked-by: NJeremy Kerr <jk@ozlabs.org>
      7e862d7e
    • M
      powerpc: Replace mem_init_done with slab_is_available() · f691fa10
      Michael Ellerman 提交于
      We have a powerpc specific global called mem_init_done which is "set on
      boot once kmalloc can be called".
      
      But that's not *quite* true. We set it at the bottom of mem_init(), and
      rely on the fact that mm_init() calls kmem_cache_init() immediately
      after that, and nothing is running in parallel.
      
      So replace it with the generic and 100% correct slab_is_available().
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f691fa10
  10. 07 4月, 2015 1 次提交
  11. 31 3月, 2015 6 次提交
    • G
      powerpc/eeh: Fix PE#0 check in eeh_add_to_parent_pe() · 433185d2
      Gavin Shan 提交于
      The function eeh_add_parent_pe() is used to create a PE or add one
      edev to its parent PE. Current code checks if PE#0 is valid for the
      later case. Actually, we should validate PE#0 for both cases when
      EEH core regards PE#0 as invalid one (without flag EEH_VALID_PE_ZERO).
      Otherwise, not all EEH devices can be added to its parent PE#0 for
      EEH on P7IOC.
      
      The patch fixes the issue by validating PE#0 for the two cases. So far,
      we don't have PE#0 for EEH on P7IOC, but it will show up when we enable
      M64 for P7IOC. The patch also makes the error message more meaningful.
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      433185d2
    • W
      powerpc/powernv: Shift VF resource with an offset · 781a868f
      Wei Yang 提交于
      On PowerNV platform, resource position in M64 BAR implies the PE# the
      resource belongs to. In some cases, adjustment of a resource is necessary
      to locate it to a correct position in M64 BAR .
      
      This patch adds pnv_pci_vf_resource_shift() to shift the 'real' PF IOV BAR
      address according to an offset.
      
      Note:
      
          After doing so, there would be a "hole" in the /proc/iomem when offset
          is a positive value. It looks like the device return some mmio back to
          the system, which actually no one could use it.
      
      [bhelgaas: rework loops, rework overlap check, index resource[]
      conventionally, remove pci_regs.h include, squashed with next patch]
      Signed-off-by: NWei Yang <weiyang@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      781a868f
    • W
      powerpc/powernv: Implement pcibios_iov_resource_alignment() on powernv · 5350ab3f
      Wei Yang 提交于
      Implement pcibios_iov_resource_alignment() on powernv platform.
      
      On PowerNV platform, there are 3 cases for the IOV BAR:
      1. initial state, the IOV BAR size is multiple times of VF BAR size
      2. after expanded, the IOV BAR size is expanded to meet the M64 segment size
      3. sizing stage, the IOV BAR is truncated to 0
      
      pnv_pci_iov_resource_alignment() handle these three cases respectively.
      
      [bhelgaas: adjust to drop "align" parameter, return pci_iov_resource_size()
      if no ppc_md machdep_call version]
      Signed-off-by: NWei Yang <weiyang@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      5350ab3f
    • W
      powerpc/powernv: Reserve additional space for IOV BAR according to the number of total_pe · 6e628c7d
      Wei Yang 提交于
      On PHB3, PF IOV BAR will be covered by M64 BAR to have better PE isolation.
      M64 BAR is a type of hardware resource in PHB3, which could map a range of
      MMIO to PE numbers on powernv platform. And this range is divided equally
      by the number of total_pe with each divided range mapping to a PE number.
      Also, the M64 BAR must map a MMIO range with power-of-two size.
      
      The total_pe number is usually different from total_VFs, which can lead to
      a conflict between MMIO space and the PE number.
      
      For example, if total_VFs is 128 and total_pe is 256, the second half of
      M64 BAR will be part of other PCI device, which may already belong to other
      PEs.
      
      This patch prevents the conflict by reserving additional space for the PF
      IOV BAR, which is total_pe number of VF's BAR size.
      
      [bhelgaas: make dev_printk() output more consistent, index resource[]
      conventionally]
      Signed-off-by: NWei Yang <weiyang@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      6e628c7d
    • W
      powerpc/pci: Don't unset PCI resources for VFs · c3b80fb0
      Wei Yang 提交于
      Flag PCI_REASSIGN_ALL_RSRC is used to ignore resources information setup by
      firmware, so that kernel would re-assign all resources of pci devices.
      
      On powerpc arch, this happens in a header fixup function
      pcibios_fixup_resources(), which will clean up the resources if this flag
      is set. This works fine for PFs, since after clean up, kernel will
      re-assign the resources in pcibios_resource_survey().
      
      Below is a simple call flow on how it works:
      
          pcibios_init
            pcibios_scan_phb
              pci_scan_child_bus
                ...
                  pci_device_add
                    pci_fixup_device(pci_fixup_header)
                      pcibios_fixup_resources                     # header fixup
                        for (i = 0; i < DEVICE_COUNT_RESOURCE; i++)
                          dev->resource[i].start = 0
            pcibios_resource_survey                               # re-assign
              pcibios_allocate_resources
      
      However, the VF resources won't be re-assigned, since the VF resources are
      completely determined by the PF resources, and the PF resources have
      already been reassigned. This means we need to leave VF's resources
      un-cleared in pcibios_fixup_resources().
      
      In this patch, we skip the resource unset process in
      pcibios_fixup_resources(), if the pci_dev is a VF.
      Signed-off-by: NWei Yang <weiyang@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      c3b80fb0
    • G
      powerpc/pci: Create pci_dn for VFs · a8b2f828
      Gavin Shan 提交于
      pci_dn is the extension of PCI device node and is created from device node.
      Unfortunately, VFs are enabled dynamically by PF's driver and they don't
      have corresponding device nodes and pci_dn, which is required to access
      VFs' config spaces.
      
      The patch creates pci_dn for VFs in pcibios_sriov_enable() on their PF,
      and removes pci_dn for VFs in pcibios_sriov_disable() on their PF. When
      VF's pci_dn is created, it's put to the child list of the pci_dn of PF's
      upstream bridge. The pci_dn is linked to pci_dev during early fixup time
      to setup the fast path.
      
      [bhelgaas: add ifdef around add_one_dev_pci_info(), use dev_printk()]
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      a8b2f828
  12. 28 3月, 2015 2 次提交
    • M
      powerpc: Add a proper syscall for switching endianness · 529d235a
      Michael Ellerman 提交于
      We currently have a "special" syscall for switching endianness. This is
      syscall number 0x1ebe, which is handled explicitly in the 64-bit syscall
      exception entry.
      
      That has a few problems, firstly the syscall number is outside of the
      usual range, which confuses various tools. For example strace doesn't
      recognise the syscall at all.
      
      Secondly it's handled explicitly as a special case in the syscall
      exception entry, which is complicated enough without it.
      
      As a first step toward removing the special syscall, we need to add a
      regular syscall that implements the same functionality.
      
      The logic is simple, it simply toggles the MSR_LE bit in the userspace
      MSR. This is the same as the special syscall, with the caveat that the
      special syscall clobbers fewer registers.
      
      This version clobbers r9-r12, XER, CTR, and CR0-1,5-7.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      529d235a
    • T
      powerpc/pseries: Simplify check for suspendability during suspend/migration · c03e7374
      Tyrel Datwyler 提交于
      During suspend/migration operation we must wait for the VASI state reported
      by the hypervisor to become Suspending prior to making the ibm,suspend-me
      RTAS call. Calling routines to rtas_ibm_supend_me() pass a vasi_state variable
      that exposes the VASI state to the caller. This is unnecessary as the caller
      only really cares about the following three conditions; if there is an error
      we should bailout, success indicating we have suspended and woken back up so
      proceed to device tree update, or we are not suspendable yet so try calling
      rtas_ibm_suspend_me again shortly.
      
      This patch removes the extraneous vasi_state variable and simply uses the
      return code to communicate how to proceed. We either succeed, fail, or get
      -EAGAIN in which case we sleep for a second before trying to call
      rtas_ibm_suspend_me again. The behaviour of ppc_rtas() remains the same,
      but migrate_store() now returns the propogated error code on failure.
      Previously -1 was returned from migrate_store() in the  failure case which
      equates to -EPERM and was clearly wrong.
      Signed-off-by: NTyrel Datwyler <tyreld@linux.vnet.ibm.com>
      Cc: Nathan Fontenont <nfont@linux.vnet.ibm.com>
      Cc: Cyril Bur <cyrilbur@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c03e7374