1. 28 10月, 2015 7 次提交
  2. 23 10月, 2015 1 次提交
    • S
      powerpc/85xx: Load all early TLB entries at once · d9e1831a
      Scott Wood 提交于
      Use an AS=1 trampoline TLB entry to allow all normal TLB1 entries to
      be loaded at once.  This avoids the need to keep the translation that
      code is executing from in the same TLB entry in the final TLB
      configuration as during early boot, which in turn is helpful for
      relocatable kernels (e.g. kdump) where the kernel is not running from
      what would be the first TLB entry.
      
      On e6500, we limit map_mem_in_cams() to the primary hwthread of a
      core (the boot cpu is always considered primary, as a kdump kernel
      can be entered on any cpu).  Each TLB only needs to be set up once,
      and when we do, we don't want another thread to be running when we
      create a temporary trampoline TLB1 entry.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      d9e1831a
  3. 15 10月, 2015 3 次提交
  4. 12 10月, 2015 1 次提交
    • A
      powerpc/mm: Differentiate between hugetlb and THP during page walk · 891121e6
      Aneesh Kumar K.V 提交于
      We need to properly identify whether a hugepage is an explicit or
      a transparent hugepage in follow_huge_addr(). We used to depend
      on hugepage shift argument to do that. But in some case that can
      result in wrong results. For ex:
      
      On finding a transparent hugepage we set hugepage shift to PMD_SHIFT.
      But we can end up clearing the thp pte, via pmdp_huge_get_and_clear.
      We do prevent reusing the pfn page via the usage of
      kick_all_cpus_sync(). But that happens after we updated the pte to 0.
      Hence in follow_huge_addr() we can find hugepage shift set, but transparent
      huge page check fail for a thp pte.
      
      NOTE: We fixed a variant of this race against thp split in commit
      691e95fd
      ("powerpc/mm/thp: Make page table walk safe against thp split/collapse")
      
      Without this patch, we may hit the BUG_ON(flags & FOLL_GET) in
      follow_page_mask occasionally.
      
      In the long term, we may want to switch ppc64 64k page size config to
      enable CONFIG_ARCH_WANT_GENERAL_HUGETLB
      Reported-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      891121e6
  5. 02 10月, 2015 2 次提交
  6. 01 10月, 2015 2 次提交
    • M
      powerpc/vdso: Avoid link stack corruption in __get_datapage() · c974809a
      Michael Neuling 提交于
      powerpc has a link register (lr) used for calling functions. We "bl
      <func>" to call a function, and "blr" to return back to the call site.
      
      The lr is only a single register, so if we call another function from
      inside this function (ie. nested calls), software must save away the
      lr on the software stack before calling the new function. Before
      returning (ie. before the "blr"), the lr is restored by software from
      the software stack.
      
      This makes branch prediction quite difficult for the processor as it
      will only know the branch target just before the "blr".
      
      To help with this, modern powerpc processors keep a (non-architected)
      hardware stack of lr called a "link stack". When a "bl <func>" is
      run, the lr is pushed onto this stack. When a "blr" is called, the
      branch predictor pops the lr value from the top of the link stack, and
      uses it to predict the branch target. Hence the processor pipeline
      knows a lot earlier the branch target.
      
      This works great but there are some cases where you call "bl" but
      without a matching "blr". Once such case is when trying to determine
      the program counter (which can't be read directly). Here you "bl+4;
      mflr" to get the program counter. If you do this, the link stack will
      get out of sync with reality, causing the branch predictor to
      mis-predict subsequent function returns.
      
      To avoid this, modern micro-architectures have a special case of bl.
      Using the form "bcl 20,31,+4", ensures the processor doesn't push to
      the link stack.
      
      The 32 and 64 bit variants of __get_datapage() use a "bl; mflr" to
      determine the loaded address of the VDSO. The current versions of
      these attempt to use this special bl variant.
      
      Unfortunately they use +8 rather than the required +4. Hence the
      current code results in the link stack getting out of sync with
      reality and hence the resulting performance degradation.
      
      This patch moves it to bcl+4 by moving __kernel_datapage_offset out of
      __get_datapage().
      
      With this patch, running a gettimeofday() (which uses
      __get_datapage()) microbenchmark we get a decent bump in performance
      on POWER7/8.
      
      For the benchmark in tools/testing/selftests/powerpc/benchmarks/gettimeofday.c
        POWER8:
          64bit gets ~4% improvement
          32bit gets ~9% improvement
        POWER7:
          64bit gets ~7% improvement
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Reported-by: NAaron Sawdey <sawdey@us.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c974809a
    • M
      powerpc/vdso: Emit GNU & SysV hashes · 787b393c
      Michael Ellerman 提交于
      Andy Lutomirski says:
      
        Some dynamic loaders may be slightly faster if a GNU hash is
        available.
      
        This is unlikely to have any measurable effect on the time it takes
        to resolve vdso symbols (since there are so few of them).  In some
        contexts, it can be a win for a different reason: if every DSO has a
        GNU hash section, then libc can avoid calculating SysV hashes at
        all. Both musl and glibc appear to have this optimization.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      787b393c
  7. 17 9月, 2015 2 次提交
  8. 16 9月, 2015 1 次提交
    • B
      PCI: Revert "PCI: Call pci_read_bridge_bases() from core instead of arch code" · 237865f1
      Bjorn Helgaas 提交于
      Revert dff22d20 ("PCI: Call pci_read_bridge_bases() from core instead
      of arch code").
      
      Reading PCI bridge windows is not arch-specific in itself, but there is PCI
      core code that doesn't work correctly if we read them too early.  For
      example, Hannes found this case on an ARM Freescale i.mx6 board:
      
        pci_bus 0000:00: root bus resource [mem 0x01000000-0x01efffff]
        pci 0000:00:00.0: PCI bridge to [bus 01-ff]
        pci 0000:00:00.0: BAR 8: no space for [mem size 0x01000000] (mem window)
        pci 0000:01:00.0: BAR 2: failed to assign [mem size 0x00200000]
        pci 0000:01:00.0: BAR 1: failed to assign [mem size 0x00004000]
        pci 0000:01:00.0: BAR 0: failed to assign [mem size 0x00000100]
      
      The 00:00.0 mem window needs to be at least 3MB: the 01:00.0 device needs
      0x204100 of space, and mem windows are megabyte-aligned.
      
      Bus sizing can increase a bridge window size, but never *decrease* it (see
      d65245c3 ("PCI: don't shrink bridge resources")).  Prior to
      dff22d20, ARM didn't read bridge windows at all, so the "original size"
      was zero, and we assigned a 3MB window.
      
      After dff22d20, we read the bridge windows before sizing the bus.  The
      firmware programmed a 16MB window (size 0x01000000) in 00:00.0, and since
      we never decrease the size, we kept 16MB even though we only needed 3MB.
      But 16MB doesn't fit in the host bridge aperture, so we failed to assign
      space for the window and the downstream devices.
      
      I think this is a defect in the PCI core: we shouldn't rely on the firmware
      to assign sensible windows.
      
      Ray reported a similar problem, also on ARM, with Broadcom iProc.
      
      Issues like this are too hard to fix right now, so revert dff22d20.
      Reported-by: NHannes <oe5hpm@gmail.com>
      Reported-by: NRay Jui <rjui@broadcom.com>
      Link: http://lkml.kernel.org/r/CAAa04yFQEUJm7Jj1qMT57-LG7ZGtnhNDBe=PpSRa70Mj+XhW-A@mail.gmail.com
      Link: http://lkml.kernel.org/r/55F75BB8.4070405@broadcom.comSigned-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Acked-by: NYinghai Lu <yinghai@kernel.org>
      Acked-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      237865f1
  9. 15 9月, 2015 1 次提交
  10. 28 8月, 2015 1 次提交
    • G
      powerpc/eeh: Fix fenced PHB caused by eeh_slot_error_detail() · 25980013
      Gavin Shan 提交于
      The config space of some PCI devices can't be accessed when their
      PEs are in frozen state. Otherwise, fenced PHB might be seen.
      Those PEs are identified with flag EEH_PE_CFG_RESTRICTED, meaing
      EEH_PE_CFG_BLOCKED is set automatically when the PE is put to
      frozen state (EEH_PE_ISOLATED). eeh_slot_error_detail() restores
      PCI device BARs with eeh_pe_restore_bars(), which then calls
      eeh_ops->restore_config() to reinitialize the PCI device in
      (OPAL) firmware. eeh_ops->restore_config() produces PCI config
      access that causes fenced PHB. The problem was reported on below
      adapter:
      
         0001:01:00.0 0200: 14e4:168e (rev 10)
         0001:01:00.0 Ethernet controller: Broadcom Corporation \
                      NetXtreme II BCM57810 10 Gigabit Ethernet (rev 10)
      
      This fixes the issue by skipping eeh_pe_restore_bars() in
      eeh_slot_error_detail() when EEH_PE_CFG_BLOCKED is set for the PE.
      
      Fixes: b6541db1 ("powerpc/eeh: Block PCI config access upon frozen PE")
      Cc: stable@vger.kernel.org # v4.0+
      Reported-by: NManvanthara B. Puttashankar <mputtash@in.ibm.com>
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      25980013
  11. 26 8月, 2015 1 次提交
    • G
      powerpc/PCI: Disable MSI/MSI-X interrupts at PCI probe time in OF case · 4d9aac39
      Guilherme G. Piccoli 提交于
      Since commit 1851617c ("PCI/MSI: Disable MSI at enumeration even if
      kernel doesn't support MSI"), the setup of dev->msi_cap/msix_cap and the
      disable of MSI/MSI-X interrupts isn't being done at PCI probe time, as
      the logic responsible for this was moved in the aforementioned commit
      from pci_device_add() to pci_setup_device(). The latter function is not
      reachable on PowerPC pseries platform during Open Firmware PCI probing
      time.
      
      This exhibits as drivers not being able to enable MSI, eg:
      
        bnx2x 0000:01:00.0: no msix capability found
      
      This patch calls pci_msi_setup_pci_dev() explicitly to disable MSI/MSI-X
      during PCI probe time on pSeries platform.
      
      Fixes: 1851617c ("PCI/MSI: Disable MSI at enumeration even if kernel doesn't support MSI")
      [mpe: Flesh out change log and clarify comment]
      Signed-off-by: NGuilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      4d9aac39
  12. 22 8月, 2015 2 次提交
    • P
      KVM: PPC: Book3S HV: Implement dynamic micro-threading on POWER8 · b4deba5c
      Paul Mackerras 提交于
      This builds on the ability to run more than one vcore on a physical
      core by using the micro-threading (split-core) modes of the POWER8
      chip.  Previously, only vcores from the same VM could be run together,
      and (on POWER8) only if they had just one thread per core.  With the
      ability to split the core on guest entry and unsplit it on guest exit,
      we can run up to 8 vcpu threads from up to 4 different VMs, and we can
      run multiple vcores with 2 or 4 vcpus per vcore.
      
      Dynamic micro-threading is only available if the static configuration
      of the cores is whole-core mode (unsplit), and only on POWER8.
      
      To manage this, we introduce a new kvm_split_mode struct which is
      shared across all of the subcores in the core, with a pointer in the
      paca on each thread.  In addition we extend the core_info struct to
      have information on each subcore.  When deciding whether to add a
      vcore to the set already on the core, we now have two possibilities:
      (a) piggyback the vcore onto an existing subcore, or (b) start a new
      subcore.
      
      Currently, when any vcpu needs to exit the guest and switch to host
      virtual mode, we interrupt all the threads in all subcores and switch
      the core back to whole-core mode.  It may be possible in future to
      allow some of the subcores to keep executing in the guest while
      subcore 0 switches to the host, but that is not implemented in this
      patch.
      
      This adds a module parameter called dynamic_mt_modes which controls
      which micro-threading (split-core) modes the code will consider, as a
      bitmap.  In other words, if it is 0, no micro-threading mode is
      considered; if it is 2, only 2-way micro-threading is considered; if
      it is 4, only 4-way, and if it is 6, both 2-way and 4-way
      micro-threading mode will be considered.  The default is 6.
      
      With this, we now have secondary threads which are the primary thread
      for their subcore and therefore need to do the MMU switch.  These
      threads will need to be started even if they have no vcpu to run, so
      we use the vcore pointer in the PACA rather than the vcpu pointer to
      trigger them.
      
      It is now possible for thread 0 to find that an exit has been
      requested before it gets to switch the subcore state to the guest.  In
      that case we haven't added the guest's timebase offset to the
      timebase, so we need to be careful not to subtract the offset in the
      guest exit path.  In fact we just skip the whole path that switches
      back to host context, since we haven't switched to the guest context.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      b4deba5c
    • P
      KVM: PPC: Book3S HV: Make use of unused threads when running guests · ec257165
      Paul Mackerras 提交于
      When running a virtual core of a guest that is configured with fewer
      threads per core than the physical cores have, the extra physical
      threads are currently unused.  This makes it possible to use them to
      run one or more other virtual cores from the same guest when certain
      conditions are met.  This applies on POWER7, and on POWER8 to guests
      with one thread per virtual core.  (It doesn't apply to POWER8 guests
      with multiple threads per vcore because they require a 1-1 virtual to
      physical thread mapping in order to be able to use msgsndp and the
      TIR.)
      
      The idea is that we maintain a list of preempted vcores for each
      physical cpu (i.e. each core, since the host runs single-threaded).
      Then, when a vcore is about to run, it checks to see if there are
      any vcores on the list for its physical cpu that could be
      piggybacked onto this vcore's execution.  If so, those additional
      vcores are put into state VCORE_PIGGYBACK and their runnable VCPU
      threads are started as well as the original vcore, which is called
      the master vcore.
      
      After the vcores have exited the guest, the extra ones are put back
      onto the preempted list if any of their VCPUs are still runnable and
      not idle.
      
      This means that vcpu->arch.ptid is no longer necessarily the same as
      the physical thread that the vcpu runs on.  In order to make it easier
      for code that wants to send an IPI to know which CPU to target, we
      now store that in a new field in struct vcpu_arch, called thread_cpu.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Tested-by: NLaurent Vivier <lvivier@redhat.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      ec257165
  13. 20 8月, 2015 1 次提交
  14. 19 8月, 2015 1 次提交
  15. 18 8月, 2015 4 次提交
  16. 14 8月, 2015 1 次提交
    • D
      powerpc/eeh: Probe after unbalanced kref check · e642d11b
      Daniel Axtens 提交于
      In the complete hotplug case, EEH PEs are supposed to be released
      and set to NULL. Normally, this is done by eeh_remove_device(),
      which is called from pcibios_release_device().
      
      However, if something is holding a kref to the device, it will not
      be released, and the PE will remain. eeh_add_device_late() has
      a check for this which will explictly destroy the PE in this case.
      
      This check in eeh_add_device_late() occurs after a call to
      eeh_ops->probe(). On PowerNV, probe is a pointer to pnv_eeh_probe(),
      which will exit without probing if there is an existing PE.
      
      This means that on PowerNV, devices with outstanding krefs will not
      be rediscovered by EEH correctly after a complete hotplug. This is
      affecting CXL (CAPI) devices in the field.
      
      Put the probe after the kref check so that the PE is destroyed
      and affected devices are correctly rediscovered by EEH.
      
      Fixes: d91dafc0 ("powerpc/eeh: Delay probing EEH device during hotplug")
      Cc: stable@vger.kernel.org
      Cc: Gavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NDaniel Axtens <dja@axtens.net>
      Acked-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      e642d11b
  17. 12 8月, 2015 2 次提交
  18. 11 8月, 2015 1 次提交
    • D
      cleanup IORESOURCE_CACHEABLE vs ioremap() · 92b19ff5
      Dan Williams 提交于
      Quoting Arnd:
          I was thinking the opposite approach and basically removing all uses
          of IORESOURCE_CACHEABLE from the kernel. There are only a handful of
          them.and we can probably replace them all with hardcoded
          ioremap_cached() calls in the cases they are actually useful.
      
      All existing usages of IORESOURCE_CACHEABLE call ioremap() instead of
      ioremap_nocache() if the resource is cacheable, however ioremap() is
      uncached by default. Clearly none of the existing usages care about the
      cacheability. Particularly devm_ioremap_resource() never worked as
      advertised since it always fell back to plain ioremap().
      
      Clean this up as the new direction we want is to convert
      ioremap_<type>() usages to memremap(..., flags).
      Suggested-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      92b19ff5
  19. 10 8月, 2015 1 次提交
    • V
      powerpc/time: Migrate to new 'set-state' interface · 37a13e78
      Viresh Kumar 提交于
      Migrate powerpc driver to the new 'set-state' interface provided by
      clockevents core, the earlier 'set-mode' interface is marked obsolete
      now.
      
      This also enables us to implement callbacks for new states of clockevent
      devices, for example: ONESHOT_STOPPED.
      
      We weren't doing anything in ->set_mode(ONSHOT) and so
      set_state_oneshot() isn't implemented.
      
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
      37a13e78
  20. 08 8月, 2015 1 次提交
    • S
      powerpc/fsl: Force coherent memory on e500mc derivatives · c6023202
      Scott Wood 提交于
      In CoreNet systems it is not allowed to mix M and non-M mappings to the
      same memory, and coherent DMA accesses are considered to be M mappings
      for this purpose.  Ignoring this has been observed to cause hard
      lockups in non-SMP kernels on e6500.
      
      Furthermore, e6500 implements the LRAT (logical to real address table)
      which allows KVM guests to control the WIMGE bits.  This means that
      KVM cannot force the M bit on the way it usually does, so the guest had
      better set it itself.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      c6023202
  21. 07 8月, 2015 1 次提交
    • A
      signal: fix information leak in copy_siginfo_from_user32 · 3c00cb5e
      Amanieu d'Antras 提交于
      This function can leak kernel stack data when the user siginfo_t has a
      positive si_code value.  The top 16 bits of si_code descibe which fields
      in the siginfo_t union are active, but they are treated inconsistently
      between copy_siginfo_from_user32, copy_siginfo_to_user32 and
      copy_siginfo_to_user.
      
      copy_siginfo_from_user32 is called from rt_sigqueueinfo and
      rt_tgsigqueueinfo in which the user has full control overthe top 16 bits
      of si_code.
      
      This fixes the following information leaks:
      x86:   8 bytes leaked when sending a signal from a 32-bit process to
             itself. This leak grows to 16 bytes if the process uses x32.
             (si_code = __SI_CHLD)
      x86:   100 bytes leaked when sending a signal from a 32-bit process to
             a 64-bit process. (si_code = -1)
      sparc: 4 bytes leaked when sending a signal from a 32-bit process to a
             64-bit process. (si_code = any)
      
      parsic and s390 have similar bugs, but they are not vulnerable because
      rt_[tg]sigqueueinfo have checks that prevent sending a positive si_code
      to a different process.  These bugs are also fixed for consistency.
      Signed-off-by: NAmanieu d'Antras <amanieu@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c00cb5e
  22. 06 8月, 2015 3 次提交