1. 21 4月, 2015 7 次提交
    • P
      KVM: PPC: Book3S HV: Simplify handling of VCPUs that need a VPA update · d911f0be
      Paul Mackerras 提交于
      Previously, if kvmppc_run_core() was running a VCPU that needed a VPA
      update (i.e. one of its 3 virtual processor areas needed to be pinned
      in memory so the host real mode code can update it on guest entry and
      exit), we would drop the vcore lock and do the update there and then.
      Future changes will make it inconvenient to drop the lock, so instead
      we now remove it from the list of runnable VCPUs and wake up its
      VCPU task.  This will have the effect that the VCPU task will exit
      kvmppc_run_vcpu(), go around the do loop in kvmppc_vcpu_run_hv(), and
      re-enter kvmppc_run_vcpu(), whereupon it will do the necessary call
      to kvmppc_update_vpas() and then rejoin the vcore.
      
      The one complication is that the runner VCPU (whose VCPU task is the
      current task) might be one of the ones that gets removed from the
      runnable list.  In that case we just return from kvmppc_run_core()
      and let the code in kvmppc_run_vcpu() wake up another VCPU task to be
      the runner if necessary.
      
      This all means that the VCORE_STARTING state is no longer used, so we
      remove it.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      d911f0be
    • P
      KVM: PPC: Book3S HV: Accumulate timing information for real-mode code · b6c295df
      Paul Mackerras 提交于
      This reads the timebase at various points in the real-mode guest
      entry/exit code and uses that to accumulate total, minimum and
      maximum time spent in those parts of the code.  Currently these
      times are accumulated per vcpu in 5 parts of the code:
      
      * rm_entry - time taken from the start of kvmppc_hv_entry() until
        just before entering the guest.
      * rm_intr - time from when we take a hypervisor interrupt in the
        guest until we either re-enter the guest or decide to exit to the
        host.  This includes time spent handling hcalls in real mode.
      * rm_exit - time from when we decide to exit the guest until the
        return from kvmppc_hv_entry().
      * guest - time spend in the guest
      * cede - time spent napping in real mode due to an H_CEDE hcall
        while other threads in the same vcore are active.
      
      These times are exposed in debugfs in a directory per vcpu that
      contains a file called "timings".  This file contains one line for
      each of the 5 timings above, with the name followed by a colon and
      4 numbers, which are the count (number of times the code has been
      executed), the total time, the minimum time, and the maximum time,
      all in nanoseconds.
      
      The overhead of the extra code amounts to about 30ns for an hcall that
      is handled in real mode (e.g. H_SET_DABR), which is about 25%.  Since
      production environments may not wish to incur this overhead, the new
      code is conditional on a new config symbol,
      CONFIG_KVM_BOOK3S_HV_EXIT_TIMING.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      b6c295df
    • P
      KVM: PPC: Book3S HV: Create debugfs file for each guest's HPT · e23a808b
      Paul Mackerras 提交于
      This creates a debugfs directory for each HV guest (assuming debugfs
      is enabled in the kernel config), and within that directory, a file
      by which the contents of the guest's HPT (hashed page table) can be
      read.  The directory is named vmnnnn, where nnnn is the PID of the
      process that created the guest.  The file is named "htab".  This is
      intended to help in debugging problems in the host's management
      of guest memory.
      
      The contents of the file consist of a series of lines like this:
      
        3f48 4000d032bf003505 0000000bd7ff1196 00000003b5c71196
      
      The first field is the index of the entry in the HPT, the second and
      third are the HPT entry, so the third entry contains the real page
      number that is mapped by the entry if the entry's valid bit is set.
      The fourth field is the guest's view of the second doubleword of the
      entry, so it contains the guest physical address.  (The format of the
      second through fourth fields are described in the Power ISA and also
      in arch/powerpc/include/asm/mmu-hash64.h.)
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      e23a808b
    • A
      KVM: PPC: Book3S HV: Add helpers for lock/unlock hpte · a4bd6eb0
      Aneesh Kumar K.V 提交于
      This adds helper routines for locking and unlocking HPTEs, and uses
      them in the rest of the code.  We don't change any locking rules in
      this patch.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      a4bd6eb0
    • A
      KVM: PPC: Book3S HV: Remove RMA-related variables from code · 31037eca
      Aneesh Kumar K.V 提交于
      We don't support real-mode areas now that 970 support is removed.
      Remove the remaining details of rma from the code.  Also rename
      rma_setup_done to hpte_setup_done to better reflect the changes.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      31037eca
    • M
      KVM: PPC: Book3S HV: Add fast real-mode H_RANDOM implementation. · e928e9cb
      Michael Ellerman 提交于
      Some PowerNV systems include a hardware random-number generator.
      This HWRNG is present on POWER7+ and POWER8 chips and is capable of
      generating one 64-bit random number every microsecond.  The random
      numbers are produced by sampling a set of 64 unstable high-frequency
      oscillators and are almost completely entropic.
      
      PAPR defines an H_RANDOM hypercall which guests can use to obtain one
      64-bit random sample from the HWRNG.  This adds a real-mode
      implementation of the H_RANDOM hypercall.  This hypercall was
      implemented in real mode because the latency of reading the HWRNG is
      generally small compared to the latency of a guest exit and entry for
      all the threads in the same virtual core.
      
      Userspace can detect the presence of the HWRNG and the H_RANDOM
      implementation by querying the KVM_CAP_PPC_HWRNG capability.  The
      H_RANDOM hypercall implementation will only be invoked when the guest
      does an H_RANDOM hypercall if userspace first enables the in-kernel
      H_RANDOM implementation using the KVM_CAP_PPC_ENABLE_HCALL capability.
      Signed-off-by: NMichael Ellerman <michael@ellerman.id.au>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      e928e9cb
    • D
      kvmppc: Implement H_LOGICAL_CI_{LOAD,STORE} in KVM · 99342cf8
      David Gibson 提交于
      On POWER, storage caching is usually configured via the MMU - attributes
      such as cache-inhibited are stored in the TLB and the hashed page table.
      
      This makes correctly performing cache inhibited IO accesses awkward when
      the MMU is turned off (real mode).  Some CPU models provide special
      registers to control the cache attributes of real mode load and stores but
      this is not at all consistent.  This is a problem in particular for SLOF,
      the firmware used on KVM guests, which runs entirely in real mode, but
      which needs to do IO to load the kernel.
      
      To simplify this qemu implements two special hypercalls, H_LOGICAL_CI_LOAD
      and H_LOGICAL_CI_STORE which simulate a cache-inhibited load or store to
      a logical address (aka guest physical address).  SLOF uses these for IO.
      
      However, because these are implemented within qemu, not the host kernel,
      these bypass any IO devices emulated within KVM itself.  The simplest way
      to see this problem is to attempt to boot a KVM guest from a virtio-blk
      device with iothread / dataplane enabled.  The iothread code relies on an
      in kernel implementation of the virtio queue notification, which is not
      triggered by the IO hcalls, and so the guest will stall in SLOF unable to
      load the guest OS.
      
      This patch addresses this by providing in-kernel implementations of the
      2 hypercalls, which correctly scan the KVM IO bus.  Any access to an
      address not handled by the KVM IO bus will cause a VM exit, hitting the
      qemu implementation as before.
      
      Note that a userspace change is also required, in order to enable these
      new hcall implementations with KVM_CAP_PPC_ENABLE_HCALL.
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      [agraf: fix compilation]
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      99342cf8
  2. 01 4月, 2015 1 次提交
  3. 20 3月, 2015 1 次提交
    • P
      powerpc/powernv: Fixes for hypervisor doorbell handling · 755563bc
      Paul Mackerras 提交于
      Since we can now use hypervisor doorbells for host IPIs, this makes
      sure we clear the host IPI flag when taking a doorbell interrupt, and
      clears any pending doorbell IPI in pnv_smp_cpu_kill_self() (as we
      already do for IPIs sent via the XICS interrupt controller).  Otherwise
      if there did happen to be a leftover pending doorbell interrupt for
      an offline CPU thread for any reason, it would prevent that thread from
      going into a power-saving mode; it would instead keep waking up because
      of the interrupt.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      755563bc
  4. 04 3月, 2015 1 次提交
  5. 23 2月, 2015 1 次提交
    • P
      powerpc: Re-enable dynticks · fea559f3
      Paul Clarke 提交于
      Implement arch_irq_work_has_interrupt() for powerpc
      
      Commit 9b01f5bf introduced a dependency on "IRQ work self-IPIs" for
      full dynamic ticks to be enabled, by expecting architectures to
      implement a suitable arch_irq_work_has_interrupt() routine.
      
      Several arches have implemented this routine, including x86 (3010279f)
      and arm (09f6edd4), but powerpc was omitted.
      
      This patch implements this routine for powerpc.
      
      The symptom, at boot (on powerpc systems) with "nohz_full=<CPU list>"
      is displayed:
      
           NO_HZ: Can't run full dynticks because arch doesn't support irq work self-IPIs
      
      after this patch:
      
           NO_HZ: Full dynticks CPUs: <CPU list>.
      
      Tested against 3.19.
      
      powerpc implements "IRQ work self-IPIs" by setting the decrementer to 1 in
      arch_irq_work_raise(), which causes a decrementer exception on the next
      timebase tick. We then handle the work in __timer_interrupt().
      
      CC: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPaul A. Clarke <pc@us.ibm.com>
      Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      [mpe: Flesh out change log, fix ws & include guards, remove include of processor.h]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      fea559f3
  6. 17 2月, 2015 1 次提交
  7. 13 2月, 2015 3 次提交
    • A
      all arches, signal: move restart_block to struct task_struct · f56141e3
      Andy Lutomirski 提交于
      If an attacker can cause a controlled kernel stack overflow, overwriting
      the restart block is a very juicy exploit target.  This is because the
      restart_block is held in the same memory allocation as the kernel stack.
      
      Moving the restart block to struct task_struct prevents this exploit by
      making the restart_block harder to locate.
      
      Note that there are other fields in thread_info that are also easy
      targets, at least on some architectures.
      
      It's also a decent simplification, since the restart code is more or less
      identical on all architectures.
      
      [james.hogan@imgtec.com: metag: align thread_info::supervisor_stack]
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: David Miller <davem@davemloft.net>
      Acked-by: NRichard Weinberger <richard@nod.at>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Steven Miao <realmz6@gmail.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Tested-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: NJames Hogan <james.hogan@imgtec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f56141e3
    • M
      mm: remove remaining references to NUMA hinting bits and helpers · 21d9ee3e
      Mel Gorman 提交于
      This patch removes the NUMA PTE bits and associated helpers.  As a
      side-effect it increases the maximum possible swap space on x86-64.
      
      One potential source of problems is races between the marking of PTEs
      PROT_NONE, NUMA hinting faults and migration.  It must be guaranteed that
      a PTE being protected is not faulted in parallel, seen as a pte_none and
      corrupting memory.  The base case is safe but transhuge has problems in
      the past due to an different migration mechanism and a dependance on page
      lock to serialise migrations and warrants a closer look.
      
      task_work hinting update			parallel fault
      ------------------------			--------------
      change_pmd_range
        change_huge_pmd
          __pmd_trans_huge_lock
            pmdp_get_and_clear
      						__handle_mm_fault
      						pmd_none
      						  do_huge_pmd_anonymous_page
      						  read? pmd_lock blocks until hinting complete, fail !pmd_none test
      						  write? __do_huge_pmd_anonymous_page acquires pmd_lock, checks pmd_none
            pmd_modify
            set_pmd_at
      
      task_work hinting update			parallel migration
      ------------------------			------------------
      change_pmd_range
        change_huge_pmd
          __pmd_trans_huge_lock
            pmdp_get_and_clear
      						__handle_mm_fault
      						  do_huge_pmd_numa_page
      						    migrate_misplaced_transhuge_page
      						    pmd_lock waits for updates to complete, recheck pmd_same
            pmd_modify
            set_pmd_at
      
      Both of those are safe and the case where a transhuge page is inserted
      during a protection update is unchanged.  The case where two processes try
      migrating at the same time is unchanged by this series so should still be
      ok.  I could not find a case where we are accidentally depending on the
      PTE not being cleared and flushed.  If one is missed, it'll manifest as
      corruption problems that start triggering shortly after this series is
      merged and only happen when NUMA balancing is enabled.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21d9ee3e
    • M
      mm: add p[te|md] protnone helpers for use by NUMA balancing · e7bb4b6d
      Mel Gorman 提交于
      This is a preparatory patch that introduces protnone helpers for automatic
      NUMA balancing.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7bb4b6d
  8. 12 2月, 2015 1 次提交
  9. 06 2月, 2015 2 次提交
    • P
      kvm: add halt_poll_ns module parameter · f7819512
      Paolo Bonzini 提交于
      This patch introduces a new module parameter for the KVM module; when it
      is present, KVM attempts a bit of polling on every HLT before scheduling
      itself out via kvm_vcpu_block.
      
      This parameter helps a lot for latency-bound workloads---in particular
      I tested it with O_DSYNC writes with a battery-backed disk in the host.
      In this case, writes are fast (because the data doesn't have to go all
      the way to the platters) but they cannot be merged by either the host or
      the guest.  KVM's performance here is usually around 30% of bare metal,
      or 50% if you use cache=directsync or cache=writethrough (these
      parameters avoid that the guest sends pointless flush requests, and
      at the same time they are not slow because of the battery-backed cache).
      The bad performance happens because on every halt the host CPU decides
      to halt itself too.  When the interrupt comes, the vCPU thread is then
      migrated to a new physical CPU, and in general the latency is horrible
      because the vCPU thread has to be scheduled back in.
      
      With this patch performance reaches 60-65% of bare metal and, more
      important, 99% of what you get if you use idle=poll in the guest.  This
      means that the tunable gets rid of this particular bottleneck, and more
      work can be done to improve performance in the kernel or QEMU.
      
      Of course there is some price to pay; every time an otherwise idle vCPUs
      is interrupted by an interrupt, it will poll unnecessarily and thus
      impose a little load on the host.  The above results were obtained with
      a mostly random value of the parameter (500000), and the load was around
      1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.
      
      The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
      that can be used to tune the parameter.  It counts how many HLT
      instructions received an interrupt during the polling period; each
      successful poll avoids that Linux schedules the VCPU thread out and back
      in, and may also avoid a likely trip to C1 and back for the physical CPU.
      
      While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
      Of these halts, almost all are failed polls.  During the benchmark,
      instead, basically all halts end within the polling period, except a more
      or less constant stream of 50 per second coming from vCPUs that are not
      running the benchmark.  The wasted time is thus very low.  Things may
      be slightly different for Windows VMs, which have a ~10 ms timer tick.
      
      The effect is also visible on Marcelo's recently-introduced latency
      test for the TSC deadline timer.  Though of course a non-RT kernel has
      awful latency bounds, the latency of the timer is around 8000-10000 clock
      cycles compared to 20000-120000 without setting halt_poll_ns.  For the TSC
      deadline timer, thus, the effect is both a smaller average latency and
      a smaller variance.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f7819512
    • J
      mm/debug_pagealloc: fix build failure on ppc and some other archs · 7b02190c
      Joonsoo Kim 提交于
      Kim Phillips reported following build failure.
      
        LD      init/built-in.o
        mm/built-in.o: In function `free_pages_prepare':
        mm/page_alloc.c:770: undefined reference to `.kernel_map_pages'
        mm/built-in.o: In function `prep_new_page':
        mm/page_alloc.c:933: undefined reference to `.kernel_map_pages'
        mm/built-in.o: In function `map_pages':
        mm/compaction.c:61: undefined reference to `.kernel_map_pages'
        make: *** [vmlinux] Error 1
      
      Reason for this problem is that commit 031bc574
      ("mm/debug-pagealloc: make debug-pagealloc boottime configurable")
      forgot to remove the old declaration of kernel_map_pages() for some
      architectures.  This patch removes them to fix build failure.
      Reported-by: NKim Phillips <kim.phillips@freescale.com>
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b02190c
  10. 04 2月, 2015 2 次提交
  11. 02 2月, 2015 1 次提交
    • R
      cxl: Fix device_node reference counting · 6f963ec2
      Ryan Grimm 提交于
      When unbinding and rebinding the driver on a system with a card in PHB0, this
      error condition is reached after a few attempts:
      
      ERROR: Bad of_node_put() on /pciex@3fffe40000000
      CPU: 0 PID: 3040 Comm: bash Not tainted 3.18.0-rc3-12545-g3627ffe #152
      Call Trace:
      [c000000721acb5c0] [c00000000086ef94] .dump_stack+0x84/0xb0 (unreliable)
      [c000000721acb640] [c00000000073a0a8] .of_node_release+0xd8/0xe0
      [c000000721acb6d0] [c00000000044bc44] .kobject_release+0x74/0xe0
      [c000000721acb760] [c0000000007394fc] .of_node_put+0x1c/0x30
      [c000000721acb7d0] [c000000000545cd8] .cxl_probe+0x1a98/0x1d50
      [c000000721acb900] [c0000000004845a0] .local_pci_probe+0x40/0xc0
      [c000000721acb980] [c000000000484998] .pci_device_probe+0x128/0x170
      [c000000721acba30] [c00000000052400c] .driver_probe_device+0xac/0x2a0
      [c000000721acbad0] [c000000000522468] .bind_store+0x108/0x160
      [c000000721acbb70] [c000000000521448] .drv_attr_store+0x38/0x60
      [c000000721acbbe0] [c000000000293840] .sysfs_kf_write+0x60/0xa0
      [c000000721acbc50] [c000000000292500] .kernfs_fop_write+0x140/0x1d0
      [c000000721acbcf0] [c000000000208648] .vfs_write+0xd8/0x260
      [c000000721acbd90] [c000000000208b18] .SyS_write+0x58/0x100
      [c000000721acbe30] [c000000000009258] syscall_exit+0x0/0x98
      
      We are missing a call to of_node_get(). pnv_pci_to_phb_node() should
      call of_node_get() otherwise np's reference count isn't incremented and
      it might go away. Rename pnv_pci_to_phb_node() to pnv_pci_get_phb_node()
      so it's clear it calls of_node_get().
      Signed-off-by: NRyan Grimm <grimm@linux.vnet.ibm.com>
      Acked-by: NIan Munsie <imunsie@au1.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      6f963ec2
  12. 30 1月, 2015 4 次提交
  13. 28 1月, 2015 1 次提交
    • M
      powerpc: Remove some unused functions · 8aa989b8
      Michael Ellerman 提交于
      Remove slice_set_psize() which is not used.
      
      It was added in 3a8247cc "powerpc: Only demote individual slices
      rather than whole process" but was never used.
      
      Remove vsx_assist_exception() which is not used.
      
      It was added in ce48b210 "powerpc: Add VSX context save/restore,
      ptrace and signal support" but was never used.
      
      Remove generic_mach_cpu_die() which is not used.
      
      Its last caller was removed in 375f561a "powerpc/powernv: Always go
      into nap mode when CPU is offline".
      
      Remove mpc7448_hpc2_power_off() and mpc7448_hpc2_halt() which are
      unused.
      
      These were introduced in c5d56332 "[POWERPC] Add general support for
      mpc7448hpc2 (Taiga) platform" but were never used.
      
      This was partially found by using a static code analysis program called
      cppcheck.
      Signed-off-by: NRickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
      [mpe: Update changelog with details on when/why they are unused]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8aa989b8
  14. 27 1月, 2015 1 次提交
    • C
      powerpc/pseries: Fix endian problems with LE migration · 3df76a9d
      Cyril Bur 提交于
      RTAS events require arguments be passed in big endian while hypercalls
      have their arguments passed in registers and the values should therefore
      be in CPU endian.
      
      The "ibm,suspend_me" 'RTAS' call makes a sequence of hypercalls to setup
      one true RTAS call. This means that "ibm,suspend_me" is handled
      specially in the ppc_rtas() syscall.
      
      The ppc_rtas() syscall has its arguments in big endian and can therefore
      pass these arguments directly to the RTAS call. "ibm,suspend_me" is
      handled specially from within ppc_rtas() (by calling rtas_ibm_suspend_me())
      which has left an endian bug on little endian systems due to the
      requirement of hypercalls. The return value from rtas_ibm_suspend_me()
      gets returned in cpu endian, and is left unconverted, also a bug on
      little endian systems.
      
      rtas_ibm_suspend_me() does not actually make use of the rtas_args that
      it is passed. This patch removes the convoluted use of the rtas_args
      struct to pass params to rtas_ibm_suspend_me() in favour of passing what
      it needs as actual arguments. This patch also ensures the two callers of
      rtas_ibm_suspend_me() pass function parameters in cpu endian and in the
      case of ppc_rtas(), converts the return value.
      
      migrate_store() (the other caller of rtas_ibm_suspend_me()) is from a
      sysfs file which deals with everything in cpu endian so this function
      only underwent cleanup.
      
      This patch has been tested with KVM both LE and BE and on PowerVM both
      LE and BE. Under QEMU/KVM the migration happens without touching these
      code pathes.
      
      For PowerVM there is no obvious regression on BE and the LE code path
      now provides the correct parameters to the hypervisor.
      Signed-off-by: NCyril Bur <cyrilbur@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      3df76a9d
  15. 23 1月, 2015 6 次提交
    • G
      powerpc/eeh: Allow to set maximal frozen times · 1b28f170
      Gavin Shan 提交于
      When PE's frozen count hits maximal allowed frozen times, which is
      5 currently, it will be forced to be offline permanently. Once the
      PE is removed permanently, rebooting machine is required to bring
      the PE back. It's not convienent when testing EEH functionality.
      
      The patch exports the maximal allowed frozen times through debugfs
      entry (/sys/kernel/debug/powerpc/eeh_max_freezes).
      Requested-by: NRyan Grimm <grimm@linux.vnet.ibm.com>
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      1b28f170
    • G
      powerpc/eeh: Introduce flag EEH_PE_REMOVED · 432227e9
      Gavin Shan 提交于
      The conditions that one specific PE's frozen count exceeds the maximal
      allowed times (EEH_MAX_ALLOWED_FREEZES) and it's in isolated or recovery
      state indicate the PE was removed permanently implicitly. The patch
      introduces flag EEH_PE_REMOVED to indicate that explicitly so that we
      don't depend on the fixed maximal allowed times, which can be varied as
      we do in subsequent patch.
      
      Flag EEH_PE_REMOVED is expected to be marked for the PE whose frozen
      count exceeds the maximal allowed times, or just failed from recovery.
      Requested-by: NRyan Grimm <grimm@linux.vnet.ibm.com>
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      432227e9
    • G
      powerpc/eeh: Fix missed PE#0 on P7IOC · 2aa5cf9e
      Gavin Shan 提交于
      PE#0 should be regarded as valid for P7IOC, while it's invalid for
      PHB3. The patch adds flag EEH_VALID_PE_ZERO to differentiate those
      two cases. Without the patch, we possibly see frozen PE#0 state is
      cleared without EEH recovery taken on P7IOC as following kernel logs
      indicate:
      
      [root@ltcfbl8eb ~]# dmesg
             :
      pci 0000:00     : [PE# 000] Secondary bus 0 associated with PE#0
      pci 0000:01     : [PE# 001] Secondary bus 1 associated with PE#1
      pci 0001:00     : [PE# 000] Secondary bus 0 associated with PE#0
      pci 0001:01     : [PE# 001] Secondary bus 1 associated with PE#1
      pci 0002:00     : [PE# 000] Secondary bus 0 associated with PE#0
      pci 0002:01     : [PE# 001] Secondary bus 1 associated with PE#1
      pci 0003:00     : [PE# 000] Secondary bus 0 associated with PE#0
      pci 0003:01     : [PE# 001] Secondary bus 1 associated with PE#1
      pci 0003:20     : [PE# 002] Secondary bus 32..63 associated with PE#2
             :
      EEH: Clear non-existing PHB#3-PE#0
      EEH: PHB location: U78AE.001.WZS00M9-P1-002
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      2aa5cf9e
    • N
      powerpc/kprobes: Fix kallsyms lookup across powerpc ABIv1 and ABIv2 · bf794bf5
      Naveen N. Rao 提交于
      Currently, all non-dot symbols are being treated as function descriptors
      in ABIv1. This is incorrect and is resulting in perf probe not working:
      
        # perf probe do_fork
        Added new event:
        Failed to write event: Invalid argument
          Error: Failed to add events.
        # dmesg | tail -1
        [192268.073063] Could not insert probe at _text+768432: -22
      
      perf probe bases all kernel probes on _text and writes,
      for example, "p:probe/do_fork _text+768432" to
      /sys/kernel/debug/tracing/kprobe_events. In-kernel, _text is being
      considered to be a function descriptor and is resulting in the above
      error.
      
      Fix this by changing how we lookup symbol addresses on ppc64. We first
      check for the dot variant of a symbol and look at the non-dot variant
      only if that fails. In this manner, we avoid having to look at the
      function descriptor.
      
      While at it, also separate out how this works on ABIv2 where
      we don't have dot symbols, but need to use the local entry point.
      Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      bf794bf5
    • M
      powerpc: Rename _TIF_SYSCALL_T_OR_A to _TIF_SYSCALL_DOTRACE · 10ea8343
      Michael Ellerman 提交于
      Once upon a time, at least 9 years ago (< 2.6.12), _TIF_SYSCALL_T_OR_A
      meant "TRACE or AUDIT". But these days it means TRACE or AUDIT or
      SECCOMP or TRACEPOINT or NOHZ.
      
      All of those are implemented via syscall_dotrace() so rename the flag to
      that to try and clarify things.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      10ea8343
    • M
      powerpc: Remove unused CPU_FTR_IABR · ed77d418
      Michael Ellerman 提交于
      We removed the last usage of CPU_FTR_IABR in commit 1ad7d705
      "powerpc/xmon: Enable HW instruction breakpoint on POWER8".
      
      Mark it as free.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      ed77d418
  16. 22 1月, 2015 1 次提交
  17. 12 1月, 2015 2 次提交
    • J
      uio: uio_fsl_elbc_gpcm: new driver · fbc4a8a8
      John Ogness 提交于
      This driver provides UIO access to memory of a peripheral connected
      to the Freescale enhanced local bus controller (eLBC) interface
      using the general purpose chip-select mode (GPCM).
      Signed-off-by: NJohn Ogness <john.ogness@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fbc4a8a8
    • M
      powerpc: Work around gcc bug in current_thread_info() · a87e810f
      Michael Ellerman 提交于
      In commit a3e5b356 "powerpc: Don't use local named register variable
      in current_thread_info" Anton changed the way we did current_thread_info()
      to accommodate LLVM, and it was not meant to have any effect elsewhere.
      
      Unfortunately it has exposed a gcc bug, where r1 gets copied into
      another register and then gcc uses that register to restore the toc
      after a function call, even when that register is volatile and has been
      clobbered by the function call.
      
      We could revert Anton's patch, but it's not clear the original code is
      safe either, we may just have been lucky.
      
      The cleanest solution is to just use the existing CURRENT_THREAD_INFO()
      asm macro, and call it using inline asm.
      
      Segher points out we don't need volatile on the asm, if the result of
      the shift is unused it's fine for the compiler to elide it.
      
      Fixes: a3e5b356 ("powerpc: Don't use local named register variable in current_thread_info")
      Reported-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a87e810f
  18. 29 12月, 2014 2 次提交
    • H
      powerpc/kdump: Ignore failure in enabling big endian exception during crash · c1caae3d
      Hari Bathini 提交于
      In LE kernel, we currently have a hack for kexec that resets the exception
      endian before starting a new kernel as the kernel that is loaded could be a
      big endian or a little endian kernel. In kdump case, resetting exception
      endian fails when one or more cpus is disabled. But we can ignore the failure
      and still go ahead, as in most cases crashkernel will be of same endianess
      as primary kernel and reseting endianess is not even needed in those cases.
      This patch adds a new inline function to say if this is kdump path. This
      function is used at places where such a check is needed.
      Signed-off-by: NHari Bathini <hbathini@linux.vnet.ibm.com>
      [mpe: Rename to kdump_in_progress(), use bool, and edit comment]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c1caae3d
    • P
      powerpc: Wire up sys_execveat() syscall · 1e5d0fdb
      Pranith Kumar 提交于
      Wire up sys_execveat(). This passes the selftests for the system call.
      
      Check success of execveat(3, '../execveat', 0)... [OK]
      Check success of execveat(5, 'execveat', 0)... [OK]
      Check success of execveat(6, 'execveat', 0)... [OK]
      Check success of execveat(-100, '/home/pranith/linux/...ftests/exec/execveat', 0)... [OK]
      Check success of execveat(99, '/home/pranith/linux/...ftests/exec/execveat', 0)... [OK]
      Check success of execveat(8, '', 4096)... [OK]
      Check success of execveat(17, '', 4096)... [OK]
      Check success of execveat(9, '', 4096)... [OK]
      Check success of execveat(14, '', 4096)... [OK]
      Check success of execveat(14, '', 4096)... [OK]
      Check success of execveat(15, '', 4096)... [OK]
      Check failure of execveat(8, '', 0) with ENOENT... [OK]
      Check failure of execveat(8, '(null)', 4096) with EFAULT... [OK]
      Check success of execveat(5, 'execveat.symlink', 0)... [OK]
      Check success of execveat(6, 'execveat.symlink', 0)... [OK]
      Check success of execveat(-100, '/home/pranith/linux/...xec/execveat.symlink', 0)... [OK]
      Check success of execveat(10, '', 4096)... [OK]
      Check success of execveat(10, '', 4352)... [OK]
      Check failure of execveat(5, 'execveat.symlink', 256) with ELOOP... [OK]
      Check failure of execveat(6, 'execveat.symlink', 256) with ELOOP... [OK]
      Check failure of execveat(-100, '/home/pranith/linux/tools/testing/selftests/exec/execveat.symlink', 256) with ELOOP... [OK]
      Check success of execveat(3, '../script', 0)... [OK]
      Check success of execveat(5, 'script', 0)... [OK]
      Check success of execveat(6, 'script', 0)... [OK]
      Check success of execveat(-100, '/home/pranith/linux/...elftests/exec/script', 0)... [OK]
      Check success of execveat(13, '', 4096)... [OK]
      Check success of execveat(13, '', 4352)... [OK]
      Check failure of execveat(18, '', 4096) with ENOENT... [OK]
      Check failure of execveat(7, 'script', 0) with ENOENT... [OK]
      Check success of execveat(16, '', 4096)... [OK]
      Check success of execveat(16, '', 4096)... [OK]
      Check success of execveat(4, '../script', 0)... [OK]
      Check success of execveat(4, 'script', 0)... [OK]
      Check success of execveat(4, '../script', 0)... [OK]
      Check failure of execveat(4, 'script', 0) with ENOENT... [OK]
      Check failure of execveat(5, 'execveat', 65535) with EINVAL... [OK]
      Check failure of execveat(5, 'no-such-file', 0) with ENOENT... [OK]
      Check failure of execveat(6, 'no-such-file', 0) with ENOENT... [OK]
      Check failure of execveat(-100, 'no-such-file', 0) with ENOENT... [OK]
      Check failure of execveat(5, '', 4096) with EACCES... [OK]
      Check failure of execveat(5, 'Makefile', 0) with EACCES... [OK]
      Check failure of execveat(11, '', 4096) with EACCES... [OK]
      Check failure of execveat(12, '', 4096) with EACCES... [OK]
      Check failure of execveat(99, '', 4096) with EBADF... [OK]
      Check failure of execveat(99, 'execveat', 0) with EBADF... [OK]
      Check failure of execveat(8, 'execveat', 0) with ENOTDIR... [OK]
      Invoke copy of 'execveat' via filename of length 4093:
      Check success of execveat(19, '', 4096)... [OK]
      Check success of execveat(5, 'xxxxxxxxxxxxxxxxxxxx...yyyyyyyyyyyyyyyyyyyy', 0)... [OK]
      Invoke copy of 'script' via filename of length 4093:
      Check success of execveat(20, '', 4096)... [OK]
      /bin/sh: 0: Can't open /dev/fd/5/xxxxxxx(... a long line of x's and y's, 0)... [OK]
      Check success of execveat(5, 'xxxxxxxxxxxxxxxxxxxx...yyyyyyyyyyyyyyyyyyyy', 0)... [OK]
      
      Tested on a 32-bit powerpc system.
      Signed-off-by: NPranith Kumar <bobby.prani@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      1e5d0fdb
  19. 18 12月, 2014 1 次提交
    • M
      powerpc/uaccess: Allow get_user() with bitwise types · 505e4283
      Michael S. Tsirkin 提交于
      At the moment, if p and x are both of the same bitwise type
      (eg. __le32), get_user(x, p) produces a sparse warning.
      
      This is because *p is loaded into a long then cast back to typeof(*p).
      
      When typeof(*p) is a bitwise type (which is uncommon), such a cast needs
      __force, otherwise sparse produces a warning.
      
      For non-bitwise types __force should have no effect, and should not hide
      any legitimate errors.
      
      Note that we are casting to typeof(*p) not typeof(x). Even with the
      cast, if x and *p are of different types we should get the warning, so I
      think we are not loosing the ability to detect any actual errors.
      
      virtio would like to use bitwise types with get_user() so fix these
      spurious warnings by adding __force.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      [mpe: Fill in changelog with more details]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      505e4283
  20. 17 12月, 2014 1 次提交
    • S
      KVM: PPC: Book3S HV: Improve H_CONFER implementation · 90fd09f8
      Sam Bobroff 提交于
      Currently the H_CONFER hcall is implemented in kernel virtual mode,
      meaning that whenever a guest thread does an H_CONFER, all the threads
      in that virtual core have to exit the guest.  This is bad for
      performance because it interrupts the other threads even if they
      are doing useful work.
      
      The H_CONFER hcall is called by a guest VCPU when it is spinning on a
      spinlock and it detects that the spinlock is held by a guest VCPU that
      is currently not running on a physical CPU.  The idea is to give this
      VCPU's time slice to the holder VCPU so that it can make progress
      towards releasing the lock.
      
      To avoid having the other threads exit the guest unnecessarily,
      we add a real-mode implementation of H_CONFER that checks whether
      the other threads are doing anything.  If all the other threads
      are idle (i.e. in H_CEDE) or trying to confer (i.e. in H_CONFER),
      it returns H_TOO_HARD which causes a guest exit and allows the
      H_CONFER to be handled in virtual mode.
      
      Otherwise it spins for a short time (up to 10 microseconds) to give
      other threads the chance to observe that this thread is trying to
      confer.  The spin loop also terminates when any thread exits the guest
      or when all other threads are idle or trying to confer.  If the
      timeout is reached, the H_CONFER returns H_SUCCESS.  In this case the
      guest VCPU will recheck the spinlock word and most likely call
      H_CONFER again.
      
      This also improves the implementation of the H_CONFER virtual mode
      handler.  If the VCPU is part of a virtual core (vcore) which is
      runnable, there will be a 'runner' VCPU which has taken responsibility
      for running the vcore.  In this case we yield to the runner VCPU
      rather than the target VCPU.
      
      We also introduce a check on the target VCPU's yield count: if it
      differs from the yield count passed to H_CONFER, the target VCPU
      has run since H_CONFER was called and may have already released
      the lock.  This check is required by PAPR.
      Signed-off-by: NSam Bobroff <sam.bobroff@au1.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      90fd09f8