1. 30 1月, 2015 1 次提交
    • L
      vm: add VM_FAULT_SIGSEGV handling support · 33692f27
      Linus Torvalds 提交于
      The core VM already knows about VM_FAULT_SIGBUS, but cannot return a
      "you should SIGSEGV" error, because the SIGSEGV case was generally
      handled by the caller - usually the architecture fault handler.
      
      That results in lots of duplication - all the architecture fault
      handlers end up doing very similar "look up vma, check permissions, do
      retries etc" - but it generally works.  However, there are cases where
      the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV.
      
      In particular, when accessing the stack guard page, libsigsegv expects a
      SIGSEGV.  And it usually got one, because the stack growth is handled by
      that duplicated architecture fault handler.
      
      However, when the generic VM layer started propagating the error return
      from the stack expansion in commit fee7e49d ("mm: propagate error
      from stack expansion even for guard page"), that now exposed the
      existing VM_FAULT_SIGBUS result to user space.  And user space really
      expected SIGSEGV, not SIGBUS.
      
      To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those
      duplicate architecture fault handlers about it.  They all already have
      the code to handle SIGSEGV, so it's about just tying that new return
      value to the existing code, but it's all a bit annoying.
      
      This is the mindless minimal patch to do this.  A more extensive patch
      would be to try to gather up the mostly shared fault handling logic into
      one generic helper routine, and long-term we really should do that
      cleanup.
      
      Just from this patch, you can generally see that most architectures just
      copied (directly or indirectly) the old x86 way of doing things, but in
      the meantime that original x86 model has been improved to hold the VM
      semaphore for shorter times etc and to handle VM_FAULT_RETRY and other
      "newer" things, so it would be a good idea to bring all those
      improvements to the generic case and teach other architectures about
      them too.
      Reported-and-tested-by: NTakashi Iwai <tiwai@suse.de>
      Tested-by: NJan Engelhardt <jengelh@inai.de>
      Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots"
      Cc: linux-arch@vger.kernel.org
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33692f27
  2. 22 1月, 2015 1 次提交
    • S
      powerpc/powernv: Restore LPCR with LPCR_PECE1 cleared · 0eb13208
      Shreyas B. Prabhu 提交于
      LPCR_PECE1 bit controls whether decrementer interrupts are allowed to
      cause exit from power-saving mode. While waking up from winkle, restoring
      LPCR with LPCR_PECE1 set (i.e Decrementer interrupts allowed) can cause
      issue in the following scenario:
      
      - All the threads in a core are offlined. The core enters deep winkle.
      - Spurious interrupt wakes up a thread in the core. Here LPCR is restored
        with LPCR_PECE1 bit set.
      - Since it was a spurious interrupt on a offline thread, the thread clears
        the interrupt and goes back to winkle.
      - Here before the thread executes winkle and puts the core into deep winkle,
        if a decrementer interrupt occurs on any of the sibling threads in the core
        that thread wakes up.
      - Since in offline loop we are flushing interrupt only in case of external
        interrupt, the decrementer interrupt does not get flushed. So at this stage
        the thread is stuck in this is loop of waking up at 0x100 due to decrementer
        interrupt, not flushing the interrupt as only external interrupts get flushed,
        entering winkle, waking up at 0x100 again.
      
      Fix this by programming PORE to restore LPCR with LPCR_PECE1 bit
      cleared when waking up from winkle.
      Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      0eb13208
  3. 21 1月, 2015 1 次提交
  4. 20 1月, 2015 1 次提交
    • R
      module: remove mod arg from module_free, rename module_memfree(). · be1f221c
      Rusty Russell 提交于
      Nothing needs the module pointer any more, and the next patch will
      call it from RCU, where the module itself might no longer exist.
      Removing the arg is the safest approach.
      
      This just codifies the use of the module_alloc/module_free pattern
      which ftrace and bpf use.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: x86@kernel.org
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: linux-cris-kernel@axis.com
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mips@linux-mips.org
      Cc: nios2-dev@lists.rocketboards.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: sparclinux@vger.kernel.org
      Cc: netdev@vger.kernel.org
      be1f221c
  5. 19 1月, 2015 1 次提交
  6. 17 1月, 2015 1 次提交
    • Y
      powerpc/PCI: Clip bridge windows to fit in upstream windows · 3ebfe46a
      Yinghai Lu 提交于
      Every PCI-PCI bridge window should fit inside an upstream bridge window
      because orphaned address space is unreachable from the primary side of the
      upstream bridge.  If we inherit invalid bridge windows that overlap an
      upstream window from firmware, clip them to fit and update the bridge
      accordingly.
      
      [bhelgaas: changelog]
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=85491Reported-by: NMarek Kordik <kordikmarek@gmail.com>
      Fixes: 5b285415 ("PCI: Restrict 64-bit prefetchable bridge windows to 64-bit resources")
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      CC: Paul Mackerras <paulus@samba.org>
      CC: Michael Ellerman <mpe@ellerman.id.au>
      CC: Gavin Shan <gwshan@linux.vnet.ibm.com>
      CC: Anton Blanchard <anton@samba.org>
      CC: Sebastian Ott <sebott@linux.vnet.ibm.com>
      CC: Wei Yang <weiyang@linux.vnet.ibm.com>
      CC: Andrew Murray <amurray@embedded-bits.co.uk>
      CC: linuxppc-dev@lists.ozlabs.org
      3ebfe46a
  7. 13 1月, 2015 1 次提交
    • M
      crypto: add missing crypto module aliases · 3e14dcf7
      Mathias Krause 提交于
      Commit 5d26a105 ("crypto: prefix module autoloading with "crypto-"")
      changed the automatic module loading when requesting crypto algorithms
      to prefix all module requests with "crypto-". This requires all crypto
      modules to have a crypto specific module alias even if their file name
      would otherwise match the requested crypto algorithm.
      
      Even though commit 5d26a105 added those aliases for a vast amount of
      modules, it was missing a few. Add the required MODULE_ALIAS_CRYPTO
      annotations to those files to make them get loaded automatically, again.
      This fixes, e.g., requesting 'ecb(blowfish-generic)', which used to work
      with kernels v3.18 and below.
      
      Also change MODULE_ALIAS() lines to MODULE_ALIAS_CRYPTO(). The former
      won't work for crypto modules any more.
      
      Fixes: 5d26a105 ("crypto: prefix module autoloading with "crypto-"")
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NMathias Krause <minipli@googlemail.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      3e14dcf7
  8. 12 1月, 2015 2 次提交
    • M
      powerpc: Work around gcc bug in current_thread_info() · a87e810f
      Michael Ellerman 提交于
      In commit a3e5b356 "powerpc: Don't use local named register variable
      in current_thread_info" Anton changed the way we did current_thread_info()
      to accommodate LLVM, and it was not meant to have any effect elsewhere.
      
      Unfortunately it has exposed a gcc bug, where r1 gets copied into
      another register and then gcc uses that register to restore the toc
      after a function call, even when that register is volatile and has been
      clobbered by the function call.
      
      We could revert Anton's patch, but it's not clear the original code is
      safe either, we may just have been lucky.
      
      The cleanest solution is to just use the existing CURRENT_THREAD_INFO()
      asm macro, and call it using inline asm.
      
      Segher points out we don't need volatile on the asm, if the result of
      the shift is unused it's fine for the compiler to elide it.
      
      Fixes: a3e5b356 ("powerpc: Don't use local named register variable in current_thread_info")
      Reported-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a87e810f
    • A
      powernv: Fix OPAL tracepoint code · bfe5fda8
      Anton Blanchard 提交于
      Patch c49f6353 ("powernv: Add OPAL tracepoints") has a spurious
      store to the stack:
      
      	ld      r12,opal_tracepoint_refcount@toc(r2);           \
      	std     r12,32(r1);                                     \
      
      The store was originally used to save the current tracepoint status
      so the entry and the exit tracepoints were always balanced. In the
      end I just created a separate path when tracepoints are enabled.
      
      The offset on the stack used for this store is not valid for ABIv2
      and it causes strange issues. I noticed it because OPAL console input
      was broken.
      
      Fixes: c49f6353 ("powernv: Add OPAL tracepoints")
      Cc: <stable@vger.kernel.org> # v3.17+
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      bfe5fda8
  9. 29 12月, 2014 3 次提交
    • M
      Revert "powerpc: Secondary CPUs must set cpu_callin_map after setting active and online" · 1be6f10f
      Michael Ellerman 提交于
      This reverts commit 7c5c92ed.
      
      Although this did fix the bug it was aimed at, it also broke secondary
      startup on platforms that use give/take_timebase(). Unfortunately we
      didn't detect that while it was in next.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      1be6f10f
    • H
      powerpc/kdump: Ignore failure in enabling big endian exception during crash · c1caae3d
      Hari Bathini 提交于
      In LE kernel, we currently have a hack for kexec that resets the exception
      endian before starting a new kernel as the kernel that is loaded could be a
      big endian or a little endian kernel. In kdump case, resetting exception
      endian fails when one or more cpus is disabled. But we can ignore the failure
      and still go ahead, as in most cases crashkernel will be of same endianess
      as primary kernel and reseting endianess is not even needed in those cases.
      This patch adds a new inline function to say if this is kdump path. This
      function is used at places where such a check is needed.
      Signed-off-by: NHari Bathini <hbathini@linux.vnet.ibm.com>
      [mpe: Rename to kdump_in_progress(), use bool, and edit comment]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c1caae3d
    • P
      powerpc: Wire up sys_execveat() syscall · 1e5d0fdb
      Pranith Kumar 提交于
      Wire up sys_execveat(). This passes the selftests for the system call.
      
      Check success of execveat(3, '../execveat', 0)... [OK]
      Check success of execveat(5, 'execveat', 0)... [OK]
      Check success of execveat(6, 'execveat', 0)... [OK]
      Check success of execveat(-100, '/home/pranith/linux/...ftests/exec/execveat', 0)... [OK]
      Check success of execveat(99, '/home/pranith/linux/...ftests/exec/execveat', 0)... [OK]
      Check success of execveat(8, '', 4096)... [OK]
      Check success of execveat(17, '', 4096)... [OK]
      Check success of execveat(9, '', 4096)... [OK]
      Check success of execveat(14, '', 4096)... [OK]
      Check success of execveat(14, '', 4096)... [OK]
      Check success of execveat(15, '', 4096)... [OK]
      Check failure of execveat(8, '', 0) with ENOENT... [OK]
      Check failure of execveat(8, '(null)', 4096) with EFAULT... [OK]
      Check success of execveat(5, 'execveat.symlink', 0)... [OK]
      Check success of execveat(6, 'execveat.symlink', 0)... [OK]
      Check success of execveat(-100, '/home/pranith/linux/...xec/execveat.symlink', 0)... [OK]
      Check success of execveat(10, '', 4096)... [OK]
      Check success of execveat(10, '', 4352)... [OK]
      Check failure of execveat(5, 'execveat.symlink', 256) with ELOOP... [OK]
      Check failure of execveat(6, 'execveat.symlink', 256) with ELOOP... [OK]
      Check failure of execveat(-100, '/home/pranith/linux/tools/testing/selftests/exec/execveat.symlink', 256) with ELOOP... [OK]
      Check success of execveat(3, '../script', 0)... [OK]
      Check success of execveat(5, 'script', 0)... [OK]
      Check success of execveat(6, 'script', 0)... [OK]
      Check success of execveat(-100, '/home/pranith/linux/...elftests/exec/script', 0)... [OK]
      Check success of execveat(13, '', 4096)... [OK]
      Check success of execveat(13, '', 4352)... [OK]
      Check failure of execveat(18, '', 4096) with ENOENT... [OK]
      Check failure of execveat(7, 'script', 0) with ENOENT... [OK]
      Check success of execveat(16, '', 4096)... [OK]
      Check success of execveat(16, '', 4096)... [OK]
      Check success of execveat(4, '../script', 0)... [OK]
      Check success of execveat(4, 'script', 0)... [OK]
      Check success of execveat(4, '../script', 0)... [OK]
      Check failure of execveat(4, 'script', 0) with ENOENT... [OK]
      Check failure of execveat(5, 'execveat', 65535) with EINVAL... [OK]
      Check failure of execveat(5, 'no-such-file', 0) with ENOENT... [OK]
      Check failure of execveat(6, 'no-such-file', 0) with ENOENT... [OK]
      Check failure of execveat(-100, 'no-such-file', 0) with ENOENT... [OK]
      Check failure of execveat(5, '', 4096) with EACCES... [OK]
      Check failure of execveat(5, 'Makefile', 0) with EACCES... [OK]
      Check failure of execveat(11, '', 4096) with EACCES... [OK]
      Check failure of execveat(12, '', 4096) with EACCES... [OK]
      Check failure of execveat(99, '', 4096) with EBADF... [OK]
      Check failure of execveat(99, 'execveat', 0) with EBADF... [OK]
      Check failure of execveat(8, 'execveat', 0) with ENOTDIR... [OK]
      Invoke copy of 'execveat' via filename of length 4093:
      Check success of execveat(19, '', 4096)... [OK]
      Check success of execveat(5, 'xxxxxxxxxxxxxxxxxxxx...yyyyyyyyyyyyyyyyyyyy', 0)... [OK]
      Invoke copy of 'script' via filename of length 4093:
      Check success of execveat(20, '', 4096)... [OK]
      /bin/sh: 0: Can't open /dev/fd/5/xxxxxxx(... a long line of x's and y's, 0)... [OK]
      Check success of execveat(5, 'xxxxxxxxxxxxxxxxxxxx...yyyyyyyyyyyyyyyyyyyy', 0)... [OK]
      
      Tested on a 32-bit powerpc system.
      Signed-off-by: NPranith Kumar <bobby.prani@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      1e5d0fdb
  10. 20 12月, 2014 1 次提交
  11. 19 12月, 2014 1 次提交
  12. 18 12月, 2014 3 次提交
  13. 17 12月, 2014 9 次提交
    • S
      KVM: PPC: Book3S HV: Improve H_CONFER implementation · 90fd09f8
      Sam Bobroff 提交于
      Currently the H_CONFER hcall is implemented in kernel virtual mode,
      meaning that whenever a guest thread does an H_CONFER, all the threads
      in that virtual core have to exit the guest.  This is bad for
      performance because it interrupts the other threads even if they
      are doing useful work.
      
      The H_CONFER hcall is called by a guest VCPU when it is spinning on a
      spinlock and it detects that the spinlock is held by a guest VCPU that
      is currently not running on a physical CPU.  The idea is to give this
      VCPU's time slice to the holder VCPU so that it can make progress
      towards releasing the lock.
      
      To avoid having the other threads exit the guest unnecessarily,
      we add a real-mode implementation of H_CONFER that checks whether
      the other threads are doing anything.  If all the other threads
      are idle (i.e. in H_CEDE) or trying to confer (i.e. in H_CONFER),
      it returns H_TOO_HARD which causes a guest exit and allows the
      H_CONFER to be handled in virtual mode.
      
      Otherwise it spins for a short time (up to 10 microseconds) to give
      other threads the chance to observe that this thread is trying to
      confer.  The spin loop also terminates when any thread exits the guest
      or when all other threads are idle or trying to confer.  If the
      timeout is reached, the H_CONFER returns H_SUCCESS.  In this case the
      guest VCPU will recheck the spinlock word and most likely call
      H_CONFER again.
      
      This also improves the implementation of the H_CONFER virtual mode
      handler.  If the VCPU is part of a virtual core (vcore) which is
      runnable, there will be a 'runner' VCPU which has taken responsibility
      for running the vcore.  In this case we yield to the runner VCPU
      rather than the target VCPU.
      
      We also introduce a check on the target VCPU's yield count: if it
      differs from the yield count passed to H_CONFER, the target VCPU
      has run since H_CONFER was called and may have already released
      the lock.  This check is required by PAPR.
      Signed-off-by: NSam Bobroff <sam.bobroff@au1.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      90fd09f8
    • P
      KVM: PPC: Book3S HV: Fix endianness of instruction obtained from HEIR register · 4a157d61
      Paul Mackerras 提交于
      There are two ways in which a guest instruction can be obtained from
      the guest in the guest exit code in book3s_hv_rmhandlers.S.  If the
      exit was caused by a Hypervisor Emulation interrupt (i.e. an illegal
      instruction), the offending instruction is in the HEIR register
      (Hypervisor Emulation Instruction Register).  If the exit was caused
      by a load or store to an emulated MMIO device, we load the instruction
      from the guest by turning data relocation on and loading the instruction
      with an lwz instruction.
      
      Unfortunately, in the case where the guest has opposite endianness to
      the host, these two methods give results of different endianness, but
      both get put into vcpu->arch.last_inst.  The HEIR value has been loaded
      using guest endianness, whereas the lwz will load the instruction using
      host endianness.  The rest of the code that uses vcpu->arch.last_inst
      assumes it was loaded using host endianness.
      
      To fix this, we define a new vcpu field to store the HEIR value.  Then,
      in kvmppc_handle_exit_hv(), we transfer the value from this new field to
      vcpu->arch.last_inst, doing a byte-swap if the guest and host endianness
      differ.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      4a157d61
    • P
      KVM: PPC: Book3S HV: Remove code for PPC970 processors · c17b98cf
      Paul Mackerras 提交于
      This removes the code that was added to enable HV KVM to work
      on PPC970 processors.  The PPC970 is an old CPU that doesn't
      support virtualizing guest memory.  Removing PPC970 support also
      lets us remove the code for allocating and managing contiguous
      real-mode areas, the code for the !kvm->arch.using_mmu_notifiers
      case, the code for pinning pages of guest memory when first
      accessed and keeping track of which pages have been pinned, and
      the code for handling H_ENTER hypercalls in virtual mode.
      
      Book3S HV KVM is now supported only on POWER7 and POWER8 processors.
      The KVM_CAP_PPC_RMA capability now always returns 0.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      c17b98cf
    • S
      KVM: PPC: Book3S HV: Tracepoints for KVM HV guest interactions · 3c78f78a
      Suresh E. Warrier 提交于
      This patch adds trace points in the guest entry and exit code and also
      for exceptions handled by the host in kernel mode - hypercalls and page
      faults. The new events are added to /sys/kernel/debug/tracing/events
      under a new subsystem called kvm_hv.
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NSuresh Warrier <warrier@linux.vnet.ibm.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      3c78f78a
    • P
      KVM: PPC: Book3S HV: Simplify locking around stolen time calculations · 2711e248
      Paul Mackerras 提交于
      Currently the calculations of stolen time for PPC Book3S HV guests
      uses fields in both the vcpu struct and the kvmppc_vcore struct.  The
      fields in the kvmppc_vcore struct are protected by the
      vcpu->arch.tbacct_lock of the vcpu that has taken responsibility for
      running the virtual core.  This works correctly but confuses lockdep,
      because it sees that the code takes the tbacct_lock for a vcpu in
      kvmppc_remove_runnable() and then takes another vcpu's tbacct_lock in
      vcore_stolen_time(), and it thinks there is a possibility of deadlock,
      causing it to print reports like this:
      
      =============================================
      [ INFO: possible recursive locking detected ]
      3.18.0-rc7-kvm-00016-g8db4bc6 #89 Not tainted
      ---------------------------------------------
      qemu-system-ppc/6188 is trying to acquire lock:
       (&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb1fe8>] .vcore_stolen_time+0x48/0xd0 [kvm_hv]
      
      but task is already holding lock:
       (&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb25a0>] .kvmppc_remove_runnable.part.3+0x30/0xd0 [kvm_hv]
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(&(&vcpu->arch.tbacct_lock)->rlock);
        lock(&(&vcpu->arch.tbacct_lock)->rlock);
      
       *** DEADLOCK ***
      
       May be due to missing lock nesting notation
      
      3 locks held by qemu-system-ppc/6188:
       #0:  (&vcpu->mutex){+.+.+.}, at: [<d00000000eb93f98>] .vcpu_load+0x28/0xe0 [kvm]
       #1:  (&(&vcore->lock)->rlock){+.+...}, at: [<d00000000ecb41b0>] .kvmppc_vcpu_run_hv+0x530/0x1530 [kvm_hv]
       #2:  (&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb25a0>] .kvmppc_remove_runnable.part.3+0x30/0xd0 [kvm_hv]
      
      stack backtrace:
      CPU: 40 PID: 6188 Comm: qemu-system-ppc Not tainted 3.18.0-rc7-kvm-00016-g8db4bc6 #89
      Call Trace:
      [c000000b2754f3f0] [c000000000b31b6c] .dump_stack+0x88/0xb4 (unreliable)
      [c000000b2754f470] [c0000000000faeb8] .__lock_acquire+0x1878/0x2190
      [c000000b2754f600] [c0000000000fbf0c] .lock_acquire+0xcc/0x1a0
      [c000000b2754f6d0] [c000000000b2954c] ._raw_spin_lock_irq+0x4c/0x70
      [c000000b2754f760] [d00000000ecb1fe8] .vcore_stolen_time+0x48/0xd0 [kvm_hv]
      [c000000b2754f7f0] [d00000000ecb25b4] .kvmppc_remove_runnable.part.3+0x44/0xd0 [kvm_hv]
      [c000000b2754f880] [d00000000ecb43ec] .kvmppc_vcpu_run_hv+0x76c/0x1530 [kvm_hv]
      [c000000b2754f9f0] [d00000000eb9f46c] .kvmppc_vcpu_run+0x2c/0x40 [kvm]
      [c000000b2754fa60] [d00000000eb9c9a4] .kvm_arch_vcpu_ioctl_run+0x54/0x160 [kvm]
      [c000000b2754faf0] [d00000000eb94538] .kvm_vcpu_ioctl+0x498/0x760 [kvm]
      [c000000b2754fcb0] [c000000000267eb4] .do_vfs_ioctl+0x444/0x770
      [c000000b2754fd90] [c0000000002682a4] .SyS_ioctl+0xc4/0xe0
      [c000000b2754fe30] [c0000000000092e4] syscall_exit+0x0/0x98
      
      In order to make the locking easier to analyse, we change the code to
      use a spinlock in the kvmppc_vcore struct to protect the stolen_tb and
      preempt_tb fields.  This lock needs to be an irq-safe lock since it is
      used in the kvmppc_core_vcpu_load_hv() and kvmppc_core_vcpu_put_hv()
      functions, which are called with the scheduler rq lock held, which is
      an irq-safe lock.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      2711e248
    • R
      arch: powerpc: kvm: book3s_paired_singles.c: Remove unused function · a0499cf7
      Rickard Strandqvist 提交于
      Remove the function inst_set_field() that is not used anywhere.
      
      This was partially found by using a static code analysis program called cppcheck.
      Signed-off-by: NRickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      a0499cf7
    • R
      arch: powerpc: kvm: book3s_pr.c: Remove unused function · 6178839b
      Rickard Strandqvist 提交于
      Remove the function get_fpr_index() that is not used anywhere.
      
      This was partially found by using a static code analysis program called cppcheck.
      Signed-off-by: NRickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      6178839b
    • R
      arch: powerpc: kvm: book3s.c: Remove some unused functions · 54ca162a
      Rickard Strandqvist 提交于
      Removes some functions that are not used anywhere:
      kvmppc_core_load_guest_debugstate() kvmppc_core_load_host_debugstate()
      
      This was partially found by using a static code analysis program called cppcheck.
      Signed-off-by: NRickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      54ca162a
    • R
      arch: powerpc: kvm: book3s_32_mmu.c: Remove unused function · 24aaaf22
      Rickard Strandqvist 提交于
      Remove the function sr_nx() that is not used anywhere.
      
      This was partially found by using a static code analysis program called cppcheck.
      Signed-off-by: NRickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      24aaaf22
  14. 15 12月, 2014 13 次提交
    • S
      KVM: PPC: Book3S HV: Check wait conditions before sleeping in kvmppc_vcore_blocked · 1bc5d59c
      Suresh E. Warrier 提交于
      The kvmppc_vcore_blocked() code does not check for the wait condition
      after putting the process on the wait queue. This means that it is
      possible for an external interrupt to become pending, but the vcpu to
      remain asleep until the next decrementer interrupt.  The fix is to
      make one last check for pending exceptions and ceded state before
      calling schedule().
      Signed-off-by: NSuresh Warrier <warrier@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      1bc5d59c
    • C
      KVM: PPC: Book3S HV: ptes are big endian · ffada016
      Cédric Le Goater 提交于
      When being restored from qemu, the kvm_get_htab_header are in native
      endian, but the ptes are big endian.
      
      This patch fixes restore on a KVM LE host. Qemu also needs a fix for
      this :
      
           http://lists.nongnu.org/archive/html/qemu-ppc/2014-11/msg00008.htmlSigned-off-by: NCédric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      ffada016
    • S
      KVM: PPC: Book3S HV: Fix inaccuracies in ICP emulation for H_IPI · 5b88cda6
      Suresh E. Warrier 提交于
      This fixes some inaccuracies in the state machine for the virtualized
      ICP when implementing the H_IPI hcall (Set_MFFR and related states):
      
      1. The old code wipes out any pending interrupts when the new MFRR is
         more favored than the CPPR but less favored than a pending
         interrupt (by always modifying xisr and the pending_pri). This can
         cause us to lose a pending external interrupt.
      
         The correct code here is to only modify the pending_pri and xisr in
         the ICP if the MFRR is equal to or more favored than the current
         pending pri (since in this case, it is guaranteed that that there
         cannot be a pending external interrupt). The code changes are
         required in both kvmppc_rm_h_ipi and kvmppc_h_ipi.
      
      2. Again, in both kvmppc_rm_h_ipi and kvmppc_h_ipi, there is a check
         for whether MFRR is being made less favored AND further if new MFFR
         is also less favored than the current CPPR, we check for any
         resends pending in the ICP. These checks look like they are
         designed to cover the case where if the MFRR is being made less
         favored, we opportunistically trigger a resend of any interrupts
         that had been previously rejected. Although, this is not a state
         described by PAPR, this is an action we actually need to do
         especially if the CPPR is already at 0xFF.  Because in this case,
         the resend bit will stay on until another ICP state change which
         may be a long time coming and the interrupt stays pending until
         then. The current code which checks for MFRR < CPPR is broken when
         CPPR is 0xFF since it will not get triggered in that case.
      
         Ideally, we would want to do a resend only if
      
         	prio(pending_interrupt) < mfrr && prio(pending_interrupt) < cppr
      
         where pending interrupt is the one that was rejected. But we don't
         have the priority of the pending interrupt state saved, so we
         simply trigger a resend whenever the MFRR is made less favored.
      
      3. In kvmppc_rm_h_ipi, where we save state to pass resends to the
         virtual mode, we also need to save the ICP whose need_resend we
         reset since this does not need to be my ICP (vcpu->arch.icp) as is
         incorrectly assumed by the current code. A new field rm_resend_icp
         is added to the kvmppc_icp structure for this purpose.
      Signed-off-by: NSuresh Warrier <warrier@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      5b88cda6
    • P
      KVM: PPC: Book3S HV: Fix KSM memory corruption · b4a83900
      Paul Mackerras 提交于
      Testing with KSM active in the host showed occasional corruption of
      guest memory.  Typically a page that should have contained zeroes
      would contain values that look like the contents of a user process
      stack (values such as 0x0000_3fff_xxxx_xxx).
      
      Code inspection in kvmppc_h_protect revealed that there was a race
      condition with the possibility of granting write access to a page
      which is read-only in the host page tables.  The code attempts to keep
      the host mapping read-only if the host userspace PTE is read-only, but
      if that PTE had been temporarily made invalid for any reason, the
      read-only check would not trigger and the host HPTE could end up
      read-write.  Examination of the guest HPT in the failure situation
      revealed that there were indeed shared pages which should have been
      read-only that were mapped read-write.
      
      To close this race, we don't let a page go from being read-only to
      being read-write, as far as the real HPTE mapping the page is
      concerned (the guest view can go to read-write, but the actual mapping
      stays read-only).  When the guest tries to write to the page, we take
      an HDSI and let kvmppc_book3s_hv_page_fault take care of providing a
      writable HPTE for the page.
      
      This eliminates the occasional corruption of shared pages
      that was previously seen with KSM active.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      b4a83900
    • M
      KVM: PPC: Book3S HV: Fix an issue where guest is paused on receiving HMI · dee6f24c
      Mahesh Salgaonkar 提交于
      When we get an HMI (hypervisor maintenance interrupt) while in a
      guest, we see that guest enters into paused state.  The reason is, in
      kvmppc_handle_exit_hv it falls through default path and returns to
      host instead of resuming guest.  This causes guest to enter into
      paused state.  HMI is a hypervisor only interrupt and it is safe to
      resume the guest since the host has handled it already.  This patch
      adds a switch case to resume the guest.
      
      Without this patch we see guest entering into paused state with following
      console messages:
      
      [ 3003.329351] Severe Hypervisor Maintenance interrupt [Recovered]
      [ 3003.329356]  Error detail: Timer facility experienced an error
      [ 3003.329359] 	HMER: 0840000000000000
      [ 3003.329360] 	TFMR: 4a12000980a84000
      [ 3003.329366] vcpu c0000007c35094c0 (40):
      [ 3003.329368] pc  = c0000000000c2ba0  msr = 8000000000009032  trap = e60
      [ 3003.329370] r 0 = c00000000021ddc0  r16 = 0000000000000046
      [ 3003.329372] r 1 = c00000007a02bbd0  r17 = 00003ffff27d5d98
      [ 3003.329375] r 2 = c0000000010980b8  r18 = 00001fffffc9a0b0
      [ 3003.329377] r 3 = c00000000142d6b8  r19 = c00000000142d6b8
      [ 3003.329379] r 4 = 0000000000000002  r20 = 0000000000000000
      [ 3003.329381] r 5 = c00000000524a110  r21 = 0000000000000000
      [ 3003.329383] r 6 = 0000000000000001  r22 = 0000000000000000
      [ 3003.329386] r 7 = 0000000000000000  r23 = c00000000524a110
      [ 3003.329388] r 8 = 0000000000000000  r24 = 0000000000000001
      [ 3003.329391] r 9 = 0000000000000001  r25 = c00000007c31da38
      [ 3003.329393] r10 = c0000000014280b8  r26 = 0000000000000002
      [ 3003.329395] r11 = 746f6f6c2f68656c  r27 = c00000000524a110
      [ 3003.329397] r12 = 0000000028004484  r28 = c00000007c31da38
      [ 3003.329399] r13 = c00000000fe01400  r29 = 0000000000000002
      [ 3003.329401] r14 = 0000000000000046  r30 = c000000003011e00
      [ 3003.329403] r15 = ffffffffffffffba  r31 = 0000000000000002
      [ 3003.329404] ctr = c00000000041a670  lr  = c000000000272520
      [ 3003.329405] srr0 = c00000000007e8d8 srr1 = 9000000000001002
      [ 3003.329406] sprg0 = 0000000000000000 sprg1 = c00000000fe01400
      [ 3003.329407] sprg2 = c00000000fe01400 sprg3 = 0000000000000005
      [ 3003.329408] cr = 48004482  xer = 2000000000000000  dsisr = 42000000
      [ 3003.329409] dar = 0000010015020048
      [ 3003.329410] fault dar = 0000010015020048 dsisr = 42000000
      [ 3003.329411] SLB (8 entries):
      [ 3003.329412]   ESID = c000000008000000 VSID = 40016e7779000510
      [ 3003.329413]   ESID = d000000008000001 VSID = 400142add1000510
      [ 3003.329414]   ESID = f000000008000004 VSID = 4000eb1a81000510
      [ 3003.329415]   ESID = 00001f000800000b VSID = 40004fda0a000d90
      [ 3003.329416]   ESID = 00003f000800000c VSID = 400039f536000d90
      [ 3003.329417]   ESID = 000000001800000d VSID = 0001251b35150d90
      [ 3003.329417]   ESID = 000001000800000e VSID = 4001e46090000d90
      [ 3003.329418]   ESID = d000080008000019 VSID = 40013d349c000400
      [ 3003.329419] lpcr = c048800001847001 sdr1 = 0000001b19000006 last_inst = ffffffff
      [ 3003.329421] trap=0xe60 | pc=0xc0000000000c2ba0 | msr=0x8000000000009032
      [ 3003.329524] Severe Hypervisor Maintenance interrupt [Recovered]
      [ 3003.329526]  Error detail: Timer facility experienced an error
      [ 3003.329527] 	HMER: 0840000000000000
      [ 3003.329527] 	TFMR: 4a12000980a94000
      [ 3006.359786] Severe Hypervisor Maintenance interrupt [Recovered]
      [ 3006.359792]  Error detail: Timer facility experienced an error
      [ 3006.359795] 	HMER: 0840000000000000
      [ 3006.359797] 	TFMR: 4a12000980a84000
      
       Id    Name                           State
      ----------------------------------------------------
       2     guest2                         running
       3     guest3                         paused
       4     guest4                         running
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      dee6f24c
    • P
      KVM: PPC: Book3S HV: Fix computation of tlbie operand · d506735b
      Paul Mackerras 提交于
      The B (segment size) field in the RB operand for the tlbie
      instruction is two bits, which we get from the top two bits of
      the first doubleword of the HPT entry to be invalidated.  These
      bits go in bits 8 and 9 of the RB operand (bits 54 and 55 in IBM
      bit numbering).
      
      The compute_tlbie_rb() function gets these bits as v >> (62 - 8),
      which is not correct as it will bring in the top 10 bits, not
      just the top two.  These extra bits could corrupt the AP, AVAL
      and L fields in the RB value.  To fix this we shift right 62 bits
      and then shift left 8 bits, so we only get the two bits of the
      B field.
      
      The first doubleword of the HPT entry is under the control of the
      guest kernel.  In fact, Linux guests will always put zeroes in bits
      54 -- 61 (IBM bits 2 -- 9), but we should not rely on guests doing
      this.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      d506735b
    • A
      KVM: PPC: Book3S HV: Add missing HPTE unlock · f6fb9e84
      Aneesh Kumar K.V 提交于
      In kvm_test_clear_dirty(), if we find an invalid HPTE we move on to the
      next HPTE without unlocking the invalid one.  In fact we should never
      find an invalid and unlocked HPTE in the rmap chain, but for robustness
      we should unlock it.  This adds the missing unlock.
      Reported-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      f6fb9e84
    • A
      KVM: PPC: BookE: Improve irq inject tracepoint · b6b61257
      Alexander Graf 提交于
      When injecting an IRQ, we only document which IRQ priority (which translates
      to IRQ type) gets injected. However, when reading traces you don't necessarily
      have all the numbers in your head to know which IRQ really is meant.
      
      This patch converts the IRQ number field to a symbolic name that is in sync
      with the respective define. That way it's a lot easier for readers to figure
      out what interrupt gets injected.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      b6b61257
    • B
      powerpc/powernv: Expose OPAL firmware symbol map · c8742f85
      Benjamin Herrenschmidt 提交于
      Newer versions of OPAL will provide this, so let's expose it to user
      space so tools like perf can use it to properly decode samples in
      firmware space.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c8742f85
    • S
      powernv/powerpc: Add winkle support for offline cpus · 77b54e9f
      Shreyas B. Prabhu 提交于
      Winkle is a deep idle state supported in power8 chips. A core enters
      winkle when all the threads of the core enter winkle. In this state
      power supply to the entire chiplet i.e core, private L2 and private L3
      is turned off. As a result it gives higher powersavings compared to
      sleep.
      
      But entering winkle results in a total hypervisor state loss. Hence the
      hypervisor context has to be preserved before entering winkle and
      restored upon wake up.
      
      Power-on Reset Engine (PORE) is a dedicated engine which is responsible
      for powering on the chiplet during wake up. It can be programmed to
      restore the register contests of a few specific registers. This patch
      uses PORE to restore register state wherever possible and uses stack to
      save and restore rest of the necessary registers.
      
      With hypervisor state restore things fall under three categories-
      per-core state, per-subcore state and per-thread state. To manage this,
      extend the infrastructure introduced for sleep. Mainly we add a paca
      variable subcore_sibling_mask. Using this and the core_idle_state we can
      distingush first thread in core and subcore.
      Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      77b54e9f
    • S
      powernv/cpuidle: Redesign idle states management · 7cba160a
      Shreyas B. Prabhu 提交于
      Deep idle states like sleep and winkle are per core idle states. A core
      enters these states only when all the threads enter either the
      particular idle state or a deeper one. There are tasks like fastsleep
      hardware bug workaround and hypervisor core state save which have to be
      done only by the last thread of the core entering deep idle state and
      similarly tasks like timebase resync, hypervisor core register restore
      that have to be done only by the first thread waking up from these
      state.
      
      The current idle state management does not have a way to distinguish the
      first/last thread of the core waking/entering idle states. Tasks like
      timebase resync are done for all the threads. This is not only is
      suboptimal, but can cause functionality issues when subcores and kvm is
      involved.
      
      This patch adds the necessary infrastructure to track idle states of
      threads in a per-core structure. It uses this info to perform tasks like
      fastsleep workaround and timebase resync only once per core.
      Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Originally-by: NPreeti U. Murthy <preeti@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: linux-pm@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7cba160a
    • S
      powerpc/powernv: Enable Offline CPUs to enter deep idle states · 8eb8ac89
      Shreyas B. Prabhu 提交于
      The secondary threads should enter deep idle states so as to gain maximum
      powersavings when the entire core is offline. To do so the offline path
      must be made aware of the available deepest idle state. Hence probe the
      device tree for the possible idle states in powernv core code and
      expose the deepest idle state through flags.
      
      Since the  device tree is probed by the cpuidle driver as well, move
      the parameters required to discover the idle states into an appropriate
      common place to both the driver and the powernv core code.
      
      Another point is that fastsleep idle state may require workarounds in
      the kernel to function properly. This workaround is introduced in the
      subsequent patches. However neither the cpuidle driver or the hotplug
      path need be bothered about this workaround.
      
      They will be taken care of by the core powernv code.
      Originally-by: NSrivatsa S. Bhat <srivatsa@mit.edu>
      Signed-off-by: NPreeti U. Murthy <preeti@linux.vnet.ibm.com>
      Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Reviewed-by: NPaul Mackerras <paulus@samba.org>
      
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: linux-pm@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8eb8ac89
    • P
      powerpc/powernv: Switch off MMU before entering nap/sleep/rvwinkle mode · 8117ac6a
      Paul Mackerras 提交于
      Currently, when going idle, we set the flag indicating that we are in
      nap mode (paca->kvm_hstate.hwthread_state) and then execute the nap
      (or sleep or rvwinkle) instruction, all with the MMU on.  This is bad
      for two reasons: (a) the architecture specifies that those instructions
      must be executed with the MMU off, and in fact with only the SF, HV, ME
      and possibly RI bits set, and (b) this introduces a race, because as
      soon as we set the flag, another thread can switch the MMU to a guest
      context.  If the race is lost, this thread will typically start looping
      on relocation-on ISIs at 0xc...4400.
      
      This fixes it by setting the MSR as required by the architecture before
      setting the flag or executing the nap/sleep/rvwinkle instruction.
      
      Cc: stable@vger.kernel.org
      [ shreyas@linux.vnet.ibm.com: Edited to handle LE ]
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8117ac6a
  15. 14 12月, 2014 1 次提交