1. 13 9月, 2015 3 次提交
    • S
      perf/core: Drop PERF_EVENT_TXN · 8f3e5684
      Sukadev Bhattiprolu 提交于
      We currently use PERF_EVENT_TXN flag to determine if we are in the middle
      of a transaction. If in a transaction, we defer the schedulability checks
      from pmu->add() operation to the pmu->commit() operation.
      
      Now that we have "transaction types" (PERF_PMU_TXN_ADD, PERF_PMU_TXN_READ)
      we can use the type to determine if we are in a transaction and drop the
      PERF_EVENT_TXN flag.
      
      When PERF_EVENT_TXN is dropped, the cpuhw->group_flag on some architectures
      becomes unused, so drop that field as well.
      
      This is an extension of the Powerpc patch from Peter Zijlstra to s390,
      Sparc and x86 architectures.
      Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1441336073-22750-11-git-send-email-sukadev@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8f3e5684
    • S
      powerpc, perf/powerpc/hv-24x7: Use PMU_TXN_READ interface · 88a48613
      Sukadev Bhattiprolu 提交于
      The 24x7 counters in Powerpc allow monitoring a large number of counters
      simultaneously. They also allow reading several counters in a single
      HCALL so we can get a more consistent snapshot of the system.
      
      Use the PMU's transaction interface to monitor and read several event
      counters at once. The idea is that users can group several 24x7 events
      into a single group of events. We use the following logic to submit
      the group of events to the PMU and read the values:
      
      	pmu->start_txn()		// Initialize before first event
      
      	for each event in group
      		pmu->read(event);	// Queue each event to be read
      
      	pmu->commit_txn()		// Read/update all queuedcounters
      
      The ->commit_txn() also updates the event counts in the respective
      perf_event objects.  The perf subsystem can then directly get the
      event counts from the perf_event and can avoid submitting a new
      ->read() request to the PMU.
      
      Thanks to input from Peter Zijlstra.
      Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1441336073-22750-10-git-send-email-sukadev@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      88a48613
    • S
      perf/core: Add a 'flags' parameter to the PMU transactional interfaces · fbbe0701
      Sukadev Bhattiprolu 提交于
      Currently, the PMU interface allows reading only one counter at a time.
      But some PMUs like the 24x7 counters in Power, support reading several
      counters at once. To leveage this functionality, extend the transaction
      interface to support a "transaction type".
      
      The first type, PERF_PMU_TXN_ADD, refers to the existing transactions,
      i.e. used to _schedule_ all the events on the PMU as a group. A second
      transaction type, PERF_PMU_TXN_READ, will be used in a follow-on patch,
      by the 24x7 counters to read several counters at once.
      
      Extend the transaction interfaces to the PMU to accept a 'txn_flags'
      parameter and use this parameter to ignore any transactions that are
      not of type PERF_PMU_TXN_ADD.
      
      Thanks to Peter Zijlstra for his input.
      Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      [peterz: s390 compile fix]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1441336073-22750-3-git-send-email-sukadev@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fbbe0701
  2. 11 9月, 2015 6 次提交
    • C
      dma-mapping: consolidate dma_set_mask · 452e06af
      Christoph Hellwig 提交于
      Almost everyone implements dma_set_mask the same way, although some time
      that's hidden in ->set_dma_mask methods.
      
      This patch consolidates those into a common implementation that either
      calls ->set_dma_mask if present or otherwise uses the default
      implementation.  Some architectures used to only call ->set_dma_mask
      after the initial checks, and those instance have been fixed to do the
      full work.  h8300 implemented dma_set_mask bogusly as a no-ops and has
      been fixed.
      
      Unfortunately some architectures overload unrelated semantics like changing
      the dma_ops into it so we still need to allow for an architecture override
      for now.
      
      [jcmvbkbc@gmail.com: fix xtensa]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
      Signed-off-by: NMax Filippov <jcmvbkbc@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      452e06af
    • C
      dma-mapping: consolidate dma_supported · ee196371
      Christoph Hellwig 提交于
      Most architectures just call into ->dma_supported, but some also return 1
      if the method is not present, or 0 if no dma ops are present (although
      that should never happeb). Consolidate this more broad version into
      common code.
      
      Also fix h8300 which inorrectly always returned 0, which would have been
      a problem if it's dma_set_mask implementation wasn't a similarly buggy
      noop.
      
      As a few architectures have much more elaborate implementations, we
      still allow for arch overrides.
      
      [jcmvbkbc@gmail.com: fix xtensa]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
      Signed-off-by: NMax Filippov <jcmvbkbc@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ee196371
    • C
      dma-mapping: cosolidate dma_mapping_error · efa21e43
      Christoph Hellwig 提交于
      Currently there are three valid implementations of dma_mapping_error:
      
       (1) call ->mapping_error
       (2) check for a hardcoded error code
       (3) always return 0
      
      This patch provides a common implementation that calls ->mapping_error
      if present, then checks for DMA_ERROR_CODE if defined or otherwise
      returns 0.
      
      [jcmvbkbc@gmail.com: fix xtensa]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
      Signed-off-by: NMax Filippov <jcmvbkbc@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      efa21e43
    • C
      dma-mapping: consolidate dma_{alloc,free}_noncoherent · 1e893752
      Christoph Hellwig 提交于
      Most architectures do not support non-coherent allocations and either
      define dma_{alloc,free}_noncoherent to their coherent versions or stub
      them out.
      
      Openrisc uses dma_{alloc,free}_attrs to implement them, and only Mips
      implements them directly.
      
      This patch moves the Openrisc version to common code, and handles the
      DMA_ATTR_NON_CONSISTENT case in the mips dma_map_ops instance.
      
      Note that actual non-coherent allocations require a dma_cache_sync
      implementation, so if non-coherent allocations didn't work on
      an architecture before this patch they still won't work after it.
      
      [jcmvbkbc@gmail.com: fix xtensa]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
      Signed-off-by: NMax Filippov <jcmvbkbc@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e893752
    • C
      dma-mapping: consolidate dma_{alloc,free}_{attrs,coherent} · 6894258e
      Christoph Hellwig 提交于
      Since 2009 we have a nice asm-generic header implementing lots of DMA API
      functions for architectures using struct dma_map_ops, but unfortunately
      it's still missing a lot of APIs that all architectures still have to
      duplicate.
      
      This series consolidates the remaining functions, although we still need
      arch opt outs for two of them as a few architectures have very
      non-standard implementations.
      
      This patch (of 5):
      
      The coherent DMA allocator works the same over all architectures supporting
      dma_map operations.
      
      This patch consolidates them and converges the minor differences:
      
       - the debug_dma helpers are now called from all architectures, including
         those that were previously missing them
       - dma_alloc_from_coherent and dma_release_from_coherent are now always
         called from the generic alloc/free routines instead of the ops
         dma-mapping-common.h always includes dma-coherent.h to get the defintions
         for them, or the stubs if the architecture doesn't support this feature
       - checks for ->alloc / ->free presence are removed.  There is only one
         magic instead of dma_map_ops without them (mic_dma_ops) and that one
         is x86 only anyway.
      
      Besides that only x86 needs special treatment to replace a default devices
      if none is passed and tweak the gfp_flags.  An optional arch hook is provided
      for that.
      
      [linux@roeck-us.net: fix build]
      [jcmvbkbc@gmail.com: fix xtensa]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
      Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NMax Filippov <jcmvbkbc@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6894258e
    • D
      kexec: split kexec_load syscall from kexec core code · 2965faa5
      Dave Young 提交于
      There are two kexec load syscalls, kexec_load another and kexec_file_load.
       kexec_file_load has been splited as kernel/kexec_file.c.  In this patch I
      split kexec_load syscall code to kernel/kexec.c.
      
      And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
      use kexec_file_load only, or vice verse.
      
      The original requirement is from Ted Ts'o, he want kexec kernel signature
      being checked with CONFIG_KEXEC_VERIFY_SIG enabled.  But kexec-tools use
      kexec_load syscall can bypass the checking.
      
      Vivek Goyal proposed to create a common kconfig option so user can compile
      in only one syscall for loading kexec kernel.  KEXEC/KEXEC_FILE selects
      KEXEC_CORE so that old config files still work.
      
      Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
      architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
      KEXEC_CORE in arch Kconfig.  Also updated general kernel code with to
      kexec_load syscall.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NDave Young <dyoung@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Petr Tesarik <ptesarik@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Josh Boyer <jwboyer@fedoraproject.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2965faa5
  3. 09 9月, 2015 1 次提交
    • V
      mm: rename alloc_pages_exact_node() to __alloc_pages_node() · 96db800f
      Vlastimil Babka 提交于
      alloc_pages_exact_node() was introduced in commit 6484eb3e ("page
      allocator: do not check NUMA node ID when the caller knows the node is
      valid") as an optimized variant of alloc_pages_node(), that doesn't
      fallback to current node for nid == NUMA_NO_NODE.  Unfortunately the
      name of the function can easily suggest that the allocation is
      restricted to the given node and fails otherwise.  In truth, the node is
      only preferred, unless __GFP_THISNODE is passed among the gfp flags.
      
      The misleading name has lead to mistakes in the past, see for example
      commits 5265047a ("mm, thp: really limit transparent hugepage
      allocation to local node") and b360edb4 ("mm, mempolicy:
      migrate_to_node should only migrate to node").
      
      Another issue with the name is that there's a family of
      alloc_pages_exact*() functions where 'exact' means exact size (instead
      of page order), which leads to more confusion.
      
      To prevent further mistakes, this patch effectively renames
      alloc_pages_exact_node() to __alloc_pages_node() to better convey that
      it's an optimized variant of alloc_pages_node() not intended for general
      usage.  Both functions get described in comments.
      
      It has been also considered to really provide a convenience function for
      allocations restricted to a node, but the major opinion seems to be that
      __GFP_THISNODE already provides that functionality and we shouldn't
      duplicate the API needlessly.  The number of users would be small
      anyway.
      
      Existing callers of alloc_pages_exact_node() are simply converted to
      call __alloc_pages_node(), with the exception of sba_alloc_coherent()
      which open-codes the check for NUMA_NO_NODE, so it is converted to use
      alloc_pages_node() instead.  This means it no longer performs some
      VM_BUG_ON checks, and since the current check for nid in
      alloc_pages_node() uses a 'nid < 0' comparison (which includes
      NUMA_NO_NODE), it may hide wrong values which would be previously
      exposed.
      
      Both differences will be rectified by the next patch.
      
      To sum up, this patch makes no functional changes, except temporarily
      hiding potentially buggy callers.  Restricting the checks in
      alloc_pages_node() is left for the next patch which can in turn expose
      more existing buggy callers.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NRobin Holt <robinmholt@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Gleb Natapov <gleb@kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Cliff Whickman <cpw@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96db800f
  4. 04 9月, 2015 2 次提交
  5. 03 9月, 2015 3 次提交
    • T
      KVM: PPC: Book3S: Fix size of the PSPB register · f35f3a48
      Thomas Huth 提交于
      The size of the Problem State Priority Boost Register is only
      32 bits, but the kvm_vcpu_arch->pspb variable is declared as
      "ulong", ie. 64-bit. However, the assembler code accesses this
      variable with 32-bit accesses, and the KVM_REG_PPC_PSPB macro
      is defined with SIZE_U32, too, so that the current code is
      broken on big endian hosts: kvmppc_get_one_reg_hv() will only
      return zero for this register since it is using the wrong half
      of the pspb variable. Let's fix this problem by adjusting the
      size of the pspb field in the kvm_vcpu_arch structure.
      Signed-off-by: NThomas Huth <thuth@redhat.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      f35f3a48
    • G
      KVM: PPC: Book3S HV: Exit on H_DOORBELL if HOST_IPI is set · 06554d9f
      Gautham R. Shenoy 提交于
      The code that handles the case when we receive a H_DOORBELL interrupt
      has a comment which says "Hypervisor doorbell - exit only if host IPI
      flag set".  However, the current code does not actually check if the
      host IPI flag is set.  This is due to a comparison instruction that
      got missed.
      
      As a result, the current code performs the exit to host only
      if some sibling thread or a sibling sub-core is exiting to the
      host.  This implies that, an IPI sent to a sibling core in
      (subcores-per-core != 1) mode will be missed by the host unless the
      sibling core is on the exit path to the host.
      
      This patch adds the missing comparison operation which will ensure
      that when HOST_IPI flag is set, we unconditionally exit to the host.
      
      Fixes: 66feed61
      Cc: stable@vger.kernel.org # v4.1+
      Signed-off-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      06554d9f
    • G
      KVM: PPC: Book3S HV: Fix race in starting secondary threads · 7f235328
      Gautham R. Shenoy 提交于
      The current dynamic micro-threading code has a race due to which a
      secondary thread naps when it is supposed to be running a vcpu. As a
      side effect of this, on a guest exit, the primary thread in
      kvmppc_wait_for_nap() finds that this secondary thread hasn't cleared
      its vcore pointer. This results in "CPU X seems to be stuck!"
      warnings.
      
      The race is possible since the primary thread on exiting the guests
      only waits for all the secondaries to clear its vcore pointer. It
      subsequently expects the secondary threads to enter nap while it
      unsplits the core. A secondary thread which hasn't yet entered the nap
      will loop in kvm_no_guest until its vcore pointer and the do_nap flag
      are unset. Once the core has been unsplit, a new vcpu thread can grab
      the core and set the do_nap flag *before* setting the vcore pointers
      of the secondary. As a result, the secondary thread will now enter nap
      via kvm_unsplit_nap instead of running the guest vcpu.
      
      Fix this by setting the do_nap flag after setting the vcore pointer in
      the PACA of the secondary in kvmppc_run_core. Also, ensure that a
      secondary thread doesn't nap in kvm_unsplit_nap when the vcore pointer
      in its PACA struct is set.
      
      Fixes: b4deba5cSigned-off-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      7f235328
  6. 28 8月, 2015 5 次提交
    • G
      powerpc/eeh: Fix fenced PHB caused by eeh_slot_error_detail() · 25980013
      Gavin Shan 提交于
      The config space of some PCI devices can't be accessed when their
      PEs are in frozen state. Otherwise, fenced PHB might be seen.
      Those PEs are identified with flag EEH_PE_CFG_RESTRICTED, meaing
      EEH_PE_CFG_BLOCKED is set automatically when the PE is put to
      frozen state (EEH_PE_ISOLATED). eeh_slot_error_detail() restores
      PCI device BARs with eeh_pe_restore_bars(), which then calls
      eeh_ops->restore_config() to reinitialize the PCI device in
      (OPAL) firmware. eeh_ops->restore_config() produces PCI config
      access that causes fenced PHB. The problem was reported on below
      adapter:
      
         0001:01:00.0 0200: 14e4:168e (rev 10)
         0001:01:00.0 Ethernet controller: Broadcom Corporation \
                      NetXtreme II BCM57810 10 Gigabit Ethernet (rev 10)
      
      This fixes the issue by skipping eeh_pe_restore_bars() in
      eeh_slot_error_detail() when EEH_PE_CFG_BLOCKED is set for the PE.
      
      Fixes: b6541db1 ("powerpc/eeh: Block PCI config access upon frozen PE")
      Cc: stable@vger.kernel.org # v4.0+
      Reported-by: NManvanthara B. Puttashankar <mputtash@in.ibm.com>
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      25980013
    • G
      powerpc/pseries: Cleanup on pci_dn_reconfig_notifier() · ea0f8acf
      Gavin Shan 提交于
      This applies cleanup on pci_dn_reconfig_notifier(), no functional
      changes:
      
         * Rename variable "pci" to "pdn" to indicate its purpose clearly.
         * The parent node can be released at any time. So it should be
           hold with of_get_parent() before accessing it.
         * The device node doesn't have to have parent node in theory.
           More check on this.
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      ea0f8acf
    • G
      powerpc/pseries: Fix corrupted pdn list · 590c7567
      Gavin Shan 提交于
      Commit cca87d30 ("powerpc/pci: Refactor pci_dn") introduced pdn
      list for SRIOV VFs. It means the pdn is be put into the child list
      of its parent pdn when the pdn is created. When doing PCI hot
      unplugging on pSeries, the PCI device node as well as its pdn are
      released through procfs entry "powerpc/ofdt". Some one else grabs
      the memory chunk of the pdn and update it accordingly. At the same
      time, the pdn is still tracked in the child list of parent pdn. It
      leads to corrupted child list in the parent pdn.
      
      This fixes above issue by removing the pdn from the child list of
      its parent pdn when the device node is detached from the system.
      Note the pdn is free'd when the device node is released if the
      device node is dynamic one. Otherwise, the device node as well
      as the pdn won't be released.
      
      Fixes: cca87d30 ("powerpc/pci: Refactor pci_dn")
      Cc: stable@vger.kernel.org # 4.1+
      Reported-by: NSantwana Samantray <santwana.samantray@in.ibm.com>
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      590c7567
    • D
      mm: ZONE_DEVICE for "device memory" · 033fbae9
      Dan Williams 提交于
      While pmem is usable as a block device or via DAX mappings to userspace
      there are several usage scenarios that can not target pmem due to its
      lack of struct page coverage. In preparation for "hot plugging" pmem
      into the vmemmap add ZONE_DEVICE as a new zone to tag these pages
      separately from the ones that are subject to standard page allocations.
      Importantly "device memory" can be removed at will by userspace
      unbinding the driver of the device.
      
      Having a separate zone prevents allocation and otherwise marks these
      pages that are distinct from typical uniform memory.  Device memory has
      different lifetime and performance characteristics than RAM.  However,
      since we have run out of ZONES_SHIFT bits this functionality currently
      depends on sacrificing ZONE_DMA.
      
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Jerome Glisse <j.glisse@gmail.com>
      [hch: various simplifications in the arch interface]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      033fbae9
    • D
      dax: drop size parameter to ->direct_access() · cb389b9c
      Dan Williams 提交于
      None of the implementations currently use it.  The common
      bdev_direct_access() entry point handles all the size checks before
      calling ->direct_access().
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      cb389b9c
  7. 27 8月, 2015 2 次提交
  8. 26 8月, 2015 1 次提交
    • G
      powerpc/PCI: Disable MSI/MSI-X interrupts at PCI probe time in OF case · 4d9aac39
      Guilherme G. Piccoli 提交于
      Since commit 1851617c ("PCI/MSI: Disable MSI at enumeration even if
      kernel doesn't support MSI"), the setup of dev->msi_cap/msix_cap and the
      disable of MSI/MSI-X interrupts isn't being done at PCI probe time, as
      the logic responsible for this was moved in the aforementioned commit
      from pci_device_add() to pci_setup_device(). The latter function is not
      reachable on PowerPC pseries platform during Open Firmware PCI probing
      time.
      
      This exhibits as drivers not being able to enable MSI, eg:
      
        bnx2x 0000:01:00.0: no msix capability found
      
      This patch calls pci_msi_setup_pci_dev() explicitly to disable MSI/MSI-X
      during PCI probe time on pSeries platform.
      
      Fixes: 1851617c ("PCI/MSI: Disable MSI at enumeration even if kernel doesn't support MSI")
      [mpe: Flesh out change log and clarify comment]
      Signed-off-by: NGuilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      4d9aac39
  9. 22 8月, 2015 13 次提交
    • M
      powerpc/powernv: Fix mis-merge of OPAL support for LEDS driver · 5d53be7d
      Michael Ellerman 提交于
      When I merged the OPAL support for the powernv LEDS driver I missed a
      hunk.
      
      This is slightly modified from the original patch, as the original added
      code to opal-api.h which is not in the skiboot version, which is
      discouraged.
      
      Instead those values are moved into the driver, which is the only place
      they are used.
      
      Fixes: 8a8d9181 ("powerpc/powernv: Add OPAL interfaces for accessing and modifying system LED states")
      Reviewed-by: NVasant Hegde <hegdevasant@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      5d53be7d
    • S
      KVM: PPC: Book3S: correct width in XER handling · c63517c2
      Sam bobroff 提交于
      In 64 bit kernels, the Fixed Point Exception Register (XER) is a 64
      bit field (e.g. in kvm_regs and kvm_vcpu_arch) and in most places it is
      accessed as such.
      
      This patch corrects places where it is accessed as a 32 bit field by a
      64 bit kernel.  In some cases this is via a 32 bit load or store
      instruction which, depending on endianness, will cause either the
      lower or upper 32 bits to be missed.  In another case it is cast as a
      u32, causing the upper 32 bits to be cleared.
      
      This patch corrects those places by extending the access methods to
      64 bits.
      Signed-off-by: NSam Bobroff <sam.bobroff@au1.ibm.com>
      Reviewed-by: NLaurent Vivier <lvivier@redhat.com>
      Reviewed-by: NThomas Huth <thuth@redhat.com>
      Tested-by: NThomas Huth <thuth@redhat.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      c63517c2
    • P
      KVM: PPC: Book3S HV: Fix preempted vcore stolen time calculation · 563a1e93
      Paul Mackerras 提交于
      Whenever a vcore state is VCORE_PREEMPT we need to be counting stolen
      time for it.  This currently isn't the case when we have a vcore that
      no longer has any runnable threads in it but still has a runner task,
      so we do an explicit call to kvmppc_core_start_stolen() in that case.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      563a1e93
    • P
      KVM: PPC: Book3S HV: Fix preempted vcore list locking · 402813fe
      Paul Mackerras 提交于
      When a vcore gets preempted, we put it on the preempted vcore list for
      the current CPU.  The runner task then calls schedule() and comes back
      some time later and takes itself off the list.  We need to be careful
      to lock the list that it was put onto, which may not be the list for the
      current CPU since the runner task may have moved to another CPU.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      402813fe
    • P
      KVM: PPC: Book3S HV: Implement H_CLEAR_REF and H_CLEAR_MOD · cdeee518
      Paul Mackerras 提交于
      This adds implementations for the H_CLEAR_REF (test and clear reference
      bit) and H_CLEAR_MOD (test and clear changed bit) hypercalls.
      
      When clearing the reference or change bit in the guest view of the HPTE,
      we also have to clear it in the real HPTE so that we can detect future
      references or changes.  When we do so, we transfer the R or C bit value
      to the rmap entry for the underlying host page so that kvm_age_hva_hv(),
      kvm_test_age_hva_hv() and kvmppc_hv_get_dirty_log() know that the page
      has been referenced and/or changed.
      
      These hypercalls are not used by Linux guests.  These implementations
      have been tested using a FreeBSD guest.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      cdeee518
    • P
      KVM: PPC: Book3S HV: Fix bug in dirty page tracking · 08fe1e7b
      Paul Mackerras 提交于
      This fixes a bug in the tracking of pages that get modified by the
      guest.  If the guest creates a large-page HPTE, writes to memory
      somewhere within the large page, and then removes the HPTE, we only
      record the modified state for the first normal page within the large
      page, when in fact the guest might have modified some other normal
      page within the large page.
      
      To fix this we use some unused bits in the rmap entry to record the
      order (log base 2) of the size of the page that was modified, when
      removing an HPTE.  Then in kvm_test_clear_dirty_npages() we use that
      order to return the correct number of modified pages.
      
      The same thing could in principle happen when removing a HPTE at the
      host's request, i.e. when paging out a page, except that we never
      page out large pages, and the guest can only create large-page HPTEs
      if the guest RAM is backed by large pages.  However, we also fix
      this case for the sake of future-proofing.
      
      The reference bit is also subject to the same loss of information.  We
      don't make the same fix here for the reference bit because there isn't
      an interface for userspace to find out which pages the guest has
      referenced, whereas there is one for userspace to find out which pages
      the guest has modified.  Because of this loss of information, the
      kvm_age_hva_hv() and kvm_test_age_hva_hv() functions might incorrectly
      say that a page has not been referenced when it has, but that doesn't
      matter greatly because we never page or swap out large pages.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      08fe1e7b
    • P
      KVM: PPC: Book3S HV: Fix race in reading change bit when removing HPTE · 1e5bf454
      Paul Mackerras 提交于
      The reference (R) and change (C) bits in a HPT entry can be set by
      hardware at any time up until the HPTE is invalidated and the TLB
      invalidation sequence has completed.  This means that when removing
      a HPTE, we need to read the HPTE after the invalidation sequence has
      completed in order to obtain reliable values of R and C.  The code
      in kvmppc_do_h_remove() used to do this.  However, commit 6f22bd32
      ("KVM: PPC: Book3S HV: Make HTAB code LE host aware") removed the
      read after invalidation as a side effect of other changes.  This
      restores the read of the HPTE after invalidation.
      
      The user-visible effect of this bug would be that when migrating a
      guest, there is a small probability that a page modified by the guest
      and then unmapped by the guest might not get re-transmitted and thus
      the destination might end up with a stale copy of the page.
      
      Fixes: 6f22bd32Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      1e5bf454
    • P
      KVM: PPC: Book3S HV: Implement dynamic micro-threading on POWER8 · b4deba5c
      Paul Mackerras 提交于
      This builds on the ability to run more than one vcore on a physical
      core by using the micro-threading (split-core) modes of the POWER8
      chip.  Previously, only vcores from the same VM could be run together,
      and (on POWER8) only if they had just one thread per core.  With the
      ability to split the core on guest entry and unsplit it on guest exit,
      we can run up to 8 vcpu threads from up to 4 different VMs, and we can
      run multiple vcores with 2 or 4 vcpus per vcore.
      
      Dynamic micro-threading is only available if the static configuration
      of the cores is whole-core mode (unsplit), and only on POWER8.
      
      To manage this, we introduce a new kvm_split_mode struct which is
      shared across all of the subcores in the core, with a pointer in the
      paca on each thread.  In addition we extend the core_info struct to
      have information on each subcore.  When deciding whether to add a
      vcore to the set already on the core, we now have two possibilities:
      (a) piggyback the vcore onto an existing subcore, or (b) start a new
      subcore.
      
      Currently, when any vcpu needs to exit the guest and switch to host
      virtual mode, we interrupt all the threads in all subcores and switch
      the core back to whole-core mode.  It may be possible in future to
      allow some of the subcores to keep executing in the guest while
      subcore 0 switches to the host, but that is not implemented in this
      patch.
      
      This adds a module parameter called dynamic_mt_modes which controls
      which micro-threading (split-core) modes the code will consider, as a
      bitmap.  In other words, if it is 0, no micro-threading mode is
      considered; if it is 2, only 2-way micro-threading is considered; if
      it is 4, only 4-way, and if it is 6, both 2-way and 4-way
      micro-threading mode will be considered.  The default is 6.
      
      With this, we now have secondary threads which are the primary thread
      for their subcore and therefore need to do the MMU switch.  These
      threads will need to be started even if they have no vcpu to run, so
      we use the vcore pointer in the PACA rather than the vcpu pointer to
      trigger them.
      
      It is now possible for thread 0 to find that an exit has been
      requested before it gets to switch the subcore state to the guest.  In
      that case we haven't added the guest's timebase offset to the
      timebase, so we need to be careful not to subtract the offset in the
      guest exit path.  In fact we just skip the whole path that switches
      back to host context, since we haven't switched to the guest context.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      b4deba5c
    • P
      KVM: PPC: Book3S HV: Make use of unused threads when running guests · ec257165
      Paul Mackerras 提交于
      When running a virtual core of a guest that is configured with fewer
      threads per core than the physical cores have, the extra physical
      threads are currently unused.  This makes it possible to use them to
      run one or more other virtual cores from the same guest when certain
      conditions are met.  This applies on POWER7, and on POWER8 to guests
      with one thread per virtual core.  (It doesn't apply to POWER8 guests
      with multiple threads per vcore because they require a 1-1 virtual to
      physical thread mapping in order to be able to use msgsndp and the
      TIR.)
      
      The idea is that we maintain a list of preempted vcores for each
      physical cpu (i.e. each core, since the host runs single-threaded).
      Then, when a vcore is about to run, it checks to see if there are
      any vcores on the list for its physical cpu that could be
      piggybacked onto this vcore's execution.  If so, those additional
      vcores are put into state VCORE_PIGGYBACK and their runnable VCPU
      threads are started as well as the original vcore, which is called
      the master vcore.
      
      After the vcores have exited the guest, the extra ones are put back
      onto the preempted list if any of their VCPUs are still runnable and
      not idle.
      
      This means that vcpu->arch.ptid is no longer necessarily the same as
      the physical thread that the vcpu runs on.  In order to make it easier
      for code that wants to send an IPI to know which CPU to target, we
      now store that in a new field in struct vcpu_arch, called thread_cpu.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Tested-by: NLaurent Vivier <lvivier@redhat.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      ec257165
    • T
      KVM: PPC: add missing pt_regs initialization · 845ac985
      Tudor Laurentiu 提交于
      On this switch branch the regs initialization
      doesn't happen so add it.
      This was found with the help of a static
      code analysis tool.
      Signed-off-by: NLaurentiu Tudor <Laurentiu.Tudor@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      845ac985
    • T
      KVM: PPC: Fix warnings from sparse · 5358a963
      Thomas Huth 提交于
      When compiling the KVM code for POWER with "make C=1", sparse
      complains about functions missing proper prototypes and a 64-bit
      constant missing the ULL prefix. Let's fix this by making the
      functions static or by including the proper header with the
      prototypes, and by appending a ULL prefix to the constant
      PPC_MPPE_ADDRESS_MASK.
      Signed-off-by: NThomas Huth <thuth@redhat.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      5358a963
    • T
      KVM: PPC: Remove PPC970 from KVM_BOOK3S_64_HV text in Kconfig · 129fd423
      Thomas Huth 提交于
      Since the PPC970 support has been removed from the kvm-hv kernel
      module recently, we should also reflect this change in the help
      text of the corresponding Kconfig option.
      Signed-off-by: NThomas Huth <thuth@redhat.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      129fd423
    • T
      KVM: PPC: fix suspicious use of conditional operator · f5ffe330
      Tudor Laurentiu 提交于
      This was signaled by a static code analysis tool.
      Signed-off-by: NLaurentiu Tudor <Laurentiu.Tudor@freescale.com>
      Reviewed-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      f5ffe330
  10. 21 8月, 2015 1 次提交
  11. 20 8月, 2015 3 次提交