1. 02 10月, 2018 7 次提交
    • K
      perf/x86/intel: Add quirk for Goldmont Plus · 7c5314b8
      Kan Liang 提交于
      A ucode patch is needed for Goldmont Plus while counter freezing feature
      is enabled. Otherwise, there will be some issues, e.g. PMI flood with
      some events.
      
      Add a quirk to check microcode version. If the system starts with the
      wrong ucode, leave the counter-freezing feature permanently disabled.
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Link: http://lkml.kernel.org/r/1533712328-2834-3-git-send-email-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7c5314b8
    • P
      x86/cpu: Sanitize FAM6_ATOM naming · f2c4db1b
      Peter Zijlstra 提交于
      Going primarily by:
      
        https://en.wikipedia.org/wiki/List_of_Intel_Atom_microprocessors
      
      with additional information gleaned from other related pages; notably:
      
       - Bonnell shrink was called Saltwell
       - Moorefield is the Merriefield refresh which makes it Airmont
      
      The general naming scheme is: FAM6_ATOM_UARCH_SOCTYPE
      
        for i in `git grep -l FAM6_ATOM` ; do
      	sed -i  -e 's/ATOM_PINEVIEW/ATOM_BONNELL/g'		\
      		-e 's/ATOM_LINCROFT/ATOM_BONNELL_MID/'		\
      		-e 's/ATOM_PENWELL/ATOM_SALTWELL_MID/g'		\
      		-e 's/ATOM_CLOVERVIEW/ATOM_SALTWELL_TABLET/g'	\
      		-e 's/ATOM_CEDARVIEW/ATOM_SALTWELL/g'		\
      		-e 's/ATOM_SILVERMONT1/ATOM_SILVERMONT/g'	\
      		-e 's/ATOM_SILVERMONT2/ATOM_SILVERMONT_X/g'	\
      		-e 's/ATOM_MERRIFIELD/ATOM_SILVERMONT_MID/g'	\
      		-e 's/ATOM_MOOREFIELD/ATOM_AIRMONT_MID/g'	\
      		-e 's/ATOM_DENVERTON/ATOM_GOLDMONT_X/g'		\
      		-e 's/ATOM_GEMINI_LAKE/ATOM_GOLDMONT_PLUS/g' ${i}
        done
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: dave.hansen@linux.intel.com
      Cc: len.brown@intel.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      f2c4db1b
    • A
      perf/x86/intel: Add a separate Arch Perfmon v4 PMI handler · af3bdb99
      Andi Kleen 提交于
      Implements counter freezing for Arch Perfmon v4 (Skylake and
      newer). This allows to speed up the PMI handler by avoiding
      unnecessary MSR writes and make it more accurate.
      
      The Arch Perfmon v4 PMI handler is substantially different than
      the older PMI handler.
      
      Differences to the old handler:
      
      - It relies on counter freezing, which eliminates several MSR
        writes from the PMI handler and lowers the overhead significantly.
      
        It makes the PMI handler more accurate, as all counters get
        frozen atomically as soon as any counter overflows. So there is
        much less counting of the PMI handler itself.
      
        With the freezing we don't need to disable or enable counters or
        PEBS. Only BTS which does not support auto-freezing still needs to
        be explicitly managed.
      
      - The PMU acking is done at the end, not the beginning.
        This makes it possible to avoid manual enabling/disabling
        of the PMU, instead we just rely on the freezing/acking.
      
      - The APIC is acked before reenabling the PMU, which avoids
        problems with LBRs occasionally not getting unfreezed on Skylake.
      
      - Looping is only needed to workaround a corner case which several PMIs
        are very close to each other. For common cases, the counters are freezed
        during PMI handler. It doesn't need to do re-check.
      
      This patch:
      
      - Adds code to enable v4 counter freezing
      - Fork <=v3 and >=v4 PMI handlers into separate functions.
      - Add kernel parameter to disable counter freezing. It took some time to
        debug counter freezing, so in case there are new problems we added an
        option to turn it off. Would not expect this to be used until there
        are new bugs.
      - Only for big core. The patch for small core will be posted later
        separately.
      
      Performance:
      
      When profiling a kernel build on Kabylake with different perf options,
      measuring the length of all NMI handlers using the nmi handler
      trace point:
      
      V3 is without counter freezing.
      V4 is with counter freezing.
      The value is the average cost of the PMI handler.
      (lower is better)
      
      perf options    `           V3(ns) V4(ns)  delta
      -c 100000                   1088   894     -18%
      -g -c 100000                1862   1646    -12%
      --call-graph lbr -c 100000  3649   3367    -8%
      --c.g. dwarf -c 100000      2248   1982    -12%
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Link: http://lkml.kernel.org/r/1533712328-2834-2-git-send-email-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      af3bdb99
    • K
      perf/x86/intel: Factor out common code of PMI handler · ba12d20e
      Kan Liang 提交于
      The Arch Perfmon v4 PMI handler is substantially different than
      the older PMI handler. Instead of adding more and more ifs cleanly
      fork the new handler into a new function, with the main common
      code factored out into a common function.
      
      Fix complaint from checkpatch.pl by removing "false" from "static bool
      warned".
      
      No functional change.
      
      Based-on-code-from: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Link: http://lkml.kernel.org/r/1533712328-2834-1-git-send-email-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ba12d20e
    • N
      perf/x86/amd/uncore: Set ThreadMask and SliceMask for L3 Cache perf events · d7cbbe49
      Natarajan, Janakarajan 提交于
      In Family 17h, some L3 Cache Performance events require the ThreadMask
      and SliceMask to be set. For other events, these fields do not affect
      the count either way.
      
      Set ThreadMask and SliceMask to 0xFF and 0xF respectively.
      Signed-off-by: NJanakarajan Natarajan <Janakarajan.Natarajan@amd.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H . Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Suravee <Suravee.Suthikulpanit@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/Message-ID:
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      d7cbbe49
    • K
      perf/x86/intel/uncore: Fix PCI BDF address of M3UPI on SKX · 9d92cfea
      Kan Liang 提交于
      The counters on M3UPI Link 0 and Link 3 don't count properly, and writing
      0 to these counters may causes system crash on some machines.
      
      The PCI BDF addresses of the M3UPI in the current code are incorrect.
      
      The correct addresses should be:
      
        D18:F1	0x204D
        D18:F2	0x204E
        D18:F5	0x204D
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: cd34cd97 ("perf/x86/intel/uncore: Add Skylake server uncore support")
      Link: http://lkml.kernel.org/r/1537538826-55489-1-git-send-email-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9d92cfea
    • M
      perf/x86/intel/uncore: Use boot_cpu_data.phys_proc_id instead of hardcorded physical package ID 0 · 6265adb9
      Masayoshi Mizuma 提交于
      Physical package id 0 doesn't always exist, we should use
      boot_cpu_data.phys_proc_id here.
      Signed-off-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/20180910144750.6782-1-msys.mizuma@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6265adb9
  2. 29 9月, 2018 4 次提交
    • R
      x86/intel_rdt: Use perf infrastructure for measurements · dd45407c
      Reinette Chatre 提交于
      The success of a cache pseudo-locked region is measured using
      performance monitoring events that are programmed directly at the time
      the user requests a measurement.
      
      Modifying the performance event registers directly is not appropriate
      since it circumvents the in-kernel perf infrastructure that exists to
      manage these resources and provide resource arbitration to the
      performance monitoring hardware.
      
      The cache pseudo-locking measurements are modified to use the in-kernel
      perf infrastructure. Performance events are created and validated with
      the appropriate perf API. The performance counters are still read as
      directly as possible to avoid the additional cache hits. This is
      done safely by first ensuring with the perf API that the counters have
      been programmed correctly and only accessing the counters in an
      interrupt disabled section where they are not able to be moved.
      
      As part of the transition to the in-kernel perf infrastructure the L2
      and L3 measurements are split into two separate measurements that can
      be triggered independently. This separation prevents additional cache
      misses incurred during the extra testing code used to decide if a
      L2 and/or L3 measurement should be made.
      Signed-off-by: NReinette Chatre <reinette.chatre@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: fenghua.yu@intel.com
      Cc: tony.luck@intel.com
      Cc: peterz@infradead.org
      Cc: acme@kernel.org
      Cc: gavin.hindman@intel.com
      Cc: jithu.joseph@intel.com
      Cc: dave.hansen@intel.com
      Cc: hpa@zytor.com
      Link: https://lkml.kernel.org/r/fc24e728b446404f42c78573c506e98cd0599873.1537468643.git.reinette.chatre@intel.com
      dd45407c
    • R
      x86/intel_rdt: Create required perf event attributes · 0a701c9d
      Reinette Chatre 提交于
      A perf event has many attributes that are maintained in a separate
      structure that should be provided when a new perf_event is created.
      
      In preparation for the transition to perf_events the required attribute
      structures are created for all the events that may be used in the
      measurements. Most attributes for all the events are identical. The
      actual configuration, what specifies what needs to be measured, is what
      will be different between the events used. This configuration needs to
      be done with X86_CONFIG that cannot be used as part of the designated
      initializers used here, this will be introduced later.
      
      Although they do look identical at this time the attribute structures
      needs to be maintained separately since a perf_event will maintain a
      pointer to its unique attributes.
      
      In support of patch testing the new structs are given the unused attribute
      until their use in later patches.
      Signed-off-by: NReinette Chatre <reinette.chatre@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: fenghua.yu@intel.com
      Cc: tony.luck@intel.com
      Cc: acme@kernel.org
      Cc: gavin.hindman@intel.com
      Cc: jithu.joseph@intel.com
      Cc: dave.hansen@intel.com
      Cc: hpa@zytor.com
      Link: https://lkml.kernel.org/r/1822f6164e221a497648d108913d056ab675d5d0.1537377064.git.reinette.chatre@intel.com
      0a701c9d
    • R
      x86/intel_rdt: Remove local register variables · b5e4274e
      Reinette Chatre 提交于
      Local register variables were used in an effort to improve the
      accuracy of the measurement of cache residency of a pseudo-locked
      region. This was done to ensure that only the cache residency of
      the memory is measured and not the cache residency of the variables
      used to perform the measurement.
      
      While local register variables do accomplish the goal they do require
      significant care since different architectures have different registers
      available. Local register variables also cannot be used with valuable
      developer tools like KASAN.
      
      Significant testing has shown that similar accuracy in measurement
      results can be obtained by replacing local register variables with
      regular local variables.
      
      Make use of local variables in the critical code but do so with
      READ_ONCE() to prevent the compiler from merging or refetching reads.
      Ensure these variables are initialized before the measurement starts,
      and ensure it is only the local variables that are accessed during
      the measurement.
      
      With the removal of the local register variables and using READ_ONCE()
      there is no longer a motivation for using a direct wrmsr call (that
      avoids the additional tracing code that may clobber the local register
      variables).
      Signed-off-by: NReinette Chatre <reinette.chatre@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: fenghua.yu@intel.com
      Cc: tony.luck@intel.com
      Cc: acme@kernel.org
      Cc: gavin.hindman@intel.com
      Cc: jithu.joseph@intel.com
      Cc: dave.hansen@intel.com
      Cc: hpa@zytor.com
      Link: https://lkml.kernel.org/r/f430f57347414e0691765d92b144758ab93d8407.1537377064.git.reinette.chatre@intel.com
      b5e4274e
    • R
      perf/x86: Add helper to obtain performance counter index · 1182a495
      Reinette Chatre 提交于
      perf_event_read_local() is the safest way to obtain measurements
      associated with performance events. In some cases the overhead
      introduced by perf_event_read_local() affects the measurements and the
      use of rdpmcl() is needed. rdpmcl() requires the index
      of the performance counter used so a helper is introduced to determine
      the index used by a provided performance event.
      
      The index used by a performance event may change when interrupts are
      enabled. A check is added to ensure that the index is only accessed
      with interrupts disabled. Even with this check the use of this counter
      needs to be done with care to ensure it is queried and used within the
      same disabled interrupts section.
      
      This change introduces a new checkpatch warning:
      CHECK: extern prototypes should be avoided in .h files
      +extern int x86_perf_rdpmc_index(struct perf_event *event);
      
      This warning was discussed and designated as a false positive in
      http://lkml.kernel.org/r/20180919091759.GZ24124@hirez.programming.kicks-ass.netSuggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NReinette Chatre <reinette.chatre@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: fenghua.yu@intel.com
      Cc: tony.luck@intel.com
      Cc: acme@kernel.org
      Cc: gavin.hindman@intel.com
      Cc: jithu.joseph@intel.com
      Cc: dave.hansen@intel.com
      Cc: hpa@zytor.com
      Link: https://lkml.kernel.org/r/b277ffa78a51254f5414f7b1bc1923826874566e.1537377064.git.reinette.chatre@intel.com
      1182a495
  3. 28 9月, 2018 1 次提交
    • K
      x86/boot: Fix kexec booting failure in the SEV bit detection code · bdec8d7f
      Kairui Song 提交于
      Commit
      
        1958b5fc ("x86/boot: Add early boot support when running with SEV active")
      
      can occasionally cause system resets when kexec-ing a second kernel even
      if SEV is not active.
      
      That's because get_sev_encryption_bit() uses 32-bit rIP-relative
      addressing to read the value of enc_bit - a variable which caches a
      previously detected encryption bit position - but kexec may allocate
      the early boot code to a higher location, beyond the 32-bit addressing
      limit.
      
      In this case, garbage will be read and get_sev_encryption_bit() will
      return the wrong value, leading to accessing memory with the wrong
      encryption setting.
      
      Therefore, remove enc_bit, and thus get rid of the need to do 32-bit
      rIP-relative addressing in the first place.
      
       [ bp: massage commit message heavily. ]
      
      Fixes: 1958b5fc ("x86/boot: Add early boot support when running with SEV active")
      Suggested-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NKairui Song <kasong@redhat.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NTom Lendacky <thomas.lendacky@amd.com>
      Cc: linux-kernel@vger.kernel.org
      Cc: tglx@linutronix.de
      Cc: mingo@redhat.com
      Cc: hpa@zytor.com
      Cc: brijesh.singh@amd.com
      Cc: kexec@lists.infradead.org
      Cc: dyoung@redhat.com
      Cc: bhe@redhat.com
      Cc: ghook@redhat.com
      Link: https://lkml.kernel.org/r/20180927123845.32052-1-kasong@redhat.com
      bdec8d7f
  4. 25 9月, 2018 4 次提交
    • S
      powerpc/numa: Use associativity if VPHN hcall is successful · 2483ef05
      Srikar Dronamraju 提交于
      Currently associativity is used to lookup node-id even if the
      preceding VPHN hcall failed. However this can cause CPU to be made
      part of the wrong node, (most likely to be node 0). This is because
      VPHN is not enabled on KVM guests.
      
      With 2ea62630 ("powerpc/topology: Get topology for shared processors at
      boot"), associativity is used to set to the wrong node. Hence KVM
      guest topology is broken.
      
      For example : A 4 node KVM guest before would have reported.
      
        [root@localhost ~]#  numactl -H
        available: 4 nodes (0-3)
        node 0 cpus: 0 1 2 3
        node 0 size: 1746 MB
        node 0 free: 1604 MB
        node 1 cpus: 4 5 6 7
        node 1 size: 2044 MB
        node 1 free: 1765 MB
        node 2 cpus: 8 9 10 11
        node 2 size: 2044 MB
        node 2 free: 1837 MB
        node 3 cpus: 12 13 14 15
        node 3 size: 2044 MB
        node 3 free: 1903 MB
        node distances:
        node   0   1   2   3
          0:  10  40  40  40
          1:  40  10  40  40
          2:  40  40  10  40
          3:  40  40  40  10
      
      Would now report:
      
        [root@localhost ~]# numactl -H
        available: 4 nodes (0-3)
        node 0 cpus: 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15
        node 0 size: 1746 MB
        node 0 free: 1244 MB
        node 1 cpus:
        node 1 size: 2044 MB
        node 1 free: 2032 MB
        node 2 cpus: 1
        node 2 size: 2044 MB
        node 2 free: 2028 MB
        node 3 cpus:
        node 3 size: 2044 MB
        node 3 free: 2032 MB
        node distances:
        node   0   1   2   3
          0:  10  40  40  40
          1:  40  10  40  40
          2:  40  40  10  40
          3:  40  40  40  10
      
      Fix this by skipping associativity lookup if the VPHN hcall failed.
      
      Fixes: 2ea62630 ("powerpc/topology: Get topology for shared processors at boot")
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      2483ef05
    • M
      powerpc/tm: Avoid possible userspace r1 corruption on reclaim · 96dc89d5
      Michael Neuling 提交于
      Current we store the userspace r1 to PACATMSCRATCH before finally
      saving it to the thread struct.
      
      In theory an exception could be taken here (like a machine check or
      SLB miss) that could write PACATMSCRATCH and hence corrupt the
      userspace r1. The SLB fault currently doesn't touch PACATMSCRATCH, but
      others do.
      
      We've never actually seen this happen but it's theoretically
      possible. Either way, the code is fragile as it is.
      
      This patch saves r1 to the kernel stack (which can't fault) before we
      turn MSR[RI] back on. PACATMSCRATCH is still used but only with
      MSR[RI] off. We then copy r1 from the kernel stack to the thread
      struct once we have MSR[RI] back on.
      Suggested-by: NBreno Leitao <leitao@debian.org>
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      96dc89d5
    • M
      powerpc/tm: Fix userspace r13 corruption · cf13435b
      Michael Neuling 提交于
      When we treclaim we store the userspace checkpointed r13 to a scratch
      SPR and then later save the scratch SPR to the user thread struct.
      
      Unfortunately, this doesn't work as accessing the user thread struct
      can take an SLB fault and the SLB fault handler will write the same
      scratch SPRG that now contains the userspace r13.
      
      To fix this, we store r13 to the kernel stack (which can't fault)
      before we access the user thread struct.
      
      Found by running P8 guest + powervm + disable_1tb_segments + TM. Seen
      as a random userspace segfault with r13 looking like a kernel address.
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Reviewed-by: NBreno Leitao <leitao@debian.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      cf13435b
    • J
      RISC-V: include linux/ftrace.h in asm-prototypes.h · 57a48978
      James Cowgill 提交于
      Building a riscv kernel with CONFIG_FUNCTION_TRACER and
      CONFIG_MODVERSIONS enabled results in these two warnings:
      
        MODPOST vmlinux.o
      WARNING: EXPORT symbol "return_to_handler" [vmlinux] version generation failed, symbol will not be versioned.
      WARNING: EXPORT symbol "_mcount" [vmlinux] version generation failed, symbol will not be versioned.
      
      When exporting symbols from an assembly file, the MODVERSIONS code
      requires their prototypes to be defined in asm-prototypes.h (see
      scripts/Makefile.build). Since both of these symbols have prototypes
      defined in linux/ftrace.h, include this header from RISC-V's
      asm-prototypes.h.
      Reported-by: NKarsten Merker <merker@debian.org>
      Signed-off-by: NJames Cowgill <jcowgill@debian.org>
      Signed-off-by: NPalmer Dabbelt <palmer@sifive.com>
      57a48978
  5. 24 9月, 2018 1 次提交
    • M
      powerpc/pseries: Fix unitialized timer reset on migration · 8604895a
      Michael Bringmann 提交于
      After migration of a powerpc LPAR, the kernel executes code to
      update the system state to reflect new platform characteristics.
      
      Such changes include modifications to device tree properties provided
      to the system by PHYP. Property notifications received by the
      post_mobility_fixup() code are passed along to the kernel in general
      through a call to of_update_property() which in turn passes such
      events back to all modules through entries like the '.notifier_call'
      function within the NUMA module.
      
      When the NUMA module updates its state, it resets its event timer. If
      this occurs after a previous call to stop_topology_update() or on a
      system without VPHN enabled, the code runs into an unitialized timer
      structure and crashes. This patch adds a safety check along this path
      toward the problem code.
      
      An example crash log is as follows.
      
        ibmvscsi 30000081: Re-enabling adapter!
        ------------[ cut here ]------------
        kernel BUG at kernel/time/timer.c:958!
        Oops: Exception in kernel mode, sig: 5 [#1]
        LE SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in: nfsv3 nfs_acl nfs tcp_diag udp_diag inet_diag lockd unix_diag af_packet_diag netlink_diag grace fscache sunrpc xts vmx_crypto pseries_rng sg binfmt_misc ip_tables xfs libcrc32c sd_mod ibmvscsi ibmveth scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod
        CPU: 11 PID: 3067 Comm: drmgr Not tainted 4.17.0+ #179
        ...
        NIP mod_timer+0x4c/0x400
        LR  reset_topology_timer+0x40/0x60
        Call Trace:
          0xc0000003f9407830 (unreliable)
          reset_topology_timer+0x40/0x60
          dt_update_callback+0x100/0x120
          notifier_call_chain+0x90/0x100
          __blocking_notifier_call_chain+0x60/0x90
          of_property_notify+0x90/0xd0
          of_update_property+0x104/0x150
          update_dt_property+0xdc/0x1f0
          pseries_devicetree_update+0x2d0/0x510
          post_mobility_fixup+0x7c/0xf0
          migration_store+0xa4/0xc0
          kobj_attr_store+0x30/0x60
          sysfs_kf_write+0x64/0xa0
          kernfs_fop_write+0x16c/0x240
          __vfs_write+0x40/0x200
          vfs_write+0xc8/0x240
          ksys_write+0x5c/0x100
          system_call+0x58/0x6c
      
      Fixes: 5d88aa85 ("powerpc/pseries: Update CPU maps when device tree is updated")
      Cc: stable@vger.kernel.org # v3.10+
      Signed-off-by: NMichael Bringmann <mwb@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8604895a
  6. 21 9月, 2018 2 次提交
  7. 20 9月, 2018 19 次提交
    • T
      powerpc/pkeys: Fix reading of ibm, processor-storage-keys property · c716a25b
      Thiago Jung Bauermann 提交于
      scan_pkey_feature() uses of_property_read_u32_array() to read the
      ibm,processor-storage-keys property and calls be32_to_cpu() on the
      value it gets. The problem is that of_property_read_u32_array() already
      returns the value converted to the CPU byte order.
      
      The value of pkeys_total ends up more or less sane because there's a min()
      call in pkey_initialize() which reduces pkeys_total to 32. So in practice
      the kernel ignores the fact that the hypervisor reserved one key for
      itself (the device tree advertises 31 keys in my test VM).
      
      This is wrong, but the effect in practice is that when a process tries to
      allocate the 32nd key, it gets an -EINVAL error instead of -ENOSPC which
      would indicate that there aren't any keys available
      
      Fixes: cf43d3b2 ("powerpc: Enable pkey subsystem")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: NThiago Jung Bauermann <bauerman@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c716a25b
    • C
      powerpc: fix csum_ipv6_magic() on little endian platforms · 85682a7e
      Christophe Leroy 提交于
      On little endian platforms, csum_ipv6_magic() keeps len and proto in
      CPU byte order. This generates a bad results leading to ICMPv6 packets
      from other hosts being dropped by powerpc64le platforms.
      
      In order to fix this, len and proto should be converted to network
      byte order ie bigendian byte order. However checksumming 0x12345678
      and 0x56341278 provide the exact same result so it is enough to
      rotate the sum of len and proto by 1 byte.
      
      PPC32 only support bigendian so the fix is needed for PPC64 only
      
      Fixes: e9c4943a ("powerpc: Implement csum_ipv6_magic in assembly")
      Reported-by: NJianlin Shi <jishi@redhat.com>
      Reported-by: NXin Long <lucien.xin@gmail.com>
      Cc: <stable@vger.kernel.org> # 4.18+
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Tested-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      85682a7e
    • A
      powerpc/powernv/ioda2: Reduce upper limit for DMA window size (again) · 7233b8ca
      Alexey Kardashevskiy 提交于
      mpe: This was fixed originally in commit d3d4ffaa
      ("powerpc/powernv/ioda2: Reduce upper limit for DMA window size"), but
      contrary to what the merge commit says was inadvertently lost by me in
      commit ce57c661 ("Merge branch 'topic/ppc-kvm' into next") which
      brought in changes that moved the code to a new file. So reapply it to
      the new file.
      
      Original commit message follows:
      
      We use PHB in mode1 which uses bit 59 to select a correct DMA window.
      However there is mode2 which uses bits 59:55 and allows up to 32 DMA
      windows per a PE.
      
      Even though documentation does not clearly specify that, it seems that
      the actual hardware does not support bits 59:55 even in mode1, in
      other words we can create a window as big as 1<<58 but DMA simply
      won't work.
      
      This reduces the upper limit from 59 to 55 bits to let the userspace
      know about the hardware limits.
      
      Fixes: ce57c661 ("Merge branch 'topic/ppc-kvm' into next")
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7233b8ca
    • D
      KVM: x86: Control guest reads of MSR_PLATFORM_INFO · 6fbbde9a
      Drew Schmitt 提交于
      Add KVM_CAP_MSR_PLATFORM_INFO so that userspace can disable guest access
      to reads of MSR_PLATFORM_INFO.
      
      Disabling access to reads of this MSR gives userspace the control to "expose"
      this platform-dependent information to guests in a clear way. As it exists
      today, guests that read this MSR would get unpopulated information if userspace
      hadn't already set it (and prior to this patch series, only the CPUID faulting
      information could have been populated). This existing interface could be
      confusing if guests don't handle the potential for incorrect/incomplete
      information gracefully (e.g. zero reported for base frequency).
      Signed-off-by: NDrew Schmitt <dasch@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6fbbde9a
    • D
      KVM: x86: Turbo bits in MSR_PLATFORM_INFO · d84f1cff
      Drew Schmitt 提交于
      Allow userspace to set turbo bits in MSR_PLATFORM_INFO. Previously, only
      the CPUID faulting bit was settable. But now any bit in
      MSR_PLATFORM_INFO would be settable. This can be used, for example, to
      convey frequency information about the platform on which the guest is
      running.
      Signed-off-by: NDrew Schmitt <dasch@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d84f1cff
    • K
      nVMX x86: Check VPID value on vmentry of L2 guests · ba8e23db
      Krish Sadhukhan 提交于
      According to section "Checks on VMX Controls" in Intel SDM vol 3C, the
      following check needs to be enforced on vmentry of L2 guests:
      
          If the 'enable VPID' VM-execution control is 1, the value of the
          of the VPID VM-execution control field must not be 0000H.
      Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: NMark Kanda <mark.kanda@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ba8e23db
    • K
      nVMX x86: check posted-interrupt descriptor addresss on vmentry of L2 · 6de84e58
      Krish Sadhukhan 提交于
      According to section "Checks on VMX Controls" in Intel SDM vol 3C,
      the following check needs to be enforced on vmentry of L2 guests:
      
         - Bits 5:0 of the posted-interrupt descriptor address are all 0.
         - The posted-interrupt descriptor address does not set any bits
           beyond the processor's physical-address width.
      Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: NMark Kanda <mark.kanda@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Reviewed-by: NDarren Kenny <darren.kenny@oracle.com>
      Reviewed-by: NKarl Heubaum <karl.heubaum@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6de84e58
    • L
      KVM: nVMX: Wake blocked vCPU in guest-mode if pending interrupt in virtual APICv · e6c67d8c
      Liran Alon 提交于
      In case L1 do not intercept L2 HLT or enter L2 in HLT activity-state,
      it is possible for a vCPU to be blocked while it is in guest-mode.
      
      According to Intel SDM 26.6.5 Interrupt-Window Exiting and
      Virtual-Interrupt Delivery: "These events wake the logical processor
      if it just entered the HLT state because of a VM entry".
      Therefore, if L1 enters L2 in HLT activity-state and L2 has a pending
      deliverable interrupt in vmcs12->guest_intr_status.RVI, then the vCPU
      should be waken from the HLT state and injected with the interrupt.
      
      In addition, if while the vCPU is blocked (while it is in guest-mode),
      it receives a nested posted-interrupt, then the vCPU should also be
      waken and injected with the posted interrupt.
      
      To handle these cases, this patch enhances kvm_vcpu_has_events() to also
      check if there is a pending interrupt in L2 virtual APICv provided by
      L1. That is, it evaluates if there is a pending virtual interrupt for L2
      by checking RVI[7:4] > VPPR[7:4] as specified in Intel SDM 29.2.1
      Evaluation of Pending Interrupts.
      
      Note that this also handles the case of nested posted-interrupt by the
      fact RVI is updated in vmx_complete_nested_posted_interrupt() which is
      called from kvm_vcpu_check_block() -> kvm_arch_vcpu_runnable() ->
      kvm_vcpu_running() -> vmx_check_nested_events() ->
      vmx_complete_nested_posted_interrupt().
      Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Reviewed-by: NDarren Kenny <darren.kenny@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e6c67d8c
    • P
      KVM: VMX: check nested state and CR4.VMXE against SMM · 5bea5123
      Paolo Bonzini 提交于
      VMX cannot be enabled under SMM, check it when CR4 is set and when nested
      virtualization state is restored.
      
      This should fix some WARNs reported by syzkaller, mostly around
      alloc_shadow_vmcs.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5bea5123
    • S
      kvm: x86: make kvm_{load|put}_guest_fpu() static · 822f312d
      Sebastian Andrzej Siewior 提交于
      The functions
      	kvm_load_guest_fpu()
      	kvm_put_guest_fpu()
      
      are only used locally, make them static. This requires also that both
      functions are moved because they are used before their implementation.
      Those functions were exported (via EXPORT_SYMBOL) before commit
      e5bb4025 ("KVM: Drop kvm_{load,put}_guest_fpu() exports").
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      822f312d
    • V
      x86/hyper-v: rename ipi_arg_{ex,non_ex} structures · a1efa9b7
      Vitaly Kuznetsov 提交于
      These structures are going to be used from KVM code so let's make
      their names reflect their Hyper-V origin.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com>
      Acked-by: NK. Y. Srinivasan <kys@microsoft.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a1efa9b7
    • S
      KVM: VMX: use preemption timer to force immediate VMExit · d264ee0c
      Sean Christopherson 提交于
      A VMX preemption timer value of '0' is guaranteed to cause a VMExit
      prior to the CPU executing any instructions in the guest.  Use the
      preemption timer (if it's supported) to trigger immediate VMExit
      in place of the current method of sending a self-IPI.  This ensures
      that pending VMExit injection to L1 occurs prior to executing any
      instructions in the guest (regardless of nesting level).
      
      When deferring VMExit injection, KVM generates an immediate VMExit
      from the (possibly nested) guest by sending itself an IPI.  Because
      hardware interrupts are blocked prior to VMEnter and are unblocked
      (in hardware) after VMEnter, this results in taking a VMExit(INTR)
      before any guest instruction is executed.  But, as this approach
      relies on the IPI being received before VMEnter executes, it only
      works as intended when KVM is running as L0.  Because there are no
      architectural guarantees regarding when IPIs are delivered, when
      running nested the INTR may "arrive" long after L2 is running e.g.
      L0 KVM doesn't force an immediate switch to L1 to deliver an INTR.
      
      For the most part, this unintended delay is not an issue since the
      events being injected to L1 also do not have architectural guarantees
      regarding their timing.  The notable exception is the VMX preemption
      timer[1], which is architecturally guaranteed to cause a VMExit prior
      to executing any instructions in the guest if the timer value is '0'
      at VMEnter.  Specifically, the delay in injecting the VMExit causes
      the preemption timer KVM unit test to fail when run in a nested guest.
      
      Note: this approach is viable even on CPUs with a broken preemption
      timer, as broken in this context only means the timer counts at the
      wrong rate.  There are no known errata affecting timer value of '0'.
      
      [1] I/O SMIs also have guarantees on when they arrive, but I have
          no idea if/how those are emulated in KVM.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      [Use a hook for SVM instead of leaving the default in x86.c - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d264ee0c
    • S
      KVM: VMX: modify preemption timer bit only when arming timer · f459a707
      Sean Christopherson 提交于
      Provide a singular location where the VMX preemption timer bit is
      set/cleared so that future usages of the preemption timer can ensure
      the VMCS bit is up-to-date without having to modify unrelated code
      paths.  For example, the preemption timer can be used to force an
      immediate VMExit.  Cache the status of the timer to avoid redundant
      VMREAD and VMWRITE, e.g. if the timer stays armed across multiple
      VMEnters/VMExits.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f459a707
    • S
      KVM: VMX: immediately mark preemption timer expired only for zero value · 4c008127
      Sean Christopherson 提交于
      A VMX preemption timer value of '0' at the time of VMEnter is
      architecturally guaranteed to cause a VMExit prior to the CPU
      executing any instructions in the guest.  This architectural
      definition is in place to ensure that a previously expired timer
      is correctly recognized by the CPU as it is possible for the timer
      to reach zero and not trigger a VMexit due to a higher priority
      VMExit being signalled instead, e.g. a pending #DB that morphs into
      a VMExit.
      
      Whether by design or coincidence, commit f4124500 ("KVM: nVMX:
      Fully emulate preemption timer") special cased timer values of '0'
      and '1' to ensure prompt delivery of the VMExit.  Unlike '0', a
      timer value of '1' has no has no architectural guarantees regarding
      when it is delivered.
      
      Modify the timer emulation to trigger immediate VMExit if and only
      if the timer value is '0', and document precisely why '0' is special.
      Do this even if calibration of the virtual TSC failed, i.e. VMExit
      will occur immediately regardless of the frequency of the timer.
      Making only '0' a special case gives KVM leeway to be more aggressive
      in ensuring the VMExit is injected prior to executing instructions in
      the nested guest, and also eliminates any ambiguity as to why '1' is
      a special case, e.g. why wasn't the threshold for a "short timeout"
      set to 10, 100, 1000, etc...
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4c008127
    • A
      KVM: SVM: Switch to bitmap_zalloc() · a101c9d6
      Andy Shevchenko 提交于
      Switch to bitmap_zalloc() to show clearly what we are allocating.
      Besides that it returns pointer of bitmap type instead of opaque void *.
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a101c9d6
    • T
      KVM/MMU: Fix comment in walk_shadow_page_lockless_end() · 9a984586
      Tianyu Lan 提交于
      kvm_commit_zap_page() has been renamed to kvm_mmu_commit_zap_page()
      This patch is to fix the commit.
      Signed-off-by: NLan Tianyu <Tianyu.Lan@microsoft.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9a984586
    • W
      KVM: x86: don't reset root in kvm_mmu_setup() · 83b20b28
      Wei Yang 提交于
      Here is the code path which shows kvm_mmu_setup() is invoked after
      kvm_mmu_create(). Since kvm_mmu_setup() is only invoked in this code path,
      this means the root_hpa and prev_roots are guaranteed to be invalid. And
      it is not necessary to reset it again.
      
          kvm_vm_ioctl_create_vcpu()
              kvm_arch_vcpu_create()
                  vmx_create_vcpu()
                      kvm_vcpu_init()
                          kvm_arch_vcpu_init()
                              kvm_mmu_create()
              kvm_arch_vcpu_setup()
                  kvm_mmu_setup()
                      kvm_init_mmu()
      
      This patch set reset_roots to false in kmv_mmu_setup().
      
      Fixes: 50c28f21Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      83b20b28
    • J
      kvm: mmu: Don't read PDPTEs when paging is not enabled · d35b34a9
      Junaid Shahid 提交于
      kvm should not attempt to read guest PDPTEs when CR0.PG = 0 and
      CR4.PAE = 1.
      Signed-off-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d35b34a9
    • V
      x86/kvm/lapic: always disable MMIO interface in x2APIC mode · d1766202
      Vitaly Kuznetsov 提交于
      When VMX is used with flexpriority disabled (because of no support or
      if disabled with module parameter) MMIO interface to lAPIC is still
      available in x2APIC mode while it shouldn't be (kvm-unit-tests):
      
      PASS: apic_disable: Local apic enabled in x2APIC mode
      PASS: apic_disable: CPUID.1H:EDX.APIC[bit 9] is set
      FAIL: apic_disable: *0xfee00030: 50014
      
      The issue appears because we basically do nothing while switching to
      x2APIC mode when APIC access page is not used. apic_mmio_{read,write}
      only check if lAPIC is disabled before proceeding to actual write.
      
      When APIC access is virtualized we correctly manipulate with VMX controls
      in vmx_set_virtual_apic_mode() and we don't get vmexits from memory writes
      in x2APIC mode so there's no issue.
      
      Disabling MMIO interface seems to be easy. The question is: what do we
      do with these reads and writes? If we add apic_x2apic_mode() check to
      apic_mmio_in_range() and return -EOPNOTSUPP these reads and writes will
      go to userspace. When lAPIC is in kernel, Qemu uses this interface to
      inject MSIs only (see kvm_apic_mem_write() in hw/i386/kvm/apic.c). This
      somehow works with disabled lAPIC but when we're in xAPIC mode we will
      get a real injected MSI from every write to lAPIC. Not good.
      
      The simplest solution seems to be to just ignore writes to the region
      and return ~0 for all reads when we're in x2APIC mode. This is what this
      patch does. However, this approach is inconsistent with what currently
      happens when flexpriority is enabled: we allocate APIC access page and
      create KVM memory region so in x2APIC modes all reads and writes go to
      this pre-allocated page which is, btw, the same for all vCPUs.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d1766202
  8. 19 9月, 2018 2 次提交