1. 25 3月, 2016 1 次提交
  2. 21 3月, 2016 17 次提交
    • S
      perf/x86/intel/rapl: Add missing Broadwell models · 7b0fd569
      Srinivas Pandruvada 提交于
      Added Broadwell-H and Broadwell-Server.
      Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: bp@alien8.de
      Link: http://lkml.kernel.org/r/1458517938-25308-1-git-send-email-srinivas.pandruvada@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7b0fd569
    • K
      perf/x86/intel/uncore: Remove ev_sel_ext bit support for PCU · cb225252
      Kan Liang 提交于
      The ev_sel_ext in PCU_MSR_PMON_CTL is locked on some CPU models, so despite
      it being documented in the SDM, if we write 1 to that bit then we can get a #GP
      fault.
      
      Which #GP the perf fuzzer happily triggered in Peter Zijlstra's testing.
      
      Also, there are no public events which use that bit, so remove ev_sel_ext
      bit support for PCU.
      Reported-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1458500301-3594-1-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cb225252
    • H
      perf/x86/amd/power: Add AMD accumulated power reporting mechanism · c7ab62bf
      Huang Rui 提交于
      Introduce an AMD accumlated power reporting mechanism for the Family
      15h, Model 60h processor that can be used to calculate the average
      power consumed by a processor during a measurement interval. The
      feature support is indicated by CPUID Fn8000_0007_EDX[12].
      
      This feature will be implemented both in hwmon and perf. The current
      design provides one event to report per package/processor power
      consumption by counting each compute unit power value.
      
      Here the gory details of how the computation is done:
      
      * Tsample: compute unit power accumulator sample period
      * Tref: the PTSC counter period (PTSC: performance timestamp counter)
      * N: the ratio of compute unit power accumulator sample period to the
        PTSC period
      
      * Jmax: max compute unit accumulated power which is indicated by
        MSR_C001007b[MaxCpuSwPwrAcc]
      
      * Jx/Jy: compute unit accumulated power which is indicated by
        MSR_C001007a[CpuSwPwrAcc]
      
      * Tx/Ty: the value of performance timestamp counter which is indicated
        by CU_PTSC MSR_C0010280[PTSC]
      * PwrCPUave: CPU average power
      
      i. Determine the ratio of Tsample to Tref by executing CPUID Fn8000_0007.
      	N = value of CPUID Fn8000_0007_ECX[CpuPwrSampleTimeRatio[15:0]].
      
      ii. Read the full range of the cumulative energy value from the new
          MSR MaxCpuSwPwrAcc.
      	Jmax = value returned.
      
      iii. At time x, software reads CpuSwPwrAcc and samples the PTSC.
      	Jx = value read from CpuSwPwrAcc and Tx = value read from PTSC.
      
      iv. At time y, software reads CpuSwPwrAcc and samples the PTSC.
      	Jy = value read from CpuSwPwrAcc and Ty = value read from PTSC.
      
      v. Calculate the average power consumption for a compute unit over
      time period (y-x). Unit of result is uWatt:
      
      	if (Jy < Jx) // Rollover has occurred
      		Jdelta = (Jy + Jmax) - Jx
      	else
      		Jdelta = Jy - Jx
      	PwrCPUave = N * Jdelta * 1000 / (Ty - Tx)
      
      Simple example:
      
        root@hr-zp:/home/ray/tip# ./tools/perf/perf stat -a -e 'power/power-pkg/' make -j4
          CHK     include/config/kernel.release
          CHK     include/generated/uapi/linux/version.h
          CHK     include/generated/utsrelease.h
          CHK     include/generated/timeconst.h
          CHK     include/generated/bounds.h
          CHK     include/generated/asm-offsets.h
          CALL    scripts/checksyscalls.sh
          CHK     include/generated/compile.h
          SKIPPED include/generated/compile.h
          Building modules, stage 2.
        Kernel: arch/x86/boot/bzImage is ready  (#40)
          MODPOST 4225 modules
      
         Performance counter stats for 'system wide':
      
                    183.44 mWatts power/power-pkg/
      
             341.837270111 seconds time elapsed
      
        root@hr-zp:/home/ray/tip# ./tools/perf/perf stat -a -e 'power/power-pkg/' sleep 10
      
         Performance counter stats for 'system wide':
      
                      0.18 mWatts power/power-pkg/
      
              10.012551815 seconds time elapsed
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Suggested-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NHuang Rui <ray.huang@amd.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: jacob.w.shin@gmail.com
      Link: http://lkml.kernel.org/r/1457502306-2559-1-git-send-email-ray.huang@amd.com
      [ Fixed the modular build. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c7ab62bf
    • H
      x86/cpufeature, perf/x86: Add AMD Accumulated Power Mechanism feature flag · 01fe03ff
      Huang Rui 提交于
      AMD CPU family 15h model 0x60 introduces a mechanism for measuring
      accumulated power. It is used to report the processor power consumption
      and support for it is indicated by CPUID Fn8000_0007_EDX[12].
      Signed-off-by: NHuang Rui <ray.huang@amd.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andreas Herrmann <herrmann.der.user@googlemail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Frédéric Weisbecker <fweisbec@gmail.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hector Marco-Gisbert <hecmargi@upv.es>
      Cc: Jacob Shin <jacob.w.shin@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Kristen Carlson Accardi <kristen@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Wan Zongshun <Vincent.Wan@amd.com>
      Cc: spg_linux_kernel@amd.com
      Link: http://lkml.kernel.org/r/1452739808-11871-4-git-send-email-ray.huang@amd.com
      [ Resolved conflict and moved the synthetic CPUID slot to 19. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      01fe03ff
    • S
      perf/x86/amd: Add support for new IOMMU performance events · f8519155
      Suravee Suthikulpanit 提交于
      This patch adds new IOMMU performance event based on
      the information in table 74 of the AMD I/O Virtualization Technology
      (IOMMU) Specification (Document Id: 4882, Rev 2.62, Feb 2015)
      Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NJoerg Roedel <jroedel@suse.de>
      Acked-by: NJoerg Roedel <jroedel@suse.de>
      Cc: <acme@kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://support.amd.com/TechDocs/48882_IOMMU.pdfSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f8519155
    • H
      perf/x86/amd: Move nodes_per_socket into bsp_init_amd() · 8dfeae0d
      Huang Rui 提交于
      nodes_per_socket is static and it needn't be initialized many
      times during every CPU core init. So move its initialization into
      bsp_init_amd().
      Signed-off-by: NHuang Rui <ray.huang@amd.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andreas Herrmann <herrmann.der.user@googlemail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Frédéric Weisbecker <fweisbec@gmail.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hector Marco-Gisbert <hecmargi@upv.es>
      Cc: Jacob Shin <jacob.w.shin@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: spg_linux_kernel@amd.com
      Link: http://lkml.kernel.org/r/1452739808-11871-2-git-send-email-ray.huang@amd.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8dfeae0d
    • P
      perf/x86/cqm: Factor out some common code · 27348f38
      Peter Zijlstra 提交于
      Having the same code twice (and once quite ugly) is fragile.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      27348f38
    • V
      perf/x86/mbm: Add support for MBM counter overflow handling · e7ee3e8c
      Vikas Shivappa 提交于
      This patch adds a per package timer which periodically updates the
      memory bandwidth counters for the events that are currently active.
      
      Current patch has a periodic timer every 1s since the SDM guarantees
      that the counter will not overflow in 1s but this time can be definitely
      improved by calibrating on the system. The overflow is really a function
      of the max memory b/w that the socket can support, max counter value and
      scaling factor.
      Signed-off-by: NVikas Shivappa <vikas.shivappa@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NTony Luck <tony.luck@intel.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: fenghua.yu@intel.com
      Cc: h.peter.anvin@intel.com
      Cc: ravi.v.shankar@intel.com
      Cc: vikas.shivappa@intel.com
      Link: http://lkml.kernel.org/r/013b756c5006b1c4ca411f3ecf43ed52f19fbf87.1457723885.git.tony.luck@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e7ee3e8c
    • V
      perf/x86/mbm: Implement RMID recycling · 2d4de837
      Vikas Shivappa 提交于
      RMID could be allocated or deallocated as part of RMID recycling.
      
      When an RMID is allocated for MBM event, the MBM counter needs to be
      initialized because next time we read the counter we need the previous
      value to account for total bytes that went to the memory controller.
      
      Similarly, when RMID is deallocated we need to update the ->count
      variable.
      Signed-off-by: NVikas Shivappa <vikas.shivappa@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NTony Luck <tony.luck@intel.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: fenghua.yu@intel.com
      Cc: h.peter.anvin@intel.com
      Cc: ravi.v.shankar@intel.com
      Cc: vikas.shivappa@intel.com
      Link: http://lkml.kernel.org/r/1457652732-4499-6-git-send-email-vikas.shivappa@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2d4de837
    • T
      perf/x86/mbm: Add memory bandwidth monitoring event management · 87f01cc2
      Tony Luck 提交于
      Includes all the core infrastructure to measure the total_bytes and
      bandwidth.
      
      We have per socket counters for both total system wide L3 external
      bytes and local socket memory-controller bytes. The OS does MSR writes
      to MSR_IA32_QM_EVTSEL and MSR_IA32_QM_CTR to read the counters and
      uses the IA32_PQR_ASSOC_MSR to associate the RMID with the task. The
      tasks have a common RMID for CQM (cache quality of service monitoring)
      and MBM. Hence most of the scheduling code is reused from CQM.
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      [ Restructured rmid_read to not have an obvious hole, removed MBM_CNTR_MAX as its unused. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NVikas Shivappa <vikas.shivappa@linux.intel.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: fenghua.yu@intel.com
      Cc: h.peter.anvin@intel.com
      Cc: ravi.v.shankar@intel.com
      Cc: vikas.shivappa@intel.com
      Link: http://lkml.kernel.org/r/abd7aac9a18d93b95b985b931cf258df0164746d.1457723885.git.tony.luck@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      87f01cc2
    • V
      perf/x86/mbm: Add Intel Memory B/W Monitoring enumeration and init · 33c3cc7a
      Vikas Shivappa 提交于
      The MBM init patch enumerates the Intel MBM (Memory b/w monitoring)
      and initializes the perf events and datastructures for monitoring the
      memory b/w.
      
      Its based on original patch series by Tony Luck and Kanaka Juvva.
      
      Memory bandwidth monitoring (MBM) provides OS/VMM a way to monitor
      bandwidth from one level of cache to another. The current patches
      support L3 external bandwidth monitoring. It supports both 'local
      bandwidth' and 'total bandwidth' monitoring for the socket. Local
      bandwidth measures the amount of data sent through the memory controller
      on the socket and total b/w measures the total system bandwidth.
      
      Extending the cache quality of service monitoring (CQM) we add two
      more events to the perf infrastructure:
      
        intel_cqm_llc/local_bytes - bytes sent through local socket memory controller
        intel_cqm_llc/total_bytes - total L3 external bytes sent
      
      The tasks are associated with a Resouce Monitoring ID (RMID) just like
      in CQM and OS uses a MSR write to indicate the RMID of the task during
      scheduling.
      Signed-off-by: NVikas Shivappa <vikas.shivappa@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NTony Luck <tony.luck@intel.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: fenghua.yu@intel.com
      Cc: h.peter.anvin@intel.com
      Cc: ravi.v.shankar@intel.com
      Cc: vikas.shivappa@intel.com
      Link: http://lkml.kernel.org/r/1457652732-4499-4-git-send-email-vikas.shivappa@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      33c3cc7a
    • V
      perf/x86/cqm: Fix CQM memory leak and notifier leak · ada2f634
      Vikas Shivappa 提交于
      Fixes the hotcpu notifier leak and other global variable memory leaks
      during CQM (cache quality of service monitoring) initialization.
      Signed-off-by: NVikas Shivappa <vikas.shivappa@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NTony Luck <tony.luck@intel.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: fenghua.yu@intel.com
      Cc: h.peter.anvin@intel.com
      Cc: ravi.v.shankar@intel.com
      Cc: vikas.shivappa@intel.com
      Link: http://lkml.kernel.org/r/1457652732-4499-3-git-send-email-vikas.shivappa@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ada2f634
    • V
      perf/x86/cqm: Fix CQM handling of grouping events into a cache_group · a223c1c7
      Vikas Shivappa 提交于
      Currently CQM (cache quality of service monitoring) is grouping all
      events belonging to same PID to use one RMID. However its not counting
      all of these different events. Hence we end up with a count of zero
      for all events other than the group leader.
      
      The patch tries to address the issue by keeping a flag in the
      perf_event.hw which has other CQM related fields. The field is updated
      at event creation and during grouping.
      Signed-off-by: NVikas Shivappa <vikas.shivappa@linux.intel.com>
      [peterz: Changed hw_perf_event::is_group_event to an int]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NTony Luck <tony.luck@intel.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: fenghua.yu@intel.com
      Cc: h.peter.anvin@intel.com
      Cc: ravi.v.shankar@intel.com
      Cc: vikas.shivappa@intel.com
      Link: http://lkml.kernel.org/r/1457652732-4499-2-git-send-email-vikas.shivappa@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a223c1c7
    • P
      perf/x86/BTS: Fix RCU usage · e8d8a90f
      Peter Zijlstra 提交于
      This splat reminds us:
      
      [ 8166.045595] [ INFO: suspicious RCU usage. ]
      
      [ 8166.168972]  [<ffffffff81127837>] lockdep_rcu_suspicious+0xe7/0x120
      [ 8166.175966]  [<ffffffff811e0bae>] perf_callchain+0x23e/0x250
      [ 8166.182280]  [<ffffffff811dda3d>] perf_prepare_sample+0x27d/0x350
      [ 8166.189082]  [<ffffffff8100f503>] intel_pmu_drain_bts_buffer+0x133/0x200
      
      ... that as the core code does, one should hold rcu_read_lock() over that
      entire BTS event-output generation sequence as well.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      e8d8a90f
    • P
      perf/x86/ibs: Add IBS interrupt to the dynamic throttle · c2872d38
      Peter Zijlstra 提交于
      Interrupt throttling is normally only done against
      sysctl_perf_event_sample_rate. This means that if that number is too
      high (for whatever reason) you can lock up your machine.
      
      We have, however, a dynamic throttling scheme too, but for that to
      work, we need to add a callback to the interrupt handler, IBS did not
      have this, so add it.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c2872d38
    • P
      perf/x86/ibs: Fix race with IBS_STARTING state · 5a50f529
      Peter Zijlstra 提交于
      While tracing the IBS bits I saw the NMI hitting between clearing
      IBS_STARTING and the actual MSR writes to disable the counter.
      
      Since IBS_STARTING was cleared, the handler assumed these were spurious
      NMIs and because STOPPING wasn't set yet either, insta-triggered an
      "Unknown NMI".
      
      Cure this by clearing IBS_STARTING after disabling the hardware.
      Tested-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      5a50f529
    • P
      perf/x86/ibs: Fix IBS throttle · 0158b83f
      Peter Zijlstra 提交于
      When the IBS IRQ handler get a !0 return from perf_event_overflow;
      meaning it should throttle the event, it only disables it, it doesn't
      call perf_ibs_stop().
      
      This confuses the state machine, as we'll use pmu::start() ->
      perf_ibs_start() to unthrottle.
      Tested-by: NBorislav Petkov <bp@alien8.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vince@deater.net>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: dvyukov@google.com
      Cc: oleg@redhat.com
      Cc: panand@redhat.com
      Cc: sasha.levin@oracle.com
      Link: http://lkml.kernel.org/r/20160311142346.GE6344@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0158b83f
  3. 19 3月, 2016 7 次提交
  4. 18 3月, 2016 3 次提交
    • T
      x86/irq: Cure live lock in fixup_irqs() · 551adc60
      Thomas Gleixner 提交于
      Harry reported, that he's able to trigger a system freeze with cpu hot
      unplug. The freeze turned out to be a live lock caused by recent changes in
      irq_force_complete_move().
      
      When fixup_irqs() and from there irq_force_complete_move() is called on the
      dying cpu, then all other cpus are in stop machine an wait for the dying cpu
      to complete the teardown. If there is a move of an interrupt pending then
      irq_force_complete_move() sends the cleanup IPI to the cpus in the old_domain
      mask and waits for them to clear the mask. That's obviously impossible as
      those cpus are firmly stuck in stop machine with interrupts disabled.
      
      I should have known that, but I completely overlooked it being concentrated on
      the locking issues around the vectors. And the existance of the call to
      __irq_complete_move() in the code, which actually sends the cleanup IPI made
      it reasonable to wait for that cleanup to complete. That call was bogus even
      before the recent changes as it was just a pointless distraction.
      
      We have to look at two cases:
      
      1) The move_in_progress flag of the interrupt is set
      
         This means the ioapic has been updated with the new vector, but it has not
         fired yet. In theory there is a race:
      
         set_ioapic(new_vector) <-- Interrupt is raised before update is effective,
         			      i.e. it's raised on the old vector. 
      
         So if the target cpu cannot handle that interrupt before the old vector is
         cleaned up, we get a spurious interrupt and in the worst case the ioapic
         irq line becomes stale, but my experiments so far have only resulted in
         spurious interrupts.
      
         But in case of cpu hotplug this should be a non issue because if the
         affinity update happens right before all cpus rendevouz in stop machine,
         there is no way that the interrupt can be blocked on the target cpu because
         all cpus loops first with interrupts enabled in stop machine, so the old
         vector is not yet cleaned up when the interrupt fires.
      
         So the only way to run into this issue is if the delivery of the interrupt
         on the apic/system bus would be delayed beyond the point where the target
         cpu disables interrupts in stop machine. I doubt that it can happen, but at
         least there is a theroretical chance. Virtualization might be able to
         expose this, but AFAICT the IOAPIC emulation is not as stupid as the real
         hardware.
      
         I've spent quite some time over the weekend to enforce that situation,
         though I was not able to trigger the delayed case.
      
      2) The move_in_progress flag is not set and the old_domain cpu mask is not
         empty.
      
         That means, that an interrupt was delivered after the change and the
         cleanup IPI has been sent to the cpus in old_domain, but not all CPUs have
         responded to it yet.
      
      In both cases we can assume that the next interrupt will arrive on the new
      vector, so we can cleanup the old vectors on the cpus in the old_domain cpu
      mask.
      
      Fixes: 98229aa3 "x86/irq: Plug vector cleanup race"
      Reported-by: NHarry Junior <harryjr@outlook.fr>
      Tested-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Joe Lawrence <joe.lawrence@stratus.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1603140931430.3657@nanosSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      551adc60
    • T
      x86/tsc: Prevent NULL pointer deref in calibrate_delay_is_known() · f508a5ba
      Thomas Gleixner 提交于
      The topology_core_cpumask is used to find a neighbour cpu in
      calibrate_delay_is_known(). It might not be allocated at the first invocation
      of that function on the boot cpu, when CONFIG_CPUMASK_OFFSTACK is set.
      
      The mask is allocated later in native_smp_prepare_cpus. As a consequence the
      underlying find_next_bit() call dereferences a NULL pointer.
      
      Add a proper check to prevent this.
      
      Fixes: c25323c0 "x86/tsc: Use topology functions"
      Reported-and-tested-by: NRichard W.M. Jones <rjones@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Josh Boyer <jwboyer@fedoraproject.org>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1603180843270.3978@nanosSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      f508a5ba
    • D
      x86/apic: Fix suspicious RCU usage in smp_trace_call_function_interrupt() · 7834c103
      Dave Jones 提交于
      Since 4.4, I've been able to trigger this occasionally:
      
      ===============================
      [ INFO: suspicious RCU usage. ]
      4.5.0-rc7-think+ #3 Not tainted
      Cc: Andi Kleen <ak@linux.intel.com>
      Link: http://lkml.kernel.org/r/20160315012054.GA17765@codemonkey.org.ukSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      
      -------------------------------
      ./arch/x86/include/asm/msr-trace.h:47 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      RCU used illegally from idle CPU!
      rcu_scheduler_active = 1, debug_locks = 1
      RCU used illegally from extended quiescent state!
      no locks held by swapper/3/0.
      
      stack backtrace:
      CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.5.0-rc7-think+ #3
       ffffffff92f821e0 1f3e5c340597d7fc ffff880468e07f10 ffffffff92560c2a
       ffff880462145280 0000000000000001 ffff880468e07f40 ffffffff921376a6
       ffffffff93665ea0 0000cc7c876d28da 0000000000000005 ffffffff9383dd60
      Call Trace:
       <IRQ>  [<ffffffff92560c2a>] dump_stack+0x67/0x9d
       [<ffffffff921376a6>] lockdep_rcu_suspicious+0xe6/0x100
       [<ffffffff925ae7a7>] do_trace_write_msr+0x127/0x1a0
       [<ffffffff92061c83>] native_apic_msr_eoi_write+0x23/0x30
       [<ffffffff92054408>] smp_trace_call_function_interrupt+0x38/0x360
       [<ffffffff92d1ca60>] trace_call_function_interrupt+0x90/0xa0
       <EOI>  [<ffffffff92ac5124>] ? cpuidle_enter_state+0x1b4/0x520
      
      Move the entering_irq() call before ack_APIC_irq(), because entering_irq()
      tells the RCU susbstems to end the extended quiescent state, so that the
      following trace call in ack_APIC_irq() works correctly.
      Suggested-by: NAndi Kleen <ak@linux.intel.com>
      Fixes: 4787c368 "x86/tracing: Add irq_enter/exit() in smp_trace_reschedule_interrupt()"
      Signed-off-by: NDave Jones <davej@codemonkey.org.uk>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      7834c103
  5. 17 3月, 2016 2 次提交
  6. 16 3月, 2016 1 次提交
    • T
      x86/mm, x86/mce: Fix return type/value for memcpy_mcsafe() · cbf8b5a2
      Tony Luck 提交于
      Returning a 'bool' was very unpopular. Doubly so because the
      code was just wrong (returning zero for true, one for false;
      great for shell programming, not so good for C).
      
      Change return type to "int". Keep zero as the success indicator
      because it matches other similar code and people may be more
      comfortable writing:
      
      	if (memcpy_mcsafe(to, from, count)) {
      		printk("Sad panda, copy failed\n");
      		...
      	}
      
      Make the failure return value -EFAULT for now.
      
      Reported by: Mika Penttilä <mika.penttila@nextfour.com>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: mika.penttila@nextfour.com
      Fixes: 92b0729c ("x86/mm, x86/mce: Add memcpy_mcsafe()")
      Link: http://lkml.kernel.org/r/695f14233fa7a54fcac4406c706d7fec228e3f4c.1457993040.git.tony.luck@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cbf8b5a2
  7. 15 3月, 2016 1 次提交
    • V
      x86/video: Don't assume all FB devices are PCI devices · 743146db
      Vitaly Kuznetsov 提交于
      When booting Hyper-V Generation 2 guests KASAN reports the following
      out-of-bounds access:
      
        BUG: KASAN: slab-out-of-bounds in fb_is_primary_device+0x58/0x70 at addr ffff880079cf0eb0
        Read of size 8 by task swapper/0/1
        ...
         [<ffffffff81581308>] dump_stack+0x63/0x8b
         [<ffffffff812e1f99>] print_trailer+0xf9/0x150
         [<ffffffff812e7344>] object_err+0x34/0x40
         [<ffffffff812e9630>] kasan_report_error+0x230/0x550
         [<ffffffff812e9ee8>] kasan_report+0x58/0x60
         [<ffffffff812e4500>] ? ___slab_alloc+0x80/0x490
         [<ffffffff81878a28>] ? fb_is_primary_device+0x58/0x70
         [<ffffffff812e87cd>] __asan_load8+0x5d/0x70
         [<ffffffff81878a28>] fb_is_primary_device+0x58/0x70
         [<ffffffff8162357a>] register_framebuffer+0xda/0x5b0
         [<ffffffff816234a0>] ? remove_conflicting_framebuffers+0x50/0x50
        ...
      
      The issue is caused by the to_pci_dev() call with no check that the given
      info->device is in fact a PCI device and some FB devices (Hyper-V FB, EFI
      FB,...) are not.
      
      While on it, clean up the function.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Acked-by: NBjorn Helgaas <bhelgaas@google.com>
      Cc: Bjorn Helgaas <helgaas@kernel.org>
      Cc: Cathy Avery <cavery@redhat.com>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1458030033-10122-1-git-send-email-vkuznets@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      743146db
  8. 13 3月, 2016 1 次提交
    • F
      x86/cpufeature: Enable new AVX-512 features · d0500494
      Fenghua Yu 提交于
      A few new AVX-512 instruction groups/features are added in cpufeatures.h
      for enuermation: AVX512DQ, AVX512BW, and AVX512VL.
      
      Clear the flags in fpu__xstate_clear_all_cpu_caps().
      
      The specification for latest AVX-512 including the features can be found at:
      
        https://software.intel.com/sites/default/files/managed/07/b7/319433-023.pdf
      
      Note, I didn't enable the flags in KVM. Hopefully the KVM guys can pick up
      the flags and enable them in KVM.
      Signed-off-by: NFenghua Yu <fenghua.yu@intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Gleb Natapov <gleb@kernel.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Ravi V Shankar <ravi.v.shankar@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kvm@vger.kernel.org
      Link: http://lkml.kernel.org/r/1457667498-37357-1-git-send-email-fenghua.yu@intel.com
      [ Added more detailed feature descriptions. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      d0500494
  9. 12 3月, 2016 2 次提交
    • M
      x86/efi: Fix boot crash by always mapping boot service regions into new EFI page tables · 452308de
      Matt Fleming 提交于
      Some machines have EFI regions in page zero (physical address
      0x00000000) and historically that region has been added to the e820
      map via trim_bios_range(), and ultimately mapped into the kernel page
      tables. It was not mapped via efi_map_regions() as one would expect.
      
      Alexis reports that with the new separate EFI page tables some boot
      services regions, such as page zero, are not mapped. This triggers an
      oops during the SetVirtualAddressMap() runtime call.
      
      For the EFI boot services quirk on x86 we need to memblock_reserve()
      boot services regions until after SetVirtualAddressMap(). Doing that
      while respecting the ownership of regions that may have already been
      reserved by the kernel was the motivation behind this commit:
      
        7d68dc3f ("x86, efi: Do not reserve boot services regions within reserved areas")
      
      That patch was merged at a time when the EFI runtime virtual mappings
      were inserted into the kernel page tables as described above, and the
      trick of setting ->numpages (and hence the region size) to zero to
      track regions that should not be freed in efi_free_boot_services()
      meant that we never mapped those regions in efi_map_regions(). Instead
      we were relying solely on the existing kernel mappings.
      
      Now that we have separate page tables we need to make sure the EFI
      boot services regions are mapped correctly, even if someone else has
      already called memblock_reserve(). Instead of stashing a tag in
      ->numpages, set the EFI_MEMORY_RUNTIME bit of ->attribute. Since it
      generally makes no sense to mark a boot services region as required at
      runtime, it's pretty much guaranteed the firmware will not have
      already set this bit.
      
      For the record, the specific circumstances under which Alexis
      triggered this bug was that an EFI runtime driver on his machine was
      responding to the EVT_SIGNAL_VIRTUAL_ADDRESS_CHANGE event during
      SetVirtualAddressMap().
      
      The event handler for this driver looks like this,
      
        sub rsp,0x28
        lea rdx,[rip+0x2445] # 0xaa948720
        mov ecx,0x4
        call func_aa9447c0  ; call to ConvertPointer(4, & 0xaa948720)
        mov r11,QWORD PTR [rip+0x2434] # 0xaa948720
        xor eax,eax
        mov BYTE PTR [r11+0x1],0x1
        add rsp,0x28
        ret
      
      Which is pretty typical code for an EVT_SIGNAL_VIRTUAL_ADDRESS_CHANGE
      handler. The "mov r11, QWORD PTR [rip+0x2424]" was the faulting
      instruction because ConvertPointer() was being called to convert the
      address 0x0000000000000000, which when converted is left unchanged and
      remains 0x0000000000000000.
      
      The output of the oops trace gave the impression of a standard NULL
      pointer dereference bug, but because we're accessing physical
      addresses during ConvertPointer(), it wasn't. EFI boot services code
      is stored at that address on Alexis' machine.
      Reported-by: NAlexis Murzeau <amurzeau@gmail.com>
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maarten Lankhorst <maarten.lankhorst@canonical.com>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Raphael Hertzog <hertzog@debian.org>
      Cc: Roger Shimizu <rogershimizu@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-efi@vger.kernel.org
      Link: http://lkml.kernel.org/r/1457695163-29632-2-git-send-email-matt@codeblueprint.co.uk
      Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=815125Signed-off-by: NIngo Molnar <mingo@kernel.org>
      452308de
    • B
      x86/fpu: Fix eager-FPU handling on legacy FPU machines · 6e686709
      Borislav Petkov 提交于
      i486 derived cores like Intel Quark support only the very old,
      legacy x87 FPU (FSAVE/FRSTOR, CPUID bit FXSR is not set), and
      our FPU code wasn't handling the saving and restoring there
      properly in the 'eagerfpu' case.
      
      So after we made eagerfpu the default for all CPU types:
      
        58122bf1 x86/fpu: Default eagerfpu=on on all CPUs
      
      these old FPU designs broke. First, Andy Shevchenko reported a splat:
      
        WARNING: CPU: 0 PID: 823 at arch/x86/include/asm/fpu/internal.h:163 fpu__clear+0x8c/0x160
      
      which was us trying to execute FXRSTOR on those machines even though
      they don't support it.
      
      After taking care of that, Bryan O'Donoghue reported that a simple FPU
      test still failed because we weren't initializing the FPU state properly
      on those machines.
      
      Take care of all that.
      Reported-and-tested-by: NBryan O'Donoghue <pure.logic@nexus-software.ie>
      Reported-by: NAndy Shevchenko <andy.shevchenko@gmail.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yu-cheng <yu-cheng.yu@intel.com>
      Link: http://lkml.kernel.org/r/20160311113206.GD4312@pd.tnicSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6e686709
  10. 11 3月, 2016 2 次提交
    • H
      x86/mm/32: Enable full randomization on i386 and X86_32 · 8b8addf8
      Hector Marco-Gisbert 提交于
      Currently on i386 and on X86_64 when emulating X86_32 in legacy mode, only
      the stack and the executable are randomized but not other mmapped files
      (libraries, vDSO, etc.). This patch enables randomization for the
      libraries, vDSO and mmap requests on i386 and in X86_32 in legacy mode.
      
      By default on i386 there are 8 bits for the randomization of the libraries,
      vDSO and mmaps which only uses 1MB of VA.
      
      This patch preserves the original randomness, using 1MB of VA out of 3GB or
      4GB. We think that 1MB out of 3GB is not a big cost for having the ASLR.
      
      The first obvious security benefit is that all objects are randomized (not
      only the stack and the executable) in legacy mode which highly increases
      the ASLR effectiveness, otherwise the attackers may use these
      non-randomized areas. But also sensitive setuid/setgid applications are
      more secure because currently, attackers can disable the randomization of
      these applications by setting the ulimit stack to "unlimited". This is a
      very old and widely known trick to disable the ASLR in i386 which has been
      allowed for too long.
      
      Another trick used to disable the ASLR was to set the ADDR_NO_RANDOMIZE
      personality flag, but fortunately this doesn't work on setuid/setgid
      applications because there is security checks which clear Security-relevant
      flags.
      
      This patch always randomizes the mmap_legacy_base address, removing the
      possibility to disable the ASLR by setting the stack to "unlimited".
      Signed-off-by: NHector Marco-Gisbert <hecmargi@upv.es>
      Acked-by: NIsmael Ripoll Ripoll <iripoll@upv.es>
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: NArjan van de Ven <arjan@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: kees Cook <keescook@chromium.org>
      Link: http://lkml.kernel.org/r/1457639460-5242-1-git-send-email-hecmargi@upv.esSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8b8addf8
    • J
      x86/entry/traps: Show unhandled signal for i386 in do_trap() · 10ee7386
      Jianyu Zhan 提交于
      Commit abd4f750 ("x86: i386-show-unhandled-signals-v3") did turn on
      the showing-unhandled-signal behaviour for i386 for some exception handlers,
      but for no reason do_trap() is left out (my naive guess is because turning it on
      for do_trap() would be too noisy since do_trap() is shared by several exceptions).
      
      And since the same commit make "show_unhandled_signals" a debug tunable(in
      /proc/sys/debug/exception-trace), and x86 by default turning it on.
      
      So it would be strange for i386 users who turing it on manually and expect
      seeing the unhandled signal output in log, but nothing.
      
      This patch turns it on for i386 in do_trap() as well.
      Signed-off-by: NJianyu Zhan <nasa4836@gmail.com>
      Reviewed-by: NJan Beulich <jbeulich@suse.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bp@suse.de
      Cc: dave.hansen@linux.intel.com
      Cc: heukelum@fastmail.fm
      Cc: jbeulich@novell.com
      Cc: jdike@addtoit.com
      Cc: joe@perches.com
      Cc: luto@kernel.org
      Link: http://lkml.kernel.org/r/1457612398-4568-1-git-send-email-nasa4836@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      10ee7386
  11. 10 3月, 2016 3 次提交
    • B
      x86/delay: Avoid preemptible context checks in delay_mwaitx() · 84477336
      Borislav Petkov 提交于
      We do use this_cpu_ptr(&cpu_tss) as a cacheline-aligned, seldomly
      accessed per-cpu var as the MONITORX target in delay_mwaitx(). However,
      when called in preemptible context, this_cpu_ptr -> smp_processor_id() ->
      debug_smp_processor_id() fires:
      
        BUG: using smp_processor_id() in preemptible [00000000] code: udevd/312
        caller is delay_mwaitx+0x40/0xa0
      
      But we don't care about that check - we only need cpu_tss as a MONITORX
      target and it doesn't really matter which CPU's var we're touching as
      we're going idle anyway. Fix that.
      Suggested-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Huang Rui <ray.huang@amd.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: spg_linux_kernel@amd.com
      Link: http://lkml.kernel.org/r/20160309205622.GG6564@pd.tnicSigned-off-by: NIngo Molnar <mingo@kernel.org>
      84477336
    • P
      KVM: MMU: fix reserved bit check for ept=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0 · 5f0b8199
      Paolo Bonzini 提交于
      KVM has special logic to handle pages with pte.u=1 and pte.w=0 when
      CR0.WP=1.  These pages' SPTEs flip continuously between two states:
      U=1/W=0 (user and supervisor reads allowed, supervisor writes not allowed)
      and U=0/W=1 (supervisor reads and writes allowed, user writes not allowed).
      
      When SMEP is in effect, however, U=0 will enable kernel execution of
      this page.  To avoid this, KVM also sets NX=1 in the shadow PTE together
      with U=0, making the two states U=1/W=0/NX=gpte.NX and U=0/W=1/NX=1.
      When guest EFER has the NX bit cleared, the reserved bit check thinks
      that the latter state is invalid; teach it that the smep_andnot_wp case
      will also use the NX bit of SPTEs.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: NXiao Guangrong <guangrong.xiao@linux.inel.com>
      Fixes: c258b62bSigned-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5f0b8199
    • P
      KVM: MMU: fix ept=0/pte.u=1/pte.w=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0 combo · 844a5fe2
      Paolo Bonzini 提交于
      Yes, all of these are needed. :) This is admittedly a bit odd, but
      kvm-unit-tests access.flat tests this if you run it with "-cpu host"
      and of course ept=0.
      
      KVM runs the guest with CR0.WP=1, so it must handle supervisor writes
      specially when pte.u=1/pte.w=0/CR0.WP=0.  Such writes cause a fault
      when U=1 and W=0 in the SPTE, but they must succeed because CR0.WP=0.
      When KVM gets the fault, it sets U=0 and W=1 in the shadow PTE and
      restarts execution.  This will still cause a user write to fault, while
      supervisor writes will succeed.  User reads will fault spuriously now,
      and KVM will then flip U and W again in the SPTE (U=1, W=0).  User reads
      will be enabled and supervisor writes disabled, going back to the
      originary situation where supervisor writes fault spuriously.
      
      When SMEP is in effect, however, U=0 will enable kernel execution of
      this page.  To avoid this, KVM also sets NX=1 in the shadow PTE together
      with U=0.  If the guest has not enabled NX, the result is a continuous
      stream of page faults due to the NX bit being reserved.
      
      The fix is to force EFER.NX=1 even if the CPU is taking care of the EFER
      switch.  (All machines with SMEP have the CPU_LOAD_IA32_EFER vm-entry
      control, so they do not use user-return notifiers for EFER---if they did,
      EFER.NX would be forced to the same value as the host).
      
      There is another bug in the reserved bit check, which I've split to a
      separate patch for easier application to stable kernels.
      
      Cc: stable@vger.kernel.org
      Cc: Andy Lutomirski <luto@amacapital.net>
      Reviewed-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
      Fixes: f6577a5fSigned-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      844a5fe2