1. 29 3月, 2016 1 次提交
    • T
      x86/mm/pat: Add support of non-default PAT MSR setting · 02f037d6
      Toshi Kani 提交于
      In preparation for fixing a regression caused by:
      
        9cd25aac ("x86/mm/pat: Emulate PAT when it is disabled")'
      
      ... PAT needs to support a case that PAT MSR is initialized with a
      non-default value.
      
      When pat_init() is called and PAT is disabled, it initializes the
      PAT table with the BIOS default value. Xen, however, sets PAT MSR
      with a non-default value to enable WC. This causes inconsistency
      between the PAT table and PAT MSR when PAT is set to disable on Xen.
      
      Change pat_init() to handle the PAT disable cases properly.  Add
      init_cache_modes() to handle two cases when PAT is set to disable.
      
       1. CPU supports PAT: Set PAT table to be consistent with PAT MSR.
       2. CPU does not support PAT: Set PAT table to be consistent with
          PWT and PCD bits in a PTE.
      
      Note, __init_cache_modes(), renamed from pat_init_cache_modes(),
      will be changed to a static function in a later patch.
      Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: elliott@hpe.com
      Cc: konrad.wilk@oracle.com
      Cc: paul.gortmaker@windriver.com
      Cc: xen-devel@lists.xenproject.org
      Link: http://lkml.kernel.org/r/1458769323-24491-2-git-send-email-toshi.kani@hpe.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      02f037d6
  2. 26 3月, 2016 2 次提交
    • A
      mm, kasan: stackdepot implementation. Enable stackdepot for SLAB · cd11016e
      Alexander Potapenko 提交于
      Implement the stack depot and provide CONFIG_STACKDEPOT.  Stack depot
      will allow KASAN store allocation/deallocation stack traces for memory
      chunks.  The stack traces are stored in a hash table and referenced by
      handles which reside in the kasan_alloc_meta and kasan_free_meta
      structures in the allocated memory chunks.
      
      IRQ stack traces are cut below the IRQ entry point to avoid unnecessary
      duplication.
      
      Right now stackdepot support is only enabled in SLAB allocator.  Once
      KASAN features in SLAB are on par with those in SLUB we can switch SLUB
      to stackdepot as well, thus removing the dependency on SLUB stack
      bookkeeping, which wastes a lot of memory.
      
      This patch is based on the "mm: kasan: stack depots" patch originally
      prepared by Dmitry Chernenkov.
      
      Joonsoo has said that he plans to reuse the stackdepot code for the
      mm/page_owner.c debugging facility.
      
      [akpm@linux-foundation.org: s/depot_stack_handle/depot_stack_handle_t]
      [aryabinin@virtuozzo.com: comment style fixes]
      Signed-off-by: NAlexander Potapenko <glider@google.com>
      Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd11016e
    • A
      arch, ftrace: for KASAN put hard/soft IRQ entries into separate sections · be7635e7
      Alexander Potapenko 提交于
      KASAN needs to know whether the allocation happens in an IRQ handler.
      This lets us strip everything below the IRQ entry point to reduce the
      number of unique stack traces needed to be stored.
      
      Move the definition of __irq_entry to <linux/interrupt.h> so that the
      users don't need to pull in <linux/ftrace.h>.  Also introduce the
      __softirq_entry macro which is similar to __irq_entry, but puts the
      corresponding functions to the .softirqentry.text section.
      Signed-off-by: NAlexander Potapenko <glider@google.com>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be7635e7
  3. 23 3月, 2016 5 次提交
    • P
      x86/msr: Remove unused native_read_tscp() · 9da77666
      Prarit Bhargava 提交于
      After e76b027e ("x86,vdso: Use LSL unconditionally for vgetcpu")
      native_read_tscp() is unused in the kernel. The function can be removed like
      native_read_tsc() was.
      Signed-off-by: NPrarit Bhargava <prarit@redhat.com>
      Acked-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@suse.de>
      Link: http://lkml.kernel.org/r/1458687968-9106-1-git-send-email-prarit@redhat.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      9da77666
    • A
      x86/extable: use generic search and sort routines · 29934b0f
      Ard Biesheuvel 提交于
      Replace the arch specific versions of search_extable() and
      sort_extable() with calls to the generic ones, which now support
      relative exception tables as well.
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: NH. Peter Anvin <hpa@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29934b0f
    • D
      kernel: add kcov code coverage · 5c9a8750
      Dmitry Vyukov 提交于
      kcov provides code coverage collection for coverage-guided fuzzing
      (randomized testing).  Coverage-guided fuzzing is a testing technique
      that uses coverage feedback to determine new interesting inputs to a
      system.  A notable user-space example is AFL
      (http://lcamtuf.coredump.cx/afl/).  However, this technique is not
      widely used for kernel testing due to missing compiler and kernel
      support.
      
      kcov does not aim to collect as much coverage as possible.  It aims to
      collect more or less stable coverage that is function of syscall inputs.
      To achieve this goal it does not collect coverage in soft/hard
      interrupts and instrumentation of some inherently non-deterministic or
      non-interesting parts of kernel is disbled (e.g.  scheduler, locking).
      
      Currently there is a single coverage collection mode (tracing), but the
      API anticipates additional collection modes.  Initially I also
      implemented a second mode which exposes coverage in a fixed-size hash
      table of counters (what Quentin used in his original patch).  I've
      dropped the second mode for simplicity.
      
      This patch adds the necessary support on kernel side.  The complimentary
      compiler support was added in gcc revision 231296.
      
      We've used this support to build syzkaller system call fuzzer, which has
      found 90 kernel bugs in just 2 months:
      
        https://github.com/google/syzkaller/wiki/Found-Bugs
      
      We've also found 30+ bugs in our internal systems with syzkaller.
      Another (yet unexplored) direction where kcov coverage would greatly
      help is more traditional "blob mutation".  For example, mounting a
      random blob as a filesystem, or receiving a random blob over wire.
      
      Why not gcov.  Typical fuzzing loop looks as follows: (1) reset
      coverage, (2) execute a bit of code, (3) collect coverage, repeat.  A
      typical coverage can be just a dozen of basic blocks (e.g.  an invalid
      input).  In such context gcov becomes prohibitively expensive as
      reset/collect coverage steps depend on total number of basic
      blocks/edges in program (in case of kernel it is about 2M).  Cost of
      kcov depends only on number of executed basic blocks/edges.  On top of
      that, kernel requires per-thread coverage because there are always
      background threads and unrelated processes that also produce coverage.
      With inlined gcov instrumentation per-thread coverage is not possible.
      
      kcov exposes kernel PCs and control flow to user-space which is
      insecure.  But debugfs should not be mapped as user accessible.
      
      Based on a patch by Quentin Casasnovas.
      
      [akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
      [akpm@linux-foundation.org: unbreak allmodconfig]
      [akpm@linux-foundation.org: follow x86 Makefile layout standards]
      Signed-off-by: NDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: syzkaller <syzkaller@googlegroups.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Tavis Ormandy <taviso@google.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: David Drysdale <drysdale@google.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c9a8750
    • A
      x86/compat: remove is_compat_task() · f970165b
      Andy Lutomirski 提交于
      x86's is_compat_task always checked the current syscall type, not the
      task type.  It has no non-arch users any more, so just remove it to
      avoid confusion.
      
      On x86, nothing should really be checking the task ABI.  There are
      legitimate users for the syscall ABI and for the mm ABI.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f970165b
    • P
      KVM: page_track: fix access to NULL slot · a6adb106
      Paolo Bonzini 提交于
      This happens when doing the reboot test from virt-tests:
      
      [  131.833653] BUG: unable to handle kernel NULL pointer dereference at           (null)
      [  131.842461] IP: [<ffffffffa0950087>] kvm_page_track_is_active+0x17/0x60 [kvm]
      [  131.850500] PGD 0
      [  131.852763] Oops: 0000 [#1] SMP
      [  132.007188] task: ffff880075fbc500 ti: ffff880850a3c000 task.ti: ffff880850a3c000
      [  132.138891] Call Trace:
      [  132.141639]  [<ffffffffa092bd11>] page_fault_handle_page_track+0x31/0x40 [kvm]
      [  132.149732]  [<ffffffffa093380f>] paging64_page_fault+0xff/0x910 [kvm]
      [  132.172159]  [<ffffffffa092c734>] kvm_mmu_page_fault+0x64/0x110 [kvm]
      [  132.179372]  [<ffffffffa06743c2>] handle_exception+0x1b2/0x430 [kvm_intel]
      [  132.187072]  [<ffffffffa067a301>] vmx_handle_exit+0x1e1/0xc50 [kvm_intel]
      ...
      
      Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
      Fixes: 3d0c27adSigned-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a6adb106
  4. 22 3月, 2016 17 次提交
  5. 21 3月, 2016 15 次提交
    • S
      perf/x86/intel/rapl: Add missing Broadwell models · 7b0fd569
      Srinivas Pandruvada 提交于
      Added Broadwell-H and Broadwell-Server.
      Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: bp@alien8.de
      Link: http://lkml.kernel.org/r/1458517938-25308-1-git-send-email-srinivas.pandruvada@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7b0fd569
    • K
      perf/x86/intel/uncore: Remove ev_sel_ext bit support for PCU · cb225252
      Kan Liang 提交于
      The ev_sel_ext in PCU_MSR_PMON_CTL is locked on some CPU models, so despite
      it being documented in the SDM, if we write 1 to that bit then we can get a #GP
      fault.
      
      Which #GP the perf fuzzer happily triggered in Peter Zijlstra's testing.
      
      Also, there are no public events which use that bit, so remove ev_sel_ext
      bit support for PCU.
      Reported-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1458500301-3594-1-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cb225252
    • H
      perf/x86/amd/power: Add AMD accumulated power reporting mechanism · c7ab62bf
      Huang Rui 提交于
      Introduce an AMD accumlated power reporting mechanism for the Family
      15h, Model 60h processor that can be used to calculate the average
      power consumed by a processor during a measurement interval. The
      feature support is indicated by CPUID Fn8000_0007_EDX[12].
      
      This feature will be implemented both in hwmon and perf. The current
      design provides one event to report per package/processor power
      consumption by counting each compute unit power value.
      
      Here the gory details of how the computation is done:
      
      * Tsample: compute unit power accumulator sample period
      * Tref: the PTSC counter period (PTSC: performance timestamp counter)
      * N: the ratio of compute unit power accumulator sample period to the
        PTSC period
      
      * Jmax: max compute unit accumulated power which is indicated by
        MSR_C001007b[MaxCpuSwPwrAcc]
      
      * Jx/Jy: compute unit accumulated power which is indicated by
        MSR_C001007a[CpuSwPwrAcc]
      
      * Tx/Ty: the value of performance timestamp counter which is indicated
        by CU_PTSC MSR_C0010280[PTSC]
      * PwrCPUave: CPU average power
      
      i. Determine the ratio of Tsample to Tref by executing CPUID Fn8000_0007.
      	N = value of CPUID Fn8000_0007_ECX[CpuPwrSampleTimeRatio[15:0]].
      
      ii. Read the full range of the cumulative energy value from the new
          MSR MaxCpuSwPwrAcc.
      	Jmax = value returned.
      
      iii. At time x, software reads CpuSwPwrAcc and samples the PTSC.
      	Jx = value read from CpuSwPwrAcc and Tx = value read from PTSC.
      
      iv. At time y, software reads CpuSwPwrAcc and samples the PTSC.
      	Jy = value read from CpuSwPwrAcc and Ty = value read from PTSC.
      
      v. Calculate the average power consumption for a compute unit over
      time period (y-x). Unit of result is uWatt:
      
      	if (Jy < Jx) // Rollover has occurred
      		Jdelta = (Jy + Jmax) - Jx
      	else
      		Jdelta = Jy - Jx
      	PwrCPUave = N * Jdelta * 1000 / (Ty - Tx)
      
      Simple example:
      
        root@hr-zp:/home/ray/tip# ./tools/perf/perf stat -a -e 'power/power-pkg/' make -j4
          CHK     include/config/kernel.release
          CHK     include/generated/uapi/linux/version.h
          CHK     include/generated/utsrelease.h
          CHK     include/generated/timeconst.h
          CHK     include/generated/bounds.h
          CHK     include/generated/asm-offsets.h
          CALL    scripts/checksyscalls.sh
          CHK     include/generated/compile.h
          SKIPPED include/generated/compile.h
          Building modules, stage 2.
        Kernel: arch/x86/boot/bzImage is ready  (#40)
          MODPOST 4225 modules
      
         Performance counter stats for 'system wide':
      
                    183.44 mWatts power/power-pkg/
      
             341.837270111 seconds time elapsed
      
        root@hr-zp:/home/ray/tip# ./tools/perf/perf stat -a -e 'power/power-pkg/' sleep 10
      
         Performance counter stats for 'system wide':
      
                      0.18 mWatts power/power-pkg/
      
              10.012551815 seconds time elapsed
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Suggested-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NHuang Rui <ray.huang@amd.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: jacob.w.shin@gmail.com
      Link: http://lkml.kernel.org/r/1457502306-2559-1-git-send-email-ray.huang@amd.com
      [ Fixed the modular build. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c7ab62bf
    • H
      x86/cpufeature, perf/x86: Add AMD Accumulated Power Mechanism feature flag · 01fe03ff
      Huang Rui 提交于
      AMD CPU family 15h model 0x60 introduces a mechanism for measuring
      accumulated power. It is used to report the processor power consumption
      and support for it is indicated by CPUID Fn8000_0007_EDX[12].
      Signed-off-by: NHuang Rui <ray.huang@amd.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andreas Herrmann <herrmann.der.user@googlemail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Frédéric Weisbecker <fweisbec@gmail.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hector Marco-Gisbert <hecmargi@upv.es>
      Cc: Jacob Shin <jacob.w.shin@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Kristen Carlson Accardi <kristen@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Wan Zongshun <Vincent.Wan@amd.com>
      Cc: spg_linux_kernel@amd.com
      Link: http://lkml.kernel.org/r/1452739808-11871-4-git-send-email-ray.huang@amd.com
      [ Resolved conflict and moved the synthetic CPUID slot to 19. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      01fe03ff
    • S
      perf/x86/amd: Add support for new IOMMU performance events · f8519155
      Suravee Suthikulpanit 提交于
      This patch adds new IOMMU performance event based on
      the information in table 74 of the AMD I/O Virtualization Technology
      (IOMMU) Specification (Document Id: 4882, Rev 2.62, Feb 2015)
      Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NJoerg Roedel <jroedel@suse.de>
      Acked-by: NJoerg Roedel <jroedel@suse.de>
      Cc: <acme@kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://support.amd.com/TechDocs/48882_IOMMU.pdfSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f8519155
    • H
      perf/x86/amd: Move nodes_per_socket into bsp_init_amd() · 8dfeae0d
      Huang Rui 提交于
      nodes_per_socket is static and it needn't be initialized many
      times during every CPU core init. So move its initialization into
      bsp_init_amd().
      Signed-off-by: NHuang Rui <ray.huang@amd.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andreas Herrmann <herrmann.der.user@googlemail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Frédéric Weisbecker <fweisbec@gmail.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hector Marco-Gisbert <hecmargi@upv.es>
      Cc: Jacob Shin <jacob.w.shin@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: spg_linux_kernel@amd.com
      Link: http://lkml.kernel.org/r/1452739808-11871-2-git-send-email-ray.huang@amd.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8dfeae0d
    • P
      perf/x86/cqm: Factor out some common code · 27348f38
      Peter Zijlstra 提交于
      Having the same code twice (and once quite ugly) is fragile.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      27348f38
    • V
      perf/x86/mbm: Add support for MBM counter overflow handling · e7ee3e8c
      Vikas Shivappa 提交于
      This patch adds a per package timer which periodically updates the
      memory bandwidth counters for the events that are currently active.
      
      Current patch has a periodic timer every 1s since the SDM guarantees
      that the counter will not overflow in 1s but this time can be definitely
      improved by calibrating on the system. The overflow is really a function
      of the max memory b/w that the socket can support, max counter value and
      scaling factor.
      Signed-off-by: NVikas Shivappa <vikas.shivappa@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NTony Luck <tony.luck@intel.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: fenghua.yu@intel.com
      Cc: h.peter.anvin@intel.com
      Cc: ravi.v.shankar@intel.com
      Cc: vikas.shivappa@intel.com
      Link: http://lkml.kernel.org/r/013b756c5006b1c4ca411f3ecf43ed52f19fbf87.1457723885.git.tony.luck@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e7ee3e8c
    • V
      perf/x86/mbm: Implement RMID recycling · 2d4de837
      Vikas Shivappa 提交于
      RMID could be allocated or deallocated as part of RMID recycling.
      
      When an RMID is allocated for MBM event, the MBM counter needs to be
      initialized because next time we read the counter we need the previous
      value to account for total bytes that went to the memory controller.
      
      Similarly, when RMID is deallocated we need to update the ->count
      variable.
      Signed-off-by: NVikas Shivappa <vikas.shivappa@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NTony Luck <tony.luck@intel.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: fenghua.yu@intel.com
      Cc: h.peter.anvin@intel.com
      Cc: ravi.v.shankar@intel.com
      Cc: vikas.shivappa@intel.com
      Link: http://lkml.kernel.org/r/1457652732-4499-6-git-send-email-vikas.shivappa@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2d4de837
    • T
      perf/x86/mbm: Add memory bandwidth monitoring event management · 87f01cc2
      Tony Luck 提交于
      Includes all the core infrastructure to measure the total_bytes and
      bandwidth.
      
      We have per socket counters for both total system wide L3 external
      bytes and local socket memory-controller bytes. The OS does MSR writes
      to MSR_IA32_QM_EVTSEL and MSR_IA32_QM_CTR to read the counters and
      uses the IA32_PQR_ASSOC_MSR to associate the RMID with the task. The
      tasks have a common RMID for CQM (cache quality of service monitoring)
      and MBM. Hence most of the scheduling code is reused from CQM.
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      [ Restructured rmid_read to not have an obvious hole, removed MBM_CNTR_MAX as its unused. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NVikas Shivappa <vikas.shivappa@linux.intel.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: fenghua.yu@intel.com
      Cc: h.peter.anvin@intel.com
      Cc: ravi.v.shankar@intel.com
      Cc: vikas.shivappa@intel.com
      Link: http://lkml.kernel.org/r/abd7aac9a18d93b95b985b931cf258df0164746d.1457723885.git.tony.luck@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      87f01cc2
    • V
      perf/x86/mbm: Add Intel Memory B/W Monitoring enumeration and init · 33c3cc7a
      Vikas Shivappa 提交于
      The MBM init patch enumerates the Intel MBM (Memory b/w monitoring)
      and initializes the perf events and datastructures for monitoring the
      memory b/w.
      
      Its based on original patch series by Tony Luck and Kanaka Juvva.
      
      Memory bandwidth monitoring (MBM) provides OS/VMM a way to monitor
      bandwidth from one level of cache to another. The current patches
      support L3 external bandwidth monitoring. It supports both 'local
      bandwidth' and 'total bandwidth' monitoring for the socket. Local
      bandwidth measures the amount of data sent through the memory controller
      on the socket and total b/w measures the total system bandwidth.
      
      Extending the cache quality of service monitoring (CQM) we add two
      more events to the perf infrastructure:
      
        intel_cqm_llc/local_bytes - bytes sent through local socket memory controller
        intel_cqm_llc/total_bytes - total L3 external bytes sent
      
      The tasks are associated with a Resouce Monitoring ID (RMID) just like
      in CQM and OS uses a MSR write to indicate the RMID of the task during
      scheduling.
      Signed-off-by: NVikas Shivappa <vikas.shivappa@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NTony Luck <tony.luck@intel.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: fenghua.yu@intel.com
      Cc: h.peter.anvin@intel.com
      Cc: ravi.v.shankar@intel.com
      Cc: vikas.shivappa@intel.com
      Link: http://lkml.kernel.org/r/1457652732-4499-4-git-send-email-vikas.shivappa@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      33c3cc7a
    • V
      perf/x86/cqm: Fix CQM memory leak and notifier leak · ada2f634
      Vikas Shivappa 提交于
      Fixes the hotcpu notifier leak and other global variable memory leaks
      during CQM (cache quality of service monitoring) initialization.
      Signed-off-by: NVikas Shivappa <vikas.shivappa@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NTony Luck <tony.luck@intel.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: fenghua.yu@intel.com
      Cc: h.peter.anvin@intel.com
      Cc: ravi.v.shankar@intel.com
      Cc: vikas.shivappa@intel.com
      Link: http://lkml.kernel.org/r/1457652732-4499-3-git-send-email-vikas.shivappa@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ada2f634
    • V
      perf/x86/cqm: Fix CQM handling of grouping events into a cache_group · a223c1c7
      Vikas Shivappa 提交于
      Currently CQM (cache quality of service monitoring) is grouping all
      events belonging to same PID to use one RMID. However its not counting
      all of these different events. Hence we end up with a count of zero
      for all events other than the group leader.
      
      The patch tries to address the issue by keeping a flag in the
      perf_event.hw which has other CQM related fields. The field is updated
      at event creation and during grouping.
      Signed-off-by: NVikas Shivappa <vikas.shivappa@linux.intel.com>
      [peterz: Changed hw_perf_event::is_group_event to an int]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NTony Luck <tony.luck@intel.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: fenghua.yu@intel.com
      Cc: h.peter.anvin@intel.com
      Cc: ravi.v.shankar@intel.com
      Cc: vikas.shivappa@intel.com
      Link: http://lkml.kernel.org/r/1457652732-4499-2-git-send-email-vikas.shivappa@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a223c1c7
    • P
      perf/x86/BTS: Fix RCU usage · e8d8a90f
      Peter Zijlstra 提交于
      This splat reminds us:
      
      [ 8166.045595] [ INFO: suspicious RCU usage. ]
      
      [ 8166.168972]  [<ffffffff81127837>] lockdep_rcu_suspicious+0xe7/0x120
      [ 8166.175966]  [<ffffffff811e0bae>] perf_callchain+0x23e/0x250
      [ 8166.182280]  [<ffffffff811dda3d>] perf_prepare_sample+0x27d/0x350
      [ 8166.189082]  [<ffffffff8100f503>] intel_pmu_drain_bts_buffer+0x133/0x200
      
      ... that as the core code does, one should hold rcu_read_lock() over that
      entire BTS event-output generation sequence as well.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      e8d8a90f
    • P
      perf/x86/ibs: Add IBS interrupt to the dynamic throttle · c2872d38
      Peter Zijlstra 提交于
      Interrupt throttling is normally only done against
      sysctl_perf_event_sample_rate. This means that if that number is too
      high (for whatever reason) you can lock up your machine.
      
      We have, however, a dynamic throttling scheme too, but for that to
      work, we need to add a callback to the interrupt handler, IBS did not
      have this, so add it.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c2872d38