You need to sign in or sign up before continuing.
  1. 03 7月, 2016 1 次提交
    • J
      perf/x86: Fix 32-bit perf user callgraph collection · fc188225
      Josh Poimboeuf 提交于
      A basic perf callgraph record operation causes an immediate panic on a
      32-bit kernel compiled with CONFIG_CC_STACKPROTECTOR=y:
      
        $ perf record -g ls
        Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: c0404fbd
      
        CPU: 0 PID: 998 Comm: ls Not tainted 4.7.0-rc5+ #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014
         c0dd5967 ff7afe1c 00000086 f41dbc2c c07445a0 464c457f f41dbca8 f41dbc44
         c05646f4 f41dbca8 464c457f f41dbca8 464c457f f41dbc54 c04625be c0ce56fc
         c0404fbd f41dbc88 c0404fbd b74668f0 f41dc000 00000000 c0000000 00000000
        Call Trace:
         [<c07445a0>] dump_stack+0x58/0x78
         [<c05646f4>] panic+0x8e/0x1c6
         [<c04625be>] __stack_chk_fail+0x1e/0x30
         [<c0404fbd>] ? perf_callchain_user+0x22d/0x230
         [<c0404fbd>] perf_callchain_user+0x22d/0x230
         [<c055f89f>] get_perf_callchain+0x1ff/0x270
         [<c055f988>] perf_callchain+0x78/0x90
         [<c055c7eb>] perf_prepare_sample+0x24b/0x370
         [<c055c934>] perf_event_output_forward+0x24/0x70
         [<c05531c0>] __perf_event_overflow+0xa0/0x210
         [<c0550a93>] ? cpu_clock_event_read+0x43/0x50
         [<c0553431>] perf_swevent_hrtimer+0x101/0x180
         [<c0456235>] ? kmap_atomic_prot+0x35/0x140
         [<c056dc69>] ? get_page_from_freelist+0x279/0x950
         [<c058fdd8>] ? vma_interval_tree_remove+0x158/0x230
         [<c05939f4>] ? wp_page_copy.isra.82+0x2f4/0x630
         [<c05a050d>] ? page_add_file_rmap+0x1d/0x50
         [<c0565611>] ? unlock_page+0x61/0x80
         [<c0566755>] ? filemap_map_pages+0x305/0x320
         [<c059769f>] ? handle_mm_fault+0xb7f/0x1560
         [<c074cbeb>] ? timerqueue_del+0x1b/0x70
         [<c04cfefe>] ? __remove_hrtimer+0x2e/0x60
         [<c04d017b>] __hrtimer_run_queues+0xcb/0x2a0
         [<c0553330>] ? __perf_event_overflow+0x210/0x210
         [<c04d0a2a>] hrtimer_interrupt+0x8a/0x180
         [<c043ecc2>] local_apic_timer_interrupt+0x32/0x60
         [<c043f643>] smp_apic_timer_interrupt+0x33/0x50
         [<c0b0cd38>] apic_timer_interrupt+0x34/0x3c
        Kernel Offset: disabled
        ---[ end Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: c0404fbd
      
      The panic is caused by the fact that perf_callchain_user() mistakenly
      assumes it's 64-bit only and ends up corrupting the stack.
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: stable@vger.kernel.org # v4.5+
      Fixes: 75925e1a ("perf/x86: Optimize stack walk user accesses")
      Link: http://lkml.kernel.org/r/1a547f5077ec30f75f9b57074837c3c80df86e5e.1467432113.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fc188225
  2. 17 5月, 2016 2 次提交
    • A
      perf core: Add a 'nr' field to perf_event_callchain_context · 3b1fff08
      Arnaldo Carvalho de Melo 提交于
      We will use it to count how many addresses are in the entry->ip[] array,
      excluding PERF_CONTEXT_{KERNEL,USER,etc} entries, so that we can really
      return the number of entries specified by the user via the relevant
      sysctl, kernel.perf_event_max_contexts, or via the per event
      perf_event_attr.sample_max_stack knob.
      
      This way we keep the perf_sample->ip_callchain->nr meaning, that is the
      number of entries, be it real addresses or PERF_CONTEXT_ entries, while
      honouring the max_stack knobs, i.e. the end result will be max_stack
      entries if we have at least that many entries in a given stack trace.
      
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-s8teto51tdqvlfhefndtat9r@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      3b1fff08
    • A
      perf core: Pass max stack as a perf_callchain_entry context · cfbcf468
      Arnaldo Carvalho de Melo 提交于
      This makes perf_callchain_{user,kernel}() receive the max stack
      as context for the perf_callchain_entry, instead of accessing
      the global sysctl_perf_event_max_stack.
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: He Kuang <hekuang@huawei.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Milian Wolff <milian.wolff@kdab.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Wang Nan <wangnan0@huawei.com>
      Cc: Zefan Li <lizefan@huawei.com>
      Link: http://lkml.kernel.org/n/tip-kolmn1yo40p7jhswxwrc7rrd@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      cfbcf468
  3. 05 5月, 2016 1 次提交
  4. 27 4月, 2016 1 次提交
    • A
      perf core: Allow setting up max frame stack depth via sysctl · c5dfd78e
      Arnaldo Carvalho de Melo 提交于
      The default remains 127, which is good for most cases, and not even hit
      most of the time, but then for some cases, as reported by Brendan, 1024+
      deep frames are appearing on the radar for things like groovy, ruby.
      
      And in some workloads putting a _lower_ cap on this may make sense. One
      that is per event still needs to be put in place tho.
      
      The new file is:
      
        # cat /proc/sys/kernel/perf_event_max_stack
        127
      
      Chaging it:
      
        # echo 256 > /proc/sys/kernel/perf_event_max_stack
        # cat /proc/sys/kernel/perf_event_max_stack
        256
      
      But as soon as there is some event using callchains we get:
      
        # echo 512 > /proc/sys/kernel/perf_event_max_stack
        -bash: echo: write error: Device or resource busy
        #
      
      Because we only allocate the callchain percpu data structures when there
      is a user, which allows for changing the max easily, its just a matter
      of having no callchain users at that point.
      Reported-and-Tested-by: NBrendan Gregg <brendan.d.gregg@gmail.com>
      Reviewed-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDavid Ahern <dsahern@gmail.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: He Kuang <hekuang@huawei.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Milian Wolff <milian.wolff@kdab.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Wang Nan <wangnan0@huawei.com>
      Cc: Zefan Li <lizefan@huawei.com>
      Link: http://lkml.kernel.org/r/20160426002928.GB16708@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      c5dfd78e
  5. 19 4月, 2016 1 次提交
  6. 13 4月, 2016 1 次提交
  7. 21 3月, 2016 1 次提交
    • H
      perf/x86/amd/power: Add AMD accumulated power reporting mechanism · c7ab62bf
      Huang Rui 提交于
      Introduce an AMD accumlated power reporting mechanism for the Family
      15h, Model 60h processor that can be used to calculate the average
      power consumed by a processor during a measurement interval. The
      feature support is indicated by CPUID Fn8000_0007_EDX[12].
      
      This feature will be implemented both in hwmon and perf. The current
      design provides one event to report per package/processor power
      consumption by counting each compute unit power value.
      
      Here the gory details of how the computation is done:
      
      * Tsample: compute unit power accumulator sample period
      * Tref: the PTSC counter period (PTSC: performance timestamp counter)
      * N: the ratio of compute unit power accumulator sample period to the
        PTSC period
      
      * Jmax: max compute unit accumulated power which is indicated by
        MSR_C001007b[MaxCpuSwPwrAcc]
      
      * Jx/Jy: compute unit accumulated power which is indicated by
        MSR_C001007a[CpuSwPwrAcc]
      
      * Tx/Ty: the value of performance timestamp counter which is indicated
        by CU_PTSC MSR_C0010280[PTSC]
      * PwrCPUave: CPU average power
      
      i. Determine the ratio of Tsample to Tref by executing CPUID Fn8000_0007.
      	N = value of CPUID Fn8000_0007_ECX[CpuPwrSampleTimeRatio[15:0]].
      
      ii. Read the full range of the cumulative energy value from the new
          MSR MaxCpuSwPwrAcc.
      	Jmax = value returned.
      
      iii. At time x, software reads CpuSwPwrAcc and samples the PTSC.
      	Jx = value read from CpuSwPwrAcc and Tx = value read from PTSC.
      
      iv. At time y, software reads CpuSwPwrAcc and samples the PTSC.
      	Jy = value read from CpuSwPwrAcc and Ty = value read from PTSC.
      
      v. Calculate the average power consumption for a compute unit over
      time period (y-x). Unit of result is uWatt:
      
      	if (Jy < Jx) // Rollover has occurred
      		Jdelta = (Jy + Jmax) - Jx
      	else
      		Jdelta = Jy - Jx
      	PwrCPUave = N * Jdelta * 1000 / (Ty - Tx)
      
      Simple example:
      
        root@hr-zp:/home/ray/tip# ./tools/perf/perf stat -a -e 'power/power-pkg/' make -j4
          CHK     include/config/kernel.release
          CHK     include/generated/uapi/linux/version.h
          CHK     include/generated/utsrelease.h
          CHK     include/generated/timeconst.h
          CHK     include/generated/bounds.h
          CHK     include/generated/asm-offsets.h
          CALL    scripts/checksyscalls.sh
          CHK     include/generated/compile.h
          SKIPPED include/generated/compile.h
          Building modules, stage 2.
        Kernel: arch/x86/boot/bzImage is ready  (#40)
          MODPOST 4225 modules
      
         Performance counter stats for 'system wide':
      
                    183.44 mWatts power/power-pkg/
      
             341.837270111 seconds time elapsed
      
        root@hr-zp:/home/ray/tip# ./tools/perf/perf stat -a -e 'power/power-pkg/' sleep 10
      
         Performance counter stats for 'system wide':
      
                      0.18 mWatts power/power-pkg/
      
              10.012551815 seconds time elapsed
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Suggested-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NHuang Rui <ray.huang@amd.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: jacob.w.shin@gmail.com
      Link: http://lkml.kernel.org/r/1457502306-2559-1-git-send-email-ray.huang@amd.com
      [ Fixed the modular build. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c7ab62bf
  8. 08 3月, 2016 1 次提交
    • K
      perf/x86/intel: Fix PEBS warning by only restoring active PMU in pmi · c3d266c8
      Kan Liang 提交于
      This patch tries to fix a PEBS warning found in my stress test. The
      following perf command can easily trigger the pebs warning or spurious
      NMI error on Skylake/Broadwell/Haswell platforms:
      
        sudo perf record -e 'cpu/umask=0x04,event=0xc4/pp,cycles,branches,ref-cycles,cache-misses,cache-references' --call-graph fp -b -c1000 -a
      
      Also the NMI watchdog must be enabled.
      
      For this case, the events number is larger than counter number. So
      perf has to do multiplexing.
      
      In perf_mux_hrtimer_handler, it does perf_pmu_disable(), schedule out
      old events, rotate_ctx, schedule in new events and finally
      perf_pmu_enable().
      
      If the old events include precise event, the MSR_IA32_PEBS_ENABLE
      should be cleared when perf_pmu_disable().  The MSR_IA32_PEBS_ENABLE
      should keep 0 until the perf_pmu_enable() is called and the new event is
      precise event.
      
      However, there is a corner case which could restore PEBS_ENABLE to
      stale value during the above period. In perf_pmu_disable(), GLOBAL_CTRL
      will be set to 0 to stop overflow and followed PMI. But there may be
      pending PMI from an earlier overflow, which cannot be stopped. So even
      GLOBAL_CTRL is cleared, the kernel still be possible to get PMI. At
      the end of the PMI handler, __intel_pmu_enable_all() will be called,
      which will restore the stale values if old events haven't scheduled
      out.
      
      Once the stale pebs value is set, it's impossible to be corrected if
      the new events are non-precise. Because the pebs_enabled will be set
      to 0. x86_pmu.enable_all() will ignore the MSR_IA32_PEBS_ENABLE
      setting. As a result, the following NMI with stale PEBS_ENABLE
      trigger pebs warning.
      
      The pending PMI after enabled=0 will become harmless if the NMI handler
      does not change the state. This patch checks cpuc->enabled in pmi and
      only restore the state when PMU is active.
      
      Here is the dump:
      
        Call Trace:
         <NMI>  [<ffffffff813c3a2e>] dump_stack+0x63/0x85
         [<ffffffff810a46f2>] warn_slowpath_common+0x82/0xc0
         [<ffffffff810a483a>] warn_slowpath_null+0x1a/0x20
         [<ffffffff8100fe2e>] intel_pmu_drain_pebs_nhm+0x2be/0x320
         [<ffffffff8100caa9>] intel_pmu_handle_irq+0x279/0x460
         [<ffffffff810639b6>] ? native_write_msr_safe+0x6/0x40
         [<ffffffff811f290d>] ? vunmap_page_range+0x20d/0x330
         [<ffffffff811f2f11>] ?  unmap_kernel_range_noflush+0x11/0x20
         [<ffffffff8148379f>] ? ghes_copy_tofrom_phys+0x10f/0x2a0
         [<ffffffff814839c8>] ? ghes_read_estatus+0x98/0x170
         [<ffffffff81005a7d>] perf_event_nmi_handler+0x2d/0x50
         [<ffffffff810310b9>] nmi_handle+0x69/0x120
         [<ffffffff810316f6>] default_do_nmi+0xe6/0x100
         [<ffffffff810317f2>] do_nmi+0xe2/0x130
         [<ffffffff817aea71>] end_repeat_nmi+0x1a/0x1e
         [<ffffffff810639b6>] ? native_write_msr_safe+0x6/0x40
         [<ffffffff810639b6>] ? native_write_msr_safe+0x6/0x40
         [<ffffffff810639b6>] ? native_write_msr_safe+0x6/0x40
         <<EOE>>  <IRQ>  [<ffffffff81006df8>] ?  x86_perf_event_set_period+0xd8/0x180
         [<ffffffff81006eec>] x86_pmu_start+0x4c/0x100
         [<ffffffff8100722d>] x86_pmu_enable+0x28d/0x300
         [<ffffffff811994d7>] perf_pmu_enable.part.81+0x7/0x10
         [<ffffffff8119cb70>] perf_mux_hrtimer_handler+0x200/0x280
         [<ffffffff8119c970>] ?  __perf_install_in_context+0xc0/0xc0
         [<ffffffff8110f92d>] __hrtimer_run_queues+0xfd/0x280
         [<ffffffff811100d8>] hrtimer_interrupt+0xa8/0x190
         [<ffffffff81199080>] ?  __perf_read_group_add.part.61+0x1a0/0x1a0
         [<ffffffff81051bd8>] local_apic_timer_interrupt+0x38/0x60
         [<ffffffff817af01d>] smp_apic_timer_interrupt+0x3d/0x50
         [<ffffffff817ad15c>] apic_timer_interrupt+0x8c/0xa0
         <EOI>  [<ffffffff81199080>] ?  __perf_read_group_add.part.61+0x1a0/0x1a0
         [<ffffffff81123de5>] ?  smp_call_function_single+0xd5/0x130
         [<ffffffff81123ddb>] ?  smp_call_function_single+0xcb/0x130
         [<ffffffff81199080>] ?  __perf_read_group_add.part.61+0x1a0/0x1a0
         [<ffffffff8119765a>] event_function_call+0x10a/0x120
         [<ffffffff8119c660>] ? ctx_resched+0x90/0x90
         [<ffffffff811971e0>] ? cpu_clock_event_read+0x30/0x30
         [<ffffffff811976d0>] ? _perf_event_disable+0x60/0x60
         [<ffffffff8119772b>] _perf_event_enable+0x5b/0x70
         [<ffffffff81197388>] perf_event_for_each_child+0x38/0xa0
         [<ffffffff811976d0>] ? _perf_event_disable+0x60/0x60
         [<ffffffff811a0ffd>] perf_ioctl+0x12d/0x3c0
         [<ffffffff8134d855>] ? selinux_file_ioctl+0x95/0x1e0
         [<ffffffff8124a3a1>] do_vfs_ioctl+0xa1/0x5a0
         [<ffffffff81036d29>] ? sched_clock+0x9/0x10
         [<ffffffff8124a919>] SyS_ioctl+0x79/0x90
         [<ffffffff817ac4b2>] entry_SYSCALL_64_fastpath+0x1a/0xa4
        ---[ end trace aef202839fe9a71d ]---
        Uhhuh. NMI received for unknown reason 2d on CPU 2.
        Do you have a strange power saving mode enabled?
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1457046448-6184-1-git-send-email-kan.liang@intel.com
      [ Fixed various typos and other small details. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c3d266c8
  9. 17 2月, 2016 1 次提交
  10. 09 2月, 2016 1 次提交
  11. 03 2月, 2016 1 次提交
  12. 06 1月, 2016 2 次提交
    • S
      perf/x86: Fix filter_events() bug with event mappings · 61b87cae
      Stephane Eranian 提交于
      This patch fixes a bug in the filter_events() function.
      
      The patch fixes the bug whereby if some mappings did not
      exist, e.g., STALLED_CYCLES_FRONTEND, then any event after it
      in the attrs array would disappear from the published list of
      events in /sys/devices/cpu/events. This could be verified
      easily on any system post SNB (which do not publish
      STALLED_CYCLES_FRONTEND):
      
      	$ ./perf stat -e cycles,ref-cycles true
      	Performance counter stats for 'true':
                    1,217,348      cycles
      	<not supported>      ref-cycles
      
      The problem is that in filter_events() there is an assumption
      that the argument (attrs) is organized in increasing continuous
      event indexes related to the event_map(). But if we remove the
      non-supported events by shifing the position in the array, then
      the lookup x86_pmu.event_map() needs to compensate for it, otherwise
      we are looking up the wrong index. This patch corrects this problem
      by compensating for the deleted events and with that ref-cycles
      reappears (here shown on Haswell):
      
      	$ perf stat -e ref-cycles,cycles true
      	Performance counter stats for 'true':
               4,525,910      ref-cycles
               1,064,920      cycles
             0.002943888 seconds time elapsed
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: jolsa@kernel.org
      Cc: kan.liang@intel.com
      Fixes: 8300daa2 ("perf/x86: Filter out undefined events from sysfs events attribute")
      Link: http://lkml.kernel.org/r/1449516805-6637-1-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      61b87cae
    • A
      perf/x86: Use INST_RETIRED.PREC_DIST for cycles: ppp · 72469764
      Andi Kleen 提交于
      Add a new 'three-p' precise level, that uses INST_RETIRED.PREC_DIST as
      base. The basic mechanism of abusing the inverse cmask to get all
      cycles works the same as before.
      
      PREC_DIST is available on Sandy Bridge or later. It had some problems
      on Sandy Bridge, so we only use it on IvyBridge and later. I tested it
      on Broadwell and Skylake.
      
      PREC_DIST has special support for avoiding shadow effects, which can
      give better results compare to UOPS_RETIRED. The drawback is that
      PREC_DIST can only schedule on counter 1, but that is ok for cycle
      sampling, as there is normally no need to do multiple cycle sampling
      runs in parallel. It is still possible to run perf top in parallel, as
      that doesn't use precise mode. Also of course the multiplexing can
      still allow parallel operation.
      
      :pp stays with the previous event.
      
      Example:
      
      Sample a loop with 10 sqrt with old cycles:pp
      
      	  0.14 │10:   sqrtps %xmm1,%xmm0     <--------------
      	  9.13 │      sqrtps %xmm1,%xmm0
      	 11.58 │      sqrtps %xmm1,%xmm0
      	 11.51 │      sqrtps %xmm1,%xmm0
      	  6.27 │      sqrtps %xmm1,%xmm0
      	 10.38 │      sqrtps %xmm1,%xmm0
      	 12.20 │      sqrtps %xmm1,%xmm0
      	 12.74 │      sqrtps %xmm1,%xmm0
      	  5.40 │      sqrtps %xmm1,%xmm0
      	 10.14 │      sqrtps %xmm1,%xmm0
      	 10.51 │    ↑ jmp    10
      
      We expect all 10 sqrt to get roughly the sample number of samples.
      
      But you can see that the instruction directly after the JMP is
      systematically underestimated in the result, due to sampling shadow
      effects.
      
      With the new PREC_DIST based sampling this problem is gone and all
      instructions show up roughly evenly:
      
      	  9.51 │10:   sqrtps %xmm1,%xmm0
      	 11.74 │      sqrtps %xmm1,%xmm0
      	 11.84 │      sqrtps %xmm1,%xmm0
      	  6.05 │      sqrtps %xmm1,%xmm0
      	 10.46 │      sqrtps %xmm1,%xmm0
      	 12.25 │      sqrtps %xmm1,%xmm0
      	 12.18 │      sqrtps %xmm1,%xmm0
      	  5.26 │      sqrtps %xmm1,%xmm0
      	 10.13 │      sqrtps %xmm1,%xmm0
      	 10.43 │      sqrtps %xmm1,%xmm0
      	  0.16 │    ↑ jmp    10
      
      Even with PREC_DIST there is still sampling skid and the result is not
      completely even, but systematic shadow effects are significantly
      reduced.
      
      The improvements are mainly expected to make a difference in high IPC
      code. With low IPC it should be similar.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: hpa@zytor.com
      Link: http://lkml.kernel.org/r/1448929689-13771-2-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      72469764
  13. 23 11月, 2015 2 次提交
    • A
      perf/x86: Optimize stack walk user accesses · 75925e1a
      Andi Kleen 提交于
      Change the perf user stack walking to use the new
      __copy_from_user_nmi(), and split each access into word sized transfer
      sizes. This allows to inline the complete access and optimize it all
      into a single load.
      
      The main advantage is that this avoids the overhead of double page
      faults.  When normal copy_from_user() fails it reexecutes the copy to
      compute an accurate number of non copied bytes. This leads to
      executing the expensive page fault twice.
      
      While walking stacks having a fault at some point is relatively common
      (typically when some part of the program isn't compiled with frame
      pointers), so this is a large overhead.
      
      With the optimized copies we avoid this problem because they only do
      all accesses once. And of course they're much faster too when the
      access does not fault because they're just single instructions instead
      of complex function calls.
      
      While profiling a kernel build with -g, the patch brings down the
      average time of the PMI handler from 966ns to 552ns (-43%).
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1445551641-13379-2-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      75925e1a
    • P
      treewide: Remove old email address · 90eec103
      Peter Zijlstra 提交于
      There were still a number of references to my old Red Hat email
      address in the kernel source. Remove these while keeping the
      Red Hat copyright notices intact.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      90eec103
  14. 13 9月, 2015 2 次提交
    • S
      perf/core: Drop PERF_EVENT_TXN · 8f3e5684
      Sukadev Bhattiprolu 提交于
      We currently use PERF_EVENT_TXN flag to determine if we are in the middle
      of a transaction. If in a transaction, we defer the schedulability checks
      from pmu->add() operation to the pmu->commit() operation.
      
      Now that we have "transaction types" (PERF_PMU_TXN_ADD, PERF_PMU_TXN_READ)
      we can use the type to determine if we are in a transaction and drop the
      PERF_EVENT_TXN flag.
      
      When PERF_EVENT_TXN is dropped, the cpuhw->group_flag on some architectures
      becomes unused, so drop that field as well.
      
      This is an extension of the Powerpc patch from Peter Zijlstra to s390,
      Sparc and x86 architectures.
      Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1441336073-22750-11-git-send-email-sukadev@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8f3e5684
    • S
      perf/core: Add a 'flags' parameter to the PMU transactional interfaces · fbbe0701
      Sukadev Bhattiprolu 提交于
      Currently, the PMU interface allows reading only one counter at a time.
      But some PMUs like the 24x7 counters in Power, support reading several
      counters at once. To leveage this functionality, extend the transaction
      interface to support a "transaction type".
      
      The first type, PERF_PMU_TXN_ADD, refers to the existing transactions,
      i.e. used to _schedule_ all the events on the PMU as a group. A second
      transaction type, PERF_PMU_TXN_READ, will be used in a follow-on patch,
      by the 24x7 counters to read several counters at once.
      
      Extend the transaction interfaces to the PMU to accept a 'txn_flags'
      parameter and use this parameter to ignore any transactions that are
      not of type PERF_PMU_TXN_ADD.
      
      Thanks to Peter Zijlstra for his input.
      Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      [peterz: s390 compile fix]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1441336073-22750-3-git-send-email-sukadev@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fbbe0701
  15. 04 8月, 2015 1 次提交
  16. 31 7月, 2015 2 次提交
    • A
      x86/ldt: Make modify_ldt() optional · a5b9e5a2
      Andy Lutomirski 提交于
      The modify_ldt syscall exposes a large attack surface and is
      unnecessary for modern userspace.  Make it optional.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jan Beulich <jbeulich@suse.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: security@kernel.org <security@kernel.org>
      Cc: xen-devel <xen-devel@lists.xen.org>
      Link: http://lkml.kernel.org/r/a605166a771c343fd64802dece77a903507333bd.1438291540.git.luto@kernel.org
      [ Made MATH_EMULATION dependent on MODIFY_LDT_SYSCALL. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a5b9e5a2
    • A
      x86/ldt: Make modify_ldt synchronous · 37868fe1
      Andy Lutomirski 提交于
      modify_ldt() has questionable locking and does not synchronize
      threads.  Improve it: redesign the locking and synchronize all
      threads' LDTs using an IPI on all modifications.
      
      This will dramatically slow down modify_ldt in multithreaded
      programs, but there shouldn't be any multithreaded programs that
      care about modify_ldt's performance in the first place.
      
      This fixes some fallout from the CVE-2015-5157 fixes.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jan Beulich <jbeulich@suse.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: security@kernel.org <security@kernel.org>
      Cc: <stable@vger.kernel.org>
      Cc: xen-devel <xen-devel@lists.xen.org>
      Link: http://lkml.kernel.org/r/4c6978476782160600471bd865b318db34c7b628.1438291540.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      37868fe1
  17. 06 7月, 2015 1 次提交
  18. 30 6月, 2015 1 次提交
    • P
      perf/x86: Fix 'active_events' imbalance · 93472aff
      Peter Zijlstra 提交于
      Commit 1b7b938f ("perf/x86/intel: Fix PMI handling for Intel PT") conditionally
      increments active_events in x86_add_exclusive() but unconditionally decrements in
      x86_del_exclusive().
      
      These extra decrements can lead to the situation where
      active_events is zero and thus the PMI handler is 'disabled'
      while we have active events on the PMU generating PMIs.
      
      This leads to a truckload of:
      
        Uhhuh. NMI received for unknown reason 21 on CPU 28.
        Do you have a strange power saving mode enabled?
        Dazed and confused, but trying to continue
      
      messages and generally messes up perf.
      
      Remove the condition on the increment, double increment balanced
      by a double decrement is perfectly fine.
      
      Restructure the code a little bit to make the unconditional inc
      a bit more natural.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: alexander.shishkin@linux.intel.com
      Cc: brgerst@gmail.com
      Cc: dvlasenk@redhat.com
      Cc: luto@amacapital.net
      Cc: oleg@redhat.com
      Fixes: 1b7b938f ("perf/x86/intel: Fix PMI handling for Intel PT")
      Link: http://lkml.kernel.org/r/20150624144750.GJ18673@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      93472aff
  19. 19 6月, 2015 2 次提交
    • A
      perf/x86/intel: Fix PMI handling for Intel PT · 1b7b938f
      Alexander Shishkin 提交于
      Intel PT is a separate PMU and it is not using any of the x86_pmu
      code paths, which means in particular that the active_events counter
      remains intact when new PT events are created.
      
      However, PT uses the generic x86_pmu PMI handler for its PMI handling needs.
      
      The problem here is that the latter checks active_events and in case of it
      being zero, exits without calling the actual x86_pmu.handle_nmi(), which
      results in unknown NMI errors and massive data loss for PT.
      
      The effect is not visible if there are other perf events in the system
      at the same time that keep active_events counter non-zero, for instance
      if the NMI watchdog is running, so one needs to disable it to reproduce
      the problem.
      
      At the same time, the active_events counter besides doing what the name
      suggests also implicitly serves as a PMC hardware and DS area reference
      counter.
      
      This patch adds a separate reference counter for the PMC hardware, leaving
      active_events for actually counting the events and makes sure it also
      counts PT and BTS events.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: adrian.hunter@intel.com
      Link: http://lkml.kernel.org/r/87k2v92t0s.fsf@ashishki-desk.ger.corp.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1b7b938f
    • A
      perf/x86/intel/bts: Fix DS area sharing with x86_pmu events · 6b099d9b
      Alexander Shishkin 提交于
      Currently, the intel_bts driver relies on the DS area allocated by the x86_pmu
      code in its event_init() path, which is a bug: creating a BTS event while
      no x86_pmu events are present results in a NULL pointer dereference.
      
      The same DS area is also used by PEBS sampling, which makes it quite a bit
      trickier to have a separate one for intel_bts' purposes.
      
      This patch makes intel_bts driver use the same DS allocation and reference
      counting code as x86_pmu to make sure it is always present when either
      intel_bts or x86_pmu need it.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: adrian.hunter@intel.com
      Link: http://lkml.kernel.org/r/1434024837-9916-2-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6b099d9b
  20. 07 6月, 2015 1 次提交
  21. 27 5月, 2015 5 次提交
    • P
      perf/x86: Simplify the x86_schedule_events() logic · 8736e548
      Peter Zijlstra 提交于
      	!x && y == ! (x || !y)
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      8736e548
    • P
      perf/x86/intel: Remove intel_excl_states::init_state · 43ef205b
      Peter Zijlstra 提交于
      For some obscure reason intel_{start,stop}_scheduling() copy the HT
      state to an intermediate array. This would make sense if we ever were
      to make changes to it which we'd have to discard.
      
      Except we don't. By the time we call intel_commit_scheduling() we're;
      as the name implies; committed to them. We'll never back out.
      
      A further hint its pointless is that stop_scheduling() unconditionally
      publishes the state.
      
      So the intermediate array is pointless, modify the state in place and
      kill the extra array.
      
      And remove the pointless array initialization: INTEL_EXCL_UNUSED == 0.
      
      Note; all is serialized by intel_excl_cntr::lock.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      43ef205b
    • D
      perf/x86: Tweak broken BIOS rules during check_hw_exists() · 68ab7476
      Don Zickus 提交于
      I stumbled upon an AMD box that had the BIOS using a hardware performance
      counter. Instead of printing out a warning and continuing, it failed and
      blocked further perf counter usage.
      
      Looking through the history, I found this commit:
      
        a5ebe0ba ("perf/x86: Check all MSRs before passing hw check")
      
      which tweaked the rules for a Xen guest on an almost identical box and now
      changed the behaviour.
      
      Unfortunately the rules were tweaked incorrectly and will always lead to
      MSR failures even though the MSRs are completely fine.
      
      What happens now is in arch/x86/kernel/cpu/perf_event.c::check_hw_exists():
      
      <snip>
              for (i = 0; i < x86_pmu.num_counters; i++) {
                      reg = x86_pmu_config_addr(i);
                      ret = rdmsrl_safe(reg, &val);
                      if (ret)
                              goto msr_fail;
                      if (val & ARCH_PERFMON_EVENTSEL_ENABLE) {
                              bios_fail = 1;
                              val_fail = val;
                              reg_fail = reg;
                      }
              }
      
      <snip>
              /*
               * Read the current value, change it and read it back to see if it
               * matches, this is needed to detect certain hardware emulators
               * (qemu/kvm) that don't trap on the MSR access and always return 0s.
               */
              reg = x86_pmu_event_addr(0);
      				^^^^
      
      if the first perf counter is enabled, then this routine will always fail
      because the counter is running. :-(
      
              if (rdmsrl_safe(reg, &val))
                      goto msr_fail;
              val ^= 0xffffUL;
              ret = wrmsrl_safe(reg, val);
              ret |= rdmsrl_safe(reg, &val_new);
              if (ret || val != val_new)
                      goto msr_fail;
      
      The above bios_fail used to be a 'goto' which is why it worked in the past.
      
      Further, most vendors have migrated to using fixed counters to hide their
      evilness hence this problem rarely shows up now days except on a few old boxes.
      
      I fixed my problem and kept the spirit of the original Xen fix, by recording a
      safe non-enable register to be used safely for the reading/writing check.
      Because it is not enabled, this passes on bare metal boxes (like metal), but
      should continue to throw an msr_fail on Xen guests because the register isn't
      emulated yet.
      
      Now I get a proper bios_fail error message and Xen should still see their
      msr_fail message (untested).
      Signed-off-by: NDon Zickus <dzickus@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: george.dunlap@eu.citrix.com
      Cc: konrad.wilk@oracle.com
      Link: http://lkml.kernel.org/r/1431976608-56970-1-git-send-email-dzickus@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      68ab7476
    • P
      perf/x86: Improve HT workaround GP counter constraint · cc1790cf
      Peter Zijlstra 提交于
      The (SNB/IVB/HSW) HT bug only affects events that can be programmed
      onto GP counters, therefore we should only limit the number of GP
      counters that can be used per cpu -- iow we should not constrain the
      FP counters.
      
      Furthermore, we should only enfore such a limit when there are in fact
      exclusive events being scheduled on either sibling.
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      [ Fixed build fail for the !CONFIG_CPU_SUP_INTEL case. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      cc1790cf
    • P
      perf/x86: Fix event/group validation · b371b594
      Peter Zijlstra 提交于
      Commit 43b45780 ("perf/x86: Reduce stack usage of
      x86_schedule_events()") violated the rule that 'fake' scheduling; as
      used for event/group validation; should not change the event state.
      
      This went mostly un-noticed because repeated calls of
      x86_pmu::get_event_constraints() would give the same result. And
      x86_pmu::put_event_constraints() would mostly not do anything.
      
      Commit e979121b ("perf/x86/intel: Implement cross-HT corruption
      bug workaround") made the situation much worse by actually setting the
      event->hw.constraint value to NULL, so when validation and actual
      scheduling interact we get NULL ptr derefs.
      
      Fix it by removing the constraint pointer from the event and move it
      back to an array, this time in cpuc instead of on the stack.
      
      validate_group()
        x86_schedule_events()
          event->hw.constraint = c; # store
      
            <context switch>
              perf_task_event_sched_in()
                ...
                  x86_schedule_events();
                    event->hw.constraint = c2; # store
      
                    ...
      
                    put_event_constraints(event); # assume failure to schedule
                      intel_put_event_constraints()
                        event->hw.constraint = NULL;
      
            <context switch end>
      
          c = event->hw.constraint; # read -> NULL
      
          if (!test_bit(hwc->idx, c->idxmsk)) # <- *BOOM* NULL deref
      
      This in particular is possible when the event in question is a
      cpu-wide event and group-leader, where the validate_group() tries to
      add an event to the group.
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 43b45780 ("perf/x86: Reduce stack usage of x86_schedule_events()")
      Fixes: e979121b ("perf/x86/intel: Implement cross-HT corruption bug workaround")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b371b594
  22. 02 4月, 2015 7 次提交
  23. 31 3月, 2015 1 次提交
    • I
      x86/asm/entry: Remove user_mode_ignore_vm86() · 55474c48
      Ingo Molnar 提交于
      user_mode_ignore_vm86() can be used instead of user_mode(), in
      places where we have already done a v8086_mode() security
      check of ptregs.
      
      But doing this check in the wrong place would be a bug that
      could result in security problems, and also the naming still
      isn't very clear.
      
      Furthermore, it only affects 32-bit kernels, while most
      development happens on 64-bit kernels.
      
      If we replace them with user_mode() checks then the cost is only
      a very minor increase in various slowpaths:
      
         text             data   bss     dec              hex    filename
         10573391         703562 1753042 13029995         c6d26b vmlinux.o.before
         10573423         703562 1753042 13030027         c6d28b vmlinux.o.after
      
      So lets get rid of this distinction once and for all.
      Acked-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Lutomirski <luto@kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brad Spengler <spender@grsecurity.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20150329090233.GA1963@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      55474c48
  24. 27 3月, 2015 1 次提交
    • P
      perf: Add per event clockid support · 34f43927
      Peter Zijlstra 提交于
      While thinking on the whole clock discussion it occurred to me we have
      two distinct uses of time:
      
       1) the tracking of event/ctx/cgroup enabled/running/stopped times
          which includes the self-monitoring support in struct
          perf_event_mmap_page.
      
       2) the actual timestamps visible in the data records.
      
      And we've been conflating them.
      
      The first is all about tracking time deltas, nobody should really care
      in what time base that happens, its all relative information, as long
      as its internally consistent it works.
      
      The second however is what people are worried about when having to
      merge their data with external sources. And here we have the
      discussion on MONOTONIC vs MONOTONIC_RAW etc..
      
      Where MONOTONIC is good for correlating between machines (static
      offset), MONOTNIC_RAW is required for correlating against a fixed rate
      hardware clock.
      
      This means configurability; now 1) makes that hard because it needs to
      be internally consistent across groups of unrelated events; which is
      why we had to have a global perf_clock().
      
      However, for 2) it doesn't really matter, perf itself doesn't care
      what it writes into the buffer.
      
      The below patch makes the distinction between these two cases by
      adding perf_event_clock() which is used for the second case. It
      further makes this configurable on a per-event basis, but adds a few
      sanity checks such that we cannot combine events with different clocks
      in confusing ways.
      
      And since we then have per-event configurability we might as well
      retain the 'legacy' behaviour as a default.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      34f43927