1. 03 6月, 2016 4 次提交
    • V
      perf/abi: Change the errno for sampling event not supported in hardware · a1396555
      Vineet Gupta 提交于
      Change the return code for sampling event not supported from -ENOTSUPP
      to -EOPNOTSUPP.
      
      This allows userspace to identify this case specifically, instead of
      printing the catch-all error message it did previously.
      
      Technically this is an ABI change, but we think we can get away
      with it.
      
      Old behavior:
       -------
       | # perf record ls
       | Error:
       | The sys_perf_event_open() syscall returned with 524 (Unknown error 524)
       | for event (cycles:ppp).
       | /bin/dmesg may provide additional information.
       | No CONFIG_PERF_EVENTS=y kernel support configured?
      
      New behavior:
       -------
       | # perf record ls
       | Error:
       | PMU Hardware doesn't support sampling/overflow-interrupts.
      Signed-off-by: NVineet Gupta <vgupta@synopsys.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <acme@redhat.com>
      Cc: <linux-snps-arc@lists.infradead.org>
      Cc: <vincent.weaver@maine.edu>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Vineet Gupta <Vineet.Gupta1@synopsys.com>
      Link: http://lkml.kernel.org/r/1462786660-2900-3-git-send-email-vgupta@synopsys.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a1396555
    • K
      perf/core: Fix implicitly enable dynamic interrupt throttle · ab7fdefb
      Kan Liang 提交于
      This patch fixes an issue which was introduced by commit:
      
        91a612ee ("perf/core: Fix dynamic interrupt throttle")
      
      ... which commit unconditionally sets the perf_sample_allowed_ns value
      to !0. But that could trigger a bug in the following corner case:
      
      The user can disable the dynamic interrupt throttle mechanism by setting
      perf_cpu_time_max_percent to 0. Then they change perf_event_max_sample_rate.
      For this case, the mechanism will be enabled implicitly, because
      perf_sample_allowed_ns becomes !0 - which is not what we want.
      
      This patch only updates perf_sample_allowed_ns when the dynamic
      interrupt throttle mechanism is enabled.
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Link: http://lkml.kernel.org/r/1462260366-3160-1-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ab7fdefb
    • P
      perf/core: Rename the perf_event_aux*() APIs to perf_event_sb*(), to separate... · aab5b71e
      Peter Zijlstra 提交于
      perf/core: Rename the perf_event_aux*() APIs to perf_event_sb*(), to separate them from AUX ring-buffer records
      
      There are now two different things called AUX in perf, the
      infrastructure to deliver the mmap/comm/task records and the
      AUX part in the mmap buffer (with associated AUX_RECORD).
      
      Since the former is internal, rename it to side-band to reduce
      the confusion factor.
      
      No change in functionality.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      aab5b71e
    • K
      perf/core: Optimize side-band event delivery · f2fb6bef
      Kan Liang 提交于
      The perf_event_aux() function iterates all PMUs and all events in
      their respective per-CPU contexts to find the events to deliver
      side-band records to.
      
      For example, the brk test case in lkp triggers many mmap() operations,
      which, if we're also running perf, results in many perf_event_aux()
      invocations.
      
      If we enable uncore PMU support (even when uncore events are not used),
      dozens of uncore PMUs will be iterated, which can significantly
      decrease brk_test's throughput.
      
      For example, the brk throughput:
      
        without uncore PMUs: 2647573 ops_per_sec
        with    uncore PMUs: 1768444 ops_per_sec
      
      ... a 33% reduction.
      
      To get at the per-CPU events that need side-band records, this patch
      puts these events on a per-CPU list, this avoids iterating the PMUs
      and any events that do not need side-band records.
      
      Per task events are unchanged to avoid extra overhead on the context
      switch paths.
      Suggested-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reported-by: NHuang, Ying <ying.huang@linux.intel.com>
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1458757477-3781-1-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f2fb6bef
  2. 30 5月, 2016 1 次提交
    • A
      perf core: Per event callchain limit · 97c79a38
      Arnaldo Carvalho de Melo 提交于
      Additionally to being able to control the system wide maximum depth via
      /proc/sys/kernel/perf_event_max_stack, now we are able to ask for
      different depths per event, using perf_event_attr.sample_max_stack for
      that.
      
      This uses an u16 hole at the end of perf_event_attr, that, when
      perf_event_attr.sample_type has the PERF_SAMPLE_CALLCHAIN, if
      sample_max_stack is zero, means use perf_event_max_stack, otherwise
      it'll be bounds checked under callchain_mutex.
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: He Kuang <hekuang@huawei.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Milian Wolff <milian.wolff@kdab.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Wang Nan <wangnan0@huawei.com>
      Cc: Zefan Li <lizefan@huawei.com>
      Link: http://lkml.kernel.org/n/tip-kolmn1yo40p7jhswxwrc7rrd@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      97c79a38
  3. 17 5月, 2016 5 次提交
    • A
      perf core: Separate accounting of contexts and real addresses in a stack trace · c85b0334
      Arnaldo Carvalho de Melo 提交于
      The perf_sample->ip_callchain->nr value includes all the entries in the
      ip_callchain->ip[] array, real addresses and PERF_CONTEXT_{KERNEL,USER,etc},
      while what the user expects is that what is in the kernel.perf_event_max_stack
      sysctl or in the upcoming per event perf_event_attr.sample_max_stack knob be
      honoured in terms of IP addresses in the stack trace.
      
      So allocate a bunch of extra entries for contexts, and do the accounting
      via perf_callchain_entry_ctx struct members.
      
      A new sysctl, kernel.perf_event_max_contexts_per_stack is also
      introduced for investigating possible bugs in the callchain
      implementation by some arch.
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: He Kuang <hekuang@huawei.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Milian Wolff <milian.wolff@kdab.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Wang Nan <wangnan0@huawei.com>
      Cc: Zefan Li <lizefan@huawei.com>
      Link: http://lkml.kernel.org/n/tip-3b4wnqk340c4sg4gwkfdi9yk@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      c85b0334
    • A
      perf core: Add perf_callchain_store_context() helper · 3e4de4ec
      Arnaldo Carvalho de Melo 提交于
      We need have different helpers to account how many contexts we have in
      the sample and for real addresses, so do it now as a prep patch, to
      ease review.
      
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-q964tnyuqrxw5gld18vizs3c@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      3e4de4ec
    • A
      perf core: Add a 'nr' field to perf_event_callchain_context · 3b1fff08
      Arnaldo Carvalho de Melo 提交于
      We will use it to count how many addresses are in the entry->ip[] array,
      excluding PERF_CONTEXT_{KERNEL,USER,etc} entries, so that we can really
      return the number of entries specified by the user via the relevant
      sysctl, kernel.perf_event_max_contexts, or via the per event
      perf_event_attr.sample_max_stack knob.
      
      This way we keep the perf_sample->ip_callchain->nr meaning, that is the
      number of entries, be it real addresses or PERF_CONTEXT_ entries, while
      honouring the max_stack knobs, i.e. the end result will be max_stack
      entries if we have at least that many entries in a given stack trace.
      
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-s8teto51tdqvlfhefndtat9r@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      3b1fff08
    • A
      perf core: Pass max stack as a perf_callchain_entry context · cfbcf468
      Arnaldo Carvalho de Melo 提交于
      This makes perf_callchain_{user,kernel}() receive the max stack
      as context for the perf_callchain_entry, instead of accessing
      the global sysctl_perf_event_max_stack.
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: He Kuang <hekuang@huawei.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Milian Wolff <milian.wolff@kdab.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Wang Nan <wangnan0@huawei.com>
      Cc: Zefan Li <lizefan@huawei.com>
      Link: http://lkml.kernel.org/n/tip-kolmn1yo40p7jhswxwrc7rrd@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      cfbcf468
    • A
      perf core: Generalize max_stack sysctl handler · a831100a
      Arnaldo Carvalho de Melo 提交于
      So that it can be used for other stack related knobs, such as the
      upcoming one to tweak the max number of of contexts per stack sample.
      
      In all those cases we can only change the value if there are no perf
      sessions collecting stacks, so they need to grab that mutex, etc.
      
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-8t3fk94wuzp8m2z1n4gc0s17@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      a831100a
  4. 12 5月, 2016 2 次提交
  5. 10 5月, 2016 1 次提交
  6. 05 5月, 2016 5 次提交
  7. 28 4月, 2016 1 次提交
    • P
      perf/core: Fix perf_event_open() vs. execve() race · 79c9ce57
      Peter Zijlstra 提交于
      Jann reported that the ptrace_may_access() check in
      find_lively_task_by_vpid() is racy against exec().
      
      Specifically:
      
        perf_event_open()		execve()
      
        ptrace_may_access()
      				commit_creds()
        ...				if (get_dumpable() != SUID_DUMP_USER)
      				  perf_event_exit_task();
        perf_install_in_context()
      
      would result in installing a counter across the creds boundary.
      
      Fix this by wrapping lots of perf_event_open() in cred_guard_mutex.
      This should be fine as perf_event_exit_task() is already called with
      cred_guard_mutex held, so all perf locks already nest inside it.
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      79c9ce57
  8. 27 4月, 2016 1 次提交
    • A
      perf core: Allow setting up max frame stack depth via sysctl · c5dfd78e
      Arnaldo Carvalho de Melo 提交于
      The default remains 127, which is good for most cases, and not even hit
      most of the time, but then for some cases, as reported by Brendan, 1024+
      deep frames are appearing on the radar for things like groovy, ruby.
      
      And in some workloads putting a _lower_ cap on this may make sense. One
      that is per event still needs to be put in place tho.
      
      The new file is:
      
        # cat /proc/sys/kernel/perf_event_max_stack
        127
      
      Chaging it:
      
        # echo 256 > /proc/sys/kernel/perf_event_max_stack
        # cat /proc/sys/kernel/perf_event_max_stack
        256
      
      But as soon as there is some event using callchains we get:
      
        # echo 512 > /proc/sys/kernel/perf_event_max_stack
        -bash: echo: write error: Device or resource busy
        #
      
      Because we only allocate the callchain percpu data structures when there
      is a user, which allows for changing the max easily, its just a matter
      of having no callchain users at that point.
      Reported-and-Tested-by: NBrendan Gregg <brendan.d.gregg@gmail.com>
      Reviewed-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDavid Ahern <dsahern@gmail.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: He Kuang <hekuang@huawei.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Milian Wolff <milian.wolff@kdab.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Wang Nan <wangnan0@huawei.com>
      Cc: Zefan Li <lizefan@huawei.com>
      Link: http://lkml.kernel.org/r/20160426002928.GB16708@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      c5dfd78e
  9. 23 4月, 2016 2 次提交
    • W
      perf/core: Add ::write_backward attribute to perf event · 9ecda41a
      Wang Nan 提交于
      This patch introduces 'write_backward' bit to perf_event_attr, which
      controls the direction of a ring buffer. After set, the corresponding
      ring buffer is written from end to beginning. This feature is design to
      support reading from overwritable ring buffer.
      
      Ring buffer can be created by mapping a perf event fd. Kernel puts event
      records into ring buffer, user tooling like perf fetch them from
      address returned by mmap(). To prevent racing between kernel and tooling,
      they communicate to each other through 'head' and 'tail' pointers.
      Kernel maintains 'head' pointer, points it to the next free area (tail
      of the last record). Tooling maintains 'tail' pointer, points it to the
      tail of last consumed record (record has already been fetched). Kernel
      determines the available space in a ring buffer using these two
      pointers to avoid overwrite unfetched records.
      
      By mapping without 'PROT_WRITE', an overwritable ring buffer is created.
      Different from normal ring buffer, tooling is unable to maintain 'tail'
      pointer because writing is forbidden. Therefore, for this type of ring
      buffers, kernel overwrite old records unconditionally, works like flight
      recorder. This feature would be useful if reading from overwritable ring
      buffer were as easy as reading from normal ring buffer. However,
      there's an obscure problem.
      
      The following figure demonstrates a full overwritable ring buffer. In
      this figure, the 'head' pointer points to the end of last record, and a
      long record 'E' is pending. For a normal ring buffer, a 'tail' pointer
      would have pointed to position (X), so kernel knows there's no more
      space in the ring buffer. However, for an overwritable ring buffer,
      kernel ignore the 'tail' pointer.
      
         (X)                              head
          .                                |
          .                                V
          +------+-------+----------+------+---+
          |A....A|B.....B|C........C|D....D|   |
          +------+-------+----------+------+---+
      
      Record 'A' is overwritten by event 'E':
      
            head
             |
             V
          +--+---+-------+----------+------+---+
          |.E|..A|B.....B|C........C|D....D|E..|
          +--+---+-------+----------+------+---+
      
      Now tooling decides to read from this ring buffer. However, none of these
      two natural positions, 'head' and the start of this ring buffer, are
      pointing to the head of a record. Even the full ring buffer can be
      accessed by tooling, it is unable to find a position to start decoding.
      
      The first attempt tries to solve this problem AFAIK can be found from
      [1]. It makes kernel to maintain 'tail' pointer: updates it when ring
      buffer is half full. However, this approach introduces overhead to
      fast path. Test result shows a 1% overhead [2]. In addition, this method
      utilizes no more tham 50% records.
      
      Another attempt can be found from [3], which allows putting the size of
      an event at the end of each record. This approach allows tooling to find
      records in a backward manner from 'head' pointer by reading size of a
      record from its tail. However, because of alignment requirement, it
      needs 8 bytes to record the size of a record, which is a huge waste. Its
      performance is also not good, because more data need to be written.
      This approach also introduces some extra branch instructions to fast
      path.
      
      'write_backward' is a better solution to this problem.
      
      Following figure demonstrates the state of the overwritable ring buffer
      when 'write_backward' is set before overwriting:
      
             head
              |
              V
          +---+------+----------+-------+------+
          |   |D....D|C........C|B.....B|A....A|
          +---+------+----------+-------+------+
      
      and after overwriting:
                                           head
                                            |
                                            V
          +---+------+----------+-------+---+--+
          |..E|D....D|C........C|B.....B|A..|E.|
          +---+------+----------+-------+---+--+
      
      In each situation, 'head' points to the beginning of the newest record.
      From this record, tooling can iterate over the full ring buffer and fetch
      records one by one.
      
      The only limitation that needs to be considered is back-to-back reading.
      Due to the non-deterministic of user programs, it is impossible to ensure
      the ring buffer keeps stable during reading. Consider an extreme situation:
      tooling is scheduled out after reading record 'D', then a burst of events
      come, eat up the whole ring buffer (one or multiple rounds). When the
      tooling process comes back, reading after 'D' is incorrect now.
      
      To prevent this problem, we need to find a way to ensure the ring buffer
      is stable during reading. ioctl(PERF_EVENT_IOC_PAUSE_OUTPUT) is
      suggested because its overhead is lower than
      ioctl(PERF_EVENT_IOC_ENABLE).
      
      By carefully verifying 'header' pointer, reader can avoid pausing the
      ring-buffer. For example:
      
          /* A union of all possible events */
          union perf_event event;
      
          p = head = perf_mmap__read_head();
          while (true) {
              /* copy header of next event */
              fetch(&event.header, p, sizeof(event.header));
      
              /* read 'head' pointer */
              head = perf_mmap__read_head();
      
              /* check overwritten: is the header good? */
              if (!verify(sizeof(event.header), p, head))
                  break;
      
              /* copy the whole event */
              fetch(&event, p, event.header.size);
      
              /* read 'head' pointer again */
              head = perf_mmap__read_head();
      
              /* is the whole event good? */
              if (!verify(event.header.size, p, head))
                  break;
              p += event.header.size;
          }
      
      However, the overhead is high because:
      
       a) In-place decoding is not safe.
          Copying-verifying-decoding is required.
       b) Fetching 'head' pointer requires additional synchronization.
      
      (From Alexei Starovoitov:
      
      Even when this trick works, pause is needed for more than stability of
      reading. When we collect the events into overwrite buffer we're waiting
      for some other trigger (like all cpu utilization spike or just one cpu
      running and all others are idle) and when it happens the buffer has
      valuable info from the past. At this point new events are no longer
      interesting and buffer should be paused, events read and unpaused until
      next trigger comes.)
      
      This patch utilizes event's default overflow_handler introduced
      previously. perf_event_output_backward() is created as the default
      overflow handler for backward ring buffers. To avoid extra overhead to
      fast path, original perf_event_output() becomes __perf_event_output()
      and marked '__always_inline'. In theory, there's no extra overhead
      introduced to fast path.
      
      Performance testing:
      
      Calling 3000000 times of 'close(-1)', use gettimeofday() to check
      duration.  Use 'perf record -o /dev/null -e raw_syscalls:*' to capture
      system calls. In ns.
      
      Testing environment:
      
        CPU    : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
        Kernel : v4.5.0
                          MEAN         STDVAR
       BASE            800214.950    2853.083
       PRE1           2253846.700    9997.014
       PRE2           2257495.540    8516.293
       POST           2250896.100    8933.921
      
      Where 'BASE' is pure performance without capturing. 'PRE1' is test
      result of pure 'v4.5.0' kernel. 'PRE2' is test result before this
      patch. 'POST' is test result after this patch. See [4] for the detailed
      experimental setup.
      
      Considering the stdvar, this patch doesn't introduce performance
      overhead to the fast path.
      
       [1] http://lkml.iu.edu/hypermail/linux/kernel/1304.1/04584.html
       [2] http://lkml.iu.edu/hypermail/linux/kernel/1307.1/00535.html
       [3] http://lkml.iu.edu/hypermail/linux/kernel/1512.0/01265.html
       [4] http://lkml.kernel.org/g/56F89DCD.1040202@huawei.comSigned-off-by: NWang Nan <wangnan0@huawei.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: <acme@kernel.org>
      Cc: <pi3orama@163.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: He Kuang <hekuang@huawei.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Zefan Li <lizefan@huawei.com>
      Link: http://lkml.kernel.org/r/1459865478-53413-1-git-send-email-wangnan0@huawei.com
      [ Fixed the changelog some more. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      9ecda41a
    • P
      perf/core: Make sysctl_perf_cpu_time_max_percent conform to documentation · b303e7c1
      Peter Zijlstra 提交于
      Markus reported that 0 should also disable the throttling we per
      Documentation/sysctl/kernel.txt.
      Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: 91a612ee ("perf/core: Fix dynamic interrupt throttle")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b303e7c1
  10. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  11. 31 3月, 2016 9 次提交
    • W
      perf/ring_buffer: Prepare writing into the ring-buffer from the end · d1b26c70
      Wang Nan 提交于
      Convert perf_output_begin() to __perf_output_begin() and make the later
      function able to write records from the end of the ring-buffer.
      
      Following commits will utilize the 'backward' flag.
      
      This is the core patch to support writing to the ring-buffer backwards,
      which will be introduced by upcoming patches to support reading from
      overwritable ring-buffers.
      
      In theory, this patch should not introduce any extra performance
      overhead since we use always_inline, but it does not hurt to double
      check that assumption:
      
      When CONFIG_OPTIMIZE_INLINING is disabled, the output object is nearly
      identical to original one. See:
      
         http://lkml.kernel.org/g/56F52E83.70409@huawei.com
      
      When CONFIG_OPTIMIZE_INLINING is enabled, the resuling object file becomes
      smaller:
      
       $ size kernel/events/ring_buffer.o*
         text       data        bss        dec        hex    filename
         4641          4          8       4653       122d kernel/events/ring_buffer.o.old
         4545          4          8       4557       11cd kernel/events/ring_buffer.o.new
      
      Performance testing results:
      
      Calling 3000000 times of 'close(-1)', use gettimeofday() to check
      duration.  Use 'perf record -o /dev/null -e raw_syscalls:*' to capture
      system calls. In ns.
      
      Testing environment:
      
       CPU    : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
       Kernel : v4.5.0
      
                           MEAN         STDVAR
        BASE            800214.950    2853.083
        PRE            2253846.700    9997.014
        POST           2257495.540    8516.293
      
      Where 'BASE' is pure performance without capturing. 'PRE' is test
      result of pure 'v4.5.0' kernel. 'POST' is test result after this
      patch.
      
      Considering the stdvar, this patch doesn't hurt performance, within
      noise margin.
      
      For testing details, see:
      
        http://lkml.kernel.org/g/56F89DCD.1040202@huawei.comSigned-off-by: NWang Nan <wangnan0@huawei.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <pi3orama@163.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: He Kuang <hekuang@huawei.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Zefan Li <lizefan@huawei.com>
      Link: http://lkml.kernel.org/r/1459147292-239310-4-git-send-email-wangnan0@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d1b26c70
    • W
      perf/core: Set event's default ::overflow_handler() · 1879445d
      Wang Nan 提交于
      Set a default event->overflow_handler in perf_event_alloc() so don't
      need to check event->overflow_handler in __perf_event_overflow().
      Following commits can give a different default overflow_handler.
      
      Initial idea comes from Peter:
      
        http://lkml.kernel.org/r/20130708121557.GA17211@twins.programming.kicks-ass.net
      
      Since the default value of event->overflow_handler is not NULL, existing
      'if (!overflow_handler)' checks need to be changed.
      
      is_default_overflow_handler() is introduced for this.
      
      No extra performance overhead is introduced into the hot path because in the
      original code we still need to read this handler from memory. A conditional
      branch is avoided so actually we remove some instructions.
      Signed-off-by: NWang Nan <wangnan0@huawei.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <pi3orama@163.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: He Kuang <hekuang@huawei.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Zefan Li <lizefan@huawei.com>
      Link: http://lkml.kernel.org/r/1459147292-239310-3-git-send-email-wangnan0@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1879445d
    • W
      perf/ring_buffer: Introduce new ioctl options to pause and resume the ring-buffer · 86e7972f
      Wang Nan 提交于
      Add new ioctl() to pause/resume ring-buffer output.
      
      In some situations we want to read from the ring-buffer only when we
      ensure nothing can write to the ring-buffer during reading. Without
      this patch we have to turn off all events attached to this ring-buffer
      to achieve this.
      
      This patch is a prerequisite to enable overwrite support for the
      perf ring-buffer support. Following commits will introduce new methods
      support reading from overwrite ring buffer. Before reading, caller
      must ensure the ring buffer is frozen, or the reading is unreliable.
      Signed-off-by: NWang Nan <wangnan0@huawei.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <pi3orama@163.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: He Kuang <hekuang@huawei.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Zefan Li <lizefan@huawei.com>
      Link: http://lkml.kernel.org/r/1459147292-239310-2-git-send-email-wangnan0@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      86e7972f
    • A
      perf/ring_buffer: Document AUX API usage · af5bb4ed
      Alexander Shishkin 提交于
      In order to ensure safe AUX buffer management, we rely on the assumption
      that pmu::stop() stops its ongoing AUX transaction and not just the hw.
      
      This patch documents this requirement for the perf_aux_output_{begin,end}()
      APIs.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/1457098969-21595-4-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      af5bb4ed
    • A
      perf/core: Free AUX pages in unmap path · 95ff4ca2
      Alexander Shishkin 提交于
      Now that we can ensure that when ring buffer's AUX area is on the way
      to getting unmapped new transactions won't start, we only need to stop
      all events that can potentially be writing aux data to our ring buffer.
      
      Having done that, we can safely free the AUX pages and corresponding
      PMU data, as this time it is guaranteed to be the last aux reference
      holder.
      
      This partially reverts:
      
        57ffc5ca ("perf: Fix AUX buffer refcounting")
      
      ... which was made to defer deallocation that was otherwise possible
      from an NMI context. Now it is no longer the case; the last call to
      rb_free_aux() that drops the last AUX reference has to happen in
      perf_mmap_close() on that AUX area.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/87d1qtz23d.fsf@ashishki-desk.ger.corp.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      95ff4ca2
    • A
      perf/ring_buffer: Refuse to begin AUX transaction after rb->aux_mmap_count drops · dcb10a96
      Alexander Shishkin 提交于
      When ring buffer's AUX area is unmapped and rb->aux_mmap_count drops to
      zero, new AUX transactions into this buffer can still be started,
      even though the buffer in en route to deallocation.
      
      This patch adds a check to perf_aux_output_begin() for rb->aux_mmap_count
      being zero, in which case there is no point starting new transactions,
      in other words, the ring buffers that pass a certain point in
      perf_mmap_close will not have their events sending new data, which
      clears path for freeing those buffers' pages right there and then,
      provided that no active transactions are holding the AUX reference.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/1457098969-21595-2-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      dcb10a96
    • P
      perf/core: Verify we have a single perf_hw_context PMU · 26657848
      Peter Zijlstra 提交于
      There should (and can) only be a single PMU for perf_hw_context
      events.
      
      This is because of how we schedule events: once a hardware event fails to
      schedule (the PMU is 'full') we stop trying to add more. The trivial
      'fix' would break the Round-Robin scheduling we do.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      26657848
    • A
      perf/core: Don't leak event in the syscall error path · 201c2f85
      Alexander Shishkin 提交于
      In the error path, event_file not being NULL is used to determine
      whether the event itself still needs to be free'd, so fix it up to
      avoid leaking.
      Reported-by: NLeon Yu <chianglungyu@gmail.com>
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: 13005627 ("perf: Do not double free")
      Link: http://lkml.kernel.org/r/87twk06yxp.fsf@ashishki-desk.ger.corp.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      201c2f85
    • P
      perf/core: Fix time tracking bug with multiplexing · 8fdc6539
      Peter Zijlstra 提交于
      Stephane reported that commit:
      
        3cbaa590 ("perf: Fix ctx time tracking by introducing EVENT_TIME")
      
      introduced a regression wrt. time tracking, as easily observed by:
      
      > This patch introduce a bug in the time tracking of events when
      > multiplexing is used.
      >
      > The issue is easily reproducible with the following perf run:
      >
      >  $ perf stat -a -C 0 -e branches,branches,branches,branches,branches,branches -I 1000
      >      1.000730239            652,394      branches   (66.41%)
      >      1.000730239            597,809      branches   (66.41%)
      >      1.000730239            593,870      branches   (66.63%)
      >      1.000730239            651,440      branches   (67.03%)
      >      1.000730239            656,725      branches   (66.96%)
      >      1.000730239      <not counted>      branches
      >
      > One branches event is shown as not having run. Yet, with
      > multiplexing, all events should run especially with a 1s (-I 1000)
      > interval. The delta for time_running comes out to 0. Yet, the event
      > has run because the kernel is actually multiplexing the events. The
      > problem is that the time tracking is the kernel and especially in
      > ctx_sched_out() is wrong now.
      >
      > The problem is that in case that the kernel enters ctx_sched_out() with the
      > following state:
      >    ctx->is_active=0x7 event_type=0x1
      >    Call Trace:
      >     [<ffffffff813ddd41>] dump_stack+0x63/0x82
      >     [<ffffffff81182bdc>] ctx_sched_out+0x2bc/0x2d0
      >     [<ffffffff81183896>] perf_mux_hrtimer_handler+0xf6/0x2c0
      >     [<ffffffff811837a0>] ? __perf_install_in_context+0x130/0x130
      >     [<ffffffff810f5818>] __hrtimer_run_queues+0xf8/0x2f0
      >     [<ffffffff810f6097>] hrtimer_interrupt+0xb7/0x1d0
      >     [<ffffffff810509a8>] local_apic_timer_interrupt+0x38/0x60
      >     [<ffffffff8175ca9d>] smp_apic_timer_interrupt+0x3d/0x50
      >     [<ffffffff8175ac7c>] apic_timer_interrupt+0x8c/0xa0
      >
      > In that case, the test:
      >       if (is_active & EVENT_TIME)
      >
      > will be false and the time will not be updated. Time must always be updated on
      > sched out.
      
      Fix this by always updating time if EVENT_TIME was set, as opposed to
      only updating time when EVENT_TIME changed.
      Reported-by: NStephane Eranian <eranian@google.com>
      Tested-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: kan.liang@intel.com
      Cc: namhyung@kernel.org
      Fixes: 3cbaa590 ("perf: Fix ctx time tracking by introducing EVENT_TIME")
      Link: http://lkml.kernel.org/r/20160329072644.GB3408@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8fdc6539
  12. 21 3月, 2016 4 次提交
    • P
      perf/core: Document some hotplug bits · 1dcaac1c
      Peter Zijlstra 提交于
      Document some of the hotplug notifier usage.
      Requested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1dcaac1c
    • P
      perf/core: Fix Undefined behaviour in rb_alloc() · 8184059e
      Peter Zijlstra 提交于
      Sasha reported:
      
       [ 3494.030114] UBSAN: Undefined behaviour in kernel/events/ring_buffer.c:685:22
       [ 3494.030647] shift exponent -1 is negative
      
      Andrey spotted that this is because:
      
        It happens if nr_pages = 0:
           rb->page_order = ilog2(nr_pages);
      
      Fix it by making both assignments conditional on nr_pages; since
      otherwise they should both be 0 anyway, and will be because of the
      kzalloc() used to allocate the structure.
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Reported-by: NAndrey Ryabinin <ryabinin.a.a@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/20160129141751.GA407@worktopSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8184059e
    • P
      perf/core: Fix dynamic interrupt throttle · 91a612ee
      Peter Zijlstra 提交于
      There were two problems with the dynamic interrupt throttle mechanism,
      both triggered by the same action.
      
      When you (or perf_fuzzer) write a huge value into
      /proc/sys/kernel/perf_event_max_sample_rate the computed
      perf_sample_allowed_ns becomes 0. This effectively disables the whole
      dynamic throttle.
      
      This is fixed by ensuring update_perf_cpu_limits() never sets the
      value to 0. However, we allow disabling of the dynamic throttle by
      writing 100 to /proc/sys/kernel/perf_cpu_time_max_percent. This will
      generate a warning in dmesg.
      
      The second problem is that by setting the max_sample_rate to a huge
      number, the adaptive process can take a few tries, since it halfs the
      limit each time. Change that to directly compute a new value based on
      the observed duration.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      91a612ee
    • P
      perf/core: Fix the unthrottle logic · 1e02cd40
      Peter Zijlstra 提交于
      Its possible to IOC_PERIOD while the event is throttled, this would
      re-start the event and the next tick would then try to unthrottle it,
      and find the event wasn't actually stopped anymore.
      
      This would tickle a WARN in the x86-pmu code which isn't expecting to
      start a !stopped event.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: dvyukov@google.com
      Cc: oleg@redhat.com
      Cc: panand@redhat.com
      Cc: sasha.levin@oracle.com
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/20160310143924.GR6356@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1e02cd40
  13. 08 3月, 2016 1 次提交
  14. 02 3月, 2016 1 次提交
    • F
      perf: Migrate perf to use new tick dependency mask model · 555e0c1e
      Frederic Weisbecker 提交于
      Instead of providing asynchronous checks for the nohz subsystem to verify
      perf event tick dependency, migrate perf to the new mask.
      
      Perf needs the tick for two situations:
      
      1) Freq events. We could set the tick dependency when those are
      installed on a CPU context. But setting a global dependency on top of
      the global freq events accounting is much easier. If people want that
      to be optimized, we can still refine that on the per-CPU tick dependency
      level. This patch dooesn't change the current behaviour anyway.
      
      2) Throttled events: this is a per-cpu dependency.
      Reviewed-by: NChris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      555e0c1e
  15. 29 2月, 2016 2 次提交