提交 · 1a5941312414c71dece6717da9a0fa1303127afa · openanolis / cloud-kernel

02 4月, 2015 14 次提交

perf: Add wakeup watermark control to the AUX area · 1a594131

由 Alexander Shishkin 提交于 1月 14, 2015

When AUX area gets a certain amount of new data, we want to wake up
userspace to collect it. This adds a new control to specify how much
data will cause a wakeup. This is then passed down to pmu drivers via
output handle's "wakeup" field, so that the driver can find the nearest
point where it can generate an interrupt.

We repurpose __reserved_2 in the event attribute for this, even though
it was never checked to be zero before, aux_watermark will only matter
for new AUX-aware code, so the old code should still be fine.
Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kaixu Xia <kaixu.xia@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Robert Richter <rric@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: kan.liang@intel.com
Cc: markus.t.metzger@intel.com
Cc: mathieu.poirier@linaro.org
Link: http://lkml.kernel.org/r/1421237903-181015-10-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

1a594131

perf: Support overwrite mode for the AUX area · 2023a0d2

由 Alexander Shishkin 提交于 1月 14, 2015

This adds support for overwrite mode in the AUX area, which means "keep
collecting data till you're stopped", turning AUX area into a circular
buffer, where new data overwrites old data. It does not depend on data
buffer's overwrite mode, so that it doesn't lose sideband data that is
instrumental for processing AUX data.

Overwrite mode is enabled at mapping AUX area read only. Even though
aux_tail in the buffer's user page might be user writable, it will be
ignored in this mode.

A PERF_RECORD_AUX with PERF_AUX_FLAG_OVERWRITE set is written to the perf
data stream every time an event writes new data to the AUX area. The pmu
driver might not be able to infer the exact beginning of the new data in
each snapshot, some drivers will only provide the tail, which is
aux_offset + aux_size in the AUX record. Consumer has to be able to tell
the new data from the old one, for example, by means of time stamps if
such are provided in the trace.

Consumer is also responsible for disabling any events that might write
to the AUX area (thus potentially racing with the consumer) before
collecting the data.
Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kaixu Xia <kaixu.xia@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Robert Richter <rric@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: kan.liang@intel.com
Cc: markus.t.metzger@intel.com
Cc: mathieu.poirier@linaro.org
Link: http://lkml.kernel.org/r/1421237903-181015-9-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

2023a0d2

perf: Add API for PMUs to write to the AUX area · fdc26706

由 Alexander Shishkin 提交于 1月 14, 2015

For pmus that wish to write data to ring buffer's AUX area, provide
perf_aux_output_{begin,end}() calls to initiate/commit data writes,
similarly to perf_output_{begin,end}. These also use the same output
handle structure. Also, similarly to software counterparts, these
will direct inherited events' output to parents' ring buffers.

After the perf_aux_output_begin() returns successfully, handle->size
is set to the maximum amount of data that can be written wrt aux_tail
pointer, so that no data that the user hasn't seen will be overwritten,
therefore this should always be called before hardware writing is
enabled. On success, this will return the pointer to pmu driver's
private structure allocated for this aux area by pmu::setup_aux. Same
pointer can also be retrieved using perf_get_aux() while hardware
writing is enabled.

PMU driver should pass the actual amount of data written as a parameter
to perf_aux_output_end(). All hardware writes should be completed and
visible before this one is called.

Additionally, perf_aux_output_skip() will adjust output handle and
aux_head in case some part of the buffer has to be skipped over to
maintain hardware's alignment constraints.

Nested writers are forbidden and guards are in place to catch such
attempts.
Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kaixu Xia <kaixu.xia@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Robert Richter <rric@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: kan.liang@intel.com
Cc: markus.t.metzger@intel.com
Cc: mathieu.poirier@linaro.org
Link: http://lkml.kernel.org/r/1421237903-181015-8-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

fdc26706

perf: Add AUX record · 68db7e98

由 Alexander Shishkin 提交于 1月 14, 2015

When there's new data in the AUX space, output a record indicating its
offset and size and a set of flags, such as PERF_AUX_FLAG_TRUNCATED, to
mean the described data was truncated to fit in the ring buffer.
Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kaixu Xia <kaixu.xia@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Robert Richter <rric@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: adrian.hunter@intel.com
Cc: kan.liang@intel.com
Cc: markus.t.metzger@intel.com
Cc: mathieu.poirier@linaro.org
Link: http://lkml.kernel.org/r/1421237903-181015-7-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

68db7e98

perf: Add a pmu capability for "exclusive" events · bed5b25a

由 Alexander Shishkin 提交于 1月 30, 2015

Usually, pmus that do, for example, instruction tracing, would only ever
be able to have one event per task per cpu (or per perf_event_context). For
such pmus it makes sense to disallow creating conflicting events early on,
so as to provide consistent behavior for the user.

This patch adds a pmu capability that indicates such constraint on event
creation.
Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kaixu Xia <kaixu.xia@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Robert Richter <rric@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: kan.liang@intel.com
Cc: markus.t.metzger@intel.com
Cc: mathieu.poirier@linaro.org
Link: http://lkml.kernel.org/r/1422613866-113186-1-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

bed5b25a

perf: Add a capability for AUX_NO_SG pmus to do software double buffering · 6a279230

由 Alexander Shishkin 提交于 1月 14, 2015

For pmus that don't support scatter-gather for AUX data in hardware, it
might still make sense to implement software double buffering to avoid
losing data while the user is reading data out. For this purpose, add
a pmu capability that guarantees multiple high-order chunks for AUX buffer,
so that the pmu driver can do switchover tricks.

To make use of this feature, add PERF_PMU_CAP_AUX_SW_DOUBLEBUF to your
pmu's capability mask. This will make the ring buffer AUX allocation code
ensure that the biggest high order allocation for the aux buffer pages is
no bigger than half of the total requested buffer size, thus making sure
that the buffer has at least two high order allocations.
Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kaixu Xia <kaixu.xia@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Robert Richter <rric@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: kan.liang@intel.com
Cc: markus.t.metzger@intel.com
Cc: mathieu.poirier@linaro.org
Link: http://lkml.kernel.org/r/1421237903-181015-5-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

6a279230

perf: Support high-order allocations for AUX space · 0a4e38e6

由 Alexander Shishkin 提交于 1月 14, 2015

Some pmus (such as BTS or Intel PT without multiple-entry ToPA capability)
don't support scatter-gather and will prefer larger contiguous areas for
their output regions.

This patch adds a new pmu capability to request higher order allocations.
Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kaixu Xia <kaixu.xia@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Robert Richter <rric@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: kan.liang@intel.com
Cc: markus.t.metzger@intel.com
Cc: mathieu.poirier@linaro.org
Link: http://lkml.kernel.org/r/1421237903-181015-4-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

0a4e38e6

perf: Add AUX area to ring buffer for raw data streams · 45bfb2e5

由 Peter Zijlstra 提交于 1月 14, 2015

This patch introduces "AUX space" in the perf mmap buffer, intended for
exporting high bandwidth data streams to userspace, such as instruction
flow traces.

AUX space is a ring buffer, defined by aux_{offset,size} fields in the
user_page structure, and read/write pointers aux_{head,tail}, which abide
by the same rules as data_* counterparts of the main perf buffer.

In order to allocate/mmap AUX, userspace needs to set up aux_offset to
such an offset that will be greater than data_offset+data_size and
aux_size to be the desired buffer size. Both need to be page aligned.
Then, same aux_offset and aux_size should be passed to mmap() call and
if everything adds up, you should have an AUX buffer as a result.

Pages that are mapped into this buffer also come out of user's mlock
rlimit plus perf_event_mlock_kb allowance.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kaixu Xia <kaixu.xia@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Robert Richter <rric@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: kan.liang@intel.com
Cc: markus.t.metzger@intel.com
Cc: mathieu.poirier@linaro.org
Link: http://lkml.kernel.org/r/1421237903-181015-3-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

45bfb2e5

perf: Add data_{offset,size} to user_page · e8c6deac

由 Alexander Shishkin 提交于 1月 14, 2015

Currently, the actual perf ring buffer is one page into the mmap area,
following the user page and the userspace follows this convention. This
patch adds data_{offset,size} fields to user_page that can be used by
userspace instead for locating perf data in the mmap area. This is also
helpful when mapping existing or shared buffers if their size is not
known in advance.

Right now, it is made to follow the existing convention that

	data_offset == PAGE_SIZE and
	data_offset + data_size == mmap_size.
Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kaixu Xia <kaixu.xia@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Robert Richter <rric@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: kan.liang@intel.com
Cc: markus.t.metzger@intel.com
Cc: mathieu.poirier@linaro.org
Link: http://lkml.kernel.org/r/1421237903-181015-2-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

e8c6deac

tracing: Allow BPF programs to call bpf_trace_printk() · 9c959c86

由 Alexei Starovoitov 提交于 3月 25, 2015

Debugging of BPF programs needs some form of printk from the
program, so let programs call limited trace_printk() with %d %u
%x %p modifiers only.

Similar to kernel modules, during program load verifier checks
whether program is calling bpf_trace_printk() and if so, kernel
allocates trace_printk buffers and emits big 'this is debug
only' banner.
Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1427312966-8434-6-git-send-email-ast@plumgrid.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

9c959c86

tracing: Allow BPF programs to call bpf_ktime_get_ns() · d9847d31

由 Alexei Starovoitov 提交于 3月 25, 2015

bpf_ktime_get_ns() is used by programs to compute time delta
between events or as a timestamp
Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1427312966-8434-5-git-send-email-ast@plumgrid.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

d9847d31

tracing, perf: Implement BPF programs attached to kprobes · 2541517c

由 Alexei Starovoitov 提交于 3月 25, 2015

BPF programs, attached to kprobes, provide a safe way to execute
user-defined BPF byte-code programs without being able to crash or
hang the kernel in any way. The BPF engine makes sure that such
programs have a finite execution time and that they cannot break
out of their sandbox.

The user interface is to attach to a kprobe via the perf syscall:

	struct perf_event_attr attr = {
		.type	= PERF_TYPE_TRACEPOINT,
		.config	= event_id,
		...
	};

	event_fd = perf_event_open(&attr,...);
	ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);

'prog_fd' is a file descriptor associated with BPF program
previously loaded.

'event_id' is an ID of the kprobe created.

Closing 'event_fd':

	close(event_fd);

... automatically detaches BPF program from it.

BPF programs can call in-kernel helper functions to:

  - lookup/update/delete elements in maps

  - probe_read - wraper of probe_kernel_read() used to access any
    kernel data structures

BPF programs receive 'struct pt_regs *' as an input ('struct pt_regs' is
architecture dependent) and return 0 to ignore the event and 1 to store
kprobe event into the ring buffer.

Note, kprobes are a fundamentally _not_ a stable kernel ABI,
so BPF programs attached to kprobes must be recompiled for
every kernel version and user must supply correct LINUX_VERSION_CODE
in attr.kern_version during bpf_prog_load() call.
Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
Reviewed-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1427312966-8434-4-git-send-email-ast@plumgrid.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

2541517c

tracing: Add kprobe flag · 72cbbc89

由 Alexei Starovoitov 提交于 3月 25, 2015

add TRACE_EVENT_FL_KPROBE flag to differentiate kprobe type of
tracepoints, since bpf programs can only be attached to kprobe
type of PERF_TYPE_TRACEPOINT perf events.
Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
Reviewed-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1427312966-8434-3-git-send-email-ast@plumgrid.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

72cbbc89

bpf: Make internal bpf API independent of CONFIG_BPF_SYSCALL #ifdefs · 4e537f7f

由 Daniel Borkmann 提交于 3月 25, 2015

Socket filter code and other subsystems with upcoming eBPF
support should not need to deal with the fact that we have
CONFIG_BPF_SYSCALL defined or not.

Having the bpf syscall as a config option is a nice thing and
I'd expect it to stay that way for expert users (I presume one
day the default setting of it might change, though), but code
making use of it should not care if it's actually enabled or
not.

Instead, hide this via header files and let the rest deal with it.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
Reviewed-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1427312966-8434-2-git-send-email-ast@plumgrid.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

4e537f7f

27 3月, 2015 5 次提交

perf: Add per event clockid support · 34f43927

由 Peter Zijlstra 提交于 2月 20, 2015

While thinking on the whole clock discussion it occurred to me we have
two distinct uses of time:

 1) the tracking of event/ctx/cgroup enabled/running/stopped times
    which includes the self-monitoring support in struct
    perf_event_mmap_page.

 2) the actual timestamps visible in the data records.

And we've been conflating them.

The first is all about tracking time deltas, nobody should really care
in what time base that happens, its all relative information, as long
as its internally consistent it works.

The second however is what people are worried about when having to
merge their data with external sources. And here we have the
discussion on MONOTONIC vs MONOTONIC_RAW etc..

Where MONOTONIC is good for correlating between machines (static
offset), MONOTNIC_RAW is required for correlating against a fixed rate
hardware clock.

This means configurability; now 1) makes that hard because it needs to
be internally consistent across groups of unrelated events; which is
why we had to have a global perf_clock().

However, for 2) it doesn't really matter, perf itself doesn't care
what it writes into the buffer.

The below patch makes the distinction between these two cases by
adding perf_event_clock() which is used for the second case. It
further makes this configurable on a per-event basis, but adds a few
sanity checks such that we cannot combine events with different clocks
in confusing ways.

And since we then have per-event configurability we might as well
retain the 'legacy' behaviour as a default.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NIngo Molnar <mingo@kernel.org>

34f43927

time: Add ktime_get_tai_ns() · fe5fba05

由 Peter Zijlstra 提交于 3月 17, 2015

Because it was the only clock for which we didn't have a _ns()
accessor yet.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NIngo Molnar <mingo@kernel.org>

fe5fba05

time: Introduce tk_fast_raw · f09cb9a1

由 Peter Zijlstra 提交于 3月 19, 2015

Add the NMI safe CLOCK_MONOTONIC_RAW accessor..
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NJohn Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150319093400.562746929@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

f09cb9a1

time: Add timerkeeper::tkr_raw · 4a4ad80d

由 Peter Zijlstra 提交于 3月 19, 2015

Introduce tkr_raw and make use of it.

  base_raw -> tkr_raw.base
  clock->{mult,shift} -> tkr_raw.{mult.shift}

Kill timekeeping_get_ns_raw() in favour of
timekeeping_get_ns(&tkr_raw), this removes all mono_raw special
casing.

Duplicate the updates to tkr_mono.cycle_last into tkr_raw.cycle_last,
both need the same value.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NJohn Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150319093400.422589590@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

4a4ad80d

time: Rename timekeeper::tkr to timekeeper::tkr_mono · 876e7881

由 Peter Zijlstra 提交于 3月 19, 2015

In preparation of adding another tkr field, rename this one to
tkr_mono. Also rename tk_read_base::base_mono to tk_read_base::base,
since the structure is not specific to CLOCK_MONOTONIC and the mono
name got added to the tk_read_base instance.

Lots of trivial churn.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NJohn Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150319093400.344679419@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

876e7881

26 3月, 2015 1 次提交

mm: numa: slow PTE scan rate if migration failures occur · 074c2381

由 Mel Gorman 提交于 3月 25, 2015

Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226

  Across the board the 4.0-rc1 numbers are much slower, and the degradation
  is far worse when using the large memory footprint configs. Perf points
  straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config:

   -   56.07%    56.07%  [kernel]            [k] default_send_IPI_mask_sequence_phys
      - default_send_IPI_mask_sequence_phys
         - 99.99% physflat_send_IPI_mask
            - 99.37% native_send_call_func_ipi
                 smp_call_function_many
               - native_flush_tlb_others
                  - 99.85% flush_tlb_page
                       ptep_clear_flush
                       try_to_unmap_one
                       rmap_walk
                       try_to_unmap
                       migrate_pages
                       migrate_misplaced_page
                     - handle_mm_fault
                        - 99.73% __do_page_fault
                             trace_do_page_fault
                             do_async_page_fault
                           + async_page_fault
              0.63% native_send_call_func_single_ipi
                 generic_exec_single
                 smp_call_function_single

This is showing excessive migration activity even though excessive
migrations are meant to get throttled.  Normally, the scan rate is tuned
on a per-task basis depending on the locality of faults.  However, if
migrations fail for any reason then the PTE scanner may scan faster if
the faults continue to be remote.  This means there is higher system CPU
overhead and fault trapping at exactly the time we know that migrations
cannot happen.  This patch tracks when migration failures occur and
slows the PTE scanner.
Signed-off-by: NMel Gorman <mgorman@suse.de>
Reported-by: NDave Chinner <david@fromorbit.com>
Tested-by: NDave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

074c2381

23 3月, 2015 1 次提交

perf: Remove type specific target pointers · 50f16a8b

由 Peter Zijlstra 提交于 3月 05, 2015

The only reason CQM had to use a hard-coded pmu type was so it could use
cqm_target in hw_perf_event.

Do away with the {tp,bp,cqm}_target pointers and provide a non type
specific one.

This allows us to do away with that silly pmu type as well.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Vince Weaver <vince@deater.net>
Cc: acme@kernel.org
Cc: acme@redhat.com
Cc: hpa@zytor.com
Cc: jolsa@redhat.com
Cc: kanaka.d.juvva@intel.com
Cc: matt.fleming@intel.com
Cc: tglx@linutronix.de
Cc: torvalds@linux-foundation.org
Cc: vikas.shivappa@linux.intel.com
Link: http://lkml.kernel.org/r/20150305211019.GU21418@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>

50f16a8b

20 3月, 2015 3 次提交

target: do not reject FUA CDBs when write cache is enabled but emulate_write_cache is 0 · 9bc6548f

由 Christophe Vu-Brugier 提交于 3月 19, 2015

A check that rejects a CDB with FUA bit set if no write cache is
emulated was added by the following commit:

  fde9f50f target: Add sanity checks for DPO/FUA bit usage

The condition is as follows:

  if (!dev->dev_attrib.emulate_fua_write ||
      !dev->dev_attrib.emulate_write_cache)

However, this check is wrong if the backend device supports WCE but
"emulate_write_cache" is disabled.

This patch uses se_dev_check_wce() (previously named
spc_check_dev_wce) to invoke transport->get_write_cache() if the
device has a write cache or check the "emulate_write_cache" attribute
otherwise.
Reported-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NChristophe Vu-Brugier <cvubrugier@fastmail.fm>
Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>

9bc6548f

regmap: introduce regmap_name to fix syscon regmap trace events · c6b570d9

由 Philipp Zabel 提交于 3月 09, 2015

This patch fixes a NULL pointer dereference when enabling regmap event
tracing in the presence of a syscon regmap, introduced by commit bdb0066d
("mfd: syscon: Decouple syscon interface from platform devices").
That patch introduced syscon regmaps that have their dev field set to NULL.
The regmap trace events expect it to point to a valid struct device and feed
it to dev_name():

  $ echo 1 > /sys/kernel/debug/tracing/events/regmap/enable

  Unable to handle kernel NULL pointer dereference at virtual address 0000002c
  pgd = 80004000
  [0000002c] *pgd=00000000
  Internal error: Oops: 17 [#1] SMP ARM
  Modules linked in: coda videobuf2_vmalloc
  CPU: 0 PID: 304 Comm: kworker/0:2 Not tainted 4.0.0-rc2+ #9197
  Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
  Workqueue: events_freezable thermal_zone_device_check
  task: 9f25a200 ti: 9f1ee000 task.ti: 9f1ee000
  PC is at ftrace_raw_event_regmap_block+0x3c/0xe4
  LR is at _regmap_raw_read+0x1bc/0x1cc
  pc : [<803636e8>]    lr : [<80365f2c>]    psr: 600f0093
  sp : 9f1efd78  ip : 9f1efdb8  fp : 9f1efdb4
  r10: 00000004  r9 : 00000001  r8 : 00000001
  r7 : 00000180  r6 : 00000000  r5 : 9f00e3c0  r4 : 00000003
  r3 : 00000001  r2 : 00000180  r1 : 00000000  r0 : 9f00e3c0
  Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment kernel
  Control: 10c5387d  Table: 2d91004a  DAC: 00000015
  Process kworker/0:2 (pid: 304, stack limit = 0x9f1ee210)
  Stack: (0x9f1efd78 to 0x9f1f0000)
  fd60:                                                       9f1efda4 9f1efd88
  fd80: 800708c0 805f9510 80927140 800f0013 9f1fc800 9eb2f490 00000000 00000180
  fda0: 808e3840 00000001 9f1efdfc 9f1efdb8 80365f2c 803636b8 805f8958 800708e0
  fdc0: a00f0013 803636ac 9f16de00 00000180 80927140 9f1fc800 9f1fc800 9f1efe6c
  fde0: 9f1efe6c 9f732400 00000000 00000000 9f1efe1c 9f1efe00 80365f70 80365d7c
  fe00: 80365f3c 9f1fc800 9f1fc800 00000180 9f1efe44 9f1efe20 803656a4 80365f48
  fe20: 9f1fc800 00000180 9f1efe6c 9f1efe6c 9f732400 00000000 9f1efe64 9f1efe48
  fe40: 803657bc 80365634 00000001 9e95f910 9f1fc800 9f1efeb4 9f1efe8c 9f1efe68
  fe60: 80452ac0 80365778 9f1efe8c 9f1efe78 9e93d400 9e93d5e8 9f1efeb4 9f72ef40
  fe80: 9f1efeac 9f1efe90 8044e11c 80452998 8045298c 9e93d608 9e93d400 808e1978
  fea0: 9f1efecc 9f1efeb0 8044fd14 8044e0d0 ffffffff 9f25a200 9e93d608 9e481380
  fec0: 9f1efedc 9f1efed0 8044fde8 8044fcec 9f1eff1c 9f1efee0 80038d50 8044fdd8
  fee0: 9f1ee020 9f72ef40 9e481398 00000000 00000008 9f72ef54 9f1ee020 9f72ef40
  ff00: 9e481398 9e481380 00000008 9f72ef40 9f1eff5c 9f1eff20 80039754 80038bfc
  ff20: 00000000 9e481380 80894100 808e1662 00000000 9e4f2ec0 00000000 9e481380
  ff40: 800396f8 00000000 00000000 00000000 9f1effac 9f1eff60 8003e020 80039704
  ff60: ffffffff 00000000 ffffffff 9e481380 00000000 00000000 9f1eff78 9f1eff78
  ff80: 00000000 00000000 9f1eff88 9f1eff88 9e4f2ec0 8003df30 00000000 00000000
  ffa0: 00000000 9f1effb0 8000eb60 8003df3c 00000000 00000000 00000000 00000000
  ffc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  ffe0: 00000000 00000000 00000000 00000000 00000013 00000000 ffffffff ffffffff
  Backtrace:
  [<803636ac>] (ftrace_raw_event_regmap_block) from [<80365f2c>] (_regmap_raw_read+0x1bc/0x1cc)
   r9:00000001 r8:808e3840 r7:00000180 r6:00000000 r5:9eb2f490 r4:9f1fc800
  [<80365d70>] (_regmap_raw_read) from [<80365f70>] (_regmap_bus_read+0x34/0x6c)
   r10:00000000 r9:00000000 r8:9f732400 r7:9f1efe6c r6:9f1efe6c r5:9f1fc800
   r4:9f1fc800
  [<80365f3c>] (_regmap_bus_read) from [<803656a4>] (_regmap_read+0x7c/0x144)
   r6:00000180 r5:9f1fc800 r4:9f1fc800 r3:80365f3c
  [<80365628>] (_regmap_read) from [<803657bc>] (regmap_read+0x50/0x70)
   r9:00000000 r8:9f732400 r7:9f1efe6c r6:9f1efe6c r5:00000180 r4:9f1fc800
  [<8036576c>] (regmap_read) from [<80452ac0>] (imx_get_temp+0x134/0x1a4)
   r6:9f1efeb4 r5:9f1fc800 r4:9e95f910 r3:00000001
  [<8045298c>] (imx_get_temp) from [<8044e11c>] (thermal_zone_get_temp+0x58/0x74)
   r7:9f72ef40 r6:9f1efeb4 r5:9e93d5e8 r4:9e93d400
  [<8044e0c4>] (thermal_zone_get_temp) from [<8044fd14>] (thermal_zone_device_update+0x34/0xec)
   r6:808e1978 r5:9e93d400 r4:9e93d608 r3:8045298c
  [<8044fce0>] (thermal_zone_device_update) from [<8044fde8>] (thermal_zone_device_check+0x1c/0x20)
   r5:9e481380 r4:9e93d608
  [<8044fdcc>] (thermal_zone_device_check) from [<80038d50>] (process_one_work+0x160/0x3d4)
  [<80038bf0>] (process_one_work) from [<80039754>] (worker_thread+0x5c/0x4f4)
   r10:9f72ef40 r9:00000008 r8:9e481380 r7:9e481398 r6:9f72ef40 r5:9f1ee020
   r4:9f72ef54
  [<800396f8>] (worker_thread) from [<8003e020>] (kthread+0xf0/0x108)
   r10:00000000 r9:00000000 r8:00000000 r7:800396f8 r6:9e481380 r5:00000000
   r4:9e4f2ec0
  [<8003df30>] (kthread) from [<8000eb60>] (ret_from_fork+0x14/0x34)
   r7:00000000 r6:00000000 r5:8003df30 r4:9e4f2ec0
  Code: e3140040 1a00001a e3140020 1a000016 (e596002c)
  ---[ end trace 193c15c2494ec960 ]---

Fixes: bdb0066d (mfd: syscon: Decouple syscon interface from platform devices)
Signed-off-by: NPhilipp Zabel <p.zabel@pengutronix.de>
Signed-off-by: NMark Brown <broonie@kernel.org>
Cc: stable@vger.kernel.org

c6b570d9

ata: Add a new flag to destinguish sas controller · 5067c046

由 Shaohua Li 提交于 3月 12, 2015

SAS controller has its own tag allocation, which doesn't directly match to ATA
tag, so SAS and SATA have different code path for ata tags. Originally we use
port->scsi_host (98bd4be1) to destinguish SAS controller, but libsas set
->scsi_host too, so we can't use it for the destinguish, we add a new flag for
this purpose.

Without this patch, the following oops can happen because scsi-mq uses
a host-wide tag map shared among all devices with some integer tag
values >= ATA_MAX_QUEUE.  These unexpectedly high tag values cause
__ata_qc_from_tag() to return NULL, which is then dereferenced in
ata_qc_new_init().

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
  IP: [<ffffffff804fd46e>] ata_qc_new_init+0x3e/0x120
  PGD 32adf0067 PUD 32adf1067 PMD 0
  Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
  Modules linked in: iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi igb
  i2c_algo_bit ptp pps_core pm80xx libsas scsi_transport_sas sg coretemp
  eeprom w83795 i2c_i801
  CPU: 4 PID: 1450 Comm: cydiskbench Not tainted 4.0.0-rc3 #1
  Hardware name: Supermicro X8DTH-i/6/iF/6F/X8DTH, BIOS 2.1b       05/04/12
  task: ffff8800ba86d500 ti: ffff88032a064000 task.ti: ffff88032a064000
  RIP: 0010:[<ffffffff804fd46e>]  [<ffffffff804fd46e>] ata_qc_new_init+0x3e/0x120
  RSP: 0018:ffff88032a067858  EFLAGS: 00010046
  RAX: 0000000000000000 RBX: ffff8800ba0d2230 RCX: 000000000000002a
  RDX: ffffffff80505ae0 RSI: 0000000000000020 RDI: ffff8800ba0d2230
  RBP: ffff88032a067868 R08: 0000000000000201 R09: 0000000000000001
  R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800ba0d0000
  R13: ffff8800ba0d2230 R14: ffffffff80505ae0 R15: ffff8800ba0d0000
  FS:  0000000041223950(0063) GS:ffff88033e480000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
  CR2: 0000000000000058 CR3: 000000032a0a3000 CR4: 00000000000006e0
  Stack:
   ffff880329eee758 ffff880329eee758 ffff88032a0678a8 ffffffff80502dad
   ffff8800ba167978 ffff880329eee758 ffff88032bf9c520 ffff8800ba167978
   ffff88032bf9c520 ffff88032bf9a290 ffff88032a0678b8 ffffffff80506909
  Call Trace:
   [<ffffffff80502dad>] ata_scsi_translate+0x3d/0x1b0
   [<ffffffff80506909>] ata_sas_queuecmd+0x149/0x2a0
   [<ffffffffa0046650>] sas_queuecommand+0xa0/0x1f0 [libsas]
   [<ffffffff804ea544>] scsi_dispatch_cmd+0xd4/0x1a0
   [<ffffffff804eb50f>] scsi_queue_rq+0x66f/0x7f0
   [<ffffffff803e5098>] __blk_mq_run_hw_queue+0x208/0x3f0
   [<ffffffff803e54b8>] blk_mq_run_hw_queue+0x88/0xc0
   [<ffffffff803e5c74>] blk_mq_insert_request+0xc4/0x130
   [<ffffffff803e0b63>] blk_execute_rq_nowait+0x73/0x160
   [<ffffffffa0023fca>] sg_common_write+0x3da/0x720 [sg]
   [<ffffffffa0025100>] sg_new_write+0x250/0x360 [sg]
   [<ffffffffa0025feb>] sg_write+0x13b/0x450 [sg]
   [<ffffffff8032ec91>] vfs_write+0xd1/0x1b0
   [<ffffffff8032ee54>] SyS_write+0x54/0xc0
   [<ffffffff80689932>] system_call_fastpath+0x12/0x17

tj: updated description.

Fixes: 12cb5ce1 ("libata: use blk taging")
Reported-and-tested-by: NTony Battersby <tonyb@cybernetics.com>
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

5067c046

19 3月, 2015 1 次提交

netfilter: restore rule tracing via nfnetlink_log · 4017a7ee

由 Pablo Neira Ayuso 提交于 3月 02, 2015

Since fab4085f ("netfilter: log: nf_log_packet() as real unified
interface"), the loginfo structure that is passed to nf_log_packet() is
used to explicitly indicate the logger type you want to use.

This is a problem for people tracing rules through nfnetlink_log since
packets are always routed to the NF_LOG_TYPE logger after the
aforementioned patch.

We can fix this by removing the trace loginfo structures, but that still
changes the log level from 4 to 5 for tracing messages and there may be
someone relying on this outthere. So let's just introduce a new
nf_log_trace() function that restores the former behaviour.
Reported-by: NMarkus Kötter <koetter@rrzn.uni-hannover.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

4017a7ee

18 3月, 2015 2 次提交

regulator: Fix documentation for regmap in the config · cf39284d

由 Axel Lin 提交于 3月 18, 2015

dev_get_regulator() does not exist, fix the typo.
Signed-off-by: NAxel Lin <axel.lin@ingics.com>
Signed-off-by: NMark Brown <broonie@kernel.org>

cf39284d

netdevice.h: fix ndo_bridge_* comments · ad41faa8

由 Nicolas Dichtel 提交于 3月 17, 2015

The argument 'flags' was missing in ndo_bridge_setlink().
ndo_bridge_dellink() was missing.

Fixes: 407af329 ("bridge: Add netlink interface to configure vlans on bridge ports")
Fixes: add511b3 ("bridge: add flags argument to ndo_bridge_setlink and ndo_bridge_dellink")
CC: Vlad Yasevich <vyasevic@redhat.com>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ad41faa8

17 3月, 2015 2 次提交

regulator: palmas: Correct TPS659038 register definition for REGEN2 · e03826d5

由 Keerthy 提交于 3月 17, 2015

The register offset for REGEN2_CTRL in different for TPS659038 chip as when
compared with other Palmas family PMICs. In the case of TPS659038 the wrong
offset pointed to PLLEN_CTRL and was causing a hang. Correcting the same.
Signed-off-by: NKeerthy <j-keerthy@ti.com>
Signed-off-by: NMark Brown <broonie@kernel.org>
Cc: stable@vger.kernel.org

e03826d5

livepatch: Fix subtle race with coming and going modules · 8cb2c2dc

由 Petr Mladek 提交于 3月 12, 2015

There is a notifier that handles live patches for coming and going modules.
It takes klp_mutex lock to avoid races with coming and going patches but
it does not keep the lock all the time. Therefore the following races are
possible:

  1. The notifier is called sometime in STATE_MODULE_COMING. The module
     is visible by find_module() in this state all the time. It means that
     new patch can be registered and enabled even before the notifier is
     called. It might create wrong order of stacked patches, see below
     for an example.

   2. New patch could still see the module in the GOING state even after
      the notifier has been called. It will try to initialize the related
      object structures but the module could disappear at any time. There
      will stay mess in the structures. It might even cause an invalid
      memory access.

This patch solves the problem by adding a boolean variable into struct module.
The value is true after the coming and before the going handler is called.
New patches need to be applied when the value is true and they need to ignore
the module when the value is false.

Note that we need to know state of all modules on the system. The races are
related to new patches. Therefore we do not know what modules will get
patched.

Also note that we could not simply ignore going modules. The code from the
module could be called even in the GOING state until mod->exit() finishes.
If we start supporting patches with semantic changes between function
calls, we need to apply new patches to any still usable code.
See below for an example.

Finally note that the patch solves only the situation when a new patch is
registered. There are no such problems when the patch is being removed.
It does not matter who disable the patch first, whether the normal
disable_patch() or the module notifier. There is nothing to do
once the patch is disabled.

Alternative solutions:
======================

+ reject new patches when a patched module is coming or going; this is ugly

+ wait with adding new patch until the module leaves the COMING and GOING
  states; this might be dangerous and complicated; we would need to release
  kgr_lock in the middle of the patch registration to avoid a deadlock
  with the coming and going handlers; also we might need a waitqueue for
  each module which seems to be even bigger overhead than the boolean

+ stop modules from entering COMING and GOING states; wait until modules
  leave these states when they are already there; looks complicated; we would
  need to ignore the module that asked to stop the others to avoid a deadlock;
  also it is unclear what to do when two modules asked to stop others and
  both are in COMING state (situation when two new patches are applied)

+ always register/enable new patches and fix up the potential mess (registered
  patches order) in klp_module_init(); this is nasty and prone to regressions
  in the future development

+ add another MODULE_STATE where the kallsyms are visible but the module is not
  used yet; this looks too complex; the module states are checked on "many"
  locations

Example of patch stacking breakage:
===================================

The notifier could _not_ _simply_ ignore already initialized module objects.
For example, let's have three patches (P1, P2, P3) for functions a() and b()
where a() is from vmcore and b() is from a module M. Something like:

	a()	b()
P1	a1()	b1()
P2	a2()	b2()
P3	a3()	b3(3)

If you load the module M after all patches are registered and enabled.
The ftrace ops for function a() and b() has listed the functions in this
order:

	ops_a->func_stack -> list(a3,a2,a1)
	ops_b->func_stack -> list(b3,b2,b1)

, so the pointer to b3() is the first and will be used.

Then you might have the following scenario. Let's start with state when patches
P1 and P2 are registered and enabled but the module M is not loaded. Then ftrace
ops for b() does not exist. Then we get into the following race:

CPU0					CPU1

load_module(M)

  complete_formation()

  mod->state = MODULE_STATE_COMING;
  mutex_unlock(&module_mutex);

					klp_register_patch(P3);
					klp_enable_patch(P3);

					# STATE 1

  klp_module_notify(M)
    klp_module_notify_coming(P1);
    klp_module_notify_coming(P2);
    klp_module_notify_coming(P3);

					# STATE 2

The ftrace ops for a() and b() then looks:

  STATE1:

	ops_a->func_stack -> list(a3,a2,a1);
	ops_b->func_stack -> list(b3);

  STATE2:
	ops_a->func_stack -> list(a3,a2,a1);
	ops_b->func_stack -> list(b2,b1,b3);

therefore, b2() is used for the module but a3() is used for vmcore
because they were the last added.

Example of the race with going modules:
=======================================

CPU0					CPU1

delete_module()  #SYSCALL

   try_stop_module()
     mod->state = MODULE_STATE_GOING;

   mutex_unlock(&module_mutex);

					klp_register_patch()
					klp_enable_patch()

					#save place to switch universe

					b()     # from module that is going
					  a()   # from core (patched)

   mod->exit();

Note that the function b() can be called until we call mod->exit().

If we do not apply patch against b() because it is in MODULE_STATE_GOING,
it will call patched a() with modified semantic and things might get wrong.

[jpoimboe@redhat.com: use one boolean instead of two]
Signed-off-by: NPetr Mladek <pmladek@suse.cz>
Acked-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Acked-by: NRusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NJiri Kosina <jkosina@suse.cz>

8cb2c2dc

14 3月, 2015 2 次提交

arm/arm64: KVM: Keep elrsr/aisr in sync with software model · ae705930

由 Christoffer Dall 提交于 3月 13, 2015

There is an interesting bug in the vgic code, which manifests itself
when the KVM run loop has a signal pending or needs a vmid generation
rollover after having disabled interrupts but before actually switching
to the guest.

In this case, we flush the vgic as usual, but we sync back the vgic
state and exit to userspace before entering the guest.  The consequence
is that we will be syncing the list registers back to the software model
using the GICH_ELRSR and GICH_EISR from the last execution of the guest,
potentially overwriting a list register containing an interrupt.

This showed up during migration testing where we would capture a state
where the VM has masked the arch timer but there were no interrupts,
resulting in a hung test.

Cc: Marc Zyngier <marc.zyngier@arm.com>
Reported-by: NAlex Bennee <alex.bennee@linaro.org>
Signed-off-by: NChristoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: NAlex Bennée <alex.bennee@linaro.org>
Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
Signed-off-by: NChristoffer Dall <christoffer.dall@linaro.org>

ae705930

vxlan: fix wrong usage of VXLAN_VID_MASK · 40fb70f3

由 Alexey Kodanev 提交于 3月 13, 2015

commit dfd8645e wrongly assumes that VXLAN_VDI_MASK includes
eight lower order reserved bits of VNI field that are using for remote
checksum offload.

Right now, when VNI number greater then 0xffff, vxlan_udp_encap_recv()
will always return with 'bad_flag' error, reducing the usable vni range
from 0..16777215 to 0..65535. Also, it doesn't really check whether RCO
bits processed or not.

Fix it by adding new VNI mask which has all 32 bits of VNI field:
24 bits for id and 8 bits for other usage.
Signed-off-by: NAlexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

40fb70f3

13 3月, 2015 6 次提交

of/platform: Fix sparc:allmodconfig build · a697c2ef

由 Guenter Roeck 提交于 3月 10, 2015

sparc:allmodconfig fails to build with:

drivers/built-in.o: In function `platform_bus_init':
(.init.text+0x3684): undefined reference to `of_platform_register_reconfig_notifier'

of_platform_register_reconfig_notifier is only declared if both OF_ADDRESS
and OF_DYNAMIC are configured. Yet, the include file only declares a dummy
function if OF_DYNAMIC is not configured. The sparc architecture does not
configure OF_ADDRESS, but does configure OF_DYNAMIC, causing above error.

Fixes: 801d728c ("of/reconfig: Add OF_DYNAMIC notifier for platform_bus_type")
Cc: Pantelis Antoniou <pantelis.antoniou@konsulko.com>
Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
Signed-off-by: NRob Herring <robh@kernel.org>

a697c2ef

clocksource: Rename __clocksource_updatefreq_*() to __clocksource_update_freq_*() · fba9e072

由 John Stultz 提交于 3月 11, 2015

Ingo requested this function be renamed to improve readability,
so I've renamed __clocksource_updatefreq_scale() as well as the
__clocksource_updatefreq_hz/khz() functions to avoid
squishedtogethernames.

This touches some of the sh clocksources, which I've not tested.

The arch/arm/plat-omap change is just a comment change for
consistency.
Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1426133800-29329-13-git-send-email-john.stultz@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

fba9e072

clocksource: Mostly kill clocksource_register() · f8935983

由 John Stultz 提交于 3月 11, 2015

A long running project has been to clean up remaining uses
of clocksource_register(), replacing it with the simpler
clocksource_register_khz/hz() functions.

However, there are a few cases where we need to self-define
our mult/shift values, so switch the function to a more
obviously internal __clocksource_register() name, and
consolidate much of the internal logic so we don't have
duplication.
Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: David S. Miller <davem@davemloft.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1426133800-29329-10-git-send-email-john.stultz@linaro.org
[ Minor cleanups. ]
Signed-off-by: NIngo Molnar <mingo@kernel.org>

f8935983

uapi/virtio_scsi: allow overriding CDB/SENSE size · a4994b81

由 Michael S. Tsirkin 提交于 3月 13, 2015

QEMU wants to use virtio scsi structures with
a different VIRTIO_SCSI_CDB_SIZE/VIRTIO_SCSI_SENSE_SIZE,
let's add ifdefs to allow overriding them.

Keep the old defines under new names:
VIRTIO_SCSI_CDB_DEFAULT_SIZE/VIRTIO_SCSI_SENSE_DEFAULT_SIZE,
since that's what these values really are:
defaults for cdb/sense size fields.
Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>

a4994b81

kasan, module: move MODULE_ALIGN macro into <linux/moduleloader.h> · d3733e5c

由 Andrey Ryabinin 提交于 3月 12, 2015

include/linux/moduleloader.h is more suitable place for this macro.
Also change alignment to PAGE_SIZE for CONFIG_KASAN=n as such
alignment already assumed in several places.
Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Acked-by: NRusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d3733e5c

kasan, module, vmalloc: rework shadow allocation for modules · a5af5aa8

由 Andrey Ryabinin 提交于 3月 12, 2015

Current approach in handling shadow memory for modules is broken.

Shadow memory could be freed only after memory shadow corresponds it is no
longer used.  vfree() called from interrupt context could use memory its
freeing to store 'struct llist_node' in it:

    void vfree(const void *addr)
    {
    ...
        if (unlikely(in_interrupt())) {
            struct vfree_deferred *p = this_cpu_ptr(&vfree_deferred);
            if (llist_add((struct llist_node *)addr, &p->list))
                    schedule_work(&p->wq);

Later this list node used in free_work() which actually frees memory.
Currently module_memfree() called in interrupt context will free shadow
before freeing module's memory which could provoke kernel crash.

So shadow memory should be freed after module's memory.  However, such
deallocation order could race with kasan_module_alloc() in module_alloc().

Free shadow right before releasing vm area.  At this point vfree()'d
memory is not used anymore and yet not available for other allocations.
New VM_KASAN flag used to indicate that vm area has dynamically allocated
shadow memory so kasan frees shadow only if it was previously allocated.
Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
Acked-by: NRusty Russell <rusty@rustcorp.com.au>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a5af5aa8

12 3月, 2015 3 次提交

clocksource: Add 'max_cycles' to 'struct clocksource' · fb82fe2f

由 John Stultz 提交于 3月 11, 2015

In order to facilitate clocksource validation, add a
'max_cycles' field to the clocksource structure which
will hold the maximum cycle value that can safely be
multiplied without potentially causing an overflow.
Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1426133800-29329-4-git-send-email-john.stultz@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

fb82fe2f

xps: must clear sender_cpu before forwarding · c29390c6

由 Eric Dumazet 提交于 3月 11, 2015

John reported that my previous commit added a regression
on his router.

This is because sender_cpu & napi_id share a common location,
so get_xps_queue() can see garbage and perform an out of bound access.

We need to make sure sender_cpu is cleared before doing the transmit,
otherwise any NIC busy poll enabled (skb_mark_napi_id()) can trigger
this bug.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NJohn <jw@nuclearfallout.net>
Bisected-by: NJohn <jw@nuclearfallout.net>
Fixes: 2bd82484 ("xps: fix xps for stacked devices")
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c29390c6

clk: introduce clk_is_match · 3d3801ef

由 Michael Turquette 提交于 2月 25, 2015

Some drivers compare struct clk pointers as a means of knowing
if the two pointers reference the same clock hardware. This behavior is
dubious (drivers must not dereference struct clk), but did not cause any
regressions until the per-user struct clk patch was merged. Now the test
for matching clk's will always fail with per-user struct clk's.

clk_is_match is introduced to fix the regression and prevent drivers
from comparing the pointers manually.

Fixes: 035a61c3 ("clk: Make clk API return per-user struct clk instances")
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Shawn Guo <shawn.guo@linaro.org>
Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
Signed-off-by: NMichael Turquette <mturquette@linaro.org>
[arnd@arndb.de: Fix COMMON_CLK=N && HAS_CLK=Y config]
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
[sboyd@codeaurora.org: const arguments to clk_is_match() and
remove unnecessary ternary operation]
Signed-off-by: NStephen Boyd <sboyd@codeaurora.org>

3d3801ef

openanolis / cloud-kernel 接近 2 年 前同步成功

openanolis / cloud-kernel
接近 2 年前同步成功