提交 · 2ee89fb9a942e250b5adb5535de4acca14bb7fa2 · openanolis / cloud-kernel

11 2月, 2017 4 次提交

bpf: Use bpf_load_program() from the library · 2ee89fb9

由 Mickaël Salaün 提交于 2月 10, 2017

Replace bpf_prog_load() with bpf_load_program() calls.
Signed-off-by: NMickaël Salaün <mic@digikod.net>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2ee89fb9

bpf: Always test unprivileged programs · d02d8986

由 Mickaël Salaün 提交于 2月 10, 2017

If selftests are run as root, then execute the unprivileged checks as
well. This switch from 243 to 368 tests.

The test numbers are suffixed with "/u" when executed as unprivileged or
with "/p" when executed as privileged.

The geteuid() check is replaced with a capability check.

Handling capabilities requires the libcap dependency.
Signed-off-by: NMickaël Salaün <mic@digikod.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d02d8986

bpf: Change the include directory for selftest · 7f73f39a

由 Mickaël Salaün 提交于 2月 10, 2017

Use the tools include directory instead of the installed one to allow
builds from other kernels.
Signed-off-by: NMickaël Salaün <mic@digikod.net>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7f73f39a

tools: Sync {,tools/}include/uapi/linux/bpf.h · 9a738266

由 Mickaël Salaün 提交于 2月 10, 2017

The tools version of this header is out of date; update it to the latest
version from kernel header.

Synchronize with the following commits:
* b95a5c4d ("bpf: add a longest prefix match trie map implementation")
* a5e8c070 ("bpf: add bpf_probe_read_str helper")
* d1b662ad ("bpf: allow option for setting bpf_l4_csum_replace from scratch")
Signed-off-by: NMickaël Salaün <mic@digikod.net>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Daniel Mack <daniel@zonque.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Gianluca Borello <g.borello@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9a738266

07 2月, 2017 2 次提交

bpf: enable verifier to add 0 to packet ptr · 63dfef75

由 William Tu 提交于 2月 04, 2017

The patch fixes the case when adding a zero value to the packet
pointer.  The zero value could come from src_reg equals type
BPF_K or CONST_IMM.  The patch fixes both, otherwise the verifer
reports the following error:
  [...]
    R0=imm0,min_value=0,max_value=0
    R1=pkt(id=0,off=0,r=4)
    R2=pkt_end R3=fp-12
    R4=imm4,min_value=4,max_value=4
    R5=pkt(id=0,off=4,r=4)
  269: (bf) r2 = r0     // r2 becomes imm0
  270: (77) r2 >>= 3
  271: (bf) r4 = r1     // r4 becomes pkt ptr
  272: (0f) r4 += r2    // r4 += 0
  addition of negative constant to packet pointer is not allowed
Signed-off-by: NWilliam Tu <u9012063@gmail.com>
Signed-off-by: NMihai Budiu <mbudiu@vmware.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

63dfef75

bpf: test for AND edge cases · 29200c19

由 Josef Bacik 提交于 2月 03, 2017

These two tests are based on the work done for f23cc643. The first test is
just a basic one to make sure we don't allow AND'ing negative values, even if it
would result in a valid index for the array. The second is a cleaned up version
of the original testcase provided by Jann Horn that resulted in the commit.
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

29200c19

26 1月, 2017 2 次提交

bpf: use prefix_len in test_tag when reading fdinfo · 0901df3a

由 Daniel Borkmann 提交于 1月 26, 2017

We currently used len instead of prefix_len for the strncmp() in
fdinfo on the prog_tag. It still worked as we matched on the correct
output line also with first 8 instead of 10 chars, but lets fix it
properly to use the intended length.

Fixes: 62b64660 ("bpf: add prog tag test case to bpf selftests")
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0901df3a

lib, traceevent: add PRINT_HEX_STR variant · 0fe05591

由 Daniel Borkmann 提交于 1月 25, 2017

Add support for the __print_hex_str() macro that was added for
tracing, so that user space tools such as perf can understand
it as well.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0fe05591

25 1月, 2017 2 次提交

bpf: enable verifier to better track const alu ops · 3fadc801

由 Daniel Borkmann 提交于 1月 24, 2017

William reported couple of issues in relation to direct packet
access. Typical scheme is to check for data + [off] <= data_end,
where [off] can be either immediate or coming from a tracked
register that contains an immediate, depending on the branch, we
can then access the data. However, in case of calculating [off]
for either the mentioned test itself or for access after the test
in a more "complex" way, then the verifier will stop tracking the
CONST_IMM marked register and will mark it as UNKNOWN_VALUE one.

Adding that UNKNOWN_VALUE typed register to a pkt() marked
register, the verifier then bails out in check_packet_ptr_add()
as it finds the registers imm value below 48. In the first below
example, that is due to evaluate_reg_imm_alu() not handling right
shifts and thus marking the register as UNKNOWN_VALUE via helper
__mark_reg_unknown_value() that resets imm to 0.

In the second case the same happens at the time when r4 is set
to r4 &= r5, where it transitions to UNKNOWN_VALUE from
evaluate_reg_imm_alu(). Later on r4 we shift right by 3 inside
evaluate_reg_alu(), where the register's imm turns into 3. That
is, for registers with type UNKNOWN_VALUE, imm of 0 means that
we don't know what value the register has, and for imm > 0 it
means that the value has [imm] upper zero bits. F.e. when shifting
an UNKNOWN_VALUE register by 3 to the right, no matter what value
it had, we know that the 3 upper most bits must be zero now.
This is to make sure that ALU operations with unknown registers
don't overflow. Meaning, once we know that we have more than 48
upper zero bits, or, in other words cannot go beyond 0xffff offset
with ALU ops, such an addition will track the target register
as a new pkt() register with a new id, but 0 offset and 0 range,
so for that a new data/data_end test will be required. Is the source
register a CONST_IMM one that is to be added to the pkt() register,
or the source instruction is an add instruction with immediate
value, then it will get added if it stays within max 0xffff bounds.
>From there, pkt() type, can be accessed should reg->off + imm be
within the access range of pkt().

  [...]
  from 28 to 30: R0=imm1,min_value=1,max_value=1
    R1=pkt(id=0,off=0,r=22) R2=pkt_end
    R3=imm144,min_value=144,max_value=144
    R4=imm0,min_value=0,max_value=0
    R5=inv48,min_value=2054,max_value=2054 R10=fp
  30: (bf) r5 = r3
  31: (07) r5 += 23
  32: (77) r5 >>= 3
  33: (bf) r6 = r1
  34: (0f) r6 += r5
  cannot add integer value with 0 upper zero bits to ptr_to_packet

  [...]
  from 52 to 80: R0=imm1,min_value=1,max_value=1
    R1=pkt(id=0,off=0,r=34) R2=pkt_end R3=inv
    R4=imm272 R5=inv56,min_value=17,max_value=17
    R6=pkt(id=0,off=26,r=34) R10=fp
  80: (07) r4 += 71
  81: (18) r5 = 0xfffffff8
  83: (5f) r4 &= r5
  84: (77) r4 >>= 3
  85: (0f) r1 += r4
  cannot add integer value with 3 upper zero bits to ptr_to_packet

Thus to get above use-cases working, evaluate_reg_imm_alu() has
been extended for further ALU ops. This is fine, because we only
operate strictly within realm of CONST_IMM types, so here we don't
care about overflows as they will happen in the simulated but also
real execution and interaction with pkt() in check_packet_ptr_add()
will check actual imm value once added to pkt(), but it's irrelevant
before.

With regards to 06c1c049 ("bpf: allow helpers access to variable
memory") that works on UNKNOWN_VALUE registers, the verifier becomes
now a bit smarter as it can better resolve ALU ops, so we need to
adapt two test cases there, as min/max bound tracking only becomes
necessary when registers were spilled to stack. So while mask was
set before to track upper bound for UNKNOWN_VALUE case, it's now
resolved directly as CONST_IMM, and such contructs are only necessary
when f.e. registers are spilled.

For commit 6b173873 ("bpf: recognize 64bit immediate loads as
consts") that initially enabled dw load tracking only for nfp jit/
analyzer, I did couple of tests on large, complex programs and we
don't increase complexity badly (my tests were in ~3% range on avg).
I've added a couple of tests similar to affected code above, and
it works fine with verifier now.
Reported-by: NWilliam Tu <u9012063@gmail.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Cc: Gianluca Borello <g.borello@gmail.com>
Cc: William Tu <u9012063@gmail.com>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3fadc801

bpf: add prog tag test case to bpf selftests · 62b64660

由 Daniel Borkmann 提交于 1月 24, 2017

Add the test case used to compare the results from fdinfo with
af_alg's output on the tag. Tests are from min to max sized
programs, with and without maps included.

  # ./test_tag
  test_tag: OK (40945 tests)

Tested on x86_64 and s390x.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

62b64660

24 1月, 2017 1 次提交

bpf: Add tests for the lpm trie map · 4d3381f5

由 David Herrmann 提交于 1月 21, 2017

The first part of this program runs randomized tests against the
lpm-bpf-map. It implements a "Trivial Longest Prefix Match" (tlpm)
based on simple, linear, single linked lists. The implementation
should be pretty straightforward.

Based on tlpm, this inserts randomized data into bpf-lpm-maps and
verifies the trie-based bpf-map implementation behaves the same way
as tlpm.

The second part uses 'real world' IPv4 and IPv6 addresses and tests
the trie with those.
Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
Signed-off-by: NDaniel Mack <daniel@zonque.org>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4d3381f5

20 1月, 2017 2 次提交

tools/virtio/ringtest: tweaks for s390 · 47a4c49a

由 Halil Pasic 提交于 9月 02, 2016

Make ringtest work on s390 too.
Signed-off-by: NHalil Pasic <pasic@linux.vnet.ibm.com>
Acked-by: NSascha Silbe <silbe@linux.vnet.ibm.com>
Signed-off-by: NCornelia Huck <cornelia.huck@de.ibm.com>

47a4c49a

tools/virtio/ringtest: fix run-on-all.sh for offline cpus · 21f5eda9

由 Halil Pasic 提交于 8月 29, 2016

Since ef1b144d ("tools/virtio/ringtest: fix run-on-all.sh to work
without /dev/cpu") run-on-all.sh uses seq 0 $HOST_AFFINITY as the list
of ids of the CPUs to run the command on (assuming ids of online CPUs
are consecutive and start from 0), where $HOST_AFFINITY is the highest
CPU id in the system previously determined using lscpu.  This can fail
on systems with offline CPUs.

Instead let's use lscpu to determine the list of online CPUs.
Signed-off-by: NHalil Pasic <pasic@linux.vnet.ibm.com>
Fixes: ef1b144d ("tools/virtio/ringtest: fix run-on-all.sh to work without
/dev/cpu")
Reviewed-by: NSascha Silbe <silbe@linux.vnet.ibm.com>
Signed-off-by: NCornelia Huck <cornelia.huck@de.ibm.com>

21f5eda9

19 1月, 2017 1 次提交

objtool: Fix IRET's opcode · b5b46c47

由 Jiri Slaby 提交于 1月 18, 2017

The IRET opcode is 0xcf according to the Intel manual and also to objdump of my
vmlinux:

    1ea8:       48 cf                   iretq

Fix the opcode in arch_decode_instruction().

The previous value (0xc5) seems to correspond to LDS.
Signed-off-by: NJiri Slaby <jslaby@suse.cz>
Acked-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170118132921.19319-1-jslaby@suse.czSigned-off-by: NIngo Molnar <mingo@kernel.org>

b5b46c47

18 1月, 2017 2 次提交

selftest/powerpc: Wrong PMC initialized in pmc56_overflow test · df21d2fa

由 Madhavan Srinivasan 提交于 12月 19, 2016

Test uses PMC2 to count the event. But PMC1 is being initialized.
Patch to fix it.

Fixes: 3752e453 ('selftests/powerpc: Add tests of PMU EBBs')
Signed-off-by: NMadhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

df21d2fa

bpf: Fix test_lru_sanity5() in test_lru_map.c · 3fbfadce

由 Martin KaFai Lau 提交于 1月 16, 2017

test_lru_sanity5() fails when the number of online cpus
is fewer than the number of possible cpus.  It can be
reproduced with qemu by using cmd args "--smp cpus=2,maxcpus=8".

The problem is the loop in test_lru_sanity5() is testing
'i' which is incorrect.

This patch:
1. Make sched_next_online() always return -1 if it cannot
   find a next cpu to schedule the process.
2. In test_lru_sanity5(), the parent process does
   sched_setaffinity() first (through sched_next_online())
   and the forked process will inherit it according to
   the 'man sched_setaffinity'.

Fixes: 5db58faf ("bpf: Add tests for the LRU bpf_htab")
Reported-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3fbfadce

17 1月, 2017 3 次提交

perf probe: Fix to probe on gcc generated functions in modules · 613f050d

由 Masami Hiramatsu 提交于 1月 11, 2017

Fix to probe on gcc generated functions on modules. Since
probing on a module is based on its symbol name, it should
be adjusted on actual symbols.

E.g. without this fix, perf probe shows probe definition
on non-exist symbol as below.

  $ perf probe -m build-x86_64/net/netfilter/nf_nat.ko -F in_range*
  in_range.isra.12
  $ perf probe -m build-x86_64/net/netfilter/nf_nat.ko -D in_range
  p:probe/in_range nf_nat:in_range+0

With this fix, perf probe correctly shows a probe on
gcc-generated symbol.

  $ perf probe -m build-x86_64/net/netfilter/nf_nat.ko -D in_range
  p:probe/in_range nf_nat:in_range.isra.12+0

This also fixes same problem on online module as below.

  $ perf probe -m i915 -D assert_plane
  p:probe/assert_plane i915:assert_plane.constprop.134+0
Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/148411450673.9978.14905987549651656075.stgit@devboxSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>

613f050d

perf probe: Add error checks to offline probe post-processing · 3e96dac7

由 Masami Hiramatsu 提交于 1月 11, 2017

Add error check codes on post processing and improve it for offline
probe events as:

 - post processing fails if no matched symbol found in map(-ENOENT)
   or strdup() failed(-ENOMEM).

 - Even if the symbol name is the same, it updates symbol address
   and offset.
Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/148411443738.9978.4617979132625405545.stgit@devboxSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>

3e96dac7

perf probe: Fix to show correct locations for events on modules · d2d4edbe

由 Masami Hiramatsu 提交于 1月 11, 2017

Fix to show correct locations for events on modules by relocating given
address instead of retrying after failure.

This happens when the module text size is big enough, bigger than
sh_addr, because the original code retries with given address + sh_addr
if it failed to find CU DIE at the given address.

Any address smaller than sh_addr always fails and it retries with the
correct address, but addresses bigger than sh_addr will get a CU DIE
which is on the given address (not adjusted by sh_addr).

In my environment(x86-64), the sh_addr of ".text" section is 0x10030.
Since i915 is a huge kernel module, we can see this issue as below.

  $ grep "[Tt] .*\[i915\]" /proc/kallsyms | sort | head -n1
  ffffffffc0270000 t i915_switcheroo_can_switch	[i915]

ffffffffc0270000 + 0x10030 = ffffffffc0280030, so we'll check
symbols cross this boundary.

  $ grep "[Tt] .*\[i915\]" /proc/kallsyms | grep -B1 ^ffffffffc028\
  | head -n 2
  ffffffffc027ff80 t haswell_init_clock_gating	[i915]
  ffffffffc0280110 t valleyview_init_clock_gating	[i915]

So setup probes on both function and see what happen.

  $ sudo ./perf probe -m i915 -a haswell_init_clock_gating \
        -a valleyview_init_clock_gating
  Added new events:
    probe:haswell_init_clock_gating (on haswell_init_clock_gating in i915)
    probe:valleyview_init_clock_gating (on valleyview_init_clock_gating in i915)

  You can now use it in all perf tools, such as:

  	perf record -e probe:valleyview_init_clock_gating -aR sleep 1

  $ sudo ./perf probe -l
    probe:haswell_init_clock_gating (on haswell_init_clock_gating@gpu/drm/i915/intel_pm.c in i915)
    probe:valleyview_init_clock_gating (on i915_vga_set_decode:4@gpu/drm/i915/i915_drv.c in i915)

As you can see, haswell_init_clock_gating is correctly shown,
but valleyview_init_clock_gating is not.

With this patch, both events are shown correctly.

  $ sudo ./perf probe -l
    probe:haswell_init_clock_gating (on haswell_init_clock_gating@gpu/drm/i915/intel_pm.c in i915)
    probe:valleyview_init_clock_gating (on valleyview_init_clock_gating@gpu/drm/i915/intel_pm.c in i915)

Committer notes:

In my case:

  # perf probe -m i915 -a haswell_init_clock_gating -a valleyview_init_clock_gating
  Added new events:
    probe:haswell_init_clock_gating (on haswell_init_clock_gating in i915)
    probe:valleyview_init_clock_gating (on valleyview_init_clock_gating in i915)

  You can now use it in all perf tools, such as:

	  perf record -e probe:valleyview_init_clock_gating -aR sleep 1

  # perf probe -l
    probe:haswell_init_clock_gating (on i915_getparam+432@gpu/drm/i915/i915_drv.c in i915)
    probe:valleyview_init_clock_gating (on __i915_printk+240@gpu/drm/i915/i915_drv.c in i915)
  #

  # readelf -SW /lib/modules/4.9.0+/build/vmlinux | egrep -w '.text|Name'
   [Nr] Name   Type      Address          Off    Size   ES Flg Lk Inf Al
   [ 1] .text  PROGBITS  ffffffff81000000 200000 822fd3 00  AX  0   0 4096
  #

  So both are b0rked, now with the fix:

  # perf probe -m i915 -a haswell_init_clock_gating -a valleyview_init_clock_gating
  Added new events:
    probe:haswell_init_clock_gating (on haswell_init_clock_gating in i915)
    probe:valleyview_init_clock_gating (on valleyview_init_clock_gating in i915)

  You can now use it in all perf tools, such as:

	perf record -e probe:valleyview_init_clock_gating -aR sleep 1

  # perf probe -l
    probe:haswell_init_clock_gating (on haswell_init_clock_gating@gpu/drm/i915/intel_pm.c in i915)
    probe:valleyview_init_clock_gating (on valleyview_init_clock_gating@gpu/drm/i915/intel_pm.c in i915)
  #

Both looks correct.
Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/148411436777.9978.1440275861947194930.stgit@devboxSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>

d2d4edbe

12 1月, 2017 2 次提交

tools: psock_lib: harden socket filter used by psock tests · 4d7b9dc1

由 Sowmini Varadhan 提交于 1月 12, 2017

The filter added by sock_setfilter is intended to only permit
packets matching the pattern set up by create_payload(), but
we only check the ip_len, and a single test-character in
the IP packet to ensure this condition.

Harden the filter by adding additional constraints so that we only
permit UDP/IPv4 packets that meet the ip_len and test-character
requirements. Include the bpf_asm src as a comment, in case this
needs to be enhanced in the future
Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4d7b9dc1

bpf: allow b/h/w/dw access for bpf's cb in ctx · 62c7989b

由 Daniel Borkmann 提交于 1月 12, 2017

When structs are used to store temporary state in cb[] buffer that is
used with programs and among tail calls, then the generated code will
not always access the buffer in bpf_w chunks. We can ease programming
of it and let this act more natural by allowing for aligned b/h/w/dw
sized access for cb[] ctx member. Various test cases are attached as
well for the selftest suite. Potentially, this can also be reused for
other program types to pass data around.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

62c7989b

11 1月, 2017 1 次提交

mm: get rid of __GFP_OTHER_NODE · 41b6167e

由 Michal Hocko 提交于 1月 10, 2017

The flag was introduced by commit 78afd561 ("mm: add
__GFP_OTHER_NODE flag") to allow proper accounting of remote node
allocations done by kernel daemons on behalf of a process - e.g.
khugepaged.

After "mm: fix remote numa hits statistics" we do not need and actually
use the flag so we can safely remove it because all allocations which
are satisfied from their "home" node are accounted properly.

[mhocko@suse.com: fix build]
Link: http://lkml.kernel.org/r/20170106122225.GK5556@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20170102153057.9451-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NMel Gorman <mgorman@suse.de>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

41b6167e

10 1月, 2017 3 次提交

bpf: allow helpers access to variable memory · 06c1c049

由 Gianluca Borello 提交于 1月 09, 2017

Currently, helpers that read and write from/to the stack can do so using
a pair of arguments of type ARG_PTR_TO_STACK and ARG_CONST_STACK_SIZE.
ARG_CONST_STACK_SIZE accepts a constant register of type CONST_IMM, so
that the verifier can safely check the memory access. However, requiring
the argument to be a constant can be limiting in some circumstances.

Since the current logic keeps track of the minimum and maximum value of
a register throughout the simulated execution, ARG_CONST_STACK_SIZE can
be changed to also accept an UNKNOWN_VALUE register in case its
boundaries have been set and the range doesn't cause invalid memory
accesses.

One common situation when this is useful:

int len;
char buf[BUFSIZE]; /* BUFSIZE is 128 */

if (some_condition)
	len = 42;
else
	len = 84;

some_helper(..., buf, len & (BUFSIZE - 1));

The compiler can often decide to assign the constant values 42 or 48
into a variable on the stack, instead of keeping it in a register. When
the variable is then read back from stack into the register in order to
be passed to the helper, the verifier will not be able to recognize the
register as constant (the verifier is not currently tracking all
constant writes into memory), and the program won't be valid.

However, by allowing the helper to accept an UNKNOWN_VALUE register,
this program will work because the bitwise AND operation will set the
range of possible values for the UNKNOWN_VALUE register to [0, BUFSIZE),
so the verifier can guarantee the helper call will be safe (assuming the
argument is of type ARG_CONST_STACK_SIZE_OR_ZERO, otherwise one more
check against 0 would be needed). Custom ranges can be set not only with
ALU operations, but also by explicitly comparing the UNKNOWN_VALUE
register with constants.

Another very common example happens when intercepting system call
arguments and accessing user-provided data of variable size using
bpf_probe_read(). One can load at runtime the user-provided length in an
UNKNOWN_VALUE register, and then read that exact amount of data up to a
compile-time determined limit in order to fit into the proper local
storage allocated on the stack, without having to guess a suboptimal
access size at compile time.

Also, in case the helpers accepting the UNKNOWN_VALUE register operate
in raw mode, disable the raw mode so that the program is required to
initialize all memory, since there is no guarantee the helper will fill
it completely, leaving possibilities for data leak (just relevant when
the memory used by the helper is the stack, not when using a pointer to
map element value or packet). In other words, ARG_PTR_TO_RAW_STACK will
be treated as ARG_PTR_TO_STACK.
Signed-off-by: NGianluca Borello <g.borello@gmail.com>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

06c1c049

bpf: allow adjusted map element values to spill · f0318d01

由 Gianluca Borello 提交于 1月 09, 2017

commit 48461135 ("bpf: allow access into map value arrays")
introduces the ability to do pointer math inside a map element value via
the PTR_TO_MAP_VALUE_ADJ register type.

The current support doesn't handle the case where a PTR_TO_MAP_VALUE_ADJ
is spilled into the stack, limiting several use cases, especially when
generating bpf code from a compiler.

Handle this case by explicitly enabling the register type
PTR_TO_MAP_VALUE_ADJ to be spilled. Also, make sure that min_value and
max_value are reset just for BPF_LDX operations that don't result in a
restore of a spilled register from stack.
Signed-off-by: NGianluca Borello <g.borello@gmail.com>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f0318d01

bpf: allow helpers access to map element values · 5722569b

由 Gianluca Borello 提交于 1月 09, 2017

Enable helpers to directly access a map element value by passing a
register type PTR_TO_MAP_VALUE (or PTR_TO_MAP_VALUE_ADJ) to helper
arguments ARG_PTR_TO_STACK or ARG_PTR_TO_RAW_STACK.

This enables several use cases. For example, a typical tracing program
might want to capture pathnames passed to sys_open() with:

struct trace_data {
	char pathname[PATHLEN];
};

SEC("kprobe/sys_open")
void bpf_sys_open(struct pt_regs *ctx)
{
	struct trace_data data;
	bpf_probe_read(data.pathname, sizeof(data.pathname), ctx->di);

	/* consume data.pathname, for example via
	 * bpf_trace_printk() or bpf_perf_event_output()
	 */
}

Such a program could easily hit the stack limit in case PATHLEN needs to
be large or more local variables need to exist, both of which are quite
common scenarios. Allowing direct helper access to map element values,
one could do:

struct bpf_map_def SEC("maps") scratch_map = {
	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
	.key_size = sizeof(u32),
	.value_size = sizeof(struct trace_data),
	.max_entries = 1,
};

SEC("kprobe/sys_open")
int bpf_sys_open(struct pt_regs *ctx)
{
	int id = 0;
	struct trace_data *p = bpf_map_lookup_elem(&scratch_map, &id);
	if (!p)
		return;
	bpf_probe_read(p->pathname, sizeof(p->pathname), ctx->di);

	/* consume p->pathname, for example via
	 * bpf_trace_printk() or bpf_perf_event_output()
	 */
}

And wouldn't risk exhausting the stack.

Code changes are loosely modeled after commit 6841de8b ("bpf: allow
helpers access the packet directly"). Unlike with PTR_TO_PACKET, these
changes just work with ARG_PTR_TO_STACK and ARG_PTR_TO_RAW_STACK (not
ARG_PTR_TO_MAP_KEY, ARG_PTR_TO_MAP_VALUE, ...): adding those would be
trivial, but since there is not currently a use case for that, it's
reasonable to limit the set of changes.

Also, add new tests to make sure accesses to map element values from
helpers never go out of boundary, even when adjusted.
Signed-off-by: NGianluca Borello <g.borello@gmail.com>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5722569b

06 1月, 2017 5 次提交

selftests: x86/pkeys: fix spelling mistake: "itertation" -> "iteration" · 7738789f

由 Colin King 提交于 12月 27, 2016

Fix spelling mistake in print test pass message.
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Signed-off-by: NShuah Khan <shuahkh@osg.samsung.com>

7738789f

selftests: do not require bash to run netsocktests testcase · 3659f98b

由 Rolf Eike Beer 提交于 12月 14, 2016

Nothing in this minimal script seems to require bash. We often run these
tests on embedded devices where the only shell available is the busybox
ash. Use sh instead.
Signed-off-by: NRolf Eike Beer <eb@emlix.com>
Cc: stable@vger.kernel.org
Signed-off-by: NShuah Khan <shuahkh@osg.samsung.com>

3659f98b

selftests: do not require bash to run bpf tests · d979e13a

由 Rolf Eike Beer 提交于 12月 14, 2016

Nothing in this minimal script seems to require bash. We often run these
tests on embedded devices where the only shell available is the busybox
ash. Use sh instead.
Signed-off-by: NRolf Eike Beer <eb@emlix.com>
Cc: stable@vger.kernel.org
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NShuah Khan <shuahkh@osg.samsung.com>

d979e13a

selftests: do not require bash for the generated test · a2b1e8a2

由 Rolf Eike Beer 提交于 12月 14, 2016

Nothing in this minimal script seems to require bash. We often run these
tests on embedded devices where the only shell available is the busybox
ash. Use sh instead.
Signed-off-by: NRolf Eike Beer <eb@emlix.com>
Cc: stable@vger.kernel.org
Signed-off-by: NShuah Khan <shuahkh@osg.samsung.com>

a2b1e8a2

tools: psock_tpacket: block Rx until socket filter has been added and socket... · c1878f7a

由 Sowmini Varadhan 提交于 1月 05, 2017

tools: psock_tpacket: block Rx until socket filter has been added and socket has been bound to loopback.

Packets from any/all interfaces may be queued up on the PF_PACKET socket
before it is bound to the loopback interface by psock_tpacket, and
when these are passed up by the kernel, they could interfere
with the Rx tests.

Avoid interference from spurious packet by blocking Rx until the
socket filter has been set up, and the packet has been bound to the
desired (lo) interface. The effective sequence is
	socket(PF_PACKET, SOCK_RAW, 0);
	set up ring
	Invoke SO_ATTACH_FILTER
	bind to sll_protocol set to ETH_P_ALL, sll_ifindex for lo
After this sequence, the only packets that will be passed up are
those received on loopback that pass the attached filter.
Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c1878f7a

04 1月, 2017 6 次提交

perf probe: Fix to probe on gcc generated symbols for offline kernel · 8a937a25

由 Masami Hiramatsu 提交于 1月 04, 2017

Fix perf-probe to show probe definition on gcc generated symbols for
offline kernel (including cross-arch kernel image).

gcc sometimes optimizes functions and generate new symbols with suffixes
such as ".constprop.N" or ".isra.N" etc. Since those symbol names are
not recorded in DWARF, we have to find correct generated symbols from
offline ELF binary to probe on it (kallsyms doesn't correct it).  For
online kernel or uprobes we don't need it because those are rebased on
_text, or a section relative address.

E.g. Without this:

  $ perf probe -k build-arm/vmlinux -F __slab_alloc*
  __slab_alloc.constprop.9
  $ perf probe -k build-arm/vmlinux -D __slab_alloc
  p:probe/__slab_alloc __slab_alloc+0

If you put above definition on target machine, it should fail
because there is no __slab_alloc in kallsyms.

With this fix, perf probe shows correct probe definition on
__slab_alloc.constprop.9:

  $ perf probe -k build-arm/vmlinux -D __slab_alloc
  p:probe/__slab_alloc __slab_alloc.constprop.9+0
Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/148350060434.19001.11864836288580083501.stgit@devboxSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>

8a937a25

perf probe: Fix --funcs to show correct symbols for offline module · eebc509b

由 Masami Hiramatsu 提交于 1月 04, 2017

Fix --funcs (-F) option to show correct symbols for offline module.
Since previous perf-probe uses machine__findnew_module_map() for offline
module, even if user passes a module file (with full path) which is for
other architecture, perf-probe always tries to load symbol map for
current kernel module.

This fix uses dso__new_map() to load the map from given binary as same
as a map for user applications.
Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/148350053478.19001.15435255244512631545.stgit@devboxSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>

eebc509b

perf symbols: Robustify reading of build-id from sysfs · 7934c98a

由 Arnaldo Carvalho de Melo 提交于 1月 03, 2017

Markus reported that perf segfaults when reading /sys/kernel/notes from
a kernel linked with GNU gold, due to what looks like a gold bug, so do
some bounds checking to avoid crashing in that case.
Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
Report-Link: http://lkml.kernel.org/r/20161219161821.GA294@x4
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/n/tip-ryhgs6a6jxvz207j2636w31c@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>

7934c98a

perf tools: Install tools/lib/traceevent plugins with install-bin · 30a9c644

由 Arnaldo Carvalho de Melo 提交于 1月 03, 2017

Those are binaries as well, so should be installed by:

  make -C tools/perf install-bin'

too.

Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/n/tip-3841b37u05evxrs1igkyu6ks@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>

30a9c644

tools lib traceevent: Fix prev/next_prio for deadline tasks · 07485918

由 Daniel Bristot de Oliveira 提交于 1月 03, 2017

Currently, the sched:sched_switch tracepoint reports deadline tasks with
priority -1. But when reading the trace via perf script I've got the
following output:

  # ./d & # (d is a deadline task, see [1])
  # perf record -e sched:sched_switch -a sleep 1
  # perf script
      ...
         swapper     0 [000]  2146.962441: sched:sched_switch: swapper/0:0 [120] R ==> d:2593 [4294967295]
               d  2593 [000]  2146.972472: sched:sched_switch: d:2593 [4294967295] R ==> g:2590 [4294967295]

The task d reports the wrong priority [4294967295]. This happens because
the "int prio" is stored in an unsigned long long val. Although it is
set as a %lld, as int is shorter than unsigned long long,
trace_seq_printf prints it as a positive number.

The fix is just to cast the val as an int, and print it as a %d,
as in the sched:sched_switch tracepoint's "format".

The output with the fix is:

  # ./d &
  # perf record -e sched:sched_switch -a sleep 1
  # perf script
      ...
         swapper     0 [000]  4306.374037: sched:sched_switch: swapper/0:0 [120] R ==> d:10941 [-1]
               d 10941 [000]  4306.383823: sched:sched_switch: d:10941 [-1] R ==> swapper/0:0 [120]

[1] d.c

 ---
  #include <stdio.h>
  #include <unistd.h>
  #include <sys/syscall.h>
  #include <linux/types.h>
  #include <linux/sched.h>

  struct sched_attr {
	__u32 size, sched_policy;
	__u64 sched_flags;
	__s32 sched_nice;
	__u32 sched_priority;
	__u64 sched_runtime, sched_deadline, sched_period;
  };

  int sched_setattr(pid_t pid, const struct sched_attr *attr, unsigned int flags)
  {
	return syscall(__NR_sched_setattr, pid, attr, flags);
  }

  int main(void)
  {
	struct sched_attr attr = {
		.size		= sizeof(attr),
		.sched_policy	= SCHED_DEADLINE, /* This creates a 10ms/30ms reservation */
		.sched_runtime	= 10 * 1000 * 1000,
		.sched_period	= attr.sched_deadline = 30 * 1000 * 1000,
	};

	if (sched_setattr(0, &attr, 0) < 0) {
		perror("sched_setattr");
		return -1;
	}

	for(;;);
  }
 ---

Committer notes:

Got the program from the provided URL, http://bristot.me/lkml/d.c,
trimmed it and included in the cset log above, so that we have
everything needed to test it in one place.
Signed-off-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: NSteven Rostedt <rostedt@goodmis.org>
Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/866ef75bcebf670ae91c6a96daa63597ba981f0d.1483443552.git.bristot@redhat.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>

07485918

tools: test case for TPACKET_V3/TX_RING support · fe878cad

由 Sowmini Varadhan 提交于 1月 03, 2017

Add a test case and sample code for (TPACKET_V3, PACKET_TX_RING)
Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fe878cad

03 1月, 2017 4 次提交

perf record: Fix --switch-output documentation and comment · 60437ac0

由 Jiri Olsa 提交于 1月 03, 2017

There's no --signal-trigger option, also adding the code comment into
record man page.
Signed-off-by: NJiri Olsa <jolsa@kernel.org>
Tested-by: NWang Nan <wangnan0@huawei.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1483431600-19887-4-git-send-email-jolsa@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>

60437ac0

perf record: Make __record_options static · efd21307

由 Jiri Olsa 提交于 1月 03, 2017

There's no need for this one to be global.
Signed-off-by: NJiri Olsa <jolsa@kernel.org>
Tested-by: NWang Nan <wangnan0@huawei.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1483431600-19887-3-git-send-email-jolsa@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>

efd21307

tools lib subcmd: Add OPT_STRING_OPTARG_SET option · b66fb1da

由 Jiri Olsa 提交于 1月 03, 2017

To allow string options with a default argument and variable set when
the option is used.
Signed-off-by: NJiri Olsa <jolsa@kernel.org>
Tested-by: NWang Nan <wangnan0@huawei.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1483431600-19887-2-git-send-email-jolsa@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>

b66fb1da

perf probe: Fix to get correct modname from elf header · 1f2ed153

由 Masami Hiramatsu 提交于 1月 03, 2017

Since 'perf probe' supports cross-arch probes, it is possible to analyze
different arch kernel image which has different bits-per-long.

In that case, it fails to get the module name because it uses the
MOD_NAME_OFFSET macro based on the host machine bits-per-long, instead
of the target arch bits-per-long.

This fixes above issue by changing modname-offset based on the target
archs bit width. This is ok because linux kernel uses LP64 model on
64bit arch.

E.g. without this (on x86_64, and target module is arm32):

  $ perf probe -m build-arm/fs/configfs/configfs.ko -D configfs_lookup
  p:probe/configfs_lookup :configfs_lookup+0
                          ^-Here is an empty module name.

With this fix, you can see correct module name:

  $ perf probe -m build-arm/fs/configfs/configfs.ko -D configfs_lookup
  p:probe/configfs_lookup configfs:configfs_lookup+0
Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/148337043836.6752.383495516397005695.stgit@devboxSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>

1f2ed153

openanolis / cloud-kernel 接近 2 年 前同步成功

openanolis / cloud-kernel
接近 2 年前同步成功