提交 · a4b1d3c1ddf6cb441187b6c130a473c16a05a356 · openeuler / Kernel

25 5月, 2019 3 次提交

bpf: verifier: insert zero extension according to analysis result · a4b1d3c1

由 Jiong Wang 提交于 5月 24, 2019

After previous patches, verifier will mark a insn if it really needs zero
extension on dst_reg.

It is then for back-ends to decide how to use such information to eliminate
unnecessary zero extension code-gen during JIT compilation.

One approach is verifier insert explicit zero extension for those insns
that need zero extension in a generic way, JIT back-ends then do not
generate zero extension for sub-register write at default.

However, only those back-ends which do not have hardware zero extension
want this optimization. Back-ends like x86_64 and AArch64 have hardware
zero extension support that the insertion should be disabled.

This patch introduces new target hook "bpf_jit_needs_zext" which returns
false at default, meaning verifier zero extension insertion is disabled at
default. A back-end could override this hook to return true if it doesn't
have hardware support and want verifier insert zero extension explicitly.

Offload targets do not use this native target hook, instead, they could
get the optimization results using bpf_prog_offload_ops.finalize.

NOTE: arches could have diversified features, it is possible for one arch
to have hardware zero extension support for some sub-register write insns
but not for all. For example, PowerPC, SPARC have zero extended loads, but
not for alu32. So when verifier zero extension insertion enabled, these JIT
back-ends need to peephole insns to remove those zero extension inserted
for insn that actually has hardware zero extension support. The peephole
could be as simple as looking the next insn, if it is a special zero
extension insn then it is safe to eliminate it if the current insn has
hardware zero extension support.
Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NJiong Wang <jiong.wang@netronome.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

a4b1d3c1

bpf: introduce new mov32 variant for doing explicit zero extension · 7d134041

由 Jiong Wang 提交于 5月 24, 2019

The encoding for this new variant is based on BPF_X format. "imm" field was
0 only, now it could be 1 which means doing zero extension unconditionally

  .code = BPF_ALU | BPF_MOV | BPF_X
  .dst_reg = DST
  .src_reg = SRC
  .imm  = 1

We use this new form for doing zero extension for which verifier will
guarantee SRC == DST.

Implications on JIT back-ends when doing code-gen for
BPF_ALU | BPF_MOV | BPF_X:
  1. No change if hardware already does zero extension unconditionally for
     sub-register write.
  2. Otherwise, when seeing imm == 1, just generate insns to clear high
     32-bit. No need to generate insns for the move because when imm == 1,
     dst_reg is the same as src_reg at the moment.

Interpreter doesn't need change as well. It is doing unconditionally zero
extension for mov32 already.

One helper macro BPF_ZEXT_REG is added to help creating zero extension
insn using this new mov32 variant.

One helper function insn_is_zext is added for checking one insn is an
zero extension on dst. This will be widely used by a few JIT back-ends in
later patches in this set.
Signed-off-by: NJiong Wang <jiong.wang@netronome.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

7d134041

bpf: verifier: mark verified-insn with sub-register zext flag · 5327ed3d

由 Jiong Wang 提交于 5月 24, 2019

eBPF ISA specification requires high 32-bit cleared when low 32-bit
sub-register is written. This applies to destination register of ALU32 etc.
JIT back-ends must guarantee this semantic when doing code-gen. x86_64 and
AArch64 ISA has the same semantics, so the corresponding JIT back-end
doesn't need to do extra work.

However, 32-bit arches (arm, x86, nfp etc.) and some other 64-bit arches
(PowerPC, SPARC etc) need to do explicit zero extension to meet this
requirement, otherwise code like the following will fail.

  u64_value = (u64) u32_value
  ... other uses of u64_value

This is because compiler could exploit the semantic described above and
save those zero extensions for extending u32_value to u64_value, these JIT
back-ends are expected to guarantee this through inserting extra zero
extensions which however could be a significant increase on the code size.
Some benchmarks show there could be ~40% sub-register writes out of total
insns, meaning at least ~40% extra code-gen.

One observation is these extra zero extensions are not always necessary.
Take above code snippet for example, it is possible u32_value will never be
casted into a u64, the value of high 32-bit of u32_value then could be
ignored and extra zero extension could be eliminated.

This patch implements this idea, insns defining sub-registers will be
marked when the high 32-bit of the defined sub-register matters. For
those unmarked insns, it is safe to eliminate high 32-bit clearnace for
them.

Algo:
 - Split read flags into READ32 and READ64.

 - Record index of insn that does sub-register write. Keep the index inside
   reg state and update it during verifier insn walking.

 - A full register read on a sub-register marks its definition insn as
   needing zero extension on dst register.

   A new sub-register write overrides the old one.

 - When propagating read64 during path pruning, also mark any insn defining
   a sub-register that is read in the pruned path as full-register.
Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NJiong Wang <jiong.wang@netronome.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

5327ed3d

24 5月, 2019 2 次提交

bpf: convert explored_states to hash table · dc2a4ebc

由 Alexei Starovoitov 提交于 5月 21, 2019

All prune points inside a callee bpf function most likely will have
different callsites. For example, if function foo() is called from
two callsites the half of explored states in all prune points in foo()
will be useless for subsequent walking of one of those callsites.
Fortunately explored_states pruning heuristics keeps the number of states
per prune point small, but walking these states is still a waste of cpu
time when the callsite of the current state is different from the callsite
of the explored state.

To improve pruning logic convert explored_states into hash table and
use simple insn_idx ^ callsite hash to select hash bucket.
This optimization has no effect on programs without bpf2bpf calls
and drastically improves programs with calls.
In the later case it reduces total memory consumption in 1M scale tests
by almost 3 times (peak_states drops from 5752 to 2016).

Care should be taken when comparing the states for equivalency.
Since the same hash bucket can now contain states with different indices
the insn_idx has to be part of verifier_state and compared.

Different hash table sizes and different hash functions were explored,
but the results were not significantly better vs this patch.
They can be improved in the future.

Hit/miss heuristic is not counting index miscompare as a miss.
Otherwise verifier stats become unstable when experimenting
with different hash functions.
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

dc2a4ebc

bpf: split explored_states · a8f500af

由 Alexei Starovoitov 提交于 5月 21, 2019

split explored_states into prune_point boolean mark
and link list of explored states.
This removes STATE_LIST_MARK hack and allows marks to be separate from states.
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

a8f500af

23 5月, 2019 1 次提交

ipv4/igmp: shrink struct ip_sf_list · 0db355d4

由 Eric Dumazet 提交于 5月 22, 2019

Removing two 4 bytes holes allows to use kmalloc-32
kmem cache instead of kmalloc-64 on 64bit kernels.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0db355d4

21 5月, 2019 8 次提交

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 18 · c82ee6d3

由 Thomas Gleixner 提交于 5月 19, 2019

Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 or at your option any
  later version this program is distributed in the hope that it will
  be useful but without any warranty without even the implied warranty
  of merchantability or fitness for a particular purpose see the gnu
  general public license for more details you should have received a
  copy of the gnu general public license along with this program see
  the file copying if not write to the free software foundation 675
  mass ave cambridge ma 02139 usa

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 52 file(s).
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NJilayne Lovejoy <opensource@jilayne.com>
Reviewed-by: NSteve Winslow <swinslow@gmail.com>
Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: NAllison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190519154042.342335923@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

c82ee6d3

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 17 · aaf4989b

由 Thomas Gleixner 提交于 5月 19, 2019

Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 or at your option any
  later version this program is distributed in the hope that it will
  be useful but without any warranty without even the implied warranty
  of merchantability or fitness for a particular purpose see the gnu
  general public license for more details you should have received a
  copy of the gnu general public license along with this program if
  not see http www gnu org licenses

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 13 file(s).
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NSteve Winslow <swinslow@gmail.com>
Reviewed-by: NJilayne Lovejoy <opensource@jilayne.com>
Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: NAllison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190519154042.236620792@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

aaf4989b

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 13 · 1ccea77e

由 Thomas Gleixner 提交于 5月 19, 2019

Based on 2 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version this program is distributed in the
hope that it will be useful but without any warranty without even
the implied warranty of merchantability or fitness for a particular
purpose see the gnu general public license for more details you
should have received a copy of the gnu general public license along
with this program if not see http www gnu org licenses

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version this program is distributed in the
hope that it will be useful but without any warranty without even
the implied warranty of merchantability or fitness for a particular
purpose see the gnu general public license for more details [based]
[from] [clk] [highbank] [c] you should have received a copy of the
gnu general public license along with this program if not see http
www gnu org licenses

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 355 file(s).
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: NJilayne Lovejoy <opensource@jilayne.com>
Reviewed-by: NSteve Winslow <swinslow@gmail.com>
Reviewed-by: NAllison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190519154041.837383322@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

1ccea77e

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 10 · aded9cb8

由 Thomas Gleixner 提交于 5月 19, 2019

Based on 1 normalized pattern(s):

  licensed under the fsf s gnu public license v2 or later

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 2 file(s).
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: NJilayne Lovejoy <opensource@jilayne.com>
Reviewed-by: NSteve Winslow <swinslow@gmail.com>
Reviewed-by: NAllison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190519154041.526489261@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

aded9cb8

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 7 · 9ab65aff

由 Thomas Gleixner 提交于 5月 19, 2019

Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version this program is distributed in the
  hope that it will be useful but without any warranty without even
  the implied warranty of merchantability or fitness for a particular
  purpose see the gnu general public license for more details the full
  gnu general public license is included in this distribution in the
  file called copying

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 9 file(s).
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NSteve Winslow <swinslow@gmail.com>
Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: NJilayne Lovejoy <opensource@jilayne.com>
Reviewed-by: NAllison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190519154041.244154651@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

9ab65aff

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 4 · a636cd6c

由 Thomas Gleixner 提交于 5月 19, 2019

Based on 1 normalized pattern(s):

  licensed under gplv2 or later

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 118 file(s).
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NJilayne Lovejoy <opensource@jilayne.com>
Reviewed-by: NSteve Winslow <swinslow@gmail.com>
Reviewed-by: NAllison Randal <allison@lohutok.net>
Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190519154040.961286471@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

a636cd6c

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 1 · 16216333

由 Thomas Gleixner 提交于 5月 19, 2019

Based on 2 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option [no]_[pad]_[ctrl] any later version this program is
distributed in the hope that it will be useful but without any
warranty without even the implied warranty of merchantability or
fitness for a particular purpose see the gnu general public license
for more details you should have received a copy of the gnu general
public license along with this program if not write to the free
software foundation inc 51 franklin street fifth floor boston ma
02110 1301 usa

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 176 file(s).
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NJilayne Lovejoy <opensource@jilayne.com>
Reviewed-by: NSteve Winslow <swinslow@gmail.com>
Reviewed-by: NAllison Randal <allison@lohutok.net>
Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190519154040.652910950@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

16216333

treewide: Add SPDX license identifier for missed files · 457c8996

由 Thomas Gleixner 提交于 5月 19, 2019

Add SPDX license identifiers to all files which:

 - Have no license information of any form

 - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
   initial scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

  GPL-2.0-only
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

457c8996

20 5月, 2019 1 次提交

of_net: fix of_get_mac_address retval if compiled without CONFIG_OF · 6a0a923d

由 Petr Štetiar 提交于 5月 19, 2019

of_get_mac_address prior to commit d01f449c ("of_net: add NVMEM
support to of_get_mac_address") could return only valid pointer or NULL,
after this change it could return only valid pointer or ERR_PTR encoded
error value, but I've forget to change the return value of
of_get_mac_address in case where the kernel is compiled without
CONFIG_OF, so I'm doing so now.

Cc: Mirko Lindner <mlindner@marvell.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Fixes: d01f449c ("of_net: add NVMEM support to of_get_mac_address")
Reported-by: NOctavio Alvarez <octallk1@alvarezp.org>
Signed-off-by: NPetr Štetiar <ynezz@true.cz>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6a0a923d

19 5月, 2019 2 次提交

panic: add an option to replay all the printk message in buffer · de6da1e8

由 Feng Tang 提交于 5月 17, 2019

Currently on panic, kernel will lower the loglevel and print out pending
printk msg only with console_flush_on_panic().

Add an option for users to configure the "panic_print" to replay all
dmesg in buffer, some of which they may have never seen due to the
loglevel setting, which will help panic debugging .

[feng.tang@intel.com: keep the original console_flush_on_panic() inside panic()]
  Link: http://lkml.kernel.org/r/1556199137-14163-1-git-send-email-feng.tang@intel.com
[feng.tang@intel.com: use logbuf lock to protect the console log index]
  Link: http://lkml.kernel.org/r/1556269868-22654-1-git-send-email-feng.tang@intel.com
Link: http://lkml.kernel.org/r/1556095872-36838-1-git-send-email-feng.tang@intel.comSigned-off-by: NFeng Tang <feng.tang@intel.com>
Reviewed-by: NPetr Mladek <pmladek@suse.com>
Cc: Aaro Koskinen <aaro.koskinen@nokia.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Borislav Petkov <bp@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

de6da1e8

mm/vmalloc.c: keep track of free blocks for vmap allocation · 68ad4a33

由 Uladzislau Rezki (Sony) 提交于 5月 17, 2019

Patch series "improve vmap allocation", v3.

Objective
---------

Please have a look for the description at:

  https://lkml.org/lkml/2018/10/19/786

but let me also summarize it a bit here as well.

The current implementation has O(N) complexity. Requests with different
permissive parameters can lead to long allocation time. When i say
"long" i mean milliseconds.

Description
-----------

This approach organizes the KVA memory layout into free areas of the
1-ULONG_MAX range, i.e.  an allocation is done over free areas lookups,
instead of finding a hole between two busy blocks.  It allows to have
lower number of objects which represent the free space, therefore to have
less fragmented memory allocator.  Because free blocks are always as large
as possible.

It uses the augment tree where all free areas are sorted in ascending
order of va->va_start address in pair with linked list that provides
O(1) access to prev/next elements.

Since the tree is augment, we also maintain the "subtree_max_size" of VA
that reflects a maximum available free block in its left or right
sub-tree.  Knowing that, we can easily traversal toward the lowest (left
most path) free area.

Allocation: ~O(log(N)) complexity.  It is sequential allocation method
therefore tends to maximize locality.  The search is done until a first
suitable block is large enough to encompass the requested parameters.
Bigger areas are split.

I copy paste here the description of how the area is split, since i
described it in https://lkml.org/lkml/2018/10/19/786

<snip>

A free block can be split by three different ways.  Their names are
FL_FIT_TYPE, LE_FIT_TYPE/RE_FIT_TYPE and NE_FIT_TYPE, i.e.  they
correspond to how requested size and alignment fit to a free block.

FL_FIT_TYPE - in this case a free block is just removed from the free
list/tree because it fully fits.  Comparing with current design there is
an extra work with rb-tree updating.

LE_FIT_TYPE/RE_FIT_TYPE - left/right edges fit.  In this case what we do
is just cutting a free block.  It is as fast as a current design.  Most of
the vmalloc allocations just end up with this case, because the edge is
always aligned to 1.

NE_FIT_TYPE - Is much less common case.  Basically it happens when
requested size and alignment does not fit left nor right edges, i.e.  it
is between them.  In this case during splitting we have to build a
remaining left free area and place it back to the free list/tree.

Comparing with current design there are two extra steps.  First one is we
have to allocate a new vmap_area structure.  Second one we have to insert
that remaining free block to the address sorted list/tree.

In order to optimize a first case there is a cache with free_vmap objects.
Instead of allocating from slab we just take an object from the cache and
reuse it.

Second one is pretty optimized.  Since we know a start point in the tree
we do not do a search from the top.  Instead a traversal begins from a
rb-tree node we split.
<snip>

De-allocation.  ~O(log(N)) complexity.  An area is not inserted straight
away to the tree/list, instead we identify the spot first, checking if it
can be merged around neighbors.  The list provides O(1) access to
prev/next, so it is pretty fast to check it.  Summarizing.  If merged then
large coalesced areas are created, if not the area is just linked making
more fragments.

There is one more thing that i should mention here.  After modification of
VA node, its subtree_max_size is updated if it was/is the biggest area in
its left or right sub-tree.  Apart of that it can also be populated back
to upper levels to fix the tree.  For more details please have a look at
the __augment_tree_propagate_from() function and the description.

Tests and stressing
-------------------

I use the "test_vmalloc.sh" test driver available under
"tools/testing/selftests/vm/" since 5.1-rc1 kernel.  Just trigger "sudo
./test_vmalloc.sh" to find out how to deal with it.

Tested on different platforms including x86_64/i686/ARM64/x86_64_NUMA.
Regarding last one, i do not have any physical access to NUMA system,
therefore i emulated it.  The time of stressing is days.

If you run the test driver in "stress mode", you also need the patch that
is in Andrew's tree but not in Linux 5.1-rc1.  So, please apply it:

http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/commit/?id=e0cf7749bade6da318e98e934a24d8b62fab512c

After massive testing, i have not identified any problems like memory
leaks, crashes or kernel panics.  I find it stable, but more testing would
be good.

Performance analysis
--------------------

I have used two systems to test.  One is i5-3320M CPU @ 2.60GHz and
another is HiKey960(arm64) board.  i5-3320M runs on 4.20 kernel, whereas
Hikey960 uses 4.15 kernel.  I have both system which could run on 5.1-rc1
as well, but the results have not been ready by time i an writing this.

Currently it consist of 8 tests.  There are three of them which correspond
to different types of splitting(to compare with default).  We have 3
ones(see above).  Another 5 do allocations in different conditions.

a) sudo ./test_vmalloc.sh performance

When the test driver is run in "performance" mode, it runs all available
tests pinned to first online CPU with sequential execution test order.  We
do it in order to get stable and repeatable results.  Take a look at time
difference in "long_busy_list_alloc_test".  It is not surprising because
the worst case is O(N).

# i5-3320M
How many cycles all tests took:
CPU0=646919905370(default) cycles vs CPU0=193290498550(patched) cycles

# See detailed table with results here:
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_default.txt
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_patched.txt

# Hikey960 8x CPUs
How many cycles all tests took:
CPU0=3478683207 cycles vs CPU0=463767978 cycles

# See detailed table with results here:
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_default.txt
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_patched.txt

b) time sudo ./test_vmalloc.sh test_repeat_count=1

With this configuration, all tests are run on all available online CPUs.
Before running each CPU shuffles its tests execution order.  It gives
random allocation behaviour.  So it is rough comparison, but it puts in
the picture for sure.

# i5-3320M
<default>            vs            <patched>
real    101m22.813s                real    0m56.805s
user    0m0.011s                   user    0m0.015s
sys     0m5.076s                   sys     0m0.023s

# See detailed table with results here:
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_default.txt
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_patched.txt

# Hikey960 8x CPUs
<default>            vs            <patched>
real    unknown                    real    4m25.214s
user    unknown                    user    0m0.011s
sys     unknown                    sys     0m0.670s

I did not manage to complete this test on "default Hikey960" kernel
version.  After 24 hours it was still running, therefore i had to cancel
it.  That is why real/user/sys are "unknown".

This patch (of 3):

Currently an allocation of the new vmap area is done over busy list
iteration(complexity O(n)) until a suitable hole is found between two busy
areas.  Therefore each new allocation causes the list being grown.  Due to
over fragmented list and different permissive parameters an allocation can
take a long time.  For example on embedded devices it is milliseconds.

This patch organizes the KVA memory layout into free areas of the
1-ULONG_MAX range.  It uses an augment red-black tree that keeps blocks
sorted by their offsets in pair with linked list keeping the free space in
order of increasing addresses.

Nodes are augmented with the size of the maximum available free block in
its left or right sub-tree.  Thus, that allows to take a decision and
traversal toward the block that will fit and will have the lowest start
address, i.e.  it is sequential allocation.

Allocation: to allocate a new block a search is done over the tree until a
suitable lowest(left most) block is large enough to encompass: the
requested size, alignment and vstart point.  If the block is bigger than
requested size - it is split.

De-allocation: when a busy vmap area is freed it can either be merged or
inserted to the tree.  Red-black tree allows efficiently find a spot
whereas a linked list provides a constant-time access to previous and next
blocks to check if merging can be done.  In case of merging of
de-allocated memory chunk a large coalesced area is created.

Complexity: ~O(log(N))

[urezki@gmail.com: v3]
  Link: http://lkml.kernel.org/r/20190402162531.10888-2-urezki@gmail.com
[urezki@gmail.com: v4]
  Link: http://lkml.kernel.org/r/20190406183508.25273-2-urezki@gmail.com
Link: http://lkml.kernel.org/r/20190321190327.11813-2-urezki@gmail.comSigned-off-by: NUladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: NRoman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

68ad4a33

18 5月, 2019 2 次提交

net/mlx5: E-Switch, Correct type to u16 for vport_num and int for vport_index · 02f3afd9

由 Parav Pandit 提交于 4月 05, 2019

To avoid any ambiguity between vport index and vport number,
rename functions that had vport, to vport_num or vport_index appropriately.

vport_num is u16 hence change mlx5_eswitch_index_to_vport_num() return
type to u16.

vport_index is an int in vport array. Hence change input type of vport
index in mlx5_eswitch_index_to_vport_num() to int.

Correct multiple eswitch representor interfaces use type u16 of
rep->vport as type int vport_index.

Send vport FW commands with correct eswitch u16 vport_num instead
host int vport_index.

Fixes: 5ae51620 ("net/mlx5: E-Switch, Assign a different position for uplink rep and vport")
Signed-off-by: NParav Pandit <parav@mellanox.com>
Signed-off-by: NVu Pham <vuhuong@mellanox.com>
Reviewed-by: NBodong Wang <bodong@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

02f3afd9

i2c: core: add device-managed version of i2c_new_dummy · b8f5fe3b

由 Heiner Kallweit 提交于 5月 16, 2019

i2c_new_dummy is typically called from the probe function of the
driver for the primary i2c client. It requires calls to
i2c_unregister_device in the error path of the probe function and
in the remove function.
This can be simplified by introducing a device-managed version.

Note the changed error case return value type: i2c_new_dummy returns
NULL whilst devm_i2c_new_dummy_device returns an ERR_PTR.
Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
[wsa: rename new functions and fix minor kdoc issues]
Signed-off-by: NWolfram Sang <wsa+renesas@sang-engineering.com>
Reviewed-by: NPeter Rosin <peda@axentia.se>
Reviewed-by: NKieran Bingham <kieran.bingham+renesas@ideasonboard.com>
Reviewed-by: NBartosz Golaszewski <bgolaszewski@baylibre.com>
Signed-off-by: NWolfram Sang <wsa@the-dreams.de>

b8f5fe3b

17 5月, 2019 3 次提交

slab: remove /proc/slab_allocators · 7878c231

由 Qian Cai 提交于 5月 16, 2019

It turned out that DEBUG_SLAB_LEAK is still broken even after recent
recue efforts that when there is a large number of objects like
kmemleak_object which is normal on a debug kernel,

  # grep kmemleak /proc/slabinfo
  kmemleak_object   2243606 3436210 ...

reading /proc/slab_allocators could easily loop forever while processing
the kmemleak_object cache and any additional freeing or allocating
objects will trigger a reprocessing. To make a situation worse,
soft-lockups could easily happen in this sitatuion which will call
printk() to allocate more kmemleak objects to guarantee an infinite
loop.

Also, since it seems no one had noticed when it was totally broken
more than 2-year ago - see the commit fcf88917 ("slab: fix a crash
by reading /proc/slab_allocators"), probably nobody cares about it
anymore due to the decline of the SLAB. Just remove it entirely.
Suggested-by: NVlastimil Babka <vbabka@suse.cz>
Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NQian Cai <cai@lca.pw>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7878c231

net: test nouarg before dereferencing zerocopy pointers · 185ce5c3

由 Willem de Bruijn 提交于 5月 15, 2019

Zerocopy skbs without completion notification were added for packet
sockets with PACKET_TX_RING user buffers. Those signal completion
through the TP_STATUS_USER bit in the ring. Zerocopy annotation was
added only to avoid premature notification after clone or orphan, by
triggering a copy on these paths for these packets.

The mechanism had to define a special "no-uarg" mode because packet
sockets already use skb_uarg(skb) == skb_shinfo(skb)->destructor_arg
for a different pointer.

Before deferencing skb_uarg(skb), verify that it is a real pointer.

Fixes: 5cd8d46e ("packet: copy user buffers before orphan or clone")
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

185ce5c3

rhashtable: Remove RCU marking from rhash_lock_head · ba6306e3

由 Herbert Xu 提交于 5月 16, 2019

The opaque type rhash_lock_head should not be marked with __rcu
because it can never be dereferenced.  We should apply the RCU
marking when we turn it into a pointer which can be dereferenced.

This patch does exactly that.  This fixes a number of sparse
warnings as well as getting rid of some unnecessary RCU checking.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ba6306e3

16 5月, 2019 3 次提交

clk: Remove io.h from clk-provider.h · 62e59c4e

由 Stephen Boyd 提交于 4月 18, 2019

Now that we've gotten rid of clk_readl() we can remove io.h from the
clk-provider header and push out the io.h include to any code that isn't
already including the io.h header but using things like readl/writel,
etc.

Found with this grep:

  git grep -l clk-provider.h | grep '.c$' | xargs git grep -L 'linux/io.h' | \
  	xargs git grep -l \
	-e '\<__iowrite32_copy\>' --or \
	-e '\<__ioread32_copy\>' --or \
	-e '\<__iowrite64_copy\>' --or \
	-e '\<ioremap_page_range\>' --or \
	-e '\<ioremap_huge_init\>' --or \
	-e '\<arch_ioremap_pud_supported\>' --or \
	-e '\<arch_ioremap_pmd_supported\>' --or \
	-e '\<devm_ioport_map\>' --or \
	-e '\<devm_ioport_unmap\>' --or \
	-e '\<IOMEM_ERR_PTR\>' --or \
	-e '\<devm_ioremap\>' --or \
	-e '\<devm_ioremap_nocache\>' --or \
	-e '\<devm_ioremap_wc\>' --or \
	-e '\<devm_iounmap\>' --or \
	-e '\<devm_ioremap_release\>' --or \
	-e '\<devm_memremap\>' --or \
	-e '\<devm_memunmap\>' --or \
	-e '\<__devm_memremap_pages\>' --or \
	-e '\<pci_remap_cfgspace\>' --or \
	-e '\<arch_has_dev_port\>' --or \
	-e '\<arch_phys_wc_add\>' --or \
	-e '\<arch_phys_wc_del\>' --or \
	-e '\<memremap\>' --or \
	-e '\<memunmap\>' --or \
	-e '\<arch_io_reserve_memtype_wc\>' --or \
	-e '\<arch_io_free_memtype_wc\>' --or \
	-e '\<__io_aw\>' --or \
	-e '\<__io_pbw\>' --or \
	-e '\<__io_paw\>' --or \
	-e '\<__io_pbr\>' --or \
	-e '\<__io_par\>' --or \
	-e '\<__raw_readb\>' --or \
	-e '\<__raw_readw\>' --or \
	-e '\<__raw_readl\>' --or \
	-e '\<__raw_readq\>' --or \
	-e '\<__raw_writeb\>' --or \
	-e '\<__raw_writew\>' --or \
	-e '\<__raw_writel\>' --or \
	-e '\<__raw_writeq\>' --or \
	-e '\<readb\>' --or \
	-e '\<readw\>' --or \
	-e '\<readl\>' --or \
	-e '\<readq\>' --or \
	-e '\<writeb\>' --or \
	-e '\<writew\>' --or \
	-e '\<writel\>' --or \
	-e '\<writeq\>' --or \
	-e '\<readb_relaxed\>' --or \
	-e '\<readw_relaxed\>' --or \
	-e '\<readl_relaxed\>' --or \
	-e '\<readq_relaxed\>' --or \
	-e '\<writeb_relaxed\>' --or \
	-e '\<writew_relaxed\>' --or \
	-e '\<writel_relaxed\>' --or \
	-e '\<writeq_relaxed\>' --or \
	-e '\<readsb\>' --or \
	-e '\<readsw\>' --or \
	-e '\<readsl\>' --or \
	-e '\<readsq\>' --or \
	-e '\<writesb\>' --or \
	-e '\<writesw\>' --or \
	-e '\<writesl\>' --or \
	-e '\<writesq\>' --or \
	-e '\<inb\>' --or \
	-e '\<inw\>' --or \
	-e '\<inl\>' --or \
	-e '\<outb\>' --or \
	-e '\<outw\>' --or \
	-e '\<outl\>' --or \
	-e '\<inb_p\>' --or \
	-e '\<inw_p\>' --or \
	-e '\<inl_p\>' --or \
	-e '\<outb_p\>' --or \
	-e '\<outw_p\>' --or \
	-e '\<outl_p\>' --or \
	-e '\<insb\>' --or \
	-e '\<insw\>' --or \
	-e '\<insl\>' --or \
	-e '\<outsb\>' --or \
	-e '\<outsw\>' --or \
	-e '\<outsl\>' --or \
	-e '\<insb_p\>' --or \
	-e '\<insw_p\>' --or \
	-e '\<insl_p\>' --or \
	-e '\<outsb_p\>' --or \
	-e '\<outsw_p\>' --or \
	-e '\<outsl_p\>' --or \
	-e '\<ioread8\>' --or \
	-e '\<ioread16\>' --or \
	-e '\<ioread32\>' --or \
	-e '\<ioread64\>' --or \
	-e '\<iowrite8\>' --or \
	-e '\<iowrite16\>' --or \
	-e '\<iowrite32\>' --or \
	-e '\<iowrite64\>' --or \
	-e '\<ioread16be\>' --or \
	-e '\<ioread32be\>' --or \
	-e '\<ioread64be\>' --or \
	-e '\<iowrite16be\>' --or \
	-e '\<iowrite32be\>' --or \
	-e '\<iowrite64be\>' --or \
	-e '\<ioread8_rep\>' --or \
	-e '\<ioread16_rep\>' --or \
	-e '\<ioread32_rep\>' --or \
	-e '\<ioread64_rep\>' --or \
	-e '\<iowrite8_rep\>' --or \
	-e '\<iowrite16_rep\>' --or \
	-e '\<iowrite32_rep\>' --or \
	-e '\<iowrite64_rep\>' --or \
	-e '\<__io_virt\>' --or \
	-e '\<pci_iounmap\>' --or \
	-e '\<virt_to_phys\>' --or \
	-e '\<phys_to_virt\>' --or \
	-e '\<ioremap_uc\>' --or \
	-e '\<ioremap\>' --or \
	-e '\<__ioremap\>' --or \
	-e '\<iounmap\>' --or \
	-e '\<ioremap\>' --or \
	-e '\<ioremap_nocache\>' --or \
	-e '\<ioremap_uc\>' --or \
	-e '\<ioremap_wc\>' --or \
	-e '\<ioremap_wc\>' --or \
	-e '\<ioremap_wt\>' --or \
	-e '\<ioport_map\>' --or \
	-e '\<ioport_unmap\>' --or \
	-e '\<ioport_map\>' --or \
	-e '\<ioport_unmap\>' --or \
	-e '\<xlate_dev_kmem_ptr\>' --or \
	-e '\<xlate_dev_mem_ptr\>' --or \
	-e '\<unxlate_dev_mem_ptr\>' --or \
	-e '\<virt_to_bus\>' --or \
	-e '\<bus_to_virt\>' --or \
	-e '\<memset_io\>' --or \
	-e '\<memcpy_fromio\>' --or \
	-e '\<memcpy_toio\>'

I also reordered a couple includes when they weren't alphabetical and
removed clk.h from kona, replacing it with clk-provider.h because
that driver doesn't use clk consumer APIs.
Acked-by: NGeert Uytterhoeven <geert+renesas@glider.be>
Cc: Chen-Yu Tsai <wens@csie.org>
Acked-by: NMaxime Ripard <maxime.ripard@bootlin.com>
Acked-by: NTero Kristo <t-kristo@ti.com>
Acked-by: NSekhar Nori <nsekhar@ti.com>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Acked-by: NMark Brown <broonie@kernel.org>
Cc: Chris Zankel <chris@zankel.net>
Acked-by: NMax Filippov <jcmvbkbc@gmail.com>
Acked-by: NJohn Crispin <john@phrozen.org>
Acked-by: NHeiko Stuebner <heiko@sntech.de>
Signed-off-by: NStephen Boyd <sboyd@kernel.org>

62e59c4e

Add wait_var_event_interruptible() · a49294ea

由 David Howells 提交于 5月 03, 2019

Add wait_var_event_interruptible() to allow interruptible waits for events.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>

a49294ea

dns_resolver: Allow used keys to be invalidated · d0660f0b

由 David Howells 提交于 5月 03, 2019

Allow used DNS resolver keys to be invalidated after use if the caller is
doing its own caching of the results.  This reduces the amount of resources
required.

Fix AFS to invalidate DNS results to kill off permanent failure records
that get lodged in the resolver keyring and prevent future lookups from
happening.

Fixes: 0a5143f2 ("afs: Implement VL server rotation")
Signed-off-by: NDavid Howells <dhowells@redhat.com>

d0660f0b

15 5月, 2019 15 次提交

mm: memcontrol: fix recursive statistics correctness & scalabilty · 42a30035

由 Johannes Weiner 提交于 5月 14, 2019

Right now, when somebody needs to know the recursive memory statistics
and events of a cgroup subtree, they need to walk the entire subtree and
sum up the counters manually.

There are two issues with this:

1. When a cgroup gets deleted, its stats are lost. The state counters
should all be 0 at that point, of course, but the events are not.
When this happens, the event counters, which are supposed to be
monotonic, can go backwards in the parent cgroups.

2. During regular operation, we always have a certain number of lazily
freed cgroups sitting around that have been deleted, have no tasks,
but have a few cache pages remaining. These groups' statistics do not
change until we eventually hit memory pressure, but somebody
watching, say, memory.stat on an ancestor has to iterate those every
time.

This patch addresses both issues by introducing recursive counters at
each level that are propagated from the write side when stats change.

Upward propagation happens when the per-cpu caches spill over into the
local atomic counter. This is the same thing we do during charge and
uncharge, except that the latter uses atomic RMWs, which are more
expensive; stat changes happen at around the same rate. In a sparse
file test (page faults and reclaim at maximum CPU speed) with 5 cgroup
nesting levels, perf shows __mod_memcg_page state at ~1%.

Link: http://lkml.kernel.org/r/20190412151507.2769-4-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Reviewed-by: NRoman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

42a30035

mm: memcontrol: move stat/event counting functions out-of-line · db9adbcb

由 Johannes Weiner 提交于 5月 14, 2019

These are getting too big to be inlined in every callsite.  They were
stolen from vmstat.c, which already out-of-lines them, and they have
only been growing since.  The callsites aren't that hot, either.

Move __mod_memcg_state()
     __mod_lruvec_state() and
     __count_memcg_events() out of line and add kerneldoc comments.

Link: http://lkml.kernel.org/r/20190412151507.2769-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Reviewed-by: NRoman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

db9adbcb

mm: memcontrol: make cgroup stats and events query API explicitly local · 205b20cc

由 Johannes Weiner 提交于 5月 14, 2019

Patch series "mm: memcontrol: memory.stat cost & correctness".

The cgroup memory.stat file holds recursive statistics for the entire
subtree.  The current implementation does this tree walk on-demand
whenever the file is read.  This is giving us problems in production.

1. The cost of aggregating the statistics on-demand is high.  A lot of
   system service cgroups are mostly idle and their stats don't change
   between reads, yet we always have to check them.  There are also always
   some lazily-dying cgroups sitting around that are pinned by a handful
   of remaining page cache; the same applies to them.

   In an application that periodically monitors memory.stat in our
   fleet, we have seen the aggregation consume up to 5% CPU time.

2. When cgroups die and disappear from the cgroup tree, so do their
   accumulated vm events.  The result is that the event counters at
   higher-level cgroups can go backwards and confuse some of our
   automation, let alone people looking at the graphs over time.

To address both issues, this patch series changes the stat
implementation to spill counts upwards when the counters change.

The upward spilling is batched using the existing per-cpu cache.  In a
sparse file stress test with 5 level cgroup nesting, the additional cost
of the flushing was negligible (a little under 1% of CPU at 100% CPU
utilization, compared to the 5% of reading memory.stat during regular
operation).

This patch (of 4):

memcg_page_state(), lruvec_page_state(), memcg_sum_events() are
currently returning the state of the local memcg or lruvec, not the
recursive state.

In practice there is a demand for both versions, although the callers
that want the recursive counts currently sum them up by hand.

Per default, cgroups are considered recursive entities and generally we
expect more users of the recursive counters, with the local counts being
special cases.  To reflect that in the name, add a _local suffix to the
current implementations.

The following patch will re-incarnate these functions with recursive
semantics, but with an O(1) implementation.

[hannes@cmpxchg.org: fix bisection hole]
  Link: http://lkml.kernel.org/r/20190417160347.GC23013@cmpxchg.org
Link: http://lkml.kernel.org/r/20190412151507.2769-2-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Reviewed-by: NRoman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

205b20cc

mm, memcg: rename ambiguously named memory.stat counters and functions · 871789d4

由 Chris Down 提交于 5月 14, 2019

I spent literally an hour trying to work out why an earlier version of
my memory.events aggregation code doesn't work properly, only to find
out I was calling memcg->events instead of memcg->memory_events, which
is fairly confusing.

This naming seems in need of reworking, so make it harder to do the
wrong thing by using vmevents instead of events, which makes it more
clear that these are vm counters rather than memcg-specific counters.

There are also a few other inconsistent names in both the percpu and
aggregated structs, so these are all cleaned up to be more coherent and
easy to understand.

This commit contains code cleanup only: there are no logic changes.

[akpm@linux-foundation.org: fix it for preceding changes]
Link: http://lkml.kernel.org/r/20190208224319.GA23801@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Dennis Zhou <dennis@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

871789d4

include/linux/sched/signal.h: replace `tsk' with `task' · 9e9291c7

由 Andrei Vagin 提交于 5月 14, 2019

This file uses "task" 85 times and "tsk" 25 times.  It is better to be
consistent.

Link: http://lkml.kernel.org/r/20181129180547.15976-1-avagin@gmail.comSigned-off-by: NAndrei Vagin <avagin@gmail.com>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9e9291c7

ipc: conserve sequence numbers in ipcmni_extend mode · 3278a2c2

由 Manfred Spraul 提交于 5月 14, 2019

Rewrite, based on the patch from Waiman Long:

The mixing in of a sequence number into the IPC IDs is probably to avoid
ID reuse in userspace as much as possible.  With ipcmni_extend mode, the
number of usable sequence numbers is greatly reduced leading to higher
chance of ID reuse.

To address this issue, we need to conserve the sequence number space as
much as possible.  Right now, the sequence number is incremented for
every new ID created.  In reality, we only need to increment the
sequence number when new allocated ID is not greater than the last one
allocated.  It is in such case that the new ID may collide with an
existing one.  This is being done irrespective of the ipcmni mode.

In order to avoid any races, the index is first allocated and then the
pointer is replaced.

Changes compared to the initial patch:
 - Handle failures from idr_alloc().
 - Avoid that concurrent operations can see the wrong sequence number.
   (This is achieved by using idr_replace()).
 - IPCMNI_SEQ_SHIFT is not a constant, thus renamed to
   ipcmni_seq_shift().
 - IPCMNI_SEQ_MAX is not a constant, thus renamed to ipcmni_seq_max().

Link: http://lkml.kernel.org/r/20190329204930.21620-2-longman@redhat.comSigned-off-by: NManfred Spraul <manfred@colorfullife.com>
Signed-off-by: NWaiman Long <longman@redhat.com>
Suggested-by: NMatthew Wilcox <willy@infradead.org>
Acked-by: NWaiman Long <longman@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: Takashi Iwai <tiwai@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3278a2c2

pps: pps-gpio PPS ECHO implementation · 4c69add4

由 Tom Burkart 提交于 5月 14, 2019

This patch implements the PPS ECHO functionality for pps-gpio, that
sysfs claims is available already.

Configuration is done via device tree bindings.

No changes are made to userspace interfaces.

This patch was originally written by Lukas Senger as part of a masters
thesis project and modified for inclusion into the linux kernel by Tom
Burkart.

Link: http://lkml.kernel.org/r/20190324043305.6627-4-tom@aussec.comSigned-off-by: NTom Burkart <tom@aussec.com>
Acked-by: NRodolfo Giometti <giometti@enneenne.com>
Signed-off-by: NLukas Senger <lukas@fridolin.com>
Cc: Philipp Zabel <philipp.zabel@gmail.com>
Cc: Rob Herring <robh@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4c69add4

pps: descriptor-based gpio · 4461d651

由 Tom Burkart 提交于 5月 14, 2019

This patch changes the GPIO access for the pps-gpio driver from the
integer based API to the descriptor based API.

The integer based API is considered deprecated and the descriptor based
API is the preferred way to access GPIOs as per
Documentation/driver-api/gpio/intro.rst

No changes are made to userspace interfaces.

Link: http://lkml.kernel.org/r/20190324043305.6627-2-tom@aussec.comSigned-off-by: NTom Burkart <tom@aussec.com>
Acked-by: NRodolfo Giometti <giometti@enneenne.com>
Reviewed-by: NPhilipp Zabel <philipp.zabel@gmail.com>
Cc: Lukas Senger <lukas@fridolin.com>
Cc: Rob Herring <robh@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4461d651

panic/reboot: allow specifying reboot_mode for panic only · b287a25a

由 Aaro Koskinen 提交于 5月 14, 2019

Allow specifying reboot_mode for panic only. This is needed on systems
where ramoops is used to store panic logs, and user wants to use warm
reset to preserve those, while still having cold reset on normal
reboots.

Link: http://lkml.kernel.org/r/20190322004735.27702-1-aaro.koskinen@iki.fiSigned-off-by: NAaro Koskinen <aaro.koskinen@nokia.com>
Reviewed-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b287a25a

panic: avoid the extra noise dmesg · c39ea0b9

由 Feng Tang 提交于 5月 14, 2019

When kernel panic happens, it will first print the panic call stack,
then the ending msg like:

[   35.743249] ---[ end Kernel panic - not syncing: Fatal exception
[   35.749975] ------------[ cut here ]------------

The above message are very useful for debugging.

But if system is configured to not reboot on panic, say the
"panic_timeout" parameter equals 0, it will likely print out many noisy
message like WARN() call stack for each and every CPU except the panic
one, messages like below:

	WARNING: CPU: 1 PID: 280 at kernel/sched/core.c:1198 set_task_cpu+0x183/0x190
	Call Trace:
	<IRQ>
	try_to_wake_up
	default_wake_function
	autoremove_wake_function
	__wake_up_common
	__wake_up_common_lock
	__wake_up
	wake_up_klogd_work_func
	irq_work_run_list
	irq_work_tick
	update_process_times
	tick_sched_timer
	__hrtimer_run_queues
	hrtimer_interrupt
	smp_apic_timer_interrupt
	apic_timer_interrupt

For people working in console mode, the screen will first show the panic
call stack, but immediately overridden by these noisy extra messages,
which makes debugging much more difficult, as the original context gets
lost on screen.

Also these noisy messages will confuse some users, as I have seen many bug
reporters posted the noisy message into bugzilla, instead of the real
panic call stack and context.

Adding a flag "suppress_printk" which gets set in panic() to avoid those
noisy messages, without changing current kernel behavior that both panic
blinking and sysrq magic key can work as is, suggested by Petr Mladek.

To verify this, make sure kernel is not configured to reboot on panic and
in console
 # echo c > /proc/sysrq-trigger
to see if console only prints out the panic call stack.

Link: http://lkml.kernel.org/r/1551430186-24169-1-git-send-email-feng.tang@intel.comSigned-off-by: NFeng Tang <feng.tang@intel.com>
Suggested-by: NPetr Mladek <pmladek@suse.com>
Reviewed-by: NPetr Mladek <pmladek@suse.com>
Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Sasha Levin <sashal@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c39ea0b9

include/linux/cpumask.h: fix double string traverse in cpumask_parse · 3713a4e1

由 Yury Norov 提交于 5月 14, 2019

cpumask_parse() finds first occurrence of either or strchr() and
strlen().  We can do it better with a single call of strchrnul().

[akpm@linux-foundation.org: remove unneeded cast]
Link: http://lkml.kernel.org/r/20190409204208.12190-1-ynorov@marvell.comSigned-off-by: NYury Norov <ynorov@marvell.com>
Acked-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3713a4e1

exec: move struct linux_binprm::buf · a6231d19

由 Alexey Dobriyan 提交于 5月 14, 2019

struct linux_binprm::buf is the first field and it is exactly 128 bytes
in size.  It means that on x86_64 all accesses to other fields will go
though [r64 + disp32] addressing mode which is 3 bytes bloatier than
[r64 + disp8] addressing mode.  Given that accesses to other fields
outnumber accesses to ->buf, move it down.

Space savings (x86_64 defconfig):
more on distro configs because LSMs actively dereference "bprm"
but do not care about first 128 bytes of the executable itself.

add/remove: 0/0 grow/shrink: 0/24 up/down: 0/-492 (-492)
Function                                     old     new   delta
selinux_bprm_committing_creds                552     549      -3
finalize_exec                                 94      91      -3
__audit_log_bprm_fcaps                       283     280      -3
__audit_bprm                                  39      36      -3
perf_trace_sched_process_exec                347     341      -6
install_exec_creds                           105      99      -6
cap_bprm_set_creds.cold                       60      54      -6
would_dump                                   137     128      -9
load_script                                  637     628      -9
bprm_change_interp                            61      52      -9
trace_event_raw_event_sched_process_exec     260     250     -10
search_binary_handler                        255     240     -15
remove_arg_zero                              295     277     -18
free_bprm                                    119     101     -18
prepare_binprm                               379     360     -19
setup_new_exec                               336     315     -21
flush_old_exec                              1638    1617     -21
copy_strings.isra                            746     724     -22
setup_arg_pages                              559     530     -29
load_misc_binary                            1151    1118     -33
selinux_bprm_set_creds                       792     753     -39
load_elf_binary                            11111   11072     -39
cap_bprm_set_creds                          1496    1454     -42
__do_execve_file.isra                       2395    2286    -109

Link: http://lkml.kernel.org/r/20190421165025.GA26843@avx2Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a6231d19

include/linux/bitops.h: sanitize rotate primitives · ef4d6f6b

由 Rasmus Villemoes 提交于 5月 14, 2019

The ror32 implementation (word >> shift) | (word << (32 - shift) has
undefined behaviour if shift is outside the [1, 31] range.  Similarly
for the 64 bit variants.  Most callers pass a compile-time constant
(naturally in that range), but there's an UBSAN report that these may
actually be called with a shift count of 0.

Instead of special-casing that, we can make them DTRT for all values of
shift while also avoiding UB.  For some reason, this was already partly
done for rol32 (which was well-defined for [0, 31]).  gcc 8 recognizes
these patterns as rotates, so for example

  __u32 rol32(__u32 word, unsigned int shift)
  {
	return (word << (shift & 31)) | (word >> ((-shift) & 31));
  }

compiles to

0000000000000020 <rol32>:
  20:   89 f8                   mov    %edi,%eax
  22:   89 f1                   mov    %esi,%ecx
  24:   d3 c0                   rol    %cl,%eax
  26:   c3                      retq

Older compilers unfortunately do not do as well, but this only affects
the small minority of users that don't pass constants.

Due to integer promotions, ro[lr]8 were already well-defined for shifts
in [0, 8], and ro[lr]16 were mostly well-defined for shifts in [0, 16]
(only mostly - u16 gets promoted to _signed_ int, so if bit 15 is set,
word << 16 is undefined).  For consistency, update those as well.

Link: http://lkml.kernel.org/r/20190410211906.2190-1-linux@rasmusvillemoes.dkSigned-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
Reported-by: NIdo Schimmel <idosch@mellanox.com>
Tested-by: NIdo Schimmel <idosch@mellanox.com>
Reviewed-by: NWill Deacon <will.deacon@arm.com>
Cc: Vadim Pasternak <vadimp@mellanox.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Jacek Anaszewski <jacek.anaszewski@gmail.com>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ef4d6f6b

lib/math: move int_pow() from pwm_bl.c for wider use · 9f615894

由 Andy Shevchenko 提交于 5月 14, 2019

The integer exponentiation is used in few places and might be used in
the future by other call sites. Move it to wider use.

Link: http://lkml.kernel.org/r/20190323172531.80025-2-andriy.shevchenko@linux.intel.comSigned-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Lee Jones <lee.jones@linaro.org>
Cc: Ray Jui <rjui@broadcom.com>
Cc: Thierry Reding <thierry.reding@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9f615894

lib/list_sort: simplify and remove MAX_LIST_LENGTH_BITS · 043b3f7b

由 George Spelvin 提交于 5月 14, 2019

Rather than a fixed-size array of pending sorted runs, use the ->prev
links to keep track of things.  This reduces stack usage, eliminates
some ugly overflow handling, and reduces the code size.

Also:
* merge() no longer needs to handle NULL inputs, so simplify.
* The same applies to merge_and_restore_back_links(), which is renamed
  to the less ponderous merge_final().  (It's a static helper function,
  so we don't need a super-descriptive name; comments will do.)
* Document the actual return value requirements on the (*cmp)()
  function; some callers are already using this feature.

x86-64 code size 1086 -> 739 bytes (-347)

(Yes, I see checkpatch complaining about no space after comma in
"__attribute__((nonnull(2,3,4,5)))".  Checkpatch is wrong.)

Feedback from Rasmus Villemoes, Andy Shevchenko and Geert Uytterhoeven.

[akpm@linux-foundation.org: remove __pure usage due to mysterious warning]
Link: http://lkml.kernel.org/r/f63c410e0ff76009c9b58e01027e751ff7fdb749.1552704200.git.lkml@sdf.orgSigned-off-by: NGeorge Spelvin <lkml@sdf.org>
Acked-by: NAndrey Abramov <st5pub@yandex.ru>
Acked-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Daniel Wagner <daniel.wagner@siemens.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Don Mullis <don.mullis@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

043b3f7b

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功