提交 · 8d18ad83f19b7fb67485f977a51408287e3f801f · openeuler / Kernel

19 6月, 2019 2 次提交

RDMA: Report available cdevs through RDMA_NLDEV_CMD_GET_CHARDEV · 8f71bb00

由 Jason Gunthorpe 提交于 6月 13, 2019

Update the struct ib_client for all modules exporting cdevs related to the
ibdevice to also implement RDMA_NLDEV_CMD_GET_CHARDEV. All cdevs are now
autoloadable and discoverable by userspace over netlink instead of relying
on sysfs.

uverbs also exposes the DRIVER_ID for drivers that are able to support
driver id binding in rdma-core.
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
Reviewed-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

8f71bb00

RDMA: Add NLDEV_GET_CHARDEV to allow char dev discovery and autoload · 0e2d00eb

由 Jason Gunthorpe 提交于 6月 13, 2019

Allow userspace to issue a netlink query against the ib_device for
something like "uverbs" and get back the char dev name, inode major/minor,
and interface ABI information for "uverbs0".

Since we are now in netlink this can also trigger a module autoload to
make the uverbs device come into existence.

Largely this will let us replace searching and reading inside sysfs to
setup devices, and provides an alternative (using driver_id) to device
name based provider binding for things like rxe.
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

0e2d00eb

18 6月, 2019 1 次提交

RDMA: Move rdma_node_type to uapi/ · 5d60c111

由 Jason Gunthorpe 提交于 6月 13, 2019

This enum is exposed over the sysfs file 'node_type' and over netlink via
RDMA_NLDEV_ATTR_DEV_NODE_TYPE, so declare it in the uapi headers.
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

5d60c111

16 6月, 2019 2 次提交

net/mlx5: Expose eswitch encap mode · 82b11f07

由 Maor Gottlieb 提交于 6月 12, 2019

Add API to get the current Eswitch encap mode.
It will be used in downstream patches to check if
flow table can be created with encap support or not.
Signed-off-by: NMaor Gottlieb <maorg@mellanox.com>
Reviewed-by: NPetr Vorel <pvorel@suse.cz>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Reviewed-by: NParav Pandit <parav@mellanox.com>

82b11f07

net/mlx5: Declare more strictly devlink encap mode · 98fdbea5

由 Leon Romanovsky 提交于 6月 12, 2019

Devlink has UAPI declaration for encap mode, so there is no
need to be loose on the data get/set by drivers.

Update call sites to use enum devlink_eswitch_encap_mode
instead of plain u8.
Suggested-by: NParav Pandit <parav@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Acked-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NParav Pandit <parav@mellanox.com>
Reviewed-by: NPetr Vorel <pvorel@suse.cz>

98fdbea5

14 6月, 2019 6 次提交

net/mlx5: Add EQ enable/disable API · 1f8a7bee

由 Yuval Avnery 提交于 6月 10, 2019

Previously, EQ joined the chain notifier on creation.
This forced the caller to be ready to handle events before creating
the EQ through eq_create_generic interface.

To help the caller control when the created EQ will be attached to the
IRQ, add enable/disable API.
Signed-off-by: NYuval Avnery <yuvalav@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

1f8a7bee

net/mlx5: Use a single IRQ for all async EQs · 81bfa206

由 Ariel Levkovich 提交于 6月 10, 2019

The patch modifies the IRQ allocation so that all async EQs are
assigned to the same IRQ resulting in more available IRQs for
completion EQs.

The changes are using the support for IRQ sharing and EQ polling budget
that was introduced in previous patches so when the shared interrupt is
triggered, the kernel will serially call the handler of each of the
sharing EQs with a certain budget of EQEs to poll in order to prevent
starvation.
Signed-off-by: NAriel Levkovich <lariel@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

81bfa206

net/mlx5: Separate IRQ data from EQ table data · 561aa15a

由 Yuval Avnery 提交于 6月 10, 2019

IRQ table should only exist for mlx5_core_dev for PF and VF only.
EQ table of mediated devices should hold a pointer to the IRQ table
of the parent PCI device.
Signed-off-by: NYuval Avnery <yuvalav@mellanox.com>
Reviewed-by: NParav Pandit <parav@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

561aa15a

net/mlx5: Separate IRQ request/free from EQ life cycle · 24163189

由 Yuval Avnery 提交于 6月 10, 2019

Instead of requesting IRQ with eq creation, IRQs will be requested
before EQ table creation.
Instead of freeing the IRQs after EQ destroy, free IRQs after eq
table destroy.
Signed-off-by: NYuval Avnery <yuvalav@mellanox.com>
Reviewed-by: NParav Pandit <parav@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

24163189

net/mlx5: Change interrupt handler to call chain notifier · ca390799

由 Yuval Avnery 提交于 6月 10, 2019

Multiple EQs may share the same IRQ in subsequent patches.

Instead of calling the IRQ handler directly, the EQ will register
to an atomic chain notfier.

The Linux built-in shared IRQ is not used because it forces the caller
to disable the IRQ and clear affinity before free_irq() can be called.

This patch is the first step in the separation of IRQ and EQ logic.
Signed-off-by: NYuval Avnery <yuvalav@mellanox.com>
Reviewed-by: NParav Pandit <parav@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

ca390799

net/mlx5: Support querying max VFs from device · 86eec50b

由 Bodong Wang 提交于 6月 10, 2019

For ECPF with eswitch manager privilege, query the host max VF count
by querying the device using query_functions command.

With this enhancement:
1. flow steering entries are created only for valid vports based on
   the max VF count of the PF.
2. Driver only queries cap of valid vport.

Eswitch requires the max VFs when doing initialization, so do sr-iov
init before eswitch init.
Signed-off-by: NBodong Wang <bodong@mellanox.com>
Reviewed-by: NParav Pandit <parav@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

86eec50b

12 6月, 2019 2 次提交

RDMA: Convert CQ allocations to be under core responsibility · e39afe3d

由 Leon Romanovsky 提交于 5月 28, 2019

Ensure that CQ is allocated and freed by IB/core and not by drivers.
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Acked-by: NGal Pressman <galpress@amazon.com>
Reviewed-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Tested-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

e39afe3d

RDMA: Clean destroy CQ in drivers do not return errors · a52c8e24

由 Leon Romanovsky 提交于 5月 28, 2019

Like all other destroy commands, .destroy_cq() call is not supposed
to fail. In all flows, the attempt to return earlier caused to memory
leaks.

This patch converts .destroy_cq() to do not return any errors.
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Acked-by: NGal Pressman <galpress@amazon.com>
Acked-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

a52c8e24

11 6月, 2019 4 次提交

RDMA: Move owner into struct ib_device_ops · 7a154142

由 Jason Gunthorpe 提交于 6月 05, 2019

This more closely follows how other subsytems work, with owner being a
member of the structure containing the function pointers.
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
Reviewed-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

7a154142

RDMA: Move uverbs_abi_ver into struct ib_device_ops · 72c6ec18

由 Jason Gunthorpe 提交于 6月 05, 2019

No reason for every driver to emit code to set this, just make it part of
the driver's existing static const ops structure.
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
Reviewed-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

72c6ec18

RDMA: Move driver_id into struct ib_device_ops · b9560a41

由 Jason Gunthorpe 提交于 6月 05, 2019

No reason for every driver to emit code to set this, just make it part of
the driver's existing static const ops structure.
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
Reviewed-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

b9560a41

rdma: Delete the ib_ucm module · a1a8e4a8

由 Jason Gunthorpe 提交于 6月 10, 2019

This has been marked CONFIG_BROKEN for over a year now with no complaints.
Delete the whole thing for good.

The module provided the /dev/infiniband/ucmX interface.
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

a1a8e4a8

01 6月, 2019 5 次提交

{IB,net}/mlx5: Constify rep ops functions pointers · 8693115a

由 Parav Pandit 提交于 5月 29, 2019

Currently for every representor type and for every single vport,
representer function pointers copy is stored even though they don't
change from one to other vport.

Additionally priv data entry for the rep is not passed during
registration, but its copied. It is used (set and cleared) by the user
of the reps.

As we want to scale vports, to simplify and also to split constants
from data,

1. Rename mlx5_eswitch_rep_if to mlx5_eswitch_rep_ops as to match _ops
prefix with other standard netdev, ibdev ops.
2. Constify the IB and Ethernet rep ops structure.
3. Instead of storing copy of all rep function pointers, store copy
per eswitch rep type.
4. Split data and function pointers to mlx5_eswitch_rep_ops and
mlx5_eswitch_rep_data.
Signed-off-by: NParav Pandit <parav@mellanox.com>
Reviewed-by: NMark Bloch <markb@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

8693115a

net/mlx5: E-Switch, Honor eswitch functions changed event cap · 6706a3b9

由 Vu Pham 提交于 5月 29, 2019

Whenever device supports eswitch functions changed event, honor
such device setting. Do not limit it to ECPF.
Signed-off-by: NParav Pandit <parav@mellanox.com>
Signed-off-by: NVu Pham <vuhuong@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

6706a3b9

net/mlx5: E-Switch, Replace host_params event with functions_changed event · cd56f929

由 Vu Pham 提交于 5月 29, 2019

To support sriov on a E-Switch manager, num_vfs are queried
to the firmware whenever E-Switch manager is notified by
esw_functions_changed event.

Replace host_params event with esw_functions_changed event that reflects
more appropriate naming.

While at it, also correct num_vfs type from int to u16 as expected by
the function mlx5_esw_query_functions().
Signed-off-by: NVu Pham <vuhuong@mellanox.com>
Reviewed-by: NParav Pandit <parav@mellanox.com>
Reviewed-by: NBodong Wang <bodong@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

cd56f929

net/mlx5: Introduce termination table bits · c6d4e45d

由 Eli Britstein 提交于 5月 29, 2019

Termination table is a flow table with a termination flag. The flag
allows the firmware to assume that the the specified actions are the last
actions list. This assumption allows the FW to safely perform potential
looping logic (e.g. hairpin). Introduce the bits for this attribute.
Signed-off-by: NEli Britstein <elibr@mellanox.com>
Reviewed-by: NOz Shlomo <ozsh@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

c6d4e45d

net/mlx5: Add core dump register access HW bits · 0b9055a1

由 Moshe Shemesh 提交于 5月 29, 2019

Add Firmware core dump registers and HW definitions.
Signed-off-by: NMoshe Shemesh <moshe@mellanox.com>
Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

0b9055a1

22 5月, 2019 2 次提交

RDMA/core: Make ib_destroy_cq() void · 890ac8d9

由 Leon Romanovsky 提交于 5月 20, 2019

Kernel destroy CQ flows can't fail and the returned value of
ib_destroy_cq() is not interested in those flows.
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

890ac8d9

RDMA/umem: Move page_shift from ib_umem to ib_odp_umem · d2183c6f

由 Jason Gunthorpe 提交于 5月 20, 2019

This value has always been set to PAGE_SHIFT in the core code, the only
thing that does differently was the ODP path. Move the value into the ODP
struct and still use it for ODP, but change all the non-ODP things to just
use PAGE_SHIFT/PAGE_SIZE/PAGE_MASK directly.
Reviewed-by: NShiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>

d2183c6f

19 5月, 2019 2 次提交

panic: add an option to replay all the printk message in buffer · de6da1e8

由 Feng Tang 提交于 5月 17, 2019

Currently on panic, kernel will lower the loglevel and print out pending
printk msg only with console_flush_on_panic().

Add an option for users to configure the "panic_print" to replay all
dmesg in buffer, some of which they may have never seen due to the
loglevel setting, which will help panic debugging .

[feng.tang@intel.com: keep the original console_flush_on_panic() inside panic()]
  Link: http://lkml.kernel.org/r/1556199137-14163-1-git-send-email-feng.tang@intel.com
[feng.tang@intel.com: use logbuf lock to protect the console log index]
  Link: http://lkml.kernel.org/r/1556269868-22654-1-git-send-email-feng.tang@intel.com
Link: http://lkml.kernel.org/r/1556095872-36838-1-git-send-email-feng.tang@intel.comSigned-off-by: NFeng Tang <feng.tang@intel.com>
Reviewed-by: NPetr Mladek <pmladek@suse.com>
Cc: Aaro Koskinen <aaro.koskinen@nokia.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Borislav Petkov <bp@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

de6da1e8

mm/vmalloc.c: keep track of free blocks for vmap allocation · 68ad4a33

由 Uladzislau Rezki (Sony) 提交于 5月 17, 2019

Patch series "improve vmap allocation", v3.

Objective
---------

Please have a look for the description at:

  https://lkml.org/lkml/2018/10/19/786

but let me also summarize it a bit here as well.

The current implementation has O(N) complexity. Requests with different
permissive parameters can lead to long allocation time. When i say
"long" i mean milliseconds.

Description
-----------

This approach organizes the KVA memory layout into free areas of the
1-ULONG_MAX range, i.e.  an allocation is done over free areas lookups,
instead of finding a hole between two busy blocks.  It allows to have
lower number of objects which represent the free space, therefore to have
less fragmented memory allocator.  Because free blocks are always as large
as possible.

It uses the augment tree where all free areas are sorted in ascending
order of va->va_start address in pair with linked list that provides
O(1) access to prev/next elements.

Since the tree is augment, we also maintain the "subtree_max_size" of VA
that reflects a maximum available free block in its left or right
sub-tree.  Knowing that, we can easily traversal toward the lowest (left
most path) free area.

Allocation: ~O(log(N)) complexity.  It is sequential allocation method
therefore tends to maximize locality.  The search is done until a first
suitable block is large enough to encompass the requested parameters.
Bigger areas are split.

I copy paste here the description of how the area is split, since i
described it in https://lkml.org/lkml/2018/10/19/786

<snip>

A free block can be split by three different ways.  Their names are
FL_FIT_TYPE, LE_FIT_TYPE/RE_FIT_TYPE and NE_FIT_TYPE, i.e.  they
correspond to how requested size and alignment fit to a free block.

FL_FIT_TYPE - in this case a free block is just removed from the free
list/tree because it fully fits.  Comparing with current design there is
an extra work with rb-tree updating.

LE_FIT_TYPE/RE_FIT_TYPE - left/right edges fit.  In this case what we do
is just cutting a free block.  It is as fast as a current design.  Most of
the vmalloc allocations just end up with this case, because the edge is
always aligned to 1.

NE_FIT_TYPE - Is much less common case.  Basically it happens when
requested size and alignment does not fit left nor right edges, i.e.  it
is between them.  In this case during splitting we have to build a
remaining left free area and place it back to the free list/tree.

Comparing with current design there are two extra steps.  First one is we
have to allocate a new vmap_area structure.  Second one we have to insert
that remaining free block to the address sorted list/tree.

In order to optimize a first case there is a cache with free_vmap objects.
Instead of allocating from slab we just take an object from the cache and
reuse it.

Second one is pretty optimized.  Since we know a start point in the tree
we do not do a search from the top.  Instead a traversal begins from a
rb-tree node we split.
<snip>

De-allocation.  ~O(log(N)) complexity.  An area is not inserted straight
away to the tree/list, instead we identify the spot first, checking if it
can be merged around neighbors.  The list provides O(1) access to
prev/next, so it is pretty fast to check it.  Summarizing.  If merged then
large coalesced areas are created, if not the area is just linked making
more fragments.

There is one more thing that i should mention here.  After modification of
VA node, its subtree_max_size is updated if it was/is the biggest area in
its left or right sub-tree.  Apart of that it can also be populated back
to upper levels to fix the tree.  For more details please have a look at
the __augment_tree_propagate_from() function and the description.

Tests and stressing
-------------------

I use the "test_vmalloc.sh" test driver available under
"tools/testing/selftests/vm/" since 5.1-rc1 kernel.  Just trigger "sudo
./test_vmalloc.sh" to find out how to deal with it.

Tested on different platforms including x86_64/i686/ARM64/x86_64_NUMA.
Regarding last one, i do not have any physical access to NUMA system,
therefore i emulated it.  The time of stressing is days.

If you run the test driver in "stress mode", you also need the patch that
is in Andrew's tree but not in Linux 5.1-rc1.  So, please apply it:

http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/commit/?id=e0cf7749bade6da318e98e934a24d8b62fab512c

After massive testing, i have not identified any problems like memory
leaks, crashes or kernel panics.  I find it stable, but more testing would
be good.

Performance analysis
--------------------

I have used two systems to test.  One is i5-3320M CPU @ 2.60GHz and
another is HiKey960(arm64) board.  i5-3320M runs on 4.20 kernel, whereas
Hikey960 uses 4.15 kernel.  I have both system which could run on 5.1-rc1
as well, but the results have not been ready by time i an writing this.

Currently it consist of 8 tests.  There are three of them which correspond
to different types of splitting(to compare with default).  We have 3
ones(see above).  Another 5 do allocations in different conditions.

a) sudo ./test_vmalloc.sh performance

When the test driver is run in "performance" mode, it runs all available
tests pinned to first online CPU with sequential execution test order.  We
do it in order to get stable and repeatable results.  Take a look at time
difference in "long_busy_list_alloc_test".  It is not surprising because
the worst case is O(N).

# i5-3320M
How many cycles all tests took:
CPU0=646919905370(default) cycles vs CPU0=193290498550(patched) cycles

# See detailed table with results here:
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_default.txt
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_patched.txt

# Hikey960 8x CPUs
How many cycles all tests took:
CPU0=3478683207 cycles vs CPU0=463767978 cycles

# See detailed table with results here:
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_default.txt
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_patched.txt

b) time sudo ./test_vmalloc.sh test_repeat_count=1

With this configuration, all tests are run on all available online CPUs.
Before running each CPU shuffles its tests execution order.  It gives
random allocation behaviour.  So it is rough comparison, but it puts in
the picture for sure.

# i5-3320M
<default>            vs            <patched>
real    101m22.813s                real    0m56.805s
user    0m0.011s                   user    0m0.015s
sys     0m5.076s                   sys     0m0.023s

# See detailed table with results here:
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_default.txt
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_patched.txt

# Hikey960 8x CPUs
<default>            vs            <patched>
real    unknown                    real    4m25.214s
user    unknown                    user    0m0.011s
sys     unknown                    sys     0m0.670s

I did not manage to complete this test on "default Hikey960" kernel
version.  After 24 hours it was still running, therefore i had to cancel
it.  That is why real/user/sys are "unknown".

This patch (of 3):

Currently an allocation of the new vmap area is done over busy list
iteration(complexity O(n)) until a suitable hole is found between two busy
areas.  Therefore each new allocation causes the list being grown.  Due to
over fragmented list and different permissive parameters an allocation can
take a long time.  For example on embedded devices it is milliseconds.

This patch organizes the KVA memory layout into free areas of the
1-ULONG_MAX range.  It uses an augment red-black tree that keeps blocks
sorted by their offsets in pair with linked list keeping the free space in
order of increasing addresses.

Nodes are augmented with the size of the maximum available free block in
its left or right sub-tree.  Thus, that allows to take a decision and
traversal toward the block that will fit and will have the lowest start
address, i.e.  it is sequential allocation.

Allocation: to allocate a new block a search is done over the tree until a
suitable lowest(left most) block is large enough to encompass: the
requested size, alignment and vstart point.  If the block is bigger than
requested size - it is split.

De-allocation: when a busy vmap area is freed it can either be merged or
inserted to the tree.  Red-black tree allows efficiently find a spot
whereas a linked list provides a constant-time access to previous and next
blocks to check if merging can be done.  In case of merging of
de-allocated memory chunk a large coalesced area is created.

Complexity: ~O(log(N))

[urezki@gmail.com: v3]
  Link: http://lkml.kernel.org/r/20190402162531.10888-2-urezki@gmail.com
[urezki@gmail.com: v4]
  Link: http://lkml.kernel.org/r/20190406183508.25273-2-urezki@gmail.com
Link: http://lkml.kernel.org/r/20190321190327.11813-2-urezki@gmail.comSigned-off-by: NUladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: NRoman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

68ad4a33

18 5月, 2019 1 次提交

i2c: core: add device-managed version of i2c_new_dummy · b8f5fe3b

由 Heiner Kallweit 提交于 5月 16, 2019

i2c_new_dummy is typically called from the probe function of the
driver for the primary i2c client. It requires calls to
i2c_unregister_device in the error path of the probe function and
in the remove function.
This can be simplified by introducing a device-managed version.

Note the changed error case return value type: i2c_new_dummy returns
NULL whilst devm_i2c_new_dummy_device returns an ERR_PTR.
Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
[wsa: rename new functions and fix minor kdoc issues]
Signed-off-by: NWolfram Sang <wsa+renesas@sang-engineering.com>
Reviewed-by: NPeter Rosin <peda@axentia.se>
Reviewed-by: NKieran Bingham <kieran.bingham+renesas@ideasonboard.com>
Reviewed-by: NBartosz Golaszewski <bgolaszewski@baylibre.com>
Signed-off-by: NWolfram Sang <wsa@the-dreams.de>

b8f5fe3b

17 5月, 2019 2 次提交

slab: remove /proc/slab_allocators · 7878c231

由 Qian Cai 提交于 5月 16, 2019

It turned out that DEBUG_SLAB_LEAK is still broken even after recent
recue efforts that when there is a large number of objects like
kmemleak_object which is normal on a debug kernel,

  # grep kmemleak /proc/slabinfo
  kmemleak_object   2243606 3436210 ...

reading /proc/slab_allocators could easily loop forever while processing
the kmemleak_object cache and any additional freeing or allocating
objects will trigger a reprocessing. To make a situation worse,
soft-lockups could easily happen in this sitatuion which will call
printk() to allocate more kmemleak objects to guarantee an infinite
loop.

Also, since it seems no one had noticed when it was totally broken
more than 2-year ago - see the commit fcf88917 ("slab: fix a crash
by reading /proc/slab_allocators"), probably nobody cares about it
anymore due to the decline of the SLAB. Just remove it entirely.
Suggested-by: NVlastimil Babka <vbabka@suse.cz>
Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NQian Cai <cai@lca.pw>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7878c231

uapi: Wire up the mount API syscalls on non-x86 arches [ver #2] · d8076bdb

由 David Howells 提交于 5月 16, 2019

Wire up the mount API syscalls on non-x86 arches.
Reported-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Reviewed-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

d8076bdb

16 5月, 2019 5 次提交

rxrpc: Allow the kernel to mark a call as being non-interruptible · b960a34b

由 David Howells 提交于 5月 09, 2019

Allow kernel services using AF_RXRPC to indicate that a call should be
non-interruptible. This allows kafs to make things like lock-extension and
writeback data storage calls non-interruptible.

If this is set, signals will be ignored for operations on that call where
possible - such as waiting to get a call channel on an rxrpc connection.

It doesn't prevent UDP sendmsg from being interrupted, but that will be
handled by packet retransmission.

rxrpc_kernel_recv_data() isn't affected by this since that never waits,
preferring instead to return -EAGAIN and leave the waiting to the caller.

Userspace initiated calls can't be set to be uninterruptible at this time.
Signed-off-by: NDavid Howells <dhowells@redhat.com>

b960a34b

rxrpc: Provide kernel interface to set max lifespan on a call · bbd172e3

由 David Howells 提交于 5月 16, 2019

Provide an interface to set max lifespan on a call from inside of the
kernel without having to call kernel_sendmsg().
Signed-off-by: NDavid Howells <dhowells@redhat.com>

bbd172e3

clk: Remove io.h from clk-provider.h · 62e59c4e

由 Stephen Boyd 提交于 4月 18, 2019

Now that we've gotten rid of clk_readl() we can remove io.h from the
clk-provider header and push out the io.h include to any code that isn't
already including the io.h header but using things like readl/writel,
etc.

Found with this grep:

  git grep -l clk-provider.h | grep '.c$' | xargs git grep -L 'linux/io.h' | \
  	xargs git grep -l \
	-e '\<__iowrite32_copy\>' --or \
	-e '\<__ioread32_copy\>' --or \
	-e '\<__iowrite64_copy\>' --or \
	-e '\<ioremap_page_range\>' --or \
	-e '\<ioremap_huge_init\>' --or \
	-e '\<arch_ioremap_pud_supported\>' --or \
	-e '\<arch_ioremap_pmd_supported\>' --or \
	-e '\<devm_ioport_map\>' --or \
	-e '\<devm_ioport_unmap\>' --or \
	-e '\<IOMEM_ERR_PTR\>' --or \
	-e '\<devm_ioremap\>' --or \
	-e '\<devm_ioremap_nocache\>' --or \
	-e '\<devm_ioremap_wc\>' --or \
	-e '\<devm_iounmap\>' --or \
	-e '\<devm_ioremap_release\>' --or \
	-e '\<devm_memremap\>' --or \
	-e '\<devm_memunmap\>' --or \
	-e '\<__devm_memremap_pages\>' --or \
	-e '\<pci_remap_cfgspace\>' --or \
	-e '\<arch_has_dev_port\>' --or \
	-e '\<arch_phys_wc_add\>' --or \
	-e '\<arch_phys_wc_del\>' --or \
	-e '\<memremap\>' --or \
	-e '\<memunmap\>' --or \
	-e '\<arch_io_reserve_memtype_wc\>' --or \
	-e '\<arch_io_free_memtype_wc\>' --or \
	-e '\<__io_aw\>' --or \
	-e '\<__io_pbw\>' --or \
	-e '\<__io_paw\>' --or \
	-e '\<__io_pbr\>' --or \
	-e '\<__io_par\>' --or \
	-e '\<__raw_readb\>' --or \
	-e '\<__raw_readw\>' --or \
	-e '\<__raw_readl\>' --or \
	-e '\<__raw_readq\>' --or \
	-e '\<__raw_writeb\>' --or \
	-e '\<__raw_writew\>' --or \
	-e '\<__raw_writel\>' --or \
	-e '\<__raw_writeq\>' --or \
	-e '\<readb\>' --or \
	-e '\<readw\>' --or \
	-e '\<readl\>' --or \
	-e '\<readq\>' --or \
	-e '\<writeb\>' --or \
	-e '\<writew\>' --or \
	-e '\<writel\>' --or \
	-e '\<writeq\>' --or \
	-e '\<readb_relaxed\>' --or \
	-e '\<readw_relaxed\>' --or \
	-e '\<readl_relaxed\>' --or \
	-e '\<readq_relaxed\>' --or \
	-e '\<writeb_relaxed\>' --or \
	-e '\<writew_relaxed\>' --or \
	-e '\<writel_relaxed\>' --or \
	-e '\<writeq_relaxed\>' --or \
	-e '\<readsb\>' --or \
	-e '\<readsw\>' --or \
	-e '\<readsl\>' --or \
	-e '\<readsq\>' --or \
	-e '\<writesb\>' --or \
	-e '\<writesw\>' --or \
	-e '\<writesl\>' --or \
	-e '\<writesq\>' --or \
	-e '\<inb\>' --or \
	-e '\<inw\>' --or \
	-e '\<inl\>' --or \
	-e '\<outb\>' --or \
	-e '\<outw\>' --or \
	-e '\<outl\>' --or \
	-e '\<inb_p\>' --or \
	-e '\<inw_p\>' --or \
	-e '\<inl_p\>' --or \
	-e '\<outb_p\>' --or \
	-e '\<outw_p\>' --or \
	-e '\<outl_p\>' --or \
	-e '\<insb\>' --or \
	-e '\<insw\>' --or \
	-e '\<insl\>' --or \
	-e '\<outsb\>' --or \
	-e '\<outsw\>' --or \
	-e '\<outsl\>' --or \
	-e '\<insb_p\>' --or \
	-e '\<insw_p\>' --or \
	-e '\<insl_p\>' --or \
	-e '\<outsb_p\>' --or \
	-e '\<outsw_p\>' --or \
	-e '\<outsl_p\>' --or \
	-e '\<ioread8\>' --or \
	-e '\<ioread16\>' --or \
	-e '\<ioread32\>' --or \
	-e '\<ioread64\>' --or \
	-e '\<iowrite8\>' --or \
	-e '\<iowrite16\>' --or \
	-e '\<iowrite32\>' --or \
	-e '\<iowrite64\>' --or \
	-e '\<ioread16be\>' --or \
	-e '\<ioread32be\>' --or \
	-e '\<ioread64be\>' --or \
	-e '\<iowrite16be\>' --or \
	-e '\<iowrite32be\>' --or \
	-e '\<iowrite64be\>' --or \
	-e '\<ioread8_rep\>' --or \
	-e '\<ioread16_rep\>' --or \
	-e '\<ioread32_rep\>' --or \
	-e '\<ioread64_rep\>' --or \
	-e '\<iowrite8_rep\>' --or \
	-e '\<iowrite16_rep\>' --or \
	-e '\<iowrite32_rep\>' --or \
	-e '\<iowrite64_rep\>' --or \
	-e '\<__io_virt\>' --or \
	-e '\<pci_iounmap\>' --or \
	-e '\<virt_to_phys\>' --or \
	-e '\<phys_to_virt\>' --or \
	-e '\<ioremap_uc\>' --or \
	-e '\<ioremap\>' --or \
	-e '\<__ioremap\>' --or \
	-e '\<iounmap\>' --or \
	-e '\<ioremap\>' --or \
	-e '\<ioremap_nocache\>' --or \
	-e '\<ioremap_uc\>' --or \
	-e '\<ioremap_wc\>' --or \
	-e '\<ioremap_wc\>' --or \
	-e '\<ioremap_wt\>' --or \
	-e '\<ioport_map\>' --or \
	-e '\<ioport_unmap\>' --or \
	-e '\<ioport_map\>' --or \
	-e '\<ioport_unmap\>' --or \
	-e '\<xlate_dev_kmem_ptr\>' --or \
	-e '\<xlate_dev_mem_ptr\>' --or \
	-e '\<unxlate_dev_mem_ptr\>' --or \
	-e '\<virt_to_bus\>' --or \
	-e '\<bus_to_virt\>' --or \
	-e '\<memset_io\>' --or \
	-e '\<memcpy_fromio\>' --or \
	-e '\<memcpy_toio\>'

I also reordered a couple includes when they weren't alphabetical and
removed clk.h from kona, replacing it with clk-provider.h because
that driver doesn't use clk consumer APIs.
Acked-by: NGeert Uytterhoeven <geert+renesas@glider.be>
Cc: Chen-Yu Tsai <wens@csie.org>
Acked-by: NMaxime Ripard <maxime.ripard@bootlin.com>
Acked-by: NTero Kristo <t-kristo@ti.com>
Acked-by: NSekhar Nori <nsekhar@ti.com>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Acked-by: NMark Brown <broonie@kernel.org>
Cc: Chris Zankel <chris@zankel.net>
Acked-by: NMax Filippov <jcmvbkbc@gmail.com>
Acked-by: NJohn Crispin <john@phrozen.org>
Acked-by: NHeiko Stuebner <heiko@sntech.de>
Signed-off-by: NStephen Boyd <sboyd@kernel.org>

62e59c4e

Add wait_var_event_interruptible() · a49294ea

由 David Howells 提交于 5月 03, 2019

Add wait_var_event_interruptible() to allow interruptible waits for events.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>

a49294ea

dns_resolver: Allow used keys to be invalidated · d0660f0b

由 David Howells 提交于 5月 03, 2019

Allow used DNS resolver keys to be invalidated after use if the caller is
doing its own caching of the results.  This reduces the amount of resources
required.

Fix AFS to invalidate DNS results to kill off permanent failure records
that get lodged in the resolver keyring and prevent future lookups from
happening.

Fixes: 0a5143f2 ("afs: Implement VL server rotation")
Signed-off-by: NDavid Howells <dhowells@redhat.com>

d0660f0b

15 5月, 2019 6 次提交

mm: memcontrol: fix recursive statistics correctness & scalabilty · 42a30035

由 Johannes Weiner 提交于 5月 14, 2019

Right now, when somebody needs to know the recursive memory statistics
and events of a cgroup subtree, they need to walk the entire subtree and
sum up the counters manually.

There are two issues with this:

1. When a cgroup gets deleted, its stats are lost. The state counters
should all be 0 at that point, of course, but the events are not.
When this happens, the event counters, which are supposed to be
monotonic, can go backwards in the parent cgroups.

2. During regular operation, we always have a certain number of lazily
freed cgroups sitting around that have been deleted, have no tasks,
but have a few cache pages remaining. These groups' statistics do not
change until we eventually hit memory pressure, but somebody
watching, say, memory.stat on an ancestor has to iterate those every
time.

This patch addresses both issues by introducing recursive counters at
each level that are propagated from the write side when stats change.

Upward propagation happens when the per-cpu caches spill over into the
local atomic counter. This is the same thing we do during charge and
uncharge, except that the latter uses atomic RMWs, which are more
expensive; stat changes happen at around the same rate. In a sparse
file test (page faults and reclaim at maximum CPU speed) with 5 cgroup
nesting levels, perf shows __mod_memcg_page state at ~1%.

Link: http://lkml.kernel.org/r/20190412151507.2769-4-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Reviewed-by: NRoman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

42a30035

mm: memcontrol: move stat/event counting functions out-of-line · db9adbcb

由 Johannes Weiner 提交于 5月 14, 2019

These are getting too big to be inlined in every callsite.  They were
stolen from vmstat.c, which already out-of-lines them, and they have
only been growing since.  The callsites aren't that hot, either.

Move __mod_memcg_state()
     __mod_lruvec_state() and
     __count_memcg_events() out of line and add kerneldoc comments.

Link: http://lkml.kernel.org/r/20190412151507.2769-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Reviewed-by: NRoman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

db9adbcb

mm: memcontrol: make cgroup stats and events query API explicitly local · 205b20cc

由 Johannes Weiner 提交于 5月 14, 2019

Patch series "mm: memcontrol: memory.stat cost & correctness".

The cgroup memory.stat file holds recursive statistics for the entire
subtree.  The current implementation does this tree walk on-demand
whenever the file is read.  This is giving us problems in production.

1. The cost of aggregating the statistics on-demand is high.  A lot of
   system service cgroups are mostly idle and their stats don't change
   between reads, yet we always have to check them.  There are also always
   some lazily-dying cgroups sitting around that are pinned by a handful
   of remaining page cache; the same applies to them.

   In an application that periodically monitors memory.stat in our
   fleet, we have seen the aggregation consume up to 5% CPU time.

2. When cgroups die and disappear from the cgroup tree, so do their
   accumulated vm events.  The result is that the event counters at
   higher-level cgroups can go backwards and confuse some of our
   automation, let alone people looking at the graphs over time.

To address both issues, this patch series changes the stat
implementation to spill counts upwards when the counters change.

The upward spilling is batched using the existing per-cpu cache.  In a
sparse file stress test with 5 level cgroup nesting, the additional cost
of the flushing was negligible (a little under 1% of CPU at 100% CPU
utilization, compared to the 5% of reading memory.stat during regular
operation).

This patch (of 4):

memcg_page_state(), lruvec_page_state(), memcg_sum_events() are
currently returning the state of the local memcg or lruvec, not the
recursive state.

In practice there is a demand for both versions, although the callers
that want the recursive counts currently sum them up by hand.

Per default, cgroups are considered recursive entities and generally we
expect more users of the recursive counters, with the local counts being
special cases.  To reflect that in the name, add a _local suffix to the
current implementations.

The following patch will re-incarnate these functions with recursive
semantics, but with an O(1) implementation.

[hannes@cmpxchg.org: fix bisection hole]
  Link: http://lkml.kernel.org/r/20190417160347.GC23013@cmpxchg.org
Link: http://lkml.kernel.org/r/20190412151507.2769-2-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Reviewed-by: NRoman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

205b20cc

mm, memcg: rename ambiguously named memory.stat counters and functions · 871789d4

由 Chris Down 提交于 5月 14, 2019

I spent literally an hour trying to work out why an earlier version of
my memory.events aggregation code doesn't work properly, only to find
out I was calling memcg->events instead of memcg->memory_events, which
is fairly confusing.

This naming seems in need of reworking, so make it harder to do the
wrong thing by using vmevents instead of events, which makes it more
clear that these are vm counters rather than memcg-specific counters.

There are also a few other inconsistent names in both the percpu and
aggregated structs, so these are all cleaned up to be more coherent and
easy to understand.

This commit contains code cleanup only: there are no logic changes.

[akpm@linux-foundation.org: fix it for preceding changes]
Link: http://lkml.kernel.org/r/20190208224319.GA23801@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Dennis Zhou <dennis@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

871789d4

arch: remove <asm/sizes.h> and <asm-generic/sizes.h> · b09e8936

由 Masahiro Yamada 提交于 5月 14, 2019

Now that all instances of #include <asm/sizes.h> have been replaced with
#include <linux/sizes.h>, we can remove these.

Link: http://lkml.kernel.org/r/1553267665-27228-2-git-send-email-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b09e8936

include/linux/sched/signal.h: replace `tsk' with `task' · 9e9291c7

由 Andrei Vagin 提交于 5月 14, 2019

This file uses "task" 85 times and "tsk" 25 times.  It is better to be
consistent.

Link: http://lkml.kernel.org/r/20181129180547.15976-1-avagin@gmail.comSigned-off-by: NAndrei Vagin <avagin@gmail.com>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9e9291c7

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功