- 19 6月, 2019 2 次提交
-
-
由 Jason Gunthorpe 提交于
Update the struct ib_client for all modules exporting cdevs related to the ibdevice to also implement RDMA_NLDEV_CMD_GET_CHARDEV. All cdevs are now autoloadable and discoverable by userspace over netlink instead of relying on sysfs. uverbs also exposes the DRIVER_ID for drivers that are able to support driver id binding in rdma-core. Signed-off-by: NJason Gunthorpe <jgg@mellanox.com> Reviewed-by: NLeon Romanovsky <leonro@mellanox.com> Signed-off-by: NDoug Ledford <dledford@redhat.com>
-
由 Jason Gunthorpe 提交于
Allow userspace to issue a netlink query against the ib_device for something like "uverbs" and get back the char dev name, inode major/minor, and interface ABI information for "uverbs0". Since we are now in netlink this can also trigger a module autoload to make the uverbs device come into existence. Largely this will let us replace searching and reading inside sysfs to setup devices, and provides an alternative (using driver_id) to device name based provider binding for things like rxe. Signed-off-by: NJason Gunthorpe <jgg@mellanox.com> Signed-off-by: NDoug Ledford <dledford@redhat.com>
-
- 18 6月, 2019 1 次提交
-
-
由 Jason Gunthorpe 提交于
This enum is exposed over the sysfs file 'node_type' and over netlink via RDMA_NLDEV_ATTR_DEV_NODE_TYPE, so declare it in the uapi headers. Signed-off-by: NJason Gunthorpe <jgg@mellanox.com> Signed-off-by: NDoug Ledford <dledford@redhat.com>
-
- 16 6月, 2019 2 次提交
-
-
由 Maor Gottlieb 提交于
Add API to get the current Eswitch encap mode. It will be used in downstream patches to check if flow table can be created with encap support or not. Signed-off-by: NMaor Gottlieb <maorg@mellanox.com> Reviewed-by: NPetr Vorel <pvorel@suse.cz> Signed-off-by: NLeon Romanovsky <leonro@mellanox.com> Reviewed-by: NParav Pandit <parav@mellanox.com>
-
由 Leon Romanovsky 提交于
Devlink has UAPI declaration for encap mode, so there is no need to be loose on the data get/set by drivers. Update call sites to use enum devlink_eswitch_encap_mode instead of plain u8. Suggested-by: NParav Pandit <parav@mellanox.com> Signed-off-by: NLeon Romanovsky <leonro@mellanox.com> Acked-by: NJiri Pirko <jiri@mellanox.com> Reviewed-by: NParav Pandit <parav@mellanox.com> Reviewed-by: NPetr Vorel <pvorel@suse.cz>
-
- 14 6月, 2019 6 次提交
-
-
由 Yuval Avnery 提交于
Previously, EQ joined the chain notifier on creation. This forced the caller to be ready to handle events before creating the EQ through eq_create_generic interface. To help the caller control when the created EQ will be attached to the IRQ, add enable/disable API. Signed-off-by: NYuval Avnery <yuvalav@mellanox.com> Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
-
由 Ariel Levkovich 提交于
The patch modifies the IRQ allocation so that all async EQs are assigned to the same IRQ resulting in more available IRQs for completion EQs. The changes are using the support for IRQ sharing and EQ polling budget that was introduced in previous patches so when the shared interrupt is triggered, the kernel will serially call the handler of each of the sharing EQs with a certain budget of EQEs to poll in order to prevent starvation. Signed-off-by: NAriel Levkovich <lariel@mellanox.com> Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
-
由 Yuval Avnery 提交于
IRQ table should only exist for mlx5_core_dev for PF and VF only. EQ table of mediated devices should hold a pointer to the IRQ table of the parent PCI device. Signed-off-by: NYuval Avnery <yuvalav@mellanox.com> Reviewed-by: NParav Pandit <parav@mellanox.com> Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
-
由 Yuval Avnery 提交于
Instead of requesting IRQ with eq creation, IRQs will be requested before EQ table creation. Instead of freeing the IRQs after EQ destroy, free IRQs after eq table destroy. Signed-off-by: NYuval Avnery <yuvalav@mellanox.com> Reviewed-by: NParav Pandit <parav@mellanox.com> Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
-
由 Yuval Avnery 提交于
Multiple EQs may share the same IRQ in subsequent patches. Instead of calling the IRQ handler directly, the EQ will register to an atomic chain notfier. The Linux built-in shared IRQ is not used because it forces the caller to disable the IRQ and clear affinity before free_irq() can be called. This patch is the first step in the separation of IRQ and EQ logic. Signed-off-by: NYuval Avnery <yuvalav@mellanox.com> Reviewed-by: NParav Pandit <parav@mellanox.com> Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
-
由 Bodong Wang 提交于
For ECPF with eswitch manager privilege, query the host max VF count by querying the device using query_functions command. With this enhancement: 1. flow steering entries are created only for valid vports based on the max VF count of the PF. 2. Driver only queries cap of valid vport. Eswitch requires the max VFs when doing initialization, so do sr-iov init before eswitch init. Signed-off-by: NBodong Wang <bodong@mellanox.com> Reviewed-by: NParav Pandit <parav@mellanox.com> Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
-
- 12 6月, 2019 2 次提交
-
-
由 Leon Romanovsky 提交于
Ensure that CQ is allocated and freed by IB/core and not by drivers. Signed-off-by: NLeon Romanovsky <leonro@mellanox.com> Acked-by: NGal Pressman <galpress@amazon.com> Reviewed-by: NDennis Dalessandro <dennis.dalessandro@intel.com> Tested-by: NDennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: NDoug Ledford <dledford@redhat.com>
-
由 Leon Romanovsky 提交于
Like all other destroy commands, .destroy_cq() call is not supposed to fail. In all flows, the attempt to return earlier caused to memory leaks. This patch converts .destroy_cq() to do not return any errors. Signed-off-by: NLeon Romanovsky <leonro@mellanox.com> Acked-by: NGal Pressman <galpress@amazon.com> Acked-by: NDennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: NDoug Ledford <dledford@redhat.com>
-
- 11 6月, 2019 4 次提交
-
-
由 Jason Gunthorpe 提交于
This more closely follows how other subsytems work, with owner being a member of the structure containing the function pointers. Signed-off-by: NJason Gunthorpe <jgg@mellanox.com> Reviewed-by: NLeon Romanovsky <leonro@mellanox.com> Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
-
由 Jason Gunthorpe 提交于
No reason for every driver to emit code to set this, just make it part of the driver's existing static const ops structure. Signed-off-by: NJason Gunthorpe <jgg@mellanox.com> Reviewed-by: NLeon Romanovsky <leonro@mellanox.com> Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
-
由 Jason Gunthorpe 提交于
No reason for every driver to emit code to set this, just make it part of the driver's existing static const ops structure. Signed-off-by: NJason Gunthorpe <jgg@mellanox.com> Reviewed-by: NLeon Romanovsky <leonro@mellanox.com> Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
-
由 Jason Gunthorpe 提交于
This has been marked CONFIG_BROKEN for over a year now with no complaints. Delete the whole thing for good. The module provided the /dev/infiniband/ucmX interface. Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
-
- 01 6月, 2019 5 次提交
-
-
由 Parav Pandit 提交于
Currently for every representor type and for every single vport, representer function pointers copy is stored even though they don't change from one to other vport. Additionally priv data entry for the rep is not passed during registration, but its copied. It is used (set and cleared) by the user of the reps. As we want to scale vports, to simplify and also to split constants from data, 1. Rename mlx5_eswitch_rep_if to mlx5_eswitch_rep_ops as to match _ops prefix with other standard netdev, ibdev ops. 2. Constify the IB and Ethernet rep ops structure. 3. Instead of storing copy of all rep function pointers, store copy per eswitch rep type. 4. Split data and function pointers to mlx5_eswitch_rep_ops and mlx5_eswitch_rep_data. Signed-off-by: NParav Pandit <parav@mellanox.com> Reviewed-by: NMark Bloch <markb@mellanox.com> Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
-
由 Vu Pham 提交于
Whenever device supports eswitch functions changed event, honor such device setting. Do not limit it to ECPF. Signed-off-by: NParav Pandit <parav@mellanox.com> Signed-off-by: NVu Pham <vuhuong@mellanox.com> Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
-
由 Vu Pham 提交于
To support sriov on a E-Switch manager, num_vfs are queried to the firmware whenever E-Switch manager is notified by esw_functions_changed event. Replace host_params event with esw_functions_changed event that reflects more appropriate naming. While at it, also correct num_vfs type from int to u16 as expected by the function mlx5_esw_query_functions(). Signed-off-by: NVu Pham <vuhuong@mellanox.com> Reviewed-by: NParav Pandit <parav@mellanox.com> Reviewed-by: NBodong Wang <bodong@mellanox.com> Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
-
由 Eli Britstein 提交于
Termination table is a flow table with a termination flag. The flag allows the firmware to assume that the the specified actions are the last actions list. This assumption allows the FW to safely perform potential looping logic (e.g. hairpin). Introduce the bits for this attribute. Signed-off-by: NEli Britstein <elibr@mellanox.com> Reviewed-by: NOz Shlomo <ozsh@mellanox.com> Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
-
由 Moshe Shemesh 提交于
Add Firmware core dump registers and HW definitions. Signed-off-by: NMoshe Shemesh <moshe@mellanox.com> Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com> Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
-
- 22 5月, 2019 2 次提交
-
-
由 Leon Romanovsky 提交于
Kernel destroy CQ flows can't fail and the returned value of ib_destroy_cq() is not interested in those flows. Signed-off-by: NLeon Romanovsky <leonro@mellanox.com> Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
-
由 Jason Gunthorpe 提交于
This value has always been set to PAGE_SHIFT in the core code, the only thing that does differently was the ODP path. Move the value into the ODP struct and still use it for ODP, but change all the non-ODP things to just use PAGE_SHIFT/PAGE_SIZE/PAGE_MASK directly. Reviewed-by: NShiraz Saleem <shiraz.saleem@intel.com> Signed-off-by: NJason Gunthorpe <jgg@mellanox.com> Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
-
- 19 5月, 2019 2 次提交
-
-
由 Feng Tang 提交于
Currently on panic, kernel will lower the loglevel and print out pending printk msg only with console_flush_on_panic(). Add an option for users to configure the "panic_print" to replay all dmesg in buffer, some of which they may have never seen due to the loglevel setting, which will help panic debugging . [feng.tang@intel.com: keep the original console_flush_on_panic() inside panic()] Link: http://lkml.kernel.org/r/1556199137-14163-1-git-send-email-feng.tang@intel.com [feng.tang@intel.com: use logbuf lock to protect the console log index] Link: http://lkml.kernel.org/r/1556269868-22654-1-git-send-email-feng.tang@intel.com Link: http://lkml.kernel.org/r/1556095872-36838-1-git-send-email-feng.tang@intel.comSigned-off-by: NFeng Tang <feng.tang@intel.com> Reviewed-by: NPetr Mladek <pmladek@suse.com> Cc: Aaro Koskinen <aaro.koskinen@nokia.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Borislav Petkov <bp@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Uladzislau Rezki (Sony) 提交于
Patch series "improve vmap allocation", v3. Objective --------- Please have a look for the description at: https://lkml.org/lkml/2018/10/19/786 but let me also summarize it a bit here as well. The current implementation has O(N) complexity. Requests with different permissive parameters can lead to long allocation time. When i say "long" i mean milliseconds. Description ----------- This approach organizes the KVA memory layout into free areas of the 1-ULONG_MAX range, i.e. an allocation is done over free areas lookups, instead of finding a hole between two busy blocks. It allows to have lower number of objects which represent the free space, therefore to have less fragmented memory allocator. Because free blocks are always as large as possible. It uses the augment tree where all free areas are sorted in ascending order of va->va_start address in pair with linked list that provides O(1) access to prev/next elements. Since the tree is augment, we also maintain the "subtree_max_size" of VA that reflects a maximum available free block in its left or right sub-tree. Knowing that, we can easily traversal toward the lowest (left most path) free area. Allocation: ~O(log(N)) complexity. It is sequential allocation method therefore tends to maximize locality. The search is done until a first suitable block is large enough to encompass the requested parameters. Bigger areas are split. I copy paste here the description of how the area is split, since i described it in https://lkml.org/lkml/2018/10/19/786 <snip> A free block can be split by three different ways. Their names are FL_FIT_TYPE, LE_FIT_TYPE/RE_FIT_TYPE and NE_FIT_TYPE, i.e. they correspond to how requested size and alignment fit to a free block. FL_FIT_TYPE - in this case a free block is just removed from the free list/tree because it fully fits. Comparing with current design there is an extra work with rb-tree updating. LE_FIT_TYPE/RE_FIT_TYPE - left/right edges fit. In this case what we do is just cutting a free block. It is as fast as a current design. Most of the vmalloc allocations just end up with this case, because the edge is always aligned to 1. NE_FIT_TYPE - Is much less common case. Basically it happens when requested size and alignment does not fit left nor right edges, i.e. it is between them. In this case during splitting we have to build a remaining left free area and place it back to the free list/tree. Comparing with current design there are two extra steps. First one is we have to allocate a new vmap_area structure. Second one we have to insert that remaining free block to the address sorted list/tree. In order to optimize a first case there is a cache with free_vmap objects. Instead of allocating from slab we just take an object from the cache and reuse it. Second one is pretty optimized. Since we know a start point in the tree we do not do a search from the top. Instead a traversal begins from a rb-tree node we split. <snip> De-allocation. ~O(log(N)) complexity. An area is not inserted straight away to the tree/list, instead we identify the spot first, checking if it can be merged around neighbors. The list provides O(1) access to prev/next, so it is pretty fast to check it. Summarizing. If merged then large coalesced areas are created, if not the area is just linked making more fragments. There is one more thing that i should mention here. After modification of VA node, its subtree_max_size is updated if it was/is the biggest area in its left or right sub-tree. Apart of that it can also be populated back to upper levels to fix the tree. For more details please have a look at the __augment_tree_propagate_from() function and the description. Tests and stressing ------------------- I use the "test_vmalloc.sh" test driver available under "tools/testing/selftests/vm/" since 5.1-rc1 kernel. Just trigger "sudo ./test_vmalloc.sh" to find out how to deal with it. Tested on different platforms including x86_64/i686/ARM64/x86_64_NUMA. Regarding last one, i do not have any physical access to NUMA system, therefore i emulated it. The time of stressing is days. If you run the test driver in "stress mode", you also need the patch that is in Andrew's tree but not in Linux 5.1-rc1. So, please apply it: http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/commit/?id=e0cf7749bade6da318e98e934a24d8b62fab512c After massive testing, i have not identified any problems like memory leaks, crashes or kernel panics. I find it stable, but more testing would be good. Performance analysis -------------------- I have used two systems to test. One is i5-3320M CPU @ 2.60GHz and another is HiKey960(arm64) board. i5-3320M runs on 4.20 kernel, whereas Hikey960 uses 4.15 kernel. I have both system which could run on 5.1-rc1 as well, but the results have not been ready by time i an writing this. Currently it consist of 8 tests. There are three of them which correspond to different types of splitting(to compare with default). We have 3 ones(see above). Another 5 do allocations in different conditions. a) sudo ./test_vmalloc.sh performance When the test driver is run in "performance" mode, it runs all available tests pinned to first online CPU with sequential execution test order. We do it in order to get stable and repeatable results. Take a look at time difference in "long_busy_list_alloc_test". It is not surprising because the worst case is O(N). # i5-3320M How many cycles all tests took: CPU0=646919905370(default) cycles vs CPU0=193290498550(patched) cycles # See detailed table with results here: ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_default.txt ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_patched.txt # Hikey960 8x CPUs How many cycles all tests took: CPU0=3478683207 cycles vs CPU0=463767978 cycles # See detailed table with results here: ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_default.txt ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_patched.txt b) time sudo ./test_vmalloc.sh test_repeat_count=1 With this configuration, all tests are run on all available online CPUs. Before running each CPU shuffles its tests execution order. It gives random allocation behaviour. So it is rough comparison, but it puts in the picture for sure. # i5-3320M <default> vs <patched> real 101m22.813s real 0m56.805s user 0m0.011s user 0m0.015s sys 0m5.076s sys 0m0.023s # See detailed table with results here: ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_default.txt ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_patched.txt # Hikey960 8x CPUs <default> vs <patched> real unknown real 4m25.214s user unknown user 0m0.011s sys unknown sys 0m0.670s I did not manage to complete this test on "default Hikey960" kernel version. After 24 hours it was still running, therefore i had to cancel it. That is why real/user/sys are "unknown". This patch (of 3): Currently an allocation of the new vmap area is done over busy list iteration(complexity O(n)) until a suitable hole is found between two busy areas. Therefore each new allocation causes the list being grown. Due to over fragmented list and different permissive parameters an allocation can take a long time. For example on embedded devices it is milliseconds. This patch organizes the KVA memory layout into free areas of the 1-ULONG_MAX range. It uses an augment red-black tree that keeps blocks sorted by their offsets in pair with linked list keeping the free space in order of increasing addresses. Nodes are augmented with the size of the maximum available free block in its left or right sub-tree. Thus, that allows to take a decision and traversal toward the block that will fit and will have the lowest start address, i.e. it is sequential allocation. Allocation: to allocate a new block a search is done over the tree until a suitable lowest(left most) block is large enough to encompass: the requested size, alignment and vstart point. If the block is bigger than requested size - it is split. De-allocation: when a busy vmap area is freed it can either be merged or inserted to the tree. Red-black tree allows efficiently find a spot whereas a linked list provides a constant-time access to previous and next blocks to check if merging can be done. In case of merging of de-allocated memory chunk a large coalesced area is created. Complexity: ~O(log(N)) [urezki@gmail.com: v3] Link: http://lkml.kernel.org/r/20190402162531.10888-2-urezki@gmail.com [urezki@gmail.com: v4] Link: http://lkml.kernel.org/r/20190406183508.25273-2-urezki@gmail.com Link: http://lkml.kernel.org/r/20190321190327.11813-2-urezki@gmail.comSigned-off-by: NUladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: NRoman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Thomas Garnier <thgarnie@google.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Joel Fernandes <joelaf@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 18 5月, 2019 1 次提交
-
-
由 Heiner Kallweit 提交于
i2c_new_dummy is typically called from the probe function of the driver for the primary i2c client. It requires calls to i2c_unregister_device in the error path of the probe function and in the remove function. This can be simplified by introducing a device-managed version. Note the changed error case return value type: i2c_new_dummy returns NULL whilst devm_i2c_new_dummy_device returns an ERR_PTR. Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com> [wsa: rename new functions and fix minor kdoc issues] Signed-off-by: NWolfram Sang <wsa+renesas@sang-engineering.com> Reviewed-by: NPeter Rosin <peda@axentia.se> Reviewed-by: NKieran Bingham <kieran.bingham+renesas@ideasonboard.com> Reviewed-by: NBartosz Golaszewski <bgolaszewski@baylibre.com> Signed-off-by: NWolfram Sang <wsa@the-dreams.de>
-
- 17 5月, 2019 2 次提交
-
-
由 Qian Cai 提交于
It turned out that DEBUG_SLAB_LEAK is still broken even after recent recue efforts that when there is a large number of objects like kmemleak_object which is normal on a debug kernel, # grep kmemleak /proc/slabinfo kmemleak_object 2243606 3436210 ... reading /proc/slab_allocators could easily loop forever while processing the kmemleak_object cache and any additional freeing or allocating objects will trigger a reprocessing. To make a situation worse, soft-lockups could easily happen in this sitatuion which will call printk() to allocate more kmemleak objects to guarantee an infinite loop. Also, since it seems no one had noticed when it was totally broken more than 2-year ago - see the commit fcf88917 ("slab: fix a crash by reading /proc/slab_allocators"), probably nobody cares about it anymore due to the decline of the SLAB. Just remove it entirely. Suggested-by: NVlastimil Babka <vbabka@suse.cz> Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NQian Cai <cai@lca.pw> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Howells 提交于
Wire up the mount API syscalls on non-x86 arches. Reported-by: NArnd Bergmann <arnd@arndb.de> Signed-off-by: NDavid Howells <dhowells@redhat.com> Reviewed-by: NArnd Bergmann <arnd@arndb.de> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 16 5月, 2019 5 次提交
-
-
由 David Howells 提交于
Allow kernel services using AF_RXRPC to indicate that a call should be non-interruptible. This allows kafs to make things like lock-extension and writeback data storage calls non-interruptible. If this is set, signals will be ignored for operations on that call where possible - such as waiting to get a call channel on an rxrpc connection. It doesn't prevent UDP sendmsg from being interrupted, but that will be handled by packet retransmission. rxrpc_kernel_recv_data() isn't affected by this since that never waits, preferring instead to return -EAGAIN and leave the waiting to the caller. Userspace initiated calls can't be set to be uninterruptible at this time. Signed-off-by: NDavid Howells <dhowells@redhat.com>
-
由 David Howells 提交于
Provide an interface to set max lifespan on a call from inside of the kernel without having to call kernel_sendmsg(). Signed-off-by: NDavid Howells <dhowells@redhat.com>
-
由 Stephen Boyd 提交于
Now that we've gotten rid of clk_readl() we can remove io.h from the clk-provider header and push out the io.h include to any code that isn't already including the io.h header but using things like readl/writel, etc. Found with this grep: git grep -l clk-provider.h | grep '.c$' | xargs git grep -L 'linux/io.h' | \ xargs git grep -l \ -e '\<__iowrite32_copy\>' --or \ -e '\<__ioread32_copy\>' --or \ -e '\<__iowrite64_copy\>' --or \ -e '\<ioremap_page_range\>' --or \ -e '\<ioremap_huge_init\>' --or \ -e '\<arch_ioremap_pud_supported\>' --or \ -e '\<arch_ioremap_pmd_supported\>' --or \ -e '\<devm_ioport_map\>' --or \ -e '\<devm_ioport_unmap\>' --or \ -e '\<IOMEM_ERR_PTR\>' --or \ -e '\<devm_ioremap\>' --or \ -e '\<devm_ioremap_nocache\>' --or \ -e '\<devm_ioremap_wc\>' --or \ -e '\<devm_iounmap\>' --or \ -e '\<devm_ioremap_release\>' --or \ -e '\<devm_memremap\>' --or \ -e '\<devm_memunmap\>' --or \ -e '\<__devm_memremap_pages\>' --or \ -e '\<pci_remap_cfgspace\>' --or \ -e '\<arch_has_dev_port\>' --or \ -e '\<arch_phys_wc_add\>' --or \ -e '\<arch_phys_wc_del\>' --or \ -e '\<memremap\>' --or \ -e '\<memunmap\>' --or \ -e '\<arch_io_reserve_memtype_wc\>' --or \ -e '\<arch_io_free_memtype_wc\>' --or \ -e '\<__io_aw\>' --or \ -e '\<__io_pbw\>' --or \ -e '\<__io_paw\>' --or \ -e '\<__io_pbr\>' --or \ -e '\<__io_par\>' --or \ -e '\<__raw_readb\>' --or \ -e '\<__raw_readw\>' --or \ -e '\<__raw_readl\>' --or \ -e '\<__raw_readq\>' --or \ -e '\<__raw_writeb\>' --or \ -e '\<__raw_writew\>' --or \ -e '\<__raw_writel\>' --or \ -e '\<__raw_writeq\>' --or \ -e '\<readb\>' --or \ -e '\<readw\>' --or \ -e '\<readl\>' --or \ -e '\<readq\>' --or \ -e '\<writeb\>' --or \ -e '\<writew\>' --or \ -e '\<writel\>' --or \ -e '\<writeq\>' --or \ -e '\<readb_relaxed\>' --or \ -e '\<readw_relaxed\>' --or \ -e '\<readl_relaxed\>' --or \ -e '\<readq_relaxed\>' --or \ -e '\<writeb_relaxed\>' --or \ -e '\<writew_relaxed\>' --or \ -e '\<writel_relaxed\>' --or \ -e '\<writeq_relaxed\>' --or \ -e '\<readsb\>' --or \ -e '\<readsw\>' --or \ -e '\<readsl\>' --or \ -e '\<readsq\>' --or \ -e '\<writesb\>' --or \ -e '\<writesw\>' --or \ -e '\<writesl\>' --or \ -e '\<writesq\>' --or \ -e '\<inb\>' --or \ -e '\<inw\>' --or \ -e '\<inl\>' --or \ -e '\<outb\>' --or \ -e '\<outw\>' --or \ -e '\<outl\>' --or \ -e '\<inb_p\>' --or \ -e '\<inw_p\>' --or \ -e '\<inl_p\>' --or \ -e '\<outb_p\>' --or \ -e '\<outw_p\>' --or \ -e '\<outl_p\>' --or \ -e '\<insb\>' --or \ -e '\<insw\>' --or \ -e '\<insl\>' --or \ -e '\<outsb\>' --or \ -e '\<outsw\>' --or \ -e '\<outsl\>' --or \ -e '\<insb_p\>' --or \ -e '\<insw_p\>' --or \ -e '\<insl_p\>' --or \ -e '\<outsb_p\>' --or \ -e '\<outsw_p\>' --or \ -e '\<outsl_p\>' --or \ -e '\<ioread8\>' --or \ -e '\<ioread16\>' --or \ -e '\<ioread32\>' --or \ -e '\<ioread64\>' --or \ -e '\<iowrite8\>' --or \ -e '\<iowrite16\>' --or \ -e '\<iowrite32\>' --or \ -e '\<iowrite64\>' --or \ -e '\<ioread16be\>' --or \ -e '\<ioread32be\>' --or \ -e '\<ioread64be\>' --or \ -e '\<iowrite16be\>' --or \ -e '\<iowrite32be\>' --or \ -e '\<iowrite64be\>' --or \ -e '\<ioread8_rep\>' --or \ -e '\<ioread16_rep\>' --or \ -e '\<ioread32_rep\>' --or \ -e '\<ioread64_rep\>' --or \ -e '\<iowrite8_rep\>' --or \ -e '\<iowrite16_rep\>' --or \ -e '\<iowrite32_rep\>' --or \ -e '\<iowrite64_rep\>' --or \ -e '\<__io_virt\>' --or \ -e '\<pci_iounmap\>' --or \ -e '\<virt_to_phys\>' --or \ -e '\<phys_to_virt\>' --or \ -e '\<ioremap_uc\>' --or \ -e '\<ioremap\>' --or \ -e '\<__ioremap\>' --or \ -e '\<iounmap\>' --or \ -e '\<ioremap\>' --or \ -e '\<ioremap_nocache\>' --or \ -e '\<ioremap_uc\>' --or \ -e '\<ioremap_wc\>' --or \ -e '\<ioremap_wc\>' --or \ -e '\<ioremap_wt\>' --or \ -e '\<ioport_map\>' --or \ -e '\<ioport_unmap\>' --or \ -e '\<ioport_map\>' --or \ -e '\<ioport_unmap\>' --or \ -e '\<xlate_dev_kmem_ptr\>' --or \ -e '\<xlate_dev_mem_ptr\>' --or \ -e '\<unxlate_dev_mem_ptr\>' --or \ -e '\<virt_to_bus\>' --or \ -e '\<bus_to_virt\>' --or \ -e '\<memset_io\>' --or \ -e '\<memcpy_fromio\>' --or \ -e '\<memcpy_toio\>' I also reordered a couple includes when they weren't alphabetical and removed clk.h from kona, replacing it with clk-provider.h because that driver doesn't use clk consumer APIs. Acked-by: NGeert Uytterhoeven <geert+renesas@glider.be> Cc: Chen-Yu Tsai <wens@csie.org> Acked-by: NMaxime Ripard <maxime.ripard@bootlin.com> Acked-by: NTero Kristo <t-kristo@ti.com> Acked-by: NSekhar Nori <nsekhar@ti.com> Cc: Krzysztof Kozlowski <krzk@kernel.org> Acked-by: NMark Brown <broonie@kernel.org> Cc: Chris Zankel <chris@zankel.net> Acked-by: NMax Filippov <jcmvbkbc@gmail.com> Acked-by: NJohn Crispin <john@phrozen.org> Acked-by: NHeiko Stuebner <heiko@sntech.de> Signed-off-by: NStephen Boyd <sboyd@kernel.org>
-
由 David Howells 提交于
Add wait_var_event_interruptible() to allow interruptible waits for events. Signed-off-by: NDavid Howells <dhowells@redhat.com> Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
-
由 David Howells 提交于
Allow used DNS resolver keys to be invalidated after use if the caller is doing its own caching of the results. This reduces the amount of resources required. Fix AFS to invalidate DNS results to kill off permanent failure records that get lodged in the resolver keyring and prevent future lookups from happening. Fixes: 0a5143f2 ("afs: Implement VL server rotation") Signed-off-by: NDavid Howells <dhowells@redhat.com>
-
- 15 5月, 2019 6 次提交
-
-
由 Johannes Weiner 提交于
Right now, when somebody needs to know the recursive memory statistics and events of a cgroup subtree, they need to walk the entire subtree and sum up the counters manually. There are two issues with this: 1. When a cgroup gets deleted, its stats are lost. The state counters should all be 0 at that point, of course, but the events are not. When this happens, the event counters, which are supposed to be monotonic, can go backwards in the parent cgroups. 2. During regular operation, we always have a certain number of lazily freed cgroups sitting around that have been deleted, have no tasks, but have a few cache pages remaining. These groups' statistics do not change until we eventually hit memory pressure, but somebody watching, say, memory.stat on an ancestor has to iterate those every time. This patch addresses both issues by introducing recursive counters at each level that are propagated from the write side when stats change. Upward propagation happens when the per-cpu caches spill over into the local atomic counter. This is the same thing we do during charge and uncharge, except that the latter uses atomic RMWs, which are more expensive; stat changes happen at around the same rate. In a sparse file test (page faults and reclaim at maximum CPU speed) with 5 cgroup nesting levels, perf shows __mod_memcg_page state at ~1%. Link: http://lkml.kernel.org/r/20190412151507.2769-4-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Reviewed-by: NRoman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
These are getting too big to be inlined in every callsite. They were stolen from vmstat.c, which already out-of-lines them, and they have only been growing since. The callsites aren't that hot, either. Move __mod_memcg_state() __mod_lruvec_state() and __count_memcg_events() out of line and add kerneldoc comments. Link: http://lkml.kernel.org/r/20190412151507.2769-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Reviewed-by: NRoman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Patch series "mm: memcontrol: memory.stat cost & correctness". The cgroup memory.stat file holds recursive statistics for the entire subtree. The current implementation does this tree walk on-demand whenever the file is read. This is giving us problems in production. 1. The cost of aggregating the statistics on-demand is high. A lot of system service cgroups are mostly idle and their stats don't change between reads, yet we always have to check them. There are also always some lazily-dying cgroups sitting around that are pinned by a handful of remaining page cache; the same applies to them. In an application that periodically monitors memory.stat in our fleet, we have seen the aggregation consume up to 5% CPU time. 2. When cgroups die and disappear from the cgroup tree, so do their accumulated vm events. The result is that the event counters at higher-level cgroups can go backwards and confuse some of our automation, let alone people looking at the graphs over time. To address both issues, this patch series changes the stat implementation to spill counts upwards when the counters change. The upward spilling is batched using the existing per-cpu cache. In a sparse file stress test with 5 level cgroup nesting, the additional cost of the flushing was negligible (a little under 1% of CPU at 100% CPU utilization, compared to the 5% of reading memory.stat during regular operation). This patch (of 4): memcg_page_state(), lruvec_page_state(), memcg_sum_events() are currently returning the state of the local memcg or lruvec, not the recursive state. In practice there is a demand for both versions, although the callers that want the recursive counts currently sum them up by hand. Per default, cgroups are considered recursive entities and generally we expect more users of the recursive counters, with the local counts being special cases. To reflect that in the name, add a _local suffix to the current implementations. The following patch will re-incarnate these functions with recursive semantics, but with an O(1) implementation. [hannes@cmpxchg.org: fix bisection hole] Link: http://lkml.kernel.org/r/20190417160347.GC23013@cmpxchg.org Link: http://lkml.kernel.org/r/20190412151507.2769-2-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Reviewed-by: NRoman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Chris Down 提交于
I spent literally an hour trying to work out why an earlier version of my memory.events aggregation code doesn't work properly, only to find out I was calling memcg->events instead of memcg->memory_events, which is fairly confusing. This naming seems in need of reworking, so make it harder to do the wrong thing by using vmevents instead of events, which makes it more clear that these are vm counters rather than memcg-specific counters. There are also a few other inconsistent names in both the percpu and aggregated structs, so these are all cleaned up to be more coherent and easy to understand. This commit contains code cleanup only: there are no logic changes. [akpm@linux-foundation.org: fix it for preceding changes] Link: http://lkml.kernel.org/r/20190208224319.GA23801@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Dennis Zhou <dennis@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Masahiro Yamada 提交于
Now that all instances of #include <asm/sizes.h> have been replaced with #include <linux/sizes.h>, we can remove these. Link: http://lkml.kernel.org/r/1553267665-27228-2-git-send-email-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrei Vagin 提交于
This file uses "task" 85 times and "tsk" 25 times. It is better to be consistent. Link: http://lkml.kernel.org/r/20181129180547.15976-1-avagin@gmail.comSigned-off-by: NAndrei Vagin <avagin@gmail.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-