- 18 8月, 2018 11 次提交
-
-
由 Shakeel Butt 提交于
Patch series "Directed kmem charging", v8. The Linux kernel's memory cgroup allows limiting the memory usage of the jobs running on the system to provide isolation between the jobs. All the kernel memory allocated in the context of the job and marked with __GFP_ACCOUNT will also be included in the memory usage and be limited by the job's limit. The kernel memory can only be charged to the memcg of the process in whose context kernel memory was allocated. However there are cases where the allocated kernel memory should be charged to the memcg different from the current processes's memcg. This patch series contains two such concrete use-cases i.e. fsnotify and buffer_head. The fsnotify event objects can consume a lot of system memory for large or unlimited queues if there is either no or slow listener. The events are allocated in the context of the event producer. However they should be charged to the event consumer. Similarly the buffer_head objects can be allocated in a memcg different from the memcg of the page for which buffer_head objects are being allocated. To solve this issue, this patch series introduces mechanism to charge kernel memory to a given memcg. In case of fsnotify events, the memcg of the consumer can be used for charging and for buffer_head, the memcg of the page can be charged. For directed charging, the caller can use the scope API memalloc_[un]use_memcg() to specify the memcg to charge for all the __GFP_ACCOUNT allocations within the scope. This patch (of 2): A lot of memory can be consumed by the events generated for the huge or unlimited queues if there is either no or slow listener. This can cause system level memory pressure or OOMs. So, it's better to account the fsnotify kmem caches to the memcg of the listener. However the listener can be in a different memcg than the memcg of the producer and these allocations happen in the context of the event producer. This patch introduces remote memcg charging API which the producer can use to charge the allocations to the memcg of the listener. There are seven fsnotify kmem caches and among them allocations from dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and inotify_inode_mark_cachep happens in the context of syscall from the listener. So, SLAB_ACCOUNT is enough for these caches. The objects from fsnotify_mark_connector_cachep are not accounted as they are small compared to the notification mark or events and it is unclear whom to account connector to since it is shared by all events attached to the inode. The allocations from the event caches happen in the context of the event producer. For such caches we will need to remote charge the allocations to the listener's memcg. Thus we save the memcg reference in the fsnotify_group structure of the listener. This patch has also moved the members of fsnotify_group to keep the size same, at least for 64 bit build, even with additional member by filling the holes. [shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it] Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Amir Goldstein <amir73il@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Roman Gushchin 提交于
Introduce the mem_cgroup_put() helper, which helps to eliminate guarding memcg css release with "#ifdef CONFIG_MEMCG" in multiple places. Link: http://lkml.kernel.org/r/20180623000600.5818-2-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@kernel.org> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Luis R. Rodriguez 提交于
Some architectures just don't have PAGE_KERNEL_EXEC. The mm/nommu.c and mm/vmalloc.c code have been using PAGE_KERNEL as a fallback for years. Move this fallback to asm-generic. Link: http://lkml.kernel.org/r/20180510185507.2439-3-mcgrof@kernel.orgSigned-off-by: NLuis R. Rodriguez <mcgrof@kernel.org> Suggested-by: NMatthew Wilcox <willy@infradead.org> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Luis R. Rodriguez 提交于
Some architectures do not define certain PAGE_KERNEL_* flags, this is either because: a) The way to implement some of these flags is *not yet ported*, or b) The architecture *has no way* to describe them Over time we have accumulated a few PAGE_KERNEL_* fallback workarounds for architectures in the kernel which do not define them using *relatively safe* equivalents. Move these scattered fallback hacks into asm-generic. We start off with PAGE_KERNEL_RO using PAGE_KERNEL as a fallback. This has been in place on the firmware loader for years. Move the fallback into the respective asm-generic header. Link: http://lkml.kernel.org/r/20180510185507.2439-2-mcgrof@kernel.orgSigned-off-by: NLuis R. Rodriguez <mcgrof@kernel.org> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oscar Salvador 提交于
link_mem_sections() and walk_memory_range() share most of the code, so we can use convert link_mem_sections() into a dummy function that calls walk_memory_range() with a callback to register_mem_sect_under_node(). This patch converts register_mem_sect_under_node() in order to match a walk_memory_range's callback, getting rid of the check_nid argument and checking instead if the system is still boothing, since we only have to check for the nid if the system is in such state. Link: http://lkml.kernel.org/r/20180622111839.10071-4-osalvador@techadventures.netSigned-off-by: NOscar Salvador <osalvador@suse.de> Suggested-by: NPavel Tatashin <pasha.tatashin@oracle.com> Tested-by: NReza Arbab <arbab@linux.vnet.ibm.com> Tested-by: NJonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Huang Ying 提交于
Huge page helps to reduce TLB miss rate, but it has higher cache footprint, sometimes this may cause some issue. For example, when copying huge page on x86_64 platform, the cache footprint is 4M. But on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M LLC (last level cache). That is, in average, there are 2.5M LLC for each core and 1.25M LLC for each thread. If the cache contention is heavy when copying the huge page, and we copy the huge page from the begin to the end, it is possible that the begin of huge page is evicted from the cache after we finishing copying the end of the huge page. And it is possible for the application to access the begin of the huge page after copying the huge page. In c79b57e4 ("mm: hugetlb: clear target sub-page last when clearing huge page"), to keep the cache lines of the target subpage hot, the order to clear the subpages in the huge page in clear_huge_page() is changed to clearing the subpage which is furthest from the target subpage firstly, and the target subpage last. The similar order changing helps huge page copying too. That is implemented in this patch. Because we have put the order algorithm into a separate function, the implementation is quite simple. The patch is a generic optimization which should benefit quite some workloads, not for a specific use case. To demonstrate the performance benefit of the patch, we tested it with vm-scalability run on transparent huge page. With this patch, the throughput increases ~16.6% in vm-scalability anon-cow-seq test case with 36 processes on a 2 socket Xeon E5 v3 2699 system (36 cores, 72 threads). The test case set /sys/kernel/mm/transparent_hugepage/enabled to be always, mmap() a big anonymous memory area and populate it, then forked 36 child processes, each writes to the anonymous memory area from the begin to the end, so cause copy on write. For each child process, other child processes could be seen as other workloads which generate heavy cache pressure. At the same time, the IPC (instruction per cycle) increased from 0.63 to 0.78, and the time spent in user space is reduced ~7.2%. Link: http://lkml.kernel.org/r/20180524005851.4079-3-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shaohua Li <shli@fb.com> Cc: Christopher Lameter <cl@linux.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jens Axboe 提交于
a_ops->readpages() is only ever used for read-ahead, yet we don't flag the IO being submitted as such. Fix that up. Any file system that uses mpage_readpages() as its ->readpages() implementation will now get this right. Since we're passing in whether the IO is read-ahead or not, we don't need to pass in the 'gfp' separately, as it is dependent on the IO being read-ahead. Kill off that member. Add some documentation notes on ->readpages() being purely for read-ahead. Link: http://lkml.kernel.org/r/20180621010725.17813-3-axboe@kernel.dkSigned-off-by: NJens Axboe <axboe@kernel.dk> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <clm@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
lookup_page_ext() finds 'struct page_ext' for a given page. It requires only read access to the given struct page. Current implemnentation takes 'struct page *' as an argument. It makes compiler complain when 'const struct page *' passed. Change the argument to 'const struct page *'. Link: http://lkml.kernel.org/r/20180531135457.20167-3-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
After commit bd33ef36 ("mm: enable page poisoning early at boot") PAGE_EXT_DEBUG_POISON is not longer used. Remove it. Link: http://lkml.kernel.org/r/20180531135457.20167-2-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: NVinayak Menon <vinmenon@codeaurora.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 NeilBrown 提交于
This flag was introduce in 2.1.37pre1 and the only place it was tested was removed in 2.1.43pre1. The flag was never set. Let's discard it properly. Link: http://lkml.kernel.org/r/877en0hewz.fsf@notabene.neil.brown.nameSigned-off-by: NNeilBrown <neilb@suse.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Arnd Bergmann 提交于
Passing an enum into FIELD_GET() produces a long but harmless warning on newer compilers: from include/linux/linkage.h:7, from include/linux/kernel.h:7, from include/linux/skbuff.h:17, from include/linux/if_ether.h:23, from include/linux/etherdevice.h:25, from drivers/net/wireless/intel/iwlwifi/mvm/rxmq.c:63: drivers/net/wireless/intel/iwlwifi/mvm/rxmq.c: In function 'iwl_mvm_rx_mpdu_mq': include/linux/bitfield.h:56:20: error: enum constant in boolean context [-Werror=int-in-bool-context] BUILD_BUG_ON_MSG(!(_mask), _pfx "mask is zero"); \ ^ ... include/linux/bitfield.h:103:3: note: in expansion of macro '__BF_FIELD_CHECK' __BF_FIELD_CHECK(_mask, _reg, 0U, "FIELD_GET: "); \ ^~~~~~~~~~~~~~~~ drivers/net/wireless/intel/iwlwifi/mvm/rxmq.c:1025:21: note: in expansion of macro 'FIELD_GET' le16_encode_bits(FIELD_GET(IWL_RX_HE_PHY_SIBG_SYM_OR_USER_NUM_MASK, The problem here is that the caller has no idea how the macro gets expanding, leading to a false-positive. It can be trivially avoided by doing a comparison against zero. This only recently started appearing as the iwlwifi driver was patched to use FIELD_GET. Link: http://lkml.kernel.org/r/20180813220950.194841-1-arnd@arndb.de Fixes: 514c3069 ("iwlwifi: add support for IEEE802.11ax") Signed-off-by: NArnd Bergmann <arnd@arndb.de> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Jakub Kicinski <jakub.kicinski@netronome.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: David Laight <David.Laight@ACULAB.COM> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 17 8月, 2018 1 次提交
-
-
由 Yannik Sembritzki 提交于
Replace the use of a magic number that indicates that verify_*_signature() should use the secondary keyring with a symbol. Signed-off-by: NYannik Sembritzki <yannik@sembritzki.me> Signed-off-by: NDavid Howells <dhowells@redhat.com> Cc: keyrings@vger.kernel.org Cc: linux-security-module@vger.kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 15 8月, 2018 3 次提交
-
-
由 Jakub Kicinski 提交于
Like the NFP4000 and NFP6000, the NFP5000 as an erratum where reading/ writing to PCI config space addresses above 0x600 can cause the NFP to generate PCIe completion timeouts. Limit the NFP5000's PF's config space size to 0x600 bytes as is already done for the NFP4000 and NFP6000. The NFP5000's VF is 0x6003 (PCI_DEVICE_ID_NETRONOME_NFP6000_VF), the same device ID as the NFP6000's VF. Thus, its config space is already limited by the existing use of quirk_nfp6000(). Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: NBjorn Helgaas <bhelgaas@google.com> Reviewed-by: NTony Egan <tony.egan@netronome.com>
-
由 Andy Shevchenko 提交于
There are a lot of examples in the kernel where PCI_VDEVICE() is used and still looks not so convenient due to additional driver_data field attached. Introduce PCI_DEVICE_DATA() macro to fully describe device ID entry in shortest possible form. For example, before: { PCI_VDEVICE(INTEL, PCI_DEVICE_ID_INTEL_MRFLD), (kernel_ulong_t) &dwc3_pci_mrfld_properties, }, after: { PCI_DEVICE_DATA(INTEL, MRFLD, &dwc3_pci_mrfld_properties) }, Drivers can be converted later on in independent way. While here, remove the unused macro with the same name from Ralink wireless driver. Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: NBjorn Helgaas <bhelgaas@google.com> Acked-by: Kalle Valo <kvalo@codeaurora.org> # for rt2x00
-
由 Nick Desaulniers 提交于
As part of the effort to reduce the code duplication between _THIS_IP_ and current_text_addr(), let's consolidate callers of current_text_addr() to use _THIS_IP_. Signed-off-by: NNick Desaulniers <ndesaulniers@google.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 14 8月, 2018 3 次提交
-
-
由 Eli Cohen 提交于
The last argument to mlx5_add_flow_rules passes the number of destinations in the struct pointed to by the dest arg. Change the name to better reflect this fact. Signed-off-by: NEli Cohen <eli@mellanox.com> Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com> Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
-
由 Zong Li 提交于
There is an unalignment access about the structure 'trace_event_raw_fib_table_lookup'. In include/trace/events/fib.h, there is a memory operation which casting the 'src' data member to a pointer, and then store a value to this pointer point to. p32 = (__be32 *) __entry->src; *p32 = flp->saddr; The offset of 'src' in structure trace_event_raw_fib_table_lookup is not four bytes alignment. On some architectures, they don't permit the unalignment access, it need to pay the price to handle this situation in exception handler. Adjust the layout of structure to avoid this case. Fixes: 9f323973 ("net/ipv4: Udate fib_table_lookup tracepoint") Signed-off-by: NZong Li <zong@andestech.com> Acked-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Willem de Bruijn 提交于
The kbuild robot shows build failure on machines without CONFIG_SMP: drivers/net/virtio_net.c:1916:10: error: implicit declaration of function 'cpumask_next_wrap' cpumask_next_wrap is exported from lib/cpumask.o, which has lib-$(CONFIG_SMP) += cpumask.o same as other functions, also define it as static inline in the NR_CPUS==1 branch in include/linux/cpumask.h. If wrap is true and next == start, return nr_cpumask_bits, or 1. Else wrap across the range of valid cpus, here [0]. Fixes: 2ca653d6 ("virtio_net: Stripe queue affinities across cores.") Signed-off-by: NWillem de Bruijn <willemb@google.com> Tested-by: NKrzysztof Kozlowski <krzk@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 13 8月, 2018 5 次提交
-
-
由 Virgile Jarry 提交于
Preventing the kernel from responding to ICMP Echo Requests messages can be useful in several ways. The sysctl parameter 'icmp_echo_ignore_all' can be used to prevent the kernel from responding to IPv4 ICMP echo requests. For IPv6 pings, such a sysctl kernel parameter did not exist. Add the ability to prevent the kernel from responding to IPv6 ICMP echo requests through the use of the following sysctl parameter : /proc/sys/net/ipv6/icmp/echo_ignore_all. Update the documentation to reflect this change. Signed-off-by: NVirgile Jarry <virgile@acceis.fr> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Vakul Garg 提交于
For preparing decryption request, several memory chunks are required (aead_req, sgin, sgout, iv, aad). For submitting the decrypt request to an accelerator, it is required that the buffers which are read by the accelerator must be dma-able and not come from stack. The buffers for aad and iv can be separately kmalloced each, but it is inefficient. This patch does a combined allocation for preparing decryption request and then segments into aead_req || sgin || sgout || iv || aad. Signed-off-by: NVakul Garg <vakul.garg@nxp.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Andrey Ignatov 提交于
== Problem description == It's useful to be able to identify cgroup associated with skb in TC so that a policy can be applied to this skb, and existing bpf_skb_cgroup_id helper can help with this. Though in real life cgroup hierarchy and hierarchy to apply a policy to don't map 1:1. It's often the case that there is a container and corresponding cgroup, but there are many more sub-cgroups inside container, e.g. because it's delegated to containerized application to control resources for its subsystems, or to separate application inside container from infra that belongs to containerization system (e.g. sshd). At the same time it may be useful to apply a policy to container as a whole. If multiple containers like this are run on a host (what is often the case) and many of them have sub-cgroups, it may not be possible to apply per-container policy in TC with existing helpers such as bpf_skb_under_cgroup or bpf_skb_cgroup_id: * bpf_skb_cgroup_id will return id of immediate cgroup associated with skb, i.e. if it's a sub-cgroup inside container, it can't be used to identify container's cgroup; * bpf_skb_under_cgroup can work only with one cgroup and doesn't scale, i.e. if there are N containers on a host and a policy has to be applied to M of them (0 <= M <= N), it'd require M calls to bpf_skb_under_cgroup, and, if M changes, it'd require to rebuild & load new BPF program. == Solution == The patch introduces new helper bpf_skb_ancestor_cgroup_id that can be used to get id of cgroup v2 that is an ancestor of cgroup associated with skb at specified level of cgroup hierarchy. That way admin can place all containers on one level of cgroup hierarchy (what is a good practice in general and already used in many configurations) and identify specific cgroup on this level no matter what sub-cgroup skb is associated with. E.g. if there is a cgroup hierarchy: root/ root/container1/ root/container1/app11/ root/container1/app11/sub-app-a/ root/container1/app12/ root/container2/ root/container2/app21/ root/container2/app22/ root/container2/app22/sub-app-b/ , then having skb associated with root/container1/app11/sub-app-a/ it's possible to get ancestor at level 1, what is container1 and apply policy for this container, or apply another policy if it's container2. Policies can be kept e.g. in a hash map where key is a container cgroup id and value is an action. Levels where container cgroups are created are usually known in advance whether cgroup hierarchy inside container may be hard to predict especially in case when its creation is delegated to containerized application. == Implementation details == The helper gets ancestor by walking parents up to specified level. Another option would be to get different kind of "id" from cgroup->ancestor_ids[level] and use it with idr_find() to get struct cgroup for ancestor. But that would require radix lookup what doesn't seem to be better (at least it's not obviously better). Format of return value of the new helper is same as that of bpf_skb_cgroup_id. Signed-off-by: NAndrey Ignatov <rdna@fb.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 Daniel Borkmann 提交于
Commit a26ca7c9 ("bpf: btf: Add pretty print support to the basic arraymap") and 699c86d6 ("bpf: btf: add pretty print for hash/lru_hash maps") enabled support for BTF and dumping via BPF fs for array and hash/lru map. However, both can be decoupled from each other such that regular BPF maps can be supported for attaching BTF key/value information, while not all maps necessarily need to dump via map_seq_show_elem() callback. The basic sanity check which is a prerequisite for all maps is that key/value size has to match in any case, and some maps can have extra checks via map_check_btf() callback, e.g. probing certain types or indicating no support in general. With that we can also enable retrieving BTF info for per-cpu map types and lpm. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NYonghong Song <yhs@fb.com>
-
由 Linus Torvalds 提交于
This is purely a preparatory patch for upcoming changes during the 4.19 merge window. We have a function called "boot_cpu_state_init()" that isn't really about the bootup cpu state: that is done much earlier by the similarly named "boot_cpu_init()" (note lack of "state" in name). This function initializes some hotplug CPU state, and needs to run after the percpu data has been properly initialized. It even has a comment to that effect. Except it _doesn't_ actually run after the percpu data has been properly initialized. On x86 it happens to do that, but on at least arm and arm64, the percpu base pointers are initialized by the arch-specific 'smp_prepare_boot_cpu()' hook, which ran _after_ boot_cpu_state_init(). This had some unexpected results, and in particular we have a patch pending for the merge window that did the obvious cleanup of using 'this_cpu_write()' in the cpu hotplug init code: - per_cpu_ptr(&cpuhp_state, smp_processor_id())->state = CPUHP_ONLINE; + this_cpu_write(cpuhp_state.state, CPUHP_ONLINE); which is obviously the right thing to do. Except because of the ordering issue, it actually failed miserably and unexpectedly on arm64. So this just fixes the ordering, and changes the name of the function to be 'boot_cpu_hotplug_init()' to make it obvious that it's about cpu hotplug state, because the core CPU state was supposed to have already been done earlier. Marked for stable, since the (not yet merged) patch that will show this problem is marked for stable. Reported-by: NVlastimil Babka <vbabka@suse.cz> Reported-by: NMian Yousaf Kaukab <yousaf.kaukab@suse.com> Suggested-by: NCatalin Marinas <catalin.marinas@arm.com> Acked-by: NThomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: stable@kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 8月, 2018 9 次提交
-
-
由 Peter Oskolkov 提交于
This patch introduces several helper functions/macros that will be used in the follow-up patch. No runtime changes yet. The new logic (fully implemented in the second patch) is as follows: * Nodes in the rb-tree will now contain not single fragments, but lists of consecutive fragments ("runs"). * At each point in time, the current "active" run at the tail is maintained/tracked. Fragments that arrive in-order, adjacent to the previous tail fragment, are added to this tail run without triggering the re-balancing of the rb-tree. * If a fragment arrives out of order with the offset _before_ the tail run, it is inserted into the rb-tree as a single fragment. * If a fragment arrives after the current tail fragment (with a gap), it starts a new "tail" run, as is inserted into the rb-tree at the end as the head of the new run. skb->cb is used to store additional information needed here (suggested by Eric Dumazet). Reported-by: NWillem de Bruijn <willemb@google.com> Signed-off-by: NPeter Oskolkov <posk@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Bart Van Assche 提交于
For legacy queues the only call of blkg_root_lookup() happens after bypass mode has been enabled. Since blkg_lookup() returns NULL for queues in bypass mode, modify the blkg_root_lookup() such that it no longer depends on bypass mode. Rename the function into blk_queue_root_blkg() as suggested by Tejun. Suggested-by: NTejun Heo <tj@kernel.org> Fixes: 6bad9b21 ("blkcg: Introduce blkg_root_lookup()") Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Vlad Buslov 提交于
Extend gen_new_estimator() to also take stats_lock when re-assigning rate estimator statistics pointer. (to be used by unlocked actions) Rename 'stats_lock' to 'lock' and change argument description to explain that it is now also used for control path. Signed-off-by: NVlad Buslov <vladbu@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Vlad Buslov 提交于
As a preparation for removing dependency on rtnl lock from rules update path, all users of shared objects must take reference while working with them. Extend action ops with put_dev() API to be used on net device returned by get_dev(). Modify mirred action (only action that implements get_dev callback): - Take reference to net device in get_dev. - Implement put_dev API that releases reference to net device. Signed-off-by: NVlad Buslov <vladbu@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Konstantin Khorenko 提交于
This path replaces physically contiguous memory arrays allocated using kmalloc_array() with flexible arrays. This enables to avoid memory allocation failures on the systems under a memory stress. Signed-off-by: NOleg Babin <obabin@virtuozzo.com> Signed-off-by: NKonstantin Khorenko <khorenko@virtuozzo.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Konstantin Khorenko 提交于
This patch introduces wrappers for accessing in/out streams indirectly. This will enable to replace physically contiguous memory arrays of streams with flexible arrays (or maybe any other appropriate mechanism) which do memory allocation on a per-page basis. Signed-off-by: NOleg Babin <obabin@virtuozzo.com> Signed-off-by: NKonstantin Khorenko <khorenko@virtuozzo.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Guillaume Nault 提交于
pppol2tp_ioctl() has everything in place for handling PPPIOCGL2TPSTATS on session sockets. We just need to copy the stats and set ->session_id. As a side effect of sharing session and tunnel code, ->using_ipsec is properly set even when the request was made using a session socket. Signed-off-by: NGuillaume Nault <g.nault@alphalink.fr> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Caleb Raitto 提交于
Make vp_set_vq_affinity() take a cpumask instead of taking a single CPU. If there are fewer queues than cores, queue affinity should be able to map to multiple cores. Link: https://patchwork.ozlabs.org/patch/948149/Suggested-by: NWillem de Bruijn <willemb@google.com> Signed-off-by: NCaleb Raitto <caraitto@google.com> Acked-by: NGonglei <arei.gonglei@huawei.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yuchung Cheng 提交于
Add a new flag to indicate a one-time immediate ACK. This flag is occasionaly set under specific TCP protocol states in addition to the more common quickack mechanism for interactive application. In several cases in the TCP code we want to force an immediate ACK but do not want to call tcp_enter_quickack_mode() because we do not want to forget the icsk_ack.pingpong or icsk_ack.ato state. Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NWei Wang <weiwan@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 11 8月, 2018 8 次提交
-
-
由 Martin KaFai Lau 提交于
This patch allows a BPF_PROG_TYPE_SK_REUSEPORT bpf prog to select a SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY introduced in the earlier patch. "bpf_run_sk_reuseport()" will return -ECONNREFUSED when the BPF_PROG_TYPE_SK_REUSEPORT prog returns SK_DROP. The callers, in inet[6]_hashtable.c and ipv[46]/udp.c, are modified to handle this case and return NULL immediately instead of continuing the sk search from its hashtable. It re-uses the existing SO_ATTACH_REUSEPORT_EBPF setsockopt to attach BPF_PROG_TYPE_SK_REUSEPORT. The "sk_reuseport_attach_bpf()" will check if the attaching bpf prog is in the new SK_REUSEPORT or the existing SOCKET_FILTER type and then check different things accordingly. One level of "__reuseport_attach_prog()" call is removed. The "sk_unhashed() && ..." and "sk->sk_reuseport_cb" tests are pushed back to "reuseport_attach_prog()" in sock_reuseport.c. sock_reuseport.c seems to have more knowledge on those test requirements than filter.c. In "reuseport_attach_prog()", after new_prog is attached to reuse->prog, the old_prog (if any) is also directly freed instead of returning the old_prog to the caller and asking the caller to free. The sysctl_optmem_max check is moved back to the "sk_reuseport_attach_filter()" and "sk_reuseport_attach_bpf()". As of other bpf prog types, the new BPF_PROG_TYPE_SK_REUSEPORT is only bounded by the usual "bpf_prog_charge_memlock()" during load time instead of bounded by both bpf_prog_charge_memlock and sysctl_optmem_max. Signed-off-by: NMartin KaFai Lau <kafai@fb.com> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 Martin KaFai Lau 提交于
This patch adds a BPF_PROG_TYPE_SK_REUSEPORT which can select a SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY. Like other non SK_FILTER/CGROUP_SKB program, it requires CAP_SYS_ADMIN. BPF_PROG_TYPE_SK_REUSEPORT introduces "struct sk_reuseport_kern" to store the bpf context instead of using the skb->cb[48]. At the SO_REUSEPORT sk lookup time, it is in the middle of transiting from a lower layer (ipv4/ipv6) to a upper layer (udp/tcp). At this point, it is not always clear where the bpf context can be appended in the skb->cb[48] to avoid saving-and-restoring cb[]. Even putting aside the difference between ipv4-vs-ipv6 and udp-vs-tcp. It is not clear if the lower layer is only ipv4 and ipv6 in the future and will it not touch the cb[] again before transiting to the upper layer. For example, in udp_gro_receive(), it uses the 48 byte NAPI_GRO_CB instead of IP[6]CB and it may still modify the cb[] after calling the udp[46]_lib_lookup_skb(). Because of the above reason, if sk->cb is used for the bpf ctx, saving-and-restoring is needed and likely the whole 48 bytes cb[] has to be saved and restored. Instead of saving, setting and restoring the cb[], this patch opts to create a new "struct sk_reuseport_kern" and setting the needed values in there. The new BPF_PROG_TYPE_SK_REUSEPORT and "struct sk_reuseport_(kern|md)" will serve all ipv4/ipv6 + udp/tcp combinations. There is no protocol specific usage at this point and it is also inline with the current sock_reuseport.c implementation (i.e. no protocol specific requirement). In "struct sk_reuseport_md", this patch exposes data/data_end/len with semantic similar to other existing usages. Together with "bpf_skb_load_bytes()" and "bpf_skb_load_bytes_relative()", the bpf prog can peek anywhere in the skb. The "bind_inany" tells the bpf prog that the reuseport group is bind-ed to a local INANY address which cannot be learned from skb. The new "bind_inany" is added to "struct sock_reuseport" which will be used when running the new "BPF_PROG_TYPE_SK_REUSEPORT" bpf prog in order to avoid repeating the "bind INANY" test on "sk_v6_rcv_saddr/sk->sk_rcv_saddr" every time a bpf prog is run. It can only be properly initialized when a "sk->sk_reuseport" enabled sk is adding to a hashtable (i.e. during "reuseport_alloc()" and "reuseport_add_sock()"). The new "sk_select_reuseport()" is the main helper that the bpf prog will use to select a SO_REUSEPORT sk. It is the only function that can use the new BPF_MAP_TYPE_REUSEPORT_ARRAY. As mentioned in the earlier patch, the validity of a selected sk is checked in run time in "sk_select_reuseport()". Doing the check in verification time is difficult and inflexible (consider the map-in-map use case). The runtime check is to compare the selected sk's reuseport_id with the reuseport_id that we want. This helper will return -EXXX if the selected sk cannot serve the incoming request (e.g. reuseport_id not match). The bpf prog can decide if it wants to do SK_DROP as its discretion. When the bpf prog returns SK_PASS, the kernel will check if a valid sk has been selected (i.e. "reuse_kern->selected_sk != NULL"). If it does , it will use the selected sk. If not, the kernel will select one from "reuse->socks[]" (as before this patch). The SK_DROP and SK_PASS handling logic will be in the next patch. Signed-off-by: NMartin KaFai Lau <kafai@fb.com> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 Martin KaFai Lau 提交于
This patch introduces a new map type BPF_MAP_TYPE_REUSEPORT_SOCKARRAY. To unleash the full potential of a bpf prog, it is essential for the userspace to be capable of directly setting up a bpf map which can then be consumed by the bpf prog to make decision. In this case, decide which SO_REUSEPORT sk to serve the incoming request. By adding BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, the userspace has total control and visibility on where a SO_REUSEPORT sk should be located in a bpf map. The later patch will introduce BPF_PROG_TYPE_SK_REUSEPORT such that the bpf prog can directly select a sk from the bpf map. That will raise the programmability of the bpf prog attached to a reuseport group (a group of sk serving the same IP:PORT). For example, in UDP, the bpf prog can peek into the payload (e.g. through the "data" pointer introduced in the later patch) to learn the application level's connection information and then decide which sk to pick from a bpf map. The userspace can tightly couple the sk's location in a bpf map with the application logic in generating the UDP payload's connection information. This connection info contact/API stays within the userspace. Also, when used with map-in-map, the userspace can switch the old-server-process's inner map to a new-server-process's inner map in one call "bpf_map_update_elem(outer_map, &index, &new_reuseport_array)". The bpf prog will then direct incoming requests to the new process instead of the old process. The old process can finish draining the pending requests (e.g. by "accept()") before closing the old-fds. [Note that deleting a fd from a bpf map does not necessary mean the fd is closed] During map_update_elem(), Only SO_REUSEPORT sk (i.e. which has already been added to a reuse->socks[]) can be used. That means a SO_REUSEPORT sk that is "bind()" for UDP or "bind()+listen()" for TCP. These conditions are ensured in "reuseport_array_update_check()". A SO_REUSEPORT sk can only be added once to a map (i.e. the same sk cannot be added twice even to the same map). SO_REUSEPORT already allows another sk to be created for the same IP:PORT. There is no need to re-create a similar usage in the BPF side. When a SO_REUSEPORT is deleted from the "reuse->socks[]" (e.g. "close()"), it will notify the bpf map to remove it from the map also. It is done through "bpf_sk_reuseport_detach()" and it will only be called if >=1 of the "reuse->sock[]" has ever been added to a bpf map. The map_update()/map_delete() has to be in-sync with the "reuse->socks[]". Hence, the same "reuseport_lock" used by "reuse->socks[]" has to be used here also. Care has been taken to ensure the lock is only acquired when the adding sk passes some strict tests. and freeing the map does not require the reuseport_lock. The reuseport_array will also support lookup from the syscall side. It will return a sock_gen_cookie(). The sock_gen_cookie() is on-demand (i.e. a sk's cookie is not generated until the very first map_lookup_elem()). The lookup cookie is 64bits but it goes against the logical userspace expectation on 32bits sizeof(fd) (and as other fd based bpf maps do also). It may catch user in surprise if we enforce value_size=8 while userspace still pass a 32bits fd during update. Supporting different value_size between lookup and update seems unintuitive also. We also need to consider what if other existing fd based maps want to return 64bits value from syscall's lookup in the future. Hence, reuseport_array supports both value_size 4 and 8, and assuming user will usually use value_size=4. The syscall's lookup will return ENOSPC on value_size=4. It will will only return 64bits value from sock_gen_cookie() when user consciously choose value_size=8 (as a signal that lookup is desired) which then requires a 64bits value in both lookup and update. Signed-off-by: NMartin KaFai Lau <kafai@fb.com> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 Martin KaFai Lau 提交于
A later patch will introduce a BPF_MAP_TYPE_REUSEPORT_ARRAY which allows a SO_REUSEPORT sk to be added to a bpf map. When a sk is removed from reuse->socks[], it also needs to be removed from the bpf map. Also, when adding a sk to a bpf map, the bpf map needs to ensure it is indeed in a reuse->socks[]. Hence, reuseport_lock is needed by the bpf map to ensure its map_update_elem() and map_delete_elem() operations are in-sync with the reuse->socks[]. The BPF_MAP_TYPE_REUSEPORT_ARRAY map will only acquire the reuseport_lock after ensuring the adding sk is already in a reuseport group (i.e. reuse->socks[]). The map_lookup_elem() will be lockless. This patch also adds an ID to sock_reuseport. A later patch will introduce BPF_PROG_TYPE_SK_REUSEPORT which allows a bpf prog to select a sk from a bpf map. It is inflexible to statically enforce a bpf map can only contain the sk belonging to a particular reuse->socks[] (i.e. same IP:PORT) during the bpf verification time. For example, think about the the map-in-map situation where the inner map can be dynamically changed in runtime and the outer map may have inner maps belonging to different reuseport groups. Hence, when the bpf prog (in the new BPF_PROG_TYPE_SK_REUSEPORT type) selects a sk, this selected sk has to be checked to ensure it belongs to the requesting reuseport group (i.e. the group serving that IP:PORT). The "sk->sk_reuseport_cb" pointer cannot be used for this checking purpose because the pointer value will change after reuseport_grow(). Instead of saving all checking conditions like the ones preced calling "reuseport_add_sock()" and compare them everytime a bpf_prog is run, a 32bits ID is introduced to survive the reuseport_grow(). The ID is only acquired if any of the reuse->socks[] is added to the newly introduced "BPF_MAP_TYPE_REUSEPORT_ARRAY" map. If "BPF_MAP_TYPE_REUSEPORT_ARRAY" is not used, the changes in this patch is a no-op. Signed-off-by: NMartin KaFai Lau <kafai@fb.com> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 Martin KaFai Lau 提交于
Although the actual cookie check "__cookie_v[46]_check()" does not involve sk specific info, it checks whether the sk has recent synq overflow event in "tcp_synq_no_recent_overflow()". The tcp_sk(sk)->rx_opt.ts_recent_stamp is updated every second when it has sent out a syncookie (through "tcp_synq_overflow()"). The above per sk "recent synq overflow event timestamp" works well for non SO_REUSEPORT use case. However, it may cause random connection request reject/discard when SO_REUSEPORT is used with syncookie because it fails the "tcp_synq_no_recent_overflow()" test. When SO_REUSEPORT is used, it usually has multiple listening socks serving TCP connection requests destinated to the same local IP:PORT. There are cases that the TCP-ACK-COOKIE may not be received by the same sk that sent out the syncookie. For example, if reuse->socks[] began with {sk0, sk1}, 1) sk1 sent out syncookies and tcp_sk(sk1)->rx_opt.ts_recent_stamp was updated. 2) the reuse->socks[] became {sk1, sk2} later. e.g. sk0 was first closed and then sk2 was added. Here, sk2 does not have ts_recent_stamp set. There are other ordering that will trigger the similar situation below but the idea is the same. 3) When the TCP-ACK-COOKIE comes back, sk2 was selected. "tcp_synq_no_recent_overflow(sk2)" returns true. In this case, all syncookies sent by sk1 will be handled (and rejected) by sk2 while sk1 is still alive. The userspace may create and remove listening SO_REUSEPORT sockets as it sees fit. E.g. Adding new thread (and SO_REUSEPORT sock) to handle incoming requests, old process stopping and new process starting...etc. With or without SO_ATTACH_REUSEPORT_[CB]BPF, the sockets leaving and joining a reuseport group makes picking the same sk to check the syncookie very difficult (if not impossible). The later patches will allow bpf prog more flexibility in deciding where a sk should be located in a bpf map and selecting a particular SO_REUSEPORT sock as it sees fit. e.g. Without closing any sock, replace the whole bpf reuseport_array in one map_update() by using map-in-map. Getting the syncookie check working smoothly across socks in the same "reuse->socks[]" is important. A partial solution is to set the newly added sk's ts_recent_stamp to the max ts_recent_stamp of a reuseport group but that will require to iterate through reuse->socks[] OR pessimistically set it to "now - TCP_SYNCOOKIE_VALID" when a sk is joining a reuseport group. However, neither of them will solve the existing sk getting moved around the reuse->socks[] and that sk may not have ts_recent_stamp updated, unlikely under continuous synflood but not impossible. This patch opts to treat the reuseport group as a whole when considering the last synq overflow timestamp since they are serving the same IP:PORT from the userspace (and BPF program) perspective. "synq_overflow_ts" is added to "struct sock_reuseport". The tcp_synq_overflow() and tcp_synq_no_recent_overflow() will update/check reuse->synq_overflow_ts if the sk is in a reuseport group. Similar to the reuseport decision in __inet_lookup_listener(), both sk->sk_reuseport and sk->sk_reuseport_cb are tested for SO_REUSEPORT usage. Update on "synq_overflow_ts" happens at roughly once every second. A synflood test was done with a 16 rx-queues and 16 reuseport sockets. No meaningful performance change is observed. Before and after the change is ~9Mpps in IPv4. Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: NMartin KaFai Lau <kafai@fb.com> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 Linus Walleij 提交于
The bgpio_init() takes one of two arguments to specify a register to set the direction of the GPIO line: either dirout that indicates that a 1 in the bit in that register sets the corresponding line to output, or dirin which indicates that a 1 in the bit in that register sets the corresponding line to input. Conversely setting the bit to 0 on these will turn the line into input and output respectively. One of these can be defined but not both. This means that a platform that sets a bit to 1 for output only defines dirout and a platform that sets a bit to 0 for output only defines dirin. In short this defines the polarity of the direction register. Both can also be left as NULL meaning the GPIO chip is either input only or output only. Tomer Maimon discovered that for get/set chips (those where the get and set registers are defined but no separate clear register, and specifying BGPIOF_READ_OUTPUT_REG_SET so that we say we want to read the output value from the SET register) we are unconditionally reading the value from the SET register when the direction bit is 1 and from the DAT register when the direction bit is 0, not taking the direction bit polarity into account. It would be expected that when the direction bit is inverted (dirin is defined but not dirout) we read the current value from the DAT register when the bit is 1 and from the SET register when the bit is 0. Currently only some versions of ATH79, brcmstb, some versions of CLP711x, GE, IOP and Loongson use the dirin mode (a 1 in the register means input). They are unaffected because BGPIOF_READ_OUTPUT_REG_SET is not set on any of them. (They do not read back the SET register to figure out the output value.) So this is no regression with current drivers. However the behaviour is wrong and does not work with Tomer's new driver where he needs to use the BGIOF_READ_OUTPUT_REG_SET. This fixes the above issue by: - Instead of defining separate functions for the inverted case, set up a flag in the gpio_chip that indicates that the direction is inverted. - Remove the special inverted functions for setting input/output and getting the direction, rely on the flag instead. - Respect this flag in bgpio_get_set() and bgpio_get_set_multiple() Reported-by: NTomer Maimon <tmaimon77@gmail.com> Signed-off-by: NLinus Walleij <linus.walleij@linaro.org>
-
由 Krzysztof Kozlowski 提交于
Building virtio_net driver without CONFIG_XPS fails with: drivers/net/virtio_net.c: In function ‘virtnet_set_affinity’: drivers/net/virtio_net.c:1910:3: error: implicit declaration of function ‘__netif_set_xps_queue’ [-Werror=implicit-function-declaration] __netif_set_xps_queue(vi->dev, mask, i, false); ^ Fixes: 4d99f660 ("net: allow to call netif_reset_xps_queues() under cpus_read_lock") Signed-off-by: NKrzysztof Kozlowski <krzk@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Collins 提交于
Introduce bindings for RPMh regulator devices found on some Qualcomm Technlogies, Inc. SoCs. These devices allow a given processor within the SoC to make PMIC regulator requests which are aggregated within the RPMh hardware block along with requests from other processors in the SoC to determine the final PMIC regulator hardware state. Signed-off-by: NDavid Collins <collinsd@codeaurora.org> Reviewed-by: NRob Herring <robh@kernel.org> Reviewed-by: NDouglas Anderson <dianders@chromium.org> Signed-off-by: NMark Brown <broonie@kernel.org>
-