- 08 8月, 2020 13 次提交
-
-
由 Roman Gushchin 提交于
To make the memcg_kmem_bypass() function available outside of the memcontrol.c, let's move it to memcontrol.h. The function is small and nicely fits into static inline sort of functions. It will be used from the slab code. Signed-off-by: NRoman Gushchin <guro@fb.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200623174037.3951353-12-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Roman Gushchin 提交于
Store the obj_cgroup pointer in the corresponding place of page->obj_cgroups for each allocated non-root slab object. Make sure that each allocated object holds a reference to obj_cgroup. Objcg pointer is obtained from the memcg->objcg dereferencing in memcg_kmem_get_cache() and passed from pre_alloc_hook to post_alloc_hook. Then in case of successful allocation(s) it's getting stored in the page->obj_cgroups vector. The objcg obtaining part look a bit bulky now, but it will be simplified by next commits in the series. Signed-off-by: NRoman Gushchin <guro@fb.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200623174037.3951353-9-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Roman Gushchin 提交于
Allocate and release memory to store obj_cgroup pointers for each non-root slab page. Reuse page->mem_cgroup pointer to store a pointer to the allocated space. This commit temporarily increases the memory footprint of the kernel memory accounting. To store obj_cgroup pointers we'll need a place for an objcg_pointer for each allocated object. However, the following patches in the series will enable sharing of slab pages between memory cgroups, which will dramatically increase the total slab utilization. And the final memory footprint will be significantly smaller than before. To distinguish between obj_cgroups and memcg pointers in case when it's not obvious which one is used (as in page_cgroup_ino()), let's always set the lowest bit in the obj_cgroup case. The original obj_cgroups pointer is marked to be ignored by kmemleak, which otherwise would report a memory leak for each allocated vector. Signed-off-by: NRoman Gushchin <guro@fb.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200623174037.3951353-8-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Roman Gushchin 提交于
Obj_cgroup API provides an ability to account sub-page sized kernel objects, which potentially outlive the original memory cgroup. The top-level API consists of the following functions: bool obj_cgroup_tryget(struct obj_cgroup *objcg); void obj_cgroup_get(struct obj_cgroup *objcg); void obj_cgroup_put(struct obj_cgroup *objcg); int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size); void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size); struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg); struct obj_cgroup *get_obj_cgroup_from_current(void); Object cgroup is basically a pointer to a memory cgroup with a per-cpu reference counter. It substitutes a memory cgroup in places where it's necessary to charge a custom amount of bytes instead of pages. All charged memory rounded down to pages is charged to the corresponding memory cgroup using __memcg_kmem_charge(). It implements reparenting: on memcg offlining it's getting reattached to the parent memory cgroup. Each online memory cgroup has an associated active object cgroup to handle new allocations and the list of all attached object cgroups. On offlining of a cgroup this list is reparented and for each object cgroup in the list the memcg pointer is swapped to the parent memory cgroup. It prevents long-living objects from pinning the original memory cgroup in the memory. The implementation is based on byte-sized per-cpu stocks. A sub-page sized leftover is stored in an atomic field, which is a part of obj_cgroup object. So on cgroup offlining the leftover is automatically reparented. memcg->objcg is rcu protected. objcg->memcg is a raw pointer, which is always pointing at a memory cgroup, but can be atomically swapped to the parent memory cgroup. So a user must ensure the lifetime of the cgroup, e.g. grab rcu_read_lock or css_set_lock. Suggested-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NRoman Gushchin <guro@fb.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20200623174037.3951353-7-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Roman Gushchin 提交于
This commit implements SLUB version of the obj_to_index() function, which will be required to calculate the offset of obj_cgroup in the obj_cgroups vector to store/obtain the objcg ownership data. To make it faster, let's repeat the SLAB's trick introduced by commit 6a2d7a95 ("SLAB: use a multiply instead of a divide in obj_to_index()") and avoid an expensive division. Vlastimil Babka noticed, that SLUB does have already a similar function called slab_index(), which is defined only if SLUB_DEBUG is enabled. The function does a similar math, but with a division, and it also takes a page address instead of a page pointer. Let's remove slab_index() and replace it with the new helper __obj_to_index(), which takes a page address. obj_to_index() will be a simple wrapper taking a page pointer and passing page_address(page) into __obj_to_index(). Signed-off-by: NRoman Gushchin <guro@fb.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200623174037.3951353-5-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Roman Gushchin 提交于
In order to prepare for per-object slab memory accounting, convert NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat items to bytes. To make it obvious, rename them to NR_SLAB_RECLAIMABLE_B and NR_SLAB_UNRECLAIMABLE_B (similar to NR_KERNEL_STACK_KB). Internally global and per-node counters are stored in pages, however memcg and lruvec counters are stored in bytes. This scheme may look weird, but only for now. As soon as slab pages will be shared between multiple cgroups, global and node counters will reflect the total number of slab pages. However memcg and lruvec counters will be used for per-memcg slab memory tracking, which will take separate kernel objects in the account. Keeping global and node counters in pages helps to avoid additional overhead. The size of slab memory shouldn't exceed 4Gb on 32-bit machines, so it will fit into atomic_long_t we use for vmstats. Signed-off-by: NRoman Gushchin <guro@fb.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200623174037.3951353-4-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Roman Gushchin 提交于
To implement per-object slab memory accounting, we need to convert slab vmstat counters to bytes. Actually, out of 4 levels of counters: global, per-node, per-memcg and per-lruvec only two last levels will require byte-sized counters. It's because global and per-node counters will be counting the number of slab pages, and per-memcg and per-lruvec will be counting the amount of memory taken by charged slab objects. Converting all vmstat counters to bytes or even all slab counters to bytes would introduce an additional overhead. So instead let's store global and per-node counters in pages, and memcg and lruvec counters in bytes. To make the API clean all access helpers (both on the read and write sides) are dealing with bytes. To avoid back-and-forth conversions a new flavor of read-side helpers is introduced, which always returns values in pages: node_page_state_pages() and global_node_page_state_pages(). Actually new helpers are just reading raw values. Old helpers are simple wrappers, which will complain on an attempt to read byte value, because at the moment no one actually needs bytes. Thanks to Johannes Weiner for the idea of having the byte-sized API on top of the page-sized internal storage. Signed-off-by: NRoman Gushchin <guro@fb.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200623174037.3951353-3-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Roman Gushchin 提交于
Patch series "The new cgroup slab memory controller", v7. The patchset moves the accounting from the page level to the object level. It allows to share slab pages between memory cgroups. This leads to a significant win in the slab utilization (up to 45%) and the corresponding drop in the total kernel memory footprint. The reduced number of unmovable slab pages should also have a positive effect on the memory fragmentation. The patchset makes the slab accounting code simpler: there is no more need in the complicated dynamic creation and destruction of per-cgroup slab caches, all memory cgroups use a global set of shared slab caches. The lifetime of slab caches is not more connected to the lifetime of memory cgroups. The more precise accounting does require more CPU, however in practice the difference seems to be negligible. We've been using the new slab controller in Facebook production for several months with different workloads and haven't seen any noticeable regressions. What we've seen were memory savings in order of 1 GB per host (it varied heavily depending on the actual workload, size of RAM, number of CPUs, memory pressure, etc). The third version of the patchset added yet another step towards the simplification of the code: sharing of slab caches between accounted and non-accounted allocations. It comes with significant upsides (most noticeable, a complete elimination of dynamic slab caches creation) but not without some regression risks, so this change sits on top of the patchset and is not completely merged in. So in the unlikely event of a noticeable performance regression it can be reverted separately. The slab memory accounting works in exactly the same way for SLAB and SLUB. With both allocators the new controller shows significant memory savings, with SLUB the difference is bigger. On my 16-core desktop machine running Fedora 32 the size of the slab memory measured after the start of the system was lower by 58% and 38% with SLUB and SLAB correspondingly. As an estimation of a potential CPU overhead, below are results of slab_bulk_test01 test, kindly provided by Jesper D. Brouer. He also helped with the evaluation of results. The test can be found here: https://github.com/netoptimizer/prototype-kernel/ The smallest number in each row should be picked for a comparison. SLUB-patched - bulk-API - SLUB-patched : bulk_quick_reuse objects=1 : 187 - 90 - 224 cycles(tsc) - SLUB-patched : bulk_quick_reuse objects=2 : 110 - 53 - 133 cycles(tsc) - SLUB-patched : bulk_quick_reuse objects=3 : 88 - 95 - 42 cycles(tsc) - SLUB-patched : bulk_quick_reuse objects=4 : 91 - 85 - 36 cycles(tsc) - SLUB-patched : bulk_quick_reuse objects=8 : 32 - 66 - 32 cycles(tsc) SLUB-original - bulk-API - SLUB-original: bulk_quick_reuse objects=1 : 87 - 87 - 142 cycles(tsc) - SLUB-original: bulk_quick_reuse objects=2 : 52 - 53 - 53 cycles(tsc) - SLUB-original: bulk_quick_reuse objects=3 : 42 - 42 - 91 cycles(tsc) - SLUB-original: bulk_quick_reuse objects=4 : 91 - 37 - 37 cycles(tsc) - SLUB-original: bulk_quick_reuse objects=8 : 31 - 79 - 76 cycles(tsc) SLAB-patched - bulk-API - SLAB-patched : bulk_quick_reuse objects=1 : 67 - 67 - 140 cycles(tsc) - SLAB-patched : bulk_quick_reuse objects=2 : 55 - 46 - 46 cycles(tsc) - SLAB-patched : bulk_quick_reuse objects=3 : 93 - 94 - 39 cycles(tsc) - SLAB-patched : bulk_quick_reuse objects=4 : 35 - 88 - 85 cycles(tsc) - SLAB-patched : bulk_quick_reuse objects=8 : 30 - 30 - 30 cycles(tsc) SLAB-original- bulk-API - SLAB-original: bulk_quick_reuse objects=1 : 143 - 136 - 67 cycles(tsc) - SLAB-original: bulk_quick_reuse objects=2 : 45 - 46 - 46 cycles(tsc) - SLAB-original: bulk_quick_reuse objects=3 : 38 - 39 - 39 cycles(tsc) - SLAB-original: bulk_quick_reuse objects=4 : 35 - 87 - 87 cycles(tsc) - SLAB-original: bulk_quick_reuse objects=8 : 29 - 66 - 30 cycles(tsc) This patch (of 19): To convert memcg and lruvec slab counters to bytes there must be a way to change these counters without touching node counters. Factor out __mod_memcg_lruvec_state() out of __mod_lruvec_state(). Signed-off-by: NRoman Gushchin <guro@fb.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200623174037.3951353-1-guro@fb.com Link: http://lkml.kernel.org/r/20200623174037.3951353-2-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Chris Down 提交于
The default is still set to inode32 for backwards compatibility, but system administrators can opt in to the new 64-bit inode numbers by either: 1. Passing inode64 on the command line when mounting, or 2. Configuring the kernel with CONFIG_TMPFS_INODE64=y The inode64 and inode32 names are used based on existing precedent from XFS. [hughd@google.com: Kconfig fixes] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008011928010.13320@eggly.anvilsSigned-off-by: NChris Down <chris@chrisdown.name> Signed-off-by: NHugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NAmir Goldstein <amir73il@gmail.com> Acked-by: NHugh Dickins <hughd@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Matthew Wilcox <willy@infradead.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/8b23758d0c66b5e2263e08baf9c4b6a7565cbd8f.1594661218.git.chris@chrisdown.nameSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Chris Down 提交于
Patch series "tmpfs: inode: Reduce risk of inum overflow", v7. In Facebook production we are seeing heavy i_ino wraparounds on tmpfs. On affected tiers, in excess of 10% of hosts show multiple files with different content and the same inode number, with some servers even having as many as 150 duplicated inode numbers with differing file content. This causes actual, tangible problems in production. For example, we have complaints from those working on remote caches that their application is reporting cache corruptions because it uses (device, inodenum) to establish the identity of a particular cache object, but because it's not unique any more, the application refuses to continue and reports cache corruption. Even worse, sometimes applications may not even detect the corruption but may continue anyway, causing phantom and hard to debug behaviour. In general, userspace applications expect that (device, inodenum) should be enough to be uniquely point to one inode, which seems fair enough. One might also need to check the generation, but in this case: 1. That's not currently exposed to userspace (ioctl(...FS_IOC_GETVERSION...) returns ENOTTY on tmpfs); 2. Even with generation, there shouldn't be two live inodes with the same inode number on one device. In order to mitigate this, we take a two-pronged approach: 1. Moving inum generation from being global to per-sb for tmpfs. This itself allows some reduction in i_ino churn. This works on both 64- and 32- bit machines. 2. Adding inode{64,32} for tmpfs. This fix is supported on machines with 64-bit ino_t only: we allow users to mount tmpfs with a new inode64 option that uses the full width of ino_t, or CONFIG_TMPFS_INODE64. You can see how this compares to previous related patches which didn't implement this per-superblock: - https://patchwork.kernel.org/patch/11254001/ - https://patchwork.kernel.org/patch/11023915/ This patch (of 2): get_next_ino has a number of problems: - It uses and returns a uint, which is susceptible to become overflowed if a lot of volatile inodes that use get_next_ino are created. - It's global, with no specificity per-sb or even per-filesystem. This means it's not that difficult to cause inode number wraparounds on a single device, which can result in having multiple distinct inodes with the same inode number. This patch adds a per-superblock counter that mitigates the second case. This design also allows us to later have a specific i_ino size per-device, for example, allowing users to choose whether to use 32- or 64-bit inodes for each tmpfs mount. This is implemented in the next commit. For internal shmem mounts which may be less tolerant to spinlock delays, we implement a percpu batching scheme which only takes the stat_lock at each batch boundary. Signed-off-by: NChris Down <chris@chrisdown.name> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NHugh Dickins <hughd@google.com> Cc: Amir Goldstein <amir73il@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Matthew Wilcox <willy@infradead.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/cover.1594661218.git.chris@chrisdown.name Link: http://lkml.kernel.org/r/1986b9d63b986f08ec07a4aa4b2275e718e47d8a.1594661218.git.chris@chrisdown.nameSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 John Hubbard 提交于
If a compound page is being split while dump_page() is being run on that page, we can end up calling compound_mapcount() on a page that is no longer compound. This leads to a crash (already seen at least once in the field), due to the VM_BUG_ON_PAGE() assertion inside compound_mapcount(). (The above is from Matthew Wilcox's analysis of Qian Cai's bug report.) A similar problem is possible, via compound_pincount() instead of compound_mapcount(). In order to avoid this kind of crash, make dump_page() slightly more robust, by providing a pair of simpler routines that don't contain assertions: head_mapcount() and head_pincount(). For debug tools, we don't want to go *too* far in this direction, but this is a simple small fix, and the crash has already been seen, so it's a good trade-off. Reported-by: NQian Cai <cai@lca.pw> Suggested-by: NMatthew Wilcox <willy@infradead.org> Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: William Kucharski <william.kucharski@oracle.com> Link: http://lkml.kernel.org/r/20200804214807.169256-1-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Waiman Long 提交于
As said by Linus: A symmetric naming is only helpful if it implies symmetries in use. Otherwise it's actively misleading. In "kzalloc()", the z is meaningful and an important part of what the caller wants. In "kzfree()", the z is actively detrimental, because maybe in the future we really _might_ want to use that "memfill(0xdeadbeef)" or something. The "zero" part of the interface isn't even _relevant_. The main reason that kzfree() exists is to clear sensitive information that should not be leaked to other future users of the same memory objects. Rename kzfree() to kfree_sensitive() to follow the example of the recently added kvfree_sensitive() and make the intention of the API more explicit. In addition, memzero_explicit() is used to clear the memory to make sure that it won't get optimized away by the compiler. The renaming is done by using the command sequence: git grep -w --name-only kzfree |\ xargs sed -i 's/kzfree/kfree_sensitive/' followed by some editing of the kfree_sensitive() kerneldoc and adding a kzfree backward compatibility macro in slab.h. [akpm@linux-foundation.org: fs/crypto/inline_crypt.c needs linux/slab.h] [akpm@linux-foundation.org: fix fs/crypto/inline_crypt.c some more] Suggested-by: NJoe Perches <joe@perches.com> Signed-off-by: NWaiman Long <longman@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NDavid Howells <dhowells@redhat.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Cc: James Morris <jmorris@namei.org> Cc: "Serge E. Hallyn" <serge@hallyn.com> Cc: Joe Perches <joe@perches.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: David Rientjes <rientjes@google.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: "Jason A . Donenfeld" <Jason@zx2c4.com> Link: http://lkml.kernel.org/r/20200616154311.12314-3-longman@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ralph Campbell 提交于
On x86_64, when CONFIG_MMU_NOTIFIER is not set/enabled, there is a compiler error: mm/migrate.c: In function 'migrate_vma_collect': mm/migrate.c:2481:7: error: 'struct mmu_notifier_range' has no member named 'migrate_pgmap_owner' range.migrate_pgmap_owner = migrate->pgmap_owner; ^ Fixes: 998427b3 ("mm/notifier: add migration invalidation type") Reported-by: NRandy Dunlap <rdunlap@infradead.org> Signed-off-by: NRalph Campbell <rcampbell@nvidia.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Tested-by: NRandy Dunlap <rdunlap@infradead.org> Acked-by: NRandy Dunlap <rdunlap@infradead.org> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Cc: "Jason Gunthorpe" <jgg@mellanox.com> Link: http://lkml.kernel.org/r/20200806193353.7124-1-rcampbell@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 05 8月, 2020 1 次提交
-
-
由 Stefano Brivio 提交于
Currently, processes sending traffic to a local bridge with an encapsulation device as a port don't get ICMP errors if they exceed the PMTU of the encapsulated link. David Ahern suggested this as a hack, but it actually looks like the correct solution: when we update the PMTU for a given destination by means of updating or creating a route exception, the encapsulation might trigger this because of PMTU discovery happening either on the encapsulation device itself, or its lower layer. This happens on bridged encapsulations only. The output interface shouldn't matter, because we already have a valid destination. Drop the output interface restriction from the associated route lookup. For UDP tunnels, we will now have a route exception created for the encapsulation itself, with a MTU value reflecting its headroom, which allows a bridge forwarding IP packets originated locally to deliver errors back to the sending socket. The behaviour is now consistent with IPv6 and verified with selftests pmtu_ipv{4,6}_br_{geneve,vxlan}{4,6}_exception introduced later in this series. v2: - reset output interface only for bridge ports (David Ahern) - add and use netif_is_any_bridge_port() helper (David Ahern) Suggested-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NStefano Brivio <sbrivio@redhat.com> Reviewed-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 04 8月, 2020 6 次提交
-
-
由 Linus Torvalds 提交于
The addition of percpu.h to the list of includes in random.h revealed some circular dependencies on arm64 and possibly other platforms. This include was added solely for the pseudo-random definitions, which have nothing to do with the rest of the definitions in this file but are still there for legacy reasons. This patch moves the pseudo-random parts to linux/prandom.h and the percpu.h include with it, which is now guarded by _LINUX_PRANDOM_H and protected against recursive inclusion. A further cleanup step would be to remove this from <linux/random.h> entirely, and make people who use the prandom infrastructure include just the new header file. That's a bit of a churn patch, but grepping for "prandom_" and "next_pseudo_random32" "struct rnd_state" should catch most users. But it turns out that that nice cleanup step is fairly painful, because a _lot_ of code currently seems to depend on the implicit include of <linux/random.h>, which can currently come in a lot of ways, including such fairly core headfers as <linux/net.h>. So the "nice cleanup" part may or may never happen. Fixes: 1c9df907 ("random: fix circular include dependency on arm64 after addition of percpu.h") Tested-by: NGuenter Roeck <linux@roeck-us.net> Acked-by: NWilly Tarreau <w@1wt.eu> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Florian Fainelli 提交于
For now we simply store the port MTU into a per-port member. Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com> Reviewed-by: NAndrew Lunn <andrew@lunn.ch> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Florian Fainelli 提交于
In preparation for adding support for a mockup data path, move the driver data structures to include/linux/dsa/loop.h such that we can share them between net/dsa/ and drivers/net/dsa/ later on. Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Ahmad Fatoum 提交于
Commit 959bc7b2 ("gpio: Automatically add lockdep keys") documents in its commits message its intention to "create a unique class key for each driver". It does so by having gpiochip_add_data add in-place the definition of two static lockdep classes for LOCKDEP use. That way, every caller of the macro adds their gpiochip with unique lockdep classes. There are many indirect callers of gpiochip_add_data, however, via use of devm_gpiochip_add_data. devm_gpiochip_add_data has external linkage and all its users will share the same lockdep classes, which probably is not intended. Fix this by replicating the gpio_chip_add_data statics-in-macro for the devm_ version as well. Fixes: 959bc7b2 ("gpio: Automatically add lockdep keys") Signed-off-by: NAhmad Fatoum <a.fatoum@pengutronix.de> Reviewed-by: NAndy Shevchenko <andy.shevchenko@gmail.com> Reviewed-by: NBartosz Golaszewski <bgolaszewski@baylibre.com> Link: https://lore.kernel.org/r/20200731123835.8003-1-a.fatoum@pengutronix.deSigned-off-by: NLinus Walleij <linus.walleij@linaro.org>
-
由 wenxu 提交于
When openvswitch conntrack offload with act_ct action. Fragment packets defrag in the ingress tc act_ct action and miss the next chain. Then the packet pass to the openvswitch datapath without the mru. The over mtu packet will be dropped in output action in openvswitch for over mtu. "kernel: net2: dropped over-mtu packet: 1528 > 1500" This patch add mru in the tc_skb_ext for adefrag and miss next chain situation. And also add mru in the qdisc_skb_cb. The act_ct set the mru to the qdisc_skb_cb when the packet defrag. And When the chain miss, The mru is set to tc_skb_ext which can be got by ovs datapath. Fixes: b57dc7c1 ("net/sched: Introduce action ct") Signed-off-by: Nwenxu <wenxu@ucloud.cn> Reviewed-by: NCong Wang <xiyou.wangcong@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Bruno Thomsen 提交于
Load new "reset-post-delay-us" value from MDIO properties, and if configured to a greater then zero delay do a flexible sleeping delay after MDIO bus reset deassert. This allows devices to exit reset state before start bus communication. Signed-off-by: NBruno Thomsen <bruno.thomsen@gmail.com> Reviewed-by: NAndrew Lunn <andrew@lunn.ch> Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 03 8月, 2020 2 次提交
-
-
由 Jouni Malinen 提交于
SAE authentication has been extended with H2E (IEEE 802.11 REVmd) and PK (WFA) options. Those extensions use special status code values in the SAE commit messages (Authentication frame with transaction sequence number 1) to identify which extension is in use. mac80211 was interpreting those new values as the AP denying authentication and that resulted in failure to complete SAE authentication in some cases. Fix this by adding exceptions for the new status code values 126 and 127. Signed-off-by: NJouni Malinen <jouni@codeaurora.org> Link: https://lore.kernel.org/r/20200731183830.18735-1-jouni@codeaurora.orgSigned-off-by: NJohannes Berg <johannes.berg@intel.com>
-
由 Linus Torvalds 提交于
That gives us ordering guarantees around the pair. Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 02 8月, 2020 3 次提交
-
-
由 Ajay Singh 提交于
Moved macros used for Vendor/Device ID from wilc1000 driver to common header file and changed macro name for consistency with other macros. Signed-off-by: NAjay Singh <ajay.kathat@microchip.com> Acked-by: NUlf Hansson <ulf.hansson@linaro.org> Acked-by: NPali Rohár <pali@kernel.org> Signed-off-by: NKalle Valo <kvalo@codeaurora.org> Link: https://lore.kernel.org/r/20200717051134.19160-1-ajay.kathat@microchip.com
-
由 Andrii Nakryiko 提交于
Add LINK_DETACH command to force-detach bpf_link without destroying it. It has the same behavior as auto-detaching of bpf_link due to cgroup dying for bpf_cgroup_link or net_device being destroyed for bpf_xdp_link. In such case, bpf_link is still a valid kernel object, but is defuncts and doesn't hold BPF program attached to corresponding BPF hook. This functionality allows users with enough access rights to manually force-detach attached bpf_link without killing respective owner process. This patch implements LINK_DETACH for cgroup, xdp, and netns links, mostly re-using existing link release handling code. Signed-off-by: NAndrii Nakryiko <andriin@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NSong Liu <songliubraving@fb.com> Acked-by: NJohn Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200731182830.286260-2-andriin@fb.com
-
由 Pavel Begunkov 提交于
Use a local var to collect flags in kiocb_set_rw_flags(). That spares some memory writes and allows to replace most of the jumps with MOVEcc. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 01 8月, 2020 3 次提交
-
-
由 Valentin Schneider 提交于
Rather that hide their purpose in some dark, damp corner of Documentation/, add some documentation to the default implementations. Signed-off-by: NValentin Schneider <valentin.schneider@arm.com> Signed-off-by: NIngo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200731192016.7484-2-valentin.schneider@arm.com
-
由 Roopa Prabhu 提交于
netdev protodown is a mechanism that allows protocols to hold an interface down. It was initially introduced in the kernel to hold links down by a multihoming protocol. There was also an attempt to introduce protodown reason at the time but was rejected. protodown and protodown reason is supported by almost every switching and routing platform. It was ok for a while to live without a protodown reason. But, its become more critical now given more than one protocol may need to keep a link down on a system at the same time. eg: vrrp peer node, port security, multihoming protocol. Its common for Network operators and protocol developers to look for such a reason on a networking box (Its also known as errDisable by most networking operators) This patch adds support for link protodown reason attribute. There are two ways to maintain protodown reasons. (a) enumerate every possible reason code in kernel - A protocol developer has to make a request and have that appear in a certain kernel version (b) provide the bits in the kernel, and allow user-space (sysadmin or NOS distributions) to manage the bit-to-reasonname map. - This makes extending reason codes easier (kind of like the iproute2 table to vrf-name map /etc/iproute2/rt_tables.d/) This patch takes approach (b). a few things about the patch: - It treats the protodown reason bits as counter to indicate active protodown users - Since protodown attribute is already an exposed UAPI, the reason is not enforced on a protodown set. Its a no-op if not used. the patch follows the below algorithm: - presence of reason bits set indicates protodown is in use - user can set protodown and protodown reason in a single or multiple setlink operations - setlink operation to clear protodown, will return -EBUSY if there are active protodown reason bits - reason is not included in link dumps if not used example with patched iproute2: $cat /etc/iproute2/protodown_reasons.d/r.conf 0 mlag 1 evpn 2 vrrp 3 psecurity $ip link set dev vxlan0 protodown on protodown_reason vrrp on $ip link set dev vxlan0 protodown_reason mlag on $ip link show 14: vxlan0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether f6:06:be:17:91:e7 brd ff:ff:ff:ff:ff:ff protodown on <mlag,vrrp> $ip link set dev vxlan0 protodown_reason mlag off $ip link set dev vxlan0 protodown off protodown_reason vrrp off Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yousuk Seung 提交于
This change adds TCP_NLA_EDT to SCM_TIMESTAMPING_OPT_STATS that reports the earliest departure time(EDT) of the timestamped skb. By tracking EDT values of the skb from different timestamps, we can observe when and how much the value changed. This allows to measure the precise delay injected on the sender host e.g. by a bpf-base throttler. Signed-off-by: NYousuk Seung <ysseung@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Acked-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 31 7月, 2020 7 次提交
-
-
由 Marco Elver 提交于
To improve the general usefulness of the IRQ state trace events with KCSAN enabled, save and restore the trace information when entering and exiting the KCSAN runtime as well as when generating a KCSAN report. Without this, reporting the IRQ trace events (whether via a KCSAN report or outside of KCSAN via a lockdep report) is rather useless due to continuously being touched by KCSAN. This is because if KCSAN is enabled, every instrumented memory access causes changes to IRQ trace events (either by KCSAN disabling/enabling interrupts or taking report_lock when generating a report). Before "lockdep: Prepare for NMI IRQ state tracking", KCSAN avoided touching the IRQ trace events via raw_local_irq_save/restore() and lockdep_off/on(). Fixes: 248591f5 ("kcsan: Make KCSAN compatible with new IRQ state tracking") Signed-off-by: NMarco Elver <elver@google.com> Signed-off-by: NIngo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200729110916.3920464-2-elver@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Marco Elver 提交于
Refactor the IRQ trace events fields, used for printing information about the IRQ trace events, into a separate struct 'irqtrace_events'. This improves readability by separating the information only used in reporting, as well as enables (simplified) storing/restoring of irqtrace_events snapshots. No functional change intended. Signed-off-by: NMarco Elver <elver@google.com> Signed-off-by: NIngo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200729110916.3920464-1-elver@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Nick Terrell 提交于
- Add unzstd() and the zstd decompress interface. - Add zstd support to decompress_method(). The decompress_method() and unzstd() functions are used to decompress the initramfs and the initrd. The __decompress() function is used in the preboot environment to decompress a zstd compressed kernel. The zstd decompression function allows the input and output buffers to overlap because that is used by x86 kernel decompression. Signed-off-by: NNick Terrell <terrelln@fb.com> Signed-off-by: NIngo Molnar <mingo@kernel.org> Tested-by: NSedat Dilek <sedat.dilek@gmail.com> Reviewed-by: NKees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20200730190841.2071656-3-nickrterrell@gmail.com
-
由 Marcelo Henrique Cerri 提交于
Add mpi_sub_ui() based on Gnu MP mpz_sub_ui() function from file mpz/aors_ui.h[1] from change id 510b83519d1c adapting the code to the kernel's data structures, helper functions and coding style and also removing the defines used to produce mpz_sub_ui() and mpz_add_ui() from the same code. [1] https://gmplib.org/repo/gmp-6.2/file/510b83519d1c/mpz/aors.hSigned-off-by: NMarcelo Henrique Cerri <marcelo.cerri@canonical.com> Signed-off-by: NStephan Mueller <smueller@chronox.de> Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
-
由 Romain Perier 提交于
Nowadays, modern kernel subsystems that use callbacks pass the data structure associated with a given callback as argument to the callback. The tasklet subsystem remains one which passes an arbitrary unsigned long to the callback function. This has several problems: - This keeps an extra field for storing the argument in each tasklet data structure, it bloats the tasklet_struct structure with a redundant .data field - No type checking can be performed on this argument. Instead of using container_of() like other callback subsystems, it forces callbacks to do explicit type cast of the unsigned long argument into the required object type. - Buffer overflows can overwrite the .func and the .data field, so an attacker can easily overwrite the function and its first argument to whatever it wants. Add a new tasklet initialization API, via DECLARE_TASKLET() and tasklet_setup(), which will replace the existing ones. This work is greatly inspired by the timer_struct conversion series, see commit e99e88a9 ("treewide: setup_timer() -> timer_setup()") To avoid problems with both -Wcast-function-type (which is enabled in the kernel via -Wextra is several subsystems), and with mismatched function prototypes when build with Control Flow Integrity enabled, this adds the "use_callback" member to let the tasklet caller choose which union member to call through. Once all old API uses are removed, this and the .data member will be removed as well. (On 64-bit this does not grow the struct size as the new member fills the hole after atomic_t, which is also "int" sized.) Signed-off-by: NRomain Perier <romain.perier@gmail.com> Co-developed-by: NAllen Pais <allen.lkml@gmail.com> Signed-off-by: NAllen Pais <allen.lkml@gmail.com> Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: NThomas Gleixner <tglx@linutronix.de> Co-developed-by: NKees Cook <keescook@chromium.org> Signed-off-by: NKees Cook <keescook@chromium.org>
-
由 Kees Cook 提交于
This converts all the existing DECLARE_TASKLET() (and ...DISABLED) macros with DECLARE_TASKLET_OLD() in preparation for refactoring the tasklet callback type. All existing DECLARE_TASKLET() users had a "0" data argument, it has been removed here as well. Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NKees Cook <keescook@chromium.org>
-
由 Willy Tarreau 提交于
Daniel Díaz and Kees Cook independently reported that commit f227e3ec ("random32: update the net random state on interrupt and activity") broke arm64 due to a circular dependency on include files since the addition of percpu.h in random.h. The correct fix would definitely be to move all the prandom32 stuff out of random.h but for backporting, a smaller solution is preferred. This one replaces linux/percpu.h with asm/percpu.h, and this fixes the problem on x86_64, arm64, arm, and mips. Note that moving percpu.h around didn't change anything and that removing it entirely broke differently. When backporting, such options might still be considered if this patch fails to help. [ It turns out that an alternate fix seems to be to just remove the troublesome <asm/pointer_auth.h> remove from the arm64 <asm/smp.h> that causes the circular dependency. But we might as well do the whole belt-and-suspenders thing, and minimize inclusion in <linux/random.h> too. Either will fix the problem, and both are good changes. - Linus ] Reported-by: NDaniel Díaz <daniel.diaz@linaro.org> Reported-by: NKees Cook <keescook@chromium.org> Tested-by: NMarc Zyngier <maz@kernel.org> Fixes: f227e3ec Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NWilly Tarreau <w@1wt.eu> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 30 7月, 2020 5 次提交
-
-
由 Chanwoo Choi 提交于
Until now, the devfreq driver using polling mode like simple_ondemand governor have used only deferrable timer for reducing the redundant power consumption. It reduces the CPU wake-up from idle due to polling mode which check the status of Non-CPU device. But, it has a problem for Non-CPU device like DMC device with DMA operation. Some Non-CPU device need to do monitor continuously regardless of CPU state in order to decide the proper next status of Non-CPU device. So, add support the delayed timer for polling mode to support the repetitive monitoring. The devfreq driver and user can select the kind of timer on either deferrable and delayed timer. For example, change the timer type of DMC device based on Exynos5422-based Odroid-XU3 as following: - If want to use deferrable timer as following: echo deferrable > /sys/class/devfreq/10c20000.memory-controller/timer - If want to use delayed timer as following: echo delayed > /sys/class/devfreq/10c20000.memory-controller/timer Reviewed-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Reviewed-by: NLukasz Luba <lukasz.luba@arm.com> Signed-off-by: NChanwoo Choi <cw00.choi@samsung.com>
-
由 Andrzej Hajda 提交于
During probe every time driver gets resource it should usually check for error printk some message if it is not -EPROBE_DEFER and return the error. This pattern is simple but requires adding few lines after any resource acquisition code, as a result it is often omitted or implemented only partially. dev_err_probe helps to replace such code sequences with simple call, so code: if (err != -EPROBE_DEFER) dev_err(dev, ...); return err; becomes: return dev_err_probe(dev, err, ...); Signed-off-by: NAndrzej Hajda <a.hajda@samsung.com> Reviewed-by: NRafael J. Wysocki <rafael@kernel.org> Reviewed-by: NMark Brown <broonie@kernel.org> Reviewed-by: NAndy Shevchenko <andy.shevchenko@gmail.com> Link: https://lore.kernel.org/r/20200713144324.23654-2-a.hajda@samsung.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Linus Torvalds 提交于
It turns out that the plugin right now ends up being really unhappy about the change from 'static' to 'extern' storage that happened in commit f227e3ec ("random32: update the net random state on interrupt and activity"). This is probably a trivial fix for the latent_entropy plugin, but for now, just remove net_rand_state from the list of things the plugin worries about. Reported-by: NStephen Rothwell <sfr@canb.auug.org.au> Cc: Emese Revfy <re.emese@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Willy Tarreau 提交于
This modifies the first 32 bits out of the 128 bits of a random CPU's net_rand_state on interrupt or CPU activity to complicate remote observations that could lead to guessing the network RNG's internal state. Note that depending on some network devices' interrupt rate moderation or binding, this re-seeding might happen on every packet or even almost never. In addition, with NOHZ some CPUs might not even get timer interrupts, leaving their local state rarely updated, while they are running networked processes making use of the random state. For this reason, we also perform this update in update_process_times() in order to at least update the state when there is user or system activity, since it's the only case we care about. Reported-by: NAmit Klein <aksecurity@gmail.com> Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org> Cc: Eric Dumazet <edumazet@google.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: NWilly Tarreau <w@1wt.eu> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Neal Liu 提交于
Control Flow Integrity(CFI) is a security mechanism that disallows changes to the original control flow graph of a compiled binary, making it significantly harder to perform such attacks. init_state_node() assign same function callback to different function pointer declarations. static int init_state_node(struct cpuidle_state *idle_state, const struct of_device_id *matches, struct device_node *state_node) { ... idle_state->enter = match_id->data; ... idle_state->enter_s2idle = match_id->data; } Function declarations: struct cpuidle_state { ... int (*enter) (struct cpuidle_device *dev, struct cpuidle_driver *drv, int index); void (*enter_s2idle) (struct cpuidle_device *dev, struct cpuidle_driver *drv, int index); }; In this case, either enter() or enter_s2idle() would cause CFI check failed since they use same callee. Align function prototype of enter() since it needs return value for some use cases. The return value of enter_s2idle() is no need currently. Signed-off-by: NNeal Liu <neal.liu@mediatek.com> Reviewed-by: NSami Tolvanen <samitolvanen@google.com> Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
-