- 03 4月, 2020 40 次提交
-
-
由 Roman Gushchin 提交于
These functions are charging the given number of kernel pages to the given memory cgroup. The number doesn't have to be a power of two. Let's make them to take the unsigned int nr_pages as an argument instead of the page order. It makes them look consistent with the corresponding uncharge functions and functions like: mem_cgroup_charge_skmem(memcg, nr_pages). Signed-off-by: NRoman Gushchin <guro@fb.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Link: http://lkml.kernel.org/r/20200109202659.752357-5-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Roman Gushchin 提交于
Rename (__)memcg_kmem_(un)charge() into (__)memcg_kmem_(un)charge_page() to better reflect what they are actually doing: 1) call __memcg_kmem_(un)charge_memcg() to actually charge or uncharge the current memcg 2) set or clear the PageKmemcg flag Signed-off-by: NRoman Gushchin <guro@fb.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Link: http://lkml.kernel.org/r/20200109202659.752357-4-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Roman Gushchin 提交于
Drop the unused page argument and put the memcg pointer at the first place. This make the function consistent with its peers: __memcg_kmem_uncharge_memcg(), memcg_kmem_charge_memcg(), etc. Signed-off-by: NRoman Gushchin <guro@fb.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Link: http://lkml.kernel.org/r/20200109202659.752357-3-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Roman Gushchin 提交于
Patch series "mm: memcg: kmem API cleanup", v2. This patchset aims to clean up the kernel memory charging API. It doesn't bring any functional changes, just removes unused arguments, renames some functions and fixes some comments. Currently it's not obvious which functions are most basic (memcg_kmem_(un)charge_memcg()) and which are based on them (memcg_kmem_(un)charge()). The patchset renames these functions and removes unused arguments: TL;DR: was: memcg_kmem_charge_memcg(page, gfp, order, memcg) memcg_kmem_uncharge_memcg(memcg, nr_pages) memcg_kmem_charge(page, gfp, order) memcg_kmem_uncharge(page, order) now: memcg_kmem_charge(memcg, gfp, nr_pages) memcg_kmem_uncharge(memcg, nr_pages) memcg_kmem_charge_page(page, gfp, order) memcg_kmem_uncharge_page(page, order) This patch (of 6): The first argument of memcg_kmem_charge_memcg() and __memcg_kmem_charge_memcg() is the page pointer and it's not used. Let's drop it. Memcg pointer is passed as the last argument. Move it to the first place for consistency with other memcg functions, e.g. __memcg_kmem_uncharge_memcg() or try_charge(). Signed-off-by: NRoman Gushchin <guro@fb.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Link: http://lkml.kernel.org/r/20200109202659.752357-2-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Roman Gushchin 提交于
Sometimes we need to get a memcg pointer from a charged kernel object. The right way to get it depends on whether it's a proper slab object or it's backed by raw pages (e.g. it's a vmalloc alloction). In the first case the kmem_cache->memcg_params.memcg indirection should be used; in other cases it's just page->mem_cgroup. To simplify this task and hide the implementation details let's use the mem_cgroup_from_obj() helper, which takes a pointer to any kernel object and returns a valid memcg pointer or NULL. Passing a kernel address rather than a pointer to a page will allow to use this helper for per-object (rather than per-page) tracked objects in the future. The caller is still responsible to ensure that the returned memcg isn't going away underneath: take the rcu read lock, cgroup mutex etc; depending on the context. mem_cgroup_from_kmem() defined in mm/list_lru.c is now obsolete and can be removed. Signed-off-by: NRoman Gushchin <guro@fb.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NYafang Shao <laoar.shao@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Link: http://lkml.kernel.org/r/20200117203609.3146239-1-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill Tkhai 提交于
The shrinker_map may be touched from any cpu (e.g., a bit there may be set by a task running everywhere) but kswapd is always bound to specific node. So allocate shrinker_map from the related NUMA node to respect its NUMA locality. Also, this follows generic way we use for allocation of memcg's per-node data. Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Reviewed-by: NRoman Gushchin <guro@fb.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Link: http://lkml.kernel.org/r/fff0e636-4c36-ed10-281c-8cdb0687c839@virtuozzo.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yafang Shao 提交于
When I manually set default n to MEMCG_KMEM in init/Kconfig, bellow error occurs, mm/slab_common.c: In function 'memcg_slab_start': mm/slab_common.c:1530:30: error: 'struct mem_cgroup' has no member named 'kmem_caches' return seq_list_start(&memcg->kmem_caches, *pos); ^ mm/slab_common.c: In function 'memcg_slab_next': mm/slab_common.c:1537:32: error: 'struct mem_cgroup' has no member named 'kmem_caches' return seq_list_next(p, &memcg->kmem_caches, pos); ^ mm/slab_common.c: In function 'memcg_slab_show': mm/slab_common.c:1551:16: error: 'struct mem_cgroup' has no member named 'kmem_caches' if (p == memcg->kmem_caches.next) ^ CC arch/x86/xen/smp.o mm/slab_common.c: In function 'memcg_slab_start': mm/slab_common.c:1531:1: warning: control reaches end of non-void function [-Wreturn-type] } ^ mm/slab_common.c: In function 'memcg_slab_next': mm/slab_common.c:1538:1: warning: control reaches end of non-void function [-Wreturn-type] } ^ That's because kmem_caches is defined only when CONFIG_MEMCG_KMEM is set, while memcg_slab_start() will use it no matter CONFIG_MEMCG_KMEM is defined or not. By the way, the reason I mannuly undefined CONFIG_MEMCG_KMEM is to verify whether my some other code change is still stable when CONFIG_MEMCG_KMEM is not set. Unfortunately, the existing code has been already unstable since v4.11. Fixes: bc2791f8 ("slab: link memcg kmem_caches on their associated memory cgroup") Signed-off-by: NYafang Shao <laoar.shao@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Link: http://lkml.kernel.org/r/1580970260-2045-1-git-send-email-laoar.shao@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wei Yang 提交于
add_to_swap_cache() and delete_from_swap_cache() are counterparts, while currently they use different ways to count pages. It doesn't break anything because we only have two sizes for PageAnon, but this is confusing and not good practice. This patch corrects it by making both functions use hpage_nr_pages(). Signed-off-by: NWei Yang <richard.weiyang@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Link: http://lkml.kernel.org/r/20200315012920.2687-1-richard.weiyang@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yang Shi 提交于
Memory barrier is needed after setting LRU bit, but smp_mb() is too strong. Some architectures, i.e. x86, imply memory barrier with atomic operations, so replacing it with smp_mb__after_atomic() sounds better, which is nop on strong ordered machines, and full memory barriers on others. With this change the vm-scalability cases would perform better on x86, I saw total 6% improvement with this patch and previous inline fix. The test data (lru-file-readtwice throughput) against v5.6-rc4: mainline w/ inline fix w/ both (adding this) 150MB 154MB 159MB Fixes: 9c4e6b1a ("mm, mlock, vmscan: no more skipping pagevecs") Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Tested-by: NShakeel Butt <shakeelb@google.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Link: http://lkml.kernel.org/r/1584500541-46817-2-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yang Shi 提交于
When backporting commit 9c4e6b1a ("mm, mlock, vmscan: no more skipping pagevecs") to our 4.9 kernel, our test bench noticed around 10% down with a couple of vm-scalability's test cases (lru-file-readonce, lru-file-readtwice and lru-file-mmap-read). I didn't see that much down on my VM (32c-64g-2nodes). It might be caused by the test configuration, which is 32c-256g with NUMA disabled and the tests were run in root memcg, so the tests actually stress only one inactive and active lru. It sounds not very usual in mordern production environment. That commit did two major changes: 1. Call page_evictable() 2. Use smp_mb to force the PG_lru set visible It looks they contribute the most overhead. The page_evictable() is a function which does function prologue and epilogue, and that was used by page reclaim path only. However, lru add is a very hot path, so it sounds better to make it inline. However, it calls page_mapping() which is not inlined either, but the disassemble shows it doesn't do push and pop operations and it sounds not very straightforward to inline it. Other than this, it sounds smp_mb() is not necessary for x86 since SetPageLRU is atomic which enforces memory barrier already, replace it with smp_mb__after_atomic() in the following patch. With the two fixes applied, the tests can get back around 5% on that test bench and get back normal on my VM. Since the test bench configuration is not that usual and I also saw around 6% up on the latest upstream, so it sounds good enough IMHO. The below is test data (lru-file-readtwice throughput) against the v5.6-rc4: mainline w/ inline fix 150MB 154MB With this patch the throughput gets 2.67% up. The data with using smp_mb__after_atomic() is showed in the following patch. Shakeel Butt did the below test: On a real machine with limiting the 'dd' on a single node and reading 100 GiB sparse file (less than a single node). Just ran a single instance to not cause the lru lock contention. The cmdline used is "dd if=file-100GiB of=/dev/null bs=4k". Ran the cmd 10 times with drop_caches in between and measured the time it took. Without patch: 56.64143 +- 0.672 sec With patches: 56.10 +- 0.21 sec [akpm@linux-foundation.org: move page_evictable() to internal.h] Fixes: 9c4e6b1a ("mm, mlock, vmscan: no more skipping pagevecs") Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Tested-by: NShakeel Butt <shakeelb@google.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Link: http://lkml.kernel.org/r/1584500541-46817-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wei Yang 提交于
Currently we use a tmp pointer, pentry, to transfer and reset swap cache slot, which is a little redundant. Swap cache slot stores the entry value directly, assign and reset it by value would be straight forward. Also this patch merges the else and if, since this is the only case we refill and repeat swap cache. Signed-off-by: NWei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NTim Chen <tim.c.chen@linux.intel.com> Link: http://lkml.kernel.org/r/20200311055352.50574-1-richard.weiyang@linux.alibaba.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Qian Cai 提交于
si->inuse_pages could be accessed concurrently as noticed by KCSAN, write to 0xffff98b00ebd04dc of 4 bytes by task 82262 on cpu 92: swap_range_free+0xbe/0x230 swap_range_free at mm/swapfile.c:719 swapcache_free_entries+0x1be/0x250 free_swap_slot+0x1c8/0x220 __swap_entry_free.constprop.19+0xa3/0xb0 free_swap_and_cache+0x53/0xa0 unmap_page_range+0x7e0/0x1ce0 unmap_single_vma+0xcd/0x170 unmap_vmas+0x18b/0x220 exit_mmap+0xee/0x220 mmput+0xe7/0x240 do_exit+0x598/0xfd0 do_group_exit+0x8b/0x180 get_signal+0x293/0x13d0 do_signal+0x37/0x5d0 prepare_exit_to_usermode+0x1b7/0x2c0 ret_from_intr+0x32/0x42 read to 0xffff98b00ebd04dc of 4 bytes by task 82499 on cpu 46: try_to_unuse+0x86b/0xc80 try_to_unuse at mm/swapfile.c:2185 __x64_sys_swapoff+0x372/0xd40 do_syscall_64+0x91/0xb05 entry_SYSCALL_64_after_hwframe+0x49/0xbe The plain reads in try_to_unuse() are outside si->lock critical section which result in data races that could be dangerous to be used in a loop. Fix them by adding READ_ONCE(). Signed-off-by: NQian Cai <cai@lca.pw> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Marco Elver <elver@google.com> Cc: Hugh Dickins <hughd@google.com> Link: http://lkml.kernel.org/r/1582578903-29294-1-git-send-email-cai@lca.pwSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wei Yang 提交于
__pagevec_lru_add() is only used in mm directory now. Remove the export symbol. Signed-off-by: NWei Yang <richardw.yang@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200126011436.22979-1-richardw.yang@linux.intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Chen Wandun 提交于
The -EEXIST returned by __swap_duplicate means there is a swap cache instead -EBUSY Signed-off-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200212145754.27123-1-chenwandun@huawei.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Pingfan Liu 提交于
FOLL_LONGTERM is a special case of FOLL_PIN. It suggests a pin which is going to be given to hardware and can't move. It would truncate CMA permanently and should be excluded. In gup slow path, where __gup_longterm_locked->check_and_migrate_cma_pages() handles FOLL_LONGTERM, but in fast path, there lacks such a check, which means a possible leak of CMA page to longterm pinned. Place a check in try_grab_compound_head() in the fast path to fix the leak, and if FOLL_LONGTERM happens on CMA, it will fall back to slow path to migrate the page. Some note about the check: Huge page's subpages have the same migrate type due to either allocation from a free_list[] or alloc_contig_range() with param MIGRATE_MOVABLE. So it is enough to check on a single subpage by is_migrate_cma_page(subpage) Signed-off-by: NPingfan Liu <kernelfans@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJason Gunthorpe <jgg@mellanox.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Link: http://lkml.kernel.org/r/1584876733-17405-3-git-send-email-kernelfans@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Pingfan Liu 提交于
To better reflect the held state of pages and make code self-explaining, rename nr as nr_pinned. Signed-off-by: NPingfan Liu <kernelfans@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Link: http://lkml.kernel.org/r/1584876733-17405-2-git-send-email-kernelfans@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Claudio Imbrenda 提交于
With the introduction of protected KVM guests on s390 there is now a concept of inaccessible pages. These pages need to be made accessible before the host can access them. While cpu accesses will trigger a fault that can be resolved, I/O accesses will just fail. We need to add a callback into architecture code for places that will do I/O, namely when writeback is started or when a page reference is taken. This is not only to enable paging, file backing etc, it is also necessary to protect the host against a malicious user space. For example a bad QEMU could simply start direct I/O on such protected memory. We do not want userspace to be able to trigger I/O errors and thus the logic is "whenever somebody accesses that page (gup) or does I/O, make sure that this page can be accessed". When the guest tries to access that page we will wait in the page fault handler for writeback to have finished and for the page_ref to be the expected value. On s390x the function is not supposed to fail, so it is ok to use a WARN_ON on failure. If we ever need some more finegrained handling we can tackle this when we know the details. Signed-off-by: NClaudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Reviewed-by: NChristian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com> Acked-by: NWill Deacon <will@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20200306132537.783769-3-imbrenda@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 John Hubbard 提交于
As part of pin_user_pages() and related API calls, pages are "dma-pinned". For the case of compound pages of order > 1, the per-page accounting of dma pins is accomplished via the 3rd struct page in the compound page. In order to support debugging of any pin_user_pages()- related problems, enhance dump_page() so as to report the pin count in that case. Documentation/core-api/pin_user_pages.rst is also updated accordingly. Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20200211001536.1027652-13-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Matthew Wilcox (Oracle) 提交于
There was no protection against a corrupted struct page having an implausible compound_head(). Sanity check that a compound page has a head within reach of the maximum allocatable page (this will need to be adjusted if one of the plans to allocate 1GB pages comes to fruition). In addition, - Print the mapping pointer using %p insted of %px. The actual value of the pointer can be read out of the raw page dump and using %p gives a chance to correlate it with an earlier printk of the mapping pointer - Print the mapping pointer from the head page, not the tail page (the tail ->mapping pointer may be in use for other purposes, eg part of a list_head) - Print the order of the page for compound pages - Dump the raw head page as well as the raw page - Print the refcount from the head page, not the tail page Suggested-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Co-developed-by: NJohn Hubbard <jhubbard@nvidia.com> Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20200211001536.1027652-12-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 John Hubbard 提交于
Up until now, gup_benchmark supported testing of the following kernel functions: * get_user_pages(): via the '-U' command line option * get_user_pages_longterm(): via the '-L' command line option * get_user_pages_fast(): as the default (no options required) Add test coverage for the new corresponding pin_*() functions: * pin_user_pages_fast(): via the '-a' command line option * pin_user_pages(): via the '-b' command line option Also, add an option for clarity: '-u' for what is now (still) the default choice: get_user_pages_fast(). Also, for the commands that set FOLL_PIN, verify that the pages really are dma-pinned, via the new is_dma_pinned() routine. Those commands are: PIN_FAST_BENCHMARK : calls pin_user_pages_fast() PIN_BENCHMARK : calls pin_user_pages() In between the calls to pin_*() and unpin_user_pages(), check each page: if page_maybe_dma_pinned() returns false, then WARN and return. Do this outside of the benchmark timestamps, so that it doesn't affect reported times. Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NIra Weiny <ira.weiny@intel.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20200211001536.1027652-10-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 John Hubbard 提交于
Now that pages are "DMA-pinned" via pin_user_page*(), and unpinned via unpin_user_pages*(), we need some visibility into whether all of this is working correctly. Add two new fields to /proc/vmstat: nr_foll_pin_acquired nr_foll_pin_released These are documented in Documentation/core-api/pin_user_pages.rst. They represent the number of pages (since boot time) that have been pinned ("nr_foll_pin_acquired") and unpinned ("nr_foll_pin_released"), via pin_user_pages*() and unpin_user_pages*(). In the absence of long-running DMA or RDMA operations that hold pages pinned, the above two fields will normally be equal to each other. Also: update Documentation/core-api/pin_user_pages.rst, to remove an earlier (now confirmed untrue) claim about a performance problem with /proc/vmstat. Also: update Documentation/core-api/pin_user_pages.rst to rename the new /proc/vmstat entries, to the names listed here. Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NJan Kara <jack@suse.cz> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20200211001536.1027652-9-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 John Hubbard 提交于
For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS scheme tends to overflow too easily, each tail page increments the head page->_refcount by GUP_PIN_COUNTING_BIAS (1024). That limits the number of huge pages that can be pinned. This patch removes that limitation, by using an exact form of pin counting for compound pages of order > 1. The "order > 1" is required because this approach uses the 3rd struct page in the compound page, and order 1 compound pages only have two pages, so that won't work there. A new struct page field, hpage_pinned_refcount, has been added, replacing a padding field in the union (so no new space is used). This enhancement also has a useful side effect: huge pages and compound pages (of order > 1) do not suffer from the "potential false positives" problem that is discussed in the page_dma_pinned() comment block. That is because these compound pages have extra space for tracking things, so they get exact pin counts instead of overloading page->_refcount. Documentation/core-api/pin_user_pages.rst is updated accordingly. Suggested-by: NJan Kara <jack@suse.cz> Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NJan Kara <jack@suse.cz> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20200211001536.1027652-8-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 John Hubbard 提交于
Add tracking of pages that were pinned via FOLL_PIN. This tracking is implemented via overloading of page->_refcount: pins are added by adding GUP_PIN_COUNTING_BIAS (1024) to the refcount. This provides a fuzzy indication of pinning, and it can have false positives (and that's OK). Please see the pre-existing Documentation/core-api/pin_user_pages.rst for details. As mentioned in pin_user_pages.rst, callers who effectively set FOLL_PIN (typically via pin_user_pages*()) are required to ultimately free such pages via unpin_user_page(). Please also note the limitation, discussed in pin_user_pages.rst under the "TODO: for 1GB and larger huge pages" section. (That limitation will be removed in a following patch.) The effect of a FOLL_PIN flag is similar to that of FOLL_GET, and may be thought of as "FOLL_GET for DIO and/or RDMA use". Pages that have been pinned via FOLL_PIN are identifiable via a new function call: bool page_maybe_dma_pinned(struct page *page); What to do in response to encountering such a page, is left to later patchsets. There is discussion about this in [1], [2], [3], and [4]. This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask(). [1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/ [2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/ [3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/ [4] LWN kernel index: get_user_pages(): https://lwn.net/Kernel/Index/#Memory_management-get_user_pages [jhubbard@nvidia.com: add kerneldoc] Link: http://lkml.kernel.org/r/20200307021157.235726-1-jhubbard@nvidia.com [imbrenda@linux.ibm.com: if pin fails, we need to unpin, a simple put_page will not be enough] Link: http://lkml.kernel.org/r/20200306132537.783769-2-imbrenda@linux.ibm.com [akpm@linux-foundation.org: fix put_compound_head defined but not used] Suggested-by: NJan Kara <jack@suse.cz> Suggested-by: NJérôme Glisse <jglisse@redhat.com> Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com> Signed-off-by: NClaudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NJan Kara <jack@suse.cz> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20200211001536.1027652-7-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 John Hubbard 提交于
Internal to mm/gup.c, require that get_user_pages_fast() and __get_user_pages_fast() identify themselves, by setting FOLL_GET. This is required in order to be able to make decisions based on "FOLL_PIN, or FOLL_GET, or both or neither are set", in upcoming patches. Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NJan Kara <jack@suse.cz> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20200211001536.1027652-6-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 John Hubbard 提交于
In preparation for an upcoming patch, send gup flags args to two more routines: put_compound_head(), and undo_dev_pagemap(). Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NJan Kara <jack@suse.cz> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20200211001536.1027652-5-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 John Hubbard 提交于
A subsequent patch requires access to gup flags, so pass the flags argument through to the __gup_device_* functions. Also placate checkpatch.pl by shortening a nearby line. Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NJérôme Glisse <jglisse@redhat.com> Reviewed-by: NIra Weiny <ira.weiny@intel.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20200211001536.1027652-3-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 John Hubbard 提交于
Patch series "mm/gup: track FOLL_PIN pages", v6. This activates tracking of FOLL_PIN pages. This is in support of fixing the get_user_pages()+DMA problem described in [1]-[4]. FOLL_PIN support is now in the main linux tree. However, the patch to use FOLL_PIN to track pages was *not* submitted, because Leon saw an RDMA test suite failure that involved (I think) page refcount overflows when huge pages were used. This patch definitively solves that kind of overflow problem, by adding an exact pincount, for compound pages (of order > 1), in the 3rd struct page of a compound page. If available, that form of pincounting is used, instead of the GUP_PIN_COUNTING_BIAS approach. Thanks again to Jan Kara for that idea. Other interesting changes: * dump_page(): added one, or two new things to report for compound pages: head refcount (for all compound pages), and map_pincount (for compound pages of order > 1). * Documentation/core-api/pin_user_pages.rst: removed the "TODO" for the huge page refcount upper limit problems, and added notes about how it works now. Also added a note about the dump_page() enhancements. * Added some comments in gup.c and mm.h, to explain that there are two ways to count pinned pages: exact (for compound pages of order > 1) and fuzzy (GUP_PIN_COUNTING_BIAS: for all other pages). ============================================================ General notes about the tracking patch: This is a prerequisite to solving the problem of proper interactions between file-backed pages, and [R]DMA activities, as discussed in [1], [2], [3], [4] and in a remarkable number of email threads since about 2017. :) In contrast to earlier approaches, the page tracking can be incrementally applied to the kernel call sites that, until now, have been simply calling get_user_pages() ("gup"). In other words, opt-in by changing from this: get_user_pages() (sets FOLL_GET) put_page() to this: pin_user_pages() (sets FOLL_PIN) unpin_user_page() ============================================================ Future steps: * Convert more subsystems from get_user_pages() to pin_user_pages(). The first probably needs to be bio/biovecs, because any filesystem testing is too difficult without those in place. * Change VFS and filesystems to respond appropriately when encountering dma-pinned pages. * Work with Ira and others to connect this all up with file system leases. [1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/ [2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/ [3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/ [4] LWN kernel index: get_user_pages() https://lwn.net/Kernel/Index/#Memory_management-get_user_pages This patch (of 12): An upcoming patch requires reusing the implementation of get_user_pages_remote(). Split up get_user_pages_remote() into an outer routine that checks flags, and an implementation routine that will be reused. This makes subsequent changes much easier to understand. There should be no change in behavior due to this patch. Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NJan Kara <jack@suse.cz> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20200211001536.1027652-2-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Matthew Wilcox (Oracle) 提交于
- These were never called PCG flags; they've been called FGP flags since their introduction in 2014. - The FGP_FOR_MMAP flag was misleadingly documented as if it was an alternative to FGP_CREAT instead of an option to it. - Rename the 'offset' parameter to 'index'. - Capitalisation, formatting, rewording. Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NChristoph Hellwig <hch@lst.de> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Link: http://lkml.kernel.org/r/20200318140253.6141-9-willy@infradead.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Matthew Wilcox (Oracle) 提交于
No in-tree users (proc, madvise, memcg, mincore) can be built as a module. Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NChristoph Hellwig <hch@lst.de> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Link: http://lkml.kernel.org/r/20200318140253.6141-8-willy@infradead.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Matthew Wilcox (Oracle) 提交于
Dumping the page information in this circumstance helps for debugging. Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NChristoph Hellwig <hch@lst.de> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Link: http://lkml.kernel.org/r/20200318140253.6141-7-willy@infradead.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Matthew Wilcox (Oracle) 提交于
Use VM_FAULT_OOM instead of indirecting through vmf_error(-ENOMEM). Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NChristoph Hellwig <hch@lst.de> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Link: http://lkml.kernel.org/r/20200318140253.6141-2-willy@infradead.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Souptick Joarder 提交于
The first argument of shrink_readahead_size_eio() is not used. Hence remove it from the function definition and from all the callers. Signed-off-by: NSouptick Joarder <jrdr.linux@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1583868093-24342-1-git-send-email-jrdr.linux@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Xianting Tian 提交于
Mount failure issue happens under the scenario: Application forked dozens of threads to mount the same number of cramfs images separately in docker, but several mounts failed with high probability. Mount failed due to the checking result of the page(read from the superblock of loop dev) is not uptodate after wait_on_page_locked(page) returned in function cramfs_read: wait_on_page_locked(page); if (!PageUptodate(page)) { ... } The reason of the checking result of the page not uptodate: systemd-udevd read the loopX dev before mount, because the status of loopX is Lo_unbound at this time, so loop_make_request directly trigger the calling of io_end handler end_buffer_async_read, which called SetPageError(page). So It caused the page can't be set to uptodate in function end_buffer_async_read: if(page_uptodate && !PageError(page)) { SetPageUptodate(page); } Then mount operation is performed, it used the same page which is just accessed by systemd-udevd above, Because this page is not uptodate, it will launch a actual read via submit_bh, then wait on this page by calling wait_on_page_locked(page). When the I/O of the page done, io_end handler end_buffer_async_read is called, because no one cleared the page error(during the whole read path of mount), which is caused by systemd-udevd reading, so this page is still in "PageError" status, which can't be set to uptodate in function end_buffer_async_read, then caused mount failure. But sometimes mount succeed even through systemd-udeved read loopX dev just before, The reason is systemd-udevd launched other loopX read just between step 3.1 and 3.2, the steps as below: 1, loopX dev default status is Lo_unbound; 2, systemd-udved read loopX dev (page is set to PageError); 3, mount operation 1) set loopX status to Lo_bound; ==>systemd-udevd read loopX dev<== 2) read loopX dev(page has no error) 3) mount succeed As the loopX dev status is set to Lo_bound after step 3.1, so the other loopX dev read by systemd-udevd will go through the whole I/O stack, part of the call trace as below: SYS_read vfs_read do_sync_read blkdev_aio_read generic_file_aio_read do_generic_file_read: ClearPageError(page); mapping->a_ops->readpage(filp, page); here, mapping->a_ops->readpage() is blkdev_readpage. In latest kernel, some function name changed, the call trace as below: blkdev_read_iter generic_file_read_iter generic_file_buffered_read: /* * A previous I/O error may have been due to temporary * failures, eg. mutipath errors. * Pg_error will be set again if readpage fails. */ ClearPageError(page); /* Start the actual read. The read will unlock the page*/ error=mapping->a_ops->readpage(flip, page); We can see ClearPageError(page) is called before the actual read, then the read in step 3.2 succeed. This patch is to add the calling of ClearPageError just before the actual read of read path of cramfs mount. Without the patch, the call trace as below when performing cramfs mount: do_mount cramfs_read cramfs_blkdev_read read_cache_page do_read_cache_page: filler(data, page); or mapping->a_ops->readpage(data, page); With the patch, the call trace as below when performing mount: do_mount cramfs_read cramfs_blkdev_read read_cache_page: do_read_cache_page: ClearPageError(page); <== new add filler(data, page); or mapping->a_ops->readpage(data, page); With the patch, mount operation trigger the calling of ClearPageError(page) before the actual read, the page has no error if no additional page error happen when I/O done. Signed-off-by: NXianting Tian <xianting_tian@126.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Cc: Jan Kara <jack@suse.cz> Cc: <yubin@h3c.com> Link: http://lkml.kernel.org/r/1583318844-22971-1-git-send-email-xianting_tian@126.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
There used to be a 'retry' label in between the two (identical) checks when first introduced in commit f446daae ("mm: implement writeback livelock avoidance using page tagging"), and later modified/updated in commit 6e6938b6 ("writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage"). The label has been removed in commit 64081362 ("mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock"), and the (identical) checks are now present / performed immediately one after another. So, remove/deduplicate the latter check, moving tag_pages_for_writeback() into the former check before the 'tag' variable assignment, so it's clear that it's not used in this (similarly-named) function call but only later in pagevec_lookup_range_tag(). Signed-off-by: NMauricio Faria de Oliveira <mfo@canonical.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NIra Weiny <ira.weiny@intel.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Jan Kara <jack@suse.cz> Link: http://lkml.kernel.org/r/20200218221716.1648-1-mfo@canonical.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jan Kara 提交于
When handling a page fault, we drop mmap_sem to start async readahead so that we don't block on IO submission with mmap_sem held. However there's no point to drop mmap_sem in case readahead is disabled. Handle that case to avoid pointless dropping of mmap_sem and retrying the fault. This was actually reported to block mlockall(MCL_CURRENT) indefinitely. Fixes: 6b4c9f44 ("filemap: drop the mmap_sem for all blocking operations") Reported-by: NMinchan Kim <minchan@kernel.org> Reported-by: NRobert Stupp <snazy@gmx.de> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NMinchan Kim <minchan@kernel.org> Link: http://lkml.kernel.org/r/20200212101356.30759-1-jack@suse.czSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Qian Cai 提交于
Kmemleak could scan task stacks while plain writes happens to those stack variables which could results in data races. For example, in sys_rt_sigaction and do_sigaction(), it could have plain writes in a 32-byte size. Since the kmemleak does not care about the actual values of a non-pointer and all do_sigaction() call sites only copy to stack variables, just disable KCSAN for kmemleak to avoid annotating anything outside Kmemleak just because Kmemleak scans everything. Suggested-by: NMarco Elver <elver@google.com> Signed-off-by: NQian Cai <cai@lca.pw> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NMarco Elver <elver@google.com> Acked-by: NCatalin Marinas <catalin.marinas@arm.com> Link: http://lkml.kernel.org/r/1583263716-25150-1-git-send-email-cai@lca.pwSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nathan Chancellor 提交于
Clang warns: mm/kmemleak.c:1955:28: warning: array comparison always evaluates to a constant [-Wtautological-compare] if (__start_ro_after_init < _sdata || __end_ro_after_init > _edata) ^ mm/kmemleak.c:1955:60: warning: array comparison always evaluates to a constant [-Wtautological-compare] if (__start_ro_after_init < _sdata || __end_ro_after_init > _edata) These are not true arrays, they are linker defined symbols, which are just addresses. Using the address of operator silences the warning and does not change the resulting assembly with either clang/ld.lld or gcc/ld (tested with diff + objdump -Dr). Suggested-by: NNick Desaulniers <ndesaulniers@google.com> Signed-off-by: NNathan Chancellor <natechancellor@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NCatalin Marinas <catalin.marinas@arm.com> Link: https://github.com/ClangBuiltLinux/linux/issues/895 Link: http://lkml.kernel.org/r/20200220051551.44000-1-natechancellor@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vlastimil Babka 提交于
This reverts commit ad2c8144. The function node_to_mem_node() was introduced by that commit for use in SLUB on systems with memoryless nodes, but it turned out to be unreliable on some architectures/configurations and a simpler solution exists than fixing it up. Thus commit 0715e6c5 ("mm, slub: prevent kmalloc_node crashes and memory leaks") removed the only user of node_to_mem_node() and we can revert the commit that introduced the function. Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Bharata B Rao <bharata@linux.ibm.com> Cc: Christopher Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: PUVICHAKRAVARTHY RAMACHANDRAN <puvichakravarthy@in.ibm.com> Cc: Sachin Sant <sachinp@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20200320115533.9604-2-vbabka@suse.czSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kees Cook 提交于
In a recent discussion[1] with Vitaly Nikolenko and Silvio Cesare, it became clear that moving the freelist pointer away from the edge of allocations would likely improve the overall defensive posture of the inline freelist pointer. My benchmarks show no meaningful change to performance (they seem to show it being faster), so this looks like a reasonable change to make. Instead of having the freelist pointer at the very beginning of an allocation (offset 0) or at the very end of an allocation (effectively offset -sizeof(void *) from the next allocation), move it away from the edges of the allocation and into the middle. This provides some protection against small-sized neighboring overflows (or underflows), for which the freelist pointer is commonly the target. (Large or well controlled overwrites are much more likely to attack live object contents, instead of attempting freelist corruption.) The vaunted kernel build benchmark, across 5 runs. Before: Mean: 250.05 Std Dev: 1.85 and after, which appears mysteriously faster: Mean: 247.13 Std Dev: 0.76 Attempts at running "sysbench --test=memory" show the change to be well in the noise (sysbench seems to be pretty unstable here -- it's not really measuring allocation). Hackbench is more allocation-heavy, and while the std dev is above the difference, it looks like may manifest as an improvement as well: 20 runs of "hackbench -g 20 -l 1000", before: Mean: 36.322 Std Dev: 0.577 and after: Mean: 36.056 Std Dev: 0.598 [1] https://twitter.com/vnik5287/status/1235113523098685440Signed-off-by: NKees Cook <keescook@chromium.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NChristoph Lameter <cl@linux.com> Cc: Vitaly Nikolenko <vnik@duasynt.com> Cc: Silvio Cesare <silvio.cesare@gmail.com> Cc: Christoph Lameter <cl@linux.com>Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Link: http://lkml.kernel.org/r/202003051624.AAAC9AECC@keescookSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kees Cook 提交于
Under CONFIG_SLAB_FREELIST_HARDENED=y, the obfuscation was relatively weak in that the ptr and ptr address were usually so close that the first XOR would result in an almost entirely 0-byte value[1], leaving most of the "secret" number ultimately being stored after the third XOR. A single blind memory content exposure of the freelist was generally sufficient to learn the secret. Add a swab() call to mix bits a little more. This is a cheap way (1 cycle) to make attacks need more than a single exposure to learn the secret (or to know _where_ the exposure is in memory). kmalloc-32 freelist walk, before: ptr ptr_addr stored value secret ffff90c22e019020@ffff90c22e019000 is 86528eb656b3b5bd (86528eb656b3b59d) ffff90c22e019040@ffff90c22e019020 is 86528eb656b3b5fd (86528eb656b3b59d) ffff90c22e019060@ffff90c22e019040 is 86528eb656b3b5bd (86528eb656b3b59d) ffff90c22e019080@ffff90c22e019060 is 86528eb656b3b57d (86528eb656b3b59d) ffff90c22e0190a0@ffff90c22e019080 is 86528eb656b3b5bd (86528eb656b3b59d) ... after: ptr ptr_addr stored value secret ffff9eed6e019020@ffff9eed6e019000 is 793d1135d52cda42 (86528eb656b3b59d) ffff9eed6e019040@ffff9eed6e019020 is 593d1135d52cda22 (86528eb656b3b59d) ffff9eed6e019060@ffff9eed6e019040 is 393d1135d52cda02 (86528eb656b3b59d) ffff9eed6e019080@ffff9eed6e019060 is 193d1135d52cdae2 (86528eb656b3b59d) ffff9eed6e0190a0@ffff9eed6e019080 is f93d1135d52cdac2 (86528eb656b3b59d) [1] https://blog.infosectcbr.com.au/2020/03/weaknesses-in-linux-kernel-heap.html Fixes: 2482ddec ("mm: add SLUB free list pointer obfuscation") Reported-by: NSilvio Cesare <silvio.cesare@gmail.com> Signed-off-by: NKees Cook <keescook@chromium.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/202003051623.AF4F8CB@keescookSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-