- 16 4月, 2020 6 次提交
-
-
由 Xu Yu 提交于
to #26424368 This introduces the new bool kconfig MEMSLI, determining whether the memsli (memory latency histogram) feature should be built-in or not. Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Xu Yu 提交于
to #26424368 Since memsli also records latency histogram for swapout and swapin, which are NOT in the slow memory path, the overhead of memsli could be nonnegligible in some specific scenarios. For example, in scenarios with frequent swapping out and in, memsli could introduce overhead of ~1% of total run time of the synthetic testcase. This adds procfs interface for memsli switch. The memsli feature is enabled by default, and you can now disable it by: $ echo 0 > /proc/memsli/enabled Apparently, you can check current memsli switch status by: $ cat /proc/memsli/enabled Note that disabling memsli at runtime will NOT clear the existing latency histogram. You still need to manually reset the specified latency histogram(s) by echo 0 into the corresponding cgroup control file(s). Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Xu Yu 提交于
to #26424368 Probe and calculate the latency of global swapout, memcg swapout and swapin respectively, and then group into the latency histogram in struct mem_cgroup. Note that the latency in each memcg is aggregated from all child memcgs. Usage: $ cat memory.direct_swapout_global_latency 0-1ms: 98313 1-5ms: 0 5-10ms: 0 10-100ms: 0 100-500ms: 0 500-1000ms: 0 >=1000ms: 0 total(ms): 52 Each line is the count of global swapout within the appropriate latency range. To clear the latency histogram: $ echo 0 > memory.direct_swapout_global_latency $ cat memory.direct_swapout_global_latency 0-1ms: 0 1-5ms: 0 5-10ms: 0 10-100ms: 0 100-500ms: 0 500-1000ms: 0 >=1000ms: 0 total(ms): 0 The usage of memory.direct_swapout_memcg_latency and memory.direct_swapin_latency is the same as memory.direct_swapout_global_latency. Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Xu Yu 提交于
to #26424368 There are some duplicate codes in the original implementation of memory latency histogram, such as {x, y, z}_show, and {x, y, z}_write, where x, y, z represents various types of memory latency. This reworks common codes of memory latency histogram to make it easier to add more types of memory latency later. Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Xu Yu 提交于
to #26424368 Probe and calculate the latency of direct compact, and then group into the latency histogram in struct mem_cgroup. Note that the latency in each memcg is aggregated from all child memcgs. Usage: $ cat memory.direct_compact_latency 0-1ms: 1176 1-5ms: 259 5-10ms: 17 10-100ms: 10 100-500ms: 0 500-1000ms: 0 >=1000ms: 0 total(ms): 921 Each line is the count of direct compact within the appropriate latency range. To clear the latency histogram: $ echo 0 > memory.direct_compact_latency $ cat memory.direct_compact_latency 0-1ms: 0 1-5ms: 0 5-10ms: 0 10-100ms: 0 100-500ms: 0 500-1000ms: 0 >=1000ms: 0 total(ms): 0 Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Xu Yu 提交于
to #26424368 Probe and calculate the latency of global direct reclaim and memcg direct reclaim, respectively, and then group into the latency histogram in struct mem_cgroup. Besides, the total latency is accumulated each time the histogram is updated. Note that the latency in each memcg is aggregated from all child memcgs. Usage: $ cat memory.direct_reclaim_global_latency 0-1ms: 228 1-5ms: 283 5-10ms: 0 10-100ms: 0 100-500ms: 0 500-1000ms: 0 >=1000ms: 0 total(ms): 539 Each line is the count of global direct reclaim within the appropriate latency range. To clear the latency histogram: $ echo 0 > memory.direct_reclaim_global_latency $ cat memory.direct_reclaim_global_latency 0-1ms: 0 1-5ms: 0 5-10ms: 0 10-100ms: 0 100-500ms: 0 500-1000ms: 0 >=1000ms: 0 total(ms): 0 The usage of memory.direct_reclaim_memcg_latency is the same as memory.direct_reclaim_global_latency. Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
- 13 4月, 2020 5 次提交
-
-
由 Alexander Duyck 提交于
to #26589565 In order to keep ourselves from reporting pages that are just going to be reused again in the case of heavy churn we can put a limit on how many total pages we will process per pass. Doing this will allow the worker thread to go into idle much more quickly so that we avoid competing with other threads that might be allocating or freeing pages. The logic added here will limit the worker thread to no more than one sixteenth of the total free pages in a given area per list. Once that limit is reached it will update the state so that at the end of the pass we will reschedule the worker to try again in 2 seconds when the memory churn has hopefully settled down. Again this optimization doesn't show much of a benefit in the standard case as the memory churn is minmal. However with page allocator shuffling enabled the gain is quite noticeable. Below are the results with a THP enabled version of the will-it-scale page_fault1 test showing the improvement in iterations for 16 processes or threads. Without: tasks processes processes_idle threads threads_idle 16 8283274.75 0.17 5594261.00 38.15 With: tasks processes processes_idle threads threads_idle 16 8767010.50 0.21 5791312.75 36.98 Link: http://lkml.kernel.org/r/20200211224719.29318.72113.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nitesh Narayan Lal <nitesh@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pagupta@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Wang <wei.w.wang@intel.com> Cc: Yang Zhang <yang.zhang.wz@gmail.com> Cc: wei qi <weiqi4@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
-
由 Alexander Duyck 提交于
to #26589565 In order to pave the way for free page reporting in virtualized environments we will need a way to get pages out of the free lists and identify those pages after they have been returned. To accomplish this, this patch adds the concept of a Reported Buddy, which is essentially meant to just be the Uptodate flag used in conjunction with the Buddy page type. To prevent the reported pages from leaking outside of the buddy lists I added a check to clear the PageReported bit in the del_page_from_free_list function. As a result any reported page that is split, merged, or allocated will have the flag cleared prior to the PageBuddy value being cleared. The process for reporting pages is fairly simple. Once we free a page that meets the minimum order for page reporting we will schedule a worker thread to start 2s or more in the future. That worker thread will begin working from the lowest supported page reporting order up to MAX_ORDER - 1 pulling unreported pages from the free list and storing them in the scatterlist. When processing each individual free list it is necessary for the worker thread to release the zone lock when it needs to stop and report the full scatterlist of pages. To reduce the work of the next iteration the worker thread will rotate the free list so that the first unreported page in the free list becomes the first entry in the list. It will then call a reporting function providing information on how many entries are in the scatterlist. Once the function completes it will return the pages to the free area from which they were allocated and start over pulling more pages from the free areas until there are no longer enough pages to report on to keep the worker busy, or we have processed as many pages as were contained in the free area when we started processing the list. The worker thread will work in a round-robin fashion making its way though each zone requesting reporting, and through each reportable free list within that zone. Once all free areas within the zone have been processed it will check to see if there have been any requests for reporting while it was processing. If so it will reschedule the worker thread to start up again in roughly 2s and exit. Link: http://lkml.kernel.org/r/20200211224635.29318.19750.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nitesh Narayan Lal <nitesh@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pagupta@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Wang <wei.w.wang@intel.com> Cc: Yang Zhang <yang.zhang.wz@gmail.com> Cc: wei qi <weiqi4@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
-
由 Alexander Duyck 提交于
to #26589565 In order to enable the use of the zone from the list manipulator functions I will need access to the zone pointer. As it turns out most of the accessors were always just being directly passed &zone->free_area[order] anyway so it would make sense to just fold that into the function itself and pass the zone and order as arguments instead of the free area. In order to be able to reference the zone we need to move the declaration of the functions down so that we have the zone defined before we define the list manipulation functions. Since the functions are only used in the file mm/page_alloc.c we can just move them there to reduce noise in the header. Link: http://lkml.kernel.org/r/20200211224613.29318.43080.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com> Acked-by: NMel Gorman <mgorman@techsingularity.net> Reviewed-by: NDan Williams <dan.j.williams@intel.com> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Reviewed-by: NPankaj Gupta <pagupta@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nitesh Narayan Lal <nitesh@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Wang <wei.w.wang@intel.com> Cc: Yang Zhang <yang.zhang.wz@gmail.com> Cc: wei qi <weiqi4@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
-
由 Dan Williams 提交于
to #26589565 commit b03641af680959df57c275a80ff0dc116627c7ae upstream In preparation for runtime randomization of the zone lists, take all (well, most of) the list_*() functions in the buddy allocator and put them in helper functions. Provide a common control point for injecting additional behavior when freeing pages. [dan.j.williams@intel.com: fix buddy list helpers] Link: http://lkml.kernel.org/r/155033679702.1773410.13041474192173212653.stgit@dwillia2-desk3.amr.corp.intel.com [vbabka@suse.cz: remove del_page_from_free_area() migratetype parameter] Link: http://lkml.kernel.org/r/4672701b-6775-6efd-0797-b6242591419e@suse.cz Link: http://lkml.kernel.org/r/154899812264.3165233.5219320056406926223.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Tested-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: Keith Busch <keith.busch@intel.com> Cc: Robert Elliott <elliott@hpe.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
-
由 Yang Shi 提交于
to #26589565 The list_is_first() and list_rotate_to_front() functions are needed by free page hinting. list_is_first() was used by i915 driver, so moved it to common header file. Backported list_rotate_to_front() from commit a16b53849913 ("list: add function list_rotate_to_front()"). Not backport the whole patch, just copy the implementation of the function. Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
-
- 02 4月, 2020 1 次提交
-
-
由 Jiri Slaby 提交于
fix #25967152 commit dce05aa6eec977f1472abed95ccd71276b9a3864 upstream Avoid global variables (namely sel_cons) by introducing vc_is_sel. It checks whether the parameter is the current selection console. This will help putting sel_cons to a struct later. Signed-off-by: NJiri Slaby <jslaby@suse.cz> Link: https://lore.kernel.org/r/20200219073951.16151-1-jslaby@suse.czSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
- 25 3月, 2020 1 次提交
-
-
由 Stephen Rothwell 提交于
fix #25924252 commit 7dcddef6f769d7e60691c732eb6d09cdb1d9df76 upstream An x86_64 allmodconfig build produces these errors: x86_64-linux-gnu-ld: kernel/sched/core.o: in function `cpuidle_poll_time': core.c:(.text+0x230): multiple definition of `cpuidle_poll_time'; arch/x86/= kernel/process.o:process.c:(.text+0xc0): first defined here (and more) Fixes: 259231a04561 ("cpuidle: add poll_limit_ns to cpuidle_device structure") Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
- 18 3月, 2020 27 次提交
-
-
由 Xu Yu 提交于
Since commit e0205ae40f12 ("mm: memcontrol: use CSS_TASK_ITER_PROCS at mem_cgroup_scan_tasks()") made mem_cgroup_scan_tasks() to check only one thread from each thread group, we can make cgroup_subsys_state::nr_tasks to record only the thread group leader, i.e., process, instead of thread(s). Furthermore, this renames cgroup_subsys_state::nr_tasks to cgroup_subsys_state::nr_procs. Fixes: f061cd88 ("alinux: kernel: cgroup: account number of tasks in the css and its descendants") Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Xu Yu 提交于
The memcg background async page reclaim, a.k.a, memcg kswapd, is implemented with a dedicated unbound workqueue currently. However, memcg kswapd will run too frequently, resulting in high overhead, page cache thrashing, frequent dirty page writeback, etc., due to improper memcg memory.wmark_ratio, unreasonable memcg memor capacity, or even abnormal memcg memory usage. We need to find out the problematic memcg(s) where memcg kswapd introduces significant overhead. This records the latency of each run of memcg kswapd work, and then aggregates into the exstat of per memcg. Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Marc Zyngier 提交于
commit 5bd90b0989731520f2cdcfbbe467f1271f3cc803 upstream. In order to find out whether a vcpu is likely to be the target of VLPIs (and to further optimize the way we deal with those), let's track the number of VLPIs a vcpu can receive. This gets implemented with an atomic variable that gets incremented or decremented on map, unmap and move of a VLPI. Signed-off-by: NMarc Zyngier <maz@kernel.org> Reviewed-by: NZenghui Yu <yuzenghui@huawei.com> Reviewed-by: NChristoffer Dall <christoffer.dall@arm.com> Link: https://lore.kernel.org/r/20191107160412.30301-2-maz@kernel.orgSigned-off-by: NShannon Zhao <shannon.zhao@linux.alibaba.com> Acked-by: NZou Cao <zoucao@linux.alibaba.com>
-
由 Marc Zyngier 提交于
commit 8e01d9a396e6db153d94a6004e6473d9ff251a6a upstream. When the VHE code was reworked, a lot of the vgic stuff was moved around, but the GICv4 residency code did stay untouched, meaning that we come in and out of residency on each flush/sync, which is obviously suboptimal. To address this, let's move things around a bit: - Residency entry (flush) moves to vcpu_load - Residency exit (sync) moves to vcpu_put - On blocking (entry to WFI), we "put" - On unblocking (exit from WFI), we "load" Because these can nest (load/block/put/load/unblock/put, for example), we now have per-VPE tracking of the residency state. Additionally, vgic_v4_put gains a "need doorbell" parameter, which only gets set to true when blocking because of a WFI. This allows a finer control of the doorbell, which now also gets disabled as soon as it gets signaled. Signed-off-by: NMarc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20191027144234.8395-2-maz@kernel.orgSigned-off-by: NShannon Zhao <shannon.zhao@linux.alibaba.com> Acked-by: NZou Cao <zoucao@linux.alibaba.com>
-
由 Minchan Kim 提交于
commit 1a4e58cce84ee88129d5d49c064bd2852b481357 upstream When a process expects no accesses to a certain memory range for a long time, it could hint kernel that the pages can be reclaimed instantly but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_PAGEOUT hint to madvise(2) syscall. MADV_PAGEOUT can be used by a process to mark a memory range as not expected to be used for a long time so that kernel reclaims *any LRU* pages instantly. The hint can help kernel in deciding which pages to evict proactively. A note: It doesn't apply SWAP_CLUSTER_MAX LRU page isolation limit intentionally because it's automatically bounded by PMD size. If PMD size(e.g., 256) makes some trouble, we could fix it later by limit it to SWAP_CLUSTER_MAX[1]. - man-page material MADV_PAGEOUT (since Linux x.x) Do not expect access in the near future so pages in the specified regions could be reclaimed instantly regardless of memory pressure. Thus, access in the range after successful operation could cause major page fault but never lose the up-to-date contents unlike MADV_DONTNEED. Pages belonging to a shared mapping are only processed if a write access is allowed for the calling process. MADV_PAGEOUT cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [1] https://lore.kernel.org/lkml/20190710194719.GS29695@dhcp22.suse.cz/ [minchan@kernel.org: clear PG_active on MADV_PAGEOUT] Link: http://lkml.kernel.org/r/20190802200643.GA181880@google.com [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-5-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org> Reported-by: Nkbuild test robot <lkp@intel.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Minchan Kim 提交于
commit 9c276cc65a58faf98be8e56962745ec99ab87636 upstream Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org> Reported-by: Nkbuild test robot <lkp@intel.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Sai Praneeth 提交于
[ Upstream commit 9dbbedaa6171247c4c7c40b83f05b200a117c2e0 ] After the kernel has booted, if any accesses by firmware causes a page fault, the efi page fault handler would freeze efi_rts_wq and schedules a new process. To do this, the efi page fault handler needs efi_rts_work. Hence, make it accessible. There will be no race conditions in accessing this structure, because all the calls to efi runtime services are already serialized. [ Wen: This patch also fixes a memory corruption: #define efi_queue_work(_rts, _arg1, _arg2, _arg3, _arg4, _arg5)\ ({ \ struct efi_runtime_work efi_rts_work; \ … init_completion(&efi_rts_work.efi_rts_comp); \ INIT_WORK(&efi_rts_work.work, efi_call_rts); \ … efi_rts_work is on the stack, registering it to workqueue will cause the following error: ODEBUG: object (____ptrval____) is on stack (____ptrval____), but NOT annotated. ------------[ cut here ]------------ WARNING: CPU: 6 PID: 1 at lib/debugobjects.c:368 __debug_object_init+0x218/0x538 Modules linked in: CPU: 6 PID: 1 Comm: swapper/0 Tainted: G W 4.19.91 #19 … Call trace: __debug_object_init+0x218/0x538 debug_object_init+0x20/0x28 __init_work+0x34/0x58 virt_efi_get_time.part.5+0x6c/0x12c virt_efi_get_time+0x4c/0x58 efi_read_time+0x40/0x9c __rtc_read_time+0x50/0x118 rtc_read_time+0x60/0x1f0 rtc_hctosys+0x74/0x124 do_one_initcall+0xac/0x3d4 kernel_init_freeable+0x49c/0x59c kernel_init+0x18/0x110 ] Tested-by: NBhupesh Sharma <bhsharma@redhat.com> Suggested-by: NMatt Fleming <matt@codeblueprint.co.uk> Based-on-code-from: Ricardo Neri <ricardo.neri@intel.com> Signed-off-by: NSai Praneeth Prakhya <sai.praneeth.prakhya@intel.com> Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org> Fixes: 3eb420e7 ("efi: Use a work queue to invoke EFI Runtime Services") Signed-off-by: NWen Yang <wenyang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Jan Kara 提交于
commit 13ef954445df4fd1d7c003a500ec5ce49573e14b upstream Notes from Xiaoguang Wang: Indeed this patch should be appled before "ext4: introduce direct I/O read using iomap infrastructure", but given that we have already appled "ext4: introduce direct I/O read using iomap infrastructure" previously, we need to update iomap_dio_rw() calls with the new argument in ext4. Filesystems do not support doing IO as asynchronous in some cases. For example in case of unaligned writes or in case file size needs to be extended (e.g. for ext4). Instead of forcing filesystem to wait for AIO in such cases, add argument to iomap_dio_rw() which makes the function wait for IO completion. This also results in executing iomap_dio_complete() inline in iomap_dio_rw() providing its return value to the caller as for ordinary sync IO. Signed-off-by: NJan Kara <jack@suse.cz> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Joao Martins 提交于
commit 11c59eae6633b8a7e77b8ee1cf908964d80c78cd upstream The recently introduced haltpoll driver is largely only useful with haltpoll governor. To allow drivers to associate with a particular idle behaviour, add a @governor property to 'struct cpuidle_driver' and thus allow a cpuidle driver to switch to a *preferred* governor on idle driver registration. We save the previous governor, and when an idle driver is unregistered we switch back to that. The @governor can be overridden by cpuidle.governor= boot param or alternatively be ignored if the governor doesn't exist. Signed-off-by: NJoao Martins <joao.m.martins@oracle.com> Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
由 Marcelo Tosatti 提交于
commit 7d4daeedd575bbc3c40c87fc6708a8b88c50fe7e upstream Since this field is shared by all governors, move it to cpuidle device structure. Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com> Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
由 Marcelo Tosatti 提交于
commit 259231a045616c4101d023a8f4dcc8379af265a6 upstream Add a poll_limit_ns variable to cpuidle_device structure. Calculate and configure it in the new cpuidle_poll_time function, in case its zero. Individual governors are allowed to override this value. Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com> Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
由 Wenwei Tao 提交于
Under memory pressure reclaim and oom would happen, with multiple cgroups exist in one system, we might want some of their memory or tasks survived the reclaim and oom while there are other candidates. The @memory.low and @memory.min have make that happen during reclaim, this patch introduces memcg priority oom to meet above requirement in the oom. The priority is from 0 to 12, the higher number the higher priority. When oom happens it always choose victim from low priority memcg. And it works both for memcg oom and global oom, it can be enabled/disabled through @memory.use_priority_oom, for global oom through the root memcg's @memory.use_priority_oom, it is disabled by default. Signed-off-by: NWenwei Tao <wenwei.tao@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Wenwei Tao 提交于
Account number of the tasks in the css and its descendants, this is prepared for the incoming memcg priority patch. In memcg priority oom, we will select victim cgroup which has victim tasks in it. We need to know whether the memcg and its descendants have tasks before the selection can move on. Signed-off-by: NWenwei Tao <wenwei.tao@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Xunlei Pang 提交于
Accessing original memory.stat turned out to be one heavy operation which has been caused many real product problems. Introduce new cgroup memory.exstat, memory.exstat stands for "extra/extended memory.stat", which contains dedicated statistics from Alibaba Clould Kernel. memory.exstat is supposed to provide hierarchical statistics. Export its first "wmark_min_throttled_ms", and will add more like direct reclaim, direct compaction, etc. Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com> Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Xunlei Pang 提交于
In co-location environment, there are more or less some memory overcommitment, then BATCH tasks may break the shared global min watermark resulting in all types of applications falling into the direct reclaim slow path hurting the RT of LS tasks. (NOTE: BATCH tasks tolerate big latency spike even in seconds as long as doesn't hurt its overal throughput. While LS tasks are very Latency-Sensitive, they may time out or fail in case of sudden latency spike lasts like hundreds of ms typically.) Actually BATCH tasks are not sensitive to memory latency, they can be assigned a strict min watermark which is different from that of LS tasks(which can be aissgned a lenient min watermark accordingly), thus isolating each other in case of global memory allocation. This is kind of like the idea behind ALLOC_HARDER for rt_task(), see gfp_to_alloc_flags(). memory.wmark_min_adj stands for memcg global WMARK_MIN adjustment, it is used to realize separate min watermarks above-mentioned for memcgs, its valid value is within [-25, 50], specifically: negative value means to be relative to [0, WMARK_MIN], positive value means to be relative to [WMARK_MIN, WMARK_LOW]. For examples, -25 means "WMARK_MIN + (WMARK_MIN - 0) * (-25%)" 50 means "WMARK_MIN + (WMARK_LOW - WMARK_MIN) * 50%" Note that the minimum -25 is what ALLOC_HARDER uses which is safe for us to adopt, and the maximum 50 is one experienced value. Negative memory.wmark_min_adj means high QoS requirements, it can allocate below the global WMARK_MIN, which is kind of like the idea behind ALLOC_HARDER, see gfp_to_alloc_flags(). Positive memory.wmark_min_adj means low QoS requirements, thus when allocation broke memcg min watermark, it should trigger direct reclaim traditionally, and we trigger throttle instead to further prevent them from disturbing others. With this interface, we can assign positive values for BATCH memcgs and negative values for LS memcgs. memory.wmark_min_adj default value is 0, and inherit from its parent, Note that the final effective wmark_min_adj will consider all the hierarchical values, its value is the maximal(most conservative) wmark_min_adj along the hierarchy but excluding intermediate default values(zero). Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com> Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Xunlei Pang 提交于
After memcg was deleted, page caches still reference to this memcg causing large number of dead(zombie) memcgs in the system. Then it slows down access to "/sys/fs/cgroup/cpu/memory.stat", etc due to tons of iterations, further causing various latencies. This patch introduces two ways to reclaim these zombie memcgs. 1) Background kthread reaper Introduce a kernel thread "memcg_zombie_reaper" to reclaim zombie memcgs at background periodically. Several knobs are also added to control the reaper scan frequency: - /sys/kernel/mm/memcg_reaper/scan_interval The scan period in second. Default 5s. - /sys/kernel/mm/memcg_reaper/pages_scan The scan rate of pages per scan. Default 1310720(5GiB for 4KiB page). - /sys/kernel/mm/memcg_reaper/verbose Output some zombie memcg information for debug purpose. Default off. - /sys/kernel/mm/memcg_reaper/reap_background "on/off" switch. Default "0" means off. Write "1" to switch it on. 2) One-shot trigger by users - /sys/kernel/mm/memcg_reaper/reap Write "1" to trigger one round of zombie memcg reaping, but without any guarantee, you may need to launch multiple rounds as needed. Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com> Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Christoph Hellwig 提交于
commit 838c4f3d7515efe9d0e32c846fb5d102b6d8a29d upstream. Add a new iomap_dio_ops structure that for now just contains the end_io handler. This avoid storing the function pointer in a mutable structure, which is a possible exploit vector for kernel code execution, and prepares for adding a submit_io handler that btrfs needs. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Goldwyn Rodrigues 提交于
commit c039b99792726346ad46ff17c5a5bcb77a5edac4 upstream. The srcmap is used to identify where the read is to be performed from. It is passed to ->iomap_begin, which can fill it in if we need to read data for partially written blocks from a different location than the write target. The srcmap is only supported for buffered writes so far. Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com> [hch: merged two patches, removed the IOMAP_F_COW flag, use iomap as srcmap if not set, adjust length down to srcmap end as well] Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Acked-by: NGoldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit de2ea4b64b75a79ed9cdf9bf30e0e197901084e4 upstream. This is identical to __sys_accept4(), except it takes a struct file instead of an fd, and it also allows passing in extra file->f_flags flags. The latter is done to support masking in O_NONBLOCK without manipulating the original file flags. No functional changes in this patch. Cc: netdev@vger.kernel.org Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit 771b53d033e8663abdf59704806aa856b236dcdb upstream. This adds support for io-wq, a smaller and specialized thread pool implementation. This is meant to replace workqueues for io_uring. Among the reasons for this addition are: - We can assign memory context smarter and more persistently if we manage the life time of threads. - We can drop various work-arounds we have in io_uring, like the async_list. - We can implement hashed work insertion, to manage concurrency of buffered writes without needing a) an extra workqueue, or b) needlessly making the concurrency of said workqueue very low which hurts performance of multiple buffered file writers. - We can implement cancel through signals, for cancelling interruptible work like read/write (or send/recv) to/from sockets. - We need the above cancel for being able to assign and use file tables from a process. - We can implement a more thorough cancel operation in general. - We need it to move towards a syslet/threadlet model for even faster async execution. For that we need to take ownership of the used threads. This list is just off the top of my head. Performance should be the same, or better, at least that's what I've seen in my testing. io-wq supports basic NUMA functionality, setting up a pool per node. io-wq hooks up to the scheduler schedule in/out just like workqueue and uses that to drive the need for more/less workers. Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NJens Axboe <axboe@kernel.dk> [Joseph: Cherry-pick allow_kernel_signal() from upstream commit 33da8e7c814f] Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Thomas Gleixner 提交于
commit 15917dc02841862840efcbfe1da0830f88078b5c upstream. The RTMUTEX tester was removed long ago but the PF bit stayed around. Remove it and free up the space. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Sam Protsenko 提交于
commit 94e297c50b529f5d01cfd1dbc808d61e95180ab7 upstream. ctags indexing ("make tags" command) throws this warning: ctags: Warning: include/linux/notifier.h:125: null expansion of name pattern "\1" This is the result of DEFINE_PER_CPU() macro expansion. Fix that by getting rid of line break. Similar fix was already done in commit 25528213 ("tags: Fix DEFINE_PER_CPU expansions"), but this one probably wasn't noticed. Link: http://lkml.kernel.org/r/20181030202808.28027-1-semen.protsenko@linaro.org Fixes: 9c80172b ("kernel/SRCU: provide a static initializer") Signed-off-by: NSam Protsenko <semen.protsenko@linaro.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NCambda Zhu <cambda@linux.alibaba.com> Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Zhengyuan Liu 提交于
commit 9310a7ba6de8cce6209e3e8a3cdf733f824cdd9b upstream. We are using PAGE_SIZE as the unit to determine if the total len in async_list has exceeded max_pages, it's not fair for smaller io sizes. For example, if we are doing 1k-size io streams, we will never exceed max_pages since len >>= PAGE_SHIFT always gets zero. So use original bytes to make it more accurate. Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Oleg Nesterov 提交于
commit b772434be0891ed1081a08ae7cfd4666728f8e82 upstream. task->saved_sigmask and ->restore_sigmask are only used in the ret-from- syscall paths. This means that set_user_sigmask() can save ->blocked in ->saved_sigmask and do set_restore_sigmask() to indicate that ->blocked was modified. This way the callers do not need 2 sigset_t's passed to set/restore and restore_user_sigmask() renamed to restore_saved_sigmask_unless() turns into the trivial helper which just calls restore_saved_sigmask(). Link: http://lkml.kernel.org/r/20190606113206.GA9464@redhat.comSigned-off-by: NOleg Nesterov <oleg@redhat.com> Cc: Deepa Dinamani <deepa.kernel@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Eric Wong <e@80x24.org> Cc: Jason Baron <jbaron@akamai.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: David Laight <David.Laight@aculab.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit aa1fa28fc73ea6b740ee7b62bf3b07141883dbb8 upstream. This is done through IORING_OP_RECVMSG. This opcode uses the same sqe->msg_flags that IORING_OP_SENDMSG added, and we pass in the msghdr struct in the sqe->addr field as well. We use MSG_DONTWAIT to force an inline fast path if recvmsg() doesn't block, and punt to async execution if it would have. Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit 0fa03c624d8fc9932d0f27c39a9deca6a37e0e17 upstream. This is done through IORING_OP_SENDMSG. There's a new sqe->msg_flags for the flags argument, and the msghdr struct is passed in the sqe->addr field. We use MSG_DONTWAIT to force an inline fast path if sendmsg() doesn't block, and punt to async execution if it would have. Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Christoph Hellwig 提交于
Cherry-pick from commit b620743077e291ae7d0debd21f50413a8c266229 upstream. If we pass pages through an iov_iter we always already have a reference in the caller. Thus remove the ITER_BVEC_FLAG_NO_REF and don't take reference to pages by default for bvec backed iov_iters. [Joseph] Resolve conflicts since we don't have: 81ba6abd2bcd "block: loop: mark bvec as ITER_BVEC_FLAG_NO_REF" 7321ecbfc7cf "block: change how we get page references in bio_iov_iter_get_pages" Reviewed-by: NMinwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-