- 02 9月, 2020 40 次提交
-
-
由 Dave Jiang 提交于
to #27305291 commit 037c8489ade669e0f09ad40d5b91e5e1159a14b1 upstream. Add a zero key in order to standardize hardware that want a key of 0's to be passed. Some platforms defaults to a zero-key with security enabled rather than allow the OS to enable the security. The zero key would allow us to manage those platform as well. This also adds a fix to secure erase so it can use the zero key to do crypto erase. Some other security commands already use zero keys. This introduces a standard zero-key to allow unification of semantics cross nvdimm security commands. Signed-off-by: NDave Jiang <dave.jiang@intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Dave Jiang 提交于
to #27305291 commit 1f4883f300da4f4d9d31eaa80f7debf6ce74843b upstream. Add theory of operation for the security support that's going into libnvdimm. Signed-off-by: NDave Jiang <dave.jiang@intel.com> Reviewed-by: NJing Lin <jing.lin@intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Dave Jiang 提交于
to #27305291 commit ecaa4a97b3908be0bf3ad12181ae8c44d1816d40 upstream. Adding test support for new Intel DSM from v1.8. The ability of simulating master passphrase update and master secure erase have been added to nfit_test. Signed-off-by: NDave Jiang <dave.jiang@intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Dave Jiang 提交于
to #27305291 commit 926f74802cb1ce0ef0c3b9f806ea542beb57e50d upstream. With the implementation of Intel NVDIMM DSM overwrite, we are adding unit test to nfit_test for testing of overwrite operation. Signed-off-by: NDave Jiang <dave.jiang@intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Dave Jiang 提交于
to #27305291 commit 3c13e2ac747a37e683597d3d875f839f2bc150e1 upstream. Add nfit_test support for DSM functions "Get Security State", "Set Passphrase", "Disable Passphrase", "Unlock Unit", "Freeze Lock", and "Secure Erase" for the fake DIMMs. Also adding a sysfs knob in order to put the DIMMs in "locked" state. The order of testing DIMM unlocking would be. 1a. Disable DIMM X. 1b. Set Passphrase to DIMM X. 2. Write to /sys/devices/platform/nfit_test.0/nfit_test_dimm/test_dimmX/lock_dimm 3. Renable DIMM X 4. Check DIMM X state via sysfs "security" attribute for nmemX. Signed-off-by: NDave Jiang <dave.jiang@intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Dave Jiang 提交于
to #27305291 commit 89fa9d8ea7bdfa841d19044485cec5f4171069e5 upstream. With Intel DSM 1.8 [1] two new security DSMs are introduced. Enable/update master passphrase and master secure erase. The master passphrase allows a secure erase to be performed without the user passphrase that is set on the NVDIMM. The commands of master_update and master_erase are added to the sysfs knob in order to initiate the DSMs. They are similar in opeartion mechanism compare to update and erase. [1]: http://pmem.io/documents/NVDIMM_DSM_Interface-V1.8.pdfSigned-off-by: NDave Jiang <dave.jiang@intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Dave Jiang 提交于
to #27305291 commit 7d988097c546187ada602cc9bccd0f03d473eb8f upstream. Add support for the NVDIMM_FAMILY_INTEL "ovewrite" capability as described by the Intel DSM spec v1.7. This will allow triggering of overwrite on Intel NVDIMMs. The overwrite operation can take tens of minutes. When the overwrite DSM is issued successfully, the NVDIMMs will be unaccessible. The kernel will do backoff polling to detect when the overwrite process is completed. According to the DSM spec v1.7, the 128G NVDIMMs can take up to 15mins to perform overwrite and larger DIMMs will take longer. Given that overwrite puts the DIMM in an indeterminate state until it completes introduce the NDD_SECURITY_OVERWRITE flag to prevent other operations from executing when overwrite is happening. The NDD_WORK_PENDING flag is added to denote that there is a device reference on the nvdimm device for an async workqueue thread context. Signed-off-by: NDave Jiang <dave.jiang@intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Dave Jiang 提交于
to #27305291 commit 64e77c8c047fb91ea8c7800c1238108a72f0bf9c upstream. Add support to issue a secure erase DSM to the Intel nvdimm. The required passphrase is acquired from an encrypted key in the kernel user keyring. To trigger the action, "erase <keyid>" is written to the "security" sysfs attribute. Signed-off-by: NDave Jiang <dave.jiang@intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Dave Jiang 提交于
to #27305291 commit d2a4ac73f56a5d0709d28b41fec8d15e4500f8f1 upstream. Add support for enabling and updating passphrase on the Intel nvdimms. The passphrase is the an encrypted key in the kernel user keyring. We trigger the update via writing "update <old_keyid> <new_keyid>" to the sysfs attribute "security". If no <old_keyid> exists (for enabling security) then a 0 should be used. Signed-off-by: NDave Jiang <dave.jiang@intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Dave Jiang 提交于
to #27305291 commit 03b65b22ada8115a7a7bfdf0789f6a94adfd6070 upstream. Add support to disable passphrase (security) for the Intel nvdimm. The passphrase used for disabling is pulled from an encrypted-key in the kernel user keyring. The action is triggered by writing "disable <keyid>" to the sysfs attribute "security". Signed-off-by: NDave Jiang <dave.jiang@intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Dave Jiang 提交于
to #27305291 commit 4c6926a23b76ea23403976290cd45a7a143f6500 upstream. Add support to unlock the dimm via the kernel key management APIs. The passphrase is expected to be pulled from userspace through keyutils. The key management and sysfs attributes are libnvdimm generic. Encrypted keys are used to protect the nvdimm passphrase at rest. The master key can be a trusted-key sealed in a TPM, preferred, or an encrypted-key, more flexible, but more exposure to a potential attacker. Signed-off-by: NDave Jiang <dave.jiang@intel.com> Co-developed-by: NDan Williams <dan.j.williams@intel.com> Reported-by: NRandy Dunlap <rdunlap@infradead.org> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Dave Jiang 提交于
to #27305291 commit 37833fb7989a9d3c3e26354e6878e682c340d718 upstream. Add support for freeze security on Intel nvdimm. This locks out any changes to security for the DIMM until a hard reset of the DIMM is performed. This is triggered by writing "freeze" to the generic nvdimm/nmemX "security" sysfs attribute. Signed-off-by: NDave Jiang <dave.jiang@intel.com> Co-developed-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Dave Jiang 提交于
to #27305291 commit f2989396553a0bd13f4b25f567a3dee3d722ce40 upstream. Some NVDIMMs, like the ones defined by the NVDIMM_FAMILY_INTEL command set, expose a security capability to lock the DIMMs at poweroff and require a passphrase to unlock them. The security model is derived from ATA security. In anticipation of other DIMMs implementing a similar scheme, and to abstract the core security implementation away from the device-specific details, introduce nvdimm_security_ops. Initially only a status retrieval operation, ->state(), is defined, along with the base infrastructure and definitions for future operations. Signed-off-by: NDave Jiang <dave.jiang@intel.com> Co-developed-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Dave Jiang 提交于
to #27305291 commit 9db67581b91d9e9e05c35570ac3f93872e6c84ca upstream. Adding nvdimm key format type to encrypted keys in order to limit the size of the key to 32bytes. Signed-off-by: NDave Jiang <dave.jiang@intel.com> Acked-by: NMimi Zohar <zohar@linux.ibm.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Dave Jiang 提交于
to #27305291 commit 76ef5e17252789da79db78341851922af0c16181 upstream. Export lookup_user_key() symbol in order to allow nvdimm passphrase update to retrieve user injected keys. Signed-off-by: NDave Jiang <dave.jiang@intel.com> Acked-by: NDavid Howells <dhowells@redhat.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Dave Jiang 提交于
to #27305291 commit d6548ae4d16dc231dec22860c9c472bcb991fb15 upstream. The generated dimm id is needed for the sysfs attribute as well as being used as the identifier/description for the security key. Since it's constant and should never change, store it as a member of struct nvdimm. As nvdimm_create() continues to grow parameters relative to NFIT driver requirements, do not require other implementations to keep pace. Introduce __nvdimm_create() to carry the new parameters and keep nvdimm_create() with the long standing default api. Signed-off-by: NDave Jiang <dave.jiang@intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> [ Shile: fixed conflict in drivers/acpi/nfit/nfit.h ] Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Dave Jiang 提交于
to #27305291 commit b3ed2ce024c36054e51cca2eb31a1cdbe4a5f11e upstream. Add command definition for security commands defined in Intel DSM specification v1.8 [1]. This includes "get security state", "set passphrase", "unlock unit", "freeze lock", "secure erase", "overwrite", "overwrite query", "master passphrase enable/disable", and "master erase", . Since this adds several Intel definitions, move the relevant bits to their own header. These commands mutate physical data, but that manipulation is not cache coherent. The requirement to flush and invalidate caches makes these commands unsuitable to be called from userspace, so extra logic is added to detect and block these commands from being submitted via the ioctl command submission path. Lastly, the commands may contain sensitive key material that should not be dumped in a standard debug session. Update the nvdimm-command payload-dump facility to move security command payloads behind a default-off compile time switch. [1]: http://pmem.io/documents/NVDIMM_DSM_Interface-V1.8.pdfSigned-off-by: NDave Jiang <dave.jiang@intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> [ Shile: fixed conflicts: This patch updated the file "drivers/acpi/nfit/intel.h". The header file is introduced by commit 0ead111 ("acpi, nfit: Collect shutdown status") in upstream, which also update the test files. So let's fetch this part to fix the conflict: - tools/testing/nvdimm/test/nfit.c - tools/testing/nvdimm/test/nfit_test.h ] Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Pavel Begunkov 提交于
to #29197839 commit d60b5fbc1ce8210759b568da49d149b868e7c6d3 upstream. Don't reissue requests from io_iopoll_reap_events(), the task may not have mm, which ends up with NULL. It's better to kill everything off on exit anyway. [ 677.734670] RIP: 0010:io_iopoll_complete+0x27e/0x630 ... [ 677.734679] Call Trace: [ 677.734695] ? __send_signal+0x1f2/0x420 [ 677.734698] ? _raw_spin_unlock_irqrestore+0x24/0x40 [ 677.734699] ? send_signal+0xf5/0x140 [ 677.734700] io_iopoll_getevents+0x12f/0x1a0 [ 677.734702] io_iopoll_reap_events.part.0+0x5e/0xa0 [ 677.734703] io_ring_ctx_wait_and_kill+0x132/0x1c0 [ 677.734704] io_uring_release+0x20/0x30 [ 677.734706] __fput+0xcd/0x230 [ 677.734707] ____fput+0xe/0x10 [ 677.734709] task_work_run+0x67/0xa0 [ 677.734710] do_exit+0x35d/0xb70 [ 677.734712] do_group_exit+0x43/0xa0 [ 677.734713] get_signal+0x140/0x900 [ 677.734715] do_signal+0x37/0x780 [ 677.734717] ? enqueue_hrtimer+0x41/0xb0 [ 677.734718] ? recalibrate_cpu_khz+0x10/0x10 [ 677.734720] ? ktime_get+0x3e/0xa0 [ 677.734721] ? lapic_next_deadline+0x26/0x30 [ 677.734723] ? tick_program_event+0x4d/0x90 [ 677.734724] ? __hrtimer_get_next_event+0x4d/0x80 [ 677.734726] __prepare_exit_to_usermode+0x126/0x1c0 [ 677.734741] prepare_exit_to_usermode+0x9/0x40 [ 677.734742] idtentry_exit_cond_rcu+0x4c/0x60 [ 677.734743] sysvec_reschedule_ipi+0x92/0x160 [ 677.734744] ? asm_sysvec_reschedule_ipi+0xa/0x20 [ 677.734745] asm_sysvec_reschedule_ipi+0x12/0x20 Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
-
由 Pavel Begunkov 提交于
to #29197839 commit cd664b0e35cb1202f40c259a1a5ea791d18c879d upstream. io_do_iopoll() won't do anything with a request unless req->iopoll_completed is set. So io_complete_rw_iopoll() has to set it, otherwise io_do_iopoll() will poll a file again and again even though the request of interest was completed long time ago. Also, remove -EAGAIN check from io_issue_sqe() as it races with the changed lines. The request will take the long way and be resubmitted from io_iopoll*(). Fixes: bbde017a32b3 ("io_uring: add memory barrier to synchronize io_kiocb's result and iopoll_completed") Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
-
由 Shile Zhang 提交于
to #27182371 Restored all the tuned configs for Cloud Kernel before, to keep the unified configs for both x86_64 and ARM64. Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Shile Zhang 提交于
to #27182371 Reconfig the ARM64 with Alibaba internal kernel help to keep the unified kernel configs. Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Charan Teja Reddy 提交于
to #28825456 commit aa09259109583b98b9d9e7ed0d8eb1b880d1eb97 upstream. Updating the zone watermarks by any means, like min_free_kbytes, water_mark_scale_factor etc, when ->watermark_boost is set will result in higher low and high watermarks than the user asked. Below are the steps to reproduce the problem on system setup of Android kernel running on Snapdragon hardware. 1) Default settings of the system are as below: #cat /proc/sys/vm/min_free_kbytes = 5162 #cat /proc/zoneinfo | grep -e boost -e low -e "high " -e min -e Node Node 0, zone Normal min 797 low 8340 high 8539 2) Monitor the zone->watermark_boost(by adding a debug print in the kernel) and whenever it is greater than zero value, write the same value of min_free_kbytes obtained from step 1. #echo 5162 > /proc/sys/vm/min_free_kbytes 3) Then read the zone watermarks in the system while the ->watermark_boost is zero. This should show the same values of watermarks as step 1 but shown a higher values than asked. #cat /proc/zoneinfo | grep -e boost -e low -e "high " -e min -e Node Node 0, zone Normal min 797 low 21148 high 21347 These higher values are because of updating the zone watermarks using the macro min_wmark_pages(zone) which also adds the zone->watermark_boost. #define min_wmark_pages(z) (z->_watermark[WMARK_MIN] + z->watermark_boost) So the steps that lead to the issue are: 1) On the extfrag event, watermarks are boosted by storing the required value in ->watermark_boost. 2) User tries to update the zone watermarks level in the system through min_free_kbytes or watermark_scale_factor. 3) Later, when kswapd woke up, it resets the zone->watermark_boost to zero. In step 2), we use the min_wmark_pages() macro to store the watermarks in the zone structure thus the values are always offsetted by ->watermark_boost value. This can be avoided by resetting the ->watermark_boost to zero before it is used. Signed-off-by: NCharan Teja Reddy <charante@codeaurora.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NBaoquan He <bhe@redhat.com> Cc: Vinayak Menon <vinmenon@codeaurora.org> Link: http://lkml.kernel.org/r/1589457511-4255-1-git-send-email-charante@codeaurora.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Henry Willard 提交于
to #28825456 commit 14f69140ff9c92a0928547ceefb153a842e8492c upstream. Commit 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs") adds a boost_watermark() function which increases the min watermark in a zone by at least pageblock_nr_pages or the number of pages in a page block. On Arm64, with 64K pages and 512M huge pages, this is 8192 pages or 512M. It does this regardless of the number of managed pages managed in the zone or the likelihood of success. This can put the zone immediately under water in terms of allocating pages from the zone, and can cause a small machine to fail immediately due to OoM. Unlike set_recommended_min_free_kbytes(), which substantially increases min_free_kbytes and is tied to THP, boost_watermark() can be called even if THP is not active. The problem is most likely to appear on architectures such as Arm64 where pageblock_nr_pages is very large. It is desirable to run the kdump capture kernel in as small a space as possible to avoid wasting memory. In some architectures, such as Arm64, there are restrictions on where the capture kernel can run, and therefore, the space available. A capture kernel running in 768M can fail due to OoM immediately after boost_watermark() sets the min in zone DMA32, where most of the memory is, to 512M. It fails even though there is over 500M of free memory. With boost_watermark() suppressed, the capture kernel can run successfully in 448M. This patch limits boost_watermark() to boosting a zone's min watermark only when there are enough pages that the boost will produce positive results. In this case that is estimated to be four times as many pages as pageblock_nr_pages. Mel said: : There is no harm in marking it stable. Clearly it does not happen very : often but it's not impossible. 32-bit x86 is a lot less common now : which would previously have been vulnerable to triggering this easily. : ppc64 has a larger base page size but typically only has one zone. : arm64 is likely the most vulnerable, particularly when CMA is : configured with a small movable zone. Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs") Signed-off-by: NHenry Willard <henry.willard@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Acked-by: NMel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/1588294148-6586-1-git-send-email-henry.willard@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> [xuyu: expand zone_managed_pages function] Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Mel Gorman 提交于
to #28825456 commit 28360f398778d7623a5ff8a8e90958c0d925e120 upstream. Dave Chinner reported a problem pointing a finger at commit 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs"). The report is extensive: https://lore.kernel.org/linux-mm/20190807091858.2857-1-david@fromorbit.com/ and it's worth recording the most relevant parts (colorful language and typos included). When running a simple, steady state 4kB file creation test to simulate extracting tarballs larger than memory full of small files into the filesystem, I noticed that once memory fills up the cache balance goes to hell. The workload is creating one dirty cached inode for every dirty page, both of which should require a single IO each to clean and reclaim, and creation of inodes is throttled by the rate at which dirty writeback runs at (via balance dirty pages). Hence the ingest rate of new cached inodes and page cache pages is identical and steady. As a result, memory reclaim should quickly find a steady balance between page cache and inode caches. The moment memory fills, the page cache is reclaimed at a much faster rate than the inode cache, and evidence suggests that the inode cache shrinker is not being called when large batches of pages are being reclaimed. In roughly the same time period that it takes to fill memory with 50% pages and 50% slab caches, memory reclaim reduces the page cache down to just dirty pages and slab caches fill the entirety of memory. The LRU is largely full of dirty pages, and we're getting spikes of random writeback from memory reclaim so it's all going to shit. Behaviour never recovers, the page cache remains pinned at just dirty pages, and nothing I could tune would make any difference. vfs_cache_pressure makes no difference - I would set it so high it should trim the entire inode caches in a single pass, yet it didn't do anything. It was clear from tracing and live telemetry that the shrinkers were pretty much not running except when there was absolutely no memory free at all, and then they did the minimum necessary to free memory to make progress. So I went looking at the code, trying to find places where pages got reclaimed and the shrinkers weren't called. There's only one - kswapd doing boosted reclaim as per commit 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs"). The watermark boosting introduced by the commit is triggered in response to an allocation "fragmentation event". The boosting was not intended to target THP specifically and triggers even if THP is disabled. However, with Dave's perfectly reasonable workload, fragmentation events can be very common given the ratio of slab to page cache allocations so boosting remains active for long periods of time. As high-order allocations might use compaction and compaction cannot move slab pages the decision was made in the commit to special-case kswapd when watermarks are boosted -- kswapd avoids reclaiming slab as reclaiming slab does not directly help compaction. As Dave notes, this decision means that slab can be artificially protected for long periods of time and messes up the balance with slab and page caches. Removing the special casing can still indirectly help avoid fragmentation by avoiding fragmentation-causing events due to slab allocation as pages from a slab pageblock will have some slab objects freed. Furthermore, with the special casing, reclaim behaviour is unpredictable as kswapd sometimes examines slab and sometimes does not in a manner that is tricky to tune or analyse. This patch removes the special casing. The downside is that this is not a universal performance win. Some benchmarks that depend on the residency of data when rereading metadata may see a regression when slab reclaim is restored to its original behaviour. Similarly, some benchmarks that only read-once or write-once may perform better when page reclaim is too aggressive. The primary upside is that slab shrinker is less surprising (arguably more sane but that's a matter of opinion), behaves consistently regardless of the fragmentation state of the system and properly obeys VM sysctls. A fsmark benchmark configuration was constructed similar to what Dave reported and is codified by the mmtest configuration config-io-fsmark-small-file-stream. It was evaluated on a 1-socket machine to avoid dealing with NUMA-related issues and the timing of reclaim. The storage was an SSD Samsung Evo and a fresh trimmed XFS filesystem was used for the test data. This is not an exact replication of Dave's setup. The configuration scales its parameters depending on the memory size of the SUT to behave similarly across machines. The parameters mean the first sample reported by fs_mark is using 50% of RAM which will barely be throttled and look like a big outlier. Dave used fake NUMA to have multiple kswapd instances which I didn't replicate. Finally, the number of iterations differ from Dave's test as the target disk was not large enough. While not identical, it should be representative. fsmark 5.3.0-rc3 5.3.0-rc3 vanilla shrinker-v1r1 Min 1-files/sec 4444.80 ( 0.00%) 4765.60 ( 7.22%) 1st-qrtle 1-files/sec 5005.10 ( 0.00%) 5091.70 ( 1.73%) 2nd-qrtle 1-files/sec 4917.80 ( 0.00%) 4855.60 ( -1.26%) 3rd-qrtle 1-files/sec 4667.40 ( 0.00%) 4831.20 ( 3.51%) Max-1 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%) Max-5 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%) Max-10 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%) Max-90 1-files/sec 4649.60 ( 0.00%) 4780.70 ( 2.82%) Max-95 1-files/sec 4491.00 ( 0.00%) 4768.20 ( 6.17%) Max-99 1-files/sec 4491.00 ( 0.00%) 4768.20 ( 6.17%) Max 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%) Hmean 1-files/sec 5004.75 ( 0.00%) 5075.96 ( 1.42%) Stddev 1-files/sec 1778.70 ( 0.00%) 1369.66 ( 23.00%) CoeffVar 1-files/sec 33.70 ( 0.00%) 26.05 ( 22.71%) BHmean-99 1-files/sec 5053.72 ( 0.00%) 5101.52 ( 0.95%) BHmean-95 1-files/sec 5053.72 ( 0.00%) 5101.52 ( 0.95%) BHmean-90 1-files/sec 5107.05 ( 0.00%) 5131.41 ( 0.48%) BHmean-75 1-files/sec 5208.45 ( 0.00%) 5206.68 ( -0.03%) BHmean-50 1-files/sec 5405.53 ( 0.00%) 5381.62 ( -0.44%) BHmean-25 1-files/sec 6179.75 ( 0.00%) 6095.14 ( -1.37%) 5.3.0-rc3 5.3.0-rc3 vanillashrinker-v1r1 Duration User 501.82 497.29 Duration System 4401.44 4424.08 Duration Elapsed 8124.76 8358.05 This is showing a slight skew for the max result representing a large outlier for the 1st, 2nd and 3rd quartile are similar indicating that the bulk of the results show little difference. Note that an earlier version of the fsmark configuration showed a regression but that included more samples taken while memory was still filling. Note that the elapsed time is higher. Part of this is that the configuration included time to delete all the test files when the test completes -- the test automation handles the possibility of testing fsmark with multiple thread counts. Without the patch, many of these objects would be memory resident which is part of what the patch is addressing. There are other important observations that justify the patch. 1. With the vanilla kernel, the number of dirty pages in the system is very low for much of the test. With this patch, dirty pages is generally kept at 10% which matches vm.dirty_background_ratio which is normal expected historical behaviour. 2. With the vanilla kernel, the ratio of Slab/Pagecache is close to 0.95 for much of the test i.e. Slab is being left alone and dominating memory consumption. With the patch applied, the ratio varies between 0.35 and 0.45 with the bulk of the measured ratios roughly half way between those values. This is a different balance to what Dave reported but it was at least consistent. 3. Slabs are scanned throughout the entire test with the patch applied. The vanille kernel has periods with no scan activity and then relatively massive spikes. 4. Without the patch, kswapd scan rates are very variable. With the patch, the scan rates remain quite steady. 4. Overall vmstats are closer to normal expectations 5.3.0-rc3 5.3.0-rc3 vanilla shrinker-v1r1 Ops Direct pages scanned 99388.00 328410.00 Ops Kswapd pages scanned 45382917.00 33451026.00 Ops Kswapd pages reclaimed 30869570.00 25239655.00 Ops Direct pages reclaimed 74131.00 5830.00 Ops Kswapd efficiency % 68.02 75.45 Ops Kswapd velocity 5585.75 4002.25 Ops Page reclaim immediate 1179721.00 430927.00 Ops Slabs scanned 62367361.00 73581394.00 Ops Direct inode steals 2103.00 1002.00 Ops Kswapd inode steals 570180.00 5183206.00 o Vanilla kernel is hitting direct reclaim more frequently, not very much in absolute terms but the fact the patch reduces it is interesting o "Page reclaim immediate" in the vanilla kernel indicates dirty pages are being encountered at the tail of the LRU. This is generally bad and means in this case that the LRU is not long enough for dirty pages to be cleaned by the background flush in time. This is much reduced by the patch. o With the patch, kswapd is reclaiming 10 times more slab pages than with the vanilla kernel. This is indicative of the watermark boosting over-protecting slab A more complete set of tests were run that were part of the basis for introducing boosting and while there are some differences, they are well within tolerances. Bottom line, the special casing kswapd to avoid slab behaviour is unpredictable and can lead to abnormal results for normal workloads. This patch restores the expected behaviour that slab and page cache is balanced consistently for a workload with a steady allocation ratio of slab/pagecache pages. It also means that if there are workloads that favour the preservation of slab over pagecache that it can be tuned via vm.vfs_cache_pressure where as the vanilla kernel effectively ignores the parameter when boosting is active. Link: http://lkml.kernel.org/r/20190808182946.GM2739@techsingularity.net Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs") Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Reviewed-by: NDave Chinner <dchinner@redhat.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> [5.0+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Andrey Ryabinin 提交于
to #28825456 commit 8118b82eb756e271929697e8ada5f637dc443af1 upstream. Commit 0a79cdad5eb2 ("mm: use alloc_flags to record if kswapd can wake") removed setting of the ALLOC_NOFRAGMENT flag. Bring it back. The runtime effect is that ALLOC_NOFRAGMENT behaviour is restored so that allocations are spread across local zones to avoid fragmentation due to mixing pageblocks as long as possible. Link: http://lkml.kernel.org/r/20190423120806.3503-2-aryabinin@virtuozzo.com Fixes: 0a79cdad5eb2 ("mm: use alloc_flags to record if kswapd can wake") Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: NMel Gorman <mgorman@techsingularity.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Andrey Ryabinin 提交于
to #28825456 commit 8139ad043d632c0e9e12d760068a7a8e91659aa1 upstream. ac.preferred_zoneref->zone passed to alloc_flags_nofragment() can be NULL. 'zone' pointer unconditionally derefernced in alloc_flags_nofragment(). Bail out on NULL zone to avoid potential crash. Currently we don't see any crashes only because alloc_flags_nofragment() has another bug which allows compiler to optimize away all accesses to 'zone'. Link: http://lkml.kernel.org/r/20190423120806.3503-1-aryabinin@virtuozzo.com Fixes: 6bb154504f8b ("mm, page_alloc: spread allocations across zones before introducing fragmentation") Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: NMel Gorman <mgorman@techsingularity.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Mel Gorman 提交于
to #28825456 commit 24512228b7a3f412b5a51f189df302616b021c33 upstream. Mikulas Patocka reported that commit 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs") "broke" memory management on parisc. The machine is not NUMA but the DISCONTIG model creates three pgdats even though it's a UMA machine for the following ranges 0) Start 0x0000000000000000 End 0x000000003fffffff Size 1024 MB 1) Start 0x0000000100000000 End 0x00000001bfdfffff Size 3070 MB 2) Start 0x0000004040000000 End 0x00000040ffffffff Size 3072 MB Mikulas reported: With the patch 1c30844d2, the kernel will incorrectly reclaim the first zone when it fills up, ignoring the fact that there are two completely free zones. Basiscally, it limits cache size to 1GiB. For example, if I run: # dd if=/dev/sda of=/dev/null bs=1M count=2048 - with the proper kernel, there should be "Buffers - 2GiB" when this command finishes. With the patch 1c30844d2, buffers will consume just 1GiB or slightly more, because the kernel was incorrectly reclaiming them. The page allocator and reclaim makes assumptions that pgdats really represent NUMA nodes and zones represent ranges and makes decisions on that basis. Watermark boosting for small pgdats leads to unexpected results even though this would have behaved reasonably on SPARSEMEM. DISCONTIG is essentially deprecated and even parisc plans to move to SPARSEMEM so there is no need to be fancy, this patch simply disables watermark boosting by default on DISCONTIGMEM. Link: http://lkml.kernel.org/r/20190419094335.GJ18914@techsingularity.net Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs") Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Reported-by: NMikulas Patocka <mpatocka@redhat.com> Tested-by: NMikulas Patocka <mpatocka@redhat.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: James Bottomley <James.Bottomley@hansenpartnership.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Mel Gorman 提交于
to #28825456 commit 94b3334cbebea34d56a7e6321c6fe9d89b309a49 upstream. Yury Norov reported that an arm64 KVM instance could not boot since after v5.0-rc1 and could addressed by reverting the patches 1c30844d2dfe272d58c ("mm: reclaim small amounts of memory when an external 73444bc4d8f92e46a20 ("mm, page_alloc: do not wake kswapd with zone lock held") The problem is that a division by zero error is possible if boosting occurs very early in boot if the system has very little memory. This patch avoids the division by zero error. Link: http://lkml.kernel.org/r/20190213143012.GT9565@techsingularity.net Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs") Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Reported-by: NYury Norov <yury.norov@gmail.com> Tested-by: NYury Norov <yury.norov@gmail.com> Tested-by: NWill Deacon <will.deacon@arm.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Mel Gorman 提交于
to #28825456 commit 73444bc4d8f92e46a20cb6bd3342fc2ea75c6787 upstream. syzbot reported the following regression in the latest merge window and it was confirmed by Qian Cai that a similar bug was visible from a different context. ====================================================== WARNING: possible circular locking dependency detected 4.20.0+ #297 Not tainted ------------------------------------------------------ syz-executor0/8529 is trying to acquire lock: 000000005e7fb829 (&pgdat->kswapd_wait){....}, at: __wake_up_common_lock+0x19e/0x330 kernel/sched/wait.c:120 but task is already holding lock: 000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: spin_lock include/linux/spinlock.h:329 [inline] 000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue_bulk mm/page_alloc.c:2548 [inline] 000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: __rmqueue_pcplist mm/page_alloc.c:3021 [inline] 000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue_pcplist mm/page_alloc.c:3050 [inline] 000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue mm/page_alloc.c:3072 [inline] 000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: get_page_from_freelist+0x1bae/0x52a0 mm/page_alloc.c:3491 It appears to be a false positive in that the only way the lock ordering should be inverted is if kswapd is waking itself and the wakeup allocates debugging objects which should already be allocated if it's kswapd doing the waking. Nevertheless, the possibility exists and so it's best to avoid the problem. This patch flags a zone as needing a kswapd using the, surprisingly, unused zone flag field. The flag is read without the lock held to do the wakeup. It's possible that the flag setting context is not the same as the flag clearing context or for small races to occur. However, each race possibility is harmless and there is no visible degredation in fragmentation treatment. While zone->flag could have continued to be unused, there is potential for moving some existing fields into the flags field instead. Particularly read-mostly ones like zone->initialized and zone->contiguous. Link: http://lkml.kernel.org/r/20190103225712.GJ31517@techsingularity.net Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs") Reported-by: syzbot+93d94a001cfbce9e60e1@syzkaller.appspotmail.com Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NVlastimil Babka <vbabka@suse.cz> Tested-by: NQian Cai <cai@lca.pw> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: include/linux/mmzone.h Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Mel Gorman 提交于
to #28825456 commit 1c30844d2dfe272d58c8fc000960b835d13aa2ac upstream. An external fragmentation event was previously described as When the page allocator fragments memory, it records the event using the mm_page_alloc_extfrag event. If the fallback_order is smaller than a pageblock order (order-9 on 64-bit x86) then it's considered an event that will cause external fragmentation issues in the future. The kernel reduces the probability of such events by increasing the watermark sizes by calling set_recommended_min_free_kbytes early in the lifetime of the system. This works reasonably well in general but if there are enough sparsely populated pageblocks then the problem can still occur as enough memory is free overall and kswapd stays asleep. This patch introduces a watermark_boost_factor sysctl that allows a zone watermark to be temporarily boosted when an external fragmentation causing events occurs. The boosting will stall allocations that would decrease free memory below the boosted low watermark and kswapd is woken if the calling context allows to reclaim an amount of memory relative to the size of the high watermark and the watermark_boost_factor until the boost is cleared. When kswapd finishes, it wakes kcompactd at the pageblock order to clean some of the pageblocks that may have been affected by the fragmentation event. kswapd avoids any writeback, slab shrinkage and swap from reclaim context during this operation to avoid excessive system disruption in the name of fragmentation avoidance. Care is taken so that kswapd will do normal reclaim work if the system is really low on memory. This was evaluated using the same workloads as "mm, page_alloc: Spread allocations across zones before introducing fragmentation". 1-socket Skylake machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 1 THP allocating thread -------------------------------------- 4.20-rc3 extfrag events < order 9: 804694 4.20-rc3+patch: 408912 (49% reduction) 4.20-rc3+patch1-4: 18421 (98% reduction) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-1 653.58 ( 0.00%) 652.71 ( 0.13%) Amean fault-huge-1 0.00 ( 0.00%) 178.93 * -99.00%* 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-1 0.00 ( 0.00%) 5.12 ( 100.00%) Note that external fragmentation causing events are massively reduced by this path whether in comparison to the previous kernel or the vanilla kernel. The fault latency for huge pages appears to be increased but that is only because THP allocations were successful with the patch applied. 1-socket Skylake machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 291392 4.20-rc3+patch: 191187 (34% reduction) 4.20-rc3+patch1-4: 13464 (95% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Min fault-base-1 912.00 ( 0.00%) 905.00 ( 0.77%) Min fault-huge-1 127.00 ( 0.00%) 135.00 ( -6.30%) Amean fault-base-1 1467.55 ( 0.00%) 1481.67 ( -0.96%) Amean fault-huge-1 1127.11 ( 0.00%) 1063.88 * 5.61%* 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-1 77.64 ( 0.00%) 83.46 ( 7.49%) As before, massive reduction in external fragmentation events, some jitter on latencies and an increase in THP allocation success rates. 2-socket Haswell machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 5 THP allocating threads ---------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 215698 4.20-rc3+patch: 200210 (7% reduction) 4.20-rc3+patch1-4: 14263 (93% reduction) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-5 1346.45 ( 0.00%) 1306.87 ( 2.94%) Amean fault-huge-5 3418.60 ( 0.00%) 1348.94 ( 60.54%) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-5 0.78 ( 0.00%) 7.91 ( 910.64%) There is a 93% reduction in fragmentation causing events, there is a big reduction in the huge page fault latency and allocation success rate is higher. 2-socket Haswell machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 166352 4.20-rc3+patch: 147463 (11% reduction) 4.20-rc3+patch1-4: 11095 (93% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-5 6217.43 ( 0.00%) 7419.67 * -19.34%* Amean fault-huge-5 3163.33 ( 0.00%) 3263.80 ( -3.18%) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-5 95.14 ( 0.00%) 87.98 ( -7.53%) There is a large reduction in fragmentation events with some jitter around the latencies and success rates. As before, the high THP allocation success rate does mean the system is under a lot of pressure. However, as the fragmentation events are reduced, it would be expected that the long-term allocation success rate would be higher. Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Mel Gorman 提交于
to #28825456 commit 0a79cdad5eb213b3a629e624565b1b3bf9192b7c upstream. This is a preparation patch that copies the GFP flag __GFP_KSWAPD_RECLAIM into alloc_flags. This is a preparation patch only that avoids having to pass gfp_mask through a long callchain in a future patch. Note that the setting in the fast path happens in alloc_flags_nofragment() and it may be claimed that this has nothing to do with ALLOC_NO_FRAGMENT. That's true in this patch but is not true later so it's done now for easier review to show where the flag needs to be recorded. No functional change. [mgorman@techsingularity.net: ALLOC_KSWAPD flag needs to be applied in the !CONFIG_ZONE_DMA32 case] Link: http://lkml.kernel.org/r/20181126143503.GO23260@techsingularity.net Link: http://lkml.kernel.org/r/20181123114528.28802-4-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Mel Gorman 提交于
to #28825456 commit 6bb154504f8b496780ec53ec81aba957a12981fa upstream. Patch series "Fragmentation avoidance improvements", v5. It has been noted before that fragmentation avoidance (aka anti-fragmentation) is not perfect. Given sufficient time or an adverse workload, memory gets fragmented and the long-term success of high-order allocations degrades. This series defines an adverse workload, a definition of external fragmentation events (including serious) ones and a series that reduces the level of those fragmentation events. The details of the workload and the consequences are described in more detail in the changelogs. However, from patch 1, this is a high-level summary of the adverse workload. The exact details are found in the mmtests implementation. The broad details of the workload are as follows; 1. Create an XFS filesystem (not specified in the configuration but done as part of the testing for this patch) 2. Start 4 fio threads that write a number of 64K files inefficiently. Inefficiently means that files are created on first access and not created in advance (fio parameterr create_on_open=1) and fallocate is not used (fallocate=none). With multiple IO issuers this creates a mix of slab and page cache allocations over time. The total size of the files is 150% physical memory so that the slabs and page cache pages get mixed 3. Warm up a number of fio read-only threads accessing the same files created in step 2. This part runs for the same length of time it took to create the files. It'll fault back in old data and further interleave slab and page cache allocations. As it's now low on memory due to step 2, fragmentation occurs as pageblocks get stolen. 4. While step 3 is still running, start a process that tries to allocate 75% of memory as huge pages with a number of threads. The number of threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP threads contending with fio, any other threads or forcing cross-NUMA scheduling. Note that the test has not been used on a machine with less than 8 cores. The benchmark records whether huge pages were allocated and what the fault latency was in microseconds 5. Measure the number of events potentially causing external fragmentation, the fault latency and the huge page allocation success rate. 6. Cleanup Overall the series reduces external fragmentation causing events by over 94% on 1 and 2 socket machines, which in turn impacts high-order allocation success rates over the long term. There are differences in latencies and high-order allocation success rates. Latencies are a mixed bag as they are vulnerable to exact system state and whether allocations succeeded so they are treated as a secondary metric. Patch 1 uses lower zones if they are populated and have free memory instead of fragmenting a higher zone. It's special cased to handle a Normal->DMA32 fallback with the reasons explained in the changelog. Patch 2-4 boosts watermarks temporarily when an external fragmentation event occurs. kswapd wakes to reclaim a small amount of old memory and then wakes kcompactd on completion to recover the system slightly. This introduces some overhead in the slowpath. The level of boosting can be tuned or disabled depending on the tolerance for fragmentation vs allocation latency. Patch 5 stalls some movable allocation requests to let kswapd from patch 4 make some progress. The duration of the stalls is very low but it is possible to tune the system to avoid fragmentation events if larger stalls can be tolerated. The bulk of the improvement in fragmentation avoidance is from patches 1-4 but patch 5 can deal with a rare corner case and provides the option of tuning a system for THP allocation success rates in exchange for some stalls to control fragmentation. This patch (of 5): The page allocator zone lists are iterated based on the watermarks of each zone which does not take anti-fragmentation into account. On x86, node 0 may have multiple zones while other nodes have one zone. A consequence is that tasks running on node 0 may fragment ZONE_NORMAL even though ZONE_DMA32 has plenty of free memory. This patch special cases the allocator fast path such that it'll try an allocation from a lower local zone before fragmenting a higher zone. In this case, stealing of pageblocks or orders larger than a pageblock are still allowed in the fast path as they are uninteresting from a fragmentation point of view. This was evaluated using a benchmark designed to fragment memory before attempting THP allocations. It's implemented in mmtests as the following configurations configs/config-global-dhp__workload_thpfioscale configs/config-global-dhp__workload_thpfioscale-defrag configs/config-global-dhp__workload_thpfioscale-madvhugepage e.g. from mmtests ./run-mmtests.sh --run-monitor --config configs/config-global-dhp__workload_thpfioscale test-run-1 The broad details of the workload are as follows; 1. Create an XFS filesystem (not specified in the configuration but done as part of the testing for this patch). 2. Start 4 fio threads that write a number of 64K files inefficiently. Inefficiently means that files are created on first access and not created in advance (fio parameter create_on_open=1) and fallocate is not used (fallocate=none). With multiple IO issuers this creates a mix of slab and page cache allocations over time. The total size of the files is 150% physical memory so that the slabs and page cache pages get mixed. 3. Warm up a number of fio read-only processes accessing the same files created in step 2. This part runs for the same length of time it took to create the files. It'll refault old data and further interleave slab and page cache allocations. As it's now low on memory due to step 2, fragmentation occurs as pageblocks get stolen. 4. While step 3 is still running, start a process that tries to allocate 75% of memory as huge pages with a number of threads. The number of threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP threads contending with fio, any other threads or forcing cross-NUMA scheduling. Note that the test has not been used on a machine with less than 8 cores. The benchmark records whether huge pages were allocated and what the fault latency was in microseconds. 5. Measure the number of events potentially causing external fragmentation, the fault latency and the huge page allocation success rate. 6. Cleanup the test files. Note that due to the use of IO and page cache that this benchmark is not suitable for running on large machines where the time to fragment memory may be excessive. Also note that while this is one mix that generates fragmentation that it's not the only mix that generates fragmentation. Differences in workload that are more slab-intensive or whether SLUB is used with high-order pages may yield different results. When the page allocator fragments memory, it records the event using the mm_page_alloc_extfrag ftrace event. If the fallback_order is smaller than a pageblock order (order-9 on 64-bit x86) then it's considered to be an "external fragmentation event" that may cause issues in the future. Hence, the primary metric here is the number of external fragmentation events that occur with order < 9. The secondary metric is allocation latency and huge page allocation success rates but note that differences in latencies and what the success rate also can affect the number of external fragmentation event which is why it's a secondary metric. 1-socket Skylake machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 1 THP allocating thread -------------------------------------- 4.20-rc3 extfrag events < order 9: 804694 4.20-rc3+patch: 408912 (49% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 vanilla lowzone-v5r8 Amean fault-base-1 662.92 ( 0.00%) 653.58 * 1.41%* Amean fault-huge-1 0.00 ( 0.00%) 0.00 ( 0.00%) 4.20.0-rc3 4.20.0-rc3 vanilla lowzone-v5r8 Percentage huge-1 0.00 ( 0.00%) 0.00 ( 0.00%) Fault latencies are slightly reduced while allocation success rates remain at zero as this configuration does not make any special effort to allocate THP and fio is heavily active at the time and either filling memory or keeping pages resident. However, a 49% reduction of serious fragmentation events reduces the changes of external fragmentation being a problem in the future. Vlastimil asked during review for a breakdown of the allocation types that are falling back. vanilla 3816 MIGRATE_UNMOVABLE 800845 MIGRATE_MOVABLE 33 MIGRATE_UNRECLAIMABLE patch 735 MIGRATE_UNMOVABLE 408135 MIGRATE_MOVABLE 42 MIGRATE_UNRECLAIMABLE The majority of the fallbacks are due to movable allocations and this is consistent for the workload throughout the series so will not be presented again as the primary source of fallbacks are movable allocations. Movable fallbacks are sometimes considered "ok" to fallback because they can be migrated. The problem is that they can fill an unmovable/reclaimable pageblock causing those allocations to fallback later and polluting pageblocks with pages that cannot move. If there is a movable fallback, it is pretty much guaranteed to affect an unmovable/reclaimable pageblock and while it might not be enough to actually cause a unmovable/reclaimable fallback in the future, we cannot know that in advance so the patch takes the only option available to it. Hence, it's important to control them. This point is also consistent throughout the series and will not be repeated. 1-socket Skylake machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 291392 4.20-rc3+patch: 191187 (34% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 vanilla lowzone-v5r8 Amean fault-base-1 1495.14 ( 0.00%) 1467.55 ( 1.85%) Amean fault-huge-1 1098.48 ( 0.00%) 1127.11 ( -2.61%) thpfioscale Percentage Faults Huge 4.20.0-rc3 4.20.0-rc3 vanilla lowzone-v5r8 Percentage huge-1 78.57 ( 0.00%) 77.64 ( -1.18%) Fragmentation events were reduced quite a bit although this is known to be a little variable. The latencies and allocation success rates are similar but they were already quite high. 2-socket Haswell machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 5 THP allocating threads ---------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 215698 4.20-rc3+patch: 200210 (7% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 vanilla lowzone-v5r8 Amean fault-base-5 1350.05 ( 0.00%) 1346.45 ( 0.27%) Amean fault-huge-5 4181.01 ( 0.00%) 3418.60 ( 18.24%) 4.20.0-rc3 4.20.0-rc3 vanilla lowzone-v5r8 Percentage huge-5 1.15 ( 0.00%) 0.78 ( -31.88%) The reduction of external fragmentation events is slight and this is partially due to the removal of __GFP_THISNODE in commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings") as THP allocations can now spill over to remote nodes instead of fragmenting local memory. 2-socket Haswell machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 166352 4.20-rc3+patch: 147463 (11% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 vanilla lowzone-v5r8 Amean fault-base-5 6138.97 ( 0.00%) 6217.43 ( -1.28%) Amean fault-huge-5 2294.28 ( 0.00%) 3163.33 * -37.88%* thpfioscale Percentage Faults Huge 4.20.0-rc3 4.20.0-rc3 vanilla lowzone-v5r8 Percentage huge-5 96.82 ( 0.00%) 95.14 ( -1.74%) There was a slight reduction in external fragmentation events although the latencies were higher. The allocation success rate is high enough that the system is struggling and there is quite a lot of parallel reclaim and compaction activity. There is also a certain degree of luck on whether processes start on node 0 or not for this patch but the relevance is reduced later in the series. Overall, the patch reduces the number of external fragmentation causing events so the success of THP over long periods of time would be improved for this adverse workload. Link: http://lkml.kernel.org/r/20181123114528.28802-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: mm/page_alloc.c Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Jan Kara 提交于
to #28718400 commit 5c72feee3e45b40a3c96c7145ec422899d0e8964 upstream. When handling a page fault, we drop mmap_sem to start async readahead so that we don't block on IO submission with mmap_sem held. However there's no point to drop mmap_sem in case readahead is disabled. Handle that case to avoid pointless dropping of mmap_sem and retrying the fault. This was actually reported to block mlockall(MCL_CURRENT) indefinitely. Fixes: 6b4c9f446981 ("filemap: drop the mmap_sem for all blocking operations") Reported-by: NMinchan Kim <minchan@kernel.org> Reported-by: NRobert Stupp <snazy@gmx.de> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NMinchan Kim <minchan@kernel.org> Link: http://lkml.kernel.org/r/20200212101356.30759-1-jack@suse.czSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Yang Shi 提交于
to #28718400 commit 7a30df49f63ad92318ddf1f7498d1129a77dd4bd upstream. A few new fields were added to mmu_gather to make TLB flush smarter for huge page by telling what level of page table is changed. __tlb_reset_range() is used to reset all these page table state to unchanged, which is called by TLB flush for parallel mapping changes for the same range under non-exclusive lock (i.e. read mmap_sem). Before commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap"), the syscalls (e.g. MADV_DONTNEED, MADV_FREE) which may update PTEs in parallel don't remove page tables. But, the forementioned commit may do munmap() under read mmap_sem and free page tables. This may result in program hang on aarch64 reported by Jan Stancek. The problem could be reproduced by his test program with slightly modified below. ---8<--- static int map_size = 4096; static int num_iter = 500; static long threads_total; static void *distant_area; void *map_write_unmap(void *ptr) { int *fd = ptr; unsigned char *map_address; int i, j = 0; for (i = 0; i < num_iter; i++) { map_address = mmap(distant_area, (size_t) map_size, PROT_WRITE | PROT_READ, MAP_SHARED | MAP_ANONYMOUS, -1, 0); if (map_address == MAP_FAILED) { perror("mmap"); exit(1); } for (j = 0; j < map_size; j++) map_address[j] = 'b'; if (munmap(map_address, map_size) == -1) { perror("munmap"); exit(1); } } return NULL; } void *dummy(void *ptr) { return NULL; } int main(void) { pthread_t thid[2]; /* hint for mmap in map_write_unmap() */ distant_area = mmap(0, DISTANT_MMAP_SIZE, PROT_WRITE | PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); munmap(distant_area, (size_t)DISTANT_MMAP_SIZE); distant_area += DISTANT_MMAP_SIZE / 2; while (1) { pthread_create(&thid[0], NULL, map_write_unmap, NULL); pthread_create(&thid[1], NULL, dummy, NULL); pthread_join(thid[0], NULL); pthread_join(thid[1], NULL); } } ---8<--- The program may bring in parallel execution like below: t1 t2 munmap(map_address) downgrade_write(&mm->mmap_sem); unmap_region() tlb_gather_mmu() inc_tlb_flush_pending(tlb->mm); free_pgtables() tlb->freed_tables = 1 tlb->cleared_pmds = 1 pthread_exit() madvise(thread_stack, 8M, MADV_DONTNEED) zap_page_range() tlb_gather_mmu() inc_tlb_flush_pending(tlb->mm); tlb_finish_mmu() if (mm_tlb_flush_nested(tlb->mm)) __tlb_reset_range() __tlb_reset_range() would reset freed_tables and cleared_* bits, but this may cause inconsistency for munmap() which do free page tables. Then it may result in some architectures, e.g. aarch64, may not flush TLB completely as expected to have stale TLB entries remained. Use fullmm flush since it yields much better performance on aarch64 and non-fullmm doesn't yields significant difference on x86. The original proposed fix came from Jan Stancek who mainly debugged this issue, I just wrapped up everything together. Jan's testing results: v5.2-rc2-24-gbec7550cca10 -------------------------- mean stddev real 37.382 2.780 user 1.420 0.078 sys 54.658 1.855 v5.2-rc2-24-gbec7550cca10 + "mm: mmu_gather: remove __tlb_reset_range() for force flush" ---------------------------------------------------------------------------------------_ mean stddev real 37.119 2.105 user 1.548 0.087 sys 55.698 1.357 [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/1558322252-113575-1-git-send-email-yang.shi@linux.alibaba.com Fixes: dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap") Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com> Signed-off-by: NJan Stancek <jstancek@redhat.com> Reported-by: NJan Stancek <jstancek@redhat.com> Tested-by: NJan Stancek <jstancek@redhat.com> Suggested-by: NWill Deacon <will.deacon@arm.com> Tested-by: NWill Deacon <will.deacon@arm.com> Acked-by: NWill Deacon <will.deacon@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Nick Piggin <npiggin@gmail.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Nadav Amit <namit@vmware.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: <stable@vger.kernel.org> [4.20+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> [xuyu: backport from mm/mmu_gather.c to mm/memory.c] Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Josef Bacik 提交于
to #28718400 commit 6b4c9f4469819a0c1a38a0a4541337e0f9bf6c11 upstream. Currently we only drop the mmap_sem if there is contention on the page lock. The idea is that we issue readahead and then go to lock the page while it is under IO and we want to not hold the mmap_sem during the IO. The problem with this is the assumption that the readahead does anything. In the case that the box is under extreme memory or IO pressure we may end up not reading anything at all for readahead, which means we will end up reading in the page under the mmap_sem. Even if the readahead does something, it could get throttled because of io pressure on the system and the process is in a lower priority cgroup. Holding the mmap_sem while doing IO is problematic because it can cause system-wide priority inversions. Consider some large company that does a lot of web traffic. This large company has load balancing logic in it's core web server, cause some engineer thought this was a brilliant plan. This load balancing logic gets statistics from /proc about the system, which trip over processes mmap_sem for various reasons. Now the web server application is in a protected cgroup, but these other processes may not be, and if they are being throttled while their mmap_sem is held we'll stall, and cause this nice death spiral. Instead rework filemap fault path to drop the mmap sem at any point that we may do IO or block for an extended period of time. This includes while issuing readahead, locking the page, or needing to call ->readpage because readahead did not occur. Then once we have a fully uptodate page we can return with VM_FAULT_RETRY and come back again to find our nicely in-cache page that was gotten outside of the mmap_sem. This patch also adds a new helper for locking the page with the mmap_sem dropped. This doesn't make sense currently as generally speaking if the page is already locked it'll have been read in (unless there was an error) before it was unlocked. However a forthcoming patchset will change this with the ability to abort read-ahead bio's if necessary, making it more likely that we could contend for a page lock and still have a not uptodate page. This allows us to deal with this case by grabbing the lock and issuing the IO without the mmap_sem held, and then returning VM_FAULT_RETRY to come back around. [josef@toxicpanda.com: v6] Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com [kirill@shutemov.name: fix race in filemap_fault()] Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1 [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.comSigned-off-by: NJosef Bacik <josef@toxicpanda.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NJan Kara <jack@suse.cz> Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com Cc: Dave Chinner <david@fromorbit.com> Cc: Rik van Riel <riel@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Josef Bacik 提交于
to #28718400 commit a75d4c33377277b6034dd1e2663bce444f952c14 upstream. Patch series "drop the mmap_sem when doing IO in the fault path", v6. Now that we have proper isolation in place with cgroups2 we have started going through and fixing the various priority inversions. Most are all gone now, but this one is sort of weird since it's not necessarily a priority inversion that happens within the kernel, but rather because of something userspace does. We have giant applications that we want to protect, and parts of these giant applications do things like watch the system state to determine how healthy the box is for load balancing and such. This involves running 'ps' or other such utilities. These utilities will often walk /proc/<pid>/whatever, and these files can sometimes need to down_read(&task->mmap_sem). Not usually a big deal, but we noticed when we are stress testing that sometimes our protected application has latency spikes trying to get the mmap_sem for tasks that are in lower priority cgroups. This is because any down_write() on a semaphore essentially turns it into a mutex, so even if we currently have it held for reading, any new readers will not be allowed on to keep from starving the writer. This is fine, except a lower priority task could be stuck doing IO because it has been throttled to the point that its IO is taking much longer than normal. But because a higher priority group depends on this completing it is now stuck behind lower priority work. In order to avoid this particular priority inversion we want to use the existing retry mechanism to stop from holding the mmap_sem at all if we are going to do IO. This already exists in the read case sort of, but needed to be extended for more than just grabbing the page lock. With io.latency we throttle at submit_bio() time, so the readahead stuff can block and even page_cache_read can block, so all these paths need to have the mmap_sem dropped. The other big thing is ->page_mkwrite. btrfs is particularly shitty here because we have to reserve space for the dirty page, which can be a very expensive operation. We use the same retry method as the read path, and simply cache the page and verify the page is still setup properly the next pass through ->page_mkwrite(). I've tested these patches with xfstests and there are no regressions. This patch (of 3): If we do not have a page at filemap_fault time we'll do this weird forced page_cache_read thing to populate the page, and then drop it again and loop around and find it. This makes for 2 ways we can read a page in filemap_fault, and it's not really needed. Instead add a FGP_FOR_MMAP flag so that pagecache_get_page() will return a unlocked page that's in pagecache. Then use the normal page locking and readpage logic already in filemap_fault. This simplifies the no page in page cache case significantly. [akpm@linux-foundation.org: fix comment text] [josef@toxicpanda.com: don't unlock null page in FGP_FOR_MMAP case] Link: http://lkml.kernel.org/r/20190312201742.22935-1-josef@toxicpanda.com Link: http://lkml.kernel.org/r/20181211173801.29535-2-josef@toxicpanda.comSigned-off-by: NJosef Bacik <josef@toxicpanda.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Tejun Heo <tj@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Rik van Riel <riel@redhat.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: mm/filemap.c Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Josef Bacik 提交于
to #28718400 commit 2a1180f1bd389e9d47693e5eb384b95f482d8d19 upstream. All of the arguments to these functions come from the vmf. Cut down on the amount of arguments passed by simply passing in the vmf to these two helpers. Link: http://lkml.kernel.org/r/20181211173801.29535-3-josef@toxicpanda.comSigned-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NJan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Yang Shi 提交于
to #28718400 commit cb4922496ae40a775a1b17025eaa1060e8991253 upstream. When unmapping VM_PFNMAP mappings, vm flags need to be updated. Since the vmas have been detached, so it sounds safe to update vm flags with read mmap_sem. Link: http://lkml.kernel.org/r/1537376621-51150-4-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NMatthew Wilcox <willy@infradead.org> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Yang Shi 提交于
to #28718400 commit b4cefb36051244bcb5651026d862c332a6cac7df upstream. When unmapping VM_HUGETLB mappings, vm flags need to be updated. Since the vmas have been detached, so it sounds safe to update vm flags with read mmap_sem. Link: http://lkml.kernel.org/r/1537376621-51150-3-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NMatthew Wilcox <willy@infradead.org> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Yang Shi 提交于
to #28718400 commit dd2283f2605e3b3e9c61bcae844b34f2afa4813f upstream. Patch series "mm: zap pages with read mmap_sem in munmap for large mapping", v11. Background: Recently, when we ran some vm scalability tests on machines with large memory, we ran into a couple of mmap_sem scalability issues when unmapping large memory space, please refer to https://lkml.org/lkml/2017/12/14/733 and https://lkml.org/lkml/2018/2/20/576. History: Then akpm suggested to unmap large mapping section by section and drop mmap_sem at a time to mitigate it (see https://lkml.org/lkml/2018/3/6/784). V1 patch series was submitted to the mailing list per Andrew's suggestion (see https://lkml.org/lkml/2018/3/20/786). Then I received a lot great feedback and suggestions. Then this topic was discussed on LSFMM summit 2018. In the summit, Michal Hocko suggested (also in the v1 patches review) to try "two phases" approach. Zapping pages with read mmap_sem, then doing via cleanup with write mmap_sem (for discussion detail, see https://lwn.net/Articles/753269/) Approach: Zapping pages is the most time consuming part, according to the suggestion from Michal Hocko [1], zapping pages can be done with holding read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write mmap_sem to cleanup vmas. But, we can't call MADV_DONTNEED directly, since there are two major drawbacks: * The unexpected state from PF if it wins the race in the middle of munmap. It may return zero page, instead of the content or SIGSEGV. * Can't handle VM_LOCKED | VM_HUGETLB | VM_PFNMAP and uprobe mappings, which is a showstopper from akpm But, some part may need write mmap_sem, for example, vma splitting. So, the design is as follows: acquire write mmap_sem lookup vmas (find and split vmas) deal with special mappings detach vmas downgrade_write zap pages free page tables release mmap_sem The vm events with read mmap_sem may come in during page zapping, but since vmas have been detached before, they, i.e. page fault, gup, etc, will not be able to find valid vma, then just return SIGSEGV or -EFAULT as expected. If the vma has VM_HUGETLB | VM_PFNMAP, they are considered as special mappings. They will be handled by falling back to regular do_munmap() with exclusive mmap_sem held in this patch since they may update vm flags. But, with the "detach vmas first" approach, the vmas have been detached when vm flags are updated, so it sounds safe to update vm flags with read mmap_sem for this specific case. So, VM_HUGETLB and VM_PFNMAP will be handled by using the optimized path in the following separate patches for bisectable sake. Unmapping uprobe areas may need update mm flags (MMF_RECALC_UPROBES). However it is fine to have false-positive MMF_RECALC_UPROBES according to uprobes developer. So, uprobe unmap will not be handled by the regular path. With the "detach vmas first" approach we don't have to re-acquire mmap_sem again to clean up vmas to avoid race window which might get the address space changed since downgrade_write() doesn't release the lock to lead regression, which simply downgrades to read lock. And, since the lock acquire/release cost is managed to the minimum and almost as same as before, the optimization could be extended to any size of mapping without incurring significant penalty to small mappings. For the time being, just do this in munmap syscall path. Other vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain intact due to some implementation difficulties since they acquire write mmap_sem from very beginning and hold it until the end, do_munmap() might be called in the middle. But, the optimized do_munmap would like to be called without mmap_sem held so that we can do the optimization. So, if we want to do the similar optimization for mmap/mremap path, I'm afraid we would have to redesign them. mremap might be called on very large area depending on the usecases, the optimization to it will be considered in the future. This patch (of 3): When running some mmap/munmap scalability tests with large memory (i.e. > 300GB), the below hung task issue may happen occasionally. INFO: task ps:14018 blocked for more than 120 seconds. Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ps D 0 14018 1 0x00000004 ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0 ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000 Call Trace: [<ffffffff817154d0>] ? __schedule+0x250/0x730 [<ffffffff817159e6>] schedule+0x36/0x80 [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150 [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30 [<ffffffff81717db0>] down_read+0x20/0x40 [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0 [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100 [<ffffffff81241d87>] __vfs_read+0x37/0x150 [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0 [<ffffffff81242266>] vfs_read+0x96/0x130 [<ffffffff812437b5>] SyS_read+0x55/0xc0 [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5 It is because munmap holds mmap_sem exclusively from very beginning to all the way down to the end, and doesn't release it in the middle. When unmapping large mapping, it may take long time (take ~18 seconds to unmap 320GB mapping with every single page mapped on an idle machine). Zapping pages is the most time consuming part, according to the suggestion from Michal Hocko [1], zapping pages can be done with holding read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write mmap_sem to cleanup vmas. But, some part may need write mmap_sem, for example, vma splitting. So, the design is as follows: acquire write mmap_sem lookup vmas (find and split vmas) deal with special mappings detach vmas downgrade_write zap pages free page tables release mmap_sem The vm events with read mmap_sem may come in during page zapping, but since vmas have been detached before, they, i.e. page fault, gup, etc, will not be able to find valid vma, then just return SIGSEGV or -EFAULT as expected. If the vma has VM_HUGETLB | VM_PFNMAP, they are considered as special mappings. They will be handled by without downgrading mmap_sem in this patch since they may update vm flags. But, with the "detach vmas first" approach, the vmas have been detached when vm flags are updated, so it sounds safe to update vm flags with read mmap_sem for this specific case. So, VM_HUGETLB and VM_PFNMAP will be handled by using the optimized path in the following separate patches for bisectable sake. Unmapping uprobe areas may need update mm flags (MMF_RECALC_UPROBES). However it is fine to have false-positive MMF_RECALC_UPROBES according to uprobes developer. With the "detach vmas first" approach we don't have to re-acquire mmap_sem again to clean up vmas to avoid race window which might get the address space changed since downgrade_write() doesn't release the lock to lead regression, which simply downgrades to read lock. And, since the lock acquire/release cost is managed to the minimum and almost as same as before, the optimization could be extended to any size of mapping without incurring significant penalty to small mappings. For the time being, just do this in munmap syscall path. Other vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain intact due to some implementation difficulties since they acquire write mmap_sem from very beginning and hold it until the end, do_munmap() might be called in the middle. But, the optimized do_munmap would like to be called without mmap_sem held so that we can do the optimization. So, if we want to do the similar optimization for mmap/mremap path, I'm afraid we would have to redesign them. mremap might be called on very large area depending on the usecases, the optimization to it will be considered in the future. With the patches, exclusive mmap_sem hold time when munmap a 80GB address space on a machine with 32 cores of E5-2680 @ 2.70GHz dropped to us level from second. munmap_test-15002 [008] 594.380138: funcgraph_entry: | __vm_munmap() { munmap_test-15002 [008] 594.380146: funcgraph_entry: !2485684 us | unmap_region(); munmap_test-15002 [008] 596.865836: funcgraph_exit: !2485692 us | } Here the execution time of unmap_region() is used to evaluate the time of holding read mmap_sem, then the remaining time is used with holding exclusive lock. [1] https://lwn.net/Articles/753269/ Link: http://lkml.kernel.org/r/1537376621-51150-2-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>Suggested-by: Michal Hocko <mhocko@kernel.org> Suggested-by: NKirill A. Shutemov <kirill@shutemov.name> Suggested-by: NMatthew Wilcox <willy@infradead.org> Reviewed-by: NMatthew Wilcox <willy@infradead.org> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-