- 02 9月, 2020 11 次提交
-
-
由 Mel Gorman 提交于
to #28825456 commit 1c30844d2dfe272d58c8fc000960b835d13aa2ac upstream. An external fragmentation event was previously described as When the page allocator fragments memory, it records the event using the mm_page_alloc_extfrag event. If the fallback_order is smaller than a pageblock order (order-9 on 64-bit x86) then it's considered an event that will cause external fragmentation issues in the future. The kernel reduces the probability of such events by increasing the watermark sizes by calling set_recommended_min_free_kbytes early in the lifetime of the system. This works reasonably well in general but if there are enough sparsely populated pageblocks then the problem can still occur as enough memory is free overall and kswapd stays asleep. This patch introduces a watermark_boost_factor sysctl that allows a zone watermark to be temporarily boosted when an external fragmentation causing events occurs. The boosting will stall allocations that would decrease free memory below the boosted low watermark and kswapd is woken if the calling context allows to reclaim an amount of memory relative to the size of the high watermark and the watermark_boost_factor until the boost is cleared. When kswapd finishes, it wakes kcompactd at the pageblock order to clean some of the pageblocks that may have been affected by the fragmentation event. kswapd avoids any writeback, slab shrinkage and swap from reclaim context during this operation to avoid excessive system disruption in the name of fragmentation avoidance. Care is taken so that kswapd will do normal reclaim work if the system is really low on memory. This was evaluated using the same workloads as "mm, page_alloc: Spread allocations across zones before introducing fragmentation". 1-socket Skylake machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 1 THP allocating thread -------------------------------------- 4.20-rc3 extfrag events < order 9: 804694 4.20-rc3+patch: 408912 (49% reduction) 4.20-rc3+patch1-4: 18421 (98% reduction) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-1 653.58 ( 0.00%) 652.71 ( 0.13%) Amean fault-huge-1 0.00 ( 0.00%) 178.93 * -99.00%* 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-1 0.00 ( 0.00%) 5.12 ( 100.00%) Note that external fragmentation causing events are massively reduced by this path whether in comparison to the previous kernel or the vanilla kernel. The fault latency for huge pages appears to be increased but that is only because THP allocations were successful with the patch applied. 1-socket Skylake machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 291392 4.20-rc3+patch: 191187 (34% reduction) 4.20-rc3+patch1-4: 13464 (95% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Min fault-base-1 912.00 ( 0.00%) 905.00 ( 0.77%) Min fault-huge-1 127.00 ( 0.00%) 135.00 ( -6.30%) Amean fault-base-1 1467.55 ( 0.00%) 1481.67 ( -0.96%) Amean fault-huge-1 1127.11 ( 0.00%) 1063.88 * 5.61%* 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-1 77.64 ( 0.00%) 83.46 ( 7.49%) As before, massive reduction in external fragmentation events, some jitter on latencies and an increase in THP allocation success rates. 2-socket Haswell machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 5 THP allocating threads ---------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 215698 4.20-rc3+patch: 200210 (7% reduction) 4.20-rc3+patch1-4: 14263 (93% reduction) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-5 1346.45 ( 0.00%) 1306.87 ( 2.94%) Amean fault-huge-5 3418.60 ( 0.00%) 1348.94 ( 60.54%) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-5 0.78 ( 0.00%) 7.91 ( 910.64%) There is a 93% reduction in fragmentation causing events, there is a big reduction in the huge page fault latency and allocation success rate is higher. 2-socket Haswell machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 166352 4.20-rc3+patch: 147463 (11% reduction) 4.20-rc3+patch1-4: 11095 (93% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-5 6217.43 ( 0.00%) 7419.67 * -19.34%* Amean fault-huge-5 3163.33 ( 0.00%) 3263.80 ( -3.18%) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-5 95.14 ( 0.00%) 87.98 ( -7.53%) There is a large reduction in fragmentation events with some jitter around the latencies and success rates. As before, the high THP allocation success rate does mean the system is under a lot of pressure. However, as the fragmentation events are reduced, it would be expected that the long-term allocation success rate would be higher. Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Josef Bacik 提交于
to #28718400 commit a75d4c33377277b6034dd1e2663bce444f952c14 upstream. Patch series "drop the mmap_sem when doing IO in the fault path", v6. Now that we have proper isolation in place with cgroups2 we have started going through and fixing the various priority inversions. Most are all gone now, but this one is sort of weird since it's not necessarily a priority inversion that happens within the kernel, but rather because of something userspace does. We have giant applications that we want to protect, and parts of these giant applications do things like watch the system state to determine how healthy the box is for load balancing and such. This involves running 'ps' or other such utilities. These utilities will often walk /proc/<pid>/whatever, and these files can sometimes need to down_read(&task->mmap_sem). Not usually a big deal, but we noticed when we are stress testing that sometimes our protected application has latency spikes trying to get the mmap_sem for tasks that are in lower priority cgroups. This is because any down_write() on a semaphore essentially turns it into a mutex, so even if we currently have it held for reading, any new readers will not be allowed on to keep from starving the writer. This is fine, except a lower priority task could be stuck doing IO because it has been throttled to the point that its IO is taking much longer than normal. But because a higher priority group depends on this completing it is now stuck behind lower priority work. In order to avoid this particular priority inversion we want to use the existing retry mechanism to stop from holding the mmap_sem at all if we are going to do IO. This already exists in the read case sort of, but needed to be extended for more than just grabbing the page lock. With io.latency we throttle at submit_bio() time, so the readahead stuff can block and even page_cache_read can block, so all these paths need to have the mmap_sem dropped. The other big thing is ->page_mkwrite. btrfs is particularly shitty here because we have to reserve space for the dirty page, which can be a very expensive operation. We use the same retry method as the read path, and simply cache the page and verify the page is still setup properly the next pass through ->page_mkwrite(). I've tested these patches with xfstests and there are no regressions. This patch (of 3): If we do not have a page at filemap_fault time we'll do this weird forced page_cache_read thing to populate the page, and then drop it again and loop around and find it. This makes for 2 ways we can read a page in filemap_fault, and it's not really needed. Instead add a FGP_FOR_MMAP flag so that pagecache_get_page() will return a unlocked page that's in pagecache. Then use the normal page locking and readpage logic already in filemap_fault. This simplifies the no page in page cache case significantly. [akpm@linux-foundation.org: fix comment text] [josef@toxicpanda.com: don't unlock null page in FGP_FOR_MMAP case] Link: http://lkml.kernel.org/r/20190312201742.22935-1-josef@toxicpanda.com Link: http://lkml.kernel.org/r/20181211173801.29535-2-josef@toxicpanda.comSigned-off-by: NJosef Bacik <josef@toxicpanda.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Tejun Heo <tj@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Rik van Riel <riel@redhat.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: mm/filemap.c Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Jeremy Linton 提交于
fix #26734090 commit d24a0c7099b32b6981d7f126c45348e381718350 upstream ACPI 6.3 adds additional fields to the MADT GICC structure to describe SPE PPI's. We pick these out of the cached reference to the madt_gicc structure similarly to the core PMU code. We then create a platform device referring to the IRQ and let the user/module loader decide whether to load the SPE driver. Tested-by: NHanjun Guo <hanjun.guo@linaro.org> Reviewed-by: NSudeep Holla <sudeep.holla@arm.com> Reviewed-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com> Signed-off-by: NJeremy Linton <jeremy.linton@arm.com> Signed-off-by: NWill Deacon <will@kernel.org> Signed-off-by: NXin Hao <xhao@linux.alibaba.com> Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
-
由 Marc Zyngier 提交于
task #25552995 commit f226650494c6aa87526d12135b7de8b8c074f3de upstream. The GICv3 architecture specification is incredibly misleading when it comes to PMR and the requirement for a DSB. It turns out that this DSB is only required if the CPU interface sends an Upstream Control message to the redistributor in order to update the RD's view of PMR. This message is only sent when ICC_CTLR_EL1.PMHE is set, which isn't the case in Linux. It can still be set from EL3, so some special care is required. But the upshot is that in the (hopefuly large) majority of the cases, we can drop the DSB altogether. This relies on a new static key being set if the boot CPU has PMHE set. The drawback is that this static key has to be exported to modules. Cc: Will Deacon <will@kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Julien Thierry <julien.thierry.kdev@gmail.com> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: NMarc Zyngier <maz@kernel.org> Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com> Signed-off-by: NZou Cao <zoucao@linux.alibaba.com> Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
-
由 Julien Thierry 提交于
task #25552995 commit 13b210ddf474d9f3368766008a89fe82a6f90b48 upstream Currently, irqflags are saved before calling runtime services and checked for mismatch on return. Provide a pair of overridable macros to save and restore (if needed) the state that need to be preserved on return from a runtime service. This allows to check for flags that are not necesarly related to irqflags. Signed-off-by: NJulien Thierry <julien.thierry@arm.com> Acked-by: NCatalin Marinas <catalin.marinas@arm.com> Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: NMarc Zyngier <marc.zyngier@arm.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: linux-efi@vger.kernel.org Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com> Signed-off-by: NZou Cao <zoucao@linux.alibaba.com> Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
-
由 Julien Thierry 提交于
task #25552995 commit 6e4933a006616343f66c4702dc4fc56bb25e7b02 upstream NMI handling code should be executed between calls to nmi_enter and nmi_exit. Add a separate domain handler to properly setup NMI context when handling an interrupt requested as NMI. Signed-off-by: NJulien Thierry <julien.thierry@arm.com> Acked-by: NMarc Zyngier <marc.zyngier@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com> Signed-off-by: NZou Cao <zoucao@linux.alibaba.com> Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
-
由 Julien Thierry 提交于
task #25552995 commit 2dcf1fbcad352baaa5f47b17e57c5743c8eedbad upstream Provide flow handlers that are NMI safe for interrupts and percpu_devid interrupts. Signed-off-by: NJulien Thierry <julien.thierry@arm.com> Acked-by: NMarc Zyngier <marc.zyngier@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com> Signed-off-by: NZou Cao <zoucao@linux.alibaba.com> Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
-
由 Julien Thierry 提交于
task #25552995 commit 4b078c3f1a26487c39363089ba0d5c6b09f2a89f upstream Add support for percpu_devid interrupts treated as NMIs. Percpu_devid NMIs need to be setup/torn down on each CPU they target. The same restrictions as for global NMIs still apply for percpu_devid NMIs. Signed-off-by: NJulien Thierry <julien.thierry@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com> Signed-off-by: NZou Cao <zoucao@linux.alibaba.com> Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com> [ Shile: fixed conflicts in include/linux/interrupt.h ] Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
-
由 Julien Thierry 提交于
task #25552995 commit b525903c254dab2491410f0f23707691b7c2c317 upstream Add functionality to allocate interrupt lines that will deliver IRQs as Non-Maskable Interrupts. These allocations are only successful if the irqchip provides the necessary support and allows NMI delivery for the interrupt line. Interrupt lines allocated for NMI delivery must be enabled/disabled through enable_nmi/disable_nmi_nosync to keep their state consistent. To treat a PERCPU IRQ as NMI, the interrupt must not be shared nor threaded, the irqchip directly managing the IRQ must be the root irqchip and the irqchip cannot be behind a slow bus. Signed-off-by: NJulien Thierry <julien.thierry@arm.com> Reviewed-by: NMarc Zyngier <marc.zyngier@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com> Signed-off-by: NZou Cao <zoucao@linux.alibaba.com> Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
-
由 Julien Thierry 提交于
task #25552995 commit 2130b789b3ef6a518b9c9c6f245642620e2b0c0c upstream. LPIs use the same priority value as other GIC interrupts. Make the GIC default priority definition visible to ITS implementation and use this same definition for LPI priorities. Tested-by: NDaniel Thompson <daniel.thompson@linaro.org> Signed-off-by: NJulien Thierry <julien.thierry@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jason Cooper <jason@lakedaemon.net> Cc: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com> Signed-off-by: NZou Cao <zoucao@linux.alibaba.com> Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
-
由 Robin Murphy 提交于
fix #27432135 commit 6af588fed39178c8e118fcf9cb6664e58a1fbe88 upstream While iommu_get_domain_for_dev() is the robust way for arbitrary IOMMU API callers to retrieve the domain pointer, for DMA ops domains it doesn't scale well for large systems and multi-queue devices, since the momentary refcount adjustment will lead to exclusive cacheline contention when multiple CPUs are operating in parallel on different mappings for the same device. In the case of DMA ops domains, however, this refcounting is actually unnecessary, since they already imply that the group exists and is managed by platform code and IOMMU internals (by virtue of iommu_group_get_for_dev()) such that a reference will already be held for the lifetime of the device. Thus we can avoid the bottleneck by providing a fast lookup specifically for the DMA code to retrieve the default domain it already knows it has set up - a simple read-only dereference plays much nicer with cache-coherency protocols. Signed-off-by: NRobin Murphy <robin.murphy@arm.com> Tested-by: NWill Deacon <will.deacon@arm.com> Signed-off-by: NJoerg Roedel <jroedel@suse.de> Signed-off-by: NRongwei Wang <rongwei.wang@linux.alibaba.com> Acked-by: Nzou cao <zoucao@linux.alibaba.com>
-
- 29 6月, 2020 10 次提交
-
-
由 Pavel Begunkov 提交于
to #28736503 commit 9dafdfc2f0a3ae551711098de3d7b621a469f11a upstream export do_tee() for use in io_uring Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 James Morse 提交于
fix #28612342 commit f9f05395f384ee858520b6c65d7e3e436af20c53 upstream If the GHES notification type is SDEI, register the provided event using the SDEI-GHES helper. SDEI may be one of two types of event, normal and critical. Critical events can interrupt normal events, so these must have separate fixmap slots and locks in case both event types are in use. Signed-off-by: NJames Morse <james.morse@arm.com> Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com> Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
-
由 James Morse 提交于
fix #28612342 commit f96935d3bc38a5f4b5188b6470a10e3fb8c3f0cc upstream APEI's Generic Hardware Error Source structures do not describe whether the SDEI event is shared or private, as this information is discoverable via the API. GHES needs to know whether an event is normal or critical to avoid sharing locks or fixmap entries, but GHES shouldn't have to know about the SDEI API. Add a helper to register the GHES using the appropriate normal or critical callback. Signed-off-by: NJames Morse <james.morse@arm.com> Acked-by: NCatalin Marinas <catalin.marinas@arm.com> Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com> Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
-
由 James Morse 提交于
fix #28612342 commit 062022315e8ad9e0628515dfc756ab54b5fdb26b upstream The GHES code calls memory_failure_queue() from IRQ context to schedule work on the current CPU so that memory_failure() can sleep. For synchronous memory errors the arch code needs to know any signals that memory_failure() will trigger are pending before it returns to user-space, possibly when exiting from the IRQ. Add a helper to kick the memory failure queue, to ensure the scheduled work has happened. This has to be called from process context, so may have been migrated from the original cpu. Pass the cpu the work was queued on. Change memory_failure_work_func() to permit being called on the 'wrong' cpu. Signed-off-by: NJames Morse <james.morse@arm.com> Tested-by: NTyler Baicar <baicar@os.amperecomputing.com> Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com> Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
-
由 Jens Axboe 提交于
fix #28871358 commit d666ba98f849ad44c4405ecc2180390ebe80f4f9 upstream blk-mq passes information to the hardware about any given request being the last that we will issue in this sequence. The point is that hardware can defer costly doorbell type writes to the last request. But if we run into errors issuing a sequence of requests, we may never send the request with bd->last == true set. For that case, we need a hook that tells the hardware that nothing else is coming right now. For failures returned by the drivers ->queue_rq() hook, the driver is responsible for flushing pending requests, if it uses bd->last to optimize that part. This works like before, no changes there. Reviewed-by: NOmar Sandoval <osandov@fb.com> Reviewed-by: NMing Lei <ming.lei@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jens Axboe 提交于
fix #28871358 Only do it if we have requests for multiple queues in the same plug. Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Christoph Hellwig 提交于
fix #28339081 commit edfbcb321faf07ca970e4191abe061deeb7d3788 upstream The USB buffer allocation code is the only place in the usb core (and in fact the whole kernel) that uses is_device_dma_capable, while the URB mapping code uses the uses_dma flag in struct usb_bus. Switch the buffer allocation to use the uses_dma flag used by the rest of the USB code, and create a helper in hcd.h that checks this flag as well as the CONFIG_HAS_DMA to simplify the caller a bit. Signed-off-by: NChristoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20190811080520.21712-3-hch@lst.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com>
-
由 Laurentiu Tudor 提交于
fix #28339081 commit 2d7a3dc3e24f43504b1f25eae8195e600f4cce8b upstream With the addition of the local memory allocator, the HCD_LOCAL_MEM flag can be dropped and the checks against it replaced with a check for the localmem_pool ptr being initialized. Signed-off-by: NLaurentiu Tudor <laurentiu.tudor@nxp.com> Tested-by: NFredrik Noring <noring@nocrew.org> Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com>
-
由 Laurentiu Tudor 提交于
fix #28339081 commit b0310c2f09bbe8aebefb97ed67949a3a7092aca6 upstream For HCs that have local memory, replace the current DMA API usage with a genalloc generic allocator to manage the mappings for these devices. To help users, introduce a new HCD API, usb_hcd_setup_local_mem() that will setup up the genalloc backing up the device local memory. It will be used in subsequent patches. This is in preparation for dropping the existing "coherent" dma mem declaration APIs. The current implementation was relying on a short circuit in the DMA API that in the end, was acting as an allocator for these type of devices. Signed-off-by: NLaurentiu Tudor <laurentiu.tudor@nxp.com> Tested-by: NFredrik Noring <noring@nocrew.org> Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com>
-
由 Fredrik Noring 提交于
fix #28339081 commit da83a722959a82733c3ca60030cc364ca2318c5a upstream gen_pool_dma_zalloc() is a zeroed memory variant of gen_pool_dma_alloc(). Also document the return values of both, and indicate NULL as a "%NULL" constant. Signed-off-by: NFredrik Noring <noring@nocrew.org> Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com>
-
- 24 6月, 2020 3 次提交
-
-
由 Dietmar Eggemann 提交于
to #28739709 commit 0e1fef63d92d61ed561e504c3a078a827a0f9bfe upstream The sched domain per rq load index files also disappear from the /proc/sys/kernel/sched_domain/cpuX/domainY directories. Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NRik van Riel <riel@surriel.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20190527062116.11512-6-dietmar.eggemann@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
-
由 Dietmar Eggemann 提交于
to #28739709 commit 5e83eafbfd3b351537c0d74467fc43e8a88f4ae4 upstream With LB_BIAS disabled, there is no need to update the rq->cpu_load[idx] any more. Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NRik van Riel <riel@surriel.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20190527062116.11512-2-dietmar.eggemann@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
-
由 Daniel Lezcano 提交于
to #28739709 commit a7fe5190c03f8137ef08db84a58dd4daf2c4785d upstream The function get_loadavg() returns almost always zero. To be more precise, statistically speaking for a total of 1023379 times passing in the function, the load is equal to zero 1020728 times, greater than 100, 610 times, the remaining is between 0 and 5. In 2011, the get_loadavg() was removed from the Android tree because of the above [1]. At this time, the load was: unsigned long this_cpu_load(void) { struct rq *this = this_rq(); return this->cpu_load[0]; } In 2014, the code was changed by commit 372ba8cb (cpuidle: menu: Lookup CPU runqueues less) and the load is: void get_iowait_load(unsigned long *nr_waiters, unsigned long *load) { struct rq *rq = this_rq(); *nr_waiters = atomic_read(&rq->nr_iowait); *load = rq->load.weight; } with the same result. Both measurements show using the load in this code path does no matter anymore. Removing it. [1] https://android.googlesource.com/kernel/common/+/4dedd9f124703207895777ac6e91dacde0f7cc17Signed-off-by: NDaniel Lezcano <daniel.lezcano@linaro.org> Acked-by: NMel Gorman <mgorman@suse.de> Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
-
- 23 6月, 2020 3 次提交
-
-
由 Yihao Wu 提交于
to #28739709 Assume workloads are composed of massive short tasks. Then periodical load tracking is unnecessary. Because load tracking should be already guaranteed by frequent sleep and wake-up. If these massive short tasks run in their individual cgroups, the load tracking becomes extremely heavy. This patch adds a switch to bypass scheduler_tick load tracking, in order to reduce scheduler overhead, without sacrificing much balance in this scenario. Performance Tests: 1) 1100+ tasks in their individual cgroups, on a 96-HT Skylake machine sched overhead(each HT): 0.74% -> 0.48% (This test's baseline is from the previous patch) 2) sysbench-threads with 96 threads, running for 5min latency_ms 95th: 63.07 -> 54.01 Besides these, no regression is found on our test platform. Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
由 Yihao Wu 提交于
to #28739709 Unless the workloads are IO-bounded, update_blocked_averages doesn't help load balance. This patch adds a switch to bypass update_blocked_averages if prior knowledge about workloads indicates IO is negligible. Performance Tests: 1) 1100+ tasks in their individual cgroups, on a 96-HT Skylake machine sched overhead(each HT): 3.78% -> 0.74% 2) cgroup-overhead benchmark in our sched-test suite on a 96-HT Skylake overhead: 21.06 -> 18.08 3) unixbench context1 with 96 threads running for 1min Score: 15409.40 -> 16821.77 Besides these, UnixBench has some performance ups and downs. But generally, the performance of UnixBench hasn't changed. Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
由 Yang Shi 提交于
task #27327988 The commit ("thp: change CoW semantics for anon-THP") rewrites THP CoW page fault handler to allocate base page only, but there is request to keep the old behavior just in case. So, introduce a new sysfs knob, fast_cow, to control the behavior, the default is the new behavior. Write that knob to 0 to switch to old behavior. Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com> [ caspar: fix checkpatch.pl warnings ] Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
- 15 6月, 2020 2 次提交
-
-
由 Sahitya Tummala 提交于
task #28557799 [ Upstream commit 30a2da7b7e225ef6c87a660419ea04d3cef3f6a7 ] There is a potential race between ioc_release_fn() and ioc_clear_queue() as shown below, due to which below kernel crash is observed. It also can result into use-after-free issue. context#1: context#2: ioc_release_fn() __ioc_clear_queue() gets the same icq ->spin_lock(&ioc->lock); ->spin_lock(&ioc->lock); ->ioc_destroy_icq(icq); ->list_del_init(&icq->q_node); ->call_rcu(&icq->__rcu_head, icq_free_icq_rcu); ->spin_unlock(&ioc->lock); ->ioc_destroy_icq(icq); ->hlist_del_init(&icq->ioc_node); This results into below crash as this memory is now used by icq->__rcu_head in context#1. There is a chance that icq could be free'd as well. 22150.386550: <6> Unable to handle kernel write to read-only memory at virtual address ffffffaa8d31ca50 ... Call trace: 22150.607350: <2> ioc_destroy_icq+0x44/0x110 22150.611202: <2> ioc_clear_queue+0xac/0x148 22150.615056: <2> blk_cleanup_queue+0x11c/0x1a0 22150.619174: <2> __scsi_remove_device+0xdc/0x128 22150.623465: <2> scsi_forget_host+0x2c/0x78 22150.627315: <2> scsi_remove_host+0x7c/0x2a0 22150.631257: <2> usb_stor_disconnect+0x74/0xc8 22150.635371: <2> usb_unbind_interface+0xc8/0x278 22150.639665: <2> device_release_driver_internal+0x198/0x250 22150.644897: <2> device_release_driver+0x24/0x30 22150.649176: <2> bus_remove_device+0xec/0x140 22150.653204: <2> device_del+0x270/0x460 22150.656712: <2> usb_disable_device+0x120/0x390 22150.660918: <2> usb_disconnect+0xf4/0x2e0 22150.664684: <2> hub_event+0xd70/0x17e8 22150.668197: <2> process_one_work+0x210/0x480 22150.672222: <2> worker_thread+0x32c/0x4c8 Fix this by adding a new ICQ_DESTROYED flag in ioc_destroy_icq() to indicate this icq is once marked as destroyed. Also, ensure __ioc_clear_queue() is accessing icq within rcu_read_lock/unlock so that icq doesn't get free'd up while it is still using it. Signed-off-by: NSahitya Tummala <stummala@codeaurora.org> Co-developed-by: NPradeep P V K <ppvk@codeaurora.org> Signed-off-by: NPradeep P V K <ppvk@codeaurora.org> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Mikulas Patocka 提交于
task #28557799 commit ad6bf88a6c19a39fb3b0045d78ea880325dfcf15 upstream. Logical block size has type unsigned short. That means that it can be at most 32768. However, there are architectures that can run with 64k pages (for example arm64) and on these architectures, it may be possible to create block devices with 64k block size. For exmaple (run this on an architecture with 64k pages): Mount will fail with this error because it tries to read the superblock using 2-sector access: device-mapper: writecache: I/O is not aligned, sector 2, size 1024, block size 65536 EXT4-fs (dm-0): unable to read superblock This patch changes the logical block size from unsigned short to unsigned int to avoid the overflow. Cc: stable@vger.kernel.org Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com> Reviewed-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
- 11 6月, 2020 1 次提交
-
-
由 Joseph Qi 提交于
fix #28528017 In case of virtio-blk device, checking /sys/block/<device>/queue/io_poll will show 1 and user can't disable it. Actually virtio-blk doesn't support poll yet, so it will confuse end user. The root cause is mq initialization will default set bit QUEUE_FLAG_POLL. This fix takes ideas from the following upstream commits: 6544d229bf43 ("block: enable polling by default if a poll map is initalized") 6e0de61107f0 ("blk-mq: remove QUEUE_FLAG_POLL from default MQ flags") Since we don't want to get HCTX_TYPE_POLL related logic involved, so just check mq_ops->poll and then set QUEUE_FLAG_POLL. Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
- 09 6月, 2020 5 次提交
-
-
由 Johannes Weiner 提交于
task #28327019 commit b8e24a9300b0836a9d39f6b20746766b3b81f1bd upstream psi tracks the time tasks wait for refaulting pages to become uptodate, but it does not track the time spent submitting the IO. The submission part can be significant if backing storage is contended or when cgroup throttling (io.latency) is in effect - a lot of time is spent in submit_bio(). In that case, we underreport memory pressure. Annotate submit_bio() to account submission time as memory stall when the bio is reading userspace workingset pages. Tested-by: NSuren Baghdasaryan <surenb@google.com> Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 zhongjiang-ali 提交于
task #28327019 Commit bc0cc360 ("alinux: blk-throttle: fix tg NULL pointer dereference") add an self-defined bio flags to fix an issue of use-after-free. But it is limited to 13 entry and has used up, hence it will fails to sync related patch. The patch replace reserved field with extended bio_flags to allow us to define more bio flags. Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
-
由 Yafang Shao 提交于
task #28327019 commit 1066d1b6974e095d5a6c472ad9180a957b496cd6 upstream The task->flags is a 32-bits flag, in which 31 bits have already been consumed. So it is hardly to introduce other new per process flag. Currently there're still enough spaces in the bit-field section of task_struct, so we can define the memstall state as a single bit in task_struct instead. This patch also removes an out-of-date comment pointed by Matthew. Suggested-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NYafang Shao <laoar.shao@gmail.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/1584408485-1921-1-git-send-email-laoar.shao@gmail.comSigned-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Johannes Weiner 提交于
task #28327019 commit 36b238d5717279163859fb6ba0f4360abcafab83 upstream When switching tasks running on a CPU, the psi state of a cgroup containing both of these tasks does not change. Right now, we don't exploit that, and can perform many unnecessary state changes in nested hierarchies, especially when most activity comes from one leaf cgroup. This patch implements an optimization where we only update cgroups whose state actually changes during a task switch. These are all cgroups that contain one task but not the other, up to the first shared ancestor. When both tasks are in the same group, we don't need to update anything at all. We can identify the first shared ancestor by walking the groups of the incoming task until we see TSK_ONCPU set on the local CPU; that's the first group that also contains the outgoing task. The new psi_task_switch() is similar to psi_task_change(). To allow code reuse, move the task flag maintenance code into a new function and the poll/avg worker wakeups into the shared psi_group_change(). Suggested-by: NPeter Zijlstra <peterz@infradead.org> Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200316191333.115523-3-hannes@cmpxchg.orgSigned-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Johannes Weiner 提交于
task #28327019 commit b05e75d611380881e73edc58a20fd8c6bb71720b upstream For simplicity, cpu pressure is defined as having more than one runnable task on a given CPU. This works on the system-level, but it has limitations in a cgrouped reality: When cpu.max is in use, it doesn't capture the time in which a task is not executing on the CPU due to throttling. Likewise, it doesn't capture the time in which a competing cgroup is occupying the CPU - meaning it only reflects cgroup-internal competitive pressure, not outside pressure. Enable tracking of currently executing tasks, and then change the definition of cpu pressure in a cgroup from NR_RUNNING > 1 to NR_RUNNING > ON_CPU which will capture the effects of cpu.max as well as competition from outside the cgroup. After this patch, a cgroup running `stress -c 1` with a cpu.max setting of 5000 10000 shows ~50% continuous CPU pressure. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200316191333.115523-2-hannes@cmpxchg.orgSigned-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
- 05 6月, 2020 2 次提交
-
-
由 Christian Borntraeger 提交于
fix #28092200 commit cdd6ad3ac63d2fa320baefcf92a02a918375c30f upstream There are cases where halt polling is unwanted. For example when running KVM on an over committed LPAR we rather want to give back the CPU to neighbour LPARs instead of polling. Let us provide a callback that allows architectures to disable polling. Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com> Acked-by: NPaolo Bonzini <pbonzini@redhat.com> Reviewed-by: NCornelia Huck <cohuck@redhat.com> Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Nchenxiangzuo <cxz18821786681@linux.alibaba.com> Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Paolo Bonzini 提交于
to #28092200 commit d970a325561da5e611596cbb06475db3755ce823 upstream Reported with "make W=1" due to -Wmissing-prototypes. Reported-by: NQian Cai <cai@lca.pw> Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: Nchenxiangzuo <cxz18821786681@linux.alibaba.com> Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
- 04 6月, 2020 2 次提交
-
-
由 Jens Axboe 提交于
to #28170604 commit 0a384abfae66651b28e4bbe16883b1ff046ba3b3 upstream This splits it into two parts, one that imports the message, and one that imports the iovec. This allows a caller to only do the first part, and import the iovec manually afterwards. No functional changes in this patch. Acked-by: NDavid Miller <davem@davemloft.net> Signed-off-by: NJens Axboe <axboe@kernel.dk> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Pavel Begunkov 提交于
to #28170604 commit 444ebb5768c5c43aadfc60111fecd6c4f946e77b upstream Make do_splice(), so other kernel parts can reuse it Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
- 28 5月, 2020 1 次提交
-
-
由 Jens Axboe 提交于
to #26323588 commit 09952e3e7826119ddd4357c453d54bcc7ef25156 upstream. Just like commit 4022e7af86be, this fixes the fact that IORING_OP_ACCEPT ends up using get_unused_fd_flags(), which checks current->signal->rlim[] for limits. Add an extra argument to __sys_accept4_file() that allows us to pass in the proper nofile limit, and grab it at request prep time. Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-