- 11 6月, 2019 2 次提交
-
-
由 Johannes Weiner 提交于
commit 1899ad18c6072d689896badafb81267b0a1092a4 upstream. Refaults happen during transitions between workingsets as well as in-place thrashing. Knowing the difference between the two has a range of applications, including measuring the impact of memory shortage on the system performance, as well as the ability to smarter balance pressure between the filesystem cache and the swap-backed workingset. During workingset transitions, inactive cache refaults and pushes out established active cache. When that active cache isn't stale, however, and also ends up refaulting, that's bonafide thrashing. Introduce a new page flag that tells on eviction whether the page has been active or not in its lifetime. This bit is then stored in the shadow entry, to classify refaults as transitioning or thrashing. How many page->flags does this leave us with on 32-bit? 20 bits are always page flags 21 if you have an MMU 23 with the zone bits for DMA, Normal, HighMem, Movable 29 with the sparsemem section bits 30 if PAE is enabled 31 with this patch. So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If that's not enough, the system can switch to discontigmem and re-gain the 6 or 7 sparsemem section bits. Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Tested-by: NDaniel Drake <drake@endlessm.com> Tested-by: NSuren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Johannes Weiner 提交于
commit 95f9ab2d596e8cbb388315e78c82b9a131bf2928 upstream. Patch series "psi: pressure stall information for CPU, memory, and IO", v4. Overview PSI reports the overall wallclock time in which the tasks in a system (or cgroup) wait for (contended) hardware resources. This helps users understand the resource pressure their workloads are under, which allows them to rootcause and fix throughput and latency problems caused by overcommitting, underprovisioning, suboptimal job placement in a grid; as well as anticipate major disruptions like OOM. Real-world applications We're using the data collected by PSI (and its previous incarnation, memdelay) quite extensively at Facebook, and with several success stories. One usecase is avoiding OOM hangs/livelocks. The reason these happen is because the OOM killer is triggered by reclaim not being able to free pages, but with fast flash devices there is *always* some clean and uptodate cache to reclaim; the OOM killer never kicks in, even as tasks spend 90% of the time thrashing the cache pages of their own executables. There is no situation where this ever makes sense in practice. We wrote a <100 line POC python script to monitor memory pressure and kill stuff way before such pathological thrashing leads to full system losses that would require forcible hard resets. We've since extended and deployed this code into other places to guarantee latency and throughput SLAs, since they're usually violated way before the kernel OOM killer would ever kick in. It is available here: https://github.com/facebookincubator/oomd Eventually we probably want to trigger the in-kernel OOM killer based on extreme sustained pressure as well, so that Linux can avoid memory livelocks - which technically aren't deadlocks, but to the user indistinguishable from them - out of the box. We'd continue using OOMD as the first line of defense to ensure workload health and implement complex kill policies that are beyond the scope of the kernel. We also use PSI memory pressure for loadshedding. Our batch job infrastructure used to use heuristics based on various VM stats to anticipate OOM situations, with lackluster success. We switched it to PSI and managed to anticipate and avoid OOM kills and lockups fairly reliably. The reduction of OOM outages in the worker pool raised the pool's aggregate productivity, and we were able to switch that service to smaller machines. Lastly, we use cgroups to isolate a machine's main workload from maintenance crap like package upgrades, logging, configuration, as well as to prevent multiple workloads on a machine from stepping on each others' toes. We were not able to configure this properly without the pressure metrics; we would see latency or bandwidth drops, but it would often be hard to impossible to rootcause it post-mortem. We now log and graph pressure for the containers in our fleet and can trivially link latency spikes and throughput drops to shortages of specific resources after the fact, and fix the job config/scheduling. PSI has also received testing, feedback, and feature requests from Android and EndlessOS for the purpose of low-latency OOM killing, to intervene in pressure situations before the UI starts hanging. How do you use this feature? A kernel with CONFIG_PSI=y will create a /proc/pressure directory with 3 files: cpu, memory, and io. If using cgroup2, cgroups will also have cpu.pressure, memory.pressure and io.pressure files, which simply aggregate task stalls at the cgroup level instead of system-wide. The cpu file contains one line: some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722 The averages give the percentage of walltime in which one or more tasks are delayed on the runqueue while another task has the CPU. They're recent averages over 10s, 1m, 5m windows, so you can tell short term trends from long term ones, similarly to the load average. The total= value gives the absolute stall time in microseconds. This allows detecting latency spikes that might be too short to sway the running averages. It also allows custom time averaging in case the 10s/1m/5m windows aren't adequate for the usecase (or are too coarse with future hardware). What to make of this "some" metric? If CPU utilization is at 100% and CPU pressure is 0, it means the system is perfectly utilized, with one runnable thread per CPU and nobody waiting. At two or more runnable tasks per CPU, the system is 100% overcommitted and the pressure average will indicate as much. From a utilization perspective this is a great state of course: no CPU cycles are being wasted, even when 50% of the threads were to go idle (as most workloads do vary). From the perspective of the individual job it's not great, however, and they would do better with more resources. Depending on what your priority and options are, raised "some" numbers may or may not require action. The memory file contains two lines: some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828 full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258 The some line is the same as for cpu, the time in which at least one task is stalled on the resource. In the case of memory, this includes waiting on swap-in, page cache refaults and page reclaim. The full line, however, indicates time in which *nobody* is using the CPU productively due to pressure: all non-idle tasks are waiting for memory in one form or another. Significant time spent in there is a good trigger for killing things, moving jobs to other machines, or dropping incoming requests, since neither the jobs nor the machine overall are making too much headway. The io file is similar to memory. Because the block layer doesn't have a concept of hardware contention right now (how much longer is my IO request taking due to other tasks?), it reports CPU potential lost on all IO delays, not just the potential lost due to competition. FAQ Q: How is PSI's CPU component different from the load average? A: There are several quirks in the load average that make it hard to impossible to tell how overcommitted the CPU really is. 1. The load average is reported as a raw number of active tasks. You need to know how many CPUs there are in the system, how many CPUs the workload is allowed to use, then think about what the proportion between load and the number of CPUs mean for the tasks trying to run. PSI reports the percentage of wallclock time in which tasks are waiting for a CPU to run on. It doesn't matter how many CPUs are present or usable. The number always tells the quality of life of tasks in the system or in a particular cgroup. 2. The shortest averaging window is 1m, which is extremely coarse, and it's sampled in 5s intervals. A *lot* can happen on a CPU in 5 seconds. This *may* be able to identify persistent long-term trends and very clear and obvious overloads, but it's unusable for latency spikes and more subtle overutilization. PSI's shortest window is 10s. It also exports the cumulative stall times (in microseconds) of synchronously recorded events. 3. On Linux, the load average for historical reasons includes all TASK_UNINTERRUPTIBLE tasks. This gives a broader sense of how busy the system is, but on the flipside it doesn't distinguish whether tasks are likely to contend over the CPU or IO - which obviously requires very different interventions from a sys admin or a job scheduler. PSI reports independent metrics for CPU and IO. You can tell which resource is making the tasks wait, but in conjunction still see how overloaded the system is overall. Q: What's the cost / performance impact of this feature? A: PSI's primary cost is in the scheduler, in particular task wakeups and sleeps. I benchmarked this code using Facebook's two most scheduling sensitive workloads: memcache and webserver. They handle a ton of small requests - lots of wakeups and sleeps with little actual work in between - so they tend to be canaries for scheduler regressions. In the tests, the boxes were handling live traffic over the course of several hours. Half the machines, the control, ran with CONFIG_PSI=n. For memcache I used eight machines total. They're 2-socket, 14 core, 56 thread boxes. The test runs for half the test period, flips the test and control kernels on the hardware to rule out HW factors, DC location etc., then runs the other half of the test. For the webservers, I used 32 machines total. They're single socket, 16 core, 32 thread machines. During the memcache test, CPU load was nopsi=78.05% psi=78.98% in the first half and nopsi=77.52% psi=78.25%, so PSI added between 0.7 and 0.9 percentage points to the CPU load, a difference of about 1%. UPDATE: I re-ran this test with the v3 version of this patch set and the CPU utilization was equivalent between test and control. UPDATE: v4 is on par with v3. As far as end-to-end request latency from the client perspective goes, we don't sample those finely enough to capture the requests going to those particular machines during the test, but we know the p50 turnaround time in this workload is 54us, and perf bench sched pipe on those machines show nopsi=5.232666 us/op and psi=5.587347 us/op, so this doesn't add much here either. The profile for the pipe benchmark shows: 0.87% sched-pipe [kernel.vmlinux] [k] psi_group_change 0.83% perf.real [kernel.vmlinux] [k] psi_group_change 0.82% perf.real [kernel.vmlinux] [k] psi_task_change 0.58% sched-pipe [kernel.vmlinux] [k] psi_task_change The webserver load is running inside 4 nested cgroup levels. The CPU load with both nopsi and psi kernels was indistinguishable at 81%. For comparison, we had to disable the cgroup cpu controller on the webservers because it added 4 percentage points to the CPU% during this same exact test. Versions of this accounting code now run on 80% of our fleet. None of our workloads have reported regressions during the rollout. Daniel Drake said: : I just retested the latest version at : http://git.cmpxchg.org/cgit.cgi/linux-psi.git (Linux 4.18) and the results : are great. : : Test setup: : Endless OS : GeminiLake N4200 low end laptop : 2GB RAM : swap (and zram swap) disabled : : Baseline test: open a handful of large-ish apps and several website : tabs in Google Chrome. : : Results: after a couple of minutes, system is excessively thrashing, mouse : cursor can barely be moved, UI is not responding to mouse clicks, so it's : impractical to recover from this situation as an ordinary user : : Add my simple killer: : https://gist.github.com/dsd/a8988bf0b81a6163475988120fe8d9cd : : Results: when the thrashing causes the UI to become sluggish, the killer : steps in and kills something (usually a chrome tab), and the system : remains usable. I repeatedly opened more apps and more websites over a 15 : minute period but I wasn't able to get the system to a point of UI : unresponsiveness. Suren said: : Backported to 4.9 and retested on ARMv8 8 code system running Android. : Signals behave as expected reacting to memory pressure, no jumps in : "total" counters that would indicate an overflow/underflow issues. Nicely : done! This patch (of 9): If we keep just enough refault information to match the *current* page cache during reclaim time, we could lose a lot of events when there is only a temporary spike in non-cache memory consumption that pushes out all the cache. Once cache comes back, we won't see those refaults. They might not be actionable for LRU aging, but we want to know about them for measuring memory pressure. [hannes@cmpxchg.org: switch to NUMA-aware lru and slab counters] Link: http://lkml.kernel.org/r/20181009184732.762-2-hannes@cmpxchg.org Link: http://lkml.kernel.org/r/20180828172258.3185-2-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <jweiner@fb.com> Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NRik van Riel <riel@surriel.com> Tested-by: NDaniel Drake <drake@endlessm.com> Tested-by: NSuren Baghdasaryan <surenb@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: Christopher Lameter <cl@linux.com> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
- 10 6月, 2019 3 次提交
-
-
由 Nadav Amit 提交于
[ Upstream commit 7298e24f904224fa79eb8fd7e0fbd78950ccf2db ] Set the page as executable after allocation. This patch is a preparatory patch for a following patch that makes module allocated pages non-executable. While at it, do some small cleanup of what appears to be unnecessary masking. Signed-off-by: NNadav Amit <namit@vmware.com> Signed-off-by: NRick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: <akpm@linux-foundation.org> Cc: <ard.biesheuvel@linaro.org> Cc: <deneen.t.dock@intel.com> Cc: <kernel-hardening@lists.openwall.com> Cc: <kristen@linux.intel.com> Cc: <linux_dti@icloud.com> Cc: <will.deacon@arm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190426001143.4983-11-namit@vmware.comSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Nadav Amit 提交于
[ Upstream commit 3c0dab44e22782359a0a706cbce72de99a22aa75 ] Since alloc_module() will not set the pages as executable soon, set ftrace trampoline pages as executable after they are allocated. For the time being, do not change ftrace to use the text_poke() interface. As a result, ftrace still breaks W^X. Signed-off-by: NNadav Amit <namit@vmware.com> Signed-off-by: NRick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NSteven Rostedt (VMware) <rostedt@goodmis.org> Cc: <akpm@linux-foundation.org> Cc: <ard.biesheuvel@linaro.org> Cc: <deneen.t.dock@intel.com> Cc: <kernel-hardening@lists.openwall.com> Cc: <kristen@linux.intel.com> Cc: <linux_dti@icloud.com> Cc: <will.deacon@arm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190426001143.4983-10-namit@vmware.comSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Steven Rostedt (VMware) 提交于
[ Upstream commit d2a68c4effd821f0871d20368f76b609349c8a3b ] Since commit 79922b80 ("ftrace: Optimize function graph to be called directly"), dynamic trampolines should not be calling the function graph tracer at the end. If they do, it could cause the function graph tracer to trace functions that it filtered out. Right now it does not cause a problem because there's a test to check if the function graph tracer is attached to the same function as the function tracer, which for now is true. But the function graph tracer is undergoing changes that can make this no longer true which will cause the function graph tracer to trace other functions. For example: # cd /sys/kernel/tracing/ # echo do_IRQ > set_ftrace_filter # mkdir instances/foo # echo ip_rcv > instances/foo/set_ftrace_filter # echo function_graph > current_tracer # echo function > instances/foo/current_tracer Would cause the function graph tracer to trace both do_IRQ and ip_rcv, if the current tests change. As the current tests prevent this from being a problem, this code does not need to be backported. But it does make the code cleaner. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: x86@kernel.org Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
- 04 6月, 2019 35 次提交
-
-
由 Darrick J. Wong 提交于
commit 17614445576b6af24e9cf36607c6448164719c96 upstream. In commit 4721a601099, we tried to fix a problem wherein directio reads into a splice pipe will bounce EFAULT/EAGAIN all the way out to userspace by simulating a zero-byte short read. This happens because some directio read implementations (xfs) will call bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous reads, but as soon as we run out of pipe buffers that _get_pages call returns EFAULT, which the splice code translates to EAGAIN and bounces out to userspace. In that commit, the iomap code catches the EFAULT and simulates a zero-byte read, but that causes assertion errors on regular splice reads because xfs doesn't allow short directio reads. The brokenness is compounded by splice_direct_to_actor immediately bailing on do_splice_to returning <= 0 without ever calling ->actor (which empties out the pipe), so if userspace calls back we'll EFAULT again on the full pipe, and nothing ever gets copied. Therefore, teach splice_direct_to_actor to clamp its requests to the amount of free space in the pipe and remove the simulated short read from the iomap directio code. Fixes: 4721a601099 ("iomap: dio data corruption and spurious errors when pipes fill") Reported-by: NMurphy Zhou <jencce.kernel@gmail.com> Ranted-by: NAmir Goldstein <amir73il@gmail.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Darrick J. Wong 提交于
commit 3b50086f0c0d78c144d9483fa292c1509c931b70 upstream. For VFS listxattr calls, xfs_xattr_put_listent calls __xfs_xattr_put_listent twice if it sees an attribute "trusted.SGI_ACL_FILE": once for that name, and again for "system.posix_acl_access". Unfortunately, if we happen to run out of buffer space while emitting the first name, we set count to -1 (so that we can feed ERANGE to the caller). The second invocation doesn't check that the context parameters make sense and overwrites the byte before the buffer, triggering a KASAN report: ================================================================== BUG: KASAN: slab-out-of-bounds in strncpy+0xb3/0xd0 Write of size 1 at addr ffff88807fbd317f by task syz/1113 CPU: 3 PID: 1113 Comm: syz Not tainted 5.0.0-rc6-xfsx #rc6 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014 Call Trace: dump_stack+0xcc/0x180 print_address_description+0x6c/0x23c kasan_report.cold.3+0x1c/0x35 strncpy+0xb3/0xd0 __xfs_xattr_put_listent+0x1a9/0x2c0 [xfs] xfs_attr_list_int_ilocked+0x11af/0x1800 [xfs] xfs_attr_list_int+0x20c/0x2e0 [xfs] xfs_vn_listxattr+0x225/0x320 [xfs] listxattr+0x11f/0x1b0 path_listxattr+0xbd/0x130 do_syscall_64+0x139/0x560 While we're at it we add an assert to the other put_listent to avoid this sort of thing ever happening to the attrlist_by_handle code. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 George Zhang 提交于
By default the tcp_tw_timeout value is 60 seconds. The minimum is 1 second and the maximum is 600. This setting is useful on system under heavy tcp load. NOTE: set the tcp_tw_timeout below 60 seconds voilates the "quiet time" restriction, and make your system into the risk of causing some old data to be accepted as new or new data rejected as old duplicated by some receivers. Link: http://web.archive.org/web/20150102003320/http://tools.ietf.org/html/rfc793Signed-off-by: NGeorge Zhang <georgezhang@linux.alibaba.com> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Guilherme G. Piccoli 提交于
Commit 37f9579f ("blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash") introduced a NULL pointer dereference in generic_make_request(). The patch sets q to NULL and enter_succeeded to false; right after, there's an 'if (enter_succeeded)' which is not taken, and then the 'else' will dereference q in blk_queue_dying(q). This patch just moves the 'q = NULL' to a point in which it won't trigger the oops, although the semantics of this NULLification remains untouched. A simple test case/reproducer is as follows: a) Build kernel v5.2-rc1 with CONFIG_BLK_CGROUP=n. b) Create a raid0 md array with 2 NVMe devices as members, and mount it with an ext4 filesystem. c) Run the following oneliner (supposing the raid0 is mounted in /mnt): (dd of=/mnt/tmp if=/dev/zero bs=1M count=999 &); sleep 0.3; echo 1 > /sys/block/nvme0n1/device/device/remove (whereas nvme0n1 is the 2nd array member) This will trigger the following oops: BUG: unable to handle kernel NULL pointer dereference at 0000000000000078 PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI RIP: 0010:generic_make_request+0x32b/0x400 Call Trace: submit_bio+0x73/0x140 ext4_io_submit+0x4d/0x60 ext4_writepages+0x626/0xe90 do_writepages+0x4b/0xe0 [...] This patch has no functional changes and preserves the md/raid0 behavior when a member is removed before kernel v4.17. Cc: stable@vger.kernel.org # v4.17 Reviewed-by: NBart Van Assche <bvanassche@acm.org> Reviewed-by: NMing Lei <ming.lei@redhat.com> Tested-by: NEric Ren <renzhengeek@gmail.com> Fixes: 37f9579f ("blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash") Signed-off-by: NGuilherme G. Piccoli <gpiccoli@canonical.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NEric Ren <renzhen@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Alex Williamson 提交于
commit ddefc033eecf23f1e8b81d0663c5db965adf5516 upstream The commit referenced below introduced device locking around save and restore of state for each device during a PCI bus "try" reset, making it decidely non-"try" and prone to deadlock in the event that a device is already locked. Restore __pci_reset_bus() and __pci_reset_slot() to their advertised locking semantics by pushing the save and restore functions into the branch where the entire tree is already locked. Extend the helper function names with "_locked" and update the comment to reflect this calling requirement. Fixes: b014e96d ("PCI: Protect pci_error_handlers->reset_notify() usage with device_lock()") Signed-off-by: NAlex Williamson <alex.williamson@redhat.com> Signed-off-by: NZhiyuan Hou <zhiyuan2048@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 kbuild test robot 提交于
lkp-build bot reported the following link error with ipv6 disabled: ld: net/hookers/hookers.o:(.data+0x40): undefined reference to `ipv6_specific' ld: net/hookers/hookers.o:(.data+0x78): undefined reference to `ipv6_mapped' ld: net/hookers/hookers.o:(.data+0xe8): undefined reference to `inet6_stream_ops' Fixed this issue by adding IS_ENABLED(CONFIG_IPV6) check. Reported-by: Nkbuild test robot <lkp@intel.com> Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 kbuild test robot 提交于
Fixes: 60448d43 ("writeback: add memcg_blkcg_link tree") Signed-off-by: Nkbuild test robot <lkp@intel.com> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
-
由 Caspar Zhang 提交于
read/write_cr0() are used in net/hookers.c, but they are only available on x86 platform. Adding a depend-on fields in Kconfig to disable this feature in other platforms. Reported-by: Nkbuild test robot <lkp@intel.com> Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Joseph Qi 提交于
Wrap cgroup writeback v1 logic to prevent build errors without CONFIG_CGROUPS or CONFIG_CGROUP_WRITEBACK. Reported-by: Nkbuild test robot <lkp@intel.com> Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Joseph Qi 提交于
Since now variable 'flags' is not used anymore, remove it as well. Otherwise there is a compile warning "unused variable 'flags'". Fixes: e1843f01 ("NFSv4.1: Again fix a race where CB_NOTIFY_LOCK fails to wake a waiter") Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Yihao Wu 提交于
When a waiter is waked by CB_NOTIFY_LOCK, it will retry nfs4_proc_setlk(). The waiter may fail to nfs4_proc_setlk() and sleep again. However, the waiter is already removed from clp->cl_lock_waitq when handling CB_NOTIFY_LOCK in nfs4_wake_lock_waiter(). So any subsequent CB_NOTIFY_LOCK won't wake this waiter anymore. We should put the waiter back to clp->cl_lock_waitq before retrying. Cc: stable@vger.kernel.org #4.9+ Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Reviewed-by: NJeff Layton <jlayton@kernel.org> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Yihao Wu 提交于
Commit b7dbcc0e "NFSv4.1: Fix a race where CB_NOTIFY_LOCK fails to wake a waiter" found this bug. However it didn't fix it. This commit replaces schedule_timeout() with wait_woken() and default_wake_function() with woken_wake_function() in function nfs4_retry_setlk() and nfs4_wake_lock_waiter(). wait_woken() uses memory barriers in its implementation to avoid potential race condition when putting a process into sleeping state and then waking it up. Fixes: a1d617d8 ("nfs: allow blocking locks to be awoken by lock callbacks") Cc: stable@vger.kernel.org #4.9+ Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Reviewed-by: NJeff Layton <jlayton@kernel.org> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jiufei Xue 提交于
So far writeback control is supported for cgroup v1 interface. However it also has some restrictions, so introduce a new kernel boot parameter to control the behavior which is disabled by default. Users can enable the writeback control for cgroup v1 with the command line "cgwb_v1". Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 luanshi 提交于
There might have tons of files queued in the writeback, awaiting for writing back. Unfortunately, the writeback's cgroup has been dead. In this case, we reassociate the inode with another writeback, but we possibly can't because the writeback associated with the dead cgroup is the only valid one. In this case, the new writeback is allocated, initialized and associated with the inode in the non-stopping fashion until all data resident in the inode's page cache are flushed to disk. It causes unnecessary high system load. This fixes the issue by enforce moving the inode to root cgroup when the previous binding cgroup becomes dead. With it, no more unnecessary writebacks are created, populated and the system load decreased by about 6x in the test case we carried out: Without the patch: 30% system load With the patch: 5% system load Signed-off-by: Nluanshi <zhangliguang@linux.alibaba.com> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jiufei Xue 提交于
We have gotten a WARNNING when releasing blkcg_css: [332489.681635] WARNING: CPU: 55 PID: 14859 at lib/list_debug.c:56 __list_del_entry+0x81/0xc0 [332489.682191] list_del corruption, ffff883e6b94d450->prev is LIST_POISON2 (dead000000000200) ...... [332489.683895] CPU: 55 PID: 14859 Comm: kworker/55:2 Tainted: G [332489.684477] Hardware name: Inspur SA5248M4/X10DRT-PS, BIOS 4.05A 10/11/2016 [332489.685061] Workqueue: cgroup_destroy css_release_work_fn [332489.685654] ffffc9001d92bd28 ffffffff81380042 ffffc9001d92bd78 0000000000000000 [332489.686269] ffffc9001d92bd68 ffffffff81088f8b 0000003800000000 ffff883e6b94d4a0 [332489.686867] ffff883e6b94d400 ffffffff81ce8fe0 ffff88375b24f400 ffff883e6b94d4a0 [332489.687479] Call Trace: [332489.688078] [<ffffffff81380042>] dump_stack+0x63/0x81 [332489.688681] [<ffffffff81088f8b>] __warn+0xcb/0xf0 [332489.689276] [<ffffffff8108900f>] warn_slowpath_fmt+0x5f/0x80 [332489.689877] [<ffffffff8139e7c1>] __list_del_entry+0x81/0xc0 [332489.690481] [<ffffffff81125552>] css_release_work_fn+0x42/0x140 [332489.691090] [<ffffffff810a2db9>] process_one_work+0x189/0x420 [332489.691693] [<ffffffff810a309e>] worker_thread+0x4e/0x4b0 [332489.692293] [<ffffffff810a3050>] ? process_one_work+0x420/0x420 [332489.692905] [<ffffffff810a9616>] kthread+0xe6/0x100 [332489.693504] [<ffffffff810a9530>] ? kthread_park+0x60/0x60 [332489.694099] [<ffffffff817184e1>] ret_from_fork+0x41/0x50 [332489.694722] ---[ end trace 0cf869c4a5cfba87 ]--- ...... This is caused by calling css_get after the css is killed by another thread described below: Thread 1 Thread 2 cgroup_rmdir -> kill_css -> percpu_ref_kill_and_confirm -> css_killed_ref_fn css_killed_work_fn -> css_put -> css_release wb_get_create -> find_blkcg_css -> css_get -> css_put -> css_release (double free) -> css_release_workfn -> css_free_work_fn -> blkcg_css_free When doublefree happened, it may free the memory still used by other threads and cause a kernel panic. Fix this by using css_tryget_online in find_blkcg_css while will return false if the css is killed. Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jiufei Xue 提交于
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jiufei Xue 提交于
Here we add a global radix tree to link memcg and blkcg that the user attach the tasks to when using cgroup v1, which is used for writeback cgroup. Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jiufei Xue 提交于
We found that it return success when we set IMMUTABLE_FL flag to a file in docker even though the docker didn't have the capability CAP_LINUX_IMMUTABLE. The commit d1d04ef8 ("ovl: stack file ops") and dab5ca8f ("ovl: add lsattr/chattr support") implemented chattr operations on a regular overlay file. ovl_real_ioctl() overridden the current process's subjective credentials with ofs->creator_cred which have the capability CAP_LINUX_IMMUTABLE so that it will return success in vfs_ioctl()->cap_capable(). Fix this by checking the capability before cred overridden. And here we only care about APPEND_FL and IMMUTABLE_FL, so get these information from inode. [SzM: move check and call to underlying fs inside inode locked region to prevent two such calls from racing with each other] Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 George Zhang 提交于
LVS fullnat will replace network traffic's source ip with its local ip, and thus the backend servers cannot obtain the real client ip. To solve this, LVS has introduced the tcp option address (TOA) to store the essential ip address information in the last tcp ack packet of the 3-way handshake, and the backend servers need to retrieve it from the packet header. In this patch, we have introduced the sk_toa_data member in the sock structure to hold the TOA information. There used to be an in-tree module for TOA managing, whereas it has now been maintained as an standalone module. In this case, the toa module should register its hook function(s) using the provided interfaces in the hookers module. TOA in sock structure: __be32 sk_toa_data[16]; The hookers module only provides the sk_toa_data placeholder, and the toa module can use this variable through the layout it needs. Hook interfaces: The hookers module replaces the kernel's syn_recv_sock and getname handler with a stub that chains the toa module's hook function(s) to the original handling function. The hookers module allows hook functions to be installed and uninstalled in any order. toa module: The external toa module will be provided in separate RPM package. [xuyu@linux.alibaba.com: amend commit log] Signed-off-by: NGeorge Zhang <georgezhang@linux.alibaba.com> Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Changpeng Liu 提交于
commit 1f23816b8eb8fdc39990abe166c10a18c16f6b21 upstream. In commit 88c85538, "virtio-blk: add discard and write zeroes features to specification" (https://github.com/oasis-tcs/virtio-spec), the virtio block specification has been extended to add VIRTIO_BLK_T_DISCARD and VIRTIO_BLK_T_WRITE_ZEROES commands. This patch enables support for discard and write zeroes in the virtio-blk driver when the device advertises the corresponding features, VIRTIO_BLK_F_DISCARD and VIRTIO_BLK_F_WRITE_ZEROES. Signed-off-by: NChangpeng Liu <changpeng.liu@intel.com> Signed-off-by: NDaniel Verkamp <dverkamp@chromium.org> Signed-off-by: NMichael S. Tsirkin <mst@redhat.com> Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jiufei Xue 提交于
Unstable tsc will trigger clocksource watchdog and disable itself, as a result other clocksource will be elected as the current clocksource which will result in performace issue on our servers. RHEL7 also disabled this feature for some issues, see changelog: [x86] disable clocksource watchdog (Prarit Bhargava) [914709] Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jiufei Xue 提交于
This reverts commit 76d3b851. The returned value for check_tsc_warp() is useless now, remove it. Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jiufei Xue 提交于
This reverts commit cc4db268. When we do hot-add and enable vCPU, the time inside the VM jumps and then VM stucks. The dmesg shows like this: [ 48.402948] CPU2 has been hot-added [ 48.413774] smpboot: Booting Node 0 Processor 2 APIC 0x2 [ 48.415155] kvm-clock: cpu 2, msr 6b615081, secondary cpu clock [ 48.453690] TSC ADJUST compensate: CPU2 observed 139318776350 warp. Adjust: 139318776350 [ 102.060874] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large: [ 102.060874] clocksource: 'kvm-clock' wd_now: 1cb1cfc4bf8 wd_last: 1be9588f1fe mask: ffffffffffffffff [ 102.060874] clocksource: 'tsc' cs_now: 207d794f7e cs_last: 205a32697a mask: ffffffffffffffff [ 102.060874] tsc: Marking TSC unstable due to clocksource watchdog [ 102.070188] KVM setup async PF for cpu 2 [ 102.071461] kvm-stealtime: cpu 2, msr 13ba95000 [ 102.074530] Will online and init hotplugged CPU: 2 This is because the TSC for the newly added VCPU is initialized to 0 while others are ahead. Guest will do the TSC ADJUST compensate and cause the time jumps. Commit bd8fab39("KVM: x86: fix maintaining of kvm_clock stability on guest CPU hotplug") can fix this problem. However, the host kernel version may be older, so do not ajust TSC if sync test fails, just mark it unstable. Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Joseph Qi 提交于
ECI may have an use case that configuring each device mapper disk throttling policy just under root blkio cgroup, but actually using them in different containers. Since hierarchical throttling is now only supported on cgroup v2 and ECI uses cgroup v1, so we have to enable hierarchical throttling on cgroup v1. This is ported from redhat 7u, and a year ago Jiufei already ported it to alikernel 4.9 as well. So I think this change should be acceptable. Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
-
由 Eryu Guan 提交于
Prior to xdragon platform 20181230 release (e.g. 0930 release), vring_use_dma_api() is required to return 'true' unconditionally. Introduce a new kernel boot parameter called "vring_force_dma_api" to control the behavior, boot xdragon host with "vring_force_dma_api" command line to make ENI hotplug work, so that normal ECS hosts keep the original behavior. Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NEryu Guan <eguan@linux.alibaba.com>
-
由 Arjan van de Ven 提交于
Cherry-pick from clear-linux patches: https://github.com/clearlinux-pkgs/linux-kvm/0104-give-rdrand-some-credit.patch try to credit rdrand/rdseed with some entropy In VMs but even modern hardware, we're super starved for entropy, and while we can and do wear a tin foil hat, it's very hard to argue that rdrand and rdtsc add zero entropy. Signed-off-by: NArjan van de Ven <arjan@linux.intel.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
-
由 Julio Montes 提交于
Cherry-pick from kata-container patches: https://github.com/kata-containers/packaging/tree/master/kernel/patches/0001-NO-UPSTREAM-9P-always-use-cached-inode-to-fill-in-v9.patch So that if in cache=none mode, we don't have to lookup server that might not support open-unlink-fstat operation. fixes https://github.com/01org/cc-oci-runtime/issues/47 fixes https://github.com/01org/cc-oci-runtime/issues/1062Signed-off-by: NJulio Montes <julio.montes@intel.com> Signed-off-by: NPeng Tao <bergwolf@gmail.com> Signed-off-by: NEryu Guan <eguan@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Arjan van de Ven 提交于
Cherry-pick from kata-container patches: https://github.com/kata-containers/packaging/tree/master/kernel/patches/0002-Compile-in-evged-always.patch We need evged for NEMU (and in general for hw reduced) The config option cannot be set normally since it breaks all regular systems, and hardware reduced is really a runtime choice. Signed-off-by: NArjan van de Ven <arjan@linux.intel.com> Signed-off-by: NEryu Guan <eguan@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Eric Whitney 提交于
commit f456767d3391e9f7d9d25a2e7241d75676dc19da upstream. Add new code to count canceled pending cluster reservations on bigalloc file systems and to reduce the cluster reservation count on all file systems using delayed allocation. This replaces old code in ext4_da_page_release_reservations that was incorrect. Signed-off-by: NEric Whitney <enwlinux@gmail.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
-
由 Eric Whitney 提交于
commit 9fe671496b6c286f9033aedfc1718d67721da0ae upstream. Modify ext4_ext_remove_space() and the code it calls to correct the reserved cluster count for pending reservations (delayed allocated clusters shared with allocated blocks) when a block range is removed from the extent tree. Pending reservations may be found for the clusters at the ends of written or unwritten extents when a block range is removed. If a physical cluster at the end of an extent is freed, it's necessary to increment the reserved cluster count to maintain correct accounting if the corresponding logical cluster is shared with at least one delayed and unwritten extent as found in the extents status tree. Add a new function, ext4_rereserve_cluster(), to reapply a reservation on a delayed allocated cluster sharing blocks with a freed allocated cluster. To avoid ENOSPC on reservation, a flag is applied to ext4_free_blocks() to briefly defer updating the freeclusters counter when an allocated cluster is freed. This prevents another thread from allocating the freed block before the reservation can be reapplied. Redefine the partial cluster object as a struct to carry more state information and to clarify the code using it. Adjust the conditional code structure in ext4_ext_remove_space to reduce the indentation level in the main body of the code to improve readability. Signed-off-by: NEric Whitney <enwlinux@gmail.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
-
由 Eric Whitney 提交于
commit b6bf9171ef5c37b66d446378ba63af5339a56a97 upstream. Ext4 does not always reduce the reserved cluster count by the number of clusters allocated when mapping a delayed extent. It sometimes adds back one or more clusters after allocation if delalloc blocks adjacent to the range allocated by ext4_ext_map_blocks() share the clusters newly allocated for that range. However, this overcounts the number of clusters needed to satisfy future mapping requests (holding one or more reservations for clusters that have already been allocated) and premature ENOSPC and quota failures, etc., result. Ext4 also does not reduce the reserved cluster count when allocating clusters for non-delayed allocated writes that have previously been reserved for delayed writes. This also results in overcounts. To make it possible to handle reserved cluster accounting for fallocated regions in the same manner as used for other non-delayed writes, do the reserved cluster accounting for them at the time of allocation. In the current code, this is only done later when a delayed extent sharing the fallocated region is finally mapped. Address comment correcting handling of unsigned long long constant from Jan Kara's review of RFC version of this patch. Signed-off-by: NEric Whitney <enwlinux@gmail.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
-
由 Eric Whitney 提交于
commit 0b02f4c0d6d9e2c611dfbdd4317193e9dca740e6 upstream. The code in ext4_da_map_blocks sometimes reserves space for more delayed allocated clusters than it should, resulting in premature ENOSPC, exceeded quota, and inaccurate free space reporting. Fix this by checking for written and unwritten blocks shared in the same cluster with the newly delayed allocated block. A cluster reservation should not be made for a cluster for which physical space has already been allocated. Signed-off-by: NEric Whitney <enwlinux@gmail.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
-
由 Eric Whitney 提交于
commit 1dc0aa46e74a3366e12f426b7caaca477853e9c3 upstream. Add new pending reservation mechanism to help manage reserved cluster accounting. Its primary function is to avoid the need to read extents from the disk when invalidating pages as a result of a truncate, punch hole, or collapse range operation. Signed-off-by: NEric Whitney <enwlinux@gmail.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
-
由 Eric Whitney 提交于
commit ad431025aecda85d3ebef5e4a3aca5c1c681d0c7 upstream. Ext4 contains a few functions that are used to search for delayed extents or blocks in the extents status tree. Rather than duplicate code to add new functions to search for extents with different status values, such as written or a combination of delayed and unwritten, generalize the existing code to search for caller-specified extents status values. Also, move this code into extents_status.c where it is better associated with the data structures it operates upon, and where it can be more readily used to implement new extents status tree functions that might want a broader scope for i_es_lock. Three missing static specifiers in RFC version of patch reported and fixed by Fengguang Wu <fengguang.wu@intel.com>. Signed-off-by: NEric Whitney <enwlinux@gmail.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
-
由 Greg Kroah-Hartman 提交于
-