提交 · 41c9ca31fd412540933c8e1252828dd49c88f0ef · openanolis / cloud-kernel

30 10月, 2019 30 次提交

mm: workingset: tell cache transitions from workingset thrashing · 41c9ca31

由 Johannes Weiner 提交于 10月 26, 2018

commit 1899ad18c6072d689896badafb81267b0a1092a4 upstream.

Refaults happen during transitions between workingsets as well as in-place
thrashing.  Knowing the difference between the two has a range of
applications, including measuring the impact of memory shortage on the
system performance, as well as the ability to smarter balance pressure
between the filesystem cache and the swap-backed workingset.

During workingset transitions, inactive cache refaults and pushes out
established active cache.  When that active cache isn't stale, however,
and also ends up refaulting, that's bonafide thrashing.

Introduce a new page flag that tells on eviction whether the page has been
active or not in its lifetime.  This bit is then stored in the shadow
entry, to classify refaults as transitioning or thrashing.

How many page->flags does this leave us with on 32-bit?

	20 bits are always page flags

	21 if you have an MMU

	23 with the zone bits for DMA, Normal, HighMem, Movable

	29 with the sparsemem section bits

	30 if PAE is enabled

	31 with this patch.

So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes.  If
that's not enough, the system can switch to discontigmem and re-gain the 6
or 7 sparsemem section bits.

Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: NDaniel Drake <drake@endlessm.com>
Tested-by: NSuren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

41c9ca31

mm: workingset: don't drop refault information prematurely · d9799717

由 Johannes Weiner 提交于 10月 26, 2018

commit 95f9ab2d596e8cbb388315e78c82b9a131bf2928 upstream.

Patch series "psi: pressure stall information for CPU, memory, and IO", v4.

		Overview

PSI reports the overall wallclock time in which the tasks in a system (or
cgroup) wait for (contended) hardware resources.

This helps users understand the resource pressure their workloads are
under, which allows them to rootcause and fix throughput and latency
problems caused by overcommitting, underprovisioning, suboptimal job
placement in a grid; as well as anticipate major disruptions like OOM.

		Real-world applications

We're using the data collected by PSI (and its previous incarnation,
memdelay) quite extensively at Facebook, and with several success stories.

One usecase is avoiding OOM hangs/livelocks.  The reason these happen is
because the OOM killer is triggered by reclaim not being able to free
pages, but with fast flash devices there is *always* some clean and
uptodate cache to reclaim; the OOM killer never kicks in, even as tasks
spend 90% of the time thrashing the cache pages of their own executables.
There is no situation where this ever makes sense in practice.  We wrote a
<100 line POC python script to monitor memory pressure and kill stuff way
before such pathological thrashing leads to full system losses that would
require forcible hard resets.

We've since extended and deployed this code into other places to guarantee
latency and throughput SLAs, since they're usually violated way before the
kernel OOM killer would ever kick in.

It is available here: https://github.com/facebookincubator/oomd

Eventually we probably want to trigger the in-kernel OOM killer based on
extreme sustained pressure as well, so that Linux can avoid memory
livelocks - which technically aren't deadlocks, but to the user
indistinguishable from them - out of the box.  We'd continue using OOMD as
the first line of defense to ensure workload health and implement complex
kill policies that are beyond the scope of the kernel.

We also use PSI memory pressure for loadshedding.  Our batch job
infrastructure used to use heuristics based on various VM stats to
anticipate OOM situations, with lackluster success.  We switched it to PSI
and managed to anticipate and avoid OOM kills and lockups fairly reliably.
The reduction of OOM outages in the worker pool raised the pool's
aggregate productivity, and we were able to switch that service to smaller
machines.

Lastly, we use cgroups to isolate a machine's main workload from
maintenance crap like package upgrades, logging, configuration, as well as
to prevent multiple workloads on a machine from stepping on each others'
toes.  We were not able to configure this properly without the pressure
metrics; we would see latency or bandwidth drops, but it would often be
hard to impossible to rootcause it post-mortem.

We now log and graph pressure for the containers in our fleet and can
trivially link latency spikes and throughput drops to shortages of
specific resources after the fact, and fix the job config/scheduling.

PSI has also received testing, feedback, and feature requests from Android
and EndlessOS for the purpose of low-latency OOM killing, to intervene in
pressure situations before the UI starts hanging.

		How do you use this feature?

A kernel with CONFIG_PSI=y will create a /proc/pressure directory with 3
files: cpu, memory, and io.  If using cgroup2, cgroups will also have
cpu.pressure, memory.pressure and io.pressure files, which simply
aggregate task stalls at the cgroup level instead of system-wide.

The cpu file contains one line:

	some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722

The averages give the percentage of walltime in which one or more tasks
are delayed on the runqueue while another task has the CPU.  They're
recent averages over 10s, 1m, 5m windows, so you can tell short term
trends from long term ones, similarly to the load average.

The total= value gives the absolute stall time in microseconds.  This
allows detecting latency spikes that might be too short to sway the
running averages.  It also allows custom time averaging in case the
10s/1m/5m windows aren't adequate for the usecase (or are too coarse with
future hardware).

What to make of this "some" metric?  If CPU utilization is at 100% and CPU
pressure is 0, it means the system is perfectly utilized, with one
runnable thread per CPU and nobody waiting.  At two or more runnable tasks
per CPU, the system is 100% overcommitted and the pressure average will
indicate as much.  From a utilization perspective this is a great state of
course: no CPU cycles are being wasted, even when 50% of the threads were
to go idle (as most workloads do vary).  From the perspective of the
individual job it's not great, however, and they would do better with more
resources.  Depending on what your priority and options are, raised "some"
numbers may or may not require action.

The memory file contains two lines:

some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828
full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258

The some line is the same as for cpu, the time in which at least one task
is stalled on the resource.  In the case of memory, this includes waiting
on swap-in, page cache refaults and page reclaim.

The full line, however, indicates time in which *nobody* is using the CPU
productively due to pressure: all non-idle tasks are waiting for memory in
one form or another.  Significant time spent in there is a good trigger
for killing things, moving jobs to other machines, or dropping incoming
requests, since neither the jobs nor the machine overall are making too
much headway.

The io file is similar to memory.  Because the block layer doesn't have a
concept of hardware contention right now (how much longer is my IO request
taking due to other tasks?), it reports CPU potential lost on all IO
delays, not just the potential lost due to competition.

		FAQ

Q: How is PSI's CPU component different from the load average?

A: There are several quirks in the load average that make it hard to
   impossible to tell how overcommitted the CPU really is.

   1. The load average is reported as a raw number of active tasks.
      You need to know how many CPUs there are in the system, how many
      CPUs the workload is allowed to use, then think about what the
      proportion between load and the number of CPUs mean for the
      tasks trying to run.

      PSI reports the percentage of wallclock time in which tasks are
      waiting for a CPU to run on. It doesn't matter how many CPUs are
      present or usable. The number always tells the quality of life
      of tasks in the system or in a particular cgroup.

   2. The shortest averaging window is 1m, which is extremely coarse,
      and it's sampled in 5s intervals. A *lot* can happen on a CPU in
      5 seconds. This *may* be able to identify persistent long-term
      trends and very clear and obvious overloads, but it's unusable
      for latency spikes and more subtle overutilization.

      PSI's shortest window is 10s. It also exports the cumulative
      stall times (in microseconds) of synchronously recorded events.

   3. On Linux, the load average for historical reasons includes all
      TASK_UNINTERRUPTIBLE tasks. This gives a broader sense of how
      busy the system is, but on the flipside it doesn't distinguish
      whether tasks are likely to contend over the CPU or IO - which
      obviously requires very different interventions from a sys admin
      or a job scheduler.

      PSI reports independent metrics for CPU and IO. You can tell
      which resource is making the tasks wait, but in conjunction
      still see how overloaded the system is overall.

Q: What's the cost / performance impact of this feature?

A: PSI's primary cost is in the scheduler, in particular task wakeups
   and sleeps.

   I benchmarked this code using Facebook's two most scheduling
   sensitive workloads: memcache and webserver. They handle a ton of
   small requests - lots of wakeups and sleeps with little actual work
   in between - so they tend to be canaries for scheduler regressions.

   In the tests, the boxes were handling live traffic over the course
   of several hours. Half the machines, the control, ran with
   CONFIG_PSI=n.

   For memcache I used eight machines total. They're 2-socket, 14
   core, 56 thread boxes. The test runs for half the test period,
   flips the test and control kernels on the hardware to rule out HW
   factors, DC location etc., then runs the other half of the test.

   For the webservers, I used 32 machines total. They're single
   socket, 16 core, 32 thread machines.

   During the memcache test, CPU load was nopsi=78.05% psi=78.98% in
   the first half and nopsi=77.52% psi=78.25%, so PSI added between
   0.7 and 0.9 percentage points to the CPU load, a difference of
   about 1%.

   UPDATE: I re-ran this test with the v3 version of this patch set
   and the CPU utilization was equivalent between test and control.

   UPDATE: v4 is on par with v3.

   As far as end-to-end request latency from the client perspective
   goes, we don't sample those finely enough to capture the requests
   going to those particular machines during the test, but we know the
   p50 turnaround time in this workload is 54us, and perf bench sched
   pipe on those machines show nopsi=5.232666 us/op and psi=5.587347
   us/op, so this doesn't add much here either.

   The profile for the pipe benchmark shows:

        0.87%  sched-pipe  [kernel.vmlinux]    [k] psi_group_change
        0.83%  perf.real   [kernel.vmlinux]    [k] psi_group_change
        0.82%  perf.real   [kernel.vmlinux]    [k] psi_task_change
        0.58%  sched-pipe  [kernel.vmlinux]    [k] psi_task_change

   The webserver load is running inside 4 nested cgroup levels. The
   CPU load with both nopsi and psi kernels was indistinguishable at
   81%.

   For comparison, we had to disable the cgroup cpu controller on the
   webservers because it added 4 percentage points to the CPU% during
   this same exact test.

   Versions of this accounting code now run on 80% of our fleet. None
   of our workloads have reported regressions during the rollout.

Daniel Drake said:

: I just retested the latest version at
: http://git.cmpxchg.org/cgit.cgi/linux-psi.git (Linux 4.18) and the results
: are great.
:
: Test setup:
: Endless OS
: GeminiLake N4200 low end laptop
: 2GB RAM
: swap (and zram swap) disabled
:
: Baseline test: open a handful of large-ish apps and several website
: tabs in Google Chrome.
:
: Results: after a couple of minutes, system is excessively thrashing, mouse
: cursor can barely be moved, UI is not responding to mouse clicks, so it's
: impractical to recover from this situation as an ordinary user
:
: Add my simple killer:
: https://gist.github.com/dsd/a8988bf0b81a6163475988120fe8d9cd
:
: Results: when the thrashing causes the UI to become sluggish, the killer
: steps in and kills something (usually a chrome tab), and the system
: remains usable.  I repeatedly opened more apps and more websites over a 15
: minute period but I wasn't able to get the system to a point of UI
: unresponsiveness.

Suren said:

: Backported to 4.9 and retested on ARMv8 8 code system running Android.
: Signals behave as expected reacting to memory pressure, no jumps in
: "total" counters that would indicate an overflow/underflow issues.  Nicely
: done!

This patch (of 9):

If we keep just enough refault information to match the *current* page
cache during reclaim time, we could lose a lot of events when there is
only a temporary spike in non-cache memory consumption that pushes out all
the cache.  Once cache comes back, we won't see those refaults.  They
might not be actionable for LRU aging, but we want to know about them for
measuring memory pressure.

[hannes@cmpxchg.org: switch to NUMA-aware lru and slab counters]
  Link: http://lkml.kernel.org/r/20181009184732.762-2-hannes@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-2-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <jweiner@fb.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NRik van Riel <riel@surriel.com>
Tested-by: NDaniel Drake <drake@endlessm.com>
Tested-by: NSuren Baghdasaryan <surenb@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

d9799717

splice: don't read more than available pipe space · 65ff07f8

由 Darrick J. Wong 提交于 6月 01, 2019

commit 17614445576b6af24e9cf36607c6448164719c96 upstream.

In commit 4721a601099, we tried to fix a problem wherein directio reads
into a splice pipe will bounce EFAULT/EAGAIN all the way out to
userspace by simulating a zero-byte short read.  This happens because
some directio read implementations (xfs) will call
bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous
reads, but as soon as we run out of pipe buffers that _get_pages call
returns EFAULT, which the splice code translates to EAGAIN and bounces
out to userspace.

In that commit, the iomap code catches the EFAULT and simulates a
zero-byte read, but that causes assertion errors on regular splice reads
because xfs doesn't allow short directio reads.

The brokenness is compounded by splice_direct_to_actor immediately
bailing on do_splice_to returning <= 0 without ever calling ->actor
(which empties out the pipe), so if userspace calls back we'll EFAULT
again on the full pipe, and nothing ever gets copied.

Therefore, teach splice_direct_to_actor to clamp its requests to the
amount of free space in the pipe and remove the simulated short read
from the iomap directio code.

Fixes: 4721a601099 ("iomap: dio data corruption and spurious errors when pipes fill")
Reported-by: NMurphy Zhou <jencce.kernel@gmail.com>
Ranted-by: NAmir Goldstein <amir73il@gmail.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

65ff07f8

net/tcp: Support tunable tcp timeout value in TIME-WAIT state · b416e029

由 George Zhang 提交于 3月 28, 2018

By default the tcp_tw_timeout value is 60 seconds. The minimum is
1 second and the maximum is 600. This setting is useful on system under
heavy tcp load.

NOTE: set the tcp_tw_timeout below 60 seconds voilates the "quiet time"
restriction, and make your system into the risk of causing some old data
to be accepted as new or new data rejected as old duplicated by some
receivers.

Link: http://web.archive.org/web/20150102003320/http://tools.ietf.org/html/rfc793Signed-off-by: NGeorge Zhang <georgezhang@linux.alibaba.com>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

b416e029

PCI: Fix "try" semantics of bus and slot reset · 31d3114e

由 Alex Williamson 提交于 5月 24, 2019

commit ddefc033eecf23f1e8b81d0663c5db965adf5516 upstream

The commit referenced below introduced device locking around save and
restore of state for each device during a PCI bus "try" reset, making
it decidely non-"try" and prone to deadlock in the event that a device
is already locked. Restore __pci_reset_bus() and __pci_reset_slot()
to their advertised locking semantics by pushing the save and restore
functions into the branch where the entire tree is already locked.
Extend the helper function names with "_locked" and update the comment
to reflect this calling requirement.

Fixes: b014e96d ("PCI: Protect pci_error_handlers->reset_notify() usage with device_lock()")
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
Signed-off-by: NZhiyuan Hou <zhiyuan2048@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

31d3114e

net/hookers: fix link error with ipv6 disabled · 19ac03e6

由 kbuild test robot 提交于 5月 23, 2019

lkp-build bot reported the following link error with ipv6 disabled:

ld: net/hookers/hookers.o:(.data+0x40): undefined reference to `ipv6_specific'
ld: net/hookers/hookers.o:(.data+0x78): undefined reference to `ipv6_mapped'
ld: net/hookers/hookers.o:(.data+0xe8): undefined reference to `inet6_stream_ops'

Fixed this issue by adding IS_ENABLED(CONFIG_IPV6) check.
Reported-by: Nkbuild test robot <lkp@intel.com>
Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

19ac03e6

writeback: memcg_blkcg_tree_lock can be static · 0019fa8c

由 kbuild test robot 提交于 5月 23, 2019

Fixes: 60448d43 ("writeback: add memcg_blkcg_link tree")
Signed-off-by: Nkbuild test robot <lkp@intel.com>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>

0019fa8c

net/hookers: only enable on x86 platform · 04fab98a

由 Caspar Zhang 提交于 5月 23, 2019

read/write_cr0() are used in net/hookers.c, but they are only available
on x86 platform. Adding a depend-on fields in Kconfig to disable this
feature in other platforms.
Reported-by: Nkbuild test robot <lkp@intel.com>
Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

04fab98a

fs/writeback: wrap cgroup writeback v1 logic · 38485c5c

由 Joseph Qi 提交于 5月 22, 2019

Wrap cgroup writeback v1 logic to prevent build errors without
CONFIG_CGROUPS or CONFIG_CGROUP_WRITEBACK.
Reported-by: Nkbuild test robot <lkp@intel.com>
Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

38485c5c

writeback: introduce cgwb_v1 boot param · f91c270d

由 Jiufei Xue 提交于 5月 13, 2019

So far writeback control is supported for cgroup v1 interface. However
it also has some restrictions, so introduce a new kernel boot parameter
to control the behavior which is disabled by default. Users can enable
the writeback control for cgroup v1 with the command line "cgwb_v1".
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

f91c270d

fs/writeback: Attach inode's wb to root if needed · 48a4f267

由 luanshi 提交于 10月 09, 2018

There might have tons of files queued in the writeback, awaiting for
writing back. Unfortunately, the writeback's cgroup has been dead. In
this case, we reassociate the inode with another writeback, but we
possibly can't because the writeback associated with the dead cgroup is
the only valid one. In this case, the new writeback is allocated,
initialized and associated with the inode in the non-stopping fashion
until all data resident in the inode's page cache are flushed to disk.
It causes unnecessary high system load.

This fixes the issue by enforce moving the inode to root cgroup when the
previous binding cgroup becomes dead. With it, no more unnecessary
writebacks are created, populated and the system load decreased by about
6x in the test case we carried out:
    Without the patch: 30% system load
    With the patch:    5%  system load
Signed-off-by: Nluanshi <zhangliguang@linux.alibaba.com>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

48a4f267

fs/writeback: fix double free of blkcg_css · f935fb62

由 Jiufei Xue 提交于 1月 29, 2018

We have gotten a WARNNING when releasing blkcg_css:

[332489.681635] WARNING: CPU: 55 PID: 14859 at lib/list_debug.c:56 __list_del_entry+0x81/0xc0
[332489.682191] list_del corruption, ffff883e6b94d450->prev is LIST_POISON2 (dead000000000200)
......
[332489.683895] CPU: 55 PID: 14859 Comm: kworker/55:2 Tainted: G
[332489.684477] Hardware name: Inspur SA5248M4/X10DRT-PS, BIOS 4.05A
10/11/2016
[332489.685061] Workqueue: cgroup_destroy css_release_work_fn
[332489.685654]  ffffc9001d92bd28 ffffffff81380042 ffffc9001d92bd78
0000000000000000
[332489.686269]  ffffc9001d92bd68 ffffffff81088f8b 0000003800000000
ffff883e6b94d4a0
[332489.686867]  ffff883e6b94d400 ffffffff81ce8fe0 ffff88375b24f400
ffff883e6b94d4a0
[332489.687479] Call Trace:
[332489.688078]  [<ffffffff81380042>] dump_stack+0x63/0x81
[332489.688681]  [<ffffffff81088f8b>] __warn+0xcb/0xf0
[332489.689276]  [<ffffffff8108900f>] warn_slowpath_fmt+0x5f/0x80
[332489.689877]  [<ffffffff8139e7c1>] __list_del_entry+0x81/0xc0
[332489.690481]  [<ffffffff81125552>] css_release_work_fn+0x42/0x140
[332489.691090]  [<ffffffff810a2db9>] process_one_work+0x189/0x420
[332489.691693]  [<ffffffff810a309e>] worker_thread+0x4e/0x4b0
[332489.692293]  [<ffffffff810a3050>] ? process_one_work+0x420/0x420
[332489.692905]  [<ffffffff810a9616>] kthread+0xe6/0x100
[332489.693504]  [<ffffffff810a9530>] ? kthread_park+0x60/0x60
[332489.694099]  [<ffffffff817184e1>] ret_from_fork+0x41/0x50
[332489.694722] ---[ end trace 0cf869c4a5cfba87 ]---
......

This is caused by calling css_get after the css is killed by another
thread described below:

           Thread 1                       Thread 2
cgroup_rmdir
  -> kill_css
    -> percpu_ref_kill_and_confirm
      -> css_killed_ref_fn

css_killed_work_fn
  -> css_put
    -> css_release
                                        wb_get_create
					  -> find_blkcg_css
					    -> css_get
					  -> css_put
					    -> css_release (double free)
    -> css_release_workfn
      -> css_free_work_fn
       -> blkcg_css_free

When doublefree happened, it may free the memory still used by
other threads and cause a kernel panic.

Fix this by using css_tryget_online in find_blkcg_css while will return
false if the css is killed.
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

f935fb62

writeback: add debug info for memcg-blkcg link · 37231c89

由 Jiufei Xue 提交于 12月 07, 2017

Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

37231c89

writeback: add memcg_blkcg_link tree · 86c80145

由 Jiufei Xue 提交于 12月 06, 2017

Here we add a global radix tree to link memcg and blkcg that the user
attach the tasks to when using cgroup v1, which is used for writeback
cgroup.
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

86c80145

net: kernel hookers service for toa module · 3327025e

由 George Zhang 提交于 3月 15, 2019

LVS fullnat will replace network traffic's source ip with its local ip,
and thus the backend servers cannot obtain the real client ip.

To solve this, LVS has introduced the tcp option address (TOA) to store
the essential ip address information in the last tcp ack packet of the
3-way handshake, and the backend servers need to retrieve it from the
packet header.

In this patch, we have introduced the sk_toa_data member in the sock
structure to hold the TOA information. There used to be an in-tree
module for TOA managing, whereas it has now been maintained as an
standalone module.

In this case, the toa module should register its hook function(s) using
the provided interfaces in the hookers module.

TOA in sock structure:

	__be32 sk_toa_data[16];

The hookers module only provides the sk_toa_data placeholder, and the
toa module can use this variable through the layout it needs.

Hook interfaces:

The hookers module replaces the kernel's syn_recv_sock and getname
handler with a stub that chains the toa module's hook function(s) to the
original handling function. The hookers module allows hook functions to
be installed and uninstalled in any order.

toa module:

The external toa module will be provided in separate RPM package.

[xuyu@linux.alibaba.com: amend commit log]
Signed-off-by: NGeorge Zhang <georgezhang@linux.alibaba.com>
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>

3327025e

virtio_blk: add discard and write zeroes support · 4b4424fe

由 Changpeng Liu 提交于 11月 01, 2018

commit 1f23816b8eb8fdc39990abe166c10a18c16f6b21 upstream.

In commit 88c85538, "virtio-blk: add discard and write zeroes features
to specification" (https://github.com/oasis-tcs/virtio-spec), the virtio
block specification has been extended to add VIRTIO_BLK_T_DISCARD and
VIRTIO_BLK_T_WRITE_ZEROES commands.  This patch enables support for
discard and write zeroes in the virtio-blk driver when the device
advertises the corresponding features, VIRTIO_BLK_F_DISCARD and
VIRTIO_BLK_F_WRITE_ZEROES.
Signed-off-by: NChangpeng Liu <changpeng.liu@intel.com>
Signed-off-by: NDaniel Verkamp <dverkamp@chromium.org>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

4b4424fe

kconfig: Disable x86 clocksource watchdog · 728f5e05

由 Jiufei Xue 提交于 1月 14, 2019

Unstable tsc will trigger clocksource watchdog and disable itself, as a
result other clocksource will be elected as the current clocksource
which will result in performace issue on our servers.

RHEL7 also disabled this feature for some issues, see changelog:
[x86] disable clocksource watchdog (Prarit Bhargava) [914709]
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

728f5e05

Revert "x86/tsc: Prepare warp test for TSC adjustment" · 82f6442e

由 Jiufei Xue 提交于 1月 10, 2019

This reverts commit 76d3b851.

The returned value for check_tsc_warp() is useless now, remove it.
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

82f6442e

Revert "x86/tsc: Try to adjust TSC if sync test fails" · 708cb367

由 Jiufei Xue 提交于 1月 10, 2019

This reverts commit cc4db268.

When we do hot-add and enable vCPU, the time inside the VM jumps and
then VM stucks.
The dmesg shows like this:
[   48.402948] CPU2 has been hot-added
[   48.413774] smpboot: Booting Node 0 Processor 2 APIC 0x2
[   48.415155] kvm-clock: cpu 2, msr 6b615081, secondary cpu clock
[   48.453690] TSC ADJUST compensate: CPU2 observed 139318776350 warp.  Adjust: 139318776350
[  102.060874] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large:
[  102.060874] clocksource:                       'kvm-clock' wd_now: 1cb1cfc4bf8 wd_last: 1be9588f1fe mask: ffffffffffffffff
[  102.060874] clocksource:                       'tsc' cs_now: 207d794f7e cs_last: 205a32697a mask: ffffffffffffffff
[  102.060874] tsc: Marking TSC unstable due to clocksource watchdog
[  102.070188] KVM setup async PF for cpu 2
[  102.071461] kvm-stealtime: cpu 2, msr 13ba95000
[  102.074530] Will online and init hotplugged CPU: 2

This is because the TSC for the newly added VCPU is initialized to 0
while others are ahead. Guest will do the TSC ADJUST compensate and
cause the time jumps.

Commit bd8fab39("KVM: x86: fix maintaining of kvm_clock stability
on guest CPU hotplug") can fix this problem.  However, the host kernel
version may be older, so do not ajust TSC if sync test fails, just mark
it unstable.
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

708cb367

block-throttle: enable hierarchical throttling even on traditional hierarchy · 61518922

由 Joseph Qi 提交于 12月 12, 2017

ECI may have an use case that configuring each device mapper disk
throttling policy just under root blkio cgroup, but actually using them
in different containers.
Since hierarchical throttling is now only supported on cgroup v2 and ECI
uses cgroup v1, so we have to enable hierarchical throttling on cgroup
v1.
This is ported from redhat 7u, and a year ago Jiufei already ported it
to alikernel 4.9 as well. So I think this change should be acceptable.
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>

61518922

eci: drivers/virtio: add vring_force_dma_api boot param · 132cd87e

由 Eryu Guan 提交于 12月 24, 2018

Prior to xdragon platform 20181230 release (e.g. 0930 release),
vring_use_dma_api() is required to return 'true' unconditionally.

Introduce a new kernel boot parameter called "vring_force_dma_api" to
control the behavior, boot xdragon host with "vring_force_dma_api"
command line to make ENI hotplug work, so that normal ECS hosts keep the
original behavior.
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NEryu Guan <eguan@linux.alibaba.com>

132cd87e

boot: give rdrand some credit · 0d6d72af

由 Arjan van de Ven 提交于 7月 29, 2016

Cherry-pick from clear-linux patches:
https://github.com/clearlinux-pkgs/linux-kvm/0104-give-rdrand-some-credit.patch

try to credit rdrand/rdseed with some entropy

In VMs but even modern hardware, we're super starved for entropy, and while we can
and do wear a tin foil hat, it's very hard to argue that
rdrand and rdtsc add zero entropy.
Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>

0d6d72af

NO-UPSTREAM: 9P: always use cached inode to fill in v9fs_vfs_getattr · f3757350

由 Julio Montes 提交于 9月 18, 2017

Cherry-pick from kata-container patches:
https://github.com/kata-containers/packaging/tree/master/kernel/patches/0001-NO-UPSTREAM-9P-always-use-cached-inode-to-fill-in-v9.patch

So that if in cache=none mode, we don't have to lookup server that
might not support open-unlink-fstat operation.

fixes https://github.com/01org/cc-oci-runtime/issues/47
fixes https://github.com/01org/cc-oci-runtime/issues/1062Signed-off-by: NJulio Montes <julio.montes@intel.com>
Signed-off-by: NPeng Tao <bergwolf@gmail.com>
Signed-off-by: NEryu Guan <eguan@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

f3757350

NEMU: Compile in evged always · 1c63c40d

由 Arjan van de Ven 提交于 8月 10, 2018

Cherry-pick from kata-container patches:
https://github.com/kata-containers/packaging/tree/master/kernel/patches/0002-Compile-in-evged-always.patch

We need evged for NEMU (and in general for hw reduced)

The config option cannot be set normally since it breaks all
regular systems, and hardware reduced is really a runtime choice.
Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
Signed-off-by: NEryu Guan <eguan@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

1c63c40d

ext4: fix reserved cluster accounting at page invalidation time · af1b3490

由 Eric Whitney 提交于 10月 01, 2018

commit f456767d3391e9f7d9d25a2e7241d75676dc19da upstream.

Add new code to count canceled pending cluster reservations on bigalloc
file systems and to reduce the cluster reservation count on all file
systems using delayed allocation.  This replaces old code in
ext4_da_page_release_reservations that was incorrect.
Signed-off-by: NEric Whitney <enwlinux@gmail.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>

af1b3490

ext4: adjust reserved cluster count when removing extents · eb792dc6

由 Eric Whitney 提交于 10月 01, 2018

commit 9fe671496b6c286f9033aedfc1718d67721da0ae upstream.

Modify ext4_ext_remove_space() and the code it calls to correct the
reserved cluster count for pending reservations (delayed allocated
clusters shared with allocated blocks) when a block range is removed
from the extent tree. Pending reservations may be found for the clusters
at the ends of written or unwritten extents when a block range is removed.
If a physical cluster at the end of an extent is freed, it's necessary
to increment the reserved cluster count to maintain correct accounting
if the corresponding logical cluster is shared with at least one
delayed and unwritten extent as found in the extents status tree.

Add a new function, ext4_rereserve_cluster(), to reapply a reservation
on a delayed allocated cluster sharing blocks with a freed allocated
cluster. To avoid ENOSPC on reservation, a flag is applied to
ext4_free_blocks() to briefly defer updating the freeclusters counter
when an allocated cluster is freed. This prevents another thread
from allocating the freed block before the reservation can be reapplied.

Redefine the partial cluster object as a struct to carry more state
information and to clarify the code using it.

Adjust the conditional code structure in ext4_ext_remove_space to
reduce the indentation level in the main body of the code to improve
readability.
Signed-off-by: NEric Whitney <enwlinux@gmail.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>

eb792dc6

ext4: reduce reserved cluster count by number of allocated clusters · 39ad46f5

由 Eric Whitney 提交于 10月 01, 2018

commit b6bf9171ef5c37b66d446378ba63af5339a56a97 upstream.

Ext4 does not always reduce the reserved cluster count by the number
of clusters allocated when mapping a delayed extent. It sometimes
adds back one or more clusters after allocation if delalloc blocks
adjacent to the range allocated by ext4_ext_map_blocks() share the
clusters newly allocated for that range. However, this overcounts
the number of clusters needed to satisfy future mapping requests
(holding one or more reservations for clusters that have already been
allocated) and premature ENOSPC and quota failures, etc., result.

Ext4 also does not reduce the reserved cluster count when allocating
clusters for non-delayed allocated writes that have previously been
reserved for delayed writes. This also results in overcounts.

To make it possible to handle reserved cluster accounting for
fallocated regions in the same manner as used for other non-delayed
writes, do the reserved cluster accounting for them at the time of
allocation. In the current code, this is only done later when a
delayed extent sharing the fallocated region is finally mapped.

Address comment correcting handling of unsigned long long constant
from Jan Kara's review of RFC version of this patch.
Signed-off-by: NEric Whitney <enwlinux@gmail.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>

39ad46f5

ext4: fix reserved cluster accounting at delayed write time · b834eb8a

由 Eric Whitney 提交于 10月 01, 2018

commit 0b02f4c0d6d9e2c611dfbdd4317193e9dca740e6 upstream.

The code in ext4_da_map_blocks sometimes reserves space for more
delayed allocated clusters than it should, resulting in premature
ENOSPC, exceeded quota, and inaccurate free space reporting.

Fix this by checking for written and unwritten blocks shared in the
same cluster with the newly delayed allocated block.  A cluster
reservation should not be made for a cluster for which physical space
has already been allocated.
Signed-off-by: NEric Whitney <enwlinux@gmail.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>

b834eb8a

ext4: add new pending reservation mechanism · a7fecf4b

由 Eric Whitney 提交于 10月 01, 2018

commit 1dc0aa46e74a3366e12f426b7caaca477853e9c3 upstream.

Add new pending reservation mechanism to help manage reserved cluster
accounting.  Its primary function is to avoid the need to read extents
from the disk when invalidating pages as a result of a truncate, punch
hole, or collapse range operation.
Signed-off-by: NEric Whitney <enwlinux@gmail.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>

a7fecf4b

ext4: generalize extents status tree search functions · 05cc8939

由 Eric Whitney 提交于 10月 01, 2018

commit ad431025aecda85d3ebef5e4a3aca5c1c681d0c7 upstream.

Ext4 contains a few functions that are used to search for delayed
extents or blocks in the extents status tree.  Rather than duplicate
code to add new functions to search for extents with different status
values, such as written or a combination of delayed and unwritten,
generalize the existing code to search for caller-specified extents
status values.  Also, move this code into extents_status.c where it
is better associated with the data structures it operates upon, and
where it can be more readily used to implement new extents status tree
functions that might want a broader scope for i_es_lock.

Three missing static specifiers in RFC version of patch reported and
fixed by Fengguang Wu <fengguang.wu@intel.com>.
Signed-off-by: NEric Whitney <enwlinux@gmail.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>

05cc8939

29 10月, 2019 10 次提交

G

Linux 4.19.81 · ef244c30
由 Greg Kroah-Hartman 提交于 10月 29, 2019

ef244c30

RDMA/cxgb4: Do not dma memory off of the stack · 27414f90

由 Greg KH 提交于 10月 01, 2019

commit 3840c5b78803b2b6cc1ff820100a74a092c40cbb upstream.

Nicolas pointed out that the cxgb4 driver is doing dma off of the stack,
which is generally considered a very bad thing. On some architectures it
could be a security problem, but odds are none of them actually run this
driver, so it's just a "normal" bug.

Resolve this by allocating the memory for a message off of the heap
instead of the stack. kmalloc() always will give us a proper memory
location that DMA will work correctly from.

Link: https://lore.kernel.org/r/20191001165611.GA3542072@kroah.comReported-by: NNicolas Waisman <nico@semmle.com>
Tested-by: NPotnuri Bharat Teja <bharat@chelsio.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

27414f90

blk-rq-qos: fix first node deletion of rq_qos_del() · 05444118

由 Tejun Heo 提交于 10月 15, 2019

commit 307f4065b9d7c1e887e8bdfb2487e4638559fea1 upstream.

rq_qos_del() incorrectly assigns the node being deleted to the head if
it was the first on the list in the !prev path.  Fix it by iterating
with ** instead.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Fixes: a7905043 ("blk-rq-qos: refactor out common elements of blk-wbt")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

05444118

PCI: PM: Fix pci_power_up() · 2ada4030

由 Rafael J. Wysocki 提交于 10月 14, 2019

commit 45144d42f299455911cc29366656c7324a3a7c97 upstream.

There is an arbitrary difference between the system resume and
runtime resume code paths for PCI devices regarding the delay to
apply when switching the devices from D3cold to D0.

Namely, pci_restore_standard_config() used in the runtime resume
code path calls pci_set_power_state() which in turn invokes
__pci_start_power_transition() to power up the device through the
platform firmware and that function applies the transition delay
(as per PCI Express Base Specification Revision 2.0, Section 6.6.1).
However, pci_pm_default_resume_early() used in the system resume
code path calls pci_power_up() which doesn't apply the delay at
all and that causes issues to occur during resume from
suspend-to-idle on some systems where the delay is required.

Since there is no reason for that difference to exist, modify
pci_power_up() to follow pci_set_power_state() more closely and
invoke __pci_start_power_transition() from there to call the
platform firmware to power up the device (in case that's necessary).

Fixes: db288c9c ("PCI / PM: restore the original behavior of pci_set_power_state()")
Reported-by: NDaniel Drake <drake@endlessm.com>
Tested-by: NDaniel Drake <drake@endlessm.com>
Link: https://lore.kernel.org/linux-pm/CAD8Lp44TYxrMgPLkHCqF9hv6smEurMXvmmvmtyFhZ6Q4SE+dig@mail.gmail.com/T/#m21be74af263c6a34f36e0fc5c77c5449d9406925Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: NBjorn Helgaas <bhelgaas@google.com>
Cc: 3.10+ <stable@vger.kernel.org> # 3.10+
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

2ada4030

xen/netback: fix error path of xenvif_connect_data() · ccb02adf

由 Juergen Gross 提交于 10月 18, 2019

commit 3d5c1a037d37392a6859afbde49be5ba6a70a6b3 upstream.

xenvif_connect_data() calls module_put() in case of error. This is
wrong as there is no related module_get().

Remove the superfluous module_put().

Fixes: 279f438e ("xen-netback: Don't destroy the netdev until the vif is shut down")
Cc: <stable@vger.kernel.org> # 3.12
Signed-off-by: NJuergen Gross <jgross@suse.com>
Reviewed-by: NPaul Durrant <paul@xen.org>
Reviewed-by: NWei Liu <wei.liu@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

ccb02adf

cpufreq: Avoid cpufreq_suspend() deadlock on system shutdown · 89ab39da

由 Rafael J. Wysocki 提交于 10月 09, 2019

commit 65650b35133ff20f0c9ef0abd5c3c66dbce3ae57 upstream.

It is incorrect to set the cpufreq syscore shutdown callback pointer
to cpufreq_suspend(), because that function cannot be run in the
syscore stage of system shutdown for two reasons: (a) it may attempt
to carry out actions depending on devices that have already been shut
down at that point and (b) the RCU synchronization carried out by it
may not be able to make progress then.

The latter issue has been present since commit 45975c7d21a1 ("rcu:
Define RCU-sched API in terms of RCU for Tree RCU PREEMPT builds"),
but the former one has been there since commit 90de2a4a ("cpufreq:
suspend cpufreq governors on shutdown") regardless.

Fix that by dropping cpufreq_syscore_ops altogether and making
device_shutdown() call cpufreq_suspend() directly before shutting
down devices, which is along the lines of what system-wide power
management does.

Fixes: 45975c7d21a1 ("rcu: Define RCU-sched API in terms of RCU for Tree RCU PREEMPT builds")
Fixes: 90de2a4a ("cpufreq: suspend cpufreq governors on shutdown")
Reported-by: NVille Syrjälä <ville.syrjala@linux.intel.com>
Tested-by: NVille Syrjälä <ville.syrjala@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
Cc: 4.0+ <stable@vger.kernel.org> # 4.0+
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

89ab39da

memstick: jmb38x_ms: Fix an error handling path in 'jmb38x_ms_probe()' · 5f19cbb3

由 Christophe JAILLET 提交于 10月 05, 2019

commit 28c9fac09ab0147158db0baeec630407a5e9b892 upstream.

If 'jmb38x_ms_count_slots()' returns 0, we must undo the previous
'pci_request_regions()' call.

Goto 'err_out_int' to fix it.

Fixes: 60fdd931 ("memstick: add support for JMicron jmb38x MemoryStick host controller")
Cc: stable@vger.kernel.org
Signed-off-by: NChristophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: NUlf Hansson <ulf.hansson@linaro.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

5f19cbb3

btrfs: tracepoints: Fix bad entry members of qgroup events · 0b95aaae

由 Qu Wenruo 提交于 10月 17, 2019

commit 1b2442b4ae0f234daeadd90e153b466332c466d8 upstream.

[BUG]
For btrfs:qgroup_meta_reserve event, the trace event can output garbage:

  qgroup_meta_reserve: 9c7f6acc-b342-4037-bc47-7f6e4d2232d7: refroot=5(FS_TREE) type=DATA diff=2
  qgroup_meta_reserve: 9c7f6acc-b342-4037-bc47-7f6e4d2232d7: refroot=5(FS_TREE) type=0x258792 diff=2

The @type can be completely garbage, as DATA type is not possible for
trace_qgroup_meta_reserve() trace event.

[CAUSE]
Ther are several problems related to qgroup trace events:
- Unassigned entry member
  Member entry::type of trace_qgroup_update_reserve() and
  trace_qgourp_meta_reserve() is not assigned

- Redundant entry member
  Member entry::type is completely useless in
  trace_qgroup_meta_convert()

Fixes: 4ee0d883 ("btrfs: qgroup: Update trace events for metadata reservation")
CC: stable@vger.kernel.org # 4.10+
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

0b95aaae

Btrfs: check for the full sync flag while holding the inode lock during fsync · 1b921b5b

由 Filipe Manana 提交于 10月 16, 2019

commit ba0b084ac309283db6e329785c1dc4f45fdbd379 upstream.

We were checking for the full fsync flag in the inode before locking the
inode, which is racy, since at that that time it might not be set but
after we acquire the inode lock some other task set it. One case where
this can happen is on a system low on memory and some concurrent task
failed to allocate an extent map and therefore set the full sync flag on
the inode, to force the next fsync to work in full mode.

A consequence of missing the full fsync flag set is hitting the problems
fixed by commit 0c713cbab620 ("Btrfs: fix race between ranged fsync and
writeback of adjacent ranges"), BUG_ON() when dropping extents from a log
tree, hitting assertion failures at tree-log.c:copy_items() or all sorts
of weird inconsistencies after replaying a log due to file extents items
representing ranges that overlap.

So just move the check such that it's done after locking the inode and
before starting writeback again.

Fixes: 0c713cbab620 ("Btrfs: fix race between ranged fsync and writeback of adjacent ranges")
CC: stable@vger.kernel.org # 5.2+
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

1b921b5b

Btrfs: add missing extents release on file extent cluster relocation error · ac6bae2b

由 Filipe Manana 提交于 10月 09, 2019

commit 44db1216efe37bf670f8d1019cdc41658d84baf5 upstream.

If we error out when finding a page at relocate_file_extent_cluster(), we
need to release the outstanding extents counter on the relocation inode,
set by the previous call to btrfs_delalloc_reserve_metadata(), otherwise
the inode's block reserve size can never decrease to zero and metadata
space is leaked. Therefore add a call to btrfs_delalloc_release_extents()
in case we can't find the target page.

Fixes: 8b62f87b ("Btrfs: rework outstanding_extents")
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

ac6bae2b

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功