提交 · 57c681c0fa49884856165a30480c72b261a28ae6 · openanolis / cloud-kernel

17 8月, 2019 33 次提交

sched: loadavg: make calc_load_n() public · 57c681c0

由 Johannes Weiner 提交于 10月 26, 2018

commit 5c54f5b9edb1aa2eabbb1091c458f1b6776a1896 upstream.

It's going to be used in a later patch. Keep the churn separate.

Link: http://lkml.kernel.org/r/20180828172258.3185-6-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: NSuren Baghdasaryan <surenb@google.com>
Tested-by: NDaniel Drake <drake@endlessm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

57c681c0

sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD · b0406fce

由 Johannes Weiner 提交于 10月 26, 2018

commit 8508cf3ffad4defa202b303e5b6379efc4cd9054 upstream.

There are several definitions of those functions/macros in places that
mess with fixed-point load averages.  Provide an official version.

[akpm@linux-foundation.org: fix missed conversion in block/blk-iolatency.c]
Link: http://lkml.kernel.org/r/20180828172258.3185-5-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: NSuren Baghdasaryan <surenb@google.com>
Tested-by: NDaniel Drake <drake@endlessm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
[Joseph: use stat.mean instead of stat->rqs.mean to solve conflict]
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

Conflicts:
    block/blk-iolatency.c

b0406fce

delayacct: track delays from thrashing cache pages · 72dfed31

由 Johannes Weiner 提交于 10月 26, 2018

commit b1d29ba82cf2bc784f4c963ddd6a2cf29e229b33 upstream.

Delay accounting already measures the time a task spends in direct reclaim
and waiting for swapin, but in low memory situations tasks spend can spend
a significant amount of their time waiting on thrashing page cache.  This
isn't tracked right now.

To know the full impact of memory contention on an individual task,
measure the delay when waiting for a recently evicted active cache page to
read back into memory.

Also update tools/accounting/getdelays.c:

     [hannes@computer accounting]$ sudo ./getdelays -d -p 1
     print delayacct stats ON
     PID     1

     CPU             count     real total  virtual total    delay total  delay average
                     50318      745000000      847346785      400533713          0.008ms
     IO              count    delay total  delay average
                       435      122601218              0ms
     SWAP            count    delay total  delay average
                         0              0              0ms
     RECLAIM         count    delay total  delay average
                         0              0              0ms
     THRASHING       count    delay total  delay average
                        19       12621439              0ms

Link: http://lkml.kernel.org/r/20180828172258.3185-4-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: NDaniel Drake <drake@endlessm.com>
Tested-by: NSuren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

72dfed31

mm: workingset: tell cache transitions from workingset thrashing · fe2c611d

由 Johannes Weiner 提交于 10月 26, 2018

commit 1899ad18c6072d689896badafb81267b0a1092a4 upstream.

Refaults happen during transitions between workingsets as well as in-place
thrashing.  Knowing the difference between the two has a range of
applications, including measuring the impact of memory shortage on the
system performance, as well as the ability to smarter balance pressure
between the filesystem cache and the swap-backed workingset.

During workingset transitions, inactive cache refaults and pushes out
established active cache.  When that active cache isn't stale, however,
and also ends up refaulting, that's bonafide thrashing.

Introduce a new page flag that tells on eviction whether the page has been
active or not in its lifetime.  This bit is then stored in the shadow
entry, to classify refaults as transitioning or thrashing.

How many page->flags does this leave us with on 32-bit?

	20 bits are always page flags

	21 if you have an MMU

	23 with the zone bits for DMA, Normal, HighMem, Movable

	29 with the sparsemem section bits

	30 if PAE is enabled

	31 with this patch.

So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes.  If
that's not enough, the system can switch to discontigmem and re-gain the 6
or 7 sparsemem section bits.

Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: NDaniel Drake <drake@endlessm.com>
Tested-by: NSuren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

fe2c611d

mm: workingset: don't drop refault information prematurely · 51396adf

由 Johannes Weiner 提交于 10月 26, 2018

commit 95f9ab2d596e8cbb388315e78c82b9a131bf2928 upstream.

Patch series "psi: pressure stall information for CPU, memory, and IO", v4.

		Overview

PSI reports the overall wallclock time in which the tasks in a system (or
cgroup) wait for (contended) hardware resources.

This helps users understand the resource pressure their workloads are
under, which allows them to rootcause and fix throughput and latency
problems caused by overcommitting, underprovisioning, suboptimal job
placement in a grid; as well as anticipate major disruptions like OOM.

		Real-world applications

We're using the data collected by PSI (and its previous incarnation,
memdelay) quite extensively at Facebook, and with several success stories.

One usecase is avoiding OOM hangs/livelocks.  The reason these happen is
because the OOM killer is triggered by reclaim not being able to free
pages, but with fast flash devices there is *always* some clean and
uptodate cache to reclaim; the OOM killer never kicks in, even as tasks
spend 90% of the time thrashing the cache pages of their own executables.
There is no situation where this ever makes sense in practice.  We wrote a
<100 line POC python script to monitor memory pressure and kill stuff way
before such pathological thrashing leads to full system losses that would
require forcible hard resets.

We've since extended and deployed this code into other places to guarantee
latency and throughput SLAs, since they're usually violated way before the
kernel OOM killer would ever kick in.

It is available here: https://github.com/facebookincubator/oomd

Eventually we probably want to trigger the in-kernel OOM killer based on
extreme sustained pressure as well, so that Linux can avoid memory
livelocks - which technically aren't deadlocks, but to the user
indistinguishable from them - out of the box.  We'd continue using OOMD as
the first line of defense to ensure workload health and implement complex
kill policies that are beyond the scope of the kernel.

We also use PSI memory pressure for loadshedding.  Our batch job
infrastructure used to use heuristics based on various VM stats to
anticipate OOM situations, with lackluster success.  We switched it to PSI
and managed to anticipate and avoid OOM kills and lockups fairly reliably.
The reduction of OOM outages in the worker pool raised the pool's
aggregate productivity, and we were able to switch that service to smaller
machines.

Lastly, we use cgroups to isolate a machine's main workload from
maintenance crap like package upgrades, logging, configuration, as well as
to prevent multiple workloads on a machine from stepping on each others'
toes.  We were not able to configure this properly without the pressure
metrics; we would see latency or bandwidth drops, but it would often be
hard to impossible to rootcause it post-mortem.

We now log and graph pressure for the containers in our fleet and can
trivially link latency spikes and throughput drops to shortages of
specific resources after the fact, and fix the job config/scheduling.

PSI has also received testing, feedback, and feature requests from Android
and EndlessOS for the purpose of low-latency OOM killing, to intervene in
pressure situations before the UI starts hanging.

		How do you use this feature?

A kernel with CONFIG_PSI=y will create a /proc/pressure directory with 3
files: cpu, memory, and io.  If using cgroup2, cgroups will also have
cpu.pressure, memory.pressure and io.pressure files, which simply
aggregate task stalls at the cgroup level instead of system-wide.

The cpu file contains one line:

	some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722

The averages give the percentage of walltime in which one or more tasks
are delayed on the runqueue while another task has the CPU.  They're
recent averages over 10s, 1m, 5m windows, so you can tell short term
trends from long term ones, similarly to the load average.

The total= value gives the absolute stall time in microseconds.  This
allows detecting latency spikes that might be too short to sway the
running averages.  It also allows custom time averaging in case the
10s/1m/5m windows aren't adequate for the usecase (or are too coarse with
future hardware).

What to make of this "some" metric?  If CPU utilization is at 100% and CPU
pressure is 0, it means the system is perfectly utilized, with one
runnable thread per CPU and nobody waiting.  At two or more runnable tasks
per CPU, the system is 100% overcommitted and the pressure average will
indicate as much.  From a utilization perspective this is a great state of
course: no CPU cycles are being wasted, even when 50% of the threads were
to go idle (as most workloads do vary).  From the perspective of the
individual job it's not great, however, and they would do better with more
resources.  Depending on what your priority and options are, raised "some"
numbers may or may not require action.

The memory file contains two lines:

some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828
full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258

The some line is the same as for cpu, the time in which at least one task
is stalled on the resource.  In the case of memory, this includes waiting
on swap-in, page cache refaults and page reclaim.

The full line, however, indicates time in which *nobody* is using the CPU
productively due to pressure: all non-idle tasks are waiting for memory in
one form or another.  Significant time spent in there is a good trigger
for killing things, moving jobs to other machines, or dropping incoming
requests, since neither the jobs nor the machine overall are making too
much headway.

The io file is similar to memory.  Because the block layer doesn't have a
concept of hardware contention right now (how much longer is my IO request
taking due to other tasks?), it reports CPU potential lost on all IO
delays, not just the potential lost due to competition.

		FAQ

Q: How is PSI's CPU component different from the load average?

A: There are several quirks in the load average that make it hard to
   impossible to tell how overcommitted the CPU really is.

   1. The load average is reported as a raw number of active tasks.
      You need to know how many CPUs there are in the system, how many
      CPUs the workload is allowed to use, then think about what the
      proportion between load and the number of CPUs mean for the
      tasks trying to run.

      PSI reports the percentage of wallclock time in which tasks are
      waiting for a CPU to run on. It doesn't matter how many CPUs are
      present or usable. The number always tells the quality of life
      of tasks in the system or in a particular cgroup.

   2. The shortest averaging window is 1m, which is extremely coarse,
      and it's sampled in 5s intervals. A *lot* can happen on a CPU in
      5 seconds. This *may* be able to identify persistent long-term
      trends and very clear and obvious overloads, but it's unusable
      for latency spikes and more subtle overutilization.

      PSI's shortest window is 10s. It also exports the cumulative
      stall times (in microseconds) of synchronously recorded events.

   3. On Linux, the load average for historical reasons includes all
      TASK_UNINTERRUPTIBLE tasks. This gives a broader sense of how
      busy the system is, but on the flipside it doesn't distinguish
      whether tasks are likely to contend over the CPU or IO - which
      obviously requires very different interventions from a sys admin
      or a job scheduler.

      PSI reports independent metrics for CPU and IO. You can tell
      which resource is making the tasks wait, but in conjunction
      still see how overloaded the system is overall.

Q: What's the cost / performance impact of this feature?

A: PSI's primary cost is in the scheduler, in particular task wakeups
   and sleeps.

   I benchmarked this code using Facebook's two most scheduling
   sensitive workloads: memcache and webserver. They handle a ton of
   small requests - lots of wakeups and sleeps with little actual work
   in between - so they tend to be canaries for scheduler regressions.

   In the tests, the boxes were handling live traffic over the course
   of several hours. Half the machines, the control, ran with
   CONFIG_PSI=n.

   For memcache I used eight machines total. They're 2-socket, 14
   core, 56 thread boxes. The test runs for half the test period,
   flips the test and control kernels on the hardware to rule out HW
   factors, DC location etc., then runs the other half of the test.

   For the webservers, I used 32 machines total. They're single
   socket, 16 core, 32 thread machines.

   During the memcache test, CPU load was nopsi=78.05% psi=78.98% in
   the first half and nopsi=77.52% psi=78.25%, so PSI added between
   0.7 and 0.9 percentage points to the CPU load, a difference of
   about 1%.

   UPDATE: I re-ran this test with the v3 version of this patch set
   and the CPU utilization was equivalent between test and control.

   UPDATE: v4 is on par with v3.

   As far as end-to-end request latency from the client perspective
   goes, we don't sample those finely enough to capture the requests
   going to those particular machines during the test, but we know the
   p50 turnaround time in this workload is 54us, and perf bench sched
   pipe on those machines show nopsi=5.232666 us/op and psi=5.587347
   us/op, so this doesn't add much here either.

   The profile for the pipe benchmark shows:

        0.87%  sched-pipe  [kernel.vmlinux]    [k] psi_group_change
        0.83%  perf.real   [kernel.vmlinux]    [k] psi_group_change
        0.82%  perf.real   [kernel.vmlinux]    [k] psi_task_change
        0.58%  sched-pipe  [kernel.vmlinux]    [k] psi_task_change

   The webserver load is running inside 4 nested cgroup levels. The
   CPU load with both nopsi and psi kernels was indistinguishable at
   81%.

   For comparison, we had to disable the cgroup cpu controller on the
   webservers because it added 4 percentage points to the CPU% during
   this same exact test.

   Versions of this accounting code now run on 80% of our fleet. None
   of our workloads have reported regressions during the rollout.

Daniel Drake said:

: I just retested the latest version at
: http://git.cmpxchg.org/cgit.cgi/linux-psi.git (Linux 4.18) and the results
: are great.
:
: Test setup:
: Endless OS
: GeminiLake N4200 low end laptop
: 2GB RAM
: swap (and zram swap) disabled
:
: Baseline test: open a handful of large-ish apps and several website
: tabs in Google Chrome.
:
: Results: after a couple of minutes, system is excessively thrashing, mouse
: cursor can barely be moved, UI is not responding to mouse clicks, so it's
: impractical to recover from this situation as an ordinary user
:
: Add my simple killer:
: https://gist.github.com/dsd/a8988bf0b81a6163475988120fe8d9cd
:
: Results: when the thrashing causes the UI to become sluggish, the killer
: steps in and kills something (usually a chrome tab), and the system
: remains usable.  I repeatedly opened more apps and more websites over a 15
: minute period but I wasn't able to get the system to a point of UI
: unresponsiveness.

Suren said:

: Backported to 4.9 and retested on ARMv8 8 code system running Android.
: Signals behave as expected reacting to memory pressure, no jumps in
: "total" counters that would indicate an overflow/underflow issues.  Nicely
: done!

This patch (of 9):

If we keep just enough refault information to match the *current* page
cache during reclaim time, we could lose a lot of events when there is
only a temporary spike in non-cache memory consumption that pushes out all
the cache.  Once cache comes back, we won't see those refaults.  They
might not be actionable for LRU aging, but we want to know about them for
measuring memory pressure.

[hannes@cmpxchg.org: switch to NUMA-aware lru and slab counters]
  Link: http://lkml.kernel.org/r/20181009184732.762-2-hannes@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-2-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <jweiner@fb.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NRik van Riel <riel@surriel.com>
Tested-by: NDaniel Drake <drake@endlessm.com>
Tested-by: NSuren Baghdasaryan <surenb@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

51396adf

splice: don't read more than available pipe space · cfe522e4

由 Darrick J. Wong 提交于 6月 01, 2019

commit 17614445576b6af24e9cf36607c6448164719c96 upstream.

In commit 4721a601099, we tried to fix a problem wherein directio reads
into a splice pipe will bounce EFAULT/EAGAIN all the way out to
userspace by simulating a zero-byte short read.  This happens because
some directio read implementations (xfs) will call
bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous
reads, but as soon as we run out of pipe buffers that _get_pages call
returns EFAULT, which the splice code translates to EAGAIN and bounces
out to userspace.

In that commit, the iomap code catches the EFAULT and simulates a
zero-byte read, but that causes assertion errors on regular splice reads
because xfs doesn't allow short directio reads.

The brokenness is compounded by splice_direct_to_actor immediately
bailing on do_splice_to returning <= 0 without ever calling ->actor
(which empties out the pipe), so if userspace calls back we'll EFAULT
again on the full pipe, and nothing ever gets copied.

Therefore, teach splice_direct_to_actor to clamp its requests to the
amount of free space in the pipe and remove the simulated short read
from the iomap directio code.

Fixes: 4721a601099 ("iomap: dio data corruption and spurious errors when pipes fill")
Reported-by: NMurphy Zhou <jencce.kernel@gmail.com>
Ranted-by: NAmir Goldstein <amir73il@gmail.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

cfe522e4

net/tcp: Support tunable tcp timeout value in TIME-WAIT state · 834327af

由 George Zhang 提交于 3月 28, 2018

By default the tcp_tw_timeout value is 60 seconds. The minimum is
1 second and the maximum is 600. This setting is useful on system under
heavy tcp load.

NOTE: set the tcp_tw_timeout below 60 seconds voilates the "quiet time"
restriction, and make your system into the risk of causing some old data
to be accepted as new or new data rejected as old duplicated by some
receivers.

Link: http://web.archive.org/web/20150102003320/http://tools.ietf.org/html/rfc793Signed-off-by: NGeorge Zhang <georgezhang@linux.alibaba.com>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

834327af

PCI: Fix "try" semantics of bus and slot reset · 6987a10b

由 Alex Williamson 提交于 5月 24, 2019

commit ddefc033eecf23f1e8b81d0663c5db965adf5516 upstream

The commit referenced below introduced device locking around save and
restore of state for each device during a PCI bus "try" reset, making
it decidely non-"try" and prone to deadlock in the event that a device
is already locked. Restore __pci_reset_bus() and __pci_reset_slot()
to their advertised locking semantics by pushing the save and restore
functions into the branch where the entire tree is already locked.
Extend the helper function names with "_locked" and update the comment
to reflect this calling requirement.

Fixes: b014e96d ("PCI: Protect pci_error_handlers->reset_notify() usage with device_lock()")
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
Signed-off-by: NZhiyuan Hou <zhiyuan2048@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

6987a10b

net/hookers: fix link error with ipv6 disabled · 58b454f7

由 kbuild test robot 提交于 5月 23, 2019

lkp-build bot reported the following link error with ipv6 disabled:

ld: net/hookers/hookers.o:(.data+0x40): undefined reference to `ipv6_specific'
ld: net/hookers/hookers.o:(.data+0x78): undefined reference to `ipv6_mapped'
ld: net/hookers/hookers.o:(.data+0xe8): undefined reference to `inet6_stream_ops'

Fixed this issue by adding IS_ENABLED(CONFIG_IPV6) check.
Reported-by: Nkbuild test robot <lkp@intel.com>
Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

58b454f7

writeback: memcg_blkcg_tree_lock can be static · bf856efa

由 kbuild test robot 提交于 5月 23, 2019

Fixes: 60448d43 ("writeback: add memcg_blkcg_link tree")
Signed-off-by: Nkbuild test robot <lkp@intel.com>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>

bf856efa

net/hookers: only enable on x86 platform · 71574b6c

由 Caspar Zhang 提交于 5月 23, 2019

read/write_cr0() are used in net/hookers.c, but they are only available
on x86 platform. Adding a depend-on fields in Kconfig to disable this
feature in other platforms.
Reported-by: Nkbuild test robot <lkp@intel.com>
Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

71574b6c

fs/writeback: wrap cgroup writeback v1 logic · f12b4ad4

由 Joseph Qi 提交于 5月 22, 2019

Wrap cgroup writeback v1 logic to prevent build errors without
CONFIG_CGROUPS or CONFIG_CGROUP_WRITEBACK.
Reported-by: Nkbuild test robot <lkp@intel.com>
Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

f12b4ad4

writeback: introduce cgwb_v1 boot param · c95122f9

由 Jiufei Xue 提交于 5月 13, 2019

So far writeback control is supported for cgroup v1 interface. However
it also has some restrictions, so introduce a new kernel boot parameter
to control the behavior which is disabled by default. Users can enable
the writeback control for cgroup v1 with the command line "cgwb_v1".
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

c95122f9

fs/writeback: Attach inode's wb to root if needed · b6a234bc

由 luanshi 提交于 10月 09, 2018

There might have tons of files queued in the writeback, awaiting for
writing back. Unfortunately, the writeback's cgroup has been dead. In
this case, we reassociate the inode with another writeback, but we
possibly can't because the writeback associated with the dead cgroup is
the only valid one. In this case, the new writeback is allocated,
initialized and associated with the inode in the non-stopping fashion
until all data resident in the inode's page cache are flushed to disk.
It causes unnecessary high system load.

This fixes the issue by enforce moving the inode to root cgroup when the
previous binding cgroup becomes dead. With it, no more unnecessary
writebacks are created, populated and the system load decreased by about
6x in the test case we carried out:
    Without the patch: 30% system load
    With the patch:    5%  system load
Signed-off-by: Nluanshi <zhangliguang@linux.alibaba.com>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

b6a234bc

fs/writeback: fix double free of blkcg_css · 1b621d22

由 Jiufei Xue 提交于 1月 29, 2018

We have gotten a WARNNING when releasing blkcg_css:

[332489.681635] WARNING: CPU: 55 PID: 14859 at lib/list_debug.c:56 __list_del_entry+0x81/0xc0
[332489.682191] list_del corruption, ffff883e6b94d450->prev is LIST_POISON2 (dead000000000200)
......
[332489.683895] CPU: 55 PID: 14859 Comm: kworker/55:2 Tainted: G
[332489.684477] Hardware name: Inspur SA5248M4/X10DRT-PS, BIOS 4.05A
10/11/2016
[332489.685061] Workqueue: cgroup_destroy css_release_work_fn
[332489.685654]  ffffc9001d92bd28 ffffffff81380042 ffffc9001d92bd78
0000000000000000
[332489.686269]  ffffc9001d92bd68 ffffffff81088f8b 0000003800000000
ffff883e6b94d4a0
[332489.686867]  ffff883e6b94d400 ffffffff81ce8fe0 ffff88375b24f400
ffff883e6b94d4a0
[332489.687479] Call Trace:
[332489.688078]  [<ffffffff81380042>] dump_stack+0x63/0x81
[332489.688681]  [<ffffffff81088f8b>] __warn+0xcb/0xf0
[332489.689276]  [<ffffffff8108900f>] warn_slowpath_fmt+0x5f/0x80
[332489.689877]  [<ffffffff8139e7c1>] __list_del_entry+0x81/0xc0
[332489.690481]  [<ffffffff81125552>] css_release_work_fn+0x42/0x140
[332489.691090]  [<ffffffff810a2db9>] process_one_work+0x189/0x420
[332489.691693]  [<ffffffff810a309e>] worker_thread+0x4e/0x4b0
[332489.692293]  [<ffffffff810a3050>] ? process_one_work+0x420/0x420
[332489.692905]  [<ffffffff810a9616>] kthread+0xe6/0x100
[332489.693504]  [<ffffffff810a9530>] ? kthread_park+0x60/0x60
[332489.694099]  [<ffffffff817184e1>] ret_from_fork+0x41/0x50
[332489.694722] ---[ end trace 0cf869c4a5cfba87 ]---
......

This is caused by calling css_get after the css is killed by another
thread described below:

           Thread 1                       Thread 2
cgroup_rmdir
  -> kill_css
    -> percpu_ref_kill_and_confirm
      -> css_killed_ref_fn

css_killed_work_fn
  -> css_put
    -> css_release
                                        wb_get_create
					  -> find_blkcg_css
					    -> css_get
					  -> css_put
					    -> css_release (double free)
    -> css_release_workfn
      -> css_free_work_fn
       -> blkcg_css_free

When doublefree happened, it may free the memory still used by
other threads and cause a kernel panic.

Fix this by using css_tryget_online in find_blkcg_css while will return
false if the css is killed.
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

1b621d22

writeback: add debug info for memcg-blkcg link · a3bcea61

由 Jiufei Xue 提交于 12月 07, 2017

Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

a3bcea61

writeback: add memcg_blkcg_link tree · 53f2c710

由 Jiufei Xue 提交于 12月 06, 2017

Here we add a global radix tree to link memcg and blkcg that the user
attach the tasks to when using cgroup v1, which is used for writeback
cgroup.
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

53f2c710

net: kernel hookers service for toa module · 0ecf4747

由 George Zhang 提交于 3月 15, 2019

LVS fullnat will replace network traffic's source ip with its local ip,
and thus the backend servers cannot obtain the real client ip.

To solve this, LVS has introduced the tcp option address (TOA) to store
the essential ip address information in the last tcp ack packet of the
3-way handshake, and the backend servers need to retrieve it from the
packet header.

In this patch, we have introduced the sk_toa_data member in the sock
structure to hold the TOA information. There used to be an in-tree
module for TOA managing, whereas it has now been maintained as an
standalone module.

In this case, the toa module should register its hook function(s) using
the provided interfaces in the hookers module.

TOA in sock structure:

	__be32 sk_toa_data[16];

The hookers module only provides the sk_toa_data placeholder, and the
toa module can use this variable through the layout it needs.

Hook interfaces:

The hookers module replaces the kernel's syn_recv_sock and getname
handler with a stub that chains the toa module's hook function(s) to the
original handling function. The hookers module allows hook functions to
be installed and uninstalled in any order.

toa module:

The external toa module will be provided in separate RPM package.

[xuyu@linux.alibaba.com: amend commit log]
Signed-off-by: NGeorge Zhang <georgezhang@linux.alibaba.com>
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>

0ecf4747

virtio_blk: add discard and write zeroes support · 311efc03

由 Changpeng Liu 提交于 11月 01, 2018

commit 1f23816b8eb8fdc39990abe166c10a18c16f6b21 upstream.

In commit 88c85538, "virtio-blk: add discard and write zeroes features
to specification" (https://github.com/oasis-tcs/virtio-spec), the virtio
block specification has been extended to add VIRTIO_BLK_T_DISCARD and
VIRTIO_BLK_T_WRITE_ZEROES commands.  This patch enables support for
discard and write zeroes in the virtio-blk driver when the device
advertises the corresponding features, VIRTIO_BLK_F_DISCARD and
VIRTIO_BLK_F_WRITE_ZEROES.
Signed-off-by: NChangpeng Liu <changpeng.liu@intel.com>
Signed-off-by: NDaniel Verkamp <dverkamp@chromium.org>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

311efc03

kconfig: Disable x86 clocksource watchdog · 89f55e0b

由 Jiufei Xue 提交于 1月 14, 2019

Unstable tsc will trigger clocksource watchdog and disable itself, as a
result other clocksource will be elected as the current clocksource
which will result in performace issue on our servers.

RHEL7 also disabled this feature for some issues, see changelog:
[x86] disable clocksource watchdog (Prarit Bhargava) [914709]
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

89f55e0b

Revert "x86/tsc: Prepare warp test for TSC adjustment" · 74343ee8

由 Jiufei Xue 提交于 1月 10, 2019

This reverts commit 76d3b851.

The returned value for check_tsc_warp() is useless now, remove it.
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

74343ee8

Revert "x86/tsc: Try to adjust TSC if sync test fails" · 08fc9dcb

由 Jiufei Xue 提交于 1月 10, 2019

This reverts commit cc4db268.

When we do hot-add and enable vCPU, the time inside the VM jumps and
then VM stucks.
The dmesg shows like this:
[   48.402948] CPU2 has been hot-added
[   48.413774] smpboot: Booting Node 0 Processor 2 APIC 0x2
[   48.415155] kvm-clock: cpu 2, msr 6b615081, secondary cpu clock
[   48.453690] TSC ADJUST compensate: CPU2 observed 139318776350 warp.  Adjust: 139318776350
[  102.060874] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large:
[  102.060874] clocksource:                       'kvm-clock' wd_now: 1cb1cfc4bf8 wd_last: 1be9588f1fe mask: ffffffffffffffff
[  102.060874] clocksource:                       'tsc' cs_now: 207d794f7e cs_last: 205a32697a mask: ffffffffffffffff
[  102.060874] tsc: Marking TSC unstable due to clocksource watchdog
[  102.070188] KVM setup async PF for cpu 2
[  102.071461] kvm-stealtime: cpu 2, msr 13ba95000
[  102.074530] Will online and init hotplugged CPU: 2

This is because the TSC for the newly added VCPU is initialized to 0
while others are ahead. Guest will do the TSC ADJUST compensate and
cause the time jumps.

Commit bd8fab39("KVM: x86: fix maintaining of kvm_clock stability
on guest CPU hotplug") can fix this problem.  However, the host kernel
version may be older, so do not ajust TSC if sync test fails, just mark
it unstable.
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

08fc9dcb

block-throttle: enable hierarchical throttling even on traditional hierarchy · 868591a0

由 Joseph Qi 提交于 12月 12, 2017

ECI may have an use case that configuring each device mapper disk
throttling policy just under root blkio cgroup, but actually using them
in different containers.
Since hierarchical throttling is now only supported on cgroup v2 and ECI
uses cgroup v1, so we have to enable hierarchical throttling on cgroup
v1.
This is ported from redhat 7u, and a year ago Jiufei already ported it
to alikernel 4.9 as well. So I think this change should be acceptable.
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>

868591a0

eci: drivers/virtio: add vring_force_dma_api boot param · 64b5a541

由 Eryu Guan 提交于 12月 24, 2018

Prior to xdragon platform 20181230 release (e.g. 0930 release),
vring_use_dma_api() is required to return 'true' unconditionally.

Introduce a new kernel boot parameter called "vring_force_dma_api" to
control the behavior, boot xdragon host with "vring_force_dma_api"
command line to make ENI hotplug work, so that normal ECS hosts keep the
original behavior.
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NEryu Guan <eguan@linux.alibaba.com>

64b5a541

boot: give rdrand some credit · 5f00d7ad

由 Arjan van de Ven 提交于 7月 29, 2016

Cherry-pick from clear-linux patches:
https://github.com/clearlinux-pkgs/linux-kvm/0104-give-rdrand-some-credit.patch

try to credit rdrand/rdseed with some entropy

In VMs but even modern hardware, we're super starved for entropy, and while we can
and do wear a tin foil hat, it's very hard to argue that
rdrand and rdtsc add zero entropy.
Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>

5f00d7ad

NO-UPSTREAM: 9P: always use cached inode to fill in v9fs_vfs_getattr · 7c58b4ee

由 Julio Montes 提交于 9月 18, 2017

Cherry-pick from kata-container patches:
https://github.com/kata-containers/packaging/tree/master/kernel/patches/0001-NO-UPSTREAM-9P-always-use-cached-inode-to-fill-in-v9.patch

So that if in cache=none mode, we don't have to lookup server that
might not support open-unlink-fstat operation.

fixes https://github.com/01org/cc-oci-runtime/issues/47
fixes https://github.com/01org/cc-oci-runtime/issues/1062Signed-off-by: NJulio Montes <julio.montes@intel.com>
Signed-off-by: NPeng Tao <bergwolf@gmail.com>
Signed-off-by: NEryu Guan <eguan@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

7c58b4ee

NEMU: Compile in evged always · 91e41111

由 Arjan van de Ven 提交于 8月 10, 2018

Cherry-pick from kata-container patches:
https://github.com/kata-containers/packaging/tree/master/kernel/patches/0002-Compile-in-evged-always.patch

We need evged for NEMU (and in general for hw reduced)

The config option cannot be set normally since it breaks all
regular systems, and hardware reduced is really a runtime choice.
Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
Signed-off-by: NEryu Guan <eguan@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

91e41111

ext4: fix reserved cluster accounting at page invalidation time · 9d4fad52

由 Eric Whitney 提交于 10月 01, 2018

commit f456767d3391e9f7d9d25a2e7241d75676dc19da upstream.

Add new code to count canceled pending cluster reservations on bigalloc
file systems and to reduce the cluster reservation count on all file
systems using delayed allocation.  This replaces old code in
ext4_da_page_release_reservations that was incorrect.
Signed-off-by: NEric Whitney <enwlinux@gmail.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>

9d4fad52

ext4: adjust reserved cluster count when removing extents · 8962ca9b

由 Eric Whitney 提交于 10月 01, 2018

commit 9fe671496b6c286f9033aedfc1718d67721da0ae upstream.

Modify ext4_ext_remove_space() and the code it calls to correct the
reserved cluster count for pending reservations (delayed allocated
clusters shared with allocated blocks) when a block range is removed
from the extent tree. Pending reservations may be found for the clusters
at the ends of written or unwritten extents when a block range is removed.
If a physical cluster at the end of an extent is freed, it's necessary
to increment the reserved cluster count to maintain correct accounting
if the corresponding logical cluster is shared with at least one
delayed and unwritten extent as found in the extents status tree.

Add a new function, ext4_rereserve_cluster(), to reapply a reservation
on a delayed allocated cluster sharing blocks with a freed allocated
cluster. To avoid ENOSPC on reservation, a flag is applied to
ext4_free_blocks() to briefly defer updating the freeclusters counter
when an allocated cluster is freed. This prevents another thread
from allocating the freed block before the reservation can be reapplied.

Redefine the partial cluster object as a struct to carry more state
information and to clarify the code using it.

Adjust the conditional code structure in ext4_ext_remove_space to
reduce the indentation level in the main body of the code to improve
readability.
Signed-off-by: NEric Whitney <enwlinux@gmail.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>

8962ca9b

ext4: reduce reserved cluster count by number of allocated clusters · 51b2c187

由 Eric Whitney 提交于 10月 01, 2018

commit b6bf9171ef5c37b66d446378ba63af5339a56a97 upstream.

Ext4 does not always reduce the reserved cluster count by the number
of clusters allocated when mapping a delayed extent. It sometimes
adds back one or more clusters after allocation if delalloc blocks
adjacent to the range allocated by ext4_ext_map_blocks() share the
clusters newly allocated for that range. However, this overcounts
the number of clusters needed to satisfy future mapping requests
(holding one or more reservations for clusters that have already been
allocated) and premature ENOSPC and quota failures, etc., result.

Ext4 also does not reduce the reserved cluster count when allocating
clusters for non-delayed allocated writes that have previously been
reserved for delayed writes. This also results in overcounts.

To make it possible to handle reserved cluster accounting for
fallocated regions in the same manner as used for other non-delayed
writes, do the reserved cluster accounting for them at the time of
allocation. In the current code, this is only done later when a
delayed extent sharing the fallocated region is finally mapped.

Address comment correcting handling of unsigned long long constant
from Jan Kara's review of RFC version of this patch.
Signed-off-by: NEric Whitney <enwlinux@gmail.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>

51b2c187

ext4: fix reserved cluster accounting at delayed write time · e986893f

由 Eric Whitney 提交于 10月 01, 2018

commit 0b02f4c0d6d9e2c611dfbdd4317193e9dca740e6 upstream.

The code in ext4_da_map_blocks sometimes reserves space for more
delayed allocated clusters than it should, resulting in premature
ENOSPC, exceeded quota, and inaccurate free space reporting.

Fix this by checking for written and unwritten blocks shared in the
same cluster with the newly delayed allocated block.  A cluster
reservation should not be made for a cluster for which physical space
has already been allocated.
Signed-off-by: NEric Whitney <enwlinux@gmail.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>

e986893f

ext4: add new pending reservation mechanism · e0a71b08

由 Eric Whitney 提交于 10月 01, 2018

commit 1dc0aa46e74a3366e12f426b7caaca477853e9c3 upstream.

Add new pending reservation mechanism to help manage reserved cluster
accounting.  Its primary function is to avoid the need to read extents
from the disk when invalidating pages as a result of a truncate, punch
hole, or collapse range operation.
Signed-off-by: NEric Whitney <enwlinux@gmail.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>

e0a71b08

ext4: generalize extents status tree search functions · af4cc672

由 Eric Whitney 提交于 10月 01, 2018

commit ad431025aecda85d3ebef5e4a3aca5c1c681d0c7 upstream.

Ext4 contains a few functions that are used to search for delayed
extents or blocks in the extents status tree.  Rather than duplicate
code to add new functions to search for extents with different status
values, such as written or a combination of delayed and unwritten,
generalize the existing code to search for caller-specified extents
status values.  Also, move this code into extents_status.c where it
is better associated with the data structures it operates upon, and
where it can be more readily used to implement new extents status tree
functions that might want a broader scope for i_es_lock.

Three missing static specifiers in RFC version of patch reported and
fixed by Fengguang Wu <fengguang.wu@intel.com>.
Signed-off-by: NEric Whitney <enwlinux@gmail.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>

af4cc672

16 8月, 2019 7 次提交

G

Linux 4.19.67 · a5aa8058
由 Greg Kroah-Hartman 提交于 8月 16, 2019

a5aa8058

iwlwifi: mvm: fix version check for GEO_TX_POWER_LIMIT support · ac295111

由 Luca Coelho 提交于 7月 19, 2019

commit f5a47fae6aa3eb06f100e701d2342ee56b857bee upstream.

We erroneously added a check for FW API version 41 before sending
GEO_TX_POWER_LIMIT, but this was already implemented in version 38.
Additionally, it was cherry-picked to older versions, namely 17, 26
and 29, so check for those as well.

Cc: stable@vger.kernel.org
Fixes: eca1e56ceedd ("iwlwifi: mvm: don't send GEO_TX_POWER_LIMIT to old firmwares")
Signed-off-by: NLuca Coelho <luciano.coelho@intel.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

ac295111

iwlwifi: mvm: don't send GEO_TX_POWER_LIMIT on version < 41 · 6a81677a

由 Luca Coelho 提交于 6月 24, 2019

commit 39bd984c203e86f3109b49c2a2e20677c4d3ab65 upstream.

Firmware versions before 41 don't support the GEO_TX_POWER_LIMIT
command, and sending it to the firmware will cause a firmware crash.
We allow this via debugfs, so we need to return an error value in case
it's not supported.

This had already been fixed during init, when we send the command if
the ACPI WGDS table is present.  Fix it also for the other,
userspace-triggered case.

Cc: stable@vger.kernel.org
Fixes: 7fe90e0e ("iwlwifi: mvm: refactor geo init")
Signed-off-by: NLuca Coelho <luciano.coelho@intel.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

6a81677a

iwlwifi: mvm: fix an out-of-bound access · 80bac45e

由 Emmanuel Grumbach 提交于 7月 22, 2019

commit ba3224db78034435e9ff0247277cce7c7bb1756c upstream.

The index for the elements of the ACPI object we dereference
was static. This means that if we called the function twice
we wouldn't start from 3 again, but rather from the latest
index we reached in the previous call.
This was dutifully reported by KASAN.

Fix this.

Cc: stable@vger.kernel.org
Fixes: 69964905 ("iwlwifi: mvm: add support for EWRD (Dynamic SAR) ACPI table")
Signed-off-by: NEmmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

80bac45e

iwlwifi: don't unmap as page memory that was mapped as single · 7626b510

由 Emmanuel Grumbach 提交于 7月 21, 2019

commit 87e7e25aee6b59fef740856f4e86d4b60496c9e1 upstream.

In order to remember how to unmap a memory (as single or
as page), we maintain a bit per Transmit Buffer (TBs) in
the meta data (structure iwl_cmd_meta).
We maintain a bitmap: 1 bit per TB.
If the TB is set, we will free the memory as a page.
This bitmap was never cleared. Fix this.

Cc: stable@vger.kernel.org
Fixes: 3cd1980b ("iwlwifi: pcie: introduce new tfd and tb formats")
Signed-off-by: NEmmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

7626b510

mwifiex: fix 802.11n/WPA detection · b38c56b7

由 Brian Norris 提交于 7月 24, 2019

commit df612421fe2566654047769c6852ffae1a31df16 upstream.

Commit 63d7ef36103d ("mwifiex: Don't abort on small, spec-compliant
vendor IEs") adjusted the ieee_types_vendor_header struct, which
inadvertently messed up the offsets used in
mwifiex_is_wpa_oui_present(). Add that offset back in, mirroring
mwifiex_is_rsn_oui_present().

As it stands, commit 63d7ef36103d breaks compatibility with WPA (not
WPA2) 802.11n networks, since we hit the "info: Disable 11n if AES is
not supported by AP" case in mwifiex_is_network_compatible().

Fixes: 63d7ef36103d ("mwifiex: Don't abort on small, spec-compliant vendor IEs")
Cc: <stable@vger.kernel.org>
Signed-off-by: NBrian Norris <briannorris@chromium.org>
Signed-off-by: NKalle Valo <kvalo@codeaurora.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

b38c56b7

KVM: Fix leak vCPU's VMCS value into other pCPU · 2bc73d91

由 Wanpeng Li 提交于 8月 05, 2019

commit 17e433b54393a6269acbcb792da97791fe1592d8 upstream.

After commit d73eb57b80b (KVM: Boost vCPUs that are delivering interrupts), a
five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs
on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting
in the VMs after stress testing:

 INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073)
 Call Trace:
   flush_tlb_mm_range+0x68/0x140
   tlb_flush_mmu.part.75+0x37/0xe0
   tlb_finish_mmu+0x55/0x60
   zap_page_range+0x142/0x190
   SyS_madvise+0x3cd/0x9c0
   system_call_fastpath+0x1c/0x21

swait_active() sustains to be true before finish_swait() is called in
kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account
by kvm_vcpu_on_spin() loop greatly increases the probability condition
kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv
is enabled the yield-candidate vCPU's VMCS RVI field leaks(by
vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current
VMCS.

This patch fixes it by checking conservatively a subset of events.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Marc Zyngier <Marc.Zyngier@arm.com>
Cc: stable@vger.kernel.org
Fixes: 98f4a146 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop)
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

2bc73d91

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功