提交 · fa418988c52e6d7893ec0f5d20f5c1c093bd1632 · openanolis / cloud-kernel

24 6月, 2020 5 次提交

alinux: sched: Finer grain of sched latency · fa418988

由 Yihao Wu 提交于 5月 21, 2020

to #28739709

Many samples are between 10ms-50ms. To display more informative
distribution of latency, divide 10ms-50ms into 5 parts uniformly.

Example:

  $ cat /sys/fs/cgroup/cpuacct/a/cpuacct.wait_latency
	0-1ms: 	59726433
	1-4ms: 	167
	4-7ms: 	0
	7-10ms: 	0
	10-20ms: 	5
	20-30ms: 	0
	30-40ms: 	3
	40-50ms: 	0
	50-100ms: 	0
	100-500ms: 	0
	500-1000ms: 	0
	1000-5000ms: 	0
	5000-10000ms: 	0
	>=10000ms: 	0
	total(ms): 	45554
	nr: 	59726600
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

fa418988

alinux: sched: Add "nr" to sched latency histogram · 2abfd07b

由 Yihao Wu 提交于 6月 18, 2020

to #28739709

Sometimes histogram is not precise enough because each sample is
roughly accounted into a histogram bar. And average latency is more
pratical for some users.

This patch adds a "nr" field in 4 latency histogram interfaces, so

	lat(avg) = total(ms) / nr

And compared to histogram, average latency is better to be used as a
SLI because of simplicity.

Example

    $ cat /sys/fs/cgroup/cpuacct/a/cpuacct.wait_latency
      0-1ms:  4139
      1-4ms:  317
      4-7ms:  568
      7-10ms:         0
      10-100ms:       42324
      100-500ms:      9131
      500-1000ms:     95
      1000-5000ms:    134
      5000-10000ms:   0
      >=10000ms:      0
      total(ms):      4256455
      nr:      182128
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

2abfd07b

alinux: sched: Add cgroup's scheduling latency histograms · 6dbaddaa

由 Yihao Wu 提交于 5月 21, 2020

to #28739709

This patch adds cpuacct.cgroup_wait_latency interface. It exports the
histogram of the sched entity's schedule latency. Unlike wait_latency,
the sched entity is a cgroup rather than task.

This is useful when tasks are not directly clustered under one cgroup.
For examples:

cgroup1 --- cgroupA --- task1
        --- cgroupB --- task2
cgroup2 --- cgroupC --- task3
        --- cgroupD --- task4

This is a common cgroup hierarchy used by many applications. With
cgroup_wait_latency, we can just read from cgroup1 to know aggregated
wait latency information of task1 and task2.

The interface output format is identical to cpuacct.wait_latency.
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

6dbaddaa

alinux: sched: Add cgroup-level blocked time histograms · a055ee2c

由 Yihao Wu 提交于 12月 26, 2019

to #28739709

This patch measures time that tasks in cpuacct cgroup blocks. There
are two types: blocked due to IO, and others like locks. And they
are exported in"cpuacct.ioblock_latency" and "cpuacct.block_latency"
respectively.

According to histogram, we know the detailed distribution of the
duration. And according to total(ms), we know the percentage of time
tasks spent off rq, waiting for resources:

(△ioblock_latency.total(ms) + △block_latency.total(ms)) / △wall_time

The interface output format is identical to cpuacct.wait_latency.
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>
Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

a055ee2c

alinux: sched: Introduce cfs scheduling latency histograms · 76d98609

由 Yihao Wu 提交于 1月 15, 2020

to #28739709

Export wait_latency in "cpuacct.wait_latency", which indicates the
time that tasks in a cpuacct cgroup wait on a cfs_rq to be scheduled.

This is like "perf sched", but it gives smaller overhead. So it can
be used as monitor constantly.

wait_latency is useful to debug application's high RT problem. It can
tell if it's caused by scheduling or not. If it is, loadavg can tell
if it's caused by bad scheduling bahaviour or system overloads.

System admins can also use wait_latency to define SLA. To ensure SLA
is guaranteed, there are various ways to decrease wait_latency.

This feature is disabled by default for performance concerns. It can
be switched on dynamically by "echo 0 > /proc/cpusli/sched_lat_enable"

Example:

  $ cat /sys/fs/cgroup/cpuacct/a/cpuacct.wait_latency
    0-1ms:  4139
    1-4ms:  317
    4-7ms:  568
    7-10ms:         0
    10-100ms:       42324
    100-500ms:      9131
    500-1000ms:     95
    1000-5000ms:    134
    5000-10000ms:   0
    >=10000ms:      0
    total(ms):      4256455
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>
Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

76d98609

23 6月, 2020 13 次提交

alinux: sched: Add switch for scheduler_tick load tracking · bcaf8afd

由 Yihao Wu 提交于 5月 13, 2020

to #28739709

Assume workloads are composed of massive short tasks. Then periodical
load tracking is unnecessary. Because load tracking should be already
guaranteed by frequent sleep and wake-up.

If these massive short tasks run in their individual cgroups, the load
tracking becomes extremely heavy.

This patch adds a switch to bypass scheduler_tick load tracking, in
order to reduce scheduler overhead, without sacrificing much balance
in this scenario.

Performance Tests:

1) 1100+ tasks in their individual cgroups, on a 96-HT Skylake machine

	sched overhead(each HT): 0.74% -> 0.48%

	(This test's baseline is from the previous patch)

2) sysbench-threads with 96 threads, running for 5min

	latency_ms 95th: 63.07 -> 54.01

Besides these, no regression is found on our test platform.
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

bcaf8afd

alinux: sched: Add switch for update_blocked_averages · bb48b716

由 Yihao Wu 提交于 5月 14, 2020

to #28739709

Unless the workloads are IO-bounded, update_blocked_averages doesn't help
load balance. This patch adds a switch to bypass update_blocked_averages
if prior knowledge about workloads indicates IO is negligible.

Performance Tests:

1) 1100+ tasks in their individual cgroups, on a 96-HT Skylake machine

	sched overhead(each HT): 3.78% -> 0.74%

2) cgroup-overhead benchmark in our sched-test suite on a 96-HT Skylake

	overhead: 21.06 -> 18.08

3) unixbench context1 with 96 threads running for 1min

	Score: 15409.40 -> 16821.77

Besides these, UnixBench has some performance ups and downs. But
generally, the performance of UnixBench hasn't changed.
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

bb48b716

alinux: mm: thp: add fast_cow switch · 56a432f5

由 Yang Shi 提交于 6月 16, 2020

task #27327988

The commit ("thp: change CoW semantics for anon-THP") rewrites THP CoW
page fault handler to allocate base page only, but there is request to
keep the old behavior just in case.  So, introduce a new sysfs knob,
fast_cow, to control the behavior, the default is the new behavior.
Write that knob to 0 to switch to old behavior.
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
[ caspar: fix checkpatch.pl warnings ]
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

56a432f5

khugepaged: introduce 'max_ptes_shared' tunable · e5b2cc5d

由 Kirill A. Shutemov 提交于 6月 16, 2020

task #27327988

commit 71a2c112a0f6da497e1b44e18e97b1716c240518 upstream

'max_ptes_shared' specifies how many pages can be shared across multiple
processes.  Exceeding the number would block the collapse::

        /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared

A higher value may increase memory footprint for some workloads.

By default, at least half of pages has to be not shared.

[colin.king@canonical.com: fix several spelling mistakes]
  Link: http://lkml.kernel.org/r/20200420084241.65433-1-colin.king@canonical.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Tested-by: NZi Yan <ziy@nvidia.com>
Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
Reviewed-by: NZi Yan <ziy@nvidia.com>
Acked-by: NYang Shi <yang.shi@linux.alibaba.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Link: http://lkml.kernel.org/r/20200416160026.16538-9-kirill.shutemov@linux.intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

e5b2cc5d

thp: change CoW semantics for anon-THP · 33b1aabe

由 Kirill A. Shutemov 提交于 6月 16, 2020

task #27327988

commit 3917c80280c93a7123f1a3a6dcdb10a3ea19737d upstream

Currently we have different copy-on-write semantics for anon- and
file-THP.  For anon-THP we try to allocate huge page on the write fault,
but on file-THP we split PMD and allocate 4k page.

Arguably, file-THP semantics is more desirable: we don't necessary want to
unshare full PMD range from the parent on the first access.  This is the
primary reason THP is unusable for some workloads, like Redis.

The original THP refcounting didn't allow to have PTE-mapped compound
pages, so we had no options, but to allocate huge page on CoW (with
fallback to 512 4k pages).

The current refcounting doesn't have such limitations and we can cut a lot
of complex code out of fault path.

khugepaged is now able to recover THP from such ranges if the
configuration allows.
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Tested-by: NZi Yan <ziy@nvidia.com>
Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
Reviewed-by: NZi Yan <ziy@nvidia.com>
Acked-by: NYang Shi <yang.shi@linux.alibaba.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Link: http://lkml.kernel.org/r/20200416160026.16538-8-kirill.shutemov@linux.intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

33b1aabe

khugepaged: allow to collapse PTE-mapped compound pages · 55b41d47

由 Kirill A. Shutemov 提交于 6月 16, 2020

task #27327988

commit 5503fbf2b0b80c1a47a7dca0e4f060f52f522cfd upstream

We can collapse PTE-mapped compound pages.  We only need to avoid handling
them more than once: lock/unlock page only once if it's present in the PMD
range multiple times as it handled on compound level.  The same goes for
LRU isolation and putback.
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Tested-by: NZi Yan <ziy@nvidia.com>
Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
Reviewed-by: NZi Yan <ziy@nvidia.com>
Acked-by: NYang Shi <yang.shi@linux.alibaba.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Link: http://lkml.kernel.org/r/20200416160026.16538-7-kirill.shutemov@linux.intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

55b41d47

khugepaged: allow to collapse a page shared across fork · 85592147

由 Kirill A. Shutemov 提交于 6月 03, 2020

task #27327988

commit 9445689f3b6170c6145a8772eee692482199cdd6 upstream

The page can be included into collapse as long as it doesn't have extra
pins (from GUP or otherwise).

Logic to check the refcount is moved to a separate function.  For pages in
swap cache, add compound_nr(page) to the expected refcount, in order to
handle the compound page case.  This is in preparation for the following
patch.

VM_BUG_ON_PAGE() was removed from __collapse_huge_page_copy() as the
invariant it checks is no longer valid: the source can be mapped multiple
times now.

[yang.shi@linux.alibaba.com: remove error message when checking external pins]
  Link: http://lkml.kernel.org/r/1589317383-9595-1-git-send-email-yang.shi@linux.alibaba.com
[cai@lca.pw: fix set-but-not-used warning]
  Link: http://lkml.kernel.org/r/20200521145644.GA6367@ovpn-112-192.phx2.redhat.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Tested-by: NZi Yan <ziy@nvidia.com>
Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
Reviewed-by: NZi Yan <ziy@nvidia.com>
Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
Acked-by: NYang Shi <yang.shi@linux.alibaba.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Link: http://lkml.kernel.org/r/20200416160026.16538-6-kirill.shutemov@linux.intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

85592147

khugepaged: drain LRU add pagevec after swapin · 733e2524

由 Kirill A. Shutemov 提交于 6月 03, 2020

task #27327988

commit ae2c5d8042426b69c5f4a74296d1a20bb769a8ad upstream

collapse_huge_page() tries to swap in pages that are part of the PMD
range.  Just swapped in page goes though LRU add cache.  The cache gets
extra reference on the page.

The extra reference can lead to the collapse fail: the following
__collapse_huge_page_isolate() would check refcount and abort collapse
seeing unexpected refcount.

The fix is to drain local LRU add cache in
__collapse_huge_page_swapin() if we successfully swapped in any pages.
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Tested-by: NZi Yan <ziy@nvidia.com>
Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
Reviewed-by: NZi Yan <ziy@nvidia.com>
Acked-by: NYang Shi <yang.shi@linux.alibaba.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Link: http://lkml.kernel.org/r/20200416160026.16538-5-kirill.shutemov@linux.intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

733e2524

khugepaged: drain all LRU caches before scanning pages · 4adb1303

由 Kirill A. Shutemov 提交于 6月 03, 2020

task #27327988

commit a980df33e9351e5474c06ec0fd96b2f409e2ff22 upstream

Having a page in LRU add cache offsets page refcount and gives
false-negative on PageLRU().  It reduces collapse success rate.

Drain all LRU add caches before scanning.  It happens relatively rare and
should not disturb the system too much.
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Tested-by: NZi Yan <ziy@nvidia.com>
Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
Reviewed-by: NZi Yan <ziy@nvidia.com>
Acked-by: NYang Shi <yang.shi@linux.alibaba.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Link: http://lkml.kernel.org/r/20200416160026.16538-4-kirill.shutemov@linux.intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

4adb1303

khugepaged: do not stop collapse if less than half PTEs are referenced · ed3a7ca7

由 Kirill A. Shutemov 提交于 6月 03, 2020

task #27327988

commit ffe945e633b527d5a4577b42cbadec3c7cbcf096 upstream

__collapse_huge_page_swapin() checks the number of referenced PTE to
decide if the memory range is hot enough to justify swapin.

We have few problems with the approach:

 - It is way too late: we can do the check much earlier and safe time.
   khugepaged_scan_pmd() already knows if we have any pages to swap in
   and number of referenced page.

 - It stops collapse altogether if there's not enough referenced pages,
   not only swappingin.

Fix it by making the right check early. We also can avoid additional
page table scanning if khugepaged_scan_pmd() haven't found any swap
entries.

Fixes: 0db501f7 ("mm, thp: convert from optimistic swapin collapsing to conservative")
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Tested-by: NZi Yan <ziy@nvidia.com>
Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
Reviewed-by: NZi Yan <ziy@nvidia.com>
Acked-by: NYang Shi <yang.shi@linux.alibaba.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Link: http://lkml.kernel.org/r/20200416160026.16538-3-kirill.shutemov@linux.intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

ed3a7ca7

khugepaged: add self test · 98415d0c

由 Kirill A. Shutemov 提交于 6月 16, 2020

task #27327988

commit e0c13f9761df8f97cf5e81495d12ecbc4075684a upstream

Patch series "thp/khugepaged improvements and CoW semantics", v4.

The patchset adds khugepaged selftest (anon-THP only for now), expands
cases khugepaged can handle and switches anon-THP copy-on-write handling
to 4k.

This patch (of 8):

The test checks if khugepaged is able to recover huge page where we expect
to do so.  It only covers anon-THP for now.

Currently the test shows few failures.  They are going to be addressed by
the following patches.

[colin.king@canonical.com: fix several spelling mistakes]
  Link: http://lkml.kernel.org/r/20200420084241.65433-1-colin.king@canonical.com
[aneesh.kumar@linux.ibm.com: replace the usage of system(3) in the test]
  Link: http://lkml.kernel.org/r/20200429110727.89388-1-aneesh.kumar@linux.ibm.com
[kirill@shutemov.name: fixup for issues I've noticed]
  Link: http://lkml.kernel.org/r/20200429124816.jp272trghrzxx5j5@box
[jhubbard@nvidia.com: add khugepaged to .gitignore]
  Link: http://lkml.kernel.org/r/20200517002509.362401-1-jhubbard@nvidia.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Tested-by: NZi Yan <ziy@nvidia.com>
Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
Reviewed-by: NZi Yan <ziy@nvidia.com>
Acked-by: NYang Shi <yang.shi@linux.alibaba.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Link: http://lkml.kernel.org/r/20200416160026.16538-1-kirill.shutemov@linux.intel.com
Link: http://lkml.kernel.org/r/20200416160026.16538-2-kirill.shutemov@linux.intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

98415d0c

device-dax: don't leak kernel memory to user space after unloading kmem · e7f3612b

由 David Hildenbrand 提交于 5月 22, 2020

task #28135435

commit 60858c00e5f018eda711a3aa84cf62214ef62d61 upstream

Assume we have kmem configured and loaded:

  [root@localhost ~]# cat /proc/iomem
  ...
  140000000-33fffffff : Persistent Memory$
    140000000-1481fffff : namespace0.0
    150000000-33fffffff : dax0.0
      150000000-33fffffff : System RAM

Assume we try to unload kmem. This force-unloading will work, even if
memory cannot get removed from the system.

  [root@localhost ~]# rmmod kmem
  [   86.380228] removing memory fails, because memory [0x0000000150000000-0x0000000157ffffff] is onlined
  ...
  [   86.431225] kmem dax0.0: DAX region [mem 0x150000000-0x33fffffff] cannot be hotremoved until the next reboot

Now, we can reconfigure the namespace:

  [root@localhost ~]# ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax
  [  131.409351] nd_pmem namespace0.0: could not reserve region [mem 0x140000000-0x33fffffff]dax
  [  131.410147] nd_pmem: probe of namespace0.0 failed with error -16namespace0.0 --mode=devdax
  ...

This fails as expected due to the busy memory resource, and the memory
cannot be used.  However, the dax0.0 device is removed, and along its
name.

The name of the memory resource now points at freed memory (name of the
device):

  [root@localhost ~]# cat /proc/iomem
  ...
  140000000-33fffffff : Persistent Memory
    140000000-1481fffff : namespace0.0
    150000000-33fffffff : �_�^7_��/_��wR��WQ���^��� ...
    150000000-33fffffff : System RAM

We have to make sure to duplicate the string.  While at it, remove the
superfluous setting of the name and fixup a stale comment.

Fixes: 9f960da72b25 ("device-dax: "Hotremove" persistent memory that is used like normal RAM")
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <stable@vger.kernel.org>	[5.3]
Link: http://lkml.kernel.org/r/20200508084217.9160-2-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>

e7f3612b

device-dax: "Hotremove" persistent memory that is used like normal RAM · f7bf25bd

由 Pavel Tatashin 提交于 6月 19, 2020

task #28135435

commit 9f960da72b25054163cf555e622dcdc3b8ccc488 upstream

It is now allowed to use persistent memory like a regular RAM, but
currently there is no way to remove this memory until machine is
rebooted.

This work expands the functionality to also allows hotremoving
previously hotplugged persistent memory, and recover the device for use
for other purposes.

To hotremove persistent memory, the management software must first
offline all memory blocks of dax region, and than unbind it from
device-dax/kmem driver.  So, operations should look like this:

  echo offline > /sys/devices/system/memory/memoryN/state
  ...
  echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind

Note: if unbind is done without offlining memory beforehand, it won't be
possible to do dax0.0 hotremove, and dax's memory is going to be part of
System RAM until reboot.

Link: http://lkml.kernel.org/r/20190517215438.6487-4-pasha.tatashin@soleen.comSigned-off-by: NPavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>

f7bf25bd

22 6月, 2020 3 次提交

configs: disable ext4 encryption · 015371fd

由 Joseph Qi 提交于 6月 22, 2020

fix #28198752

ext4 encryption will increase lock contention when opening directory and
result in performance drop in case will-it-scale open1.
Since we don't have explicit usecases as of now, so we decide to disabed
it.
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NDust Li <dust.li@linux.alibaba.com>

015371fd

alinux: sched: Fix %steal in cpuacct.proc_stat in guest OS · 1c5ab7a7

由 Yihao Wu 提交于 5月 20, 2020

to #28143829

rq_clock_task is less than rq_clock when in VM, or when IRQ_TIME_ACCOUNTING
is on. So they are not comparable when accounting elapsed time. This bug is
not observed on host yet, because neither of these two conditions are met.

Use rq_clock at both begin and end of exec_start_raw accumulation to fix
this bug, because we expect steal% in cpuacct.proc_stat of VM's cgroups can
reflect the cpu time the host steal from the guest.

Fixes: c7552980 ("alinux: sched: Introduce per-cgroup steal accounting")
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

1c5ab7a7

kbuild: mark prepare0 as PHONY to fix external module build · 54fce2ae

由 Masahiro Yamada 提交于 1月 15, 2019

fix #28883562

commit e00d8880481497474792d28c14479a9fb6752046 upstream

Commit c3ff2a5193fa ("powerpc/32: add stack protector support")
caused kernel panic on PowerPC when an external module is used with
CONFIG_STACKPROTECTOR because the 'prepare' target was not executed
for the external module build.

Commit e07db28eea38 ("kbuild: fix single target build for external
module") turned it into a build error because the 'prepare' target is
now executed but the 'prepare0' target is missing for the external
module build.

External module on arm/arm64 with CONFIG_STACKPROTECTOR_PER_TASK is
also broken in the same way.

Move 'PHONY += prepare0' to the common place. GNU Make is fine with
missing rule for phony targets. I also removed the comment which is
wrong irrespective of this commit.

I minimize the change so it can be easily backported to 4.20.x

To fix v4.20, please backport e07db28eea38 ("kbuild: fix single target
build for external module"), and then this commit.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=201891
Fixes: e07db28eea38 ("kbuild: fix single target build for external
module")
Fixes: c3ff2a5193fa ("powerpc/32: add stack protector support")
Fixes: 189af4657186 ("ARM: smp: add support for per-task stack
canaries")
Fixes: 0a1213fa7432 ("arm64: enable per-task stack canaries")
Cc: linux-stable <stable@vger.kernel.org> # v4.20
Reported-by: NSamuel Holland <samuel@sholland.org>
Reported-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Tested-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: NChunmei Xu <xuchunmei@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

54fce2ae

19 6月, 2020 1 次提交

configs: arm64: use 48-bit virtual address · f44f084b

由 Xu Yu 提交于 6月 19, 2020

fix #28506983

Some ARM machines may have large memory capacity (e.g., more than 256G),
or large hole(s) in memory layout among nodes.

Kernel with CONFIG_ARM64_VA_BITS as 39 has the linear region size as
256G, and the memory that we will not be able to cover with the linear
mapping shall be removed. This may cause part of the physical memory to
become unavailable, system deadlock on memory, or even boot failure, on
such ARM machines.

This changes CONFIG_ARM64_VA_BITS to 48 which supports 128T linear
mapping, in order to adapt to most scenarios.
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>

f44f084b

16 6月, 2020 18 次提交

pvpanic: add crash loaded event · fbb2f06e

由 Shile Zhang 提交于 6月 15, 2020

to #28727280

commit 191941692a3d1b6a9614502b279be062926b70f5 upstream.

Some users prefer kdump tools to generate guest kernel dumpfile,
at the same time, need a out-of-band kernel panic event.

Currently if booting guest kernel with
'crash_kexec_post_notifiers',
QEMU will receive PVPANIC_PANICKED event and stop VM. If booting
guest kernel without 'crash_kexec_post_notifiers', guest will not
call notifier chain.

Add PVPANIC_CRASH_LOADED bit for pvpanic event, it means that guest
kernel actually hit a kernel panic, but the guest kernel wants to
handle by itself.
Signed-off-by: Nzhenwei pi <pizhenwei@bytedance.com>
Link: https://lore.kernel.org/r/20200102023513.318836-3-pizhenwei@bytedance.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

fbb2f06e

NFS: Fix memory leaks in nfs_pageio_stop_mirroring() · 97aad837

由 Trond Myklebust 提交于 3月 29, 2020

task #28557789

[ Upstream commit 862f35c94730c9270833f3ad05bd758a29f204ed ]

If we just set the mirror count to 1 without first clearing out
the mirrors, we can leak queued up requests.
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

97aad837

NFS: direct.c: Fix memory leak of dreq when nfs_get_lock_context fails · f249cb6f

由 Misono Tomohiro 提交于 8月 28, 2019

task #28557789

[ Upstream commit 8605cf0e852af3b2c771c18417499dc4ceed03d5 ]

When dreq is allocated by nfs_direct_req_alloc(), dreq->kref is
initialized to 2. Therefore we need to call nfs_direct_req_release()
twice to release the allocated dreq. Usually it is called in
nfs_file_direct_{read, write}() and nfs_direct_complete().

However, current code only calls nfs_direct_req_relese() once if
nfs_get_lock_context() fails in nfs_file_direct_{read, write}().
So, that case would result in memory leak.

Fix this by adding the missing call.
Signed-off-by: NMisono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

f249cb6f

NFS: Fix a page leak in nfs_destroy_unlinked_subrequests() · 1cf4a89d

由 Trond Myklebust 提交于 4月 01, 2020

task #28557789

commit add42de31721fa29ed77a7ce388674d69f9d31a4 upstream.

When we detach a subrequest from the list, we must also release the
reference it holds to the parent.

Fixes: 5b2b5187 ("NFS: Fix nfs_page_group_destroy() and nfs_lock_and_join_requests() race cases")
Cc: stable@vger.kernel.org # v4.14+
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

1cf4a89d

nfs: add minor version to nfs_server_key for fscache · 73c7c995

由 Scott Mayhew 提交于 2月 24, 2020

task #28557789

[ Upstream commit 55dee1bc0d72877b99805e42e0205087e98b9edd ]

An NFS client that mounts multiple exports from the same NFS
server with higher NFSv4 versions disabled (i.e. 4.2) and without
forcing a specific NFS version results in fscache index cookie
collisions and the following messages:
[  570.004348] FS-Cache: Duplicate cookie detected

Each nfs_client structure should have its own fscache index cookie,
so add the minorversion to nfs_server_key.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=200145Signed-off-by: NScott Mayhew <smayhew@redhat.com>
Signed-off-by: NDave Wysochanski <dwysocha@redhat.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

73c7c995

NFS: Fix memory leaks · 7be0908c

由 Wenwen Wang 提交于 2月 03, 2020

task #28557789

[ Upstream commit 123c23c6a7b7ecd2a3d6060bea1d94019f71fd66 ]

In _nfs42_proc_copy(), 'res->commit_res.verf' is allocated through
kzalloc() if 'args->sync' is true. In the following code, if
'res->synchronous' is false, handle_async_copy() will be invoked. If an
error occurs during the invocation, the following code will not be executed
and the error will be returned . However, the allocated
'res->commit_res.verf' is not deallocated, leading to a memory leak. This
is also true if the invocation of process_copy_commit() returns an error.

To fix the above leaks, redirect the execution to the 'out' label if an
error is encountered.
Signed-off-by: NWenwen Wang <wenwen@cs.uga.edu>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

7be0908c

NFS/pnfs: Fix pnfs_generic_prepare_to_resend_writes() · 10c7b865

由 Trond Myklebust 提交于 1月 06, 2020

task #28557789

commit 221203ce6406273cf00e5c6397257d986c003ee6 upstream.

Instead of making assumptions about the commit verifier contents, change
the commit code to ensure we always check that the verifier was set
by the XDR code.

Fixes: f54bcf2e ("pnfs: Prepare for flexfiles by pulling out common code")
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

10c7b865

NFS: Revalidate the file size on a fatal write error · e29b0e35

由 Trond Myklebust 提交于 1月 06, 2020

task #28557789

commit 0df68ced55443243951d02cc497be31fadf28173 upstream.

If we suffer a fatal error upon writing a file, which causes us to
need to revalidate the entire mapping, then we should also revalidate
the file size.

Fixes: d2ceb7e57086 ("NFS: Don't use page_file_mapping after removing the page")
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

e29b0e35

NFS: Directory page cache pages need to be locked when read · 4d911ee7

由 Trond Myklebust 提交于 2月 02, 2020

task #28557789

commit 114de38225d9b300f027e2aec9afbb6e0def154b upstream.

When a NFS directory page cache page is removed from the page cache,
its contents are freed through a call to nfs_readdir_clear_array().
To prevent the removal of the page cache entry until after we've
finished reading it, we must take the page lock.

Fixes: 11de3b11 ("NFS: Fix a memory leak in nfs_readdir")
Cc: stable@vger.kernel.org # v2.6.37+
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: NBenjamin Coddington <bcodding@redhat.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

4d911ee7

NFS: Fix memory leaks and corruption in readdir · d387750f

由 Trond Myklebust 提交于 2月 02, 2020

task #28557789

commit 4b310319c6a8ce708f1033d57145e2aa027a883c upstream.

nfs_readdir_xdr_to_array() must not exit without having initialised
the array, so that the page cache deletion routines can safely
call nfs_readdir_clear_array().
Furthermore, we should ensure that if we exit nfs_readdir_filler()
with an error, we free up any page contents to prevent a leak
if we try to fill the page again.

Fixes: 11de3b11 ("NFS: Fix a memory leak in nfs_readdir")
Cc: stable@vger.kernel.org # v2.6.37+
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: NBenjamin Coddington <bcodding@redhat.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

d387750f

NFSv4/flexfiles: Fix invalid deref in FF_LAYOUT_DEVID_NODE() · c12c9fc7

由 Trond Myklebust 提交于 2月 26, 2019

task #28557789

[ Upstream commit 108bb4afd351d65826648a47f11fa3104e250d9b ]

If the attempt to instantiate the mirror's layout DS pointer failed,
then that pointer may hold a value of type ERR_PTR(), so we need
to check that before we dereference it.

Fixes: 65990d1a ("pNFS/flexfiles: Fix a deadlock on LAYOUTGET")
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

c12c9fc7

NFS: Add missing encode / decode sequence_maxsz to v4.2 operations · 1fd0bb40

由 Anna Schumaker 提交于 3月 01, 2019

task #28557789

[ Upstream commit 1a3466aed3a17eed41cd9411f89eb637f58349b0 ]

These really should have been there from the beginning, but we never
noticed because there was enough slack in the RPC request for the extra
bytes. Chuck's recent patch to use au_cslack and au_rslack to compute
buffer size shrunk the buffer enough that this was now a problem for
SEEK operations on my test client.

Fixes: f4ac1674 ("nfs: Add ALLOCATE support")
Fixes: 2e72448b ("NFS: Add COPY nfs operation")
Fixes: cb95deea ("NFS OFFLOAD_CANCEL xdr")
Fixes: 624bd5b7 ("nfs: Add DEALLOCATE support")
Fixes: 1c6dcbe5 ("NFS: Implement SEEK")
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

1fd0bb40

NFS/pnfs: Bulk destroy of layouts needs to be safe w.r.t. umount · ad300be7

由 Trond Myklebust 提交于 2月 22, 2019

task #28557789

[ Upstream commit 5085607d209102b37b169bc94d0aa39566a9842a ]

If a bulk layout recall or a metadata server reboot coincides with a
umount, then holding a reference to an inode is unsafe unless we
also hold a reference to the super block.

Fixes: fd9a8d71 ("NFSv4.1: Fix bulk recall and destroy of layouts")
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

ad300be7

NFS: Fix a soft lockup in the delegation recovery code · 2eebd943

由 Trond Myklebust 提交于 2月 21, 2019

task #28557789

[ Upstream commit 6f9449be53f3ce383caed797708b332ede8d952c ]

Fix a soft lockup when NFS client delegation recovery is attempted
but the inode is in the process of being freed. When the
igrab(inode) call fails, and we have to restart the recovery process,
we need to ensure that we won't attempt to recover the same delegation
again.

Fixes: 45870d69 ("NFSv4.1: Test delegation stateids when server...")
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

2eebd943

NFSv4.x: Drop the slot if nfs4_delegreturn_prepare waits for layoutreturn · f6211125

由 Trond Myklebust 提交于 11月 13, 2019

task #28557789

commit 5326de9e94bedcf7366e7e7625d4deb8c1f1ca8a upstream.

If nfs4_delegreturn_prepare needs to wait for a layoutreturn to complete
then make sure we drop the sequence slot if we hold it.

Fixes: 1c5bd76d ("pNFS: Enable layoutreturn operation for return-on-close")
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

f6211125

NFSv2: Fix a typo in encode_sattr() · 0f6c2ec0

由 Trond Myklebust 提交于 10月 04, 2019

task #28557789

commit ad97a995d8edff820d4238bd0dfc69f440031ae6 upstream.

Encode the mtime correctly.

Fixes: 95582b00 ("vfs: change inode times to use struct timespec64")
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

0f6c2ec0

xfs: fix partially uninitialized structure in xfs_reflink_remap_extent · 5cfdc142

由 Darrick J. Wong 提交于 4月 12, 2020

task #28557760

[ Upstream commit c142932c29e533ee892f87b44d8abc5719edceec ]

In the reflink extent remap function, it turns out that uirec (the block
mapping corresponding only to the part of the passed-in mapping that got
unmapped) was not fully initialized.  Specifically, br_state was not
being copied from the passed-in struct to the uirec.  This could lead to
unpredictable results such as the reflinked mapping being marked
unwritten in the destination file.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

5cfdc142

xfs: clear PF_MEMALLOC before exiting xfsaild thread · 0ff2b809

由 Eric Biggers 提交于 3月 10, 2020

task #28557760

commit 10a98cb16d80be3595fdb165fad898bb28b8b6d2 upstream.

Leaving PF_MEMALLOC set when exiting a kthread causes it to remain set
during do_exit().  That can confuse things.  In particular, if BSD
process accounting is enabled, then do_exit() writes data to an
accounting file.  If that file has FS_SYNC_FL set, then this write
occurs synchronously and can misbehave if PF_MEMALLOC is set.

For example, if the accounting file is located on an XFS filesystem,
then a WARN_ON_ONCE() in iomap_do_writepage() is triggered and the data
doesn't get written when it should.  Or if the accounting file is
located on an ext4 filesystem without a journal, then a WARN_ON_ONCE()
in ext4_write_inode() is triggered and the inode doesn't get written.

Fix this in xfsaild() by using the helper functions to save and restore
PF_MEMALLOC.

This can be reproduced as follows in the kvm-xfstests test appliance
modified to add the 'acct' Debian package, and with kvm-xfstests's
recommended kconfig modified to add CONFIG_BSD_PROCESS_ACCT=y:

        mkfs.xfs -f /dev/vdb
        mount /vdb
        touch /vdb/file
        chattr +S /vdb/file
        accton /vdb/file
        mkfs.xfs -f /dev/vdc
        mount /vdc
        umount /vdc

It causes:
	WARNING: CPU: 1 PID: 336 at fs/iomap/buffered-io.c:1534
	CPU: 1 PID: 336 Comm: xfsaild/vdc Not tainted 5.6.0-rc5 #3
	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20191223_100556-anatol 04/01/2014
	RIP: 0010:iomap_do_writepage+0x16b/0x1f0 fs/iomap/buffered-io.c:1534
	[...]
	Call Trace:
	 write_cache_pages+0x189/0x4d0 mm/page-writeback.c:2238
	 iomap_writepages+0x1c/0x33 fs/iomap/buffered-io.c:1642
	 xfs_vm_writepages+0x65/0x90 fs/xfs/xfs_aops.c:578
	 do_writepages+0x41/0xe0 mm/page-writeback.c:2344
	 __filemap_fdatawrite_range+0xd2/0x120 mm/filemap.c:421
	 file_write_and_wait_range+0x71/0xc0 mm/filemap.c:760
	 xfs_file_fsync+0x7a/0x2b0 fs/xfs/xfs_file.c:114
	 generic_write_sync include/linux/fs.h:2867 [inline]
	 xfs_file_buffered_aio_write+0x379/0x3b0 fs/xfs/xfs_file.c:691
	 call_write_iter include/linux/fs.h:1901 [inline]
	 new_sync_write+0x130/0x1d0 fs/read_write.c:483
	 __kernel_write+0x54/0xe0 fs/read_write.c:515
	 do_acct_process+0x122/0x170 kernel/acct.c:522
	 slow_acct_process kernel/acct.c:581 [inline]
	 acct_process+0x1d4/0x27c kernel/acct.c:607
	 do_exit+0x83d/0xbc0 kernel/exit.c:791
	 kthread+0xf1/0x140 kernel/kthread.c:257
	 ret_from_fork+0x27/0x50 arch/x86/entry/entry_64.S:352

This bug was originally reported by syzbot at
https://lore.kernel.org/r/0000000000000e7156059f751d7b@google.com.

Reported-by: syzbot+1f9dc49e8de2582d90c2@syzkaller.appspotmail.com
Signed-off-by: NEric Biggers <ebiggers@google.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

0ff2b809

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功