提交 · 33ef4784e43ba5c9bd0b94c368dd0114e36e7301 · openanolis / cloud-kernel

08 12月, 2019 3 次提交

alios: mm: memcontrol: make distance between wmark_low and wmark_high configurable · 33ef4784

由 Yang Shi 提交于 8月 17, 2019

Introduce a new interface, wmark_scale_factor, which defines the
distance between wmark_high and wmark_low.  The unit is in fractions of
10,000. The default value of 50 means the distance between wmark_high
and wmark_low is 0.5% of the max limit of the cgroup.  The maximum value
is 1000, or 10% of the max limit.

The distance between wmark_low and wmark_high have impact on how hard
memcg kswapd would reclaim.
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>

33ef4784

alios: mm: vmscan: make memcg kswapd set memcg state to dirty or writeback · e10c247b

由 Yang Shi 提交于 8月 02, 2019

The global kswapd could set memory node to dirty or writeback if current
scan find all pages are unqueued dirty or writeback. Then kswapd would
write out dirty pages or wait for writeback done. The memcg kswapd
behaves like global kswapd, and it should set dirty or writeback state
to memcg too if the same condition is met.

Since direct reclaim can't write out page caches, the system depends on
kswapd to write out dirty pages if scan finds too many dirty pages in
order to avoid pre-mature OOM. But, if page cache is dirtied too fast,
writing out pages definitely can't catch up with dirtying pages. It is
the responsibility of dirty page balance to throttle dirtying pages.
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>

e10c247b

alios: mm: memcontrol: support background async page reclaim · 6b2ef082

由 Yang Shi 提交于 8月 14, 2019

Currently when memory usage exceeds memory cgroup limit, memory cgroup
just can do sync direct reclaim.  This may incur unexpected stall on
some applications which are sensitive to latency.  Introduce background
async page reclaim mechanism, like what kswapd does.

Define memcg memory usage water mark by introducing wmark_ratio interface,
which is from 0 to 100 and represents percentage of max limit.  The
wmark_high is calculated by (max * wmark_ratio / 100), the wmark_low is
(wmark_high - wmark_high >> 8), which is an empirical value.  If wmark_ratio
is 0, it means water mark is disabled, both wmark_low and wmark_high is max,
which is the default value.

If wmark_ratio is setup, when charging page, if usage is greater than
wmark_high, which means the available memory of memcg is low, a work
would be scheduled to do background page reclaim until memory usage is
reduced to wmark_low if possible.

Define a dedicated unbound workqueue for scheduling water mark reclaim
works.
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>

6b2ef082

05 12月, 2019 1 次提交

vfs: add vfs_iocb_iter_[read|write] helper functions · 7ff6623e

由 Jiufei Xue 提交于 11月 14, 2019

This isn't cause any behavior changes and will be used by overlay
async IO implementation.
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

7ff6623e

28 11月, 2019 7 次提交

alios: blk-throttle: limit bios to fix amount of pages entering writeback prematurely · 06a67773

由 Xiaoguang Wang 提交于 12月 28, 2018

Currently in blk_throtl_bio(), if one bio exceeds its throtl_grp's bps
or iops limit, this bio will be queued throtl_grp's throtl_service_queue,
then obviously mm subsys will submit more pages, even underlying device
can not handle these io requests, also this will make large amount of pages
entering writeback prematurely, later if some process writes some of these
pages, it will wait for long time.

I have done some tests: one process does buffered writes on a 1GB file,
and make this process's blkcg max bps limit be 10MB/s, I observe this:
	#cat /proc/meminfo  | grep -i back
	Writeback:        900024 kB
	WritebackTmp:          0 kB

I think this Writeback value is just too big, indeed many bios have been
queued in throtl_grp's throtl_service_queue, if one process try to write
the last bio's page in this queue, it will call wait_on_page_writeback(page),
which must wait the previous bios to finish and will take long time, we
have also see 120s hung task warning in our server.

 INFO: task kworker/u128:0:30072 blocked for more than 120 seconds.
       Tainted: G            E 4.9.147-013.ali3000_015_test.alios7.x86_64 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 kworker/u128:0  D    0 30072      2 0x00000000
 Workqueue: writeback wb_workfn (flush-8:16)
  ffff882ddd066b40 0000000000000000 ffff882e5cad3400 ffff882fbe959e80
  ffff882fa50b1a00 ffffc9003a5a3768 ffffffff8173325d ffffc9003a5a3780
  00ff882e5cad3400 ffff882fbe959e80 ffffffff81360b49 ffff882e5cad3400
 Call Trace:
  [<ffffffff8173325d>] ? __schedule+0x23d/0x6d0
  [<ffffffff81360b49>] ? alloc_request_struct+0x19/0x20
  [<ffffffff81733726>] schedule+0x36/0x80
  [<ffffffff81736c56>] schedule_timeout+0x206/0x4b0
  [<ffffffff81036c69>] ? sched_clock+0x9/0x10
  [<ffffffff81363073>] ? get_request+0x403/0x810
  [<ffffffff8110ca10>] ? ktime_get+0x40/0xb0
  [<ffffffff81732f8a>] io_schedule_timeout+0xda/0x170
  [<ffffffff81733f90>] ? bit_wait+0x60/0x60
  [<ffffffff81733fab>] bit_wait_io+0x1b/0x60
  [<ffffffff81733b28>] __wait_on_bit+0x58/0x90
  [<ffffffff811b0d91>] ? find_get_pages_tag+0x161/0x2e0
  [<ffffffff811aff62>] wait_on_page_bit+0x82/0xa0
  [<ffffffff810d47f0>] ? wake_atomic_t_function+0x60/0x60
  [<ffffffffa02fc181>] mpage_prepare_extent_to_map+0x2d1/0x310 [ext4]
  [<ffffffff8121ff65>] ? kmem_cache_alloc+0x185/0x1a0
  [<ffffffffa0305a2f>] ? ext4_init_io_end+0x1f/0x40 [ext4]
  [<ffffffffa0300294>] ext4_writepages+0x404/0xef0 [ext4]
  [<ffffffff81508c64>] ? scsi_init_io+0x44/0x200
  [<ffffffff81398a0f>] ? fprop_fraction_percpu+0x2f/0x80
  [<ffffffff811c139e>] do_writepages+0x1e/0x30
  [<ffffffff8127c0f5>] __writeback_single_inode+0x45/0x320
  [<ffffffff8127c942>] writeback_sb_inodes+0x272/0x600
  [<ffffffff8127cf6b>] wb_writeback+0x10b/0x300
  [<ffffffff8127d884>] wb_workfn+0xb4/0x380
  [<ffffffff810b85e9>] ? try_to_wake_up+0x59/0x3e0
  [<ffffffff810a5759>] process_one_work+0x189/0x420
  [<ffffffff810a5a3e>] worker_thread+0x4e/0x4b0
  [<ffffffff810a59f0>] ? process_one_work+0x420/0x420
  [<ffffffff810ac026>] kthread+0xe6/0x100
  [<ffffffff810abf40>] ? kthread_park+0x60/0x60
  [<ffffffff81738499>] ret_from_fork+0x39/0x50

To fix this issue, we can simply limit throtl_service_queue's max queued
bios, currently we limit it to throtl_grp's bps_limit or iops limit, if it
still exteeds, we just sleep for a while.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

06a67773

alios: blk-throttle: fix tg NULL pointer dereference · 4667e926

由 Joseph Qi 提交于 12月 07, 2017

io throtl stats will blkg_get at the beginning of throttle and then
blkg_put at the new introduced bi_tg_end_io. This will cause blkg to be
freed if end_io is called twice like dm-thin, which will save origin
end_io first, and call its overwrite end_io and then the saved end_io.
After that, access blkg is invalid and finally BUG:

[ 4417.235048] BUG: unable to handle kernel NULL pointer dereference at 00000000000001e0
[ 4417.236475] IP: [<ffffffff812e7c71>] throtl_update_dispatch_stats+0x21/0xb0
[ 4417.237865] PGD 98395067 PUD 362e1067 PMD 0
[ 4417.239232] Oops: 0000 [#1] SMP
......
[ 4417.274070] Call Trace:
[ 4417.275407]  [<ffffffff812ea93d>] blk_throtl_bio+0xfd/0x630
[ 4417.276760]  [<ffffffff810b3613>] ? wake_up_process+0x23/0x40
[ 4417.278079]  [<ffffffff81094c04>] ? wake_up_worker+0x24/0x30
[ 4417.279387]  [<ffffffff81095772>] ? insert_work+0x62/0xa0
[ 4417.280697]  [<ffffffff8116c2c7>] ? mempool_free_slab+0x17/0x20
[ 4417.282019]  [<ffffffff8116c6c9>] ? mempool_free+0x49/0x90
[ 4417.283326]  [<ffffffff812c9acf>] generic_make_request_checks+0x16f/0x360
[ 4417.284637]  [<ffffffffa0340d97>] ? thin_map+0x227/0x2c0 [dm_thin_pool]
[ 4417.285951]  [<ffffffff812c9ce7>] generic_make_request+0x27/0x130
[ 4417.287240]  [<ffffffffa0230b3d>] __map_bio+0xad/0x100 [dm_mod]
[ 4417.288503]  [<ffffffffa023257e>] __clone_and_map_data_bio+0x15e/0x240 [dm_mod]
[ 4417.289778]  [<ffffffffa02329ea>] __split_and_process_bio+0x38a/0x500 [dm_mod]
[ 4417.291062]  [<ffffffffa0232c91>] dm_make_request+0x131/0x1a0 [dm_mod]
[ 4417.292344]  [<ffffffff812c9da2>] generic_make_request+0xe2/0x130
[ 4417.293626]  [<ffffffff812c9e61>] submit_bio+0x71/0x150
[ 4417.294909]  [<ffffffff8121ab1d>] ? bio_alloc_bioset+0x20d/0x360
[ 4417.296195]  [<ffffffff81215acb>] _submit_bh+0x14b/0x220
[ 4417.297484]  [<ffffffff81215bb0>] submit_bh+0x10/0x20
[ 4417.298744]  [<ffffffffa016d8d8>] jbd2_journal_commit_transaction+0x6c8/0x19a0 [jbd2]
[ 4417.300014]  [<ffffffff810135b8>] ? __switch_to+0xf8/0x4c0
[ 4417.301268]  [<ffffffffa01731e9>] kjournald2+0xc9/0x270 [jbd2]
[ 4417.302524]  [<ffffffff810a0fd0>] ? wake_up_atomic_t+0x30/0x30
[ 4417.303753]  [<ffffffffa0173120>] ? commit_timeout+0x10/0x10 [jbd2]
[ 4417.304950]  [<ffffffff8109ffef>] kthread+0xcf/0xe0
[ 4417.306107]  [<ffffffff8109ff20>] ? kthread_create_on_node+0x140/0x140
[ 4417.307255]  [<ffffffff81647f18>] ret_from_fork+0x58/0x90
[ 4417.308349]  [<ffffffff8109ff20>] ? kthread_create_on_node+0x140/0x140
......

Now we introduce a new bio flag BIO_THROTL_STATED to make sure
blkg_get/put only get called once for the same bio.
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

4667e926

alios: blk-throttle: support io delay stats · 65e6966a

由 Joseph Qi 提交于 12月 19, 2017

Add blkio.throttle.io_service_time and blkio.throttle.io_wait_time to
get per-cgroup io delay statistics.
io_service_time represents the time spent after io throttle to io
completion, while io_wait_time represents the time spent on throttle
queue.
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

65e6966a

alios: block: add counter to track io request's d2c time · 07232d74

由 Xiaoguang Wang 提交于 6月 19, 2019

Indeed tool iostat's await is not good enough, which is somewhat sketchy
and could not show request's latency on device driver's side.

Here we add a new counter to track io request's d2c time, also with this
patch, we can extend iostat to show this value easily.

Note:
I had checked how iostat is implemented, it just reads fields it needs,
so iostat won't be affected by this change, so does tsar.
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

07232d74

alios: jbd2: add proc entry to control whether doing buffer copy-out · ac452d09

由 Xiaoguang Wang 提交于 11月 15, 2018

When jbd2 tries to get write access to one buffer, and if this buffer
is under writeback with BH_Shadow flag, jbd2 will wait until this buffer
has been written to disk, but sometimes the time taken to wait may be
much long, especially disk capacity is almost full.

Here add a proc entry "force-copy", if its value is not zero, jbd2 will
always do meta buffer copy-cout, then we can eliminate the unnecessary
wating time here, and reduce long tail latency for buffered-write.

I construct such test case below:

$cat offline.fio
; fio-rand-RW.job for fiotest

[global]
name=fio-rand-RW
filename=fio-rand-RW
rw=randrw
rwmixread=60
rwmixwrite=40
bs=4K
direct=0
numjobs=4
time_based=1
runtime=900

[file1]
size=60G
ioengine=sync
iodepth=16

$cat online.fio
; fio-seq-write.job for fiotest

[global]
name=fio-seq-write
filename=fio-seq-write
rw=write
bs=256K
direct=0
numjobs=1
time_based=1
runtime=60

[file1]
rate=50m
size=10G
ioengine=sync
iodepth=16

With this patch:
$cat /proc/fs/jbd2/sda5-8/force_copy
0

online fio almost always get such long tail latency:

Jobs: 1 (f=1), 0B/s-0B/s: [W(1)][100.0%][w=50.0MiB/s][w=200 IOPS][eta
00m:00s]
file1: (groupid=0, jobs=1): err= 0: pid=17855: Thu Nov 15 09:45:57 2018
  write: IOPS=200, BW=50.0MiB/s (52.4MB/s)(3000MiB/60001msec)
    clat (usec): min=135, max=4086.6k, avg=867.21, stdev=50338.22
     lat (usec): min=139, max=4086.6k, avg=871.16, stdev=50338.22
    clat percentiles (usec):
     |  1.00th=[    141],  5.00th=[    143], 10.00th=[    145],
     | 20.00th=[    147], 30.00th=[    147], 40.00th=[    149],
     | 50.00th=[    149], 60.00th=[    151], 70.00th=[    153],
     | 80.00th=[    155], 90.00th=[    159], 95.00th=[    163],
     | 99.00th=[    255], 99.50th=[    273], 99.90th=[    429],
     | 99.95th=[    441], 99.99th=[3640656]

$cat /proc/fs/jbd2/sda5-8/force_copy
1

online fio latency is much better.

Jobs: 1 (f=1), 0B/s-0B/s: [W(1)][100.0%][w=50.0MiB/s][w=200 IOPS][eta
00m:00s]
file1: (groupid=0, jobs=1): err= 0: pid=8084: Thu Nov 15 09:31:15 2018
  write: IOPS=200, BW=50.0MiB/s (52.4MB/s)(3000MiB/60001msec)
    clat (usec): min=137, max=545, avg=151.35, stdev=16.22
     lat (usec): min=140, max=548, avg=155.31, stdev=16.65
    clat percentiles (usec):
     |  1.00th=[  143],  5.00th=[  145], 10.00th=[  145], 20.00th=[
147],
     | 30.00th=[  147], 40.00th=[  147], 50.00th=[  149], 60.00th=[
149],
     | 70.00th=[  151], 80.00th=[  155], 90.00th=[  157], 95.00th=[
161],
     | 99.00th=[  239], 99.50th=[  269], 99.90th=[  420], 99.95th=[
429],
     | 99.99th=[  537]

As to the cost: because we'll always need to copy meta buffer, will
consume minor cpu time and some memory(at most 32MB for 128MB journal
size).
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

ac452d09

alios: fs,ext4: remove projid limit when create hard link · 28df06b3

由 zhangliguang 提交于 12月 27, 2018

This is a temporary workaround plan to avoid the limitation when
creating hard link cross two projids.
Signed-off-by: Nzhangliguang <zhangliguang@linux.alibaba.com>
Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

28df06b3

alios: jbd2: create jbd2-ckpt thread for journal checkpoint · c31b17e5

由 Joseph Qi 提交于 3月 07, 2018

This is trying to do jbd2 checkpoint in a specific kernel thread, then
checkpoint won't be under io throttle control.
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: Nzhangliguang <zhangliguang@linux.alibaba.com>
Reviewed by: Baoyou Xie <baoyou.xie@linux.alibaba.com>
Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

c31b17e5

20 11月, 2019 22 次提交

mm: thp: handle page cache THP correctly in PageTransCompoundMap · c9f8166a

由 Yang Shi 提交于 11月 08, 2019

commit 169226f7e0d275c1879551f37484ef6683579a5c upstream

We have a usecase to use tmpfs as QEMU memory backend and we would like
to take the advantage of THP as well.  But, our test shows the EPT is
not PMD mapped even though the underlying THP are PMD mapped on host.
The number showed by /sys/kernel/debug/kvm/largepage is much less than
the number of PMD mapped shmem pages as the below:

  7f2778200000-7f2878200000 rw-s 00000000 00:14 262232 /dev/shm/qemu_back_mem.mem.Hz2hSf (deleted)
  Size:            4194304 kB
  [snip]
  AnonHugePages:         0 kB
  ShmemPmdMapped:   579584 kB
  [snip]
  Locked:                0 kB

  cat /sys/kernel/debug/kvm/largepages
  12

And some benchmarks do worse than with anonymous THPs.

By digging into the code we figured out that commit 127393fb ("mm:
thp: kvm: fix memory corruption in KVM with THP enabled") checks if
there is a single PTE mapping on the page for anonymous THP when setting
up EPT map.  But the _mapcount < 0 check doesn't work for page cache THP
since every subpage of page cache THP would get _mapcount inc'ed once it
is PMD mapped, so PageTransCompoundMap() always returns false for page
cache THP.  This would prevent KVM from setting up PMD mapped EPT entry.

So we need handle page cache THP correctly.  However, when page cache
THP's PMD gets split, kernel just remove the map instead of setting up
PTE map like what anonymous THP does.  Before KVM calls get_user_pages()
the subpages may get PTE mapped even though it is still a THP since the
page cache THP may be mapped by other processes at the mean time.

Checking its _mapcount and whether the THP has PTE mapped or not.
Although this may report some false negative cases (PTE mapped by other
processes), it looks not trivial to make this accurate.

With this fix /sys/kernel/debug/kvm/largepage would show reasonable
pages are PMD mapped by EPT as the below:

  7fbeaee00000-7fbfaee00000 rw-s 00000000 00:14 275464 /dev/shm/qemu_back_mem.mem.SKUvat (deleted)
  Size:            4194304 kB
  [snip]
  AnonHugePages:         0 kB
  ShmemPmdMapped:   557056 kB
  [snip]
  Locked:                0 kB

  cat /sys/kernel/debug/kvm/largepages
  271

And the benchmarks are as same as anonymous THPs.

[yang.shi@linux.alibaba.com: v4]
  Link: http://lkml.kernel.org/r/1571865575-42913-1-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1571769577-89735-1-git-send-email-yang.shi@linux.alibaba.com
Fixes: dd78fedd ("rmap: support file thp")
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reported-by: NGang Deng <gavin.dg@linux.alibaba.com>
Tested-by: NGang Deng <gavin.dg@linux.alibaba.com>
Suggested-by: NHugh Dickins <hughd@google.com>
Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>    [4.8+]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

c9f8166a

acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node · 159d387d

由 Dan Williams 提交于 8月 24, 2019

commit 8fc5c73554db0ac18c0c6ac5b2099ab917f83bdf upstream

Persistent memory, as described by the ACPI NFIT (NVDIMM Firmware
Interface Table), is the first known instance of a memory range
described by a unique "target" proximity domain. Where "initiator" and
"target" proximity domains is an approach that the ACPI HMAT
(Heterogeneous Memory Attributes Table) uses to described the unique
performance properties of a memory range relative to a given initiator
(e.g. CPU or DMA device).

Currently the numa-node for a /dev/pmemX block-device or /dev/daxX.Y
char-device follows the traditional notion of 'numa-node' where the
attribute conveys the closest online numa-node. That numa-node attribute
is useful for cpu-binding and memory-binding processes *near* the
device. However, when the memory range backing a 'pmem', or 'dax' device
is onlined (memory hot-add) the memory-only-numa-node representing that
address needs to be differentiated from the set of online nodes. In
other words, the numa-node association of the device depends on whether
you can bind processes *near* the cpu-numa-node in the offline
device-case, or bind process *on* the memory-range directly after the
backing address range is onlined.

Allow for the case that platform firmware describes persistent memory
with a unique proximity domain, i.e. when it is distinct from the
proximity of DRAM and CPUs that are on the same socket. Plumb the Linux
numa-node translation of that proximity through the libnvdimm region
device to namespaces that are in device-dax mode. With this in place the
proposed kmem driver [1] can optionally discover a unique numa-node
number for the address range as it transitions the memory from an
offline state managed by a device-driver to an online memory range
managed by the core-mm.

[1]: https://lore.kernel.org/lkml/20181022201317.8558C1D8@viggo.jf.intel.comReported-by: NFan Du <fan.du@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Oliver O'Halloran" <oohall@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
[yshi: Removed PowerPC stuff which is not applicable 4.19]
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>

159d387d

ICX: ACPI/ADXL: Add address translation interface using an ACPI DSM · 6ee96b76

由 Tony Luck 提交于 10月 15, 2018

commit 4cf841e398503990df640f7a7c5b2ea56f11c08c upstream.

Some new Intel servers provide an interface so that the OS can ask the
BIOS to translate a system physical address to a memory address (socket,
memory controller, channel, rank, dimm, etc.). This is useful for EDAC
drivers that want to take the address of an error reported in a machine
check bank and let the user know which DIMM may need to be replaced.

Specification for this interface is available at:

  https://cdrdv2.intel.com/v1/dl/getContent/603354

 [ Based on earlier code by Qiuxu Zhuo <qiuxu.zhuo@intel.com>. ]

 [ bp: Make the first pr_info() in adxl_init() pr_debug() so that it
   doesn't pollute every dmesg. ]
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: NQiuxu Zhuo <qiuxu.zhuo@intel.com>
CC: Len Brown <lenb@kernel.org>
CC: linux-acpi@vger.kernel.org
CC: linux-edac@vger.kernel.org
Link: http://lkml.kernel.org/r/20181015202620.23610-1-tony.luck@intel.comSigned-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

6ee96b76

ICX: intel_rapl: support two power limits for every RAPL domain · 518b1953

由 Zhang Rui 提交于 7月 10, 2019

commit 0c2ddedd8bcb88c4100acb9e0fc5ac8752d09501 upstream.

RAPL MSR interface supports 2 power limits for package domain, and 1 power
limit for other domains, while RAPL MMIO interface supports 2 power limits
for both package and dram domains.
And when 2 power limits are supported, the FW_LOCK bit is in bit 63 of the
register, instead of bit 31.

Remove the assumption that only pakcage domain supports 2 power limits.
And allow the RAPL interface driver to specify the number of power limits
supported, for every single RAPL domain it owns..
Reviewed-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com>
Tested-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com>
Signed-off-by: NZhang Rui <rui.zhang@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

518b1953

ICX: intel_rapl: support 64 bit register · dc385fec

由 Zhang Rui 提交于 7月 10, 2019

commit d978e755aabe215cb67bf713e103ed3916ec306d upstream.

RAPL MMIO interface uses 64 bit registers, thus force use 64 bit register
for all the RAPL code.
Reviewed-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com>
Tested-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com>
Signed-off-by: NZhang Rui <rui.zhang@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

dc385fec

ICX: intel_rapl: abstract RAPL common code · c88ad343

由 Zhang Rui 提交于 7月 10, 2019

commit 3382388d714891fc0f575926189f33d22e7c960b upstream.

Split intel_rapl.c to intel_rapl_common.c and intel_rapl_msr.c, where
intel_rapl_common.c contains the common code that can be used by both MSR
and MMIO interface.
intel_rapl_msr.c contains the implementation of RAPL MSR interface.
Reviewed-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com>
Tested-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com>
Signed-off-by: NZhang Rui <rui.zhang@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

c88ad343

ICX: intel_rapl: abstract register access operations · 045cd590

由 Zhang Rui 提交于 7月 10, 2019

commit beea8df821d928e7755917da6c1e45d6afde5148 upstream.

MSR and MMIO RAPL interfaces have different ways to access the registers,
thus in order to abstract the register access operations, two callbacks,
.read_raw()/.write_raw() are introduced, and they should be implemented by
MSR RAPL and MMIO RAPL interface driver respectly.

This patch implements them for the MSR I/F only.
Reviewed-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com>
Tested-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com>
Signed-off-by: NZhang Rui <rui.zhang@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

045cd590

ICX: intel_rapl: abstract register address · 9836c985

由 Zhang Rui 提交于 7月 10, 2019

commit 7fde2712a7adab721eaabafbd8ff93dff3262d35 upstream.

MSR and MMIO RAPL interface have different sets of registers, thus the
RAPL register address should be obtained from interface specific
structure, i.e. struct rapl_if_private, instead.
Reviewed-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com>
Tested-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com>
Signed-off-by: NZhang Rui <rui.zhang@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

9836c985

ICX: intel_rapl: introduce struct rapl_if_private · dc9e7782

由 Zhang Rui 提交于 7月 10, 2019

commit 7ebf8eff63b4f349e7b2ded6aa5036d94bdf94b9 upstream.

Introduce a new structure, rapl_if_private, to save the private data
for different RAPL Interface.
Reviewed-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com>
Tested-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com>
Signed-off-by: NZhang Rui <rui.zhang@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

dc9e7782

ICX: intel_rapl: introduce intel_rapl.h · d344e48a

由 Zhang Rui 提交于 7月 10, 2019

commit ff956826a403f5cf189978d5ff6b3eb53aa11610 upstream.

Create a new header file for the common definitions that might be used
by different RAPL Interface.
Reviewed-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com>
Tested-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com>
Signed-off-by: NZhang Rui <rui.zhang@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

d344e48a

ICX: cpu/topology: Export die_id · 67606df2

由 Len Brown 提交于 5月 13, 2019

commit 0e344d8c709fe01d882fc0fb5452bedfe5eba67a upstream.

Export die_id in cpu topology, for the benefit of hardware that has
multiple-die/package.
Signed-off-by: NLen Brown <len.brown@intel.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NIngo Molnar <mingo@kernel.org>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-doc@vger.kernel.org
Link: https://lkml.kernel.org/r/e7d1caaf4fbd24ee40db6d557ab28d7d83298900.1557769318.git.len.brown@intel.comSigned-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

67606df2

ICX: perf/x86: Disable extended registers for non-supported PMUs · a6b5314d

由 Kan Liang 提交于 5月 28, 2019

commit e321d02db87af7840da29ef833a2a71fc0eab198 upstream.

The perf fuzzer caused Skylake machine to crash:

[ 9680.085831] Call Trace:
[ 9680.088301]  <IRQ>
[ 9680.090363]  perf_output_sample_regs+0x43/0xa0
[ 9680.094928]  perf_output_sample+0x3aa/0x7a0
[ 9680.099181]  perf_event_output_forward+0x53/0x80
[ 9680.103917]  __perf_event_overflow+0x52/0xf0
[ 9680.108266]  ? perf_trace_run_bpf_submit+0xc0/0xc0
[ 9680.113108]  perf_swevent_hrtimer+0xe2/0x150
[ 9680.117475]  ? check_preempt_wakeup+0x181/0x230
[ 9680.122091]  ? check_preempt_curr+0x62/0x90
[ 9680.126361]  ? ttwu_do_wakeup+0x19/0x140
[ 9680.130355]  ? try_to_wake_up+0x54/0x460
[ 9680.134366]  ? reweight_entity+0x15b/0x1a0
[ 9680.138559]  ? __queue_work+0x103/0x3f0
[ 9680.142472]  ? update_dl_rq_load_avg+0x1cd/0x270
[ 9680.147194]  ? timerqueue_del+0x1e/0x40
[ 9680.151092]  ? __remove_hrtimer+0x35/0x70
[ 9680.155191]  __hrtimer_run_queues+0x100/0x280
[ 9680.159658]  hrtimer_interrupt+0x100/0x220
[ 9680.163835]  smp_apic_timer_interrupt+0x6a/0x140
[ 9680.168555]  apic_timer_interrupt+0xf/0x20
[ 9680.172756]  </IRQ>

The XMM registers can only be collected by PEBS hardware events on the
platforms with PEBS baseline support, e.g. Icelake, not software/probe
events.

Add capabilities flag PERF_PMU_CAP_EXTENDED_REGS to indicate the PMU
which support extended registers. For X86, the extended registers are
XMM registers.

Add has_extended_regs() to check if extended registers are applied.

The generic code define the mask of extended registers as 0 if arch
headers haven't overridden it.
Originally-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reported-by: NVince Weaver <vincent.weaver@maine.edu>
Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 878068ea270e ("perf/x86: Support outputting XMM registers")
Link: https://lkml.kernel.org/r/1559081314-9714-1-git-send-email-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NShen, Xiaochen <xiaochen.shen@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

a6b5314d

ICX: perf/core: Add PERF_PMU_CAP_NO_EXCLUDE for exclusion incapable PMUs · fd8d0f3e

由 Andrew Murray 提交于 1月 10, 2019

commit cc6795aeffea0a80d0baf9ad31ba926a6c42cef5 upstream.

Many PMU drivers do not have the capability to exclude counting events
that occur in specific contexts such as idle, kernel, guest, etc. These
drivers indicate this by returning an error in their event_init upon
testing the events attribute flags. This approach is error prone and
often inconsistent.

Let's instead allow PMU drivers to advertise their inability to exclude
based on context via a new capability: PERF_PMU_CAP_NO_EXCLUDE. This
allows the perf core to reject requests for exclusion events where
there is no support in the PMU.
Signed-off-by: NAndrew Murray <andrew.murray@arm.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sascha Hauer <s.hauer@pengutronix.de>
Cc: Shawn Guo <shawnguo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: robin.murphy@arm.com
Cc: suzuki.poulose@arm.com
Link: https://lkml.kernel.org/r/1547128414-50693-4-git-send-email-andrew.murray@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NShen, Xiaochen <xiaochen.shen@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

fd8d0f3e

ICX: perf/core: Add function to test for event exclusion flags · f883ae36

由 Andrew Murray 提交于 1月 10, 2019

commit 486efe9f8e30bac1e236f867df164f4966f3e207 upstream.

Add a function that tests if any of the perf event exclusion flags
are set on a given event.
Signed-off-by: NAndrew Murray <andrew.murray@arm.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sascha Hauer <s.hauer@pengutronix.de>
Cc: Shawn Guo <shawnguo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: robin.murphy@arm.com
Cc: suzuki.poulose@arm.com
Link: https://lkml.kernel.org/r/1547128414-50693-3-git-send-email-andrew.murray@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NShen, Xiaochen <xiaochen.shen@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

f883ae36

ICX: perf/x86/intel/pt: Remove software double buffering PMU capability · 79536782

由 Alexander Shishkin 提交于 5月 03, 2019

commit 72e830f68428ab9ea9eca65d160795f4e02cecfc upstream.

Now that all AUX allocations are high-order by default, the software
double buffering PMU capability doesn't make sense any more, get rid
of it. In case some PMUs choose to opt out, we can re-introduce it.
Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: adrian.hunter@intel.com
Link: http://lkml.kernel.org/r/20190503085536.24119-3-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NShen, Xiaochen <xiaochen.shen@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

79536782

ICX: node: Add memory-side caching attributes · a55e431b

由 Keith Busch 提交于 3月 11, 2019

commit acc02a109b0497e917c83f986a89c51e47d0022c upstream.

System memory may have caches to help improve access speed to frequently
requested address ranges. While the system provided cache is transparent
to the software accessing these memory ranges, applications can optimize
their own access based on cache attributes.

Provide a new API for the kernel to register these memory-side caches
under the memory node that provides it.

The new sysfs representation is modeled from the existing cpu cacheinfo
attributes, as seen from /sys/devices/system/cpu/<cpu>/cache/.  Unlike CPU
cacheinfo though, the node cache level is reported from the view of the
memory. A higher level number is nearer to the CPU, while lower levels
are closer to the last level memory.

The exported attributes are the cache size, the line size, associativity
indexing, and write back policy, and add the attributes for the system
memory caches to sysfs stable documentation.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: NBrice Goglin <Brice.Goglin@inria.fr>
Tested-by: NBrice Goglin <Brice.Goglin@inria.fr>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NFan Du <fan.du@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

a55e431b

ICX: node: Add heterogenous memory access attributes · 12dbc11b

由 Keith Busch 提交于 3月 11, 2019

commit e1cf33aafb8462c7d0a0e6349925870316f040ee upstream.

Heterogeneous memory systems provide memory nodes with different latency
and bandwidth performance attributes. Provide a new kernel interface
for subsystems to register the attributes under the memory target
node's initiator access class. If the system provides this information,
applications may query these attributes when deciding which node to
request memory.

The following example shows the new sysfs hierarchy for a node exporting
performance attributes:

  # tree -P "read*|write*"/sys/devices/system/node/nodeY/accessZ/initiators/
  /sys/devices/system/node/nodeY/accessZ/initiators/
  |-- read_bandwidth
  |-- read_latency
  |-- write_bandwidth
  `-- write_latency

The bandwidth is exported as MB/s and latency is reported in
nanoseconds. The values are taken from the platform as reported by the
manufacturer.

Memory accesses from an initiator node that is not one of the memory's
access "Z" initiator nodes linked in the same directory may observe
different performance than reported here. When a subsystem makes use
of this interface, initiators of a different access number may not have
the same performance relative to initiators in other access numbers, or
omitted from the any access class' initiators.

Descriptions for memory access initiator performance access attributes
are added to sysfs stable documentation.
Acked-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: NBrice Goglin <Brice.Goglin@inria.fr>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NFan Du <fan.du@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

12dbc11b

ICX: node: Link memory nodes to their compute nodes · 417e3834

由 Keith Busch 提交于 3月 11, 2019

commit 08d9dbe72b1f899468b2b34f9309e88a84f440f2 upstream.

Systems may be constructed with various specialized nodes. Some nodes
may provide memory, some provide compute devices that access and use
that memory, and others may provide both. Nodes that provide memory are
referred to as memory targets, and nodes that can initiate memory access
are referred to as memory initiators.

Memory targets will often have varying access characteristics from
different initiators, and platforms may have ways to express those
relationships. In preparation for these systems, provide interfaces for
the kernel to export the memory relationship among different nodes memory
targets and their initiators with symlinks to each other.

If a system provides access locality for each initiator-target pair, nodes
may be grouped into ranked access classes relative to other nodes. The
new interface allows a subsystem to register relationships of varying
classes if available and desired to be exported.

A memory initiator may have multiple memory targets in the same access
class. The target memory's initiators in a given class indicate the
nodes access characteristics share the same performance relative to other
linked initiator nodes. Each target within an initiator's access class,
though, do not necessarily perform the same as each other.

A memory target node may have multiple memory initiators. All linked
initiators in a target's class have the same access characteristics to
that target.

The following example show the nodes' new sysfs hierarchy for a memory
target node 'Y' with access class 0 from initiator node 'X':

  # symlinks -v /sys/devices/system/node/nodeX/access0/
  relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY

  # symlinks -v /sys/devices/system/node/nodeY/access0/
  relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX

The new attributes are added to the sysfs stable documentation.
Reviewed-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: NBrice Goglin <Brice.Goglin@inria.fr>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NFan Du <fan.du@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

417e3834

ICX: acpi: Add HMAT to generic parsing tables · e22f32bc

由 Keith Busch 提交于 3月 11, 2019

commit 3bc0e8eb179deebf1c06f5c4261d362c24b26ce1 upstream.

The Heterogeneous Memory Attribute Table (HMAT) header has different
field lengths than the existing parsing uses. Add the HMAT type to the
parsing rules so it may be generically parsed.
Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Tested-by: NBrice Goglin <Brice.Goglin@inria.fr>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NFan Du <fan.du@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

e22f32bc

ICX: acpi: Create subtable parsing infrastructure · a8fa883a

由 Keith Busch 提交于 3月 11, 2019

commit 60574d1e05b094d222162260dd9cac49f4d0996a upstream.

Parsing entries in an ACPI table had assumed a generic header
structure. There is no standard ACPI header, though, so less common
layouts with different field sizes required custom parsers to go through
their subtable entry list.

Create the infrastructure for adding different table types so parsing
the entries array may be more reused for all ACPI system tables and
the common code doesn't need to be duplicated.
Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Tested-by: NBrice Goglin <Brice.Goglin@inria.fr>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NFan Du <fan.du@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

a8fa883a

ICX: PCI: Add support for Immediate Readiness · 188ae607

由 Felipe Balbi 提交于 9月 07, 2018

commit d6112f8def514e019658bcc9b57d53acdb71ca3f upstream.

PCIe r4.0, sec 7.5.1.1.4 defines a new bit in the Status Register:

  Immediate Readiness – This optional bit, when Set, indicates the Function
  is guaranteed to be ready to successfully complete valid configuration
  accesses at any time following any reset that the host is capable of
  issuing Configuration Requests to this Function.

  When this bit is Set, for accesses to this Function, software is exempt
  from all requirements to delay configuration accesses following any type
  of reset, including but not limited to the timing requirements defined in
  Section 6.6.

This means that all delays after a Conventional or Function Reset can be
skipped.

This patch reads such bit and caches its value in a flag inside struct
pci_dev to be checked later if we should delay or can skip delays after a
reset.  While at that, also move the explicit msleep(100) call from
pcie_flr() and pci_af_flr() to pci_dev_wait().
Signed-off-by: NFelipe Balbi <felipe.balbi@linux.intel.com>
[bhelgaas: rename PCI_STATUS_IMMEDIATE to PCI_STATUS_IMM_READY]
Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
Signed-off-by: NLin Wang <lin.x.wang@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

188ae607

x86/mm: Split vmalloc_sync_all() · 73c092d2

由 Joerg Roedel 提交于 11月 19, 2019

commit 1a0a610d5f056c6195ae9808962477a94d1d72c8 upstream.

Commit 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in
__purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in the
vunmap() code-path.  While this change was necessary to maintain
correctness on x86-32-pae kernels, it also adds additional cycles for
architectures that don't need it.

Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported
severe performance regressions in micro-benchmarks because it now also
calls the x86-64 implementation of vmalloc_sync_all() on vunmap().  But
the vmalloc_sync_all() implementation on x86-64 is only needed for newly
created mappings.

To avoid the unnecessary work on x86-64 and to gain the performance back,
split up vmalloc_sync_all() into two functions:

	* vmalloc_sync_mappings(), and
	* vmalloc_sync_unmappings()

Most call-sites to vmalloc_sync_all() only care about new mappings being
synchronized.  The only exception is the new call-site added in the above
mentioned commit.

Shile Zhang directed us to a report of an 80% regression in reaim
throughput.

Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org
Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/
Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.com
Fixes: 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()")
Signed-off-by: NJoerg Roedel <jroedel@suse.de>
Reported-by: Nkernel test robot <oliver.sang@intel.com>
Reported-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	[GHES]
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

73c092d2

19 11月, 2019 3 次提交

block: fix .bi_size overflow · a74e2556

由 Ming Lei 提交于 11月 04, 2019

commit 79d08f89bb1b5c2c1ff90d9bb95497ab9e8aa7e0 upstream

'bio->bi_iter.bi_size' is 'unsigned int', which at most hold 4G - 1
bytes.

Before 07173c3ec276 ("block: enable multipage bvecs"), one bio can
include very limited pages, and usually at most 256, so the fs bio
size won't be bigger than 1M bytes most of times.

Since we support multi-page bvec, in theory one fs bio really can
be added > 1M pages, especially in case of hugepage, or big writeback
with too many dirty pages. Then there is chance in which .bi_size
is overflowed.

Fixes this issue by using bio_full() to check if the added segment may
overflow .bi_size.
Signed-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
Cc: Liu Yiding <liuyd.fnst@cn.fujitsu.com>
Cc: kernel test robot <rong.a.chen@intel.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: linux-xfs@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: 07173c3ec276 ("block: enable multipage bvecs")
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

a74e2556

mm, swap: fix race between swapoff and some swap operations · 73c29467

由 Huang Ying 提交于 11月 04, 2019

commit eb085574a7526c4375965c5fbf7e5b0c19cdd336 upstream.
Change SWP_VALID to (1 << 12).

When swapin is performed, after getting the swap entry information from
the page table, system will swap in the swap entry, without any lock held
to prevent the swap device from being swapoff.  This may cause the race
like below,

CPU 1				CPU 2
-----				-----
				do_swap_page
				  swapin_readahead
				    __read_swap_cache_async
swapoff				      swapcache_prepare
  p->swap_map = NULL		        __swap_duplicate
					  p->swap_map[?] /* !!! NULL pointer access */

Because swapoff is usually done when system shutdown only, the race may
not hit many people in practice.  But it is still a race need to be fixed.

To fix the race, get_swap_device() is added to check whether the specified
swap entry is valid in its swap device.  If so, it will keep the swap
entry valid via preventing the swap device from being swapoff, until
put_swap_device() is called.

Because swapoff() is very rare code path, to make the normal path runs as
fast as possible, rcu_read_lock/unlock() and synchronize_rcu() instead of
reference count is used to implement get/put_swap_device().  >From
get_swap_device() to put_swap_device(), RCU reader side is locked, so
synchronize_rcu() in swapoff() will wait until put_swap_device() is
called.

In addition to swap_map, cluster_info, etc.  data structure in the struct
swap_info_struct, the swap cache radix tree will be freed after swapoff,
so this patch fixes the race between swap cache looking up and swapoff
too.

Races between some other swap cache usages and swapoff are fixed too via
calling synchronize_rcu() between clearing PageSwapCache() and freeing
swap cache data structure.

Another possible method to fix this is to use preempt_off() +
stop_machine() to prevent the swap device from being swapoff when its data
structure is being accessed.  The overhead in hot-path of both methods is
similar.  The advantages of RCU based method are,

1. stop_machine() may disturb the normal execution code path on other
   CPUs.

2. File cache uses RCU to protect its radix tree.  If the similar
   mechanism is used for swap cache too, it is easier to share code
   between them.

3. RCU is used to protect swap cache in total_swapcache_pages() and
   exit_swap_address_space() already.  The two mechanisms can be
   merged to simplify the logic.

Link: http://lkml.kernel.org/r/20190522015423.14418-1-ying.huang@intel.com
Fixes: 235b6217 ("mm/swap: add cluster lock")
Signed-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
Reviewed-by: NAndrea Parri <andrea.parri@amarulasolutions.com>
Not-nacked-by: NHugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

73c29467

vmscan: return NODE_RECLAIM_NOSCAN in node_reclaim() when CONFIG_NUMA is n · e431b612

由 Wei Yang 提交于 11月 04, 2019

commit 8b09549c2bfd9f3f8f4cdad74107ef4f4ff9cdd7 upstream.

Commit fa5e084e ("vmscan: do not unconditionally treat zones that
fail zone_reclaim() as full") changed the return value of
node_reclaim().  The original return value 0 means NODE_RECLAIM_SOME
after this commit.

While the return value of node_reclaim() when CONFIG_NUMA is n is not
changed.  This will leads to call zone_watermark_ok() again.

This patch fixes the return value by adjusting to NODE_RECLAIM_NOSCAN.
Since node_reclaim() is only called in page_alloc.c, move it to
mm/internal.h.

Link: http://lkml.kernel.org/r/20181113080436.22078-1-richard.weiyang@gmail.comSigned-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Reviewed-by: NMatthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

e431b612

07 11月, 2019 4 次提交

blkcg: implement blk-iocost · abd942b2

由 Tejun Heo 提交于 8月 28, 2019

commit 7caa47151ab2e644dd221f741ec7578d9532c9a3 upstream.

This patchset implements IO cost model based work-conserving
proportional controller.

While io.latency provides the capability to comprehensively prioritize
and protect IOs depending on the cgroups, its protection is binary -
the lowest latency target cgroup which is suffering is protected at
the cost of all others.  In many use cases including stacking multiple
workload containers in a single system, it's necessary to distribute
IO capacity with better granularity.

One challenge of controlling IO resources is the lack of trivially
observable cost metric.  The most common metrics - bandwidth and iops
- can be off by orders of magnitude depending on the device type and
IO pattern.  However, the cost isn't a complete mystery.  Given
several key attributes, we can make fairly reliable predictions on how
expensive a given stream of IOs would be, at least compared to other
IO patterns.

The function which determines the cost of a given IO is the IO cost
model for the device.  This controller distributes IO capacity based
on the costs estimated by such model.  The more accurate the cost
model the better but the controller adapts based on IO completion
latency and as long as the relative costs across differents IO
patterns are consistent and sensible, it'll adapt to the actual
performance of the device.

Currently, the only implemented cost model is a simple linear one with
a few sets of default parameters for different classes of device.
This covers most common devices reasonably well.  All the
infrastructure to tune and add different cost models is already in
place and a later patch will also allow using bpf progs for cost
models.

Please see the top comment in blk-iocost.c and documentation for
more details.

v2: Rebased on top of RQ_ALLOC_TIME changes and folded in Rik's fix
    for a divide-by-zero bug in current_hweight() triggered by zero
    inuse_sum.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Andy Newell <newella@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
[Joseph: fix confilcts with ioc_rqos_throttle()]
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

abd942b2

cgroup: add cgroup_parse_float() · cb2f6e75

由 Tejun Heo 提交于 5月 13, 2019

commit a5e112e6424adb77d953eac20e6936b952fd6b32 upstream.

cgroup already uses floating point for percent[ile] numbers and there
are several controllers which want to take them as input.  Add a
generic parse helper to handle inputs.

Update the interface convention documentation about the use of
percentage numbers.  While at it, also clarify the default time unit.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

cb2f6e75

blk-mq: add optional request->alloc_time_ns · 4edb756c

由 Tejun Heo 提交于 8月 28, 2019

commit 6f816b4b746c2241540e537682d30d8e9997d674 upstream.

There are currently two start time timestamps - start_time_ns and
io_start_time_ns.  The former marks the request allocation and and the
second issue-to-device time.  The planned io.weight controller needs
to measure the total time bios take to execute after it leaves rq_qos
including the time spent waiting for request to become available,
which can easily dominate on saturated devices.

This patch adds request->alloc_time_ns which records when the request
allocation attempt started.  As it isn't used for the usual stats,
make it optional behind CONFIG_BLK_RQ_ALLOC_TIME and
QUEUE_FLAG_RQ_ALLOC_TIME so that it can be compiled out when there are
no users and it's active only on queues which need it even when
compiled in.

v2: s/pre_start_time/alloc_time/ and add CONFIG_BLK_RQ_ALLOC_TIME
    gating as suggested by Jens.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

4edb756c

blkcg: separate blkcg_conf_get_disk() out of blkg_conf_prep() · 5525bb81

由 Tejun Heo 提交于 8月 28, 2019

commit 015d254cb02b6d8eec4b3366274bf4672f9e0b64 upstream.

Separate out blkcg_conf_get_disk() so that it can be used by blkcg
policy interface file input parsers before the policy is actually
enabled.  This doesn't introduce any functional changes.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

5525bb81

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功