提交 · e65b696142fca81a46650c6989dad5099d9e4235 · openanolis / cloud-kernel

15 1月, 2020 33 次提交

mm: thp: extract split_queue_* into a struct · e65b6961

由 Yang Shi 提交于 10月 22, 2019

commit 364c1eebe453f06f0c1e837eb155a5725c9cd272 upstream

Patch series "Make deferred split shrinker memcg aware", v6.

Currently THP deferred split shrinker is not memcg aware, this may cause
premature OOM with some configuration.  For example the below test would
run into premature OOM easily:

$ cgcreate -g memory:thp
$ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes
$ cgexec -g memory:thp transhuge-stress 4000

transhuge-stress comes from kernel selftest.

It is easy to hit OOM, but there are still a lot THP on the deferred
split queue, memcg direct reclaim can't touch them since the deferred split
shrinker is not memcg aware.

Convert deferred split shrinker memcg aware by introducing per memcg
deferred split queue.  The THP should be on either per node or per memcg
deferred split queue if it belongs to a memcg.  When the page is
immigrated to the other memcg, it will be immigrated to the target
memcg's deferred split queue too.

Reuse the second tail page's deferred_list for per memcg list since the
same THP can't be on multiple deferred split queues.

Make deferred split shrinker not depend on memcg kmem since it is not
slab.  It doesn't make sense to not shrink THP even though memcg kmem is
disabled.

With the above change the test demonstrated above doesn't trigger OOM
even though with cgroup.memory=nokmem.

This patch (of 4):

Put split_queue, split_queue_lock and split_queue_len into a struct in
order to reduce code duplication when we convert deferred_split to memcg
aware in the later patches.

Link: http://lkml.kernel.org/r/1565144277-36240-2-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Suggested-by: N"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

e65b6961

alinux: mm: Support kidled · a29243e2

由 Gavin Shan 提交于 8月 30, 2019

This enables scanning pages in fixed interval to determine their access
frequency (hot/cold). The result is exported to user land on basis of
memory cgroup by "memory.idle_page_stats". The design is highlighted as
below:

   * A kernel thread is spawn when this feature is enabled by writing
     non-zero value to "/sys/kernel/mm/kidled/scan_period_in_seconds".
     The thread sequentially scans the nodes and their pages that have
     been chained up in LRU list.

   * For each page, its corresponding age information is stored in the
     page flags or array in node. The age represents the scanning intervals
     in which the page isn't accessed. Also, the page flag (PG_idle) is
     leveraged. The page's age is increased by one if the idle flag isn't
     cleared in two consective scans. Otherwise, the page's age is cleared out.
     Also, the page's age information is cleared when it's free'd so that
     the stale age information won't be fetched when it's allocated.

   * Initially, the flag is set, while the access bit in its PTE is cleared
     out by the thread. In next scanning period, its PTE access bit is
     synchronized with the page flag: clear the flag if access bit is set.
     The flag is kept otherwise. For unmapped pages, the flag is cleared
     when it's accessed.

   * Eventually, the page's aging information is updated to the unstable
     bucket of its corresponding memory cgroup, taking as statistics. The
     unstable bucket (statistics) is copied to stable bucket when all pages
     in all nodes are scanned for once. The stable bucket (statistics) is
     exported to user land through "memory.idle_page_stats".

TESTING
=======

   * cgroup1, unmapped pagecache

     # dd if=/dev/zero of=/ext4/test.data oflag=direct bs=1M count=128
     #
     # echo 1 > /sys/kernel/mm/kidled/use_hierarchy
     # echo 15 > /sys/kernel/mm/kidled/scan_period_in_seconds
     # mkdir -p /cgroup/memory
     # mount -tcgroup -o memory /cgroup/memory
     # echo 1 > /cgroup/memory/memory.use_hierarchy
     # mkdir -p /cgroup/memory/test
     # echo 1 > /cgroup/memory/test/memory.use_hierarchy
     #
     # echo $$ > /cgroup/memory/test/cgroup.procs
     # dd if=/ext4/test.data of=/dev/null bs=1M count=128
     # < wait a few minutes >
     # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei
     # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei
       cfei   0   0   0   134217728   0   0   0   0
     # cat /cgroup/memory/memory.idle_page_stats | grep cfei
       cfei   0   0   0   134217728   0   0   0   0

   * cgroup1, mapped pagecache

     # < create same file and memory cgroups as above >
     #
     # echo $$ > /cgroup/memory/test/cgroup.procs
     # < run program to mmap the whole created file and access the area >
     # < wait a few minutes >
     # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei
       cfei   0   134217728   0   0   0   0   0   0
     # cat /cgroup/memory/memory.idle_page_stats | grep cfei
       cfei   0   134217728   0   0   0   0   0   0

   * cgroup1, mapped and locked pagecache

     # < create same file and memory cgroups as above >
     #
     # echo $$ > /cgroup/memory/test/cgroup.procs
     # < run program to mmap the whole created file and mlock the area >
     # < wait a few minutes >
     # cat /cgroup/memory/test/memory.idle_page_stats | grep cfui
       cfui   0   134217728   0   0   0   0   0   0
     # cat /cgroup/memory/memory.idle_page_stats | grep cfui
       cfui   0   134217728   0   0   0   0   0   0

   * cgroup1, anonymous and locked area

     # < create memory cgroups as above >
     #
     # echo $$ > /cgroup/memory/test/cgroup.procs
     # < run program to mmap anonymous area and mlock it >
     # < wait a few minutes >
     # cat /cgroup/memory/test/memory.idle_page_stats | grep csui
       csui   0   0   134217728   0   0   0   0   0
     # cat /cgroup/memory/memory.idle_page_stats | grep csui
       csui   0   0   134217728   0   0   0   0   0

   * Rerun above test cases in cgroup2 and the results are no exceptional.
     However, the cgroups are populated in different way as below:

     # mkdir -p /cgroup
     # mount -tcgroup2 none /cgroup
     # echo "+memory" > /cgroup/cgroup.subtree_control
     # mkdir -p /cgroup/test
Signed-off-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

a29243e2

alinux: mm: memcontrol: make distance between wmark_low and wmark_high configurable · bbaee3af

由 Yang Shi 提交于 8月 17, 2019

Introduce a new interface, wmark_scale_factor, which defines the
distance between wmark_high and wmark_low.  The unit is in fractions of
10,000. The default value of 50 means the distance between wmark_high
and wmark_low is 0.5% of the max limit of the cgroup.  The maximum value
is 1000, or 10% of the max limit.

The distance between wmark_low and wmark_high have impact on how hard
memcg kswapd would reclaim.
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>

bbaee3af

alinux: mm: vmscan: make memcg kswapd set memcg state to dirty or writeback · c69c12cc

由 Yang Shi 提交于 8月 02, 2019

The global kswapd could set memory node to dirty or writeback if current
scan find all pages are unqueued dirty or writeback. Then kswapd would
write out dirty pages or wait for writeback done. The memcg kswapd
behaves like global kswapd, and it should set dirty or writeback state
to memcg too if the same condition is met.

Since direct reclaim can't write out page caches, the system depends on
kswapd to write out dirty pages if scan finds too many dirty pages in
order to avoid pre-mature OOM. But, if page cache is dirtied too fast,
writing out pages definitely can't catch up with dirtying pages. It is
the responsibility of dirty page balance to throttle dirtying pages.
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>

c69c12cc

alinux: mm: memcontrol: support background async page reclaim · 6967792f

由 Yang Shi 提交于 8月 14, 2019

Currently when memory usage exceeds memory cgroup limit, memory cgroup
just can do sync direct reclaim.  This may incur unexpected stall on
some applications which are sensitive to latency.  Introduce background
async page reclaim mechanism, like what kswapd does.

Define memcg memory usage water mark by introducing wmark_ratio interface,
which is from 0 to 100 and represents percentage of max limit.  The
wmark_high is calculated by (max * wmark_ratio / 100), the wmark_low is
(wmark_high - wmark_high >> 8), which is an empirical value.  If wmark_ratio
is 0, it means water mark is disabled, both wmark_low and wmark_high is max,
which is the default value.

If wmark_ratio is setup, when charging page, if usage is greater than
wmark_high, which means the available memory of memcg is low, a work
would be scheduled to do background page reclaim until memory usage is
reduced to wmark_low if possible.

Define a dedicated unbound workqueue for scheduling water mark reclaim
works.
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>

6967792f

alinux: vfs: add vfs_iocb_iter_[read|write] helper functions · 6011bef7

由 Jiufei Xue 提交于 11月 14, 2019

This isn't cause any behavior changes and will be used by overlay
async IO implementation.
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

6011bef7

alinux: blk-throttle: limit bios to fix amount of pages entering writeback prematurely · 0fd4aa6d

由 Xiaoguang Wang 提交于 12月 28, 2018

Currently in blk_throtl_bio(), if one bio exceeds its throtl_grp's bps
or iops limit, this bio will be queued throtl_grp's throtl_service_queue,
then obviously mm subsys will submit more pages, even underlying device
can not handle these io requests, also this will make large amount of pages
entering writeback prematurely, later if some process writes some of these
pages, it will wait for long time.

I have done some tests: one process does buffered writes on a 1GB file,
and make this process's blkcg max bps limit be 10MB/s, I observe this:
	#cat /proc/meminfo  | grep -i back
	Writeback:        900024 kB
	WritebackTmp:          0 kB

I think this Writeback value is just too big, indeed many bios have been
queued in throtl_grp's throtl_service_queue, if one process try to write
the last bio's page in this queue, it will call wait_on_page_writeback(page),
which must wait the previous bios to finish and will take long time, we
have also see 120s hung task warning in our server.

 INFO: task kworker/u128:0:30072 blocked for more than 120 seconds.
       Tainted: G            E 4.9.147-013.ali3000_015_test.alios7.x86_64 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 kworker/u128:0  D    0 30072      2 0x00000000
 Workqueue: writeback wb_workfn (flush-8:16)
  ffff882ddd066b40 0000000000000000 ffff882e5cad3400 ffff882fbe959e80
  ffff882fa50b1a00 ffffc9003a5a3768 ffffffff8173325d ffffc9003a5a3780
  00ff882e5cad3400 ffff882fbe959e80 ffffffff81360b49 ffff882e5cad3400
 Call Trace:
  [<ffffffff8173325d>] ? __schedule+0x23d/0x6d0
  [<ffffffff81360b49>] ? alloc_request_struct+0x19/0x20
  [<ffffffff81733726>] schedule+0x36/0x80
  [<ffffffff81736c56>] schedule_timeout+0x206/0x4b0
  [<ffffffff81036c69>] ? sched_clock+0x9/0x10
  [<ffffffff81363073>] ? get_request+0x403/0x810
  [<ffffffff8110ca10>] ? ktime_get+0x40/0xb0
  [<ffffffff81732f8a>] io_schedule_timeout+0xda/0x170
  [<ffffffff81733f90>] ? bit_wait+0x60/0x60
  [<ffffffff81733fab>] bit_wait_io+0x1b/0x60
  [<ffffffff81733b28>] __wait_on_bit+0x58/0x90
  [<ffffffff811b0d91>] ? find_get_pages_tag+0x161/0x2e0
  [<ffffffff811aff62>] wait_on_page_bit+0x82/0xa0
  [<ffffffff810d47f0>] ? wake_atomic_t_function+0x60/0x60
  [<ffffffffa02fc181>] mpage_prepare_extent_to_map+0x2d1/0x310 [ext4]
  [<ffffffff8121ff65>] ? kmem_cache_alloc+0x185/0x1a0
  [<ffffffffa0305a2f>] ? ext4_init_io_end+0x1f/0x40 [ext4]
  [<ffffffffa0300294>] ext4_writepages+0x404/0xef0 [ext4]
  [<ffffffff81508c64>] ? scsi_init_io+0x44/0x200
  [<ffffffff81398a0f>] ? fprop_fraction_percpu+0x2f/0x80
  [<ffffffff811c139e>] do_writepages+0x1e/0x30
  [<ffffffff8127c0f5>] __writeback_single_inode+0x45/0x320
  [<ffffffff8127c942>] writeback_sb_inodes+0x272/0x600
  [<ffffffff8127cf6b>] wb_writeback+0x10b/0x300
  [<ffffffff8127d884>] wb_workfn+0xb4/0x380
  [<ffffffff810b85e9>] ? try_to_wake_up+0x59/0x3e0
  [<ffffffff810a5759>] process_one_work+0x189/0x420
  [<ffffffff810a5a3e>] worker_thread+0x4e/0x4b0
  [<ffffffff810a59f0>] ? process_one_work+0x420/0x420
  [<ffffffff810ac026>] kthread+0xe6/0x100
  [<ffffffff810abf40>] ? kthread_park+0x60/0x60
  [<ffffffff81738499>] ret_from_fork+0x39/0x50

To fix this issue, we can simply limit throtl_service_queue's max queued
bios, currently we limit it to throtl_grp's bps_limit or iops limit, if it
still exteeds, we just sleep for a while.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

0fd4aa6d

alinux: blk-throttle: fix tg NULL pointer dereference · bc0cc360

由 Joseph Qi 提交于 12月 07, 2017

io throtl stats will blkg_get at the beginning of throttle and then
blkg_put at the new introduced bi_tg_end_io. This will cause blkg to be
freed if end_io is called twice like dm-thin, which will save origin
end_io first, and call its overwrite end_io and then the saved end_io.
After that, access blkg is invalid and finally BUG:

[ 4417.235048] BUG: unable to handle kernel NULL pointer dereference at 00000000000001e0
[ 4417.236475] IP: [<ffffffff812e7c71>] throtl_update_dispatch_stats+0x21/0xb0
[ 4417.237865] PGD 98395067 PUD 362e1067 PMD 0
[ 4417.239232] Oops: 0000 [#1] SMP
......
[ 4417.274070] Call Trace:
[ 4417.275407]  [<ffffffff812ea93d>] blk_throtl_bio+0xfd/0x630
[ 4417.276760]  [<ffffffff810b3613>] ? wake_up_process+0x23/0x40
[ 4417.278079]  [<ffffffff81094c04>] ? wake_up_worker+0x24/0x30
[ 4417.279387]  [<ffffffff81095772>] ? insert_work+0x62/0xa0
[ 4417.280697]  [<ffffffff8116c2c7>] ? mempool_free_slab+0x17/0x20
[ 4417.282019]  [<ffffffff8116c6c9>] ? mempool_free+0x49/0x90
[ 4417.283326]  [<ffffffff812c9acf>] generic_make_request_checks+0x16f/0x360
[ 4417.284637]  [<ffffffffa0340d97>] ? thin_map+0x227/0x2c0 [dm_thin_pool]
[ 4417.285951]  [<ffffffff812c9ce7>] generic_make_request+0x27/0x130
[ 4417.287240]  [<ffffffffa0230b3d>] __map_bio+0xad/0x100 [dm_mod]
[ 4417.288503]  [<ffffffffa023257e>] __clone_and_map_data_bio+0x15e/0x240 [dm_mod]
[ 4417.289778]  [<ffffffffa02329ea>] __split_and_process_bio+0x38a/0x500 [dm_mod]
[ 4417.291062]  [<ffffffffa0232c91>] dm_make_request+0x131/0x1a0 [dm_mod]
[ 4417.292344]  [<ffffffff812c9da2>] generic_make_request+0xe2/0x130
[ 4417.293626]  [<ffffffff812c9e61>] submit_bio+0x71/0x150
[ 4417.294909]  [<ffffffff8121ab1d>] ? bio_alloc_bioset+0x20d/0x360
[ 4417.296195]  [<ffffffff81215acb>] _submit_bh+0x14b/0x220
[ 4417.297484]  [<ffffffff81215bb0>] submit_bh+0x10/0x20
[ 4417.298744]  [<ffffffffa016d8d8>] jbd2_journal_commit_transaction+0x6c8/0x19a0 [jbd2]
[ 4417.300014]  [<ffffffff810135b8>] ? __switch_to+0xf8/0x4c0
[ 4417.301268]  [<ffffffffa01731e9>] kjournald2+0xc9/0x270 [jbd2]
[ 4417.302524]  [<ffffffff810a0fd0>] ? wake_up_atomic_t+0x30/0x30
[ 4417.303753]  [<ffffffffa0173120>] ? commit_timeout+0x10/0x10 [jbd2]
[ 4417.304950]  [<ffffffff8109ffef>] kthread+0xcf/0xe0
[ 4417.306107]  [<ffffffff8109ff20>] ? kthread_create_on_node+0x140/0x140
[ 4417.307255]  [<ffffffff81647f18>] ret_from_fork+0x58/0x90
[ 4417.308349]  [<ffffffff8109ff20>] ? kthread_create_on_node+0x140/0x140
......

Now we introduce a new bio flag BIO_THROTL_STATED to make sure
blkg_get/put only get called once for the same bio.
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

bc0cc360

alinux: blk-throttle: support io delay stats · dc61ad52

由 Joseph Qi 提交于 12月 19, 2017

Add blkio.throttle.io_service_time and blkio.throttle.io_wait_time to
get per-cgroup io delay statistics.
io_service_time represents the time spent after io throttle to io
completion, while io_wait_time represents the time spent on throttle
queue.
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

dc61ad52

alinux: block: add counter to track io request's d2c time · ba2896ac

由 Xiaoguang Wang 提交于 6月 19, 2019

Indeed tool iostat's await is not good enough, which is somewhat sketchy
and could not show request's latency on device driver's side.

Here we add a new counter to track io request's d2c time, also with this
patch, we can extend iostat to show this value easily.

Note:
I had checked how iostat is implemented, it just reads fields it needs,
so iostat won't be affected by this change, so does tsar.
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

ba2896ac

alinux: jbd2: add proc entry to control whether doing buffer copy-out · 1ced8a5c

由 Xiaoguang Wang 提交于 11月 15, 2018

When jbd2 tries to get write access to one buffer, and if this buffer
is under writeback with BH_Shadow flag, jbd2 will wait until this buffer
has been written to disk, but sometimes the time taken to wait may be
much long, especially disk capacity is almost full.

Here add a proc entry "force-copy", if its value is not zero, jbd2 will
always do meta buffer copy-cout, then we can eliminate the unnecessary
wating time here, and reduce long tail latency for buffered-write.

I construct such test case below:

$cat offline.fio
; fio-rand-RW.job for fiotest

[global]
name=fio-rand-RW
filename=fio-rand-RW
rw=randrw
rwmixread=60
rwmixwrite=40
bs=4K
direct=0
numjobs=4
time_based=1
runtime=900

[file1]
size=60G
ioengine=sync
iodepth=16

$cat online.fio
; fio-seq-write.job for fiotest

[global]
name=fio-seq-write
filename=fio-seq-write
rw=write
bs=256K
direct=0
numjobs=1
time_based=1
runtime=60

[file1]
rate=50m
size=10G
ioengine=sync
iodepth=16

With this patch:
$cat /proc/fs/jbd2/sda5-8/force_copy
0

online fio almost always get such long tail latency:

Jobs: 1 (f=1), 0B/s-0B/s: [W(1)][100.0%][w=50.0MiB/s][w=200 IOPS][eta
00m:00s]
file1: (groupid=0, jobs=1): err= 0: pid=17855: Thu Nov 15 09:45:57 2018
  write: IOPS=200, BW=50.0MiB/s (52.4MB/s)(3000MiB/60001msec)
    clat (usec): min=135, max=4086.6k, avg=867.21, stdev=50338.22
     lat (usec): min=139, max=4086.6k, avg=871.16, stdev=50338.22
    clat percentiles (usec):
     |  1.00th=[    141],  5.00th=[    143], 10.00th=[    145],
     | 20.00th=[    147], 30.00th=[    147], 40.00th=[    149],
     | 50.00th=[    149], 60.00th=[    151], 70.00th=[    153],
     | 80.00th=[    155], 90.00th=[    159], 95.00th=[    163],
     | 99.00th=[    255], 99.50th=[    273], 99.90th=[    429],
     | 99.95th=[    441], 99.99th=[3640656]

$cat /proc/fs/jbd2/sda5-8/force_copy
1

online fio latency is much better.

Jobs: 1 (f=1), 0B/s-0B/s: [W(1)][100.0%][w=50.0MiB/s][w=200 IOPS][eta
00m:00s]
file1: (groupid=0, jobs=1): err= 0: pid=8084: Thu Nov 15 09:31:15 2018
  write: IOPS=200, BW=50.0MiB/s (52.4MB/s)(3000MiB/60001msec)
    clat (usec): min=137, max=545, avg=151.35, stdev=16.22
     lat (usec): min=140, max=548, avg=155.31, stdev=16.65
    clat percentiles (usec):
     |  1.00th=[  143],  5.00th=[  145], 10.00th=[  145], 20.00th=[
147],
     | 30.00th=[  147], 40.00th=[  147], 50.00th=[  149], 60.00th=[
149],
     | 70.00th=[  151], 80.00th=[  155], 90.00th=[  157], 95.00th=[
161],
     | 99.00th=[  239], 99.50th=[  269], 99.90th=[  420], 99.95th=[
429],
     | 99.99th=[  537]

As to the cost: because we'll always need to copy meta buffer, will
consume minor cpu time and some memory(at most 32MB for 128MB journal
size).
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

1ced8a5c

alinux: fs,ext4: remove projid limit when create hard link · 08e6d768

由 zhangliguang 提交于 12月 27, 2018

This is a temporary workaround plan to avoid the limitation when
creating hard link cross two projids.
Signed-off-by: Nzhangliguang <zhangliguang@linux.alibaba.com>
Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

08e6d768

alinux: jbd2: create jbd2-ckpt thread for journal checkpoint · 3999cdd9

由 Joseph Qi 提交于 3月 07, 2018

This is trying to do jbd2 checkpoint in a specific kernel thread, then
checkpoint won't be under io throttle control.
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: Nzhangliguang <zhangliguang@linux.alibaba.com>
Reviewed by: Baoyou Xie <baoyou.xie@linux.alibaba.com>
Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

3999cdd9

acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node · a0a4e71f

由 Dan Williams 提交于 8月 24, 2019

commit 8fc5c73554db0ac18c0c6ac5b2099ab917f83bdf upstream

Persistent memory, as described by the ACPI NFIT (NVDIMM Firmware
Interface Table), is the first known instance of a memory range
described by a unique "target" proximity domain. Where "initiator" and
"target" proximity domains is an approach that the ACPI HMAT
(Heterogeneous Memory Attributes Table) uses to described the unique
performance properties of a memory range relative to a given initiator
(e.g. CPU or DMA device).

Currently the numa-node for a /dev/pmemX block-device or /dev/daxX.Y
char-device follows the traditional notion of 'numa-node' where the
attribute conveys the closest online numa-node. That numa-node attribute
is useful for cpu-binding and memory-binding processes *near* the
device. However, when the memory range backing a 'pmem', or 'dax' device
is onlined (memory hot-add) the memory-only-numa-node representing that
address needs to be differentiated from the set of online nodes. In
other words, the numa-node association of the device depends on whether
you can bind processes *near* the cpu-numa-node in the offline
device-case, or bind process *on* the memory-range directly after the
backing address range is onlined.

Allow for the case that platform firmware describes persistent memory
with a unique proximity domain, i.e. when it is distinct from the
proximity of DRAM and CPUs that are on the same socket. Plumb the Linux
numa-node translation of that proximity through the libnvdimm region
device to namespaces that are in device-dax mode. With this in place the
proposed kmem driver [1] can optionally discover a unique numa-node
number for the address range as it transitions the memory from an
offline state managed by a device-driver to an online memory range
managed by the core-mm.

[1]: https://lore.kernel.org/lkml/20181022201317.8558C1D8@viggo.jf.intel.comReported-by: NFan Du <fan.du@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Oliver O'Halloran" <oohall@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
[yshi: Removed PowerPC stuff which is not applicable 4.19]
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>

a0a4e71f

ICX: ACPI/ADXL: Add address translation interface using an ACPI DSM · 353d73cb

由 Tony Luck 提交于 10月 15, 2018

commit 4cf841e398503990df640f7a7c5b2ea56f11c08c upstream.

Some new Intel servers provide an interface so that the OS can ask the
BIOS to translate a system physical address to a memory address (socket,
memory controller, channel, rank, dimm, etc.). This is useful for EDAC
drivers that want to take the address of an error reported in a machine
check bank and let the user know which DIMM may need to be replaced.

Specification for this interface is available at:

  https://cdrdv2.intel.com/v1/dl/getContent/603354

 [ Based on earlier code by Qiuxu Zhuo <qiuxu.zhuo@intel.com>. ]

 [ bp: Make the first pr_info() in adxl_init() pr_debug() so that it
   doesn't pollute every dmesg. ]
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: NQiuxu Zhuo <qiuxu.zhuo@intel.com>
CC: Len Brown <lenb@kernel.org>
CC: linux-acpi@vger.kernel.org
CC: linux-edac@vger.kernel.org
Link: http://lkml.kernel.org/r/20181015202620.23610-1-tony.luck@intel.comSigned-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

353d73cb

ICX: intel_rapl: support two power limits for every RAPL domain · e3084076

由 Zhang Rui 提交于 7月 10, 2019

commit 0c2ddedd8bcb88c4100acb9e0fc5ac8752d09501 upstream.

RAPL MSR interface supports 2 power limits for package domain, and 1 power
limit for other domains, while RAPL MMIO interface supports 2 power limits
for both package and dram domains.
And when 2 power limits are supported, the FW_LOCK bit is in bit 63 of the
register, instead of bit 31.

Remove the assumption that only pakcage domain supports 2 power limits.
And allow the RAPL interface driver to specify the number of power limits
supported, for every single RAPL domain it owns..
Reviewed-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com>
Tested-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com>
Signed-off-by: NZhang Rui <rui.zhang@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

e3084076

ICX: intel_rapl: support 64 bit register · 2421ad36

由 Zhang Rui 提交于 7月 10, 2019

commit d978e755aabe215cb67bf713e103ed3916ec306d upstream.

RAPL MMIO interface uses 64 bit registers, thus force use 64 bit register
for all the RAPL code.
Reviewed-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com>
Tested-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com>
Signed-off-by: NZhang Rui <rui.zhang@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

2421ad36

ICX: intel_rapl: abstract RAPL common code · d5fc42c5

由 Zhang Rui 提交于 7月 10, 2019

commit 3382388d714891fc0f575926189f33d22e7c960b upstream.

Split intel_rapl.c to intel_rapl_common.c and intel_rapl_msr.c, where
intel_rapl_common.c contains the common code that can be used by both MSR
and MMIO interface.
intel_rapl_msr.c contains the implementation of RAPL MSR interface.
Reviewed-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com>
Tested-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com>
Signed-off-by: NZhang Rui <rui.zhang@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

d5fc42c5

ICX: intel_rapl: abstract register access operations · e2590cd6

由 Zhang Rui 提交于 7月 10, 2019

commit beea8df821d928e7755917da6c1e45d6afde5148 upstream.

MSR and MMIO RAPL interfaces have different ways to access the registers,
thus in order to abstract the register access operations, two callbacks,
.read_raw()/.write_raw() are introduced, and they should be implemented by
MSR RAPL and MMIO RAPL interface driver respectly.

This patch implements them for the MSR I/F only.
Reviewed-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com>
Tested-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com>
Signed-off-by: NZhang Rui <rui.zhang@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

e2590cd6

ICX: intel_rapl: abstract register address · 4d3d37d1

由 Zhang Rui 提交于 7月 10, 2019

commit 7fde2712a7adab721eaabafbd8ff93dff3262d35 upstream.

MSR and MMIO RAPL interface have different sets of registers, thus the
RAPL register address should be obtained from interface specific
structure, i.e. struct rapl_if_private, instead.
Reviewed-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com>
Tested-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com>
Signed-off-by: NZhang Rui <rui.zhang@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

4d3d37d1

ICX: intel_rapl: introduce struct rapl_if_private · a84bf2ff

由 Zhang Rui 提交于 7月 10, 2019

commit 7ebf8eff63b4f349e7b2ded6aa5036d94bdf94b9 upstream.

Introduce a new structure, rapl_if_private, to save the private data
for different RAPL Interface.
Reviewed-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com>
Tested-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com>
Signed-off-by: NZhang Rui <rui.zhang@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

a84bf2ff

ICX: intel_rapl: introduce intel_rapl.h · 60b08413

由 Zhang Rui 提交于 7月 10, 2019

commit ff956826a403f5cf189978d5ff6b3eb53aa11610 upstream.

Create a new header file for the common definitions that might be used
by different RAPL Interface.
Reviewed-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com>
Tested-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com>
Signed-off-by: NZhang Rui <rui.zhang@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

60b08413

ICX: cpu/topology: Export die_id · 747b36fe

由 Len Brown 提交于 5月 13, 2019

commit 0e344d8c709fe01d882fc0fb5452bedfe5eba67a upstream.

Export die_id in cpu topology, for the benefit of hardware that has
multiple-die/package.
Signed-off-by: NLen Brown <len.brown@intel.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NIngo Molnar <mingo@kernel.org>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-doc@vger.kernel.org
Link: https://lkml.kernel.org/r/e7d1caaf4fbd24ee40db6d557ab28d7d83298900.1557769318.git.len.brown@intel.comSigned-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

747b36fe

ICX: perf/x86: Disable extended registers for non-supported PMUs · 43ccd37f

由 Kan Liang 提交于 5月 28, 2019

commit e321d02db87af7840da29ef833a2a71fc0eab198 upstream.

The perf fuzzer caused Skylake machine to crash:

[ 9680.085831] Call Trace:
[ 9680.088301]  <IRQ>
[ 9680.090363]  perf_output_sample_regs+0x43/0xa0
[ 9680.094928]  perf_output_sample+0x3aa/0x7a0
[ 9680.099181]  perf_event_output_forward+0x53/0x80
[ 9680.103917]  __perf_event_overflow+0x52/0xf0
[ 9680.108266]  ? perf_trace_run_bpf_submit+0xc0/0xc0
[ 9680.113108]  perf_swevent_hrtimer+0xe2/0x150
[ 9680.117475]  ? check_preempt_wakeup+0x181/0x230
[ 9680.122091]  ? check_preempt_curr+0x62/0x90
[ 9680.126361]  ? ttwu_do_wakeup+0x19/0x140
[ 9680.130355]  ? try_to_wake_up+0x54/0x460
[ 9680.134366]  ? reweight_entity+0x15b/0x1a0
[ 9680.138559]  ? __queue_work+0x103/0x3f0
[ 9680.142472]  ? update_dl_rq_load_avg+0x1cd/0x270
[ 9680.147194]  ? timerqueue_del+0x1e/0x40
[ 9680.151092]  ? __remove_hrtimer+0x35/0x70
[ 9680.155191]  __hrtimer_run_queues+0x100/0x280
[ 9680.159658]  hrtimer_interrupt+0x100/0x220
[ 9680.163835]  smp_apic_timer_interrupt+0x6a/0x140
[ 9680.168555]  apic_timer_interrupt+0xf/0x20
[ 9680.172756]  </IRQ>

The XMM registers can only be collected by PEBS hardware events on the
platforms with PEBS baseline support, e.g. Icelake, not software/probe
events.

Add capabilities flag PERF_PMU_CAP_EXTENDED_REGS to indicate the PMU
which support extended registers. For X86, the extended registers are
XMM registers.

Add has_extended_regs() to check if extended registers are applied.

The generic code define the mask of extended registers as 0 if arch
headers haven't overridden it.
Originally-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reported-by: NVince Weaver <vincent.weaver@maine.edu>
Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 878068ea270e ("perf/x86: Support outputting XMM registers")
Link: https://lkml.kernel.org/r/1559081314-9714-1-git-send-email-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NShen, Xiaochen <xiaochen.shen@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

43ccd37f

ICX: perf/core: Add PERF_PMU_CAP_NO_EXCLUDE for exclusion incapable PMUs · f9bd6f63

由 Andrew Murray 提交于 1月 10, 2019

commit cc6795aeffea0a80d0baf9ad31ba926a6c42cef5 upstream.

Many PMU drivers do not have the capability to exclude counting events
that occur in specific contexts such as idle, kernel, guest, etc. These
drivers indicate this by returning an error in their event_init upon
testing the events attribute flags. This approach is error prone and
often inconsistent.

Let's instead allow PMU drivers to advertise their inability to exclude
based on context via a new capability: PERF_PMU_CAP_NO_EXCLUDE. This
allows the perf core to reject requests for exclusion events where
there is no support in the PMU.
Signed-off-by: NAndrew Murray <andrew.murray@arm.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sascha Hauer <s.hauer@pengutronix.de>
Cc: Shawn Guo <shawnguo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: robin.murphy@arm.com
Cc: suzuki.poulose@arm.com
Link: https://lkml.kernel.org/r/1547128414-50693-4-git-send-email-andrew.murray@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NShen, Xiaochen <xiaochen.shen@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

f9bd6f63

ICX: perf/core: Add function to test for event exclusion flags · 15c5b822

由 Andrew Murray 提交于 1月 10, 2019

commit 486efe9f8e30bac1e236f867df164f4966f3e207 upstream.

Add a function that tests if any of the perf event exclusion flags
are set on a given event.
Signed-off-by: NAndrew Murray <andrew.murray@arm.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sascha Hauer <s.hauer@pengutronix.de>
Cc: Shawn Guo <shawnguo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: robin.murphy@arm.com
Cc: suzuki.poulose@arm.com
Link: https://lkml.kernel.org/r/1547128414-50693-3-git-send-email-andrew.murray@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NShen, Xiaochen <xiaochen.shen@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

15c5b822

ICX: perf/x86/intel/pt: Remove software double buffering PMU capability · 70486b49

由 Alexander Shishkin 提交于 5月 03, 2019

commit 72e830f68428ab9ea9eca65d160795f4e02cecfc upstream.

Now that all AUX allocations are high-order by default, the software
double buffering PMU capability doesn't make sense any more, get rid
of it. In case some PMUs choose to opt out, we can re-introduce it.
Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: adrian.hunter@intel.com
Link: http://lkml.kernel.org/r/20190503085536.24119-3-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NShen, Xiaochen <xiaochen.shen@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

70486b49

ICX: node: Add memory-side caching attributes · 7731d5c9

由 Keith Busch 提交于 3月 11, 2019

commit acc02a109b0497e917c83f986a89c51e47d0022c upstream.

System memory may have caches to help improve access speed to frequently
requested address ranges. While the system provided cache is transparent
to the software accessing these memory ranges, applications can optimize
their own access based on cache attributes.

Provide a new API for the kernel to register these memory-side caches
under the memory node that provides it.

The new sysfs representation is modeled from the existing cpu cacheinfo
attributes, as seen from /sys/devices/system/cpu/<cpu>/cache/.  Unlike CPU
cacheinfo though, the node cache level is reported from the view of the
memory. A higher level number is nearer to the CPU, while lower levels
are closer to the last level memory.

The exported attributes are the cache size, the line size, associativity
indexing, and write back policy, and add the attributes for the system
memory caches to sysfs stable documentation.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: NBrice Goglin <Brice.Goglin@inria.fr>
Tested-by: NBrice Goglin <Brice.Goglin@inria.fr>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NFan Du <fan.du@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

7731d5c9

ICX: node: Add heterogenous memory access attributes · d41f7984

由 Keith Busch 提交于 3月 11, 2019

commit e1cf33aafb8462c7d0a0e6349925870316f040ee upstream.

Heterogeneous memory systems provide memory nodes with different latency
and bandwidth performance attributes. Provide a new kernel interface
for subsystems to register the attributes under the memory target
node's initiator access class. If the system provides this information,
applications may query these attributes when deciding which node to
request memory.

The following example shows the new sysfs hierarchy for a node exporting
performance attributes:

  # tree -P "read*|write*"/sys/devices/system/node/nodeY/accessZ/initiators/
  /sys/devices/system/node/nodeY/accessZ/initiators/
  |-- read_bandwidth
  |-- read_latency
  |-- write_bandwidth
  `-- write_latency

The bandwidth is exported as MB/s and latency is reported in
nanoseconds. The values are taken from the platform as reported by the
manufacturer.

Memory accesses from an initiator node that is not one of the memory's
access "Z" initiator nodes linked in the same directory may observe
different performance than reported here. When a subsystem makes use
of this interface, initiators of a different access number may not have
the same performance relative to initiators in other access numbers, or
omitted from the any access class' initiators.

Descriptions for memory access initiator performance access attributes
are added to sysfs stable documentation.
Acked-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: NBrice Goglin <Brice.Goglin@inria.fr>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NFan Du <fan.du@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

d41f7984

ICX: node: Link memory nodes to their compute nodes · d40877b4

由 Keith Busch 提交于 3月 11, 2019

commit 08d9dbe72b1f899468b2b34f9309e88a84f440f2 upstream.

Systems may be constructed with various specialized nodes. Some nodes
may provide memory, some provide compute devices that access and use
that memory, and others may provide both. Nodes that provide memory are
referred to as memory targets, and nodes that can initiate memory access
are referred to as memory initiators.

Memory targets will often have varying access characteristics from
different initiators, and platforms may have ways to express those
relationships. In preparation for these systems, provide interfaces for
the kernel to export the memory relationship among different nodes memory
targets and their initiators with symlinks to each other.

If a system provides access locality for each initiator-target pair, nodes
may be grouped into ranked access classes relative to other nodes. The
new interface allows a subsystem to register relationships of varying
classes if available and desired to be exported.

A memory initiator may have multiple memory targets in the same access
class. The target memory's initiators in a given class indicate the
nodes access characteristics share the same performance relative to other
linked initiator nodes. Each target within an initiator's access class,
though, do not necessarily perform the same as each other.

A memory target node may have multiple memory initiators. All linked
initiators in a target's class have the same access characteristics to
that target.

The following example show the nodes' new sysfs hierarchy for a memory
target node 'Y' with access class 0 from initiator node 'X':

  # symlinks -v /sys/devices/system/node/nodeX/access0/
  relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY

  # symlinks -v /sys/devices/system/node/nodeY/access0/
  relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX

The new attributes are added to the sysfs stable documentation.
Reviewed-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: NBrice Goglin <Brice.Goglin@inria.fr>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NFan Du <fan.du@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

d40877b4

ICX: acpi: Add HMAT to generic parsing tables · b34bc5db

由 Keith Busch 提交于 3月 11, 2019

commit 3bc0e8eb179deebf1c06f5c4261d362c24b26ce1 upstream.

The Heterogeneous Memory Attribute Table (HMAT) header has different
field lengths than the existing parsing uses. Add the HMAT type to the
parsing rules so it may be generically parsed.
Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Tested-by: NBrice Goglin <Brice.Goglin@inria.fr>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NFan Du <fan.du@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

b34bc5db

ICX: acpi: Create subtable parsing infrastructure · c9e676ba

由 Keith Busch 提交于 3月 11, 2019

commit 60574d1e05b094d222162260dd9cac49f4d0996a upstream.

Parsing entries in an ACPI table had assumed a generic header
structure. There is no standard ACPI header, though, so less common
layouts with different field sizes required custom parsers to go through
their subtable entry list.

Create the infrastructure for adding different table types so parsing
the entries array may be more reused for all ACPI system tables and
the common code doesn't need to be duplicated.
Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Tested-by: NBrice Goglin <Brice.Goglin@inria.fr>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NFan Du <fan.du@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

c9e676ba

ICX: PCI: Add support for Immediate Readiness · aa7729ab

由 Felipe Balbi 提交于 9月 07, 2018

commit d6112f8def514e019658bcc9b57d53acdb71ca3f upstream.

PCIe r4.0, sec 7.5.1.1.4 defines a new bit in the Status Register:

  Immediate Readiness – This optional bit, when Set, indicates the Function
  is guaranteed to be ready to successfully complete valid configuration
  accesses at any time following any reset that the host is capable of
  issuing Configuration Requests to this Function.

  When this bit is Set, for accesses to this Function, software is exempt
  from all requirements to delay configuration accesses following any type
  of reset, including but not limited to the timing requirements defined in
  Section 6.6.

This means that all delays after a Conventional or Function Reset can be
skipped.

This patch reads such bit and caches its value in a flag inside struct
pci_dev to be checked later if we should delay or can skip delays after a
reset.  While at that, also move the explicit msleep(100) call from
pcie_flr() and pci_af_flr() to pci_dev_wait().
Signed-off-by: NFelipe Balbi <felipe.balbi@linux.intel.com>
[bhelgaas: rename PCI_STATUS_IMMEDIATE to PCI_STATUS_IMM_READY]
Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
Signed-off-by: NLin Wang <lin.x.wang@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

aa7729ab

02 1月, 2020 3 次提交

x86/mm: Split vmalloc_sync_all() · 4797417e

由 Joerg Roedel 提交于 11月 19, 2019

commit 1a0a610d5f056c6195ae9808962477a94d1d72c8 upstream.

Commit 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in
__purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in the
vunmap() code-path.  While this change was necessary to maintain
correctness on x86-32-pae kernels, it also adds additional cycles for
architectures that don't need it.

Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported
severe performance regressions in micro-benchmarks because it now also
calls the x86-64 implementation of vmalloc_sync_all() on vunmap().  But
the vmalloc_sync_all() implementation on x86-64 is only needed for newly
created mappings.

To avoid the unnecessary work on x86-64 and to gain the performance back,
split up vmalloc_sync_all() into two functions:

	* vmalloc_sync_mappings(), and
	* vmalloc_sync_unmappings()

Most call-sites to vmalloc_sync_all() only care about new mappings being
synchronized.  The only exception is the new call-site added in the above
mentioned commit.

Shile Zhang directed us to a report of an 80% regression in reaim
throughput.

Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org
Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/
Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.com
Fixes: 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()")
Signed-off-by: NJoerg Roedel <jroedel@suse.de>
Reported-by: Nkernel test robot <oliver.sang@intel.com>
Reported-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	[GHES]
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

4797417e

block: fix .bi_size overflow · 842ed2ab

由 Ming Lei 提交于 11月 04, 2019

commit 79d08f89bb1b5c2c1ff90d9bb95497ab9e8aa7e0 upstream

'bio->bi_iter.bi_size' is 'unsigned int', which at most hold 4G - 1
bytes.

Before 07173c3ec276 ("block: enable multipage bvecs"), one bio can
include very limited pages, and usually at most 256, so the fs bio
size won't be bigger than 1M bytes most of times.

Since we support multi-page bvec, in theory one fs bio really can
be added > 1M pages, especially in case of hugepage, or big writeback
with too many dirty pages. Then there is chance in which .bi_size
is overflowed.

Fixes this issue by using bio_full() to check if the added segment may
overflow .bi_size.
Signed-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
Cc: Liu Yiding <liuyd.fnst@cn.fujitsu.com>
Cc: kernel test robot <rong.a.chen@intel.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: linux-xfs@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: 07173c3ec276 ("block: enable multipage bvecs")
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

842ed2ab

mm, swap: fix race between swapoff and some swap operations · 8afafd92

由 Huang Ying 提交于 11月 04, 2019

commit eb085574a7526c4375965c5fbf7e5b0c19cdd336 upstream.

[zhuhui] Change SWP_VALID to (1 << 12).
[Joseph] call the new __swap_count() in swap_slot_free_notify()

When swapin is performed, after getting the swap entry information from
the page table, system will swap in the swap entry, without any lock held
to prevent the swap device from being swapoff.  This may cause the race
like below,

CPU 1				CPU 2
-----				-----
				do_swap_page
				  swapin_readahead
				    __read_swap_cache_async
swapoff				      swapcache_prepare
  p->swap_map = NULL		        __swap_duplicate
					  p->swap_map[?] /* !!! NULL pointer access */

Because swapoff is usually done when system shutdown only, the race may
not hit many people in practice.  But it is still a race need to be fixed.

To fix the race, get_swap_device() is added to check whether the specified
swap entry is valid in its swap device.  If so, it will keep the swap
entry valid via preventing the swap device from being swapoff, until
put_swap_device() is called.

Because swapoff() is very rare code path, to make the normal path runs as
fast as possible, rcu_read_lock/unlock() and synchronize_rcu() instead of
reference count is used to implement get/put_swap_device().  >From
get_swap_device() to put_swap_device(), RCU reader side is locked, so
synchronize_rcu() in swapoff() will wait until put_swap_device() is
called.

In addition to swap_map, cluster_info, etc.  data structure in the struct
swap_info_struct, the swap cache radix tree will be freed after swapoff,
so this patch fixes the race between swap cache looking up and swapoff
too.

Races between some other swap cache usages and swapoff are fixed too via
calling synchronize_rcu() between clearing PageSwapCache() and freeing
swap cache data structure.

Another possible method to fix this is to use preempt_off() +
stop_machine() to prevent the swap device from being swapoff when its data
structure is being accessed.  The overhead in hot-path of both methods is
similar.  The advantages of RCU based method are,

1. stop_machine() may disturb the normal execution code path on other
   CPUs.

2. File cache uses RCU to protect its radix tree.  If the similar
   mechanism is used for swap cache too, it is easier to share code
   between them.

3. RCU is used to protect swap cache in total_swapcache_pages() and
   exit_swap_address_space() already.  The two mechanisms can be
   merged to simplify the logic.

Link: http://lkml.kernel.org/r/20190522015423.14418-1-ying.huang@intel.com
Fixes: 235b6217 ("mm/swap: add cluster lock")
Signed-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
Reviewed-by: NAndrea Parri <andrea.parri@amarulasolutions.com>
Not-nacked-by: NHugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

8afafd92

27 12月, 2019 4 次提交

blkcg: implement blk-iocost · e383d72b

由 Tejun Heo 提交于 8月 28, 2019

commit 7caa47151ab2e644dd221f741ec7578d9532c9a3 upstream.

This patchset implements IO cost model based work-conserving
proportional controller.

While io.latency provides the capability to comprehensively prioritize
and protect IOs depending on the cgroups, its protection is binary -
the lowest latency target cgroup which is suffering is protected at
the cost of all others.  In many use cases including stacking multiple
workload containers in a single system, it's necessary to distribute
IO capacity with better granularity.

One challenge of controlling IO resources is the lack of trivially
observable cost metric.  The most common metrics - bandwidth and iops
- can be off by orders of magnitude depending on the device type and
IO pattern.  However, the cost isn't a complete mystery.  Given
several key attributes, we can make fairly reliable predictions on how
expensive a given stream of IOs would be, at least compared to other
IO patterns.

The function which determines the cost of a given IO is the IO cost
model for the device.  This controller distributes IO capacity based
on the costs estimated by such model.  The more accurate the cost
model the better but the controller adapts based on IO completion
latency and as long as the relative costs across differents IO
patterns are consistent and sensible, it'll adapt to the actual
performance of the device.

Currently, the only implemented cost model is a simple linear one with
a few sets of default parameters for different classes of device.
This covers most common devices reasonably well.  All the
infrastructure to tune and add different cost models is already in
place and a later patch will also allow using bpf progs for cost
models.

Please see the top comment in blk-iocost.c and documentation for
more details.

v2: Rebased on top of RQ_ALLOC_TIME changes and folded in Rik's fix
    for a divide-by-zero bug in current_hweight() triggered by zero
    inuse_sum.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Andy Newell <newella@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
[Joseph: fix confilcts with ioc_rqos_throttle()]
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

e383d72b

cgroup: add cgroup_parse_float() · e4b4935f

由 Tejun Heo 提交于 5月 13, 2019

commit a5e112e6424adb77d953eac20e6936b952fd6b32 upstream.

cgroup already uses floating point for percent[ile] numbers and there
are several controllers which want to take them as input.  Add a
generic parse helper to handle inputs.

Update the interface convention documentation about the use of
percentage numbers.  While at it, also clarify the default time unit.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

e4b4935f

blk-mq: add optional request->alloc_time_ns · 378f7c75

由 Tejun Heo 提交于 8月 28, 2019

commit 6f816b4b746c2241540e537682d30d8e9997d674 upstream.

There are currently two start time timestamps - start_time_ns and
io_start_time_ns.  The former marks the request allocation and and the
second issue-to-device time.  The planned io.weight controller needs
to measure the total time bios take to execute after it leaves rq_qos
including the time spent waiting for request to become available,
which can easily dominate on saturated devices.

This patch adds request->alloc_time_ns which records when the request
allocation attempt started.  As it isn't used for the usual stats,
make it optional behind CONFIG_BLK_RQ_ALLOC_TIME and
QUEUE_FLAG_RQ_ALLOC_TIME so that it can be compiled out when there are
no users and it's active only on queues which need it even when
compiled in.

v2: s/pre_start_time/alloc_time/ and add CONFIG_BLK_RQ_ALLOC_TIME
    gating as suggested by Jens.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

378f7c75

blkcg: separate blkcg_conf_get_disk() out of blkg_conf_prep() · cf569826

由 Tejun Heo 提交于 8月 28, 2019

commit 015d254cb02b6d8eec4b3366274bf4672f9e0b64 upstream.

Separate out blkcg_conf_get_disk() so that it can be used by blkcg
policy interface file input parsers before the policy is actually
enabled.  This doesn't introduce any functional changes.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

cf569826

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功