提交 · 4667e926ac4a0ca09845ebd78fd96fbdca74d0be · openanolis / cloud-kernel

28 11月, 2019 11 次提交

alios: blk-throttle: fix tg NULL pointer dereference · 4667e926

由 Joseph Qi 提交于 12月 07, 2017

io throtl stats will blkg_get at the beginning of throttle and then
blkg_put at the new introduced bi_tg_end_io. This will cause blkg to be
freed if end_io is called twice like dm-thin, which will save origin
end_io first, and call its overwrite end_io and then the saved end_io.
After that, access blkg is invalid and finally BUG:

[ 4417.235048] BUG: unable to handle kernel NULL pointer dereference at 00000000000001e0
[ 4417.236475] IP: [<ffffffff812e7c71>] throtl_update_dispatch_stats+0x21/0xb0
[ 4417.237865] PGD 98395067 PUD 362e1067 PMD 0
[ 4417.239232] Oops: 0000 [#1] SMP
......
[ 4417.274070] Call Trace:
[ 4417.275407]  [<ffffffff812ea93d>] blk_throtl_bio+0xfd/0x630
[ 4417.276760]  [<ffffffff810b3613>] ? wake_up_process+0x23/0x40
[ 4417.278079]  [<ffffffff81094c04>] ? wake_up_worker+0x24/0x30
[ 4417.279387]  [<ffffffff81095772>] ? insert_work+0x62/0xa0
[ 4417.280697]  [<ffffffff8116c2c7>] ? mempool_free_slab+0x17/0x20
[ 4417.282019]  [<ffffffff8116c6c9>] ? mempool_free+0x49/0x90
[ 4417.283326]  [<ffffffff812c9acf>] generic_make_request_checks+0x16f/0x360
[ 4417.284637]  [<ffffffffa0340d97>] ? thin_map+0x227/0x2c0 [dm_thin_pool]
[ 4417.285951]  [<ffffffff812c9ce7>] generic_make_request+0x27/0x130
[ 4417.287240]  [<ffffffffa0230b3d>] __map_bio+0xad/0x100 [dm_mod]
[ 4417.288503]  [<ffffffffa023257e>] __clone_and_map_data_bio+0x15e/0x240 [dm_mod]
[ 4417.289778]  [<ffffffffa02329ea>] __split_and_process_bio+0x38a/0x500 [dm_mod]
[ 4417.291062]  [<ffffffffa0232c91>] dm_make_request+0x131/0x1a0 [dm_mod]
[ 4417.292344]  [<ffffffff812c9da2>] generic_make_request+0xe2/0x130
[ 4417.293626]  [<ffffffff812c9e61>] submit_bio+0x71/0x150
[ 4417.294909]  [<ffffffff8121ab1d>] ? bio_alloc_bioset+0x20d/0x360
[ 4417.296195]  [<ffffffff81215acb>] _submit_bh+0x14b/0x220
[ 4417.297484]  [<ffffffff81215bb0>] submit_bh+0x10/0x20
[ 4417.298744]  [<ffffffffa016d8d8>] jbd2_journal_commit_transaction+0x6c8/0x19a0 [jbd2]
[ 4417.300014]  [<ffffffff810135b8>] ? __switch_to+0xf8/0x4c0
[ 4417.301268]  [<ffffffffa01731e9>] kjournald2+0xc9/0x270 [jbd2]
[ 4417.302524]  [<ffffffff810a0fd0>] ? wake_up_atomic_t+0x30/0x30
[ 4417.303753]  [<ffffffffa0173120>] ? commit_timeout+0x10/0x10 [jbd2]
[ 4417.304950]  [<ffffffff8109ffef>] kthread+0xcf/0xe0
[ 4417.306107]  [<ffffffff8109ff20>] ? kthread_create_on_node+0x140/0x140
[ 4417.307255]  [<ffffffff81647f18>] ret_from_fork+0x58/0x90
[ 4417.308349]  [<ffffffff8109ff20>] ? kthread_create_on_node+0x140/0x140
......

Now we introduce a new bio flag BIO_THROTL_STATED to make sure
blkg_get/put only get called once for the same bio.
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

4667e926

alios: blk-throttle: support io delay stats · 65e6966a

由 Joseph Qi 提交于 12月 19, 2017

Add blkio.throttle.io_service_time and blkio.throttle.io_wait_time to
get per-cgroup io delay statistics.
io_service_time represents the time spent after io throttle to io
completion, while io_wait_time represents the time spent on throttle
queue.
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

65e6966a

alios: nvme-pci: Disable dicard zero-out functionality on Intel's P3600 NVMe disk drive · d79c6eda

由 Wenwei Tao 提交于 12月 13, 2017

We found huge performance lost on below particular Intel's disk drive
when discard zeroout functionality is enabled on it. The issue was
found when we have ext4 filesystem mounted on the disk drive and
started regular FIO testing. With it disabled, we don't observe
performance lost any more.

81:00.0 Non-Volatile memory controller: Intel Corporation \
             PCIe Data Center SSD (rev 01)

This imposes to disable the discard zero-out functionality on above
disk drive in order to regain the high performance that NVMe disk
driver supposes to provide.

Differential Revision: https://aone.alibaba-inc.com/code/D377540Signed-off-by: NWenwei Tao <wenwei.tao@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

d79c6eda

alios: memcg: Point wb to root memcg/blkcg when offlining to avoid zombie · 8514dbc7

由 Xunlei Pang 提交于 4月 03, 2019

After turning off the memcg kmem charging, we still suffer
from various zombie memcg problems on production environment
because of its non-zero reference count from both page caches
and per-memcg writeback related structure(bdi_writeback takes
a reference).

After we reclaimed all the page caches of the zombie memcg,
it still can't be dropped due to its bdi_writeback.

bdi_writeback is further referenced by the inodes of files,
so the memcg can't be truely released until the inodes are
destroyed afterwards which is quite unlikely in short term.

When memcg is offlining, change it's bdi_writeback to root,
and call css_put to formally release it. We've tested on
product environment, it yields pretty good effect.

Ditto for wb_blkcg_offline().
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

8514dbc7

alios: block: add counter to track io request's d2c time · 07232d74

由 Xiaoguang Wang 提交于 6月 19, 2019

Indeed tool iostat's await is not good enough, which is somewhat sketchy
and could not show request's latency on device driver's side.

Here we add a new counter to track io request's d2c time, also with this
patch, we can extend iostat to show this value easily.

Note:
I had checked how iostat is implemented, it just reads fields it needs,
so iostat won't be affected by this change, so does tsar.
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

07232d74

alios: fuse: add sysfs api to flush processing queue requests · 2292db66

由 Ma Jie Yue 提交于 3月 26, 2019

The failover of fuse userspace daemon will reuse the existing fuse conn,
without unmounting it, during daemon crashing and recovery procedure.
But some requests might be in process in the daemon before sending out reply,
when the crash happens. This will stuck the application since it will
never get the reply after the failover.

We add the sysfs api to flush these requests, after the daemon crash, before
recovery. It is easy to reproduce the issue in the fuse userspace daemon,
just exit after receiving the request and before sending the reply back.
The application will hang up in some read/write operation, before
echo 1 > /sys/fs/fuse/connection/xxx/flush. The flush operation will make
the io fail and return the error to the application.
Signed-off-by: NMa Jie Yue <majieyue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

2292db66

alios: jbd2: add proc entry to control whether doing buffer copy-out · ac452d09

由 Xiaoguang Wang 提交于 11月 15, 2018

When jbd2 tries to get write access to one buffer, and if this buffer
is under writeback with BH_Shadow flag, jbd2 will wait until this buffer
has been written to disk, but sometimes the time taken to wait may be
much long, especially disk capacity is almost full.

Here add a proc entry "force-copy", if its value is not zero, jbd2 will
always do meta buffer copy-cout, then we can eliminate the unnecessary
wating time here, and reduce long tail latency for buffered-write.

I construct such test case below:

$cat offline.fio
; fio-rand-RW.job for fiotest

[global]
name=fio-rand-RW
filename=fio-rand-RW
rw=randrw
rwmixread=60
rwmixwrite=40
bs=4K
direct=0
numjobs=4
time_based=1
runtime=900

[file1]
size=60G
ioengine=sync
iodepth=16

$cat online.fio
; fio-seq-write.job for fiotest

[global]
name=fio-seq-write
filename=fio-seq-write
rw=write
bs=256K
direct=0
numjobs=1
time_based=1
runtime=60

[file1]
rate=50m
size=10G
ioengine=sync
iodepth=16

With this patch:
$cat /proc/fs/jbd2/sda5-8/force_copy
0

online fio almost always get such long tail latency:

Jobs: 1 (f=1), 0B/s-0B/s: [W(1)][100.0%][w=50.0MiB/s][w=200 IOPS][eta
00m:00s]
file1: (groupid=0, jobs=1): err= 0: pid=17855: Thu Nov 15 09:45:57 2018
  write: IOPS=200, BW=50.0MiB/s (52.4MB/s)(3000MiB/60001msec)
    clat (usec): min=135, max=4086.6k, avg=867.21, stdev=50338.22
     lat (usec): min=139, max=4086.6k, avg=871.16, stdev=50338.22
    clat percentiles (usec):
     |  1.00th=[    141],  5.00th=[    143], 10.00th=[    145],
     | 20.00th=[    147], 30.00th=[    147], 40.00th=[    149],
     | 50.00th=[    149], 60.00th=[    151], 70.00th=[    153],
     | 80.00th=[    155], 90.00th=[    159], 95.00th=[    163],
     | 99.00th=[    255], 99.50th=[    273], 99.90th=[    429],
     | 99.95th=[    441], 99.99th=[3640656]

$cat /proc/fs/jbd2/sda5-8/force_copy
1

online fio latency is much better.

Jobs: 1 (f=1), 0B/s-0B/s: [W(1)][100.0%][w=50.0MiB/s][w=200 IOPS][eta
00m:00s]
file1: (groupid=0, jobs=1): err= 0: pid=8084: Thu Nov 15 09:31:15 2018
  write: IOPS=200, BW=50.0MiB/s (52.4MB/s)(3000MiB/60001msec)
    clat (usec): min=137, max=545, avg=151.35, stdev=16.22
     lat (usec): min=140, max=548, avg=155.31, stdev=16.65
    clat percentiles (usec):
     |  1.00th=[  143],  5.00th=[  145], 10.00th=[  145], 20.00th=[
147],
     | 30.00th=[  147], 40.00th=[  147], 50.00th=[  149], 60.00th=[
149],
     | 70.00th=[  151], 80.00th=[  155], 90.00th=[  157], 95.00th=[
161],
     | 99.00th=[  239], 99.50th=[  269], 99.90th=[  420], 99.95th=[
429],
     | 99.99th=[  537]

As to the cost: because we'll always need to copy meta buffer, will
consume minor cpu time and some memory(at most 32MB for 128MB journal
size).
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

ac452d09

alios: ext4: don't submit unwritten extent while holding active jbd2 handle · a8366d32

由 Xiaoguang Wang 提交于 12月 25, 2018

In ext4_writepages(), for every iteration, mpage_prepare_extent_to_map()
will try to find 2048 pages to map and normally one bio can contain 256
pages at most. If we really found 2048 pages to map, there will be 4 bios
and 4 ext4_io_submit() calls which are called both in ext4_writepages()
and mpage_map_and_submit_extent().

But note that in mpage_map_and_submit_extent(), we hold a valid jbd2 handle,
when dioread_nolock is enabled and extent is unwritten, jbd2 commit thread
will wait this handle to finish, so wait the unwritten extent is written to
disk, this will introduce unnecessary stall time, especially longer when
the writeback operation is io throttled, need to fix this issue.

Here for this scene, we accumulate bios in ext4_io_submit's io_bio, and
only submit these bios after dropping the jbd2 handle.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

a8366d32

alios: fs,ext4: remove projid limit when create hard link · 28df06b3

由 zhangliguang 提交于 12月 27, 2018

This is a temporary workaround plan to avoid the limitation when
creating hard link cross two projids.
Signed-off-by: Nzhangliguang <zhangliguang@linux.alibaba.com>
Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

28df06b3

alios: jbd2: add new "stats" proc file · 3550da0c

由 Xiaoguang Wang 提交于 6月 18, 2019

/proc/fs/jbd2/${device}/info only shows whole average statistical
info about jbd2's life cycle, but it can not show jbd2 info in
specified time interval and sometimes this capability is very useful
for trouble shooting. For example, we can not see how rs_locked and
rs_flushing grows in specified time interval, but these two indexes
can explain some reasons for app's behaviours.

Here we add a new "stats" proc file like /proc/diskstats, then we can
implement a simple tool jbd2_stats which'll display detailed jbd2 info
in specified time interval. Like below(time interval 5s):

[lege@localhost ~]$ cat /proc/fs/jbd2/vdb1-8/stats
51 30 8192 0 1 241616 0 0 22 0 47158 891 942 1000 1000

[lege@localhost ~]$ gcc -o jbd2_stat jbd2_stat.c ; ./jbd2_stat

Device              tid     trans   handles    locked  flushing
logging
vdb1-8             1861       158       359     13.00      0.00
2.00

Device              tid     trans   handles    locked  flushing
logging
vdb1-8             1974       113       389     26.00      0.00
5.00

Device              tid     trans   handles    locked  flushing
logging
vdb1-8             2188       214       308     10.00      0.00
7.00

Device              tid     trans   handles    locked  flushing
logging
vdb1-8             2344       156       332     19.00      0.00
4.00
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

3550da0c

alios: jbd2: create jbd2-ckpt thread for journal checkpoint · c31b17e5

由 Joseph Qi 提交于 3月 07, 2018

This is trying to do jbd2 checkpoint in a specific kernel thread, then
checkpoint won't be under io throttle control.
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: Nzhangliguang <zhangliguang@linux.alibaba.com>
Reviewed by: Baoyou Xie <baoyou.xie@linux.alibaba.com>
Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

c31b17e5

21 11月, 2019 1 次提交

mm, memcg: add missing memory stall section in mem_cgroup_handle_over_high · fc2036b9

由 Caspar Zhang 提交于 9月 23, 2019

When backporting commit 0e4b01df8659 ("mm, memcg: throttle allocators
when failing reclaim over memory.high"), memory stall section was
inadvertently missing. Fix this issue by adding it back.

Fixes: eda29cc0 ("mm, memcg: throttle allocators when failing reclaim over memory.high")
Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

fc2036b9

20 11月, 2019 28 次提交

mm: thp: handle page cache THP correctly in PageTransCompoundMap · c9f8166a

由 Yang Shi 提交于 11月 08, 2019

commit 169226f7e0d275c1879551f37484ef6683579a5c upstream

We have a usecase to use tmpfs as QEMU memory backend and we would like
to take the advantage of THP as well.  But, our test shows the EPT is
not PMD mapped even though the underlying THP are PMD mapped on host.
The number showed by /sys/kernel/debug/kvm/largepage is much less than
the number of PMD mapped shmem pages as the below:

  7f2778200000-7f2878200000 rw-s 00000000 00:14 262232 /dev/shm/qemu_back_mem.mem.Hz2hSf (deleted)
  Size:            4194304 kB
  [snip]
  AnonHugePages:         0 kB
  ShmemPmdMapped:   579584 kB
  [snip]
  Locked:                0 kB

  cat /sys/kernel/debug/kvm/largepages
  12

And some benchmarks do worse than with anonymous THPs.

By digging into the code we figured out that commit 127393fb ("mm:
thp: kvm: fix memory corruption in KVM with THP enabled") checks if
there is a single PTE mapping on the page for anonymous THP when setting
up EPT map.  But the _mapcount < 0 check doesn't work for page cache THP
since every subpage of page cache THP would get _mapcount inc'ed once it
is PMD mapped, so PageTransCompoundMap() always returns false for page
cache THP.  This would prevent KVM from setting up PMD mapped EPT entry.

So we need handle page cache THP correctly.  However, when page cache
THP's PMD gets split, kernel just remove the map instead of setting up
PTE map like what anonymous THP does.  Before KVM calls get_user_pages()
the subpages may get PTE mapped even though it is still a THP since the
page cache THP may be mapped by other processes at the mean time.

Checking its _mapcount and whether the THP has PTE mapped or not.
Although this may report some false negative cases (PTE mapped by other
processes), it looks not trivial to make this accurate.

With this fix /sys/kernel/debug/kvm/largepage would show reasonable
pages are PMD mapped by EPT as the below:

  7fbeaee00000-7fbfaee00000 rw-s 00000000 00:14 275464 /dev/shm/qemu_back_mem.mem.SKUvat (deleted)
  Size:            4194304 kB
  [snip]
  AnonHugePages:         0 kB
  ShmemPmdMapped:   557056 kB
  [snip]
  Locked:                0 kB

  cat /sys/kernel/debug/kvm/largepages
  271

And the benchmarks are as same as anonymous THPs.

[yang.shi@linux.alibaba.com: v4]
  Link: http://lkml.kernel.org/r/1571865575-42913-1-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1571769577-89735-1-git-send-email-yang.shi@linux.alibaba.com
Fixes: dd78fedd ("rmap: support file thp")
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reported-by: NGang Deng <gavin.dg@linux.alibaba.com>
Tested-by: NGang Deng <gavin.dg@linux.alibaba.com>
Suggested-by: NHugh Dickins <hughd@google.com>
Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>    [4.8+]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

c9f8166a

ICX: perf/x86/intel: Fix invalid Bit 13 for Icelake MSR_OFFCORE_RSP_x register · c154e184

由 Yunying Sun 提交于 7月 24, 2019

commit 3b238a64c3009fed36eaea1af629d9377759d87d upstream.

The Intel SDM states that bit 13 of Icelake's MSR_OFFCORE_RSP_x
register is valid, and used for counting hardware generated prefetches
of L3 cache. Update the bitmask to allow bit 13.

Before:
$ perf stat -e cpu/event=0xb7,umask=0x1,config1=0x1bfff/u sleep 3
 Performance counter stats for 'sleep 3':
   <not supported>      cpu/event=0xb7,umask=0x1,config1=0x1bfff/u

After:
$ perf stat -e cpu/event=0xb7,umask=0x1,config1=0x1bfff/u sleep 3
 Performance counter stats for 'sleep 3':
             9,293      cpu/event=0xb7,umask=0x1,config1=0x1bfff/u
Signed-off-by: NYunying Sun <yunying.sun@intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NKan Liang <kan.liang@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: alexander.shishkin@linux.intel.com
Cc: bp@alien8.de
Cc: hpa@zytor.com
Cc: jolsa@redhat.com
Cc: namhyung@kernel.org
Link: https://lkml.kernel.org/r/20190724082932.12833-1-yunying.sun@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NLin Wang <lin.x.wang@intel.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

c154e184

ICX: perf/x86/intel: Add more Icelake CPUIDs · e4ed6f52

由 Kan Liang 提交于 6月 03, 2019

commit faaeff98666c24376cebd0b106504d05a36881d1 upstream.

Add new model number for Icelake desktop and server to perf.

The data source encoding for Icelake server is the same as Skylake
server.
Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@alien8.de
Cc: qiuxu.zhuo@intel.com
Cc: rui.zhang@intel.com
Cc: tony.luck@intel.com
Link: https://lkml.kernel.org/r/20190603134122.13853-2-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NLin Wang <lin.x.wang@intel.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

e4ed6f52

resource/docs: Complete kernel-doc style function documentation · 1de9c7c3

由 Borislav Petkov 提交于 11月 05, 2018

commit f26621e60b35369bca9228bc936dc723b3e421af upstream.

Add the missing kernel-doc style function parameters documentation.
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: linux-tip-commits@vger.kernel.org
Cc: rdunlap@infradead.org
Fixes: b69c2e20f6e4 ("resource: Clean it up a bit")
Link: http://lkml.kernel.org/r/20181105093307.GA12445@zn.tnicSigned-off-by: NIngo Molnar <mingo@kernel.org>
[joseph: fix find_next_iomem_res() documentation]
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>

1de9c7c3

resource/docs: Fix new kernel-doc warnings · 39cecf2f

由 Randy Dunlap 提交于 11月 04, 2018

commit f75d651587f719a813ebbbfeee570e6570731d55 upstream.

The first group of warnings is caused by a "/**" kernel-doc notation
marker but the function comments are not in kernel-doc format.
Also add another error return value here.

  ../kernel/resource.c:337: warning: Function parameter or member 'start' not described in 'find_next_iomem_res'
  ../kernel/resource.c:337: warning: Function parameter or member 'end' not described in 'find_next_iomem_res'
  ../kernel/resource.c:337: warning: Function parameter or member 'flags' not described in 'find_next_iomem_res'
  ../kernel/resource.c:337: warning: Function parameter or member 'desc' not described in 'find_next_iomem_res'
  ../kernel/resource.c:337: warning: Function parameter or member 'first_lvl' not described in 'find_next_iomem_res'
  ../kernel/resource.c:337: warning: Function parameter or member 'res' not described in 'find_next_iomem_res'

Add the missing function parameter documentation for the other warnings:

  ../kernel/resource.c:409: warning: Function parameter or member 'arg' not described in 'walk_iomem_res_desc'
  ../kernel/resource.c:409: warning: Function parameter or member 'func' not described in 'walk_iomem_res_desc'
Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: b69c2e20f6e4 ("resource: Clean it up a bit")
Link: http://lkml.kernel.org/r/dda2e4d8-bedd-3167-20fe-8c7d2d35b354@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
[joseph: fix find_next_iomem_res() documentation]
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>

39cecf2f

acpi/hmat: fix an uninitialized memory_target · ffd11878

由 Qian Cai 提交于 4月 06, 2019

commit ab3a9f2ccc080d27873f76869c9a780be45e581e upstream.

The commit 665ac7e92757 ("acpi/hmat: Register processor domain to its
memory") introduced an uninitialized "struct memory_target" that could
cause an incorrect branching.

drivers/acpi/hmat/hmat.c:385:6: warning: variable 'target' is used
uninitialized whenever 'if' condition is false
[-Wsometimes-uninitialized]
        if (p->flags & ACPI_HMAT_MEMORY_PD_VALID) {
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/acpi/hmat/hmat.c:392:6: note: uninitialized use occurs here
        if (target && p->flags & ACPI_HMAT_PROCESSOR_PD_VALID) {
            ^~~~~~
drivers/acpi/hmat/hmat.c:385:2: note: remove the 'if' if its condition
is always true
        if (p->flags & ACPI_HMAT_MEMORY_PD_VALID) {
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/acpi/hmat/hmat.c:369:30: note: initialize the variable 'target'
to silence this warning
        struct memory_target *target;
                                    ^
                                     = NULL
Signed-off-by: NQian Cai <cai@lca.pw>
Reviewed-by: NMukesh Ojha <mojha@codeaurora.org>
Fixes: 665ac7e92757 ("acpi/hmat: Register processor domain to its memory")
Reviewed-by: NNathan Chancellor <natechancellor@gmail.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>

ffd11878

ICX: EDAC, i10nm: Fix randconfig builds · ea396c30

由 Tony Luck 提交于 2月 05, 2019

commit d6a9f7336d925364daca00557afa59a68e78b422 upstream.

I10NM_EDAC depends on CONFIG_ACPI so make that dependency explicit.
Reported-by: NBorislav Petkov <bp@suse.de>
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: Aristeu Rozanski <aris@redhat.com>
Cc: James Morse <james.morse@arm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: https://lkml.kernel.org/r/20190205180200.26865-1-tony.luck@intel.comSigned-off-by: NCaspar Zhang <caspar@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

ea396c30

tools x86 uapi asm: Sync the pt_regs.h copy with the kernel sources · fae5be70

由 Arnaldo Carvalho de Melo 提交于 9月 20, 2019

commit 0ceb5499a8001e5ddac2c8bd7b45eb4c643469ad upstream.

To get the changes in:

  878068ea270e ("perf/x86: Support outputting XMM registers")

That will be used in a followup patch to allow users to ask for some or
all of those registers to be collected in certain contatexts.

This silences the following perf build warning:

  Warning: Kernel ABI header at 'tools/arch/x86/include/uapi/asm/perf_regs.h' differs from latest version at 'arch/x86/include/uapi/asm/perf_regs.h'
  diff -u tools/arch/x86/include/uapi/asm/perf_regs.h arch/x86/include/uapi/asm/perf_regs.h

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/n/tip-6pjnnrzqt3x3n2cd6br3wk7k@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

fae5be70

device-dax: fix memory and resource leak if hotplug fails · d0219d42

由 Pavel Tatashin 提交于 7月 16, 2019

commit 31e4ca92a7dd4cdebd7fe1456b3b0b6ace9a816f upstream

Patch series ""Hotremove" persistent memory", v6.

Recently, adding a persistent memory to be used like a regular RAM was
added to Linux.  This work extends this functionality to also allow hot
removing persistent memory.

We (Microsoft) have an important use case for this functionality.

The requirement is for physical machines with small amount of RAM (~8G)
to be able to reboot in a very short period of time (<1s).  Yet, there
is a userland state that is expensive to recreate (~2G).

The solution is to boot machines with 2G preserved for persistent
memory.

Copy the state, and hotadd the persistent memory so machine still has
all 8G available for runtime.  Before reboot, offline and hotremove
device-dax 2G, copy the memory that is needed to be preserved to pmem0
device, and reboot.

The series of operations look like this:

1. After boot restore /dev/pmem0 to ramdisk to be consumed by apps.
   and free ramdisk.
2. Convert raw pmem0 to devdax
   ndctl create-namespace --mode devdax --map mem -e namespace0.0 -f
3. Hotadd to System RAM
   echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
   echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
   echo online_movable > /sys/devices/system/memoryXXX/state
4. Before reboot hotremove device-dax memory from System RAM
   echo offline > /sys/devices/system/memoryXXX/state
   echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
5. Create raw pmem0 device
   ndctl create-namespace --mode raw  -e namespace0.0 -f
6. Copy the state that was stored by apps to ramdisk to pmem device
7. Do kexec reboot or reboot through firmware if firmware does not
   zero memory in pmem0 region (These machines have only regular
   volatile memory). So to have pmem0 device either memmap kernel
   parameter is used, or devices nodes in dtb are specified.

This patch (of 3):

When add_memory() fails, the resource and the memory should be freed.

Link: http://lkml.kernel.org/r/20190517215438.6487-2-pasha.tatashin@soleen.com
Fixes: c221c0b0308f ("device-dax: "Hotplug" persistent memory for use like normal RAM")
Signed-off-by: NPavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: NDave Hansen <dave.hansen@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>

d0219d42

device-dax: Add a 'resource' attribute · 651aadb6

由 Vishal Verma 提交于 6月 20, 2019

commit 40cdc60ac16a42eb4e013f84d0e7aa1d6ee060d3 upstream

device-dax based devices were missing a 'resource' attribute to indicate
the physical address range contributed by the device in question. This
information is desirable to userspace tooling that may want to use the
dax device as system-ram, and wants to selectively hotplug and online
the memory blocks associated with a given device.

Without this, the tooling would have to parse /proc/iomem for the memory
ranges contributed by dax devices, which can be a workaround, but it is
far easier to provide this information in the sysfs hierarchy.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>

651aadb6

drivers/dax: Allow to include DEV_DAX_PMEM as builtin · c83b6a3a

由 Aneesh Kumar K.V 提交于 4月 01, 2019

commit 67476656febd7ec5f1fe1aeec3c441fcf53b1e45 upstream

This move the dependency to DEV_DAX_PMEM_COMPAT such that only
if DEV_DAX_PMEM is built as module we can allow the compat support.

This allows to test the new code easily in a emulation setup where we
often build things without module support.

Cc: <stable@vger.kernel.org>
Fixes: 730926c3b099 ("device-dax: Add /sys/class/dax backwards compatibility")
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>

c83b6a3a

device-dax: "Hotplug" persistent memory for use like normal RAM · 370de25f

由 Dave Hansen 提交于 2月 25, 2019

commit c221c0b0308fd01d9fb33a16f64d2fd95f8830a4 upstream

This is intended for use with NVDIMMs that are physically persistent
(physically like flash) so that they can be used as a cost-effective
RAM replacement.  Intel Optane DC persistent memory is one
implementation of this kind of NVDIMM.

Currently, a persistent memory region is "owned" by a device driver,
either the "Direct DAX" or "Filesystem DAX" drivers.  These drivers
allow applications to explicitly use persistent memory, generally
by being modified to use special, new libraries. (DIMM-based
persistent memory hardware/software is described in great detail
here: Documentation/nvdimm/nvdimm.txt).

However, this limits persistent memory use to applications which
*have* been modified.  To make it more broadly usable, this driver
"hotplugs" memory into the kernel, to be managed and used just like
normal RAM would be.

To make this work, management software must remove the device from
being controlled by the "Device DAX" infrastructure:

	echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind

and then tell the new driver that it can bind to the device:

	echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id

After this, there will be a number of new memory sections visible
in sysfs that can be onlined, or that may get onlined by existing
udev-initiated memory hotplug rules.

This rebinding procedure is currently a one-way trip.  Once memory
is bound to "kmem", it's there permanently and can not be
unbound and assigned back to device_dax.

The kmem driver will never bind to a dax device unless the device
is *explicitly* bound to the driver.  There are two reasons for
this: One, since it is a one-way trip, it can not be undone if
bound incorrectly.  Two, the kmem driver destroys data on the
device.  Think of if you had good data on a pmem device.  It
would be catastrophic if you compile-in "kmem", but leave out
the "device_dax" driver.  kmem would take over the device and
write volatile data all over your good data.

This inherits any existing NUMA information for the newly-added
memory from the persistent memory device that came from the
firmware.  On Intel platforms, the firmware has guarantees that
require each socket's persistent memory to be in a separate
memory-only NUMA node.  That means that this patch is not expected
to create NUMA nodes, but will simply hotplug memory into existing
nodes.

Because NUMA nodes are created, the existing NUMA APIs and tools
are sufficient to create policies for applications or memory areas
to have affinity for or an aversion to using this memory.

There is currently some metadata at the beginning of pmem regions.
The section-size memory hotplug restrictions, plus this small
reserved area can cause the "loss" of a section or two of capacity.
This should be fixable in follow-on patches.  But, as a first step,
losing 256MB of memory (worst case) out of hundreds of gigabytes
is a good tradeoff vs. the required code to fix this up precisely.
This calculation is also the reason we export
memory_block_size_bytes().
Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: NDan Williams <dan.j.williams@intel.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Reviewed-by: NVishal Verma <vishal.l.verma@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>

370de25f

mm/resource: Let walk_system_ram_range() search child resources · 3ed62604

由 Dave Hansen 提交于 2月 25, 2019

commit 2b539aefe9e48e3908cff02699aa63a8b9bd268e upstream

In the process of onlining memory, we use walk_system_ram_range()
to find the actual RAM areas inside of the area being onlined.

However, it currently only finds memory resources which are
"top-level" iomem_resources.  Children are not currently
searched which causes it to skip System RAM in areas like this
(in the format of /proc/iomem):

a0000000-bfffffff : Persistent Memory (legacy)
  a0000000-afffffff : System RAM

Changing the true->false here allows children to be searched
as well.  We need this because we add a new "System RAM"
resource underneath the "persistent memory" resource when
we use persistent memory in a volatile mode.
Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>

3ed62604

mm/memory-hotplug: Allow memory resources to be children · 71009d66

由 Dave Hansen 提交于 2月 25, 2019

commit 2794129e902d8eb69413d884dc6404b8716ed9ed upstream

The mm/resource.c code is used to manage the physical address
space.  The current resource configuration can be viewed in
/proc/iomem.  An example of this is at the bottom of this
description.

The nvdimm subsystem "owns" the physical address resources which
map to persistent memory and has resources inserted for them as
"Persistent Memory".  The best way to repurpose this for volatile
use is to leave the existing resource in place, but add a "System
RAM" resource underneath it. This clearly communicates the
ownership relationship of this memory.

The request_resource_conflict() API only deals with the
top-level resources.  Replace it with __request_region() which
will search for !IORESOURCE_BUSY areas lower in the resource
tree than the top level.

We *could* also simply truncate the existing top-level
"Persistent Memory" resource and take over the released address
space.  But, this means that if we ever decide to hot-unplug the
"RAM" and give it back, we need to recreate the original setup,
which may mean going back to the BIOS tables.

This should have no real effect on the existing collision
detection because the areas that truly conflict should be marked
IORESOURCE_BUSY.

00000000-00000fff : Reserved
00001000-0009fbff : System RAM
0009fc00-0009ffff : Reserved
000a0000-000bffff : PCI Bus 0000:00
000c0000-000c97ff : Video ROM
000c9800-000ca5ff : Adapter ROM
000f0000-000fffff : Reserved
  000f0000-000fffff : System ROM
00100000-9fffffff : System RAM
  01000000-01e071d0 : Kernel code
  01e071d1-027dfdff : Kernel data
  02dc6000-0305dfff : Kernel bss
a0000000-afffffff : Persistent Memory (legacy)
  a0000000-a7ffffff : System RAM
b0000000-bffdffff : System RAM
bffe0000-bfffffff : Reserved
c0000000-febfffff : PCI Bus 0000:00
Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: NDan Williams <dan.j.williams@intel.com>
Reviewed-by: NVishal Verma <vishal.l.verma@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>

71009d66

mm/resource: Move HMM pr_debug() deeper into resource code · cdb8d31f

由 Dave Hansen 提交于 2月 25, 2019

commit b926b7f3baecb2a855db629e6822e1a85212e91c upstream

HMM consumes physical address space for its own use, even
though nothing is mapped or accessible there.  It uses a
special resource description (IORES_DESC_DEVICE_PRIVATE_MEMORY)
to uniquely identify these areas.

When HMM consumes address space, it makes a best guess about
what to consume.  However, it is possible that a future memory
or device hotplug can collide with the reserved area.  In the
case of these conflicts, there is an error message in
register_memory_resource().

Later patches in this series move register_memory_resource()
from using request_resource_conflict() to __request_region().
Unfortunately, __request_region() does not return the conflict
like the previous function did, which makes it impossible to
check for IORES_DESC_DEVICE_PRIVATE_MEMORY in a conflicting
resource.

Instead of warning in register_memory_resource(), move the
check into the core resource code itself (__request_region())
where the conflicting resource _is_ available.  This has the
added bonus of producing a warning in case of HMM conflicts
with devices *or* RAM address space, as opposed to the RAM-
only warnings that were there previously.
Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: NJerome Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>

cdb8d31f

mm/resource: Return real error codes from walk failures · 2c33206f

由 Dave Hansen 提交于 2月 25, 2019

commit 5cd401ace914dc68556c6d2fcae0c349444d5f86 upstream

walk_system_ram_range() can return an error code either becuase
*it* failed, or because the 'func' that it calls returned an
error.  The memory hotplug does the following:

	ret = walk_system_ram_range(..., func);
        if (ret)
		return ret;

and 'ret' makes it out to userspace, eventually.  The problem
s, walk_system_ram_range() failues that result from *it* failing
(as opposed to 'func') return -1.  That leads to a very odd
-EPERM (-1) return code out to userspace.

Make walk_system_ram_range() return -EINVAL for internal
failures to keep userspace less confused.

This return code is compatible with all the callers that I
audited.
Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: NBjorn Helgaas <bhelgaas@google.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>

2c33206f

kernel, resource: check for IORESOURCE_SYSRAM in release_mem_region_adjustable · d0bc6e68

由 Oscar Salvador 提交于 12月 28, 2018

commit 65c78784135f847e49eb98e6b976e453e71100c3 upstream

This is a preparation for the next patch.

Currently, we only call release_mem_region_adjustable() in __remove_pages
if the zone is not ZONE_DEVICE, because resources that belong to HMM/devm
are being released by themselves with devm_release_mem_region.

Since we do not want to touch any zone/page stuff during the removing of
the memory (but during the offlining), we do not want to check for the
zone here.  So we need another way to tell release_mem_region_adjustable()
to not realease the resource in case it belongs to HMM/devm.

HMM/devm acquires/releases a resource through
devm_request_mem_region/devm_release_mem_region.

These resources have the flag IORESOURCE_MEM, while resources acquired by
hot-add memory path (register_memory_resource()) contain
IORESOURCE_SYSTEM_RAM.

So, we can check for this flag in release_mem_region_adjustable, and if
the resource does not contain such flag, we know that we are dealing with
a HMM/devm resource, so we can back off.

Link: http://lkml.kernel.org/r/20181127162005.15833-3-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de>
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>

d0bc6e68

resource: Clean it up a bit · 02cc5207

由 Borislav Petkov 提交于 10月 09, 2018

commit b69c2e20f6e4046da84ce5b33ba1ef89cb087b40 upstream

- Drop BUG_ON()s and do normal error handling instead, in
  find_next_iomem_res().

- Align function arguments on opening braces.

- Get rid of local var sibling_only in find_next_iomem_res().

- Shorten unnecessarily long first_level_children_only arg name.
Signed-off-by: NBorislav Petkov <bp@suse.de>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Bjorn Helgaas <bhelgaas@google.com>
CC: Brijesh Singh <brijesh.singh@amd.com>
CC: Dan Williams <dan.j.williams@intel.com>
CC: H. Peter Anvin <hpa@zytor.com>
CC: Lianbo Jiang <lijiang@redhat.com>
CC: Takashi Iwai <tiwai@suse.de>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Tom Lendacky <thomas.lendacky@amd.com>
CC: Vivek Goyal <vgoyal@redhat.com>
CC: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
CC: bhe@redhat.com
CC: dan.j.williams@intel.com
CC: dyoung@redhat.com
CC: kexec@lists.infradead.org
CC: mingo@redhat.com
Link: <new submission>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>

02cc5207

device-dax: Add a 'modalias' attribute to DAX 'bus' devices · 3405f4ea

由 Vishal Verma 提交于 2月 22, 2019

commit c347bd71dcdb2d0ac8b3a771486584dca8c8dd80 upstream

Add a 'modalias' attribute to devices under the DAX bus so that userspace
is able to dynamically load modules as needed.

Normally, udev can get the modalias from 'uevent', and that is correctly
set up by the DAX bus. However other tooling such as 'libndctl' for
interacting with drivers/nvdimm/, and 'libdaxctl' for drivers/dax/ can
also use the modalias to dynamically load modules via libkmod lookups.

The 'nd' bus set up by the libnvdimm subsystem exports a modalias
attribute. Imitate this to export the same for the 'dax' bus.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>

3405f4ea

device-dax: Add a 'target_node' attribute · 539796f4

由 Dan Williams 提交于 2月 20, 2019

commit 21c75763a3ae18679e5c4e2260aa9379b073566b upstream

The target-node attribute is the Linux numa-node that a device-dax
instance may create when it is online. Prior to being online the
device's 'numa_node' property reflects the closest online cpu node which
is the typical expectation of a device 'numa_node'. Once it is online it
becomes its own distinct numa node, i.e. 'target_node'.

Export the 'target_node' property to give userspace tooling the ability
to predict the effective numa-node from a device-dax instance configured
to provide 'System RAM' capacity.

Cc: Vishal Verma <vishal.l.verma@intel.com>
Reported-by: NDave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>

539796f4

device-dax: Auto-bind device after successful new_id · e663208d

由 Dan Williams 提交于 1月 24, 2019

commit 664525b2d84abca1074c9546654ae9689de8a818 upstream

The typical 'new_id' attribute behavior is to immediately attach a
device to its driver after a new device-id is added. Implement this
behavior for the dax bus.
Reported-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
Reported-by: NBrice Goglin <Brice.Goglin@inria.fr>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>

e663208d

acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node · 159d387d

由 Dan Williams 提交于 8月 24, 2019

commit 8fc5c73554db0ac18c0c6ac5b2099ab917f83bdf upstream

Persistent memory, as described by the ACPI NFIT (NVDIMM Firmware
Interface Table), is the first known instance of a memory range
described by a unique "target" proximity domain. Where "initiator" and
"target" proximity domains is an approach that the ACPI HMAT
(Heterogeneous Memory Attributes Table) uses to described the unique
performance properties of a memory range relative to a given initiator
(e.g. CPU or DMA device).

Currently the numa-node for a /dev/pmemX block-device or /dev/daxX.Y
char-device follows the traditional notion of 'numa-node' where the
attribute conveys the closest online numa-node. That numa-node attribute
is useful for cpu-binding and memory-binding processes *near* the
device. However, when the memory range backing a 'pmem', or 'dax' device
is onlined (memory hot-add) the memory-only-numa-node representing that
address needs to be differentiated from the set of online nodes. In
other words, the numa-node association of the device depends on whether
you can bind processes *near* the cpu-numa-node in the offline
device-case, or bind process *on* the memory-range directly after the
backing address range is onlined.

Allow for the case that platform firmware describes persistent memory
with a unique proximity domain, i.e. when it is distinct from the
proximity of DRAM and CPUs that are on the same socket. Plumb the Linux
numa-node translation of that proximity through the libnvdimm region
device to namespaces that are in device-dax mode. With this in place the
proposed kmem driver [1] can optionally discover a unique numa-node
number for the address range as it transitions the memory from an
offline state managed by a device-driver to an online memory range
managed by the core-mm.

[1]: https://lore.kernel.org/lkml/20181022201317.8558C1D8@viggo.jf.intel.comReported-by: NFan Du <fan.du@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Oliver O'Halloran" <oohall@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
[yshi: Removed PowerPC stuff which is not applicable 4.19]
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>

159d387d

device-dax: Add /sys/class/dax backwards compatibility · 796f6939

由 Dan Williams 提交于 8月 24, 2019

commit 730926c3b0998943654019f00296cf8e3b02277e upstream

On the expectation that some environments may not upgrade libdaxctl
(userspace component that depends on the /sys/class/dax hierarchy),
provide a default / legacy dax_pmem_compat driver. The dax_pmem_compat
driver implements the original /sys/class/dax sysfs layout rather than
/sys/bus/dax. When userspace is upgraded it can blacklist this module
and switch to the dax_pmem driver going forward.

CONFIG_DEV_DAX_PMEM_COMPAT and supporting code will be deleted according
to the dax_pmem entry in Documentation/ABI/obsolete/.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>

796f6939

device-dax: Add support for a dax override driver · e1da20e7

由 Dan Williams 提交于 11月 07, 2018

commit d200781ef237a354d918ceff5cee350d88a93d42 upstream

Introduce the 'new_id' concept for enabling a custom device-driver attach
policy for dax-bus drivers. The intended use is to have a mechanism for
hot-plugging device-dax ranges into the page allocator on-demand. With
this in place the default policy of using device-dax for performance
differentiated memory can be overridden by user-space policy that can
arrange for the memory range to be managed as 'System RAM' with
user-defined NUMA and other performance attributes.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>

e1da20e7

device-dax: Move resource pinning+mapping into the common driver · ced768b9

由 Dan Williams 提交于 10月 29, 2018

commit 89ec9f2cfa36cc5fca2fb445ed221bb9add7b536 upstream

Move the responsibility of calling devm_request_resource() and
devm_memremap_pages() into the common device-dax driver. This is another
preparatory step to allowing an alternate personality driver for a
device-dax range.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>

ced768b9

device-dax: Introduce bus + driver model · 1ea15e85

由 Dan Williams 提交于 7月 12, 2017

commit 9567da0b408a2553d32ca83cba4f1fc5a8aad459 upstream

In support of multiple device-dax instances per device-dax-region and
allowing the 'kmem' driver to attach to dax-instances instead of the
current device-node access, convert the dax sub-system from a class to a
bus. Recall that the kmem driver takes reserved / special purpose
memories and assigns them to be managed by the core-mm.

Aside from the fact the device-dax instances are registered and probed
on a bus, two other lifetime-management changes are made:

1/ Delay attaching a cdev until driver probe time

2/ A new run_dax() helper is introduced to allow restoring dax-operation
after a kill_dax() event. So, at driver ->probe() time we run_dax()
and at ->remove() time we kill_dax() and invalidate all mappings.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>

1ea15e85

device-dax: Start defining a dax bus model · d19c14e6

由 Dan Williams 提交于 7月 12, 2017

commit 51cf784c42d07fbd62cb604836a9270cf3361509 upstream

Towards eliminating the dax_class, move the dax-device-attribute
enabling to a new bus.c file in the core. The amount of code
thrash of sub-sequent patches is reduced as no logic changes are made,
just pure code movement.

A temporary export of unregister_dex_dax() and dax_attribute_groups is
needed to preserve compilation, but those symbols become static again in
a follow-on patch.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>

d19c14e6

device-dax: Remove multi-resource infrastructure · c1b93b62

由 Dan Williams 提交于 7月 14, 2017

commit 753a0850e707e9a8c5861356222f9b9e4eba7945 upstream

The multi-resource implementation anticipated discontiguous sub-division
support. That has not yet materialized, delete the infrastructure and
related code.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>

c1b93b62

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功