提交 · 8978f14a99a3eacb5504b986737bd91ef0ed4fa2 · openanolis / cloud-kernel

09 12月, 2019 1 次提交

alios: mm/thp: remove unused variable 'pgdata' in split_huge_page_to_list() · 8978f14a

由 Joseph Qi 提交于 12月 05, 2019

This fixes the following build warning:
mm/huge_memory.c: In function ‘split_huge_page_to_list’:
mm/huge_memory.c:2656:22: warning: unused variable ‘pgdata’ [-Wunused-variable]
  struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
                      ^

Fixes: 6c52af5ee5c5 ("mm: thp: extract split_queue_* into a struct")
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

8978f14a

08 12月, 2019 12 次提交

mm: thp: move deferred split queue to memcg's nodeinfo · 1d1b4c6c

由 Yang Shi 提交于 10月 22, 2019

The commit 87eaceb3faa59b9b4d940ec9554ce251325d83fe ("mm: thp: make
deferred split shrinker memcg aware") makes deferred split queue per
memcg to resolve memcg pre-mature OOM problem.  But, all nodes end up
sharing the same queue instead of one queue per-node before the commit.
It is not a big deal for memcg limit reclaim, but it may cause global
kswapd shrink THPs from a different node.

And, 0-day testing reported -19.6% regression of stress-ng's madvise
test [1].  I didn't see that much regression on my test box (24 threads,
48GB memory, 2 nodes), with the same test (stress-ng --timeout 1
--metrics-brief --sequential 72  --class vm --exclude spawn,exec), I saw
average -3% (run the same test 10 times then calculate the average since
the test itself may have most 15% variation according to my test)
regression sometimes (not every time, sometimes I didn't see regression
at all).

This might be caused by deferred split queue lock contention.  With some
configuration (i.e. just one root memcg) the lock contention my be worse
than before (given 2 nodes, two locks are reduced to one lock).

So, moving deferred split queue to memcg's nodeinfo to make it NUMA
aware again.

With this change stress-ng's madvise test shows average 4% improvement
sometimes and I didn't see degradation anymore.

[1]: https://lore.kernel.org/lkml/20190930084604.GC17687@shao2-debian/

Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

1d1b4c6c

mm: thp: make deferred split shrinker memcg aware · ace35514

由 Yang Shi 提交于 10月 22, 2019

commit 87eaceb3faa59b9b4d940ec9554ce251325d83fe upstream

Currently THP deferred split shrinker is not memcg aware, this may cause
premature OOM with some configuration.  For example the below test would
run into premature OOM easily:

$ cgcreate -g memory:thp
$ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes
$ cgexec -g memory:thp transhuge-stress 4000

transhuge-stress comes from kernel selftest.

It is easy to hit OOM, but there are still a lot THP on the deferred
split queue, memcg direct reclaim can't touch them since the deferred split
shrinker is not memcg aware.

Convert deferred split shrinker memcg aware by introducing per memcg
deferred split queue.  The THP should be on either per node or per memcg
deferred split queue if it belongs to a memcg.  When the page is
immigrated to the other memcg, it will be immigrated to the target
memcg's deferred split queue too.

Reuse the second tail page's deferred_list for per memcg list since the
same THP can't be on multiple deferred split queues.

[yang.shi@linux.alibaba.com: simplify deferred split queue dereference per Kirill Tkhai]
  Link: http://lkml.kernel.org/r/1566496227-84952-5-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1565144277-36240-5-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

ace35514

mm: shrinker: make shrinker not depend on memcg kmem · b382ffa5

由 Yang Shi 提交于 10月 22, 2019

commit 0a432dcbeb32edcd211a5d8f7847d0da7642a8b4 upstream

Currently shrinker is just allocated and can work when memcg kmem is
enabled.  But, THP deferred split shrinker is not slab shrinker, it
doesn't make too much sense to have such shrinker depend on memcg kmem.
It should be able to reclaim THP even though memcg kmem is disabled.

Introduce a new shrinker flag, SHRINKER_NONSLAB, for non-slab shrinker.
When memcg kmem is disabled, just such shrinkers can be called in
shrinking memcg slab.

[yang.shi@linux.alibaba.com: add comment]
  Link: http://lkml.kernel.org/r/1566496227-84952-4-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1565144277-36240-4-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

b382ffa5

mm: move mem_cgroup_uncharge out of __page_cache_release() · 79044939

由 Yang Shi 提交于 10月 22, 2019

commit 7ae88534cdd96235cd775c03b32a75009355740b upstream

A later patch makes THP deferred split shrinker memcg aware, but it
needs page->mem_cgroup information in THP destructor, which is called after
mem_cgroup_uncharge() now.

So move mem_cgroup_uncharge() from __page_cache_release() to compound
page destructor, which is called by both THP and other compound pages except
HugeTLB.  And call it in __put_single_page() for single order page.

Link: http://lkml.kernel.org/r/1565144277-36240-3-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Suggested-by: N"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

79044939

mm: thp: extract split_queue_* into a struct · c9acf2bd

由 Yang Shi 提交于 10月 22, 2019

commit 364c1eebe453f06f0c1e837eb155a5725c9cd272 upstream

Patch series "Make deferred split shrinker memcg aware", v6.

Currently THP deferred split shrinker is not memcg aware, this may cause
premature OOM with some configuration.  For example the below test would
run into premature OOM easily:

$ cgcreate -g memory:thp
$ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes
$ cgexec -g memory:thp transhuge-stress 4000

transhuge-stress comes from kernel selftest.

It is easy to hit OOM, but there are still a lot THP on the deferred
split queue, memcg direct reclaim can't touch them since the deferred split
shrinker is not memcg aware.

Convert deferred split shrinker memcg aware by introducing per memcg
deferred split queue.  The THP should be on either per node or per memcg
deferred split queue if it belongs to a memcg.  When the page is
immigrated to the other memcg, it will be immigrated to the target
memcg's deferred split queue too.

Reuse the second tail page's deferred_list for per memcg list since the
same THP can't be on multiple deferred split queues.

Make deferred split shrinker not depend on memcg kmem since it is not
slab.  It doesn't make sense to not shrink THP even though memcg kmem is
disabled.

With the above change the test demonstrated above doesn't trigger OOM
even though with cgroup.memory=nokmem.

This patch (of 4):

Put split_queue, split_queue_lock and split_queue_len into a struct in
order to reduce code duplication when we convert deferred_split to memcg
aware in the later patches.

Link: http://lkml.kernel.org/r/1565144277-36240-2-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Suggested-by: N"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

c9acf2bd

alios: mm: Support kidled · fd952d8c

由 Gavin Shan 提交于 8月 30, 2019

This enables scanning pages in fixed interval to determine their access
frequency (hot/cold). The result is exported to user land on basis of
memory cgroup by "memory.idle_page_stats". The design is highlighted as
below:

   * A kernel thread is spawn when this feature is enabled by writing
     non-zero value to "/sys/kernel/mm/kidled/scan_period_in_seconds".
     The thread sequentially scans the nodes and their pages that have
     been chained up in LRU list.

   * For each page, its corresponding age information is stored in the
     page flags or array in node. The age represents the scanning intervals
     in which the page isn't accessed. Also, the page flag (PG_idle) is
     leveraged. The page's age is increased by one if the idle flag isn't
     cleared in two consective scans. Otherwise, the page's age is cleared out.
     Also, the page's age information is cleared when it's free'd so that
     the stale age information won't be fetched when it's allocated.

   * Initially, the flag is set, while the access bit in its PTE is cleared
     out by the thread. In next scanning period, its PTE access bit is
     synchronized with the page flag: clear the flag if access bit is set.
     The flag is kept otherwise. For unmapped pages, the flag is cleared
     when it's accessed.

   * Eventually, the page's aging information is updated to the unstable
     bucket of its corresponding memory cgroup, taking as statistics. The
     unstable bucket (statistics) is copied to stable bucket when all pages
     in all nodes are scanned for once. The stable bucket (statistics) is
     exported to user land through "memory.idle_page_stats".

TESTING
=======

   * cgroup1, unmapped pagecache

     # dd if=/dev/zero of=/ext4/test.data oflag=direct bs=1M count=128
     #
     # echo 1 > /sys/kernel/mm/kidled/use_hierarchy
     # echo 15 > /sys/kernel/mm/kidled/scan_period_in_seconds
     # mkdir -p /cgroup/memory
     # mount -tcgroup -o memory /cgroup/memory
     # echo 1 > /cgroup/memory/memory.use_hierarchy
     # mkdir -p /cgroup/memory/test
     # echo 1 > /cgroup/memory/test/memory.use_hierarchy
     #
     # echo $$ > /cgroup/memory/test/cgroup.procs
     # dd if=/ext4/test.data of=/dev/null bs=1M count=128
     # < wait a few minutes >
     # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei
     # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei
       cfei   0   0   0   134217728   0   0   0   0
     # cat /cgroup/memory/memory.idle_page_stats | grep cfei
       cfei   0   0   0   134217728   0   0   0   0

   * cgroup1, mapped pagecache

     # < create same file and memory cgroups as above >
     #
     # echo $$ > /cgroup/memory/test/cgroup.procs
     # < run program to mmap the whole created file and access the area >
     # < wait a few minutes >
     # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei
       cfei   0   134217728   0   0   0   0   0   0
     # cat /cgroup/memory/memory.idle_page_stats | grep cfei
       cfei   0   134217728   0   0   0   0   0   0

   * cgroup1, mapped and locked pagecache

     # < create same file and memory cgroups as above >
     #
     # echo $$ > /cgroup/memory/test/cgroup.procs
     # < run program to mmap the whole created file and mlock the area >
     # < wait a few minutes >
     # cat /cgroup/memory/test/memory.idle_page_stats | grep cfui
       cfui   0   134217728   0   0   0   0   0   0
     # cat /cgroup/memory/memory.idle_page_stats | grep cfui
       cfui   0   134217728   0   0   0   0   0   0

   * cgroup1, anonymous and locked area

     # < create memory cgroups as above >
     #
     # echo $$ > /cgroup/memory/test/cgroup.procs
     # < run program to mmap anonymous area and mlock it >
     # < wait a few minutes >
     # cat /cgroup/memory/test/memory.idle_page_stats | grep csui
       csui   0   0   134217728   0   0   0   0   0
     # cat /cgroup/memory/memory.idle_page_stats | grep csui
       csui   0   0   134217728   0   0   0   0   0

   * Rerun above test cases in cgroup2 and the results are no exceptional.
     However, the cgroups are populated in different way as below:

     # mkdir -p /cgroup
     # mount -tcgroup2 none /cgroup
     # echo "+memory" > /cgroup/cgroup.subtree_control
     # mkdir -p /cgroup/test
Signed-off-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

fd952d8c

alios: mm: memcontrol: make distance between wmark_low and wmark_high configurable · 33ef4784

由 Yang Shi 提交于 8月 17, 2019

Introduce a new interface, wmark_scale_factor, which defines the
distance between wmark_high and wmark_low.  The unit is in fractions of
10,000. The default value of 50 means the distance between wmark_high
and wmark_low is 0.5% of the max limit of the cgroup.  The maximum value
is 1000, or 10% of the max limit.

The distance between wmark_low and wmark_high have impact on how hard
memcg kswapd would reclaim.
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>

33ef4784

alios: mm: vmscan: make memcg kswapd set memcg state to dirty or writeback · e10c247b

由 Yang Shi 提交于 8月 02, 2019

The global kswapd could set memory node to dirty or writeback if current
scan find all pages are unqueued dirty or writeback. Then kswapd would
write out dirty pages or wait for writeback done. The memcg kswapd
behaves like global kswapd, and it should set dirty or writeback state
to memcg too if the same condition is met.

Since direct reclaim can't write out page caches, the system depends on
kswapd to write out dirty pages if scan finds too many dirty pages in
order to avoid pre-mature OOM. But, if page cache is dirtied too fast,
writing out pages definitely can't catch up with dirtying pages. It is
the responsibility of dirty page balance to throttle dirtying pages.
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>

e10c247b

alios: mm: memcontrol: treat memcg wmark reclaim work as kswapd · f7c87fa3

由 Yang Shi 提交于 8月 02, 2019

Since background water mark reclaim is scheduled by workqueue, it could
do more work than direct reclaim, i.e. write out dirty page, etc.

So, add PF_KSWAPD flag, so that current_is_kswapd() would return true
for memcg background reclaim.  The condition "current_is_kswapd() &&
!global_reclaim(sc)" is good enough to tell current is global kswapd or
memcg background reclaim.

And, kswapd is not allowed to break memory.low protection for now, memcg
kswapd should not break it either.
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>

f7c87fa3

alios: mm: memcontrol: add background reclaim support for cgroupv2 · 256b5d94

由 Yang Shi 提交于 8月 14, 2019

Like v1, add background reclaim support for cgroup v2. The interfaces
are exactly same with v1. However, if high limit is setup for v2, the
water mark would be calculated by high limit instead of max limit.
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>

256b5d94

alios: mm: memcontrol: support background async page reclaim · 6b2ef082

由 Yang Shi 提交于 8月 14, 2019

Currently when memory usage exceeds memory cgroup limit, memory cgroup
just can do sync direct reclaim.  This may incur unexpected stall on
some applications which are sensitive to latency.  Introduce background
async page reclaim mechanism, like what kswapd does.

Define memcg memory usage water mark by introducing wmark_ratio interface,
which is from 0 to 100 and represents percentage of max limit.  The
wmark_high is calculated by (max * wmark_ratio / 100), the wmark_low is
(wmark_high - wmark_high >> 8), which is an empirical value.  If wmark_ratio
is 0, it means water mark is disabled, both wmark_low and wmark_high is max,
which is the default value.

If wmark_ratio is setup, when charging page, if usage is greater than
wmark_high, which means the available memory of memcg is low, a work
would be scheduled to do background page reclaim until memory usage is
reduced to wmark_low if possible.

Define a dedicated unbound workqueue for scheduling water mark reclaim
works.
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>

6b2ef082

alios: mm: vmscan: make it sane reclaim if cgwb_v1 is enabled · 76e0403d

由 Yang Shi 提交于 8月 02, 2019

AliOS Cloud Kernel has cgroup writeback support for v1, so the reclaim could be
treated as sane reclaim if cgwb_v1 is enabled.
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>

76e0403d

05 12月, 2019 3 次提交

J
iocost: rename weight to cost.weight to avoid conflict with cfq · 78e38d28
由 Jiufei Xue 提交于 12月 05, 2019
```
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
```
78e38d28

ovl: implement async IO routines · b7706b3d

由 Jiufei Xue 提交于 11月 14, 2019

A performance regression is observed since linux v4.19 when we do aio
test using fio with iodepth 128 on overlayfs. And we found that queue
depth of the device is always 1 which is unexpected.

After investigation, it is found that commit 16914e6f
("ovl: add ovl_read_iter()") and commit 2a92e07e
("ovl: add ovl_write_iter()") use do_iter_readv_writev() to submit
requests to real filesystem. Async IOs are converted to sync IOs here
and cause performance regression.

So implement async IO for stacked reading and writing.

Changes since v1:
  - add a cleanup helper for completion/error handling
  - handle the case when aio_req allocation failed
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

b7706b3d

vfs: add vfs_iocb_iter_[read|write] helper functions · 7ff6623e

由 Jiufei Xue 提交于 11月 14, 2019

This isn't cause any behavior changes and will be used by overlay
async IO implementation.
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

7ff6623e

29 11月, 2019 3 次提交

alios: mm, memcg: fix possible soft lockup in try_charge · 1f6142a0

由 Xu Yu 提交于 11月 26, 2019

When events such as direct reclaim and oom occur intensively, soft
lockup is very likely to happen in the instances with 1 vcpu and with
kernel preempt disabled.

The example soft lockup is as follows.

[  160.555984] watchdog: BUG: soft lockup - CPU#0 stuck for 112s! [malloc:2188]
[  160.557975] Modules linked in: button
[  160.559495] CPU: 0 PID: 2188 Comm: malloc Not tainted 4.19.57-15.457.al7.x86_64 #1
[  160.561546] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014
[  160.563707] RIP: 0010:shrink_node+0x1ae/0x450
[  160.565391] Code: 00 00 00 49 8b 4f 20 ba 01 00 00 00 4c 8b 74 24 10 4d 8b 47 28 49 8b 77 10 48 2b 4c 24 08 41 8b 7f 1c 4d8
[  160.570747] RSP: 0000:ffff9d0ec07a3b58 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
[  160.572889] RAX: ffff982ab6014330 RBX: ffff982ab6014000 RCX: 0000000000000000
[  160.574992] RDX: 0000000000000001 RSI: ffff982ab6014000 RDI: ffff982ab6014000
[  160.577106] RBP: ffff982afffb6000 R08: 0000000000000000 R09: ffff982ab6014000
[  160.579219] R10: 0000000000000004 R11: 0000000000aaaaaa R12: 0000000000000000
[  160.581326] R13: 0000000000000000 R14: 0000000000000000 R15: ffff9d0ec07a3c50
[  160.583450] FS:  00007f8b414f7740(0000) GS:ffff982afda00000(0000) knlGS:0000000000000000
[  160.585704] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  160.587662] CR2: 00007f8adb800010 CR3: 000000007ac9e001 CR4: 00000000003606b0
[  160.589835] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  160.591971] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  160.594133] Call Trace:
[  160.595602]  do_try_to_free_pages+0xcc/0x390
[  160.597356]  try_to_free_mem_cgroup_pages+0xf9/0x1d0
[  160.599198]  ? out_of_memory+0xb5/0x4a0
[  160.600882]  try_charge+0x244/0x750
[  160.602522]  ? __pagevec_lru_add_fn+0x1d0/0x330
[  160.604310]  mem_cgroup_try_charge+0xb4/0x1d0
[  160.606085]  mem_cgroup_try_charge_delay+0x1c/0x40
[  160.607892]  do_anonymous_page+0xf7/0x540
[  160.609574]  __handle_mm_fault+0x665/0xa00
[  160.611233]  ? __switch_to_asm+0x35/0x70
[  160.612838]  handle_mm_fault+0x122/0x1e0
[  160.614407]  __do_page_fault+0x1b7/0x470
[  160.615962]  do_page_fault+0x32/0x140
[  160.617474]  ? async_page_fault+0x8/0x30
[  160.619012]  async_page_fault+0x1e/0x30
[  160.620526] RIP: 0033:0x40068e

Fix it by adding cond_resched() in try_charge(), just before goto retry
after OOM_SUCCESS, in order to let OOM free some memory first.
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

1f6142a0

iocost: add ioc_gq stat · 86068d0f

由 Jiufei Xue 提交于 11月 26, 2019

Add a stat file to monitor the ioc_gq stat.
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

86068d0f

dm thin: wakeup worker only when deferred bios exist · 6a2b7b88

由 Jeffle Xu 提交于 11月 18, 2019

commit d256d796279de0bdc227ff4daef565aa7e80c898 upstream.

Single thread fio test (read, bs=4k, ioengine=libaio, iodepth=128,
numjobs=1) over dm-thin device has poor performance versus bare nvme
device.

Further investigation with perf indicates that queue_work_on() consumes
over 20% CPU time when doing IO over dm-thin device. The call stack is
as follows.

- 40.57% thin_map
    + 22.07% queue_work_on
    + 9.95% dm_thin_find_block
    + 2.80% cell_defer_no_holder
      1.91% inc_all_io_entry.isra.33.part.34
    + 1.78% bio_detain.isra.35

In cell_defer_no_holder(), wakeup_worker() is always called, no matter
whether the tc->deferred_bio_list list is empty or not. In single thread
IO model, this list is most likely empty. So skip waking up worker thread
if tc->deferred_bio_list list is empty.

Single thread IO performance improves from 448 MiB/s to 646 MiB/s (+44%)
once the needless wake_worker() calls are properly skipped.
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

6a2b7b88

28 11月, 2019 15 次提交

alios: blk-throttle: limit bios to fix amount of pages entering writeback prematurely · 06a67773

由 Xiaoguang Wang 提交于 12月 28, 2018

Currently in blk_throtl_bio(), if one bio exceeds its throtl_grp's bps
or iops limit, this bio will be queued throtl_grp's throtl_service_queue,
then obviously mm subsys will submit more pages, even underlying device
can not handle these io requests, also this will make large amount of pages
entering writeback prematurely, later if some process writes some of these
pages, it will wait for long time.

I have done some tests: one process does buffered writes on a 1GB file,
and make this process's blkcg max bps limit be 10MB/s, I observe this:
	#cat /proc/meminfo  | grep -i back
	Writeback:        900024 kB
	WritebackTmp:          0 kB

I think this Writeback value is just too big, indeed many bios have been
queued in throtl_grp's throtl_service_queue, if one process try to write
the last bio's page in this queue, it will call wait_on_page_writeback(page),
which must wait the previous bios to finish and will take long time, we
have also see 120s hung task warning in our server.

 INFO: task kworker/u128:0:30072 blocked for more than 120 seconds.
       Tainted: G            E 4.9.147-013.ali3000_015_test.alios7.x86_64 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 kworker/u128:0  D    0 30072      2 0x00000000
 Workqueue: writeback wb_workfn (flush-8:16)
  ffff882ddd066b40 0000000000000000 ffff882e5cad3400 ffff882fbe959e80
  ffff882fa50b1a00 ffffc9003a5a3768 ffffffff8173325d ffffc9003a5a3780
  00ff882e5cad3400 ffff882fbe959e80 ffffffff81360b49 ffff882e5cad3400
 Call Trace:
  [<ffffffff8173325d>] ? __schedule+0x23d/0x6d0
  [<ffffffff81360b49>] ? alloc_request_struct+0x19/0x20
  [<ffffffff81733726>] schedule+0x36/0x80
  [<ffffffff81736c56>] schedule_timeout+0x206/0x4b0
  [<ffffffff81036c69>] ? sched_clock+0x9/0x10
  [<ffffffff81363073>] ? get_request+0x403/0x810
  [<ffffffff8110ca10>] ? ktime_get+0x40/0xb0
  [<ffffffff81732f8a>] io_schedule_timeout+0xda/0x170
  [<ffffffff81733f90>] ? bit_wait+0x60/0x60
  [<ffffffff81733fab>] bit_wait_io+0x1b/0x60
  [<ffffffff81733b28>] __wait_on_bit+0x58/0x90
  [<ffffffff811b0d91>] ? find_get_pages_tag+0x161/0x2e0
  [<ffffffff811aff62>] wait_on_page_bit+0x82/0xa0
  [<ffffffff810d47f0>] ? wake_atomic_t_function+0x60/0x60
  [<ffffffffa02fc181>] mpage_prepare_extent_to_map+0x2d1/0x310 [ext4]
  [<ffffffff8121ff65>] ? kmem_cache_alloc+0x185/0x1a0
  [<ffffffffa0305a2f>] ? ext4_init_io_end+0x1f/0x40 [ext4]
  [<ffffffffa0300294>] ext4_writepages+0x404/0xef0 [ext4]
  [<ffffffff81508c64>] ? scsi_init_io+0x44/0x200
  [<ffffffff81398a0f>] ? fprop_fraction_percpu+0x2f/0x80
  [<ffffffff811c139e>] do_writepages+0x1e/0x30
  [<ffffffff8127c0f5>] __writeback_single_inode+0x45/0x320
  [<ffffffff8127c942>] writeback_sb_inodes+0x272/0x600
  [<ffffffff8127cf6b>] wb_writeback+0x10b/0x300
  [<ffffffff8127d884>] wb_workfn+0xb4/0x380
  [<ffffffff810b85e9>] ? try_to_wake_up+0x59/0x3e0
  [<ffffffff810a5759>] process_one_work+0x189/0x420
  [<ffffffff810a5a3e>] worker_thread+0x4e/0x4b0
  [<ffffffff810a59f0>] ? process_one_work+0x420/0x420
  [<ffffffff810ac026>] kthread+0xe6/0x100
  [<ffffffff810abf40>] ? kthread_park+0x60/0x60
  [<ffffffff81738499>] ret_from_fork+0x39/0x50

To fix this issue, we can simply limit throtl_service_queue's max queued
bios, currently we limit it to throtl_grp's bps_limit or iops limit, if it
still exteeds, we just sleep for a while.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

06a67773

alios: block-throttle: add counters for completed io · 6bb5d410

由 Jiufei Xue 提交于 4月 10, 2018

Now we have counters for wait_time and service_time, but no completed
ios, so the average latency can not be measured.
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

6bb5d410

alios: block-throttle: code cleanup · 4a7c0663

由 Jiufei Xue 提交于 4月 10, 2018

This patch does the code cleanup because the seq_show handlers for tg
counters are the same. No functional changes.
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

4a7c0663

alios: blk-throttle: add throttled io/bytes counter · eeb720d8

由 Joseph Qi 提交于 3月 08, 2018

Add 2 interfaces to stat io throttle information:
  blkio.throttle.total_io_queued
  blkio.throttle.total_bytes_queued

These interfaces are used for monitoring throttled io/bytes and
analyzing if delay has relation with io throttle.
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

eeb720d8

alios: blk-throttle: fix tg NULL pointer dereference · 4667e926

由 Joseph Qi 提交于 12月 07, 2017

io throtl stats will blkg_get at the beginning of throttle and then
blkg_put at the new introduced bi_tg_end_io. This will cause blkg to be
freed if end_io is called twice like dm-thin, which will save origin
end_io first, and call its overwrite end_io and then the saved end_io.
After that, access blkg is invalid and finally BUG:

[ 4417.235048] BUG: unable to handle kernel NULL pointer dereference at 00000000000001e0
[ 4417.236475] IP: [<ffffffff812e7c71>] throtl_update_dispatch_stats+0x21/0xb0
[ 4417.237865] PGD 98395067 PUD 362e1067 PMD 0
[ 4417.239232] Oops: 0000 [#1] SMP
......
[ 4417.274070] Call Trace:
[ 4417.275407]  [<ffffffff812ea93d>] blk_throtl_bio+0xfd/0x630
[ 4417.276760]  [<ffffffff810b3613>] ? wake_up_process+0x23/0x40
[ 4417.278079]  [<ffffffff81094c04>] ? wake_up_worker+0x24/0x30
[ 4417.279387]  [<ffffffff81095772>] ? insert_work+0x62/0xa0
[ 4417.280697]  [<ffffffff8116c2c7>] ? mempool_free_slab+0x17/0x20
[ 4417.282019]  [<ffffffff8116c6c9>] ? mempool_free+0x49/0x90
[ 4417.283326]  [<ffffffff812c9acf>] generic_make_request_checks+0x16f/0x360
[ 4417.284637]  [<ffffffffa0340d97>] ? thin_map+0x227/0x2c0 [dm_thin_pool]
[ 4417.285951]  [<ffffffff812c9ce7>] generic_make_request+0x27/0x130
[ 4417.287240]  [<ffffffffa0230b3d>] __map_bio+0xad/0x100 [dm_mod]
[ 4417.288503]  [<ffffffffa023257e>] __clone_and_map_data_bio+0x15e/0x240 [dm_mod]
[ 4417.289778]  [<ffffffffa02329ea>] __split_and_process_bio+0x38a/0x500 [dm_mod]
[ 4417.291062]  [<ffffffffa0232c91>] dm_make_request+0x131/0x1a0 [dm_mod]
[ 4417.292344]  [<ffffffff812c9da2>] generic_make_request+0xe2/0x130
[ 4417.293626]  [<ffffffff812c9e61>] submit_bio+0x71/0x150
[ 4417.294909]  [<ffffffff8121ab1d>] ? bio_alloc_bioset+0x20d/0x360
[ 4417.296195]  [<ffffffff81215acb>] _submit_bh+0x14b/0x220
[ 4417.297484]  [<ffffffff81215bb0>] submit_bh+0x10/0x20
[ 4417.298744]  [<ffffffffa016d8d8>] jbd2_journal_commit_transaction+0x6c8/0x19a0 [jbd2]
[ 4417.300014]  [<ffffffff810135b8>] ? __switch_to+0xf8/0x4c0
[ 4417.301268]  [<ffffffffa01731e9>] kjournald2+0xc9/0x270 [jbd2]
[ 4417.302524]  [<ffffffff810a0fd0>] ? wake_up_atomic_t+0x30/0x30
[ 4417.303753]  [<ffffffffa0173120>] ? commit_timeout+0x10/0x10 [jbd2]
[ 4417.304950]  [<ffffffff8109ffef>] kthread+0xcf/0xe0
[ 4417.306107]  [<ffffffff8109ff20>] ? kthread_create_on_node+0x140/0x140
[ 4417.307255]  [<ffffffff81647f18>] ret_from_fork+0x58/0x90
[ 4417.308349]  [<ffffffff8109ff20>] ? kthread_create_on_node+0x140/0x140
......

Now we introduce a new bio flag BIO_THROTL_STATED to make sure
blkg_get/put only get called once for the same bio.
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

4667e926

alios: blk-throttle: support io delay stats · 65e6966a

由 Joseph Qi 提交于 12月 19, 2017

Add blkio.throttle.io_service_time and blkio.throttle.io_wait_time to
get per-cgroup io delay statistics.
io_service_time represents the time spent after io throttle to io
completion, while io_wait_time represents the time spent on throttle
queue.
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

65e6966a

alios: nvme-pci: Disable dicard zero-out functionality on Intel's P3600 NVMe disk drive · d79c6eda

由 Wenwei Tao 提交于 12月 13, 2017

We found huge performance lost on below particular Intel's disk drive
when discard zeroout functionality is enabled on it. The issue was
found when we have ext4 filesystem mounted on the disk drive and
started regular FIO testing. With it disabled, we don't observe
performance lost any more.

81:00.0 Non-Volatile memory controller: Intel Corporation \
             PCIe Data Center SSD (rev 01)

This imposes to disable the discard zero-out functionality on above
disk drive in order to regain the high performance that NVMe disk
driver supposes to provide.

Differential Revision: https://aone.alibaba-inc.com/code/D377540Signed-off-by: NWenwei Tao <wenwei.tao@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

d79c6eda

alios: memcg: Point wb to root memcg/blkcg when offlining to avoid zombie · 8514dbc7

由 Xunlei Pang 提交于 4月 03, 2019

After turning off the memcg kmem charging, we still suffer
from various zombie memcg problems on production environment
because of its non-zero reference count from both page caches
and per-memcg writeback related structure(bdi_writeback takes
a reference).

After we reclaimed all the page caches of the zombie memcg,
it still can't be dropped due to its bdi_writeback.

bdi_writeback is further referenced by the inodes of files,
so the memcg can't be truely released until the inodes are
destroyed afterwards which is quite unlikely in short term.

When memcg is offlining, change it's bdi_writeback to root,
and call css_put to formally release it. We've tested on
product environment, it yields pretty good effect.

Ditto for wb_blkcg_offline().
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

8514dbc7

alios: block: add counter to track io request's d2c time · 07232d74

由 Xiaoguang Wang 提交于 6月 19, 2019

Indeed tool iostat's await is not good enough, which is somewhat sketchy
and could not show request's latency on device driver's side.

Here we add a new counter to track io request's d2c time, also with this
patch, we can extend iostat to show this value easily.

Note:
I had checked how iostat is implemented, it just reads fields it needs,
so iostat won't be affected by this change, so does tsar.
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

07232d74

alios: fuse: add sysfs api to flush processing queue requests · 2292db66

由 Ma Jie Yue 提交于 3月 26, 2019

The failover of fuse userspace daemon will reuse the existing fuse conn,
without unmounting it, during daemon crashing and recovery procedure.
But some requests might be in process in the daemon before sending out reply,
when the crash happens. This will stuck the application since it will
never get the reply after the failover.

We add the sysfs api to flush these requests, after the daemon crash, before
recovery. It is easy to reproduce the issue in the fuse userspace daemon,
just exit after receiving the request and before sending the reply back.
The application will hang up in some read/write operation, before
echo 1 > /sys/fs/fuse/connection/xxx/flush. The flush operation will make
the io fail and return the error to the application.
Signed-off-by: NMa Jie Yue <majieyue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

2292db66

alios: jbd2: add proc entry to control whether doing buffer copy-out · ac452d09

由 Xiaoguang Wang 提交于 11月 15, 2018

When jbd2 tries to get write access to one buffer, and if this buffer
is under writeback with BH_Shadow flag, jbd2 will wait until this buffer
has been written to disk, but sometimes the time taken to wait may be
much long, especially disk capacity is almost full.

Here add a proc entry "force-copy", if its value is not zero, jbd2 will
always do meta buffer copy-cout, then we can eliminate the unnecessary
wating time here, and reduce long tail latency for buffered-write.

I construct such test case below:

$cat offline.fio
; fio-rand-RW.job for fiotest

[global]
name=fio-rand-RW
filename=fio-rand-RW
rw=randrw
rwmixread=60
rwmixwrite=40
bs=4K
direct=0
numjobs=4
time_based=1
runtime=900

[file1]
size=60G
ioengine=sync
iodepth=16

$cat online.fio
; fio-seq-write.job for fiotest

[global]
name=fio-seq-write
filename=fio-seq-write
rw=write
bs=256K
direct=0
numjobs=1
time_based=1
runtime=60

[file1]
rate=50m
size=10G
ioengine=sync
iodepth=16

With this patch:
$cat /proc/fs/jbd2/sda5-8/force_copy
0

online fio almost always get such long tail latency:

Jobs: 1 (f=1), 0B/s-0B/s: [W(1)][100.0%][w=50.0MiB/s][w=200 IOPS][eta
00m:00s]
file1: (groupid=0, jobs=1): err= 0: pid=17855: Thu Nov 15 09:45:57 2018
  write: IOPS=200, BW=50.0MiB/s (52.4MB/s)(3000MiB/60001msec)
    clat (usec): min=135, max=4086.6k, avg=867.21, stdev=50338.22
     lat (usec): min=139, max=4086.6k, avg=871.16, stdev=50338.22
    clat percentiles (usec):
     |  1.00th=[    141],  5.00th=[    143], 10.00th=[    145],
     | 20.00th=[    147], 30.00th=[    147], 40.00th=[    149],
     | 50.00th=[    149], 60.00th=[    151], 70.00th=[    153],
     | 80.00th=[    155], 90.00th=[    159], 95.00th=[    163],
     | 99.00th=[    255], 99.50th=[    273], 99.90th=[    429],
     | 99.95th=[    441], 99.99th=[3640656]

$cat /proc/fs/jbd2/sda5-8/force_copy
1

online fio latency is much better.

Jobs: 1 (f=1), 0B/s-0B/s: [W(1)][100.0%][w=50.0MiB/s][w=200 IOPS][eta
00m:00s]
file1: (groupid=0, jobs=1): err= 0: pid=8084: Thu Nov 15 09:31:15 2018
  write: IOPS=200, BW=50.0MiB/s (52.4MB/s)(3000MiB/60001msec)
    clat (usec): min=137, max=545, avg=151.35, stdev=16.22
     lat (usec): min=140, max=548, avg=155.31, stdev=16.65
    clat percentiles (usec):
     |  1.00th=[  143],  5.00th=[  145], 10.00th=[  145], 20.00th=[
147],
     | 30.00th=[  147], 40.00th=[  147], 50.00th=[  149], 60.00th=[
149],
     | 70.00th=[  151], 80.00th=[  155], 90.00th=[  157], 95.00th=[
161],
     | 99.00th=[  239], 99.50th=[  269], 99.90th=[  420], 99.95th=[
429],
     | 99.99th=[  537]

As to the cost: because we'll always need to copy meta buffer, will
consume minor cpu time and some memory(at most 32MB for 128MB journal
size).
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

ac452d09

alios: ext4: don't submit unwritten extent while holding active jbd2 handle · a8366d32

由 Xiaoguang Wang 提交于 12月 25, 2018

In ext4_writepages(), for every iteration, mpage_prepare_extent_to_map()
will try to find 2048 pages to map and normally one bio can contain 256
pages at most. If we really found 2048 pages to map, there will be 4 bios
and 4 ext4_io_submit() calls which are called both in ext4_writepages()
and mpage_map_and_submit_extent().

But note that in mpage_map_and_submit_extent(), we hold a valid jbd2 handle,
when dioread_nolock is enabled and extent is unwritten, jbd2 commit thread
will wait this handle to finish, so wait the unwritten extent is written to
disk, this will introduce unnecessary stall time, especially longer when
the writeback operation is io throttled, need to fix this issue.

Here for this scene, we accumulate bios in ext4_io_submit's io_bio, and
only submit these bios after dropping the jbd2 handle.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

a8366d32

alios: fs,ext4: remove projid limit when create hard link · 28df06b3

由 zhangliguang 提交于 12月 27, 2018

This is a temporary workaround plan to avoid the limitation when
creating hard link cross two projids.
Signed-off-by: Nzhangliguang <zhangliguang@linux.alibaba.com>
Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

28df06b3

alios: jbd2: add new "stats" proc file · 3550da0c

由 Xiaoguang Wang 提交于 6月 18, 2019

/proc/fs/jbd2/${device}/info only shows whole average statistical
info about jbd2's life cycle, but it can not show jbd2 info in
specified time interval and sometimes this capability is very useful
for trouble shooting. For example, we can not see how rs_locked and
rs_flushing grows in specified time interval, but these two indexes
can explain some reasons for app's behaviours.

Here we add a new "stats" proc file like /proc/diskstats, then we can
implement a simple tool jbd2_stats which'll display detailed jbd2 info
in specified time interval. Like below(time interval 5s):

[lege@localhost ~]$ cat /proc/fs/jbd2/vdb1-8/stats
51 30 8192 0 1 241616 0 0 22 0 47158 891 942 1000 1000

[lege@localhost ~]$ gcc -o jbd2_stat jbd2_stat.c ; ./jbd2_stat

Device              tid     trans   handles    locked  flushing
logging
vdb1-8             1861       158       359     13.00      0.00
2.00

Device              tid     trans   handles    locked  flushing
logging
vdb1-8             1974       113       389     26.00      0.00
5.00

Device              tid     trans   handles    locked  flushing
logging
vdb1-8             2188       214       308     10.00      0.00
7.00

Device              tid     trans   handles    locked  flushing
logging
vdb1-8             2344       156       332     19.00      0.00
4.00
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

3550da0c

alios: jbd2: create jbd2-ckpt thread for journal checkpoint · c31b17e5

由 Joseph Qi 提交于 3月 07, 2018

This is trying to do jbd2 checkpoint in a specific kernel thread, then
checkpoint won't be under io throttle control.
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: Nzhangliguang <zhangliguang@linux.alibaba.com>
Reviewed by: Baoyou Xie <baoyou.xie@linux.alibaba.com>
Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

c31b17e5

21 11月, 2019 1 次提交

mm, memcg: add missing memory stall section in mem_cgroup_handle_over_high · fc2036b9

由 Caspar Zhang 提交于 9月 23, 2019

When backporting commit 0e4b01df8659 ("mm, memcg: throttle allocators
when failing reclaim over memory.high"), memory stall section was
inadvertently missing. Fix this issue by adding it back.

Fixes: eda29cc0 ("mm, memcg: throttle allocators when failing reclaim over memory.high")
Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

fc2036b9

20 11月, 2019 5 次提交

mm: thp: handle page cache THP correctly in PageTransCompoundMap · c9f8166a

由 Yang Shi 提交于 11月 08, 2019

commit 169226f7e0d275c1879551f37484ef6683579a5c upstream

We have a usecase to use tmpfs as QEMU memory backend and we would like
to take the advantage of THP as well.  But, our test shows the EPT is
not PMD mapped even though the underlying THP are PMD mapped on host.
The number showed by /sys/kernel/debug/kvm/largepage is much less than
the number of PMD mapped shmem pages as the below:

  7f2778200000-7f2878200000 rw-s 00000000 00:14 262232 /dev/shm/qemu_back_mem.mem.Hz2hSf (deleted)
  Size:            4194304 kB
  [snip]
  AnonHugePages:         0 kB
  ShmemPmdMapped:   579584 kB
  [snip]
  Locked:                0 kB

  cat /sys/kernel/debug/kvm/largepages
  12

And some benchmarks do worse than with anonymous THPs.

By digging into the code we figured out that commit 127393fb ("mm:
thp: kvm: fix memory corruption in KVM with THP enabled") checks if
there is a single PTE mapping on the page for anonymous THP when setting
up EPT map.  But the _mapcount < 0 check doesn't work for page cache THP
since every subpage of page cache THP would get _mapcount inc'ed once it
is PMD mapped, so PageTransCompoundMap() always returns false for page
cache THP.  This would prevent KVM from setting up PMD mapped EPT entry.

So we need handle page cache THP correctly.  However, when page cache
THP's PMD gets split, kernel just remove the map instead of setting up
PTE map like what anonymous THP does.  Before KVM calls get_user_pages()
the subpages may get PTE mapped even though it is still a THP since the
page cache THP may be mapped by other processes at the mean time.

Checking its _mapcount and whether the THP has PTE mapped or not.
Although this may report some false negative cases (PTE mapped by other
processes), it looks not trivial to make this accurate.

With this fix /sys/kernel/debug/kvm/largepage would show reasonable
pages are PMD mapped by EPT as the below:

  7fbeaee00000-7fbfaee00000 rw-s 00000000 00:14 275464 /dev/shm/qemu_back_mem.mem.SKUvat (deleted)
  Size:            4194304 kB
  [snip]
  AnonHugePages:         0 kB
  ShmemPmdMapped:   557056 kB
  [snip]
  Locked:                0 kB

  cat /sys/kernel/debug/kvm/largepages
  271

And the benchmarks are as same as anonymous THPs.

[yang.shi@linux.alibaba.com: v4]
  Link: http://lkml.kernel.org/r/1571865575-42913-1-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1571769577-89735-1-git-send-email-yang.shi@linux.alibaba.com
Fixes: dd78fedd ("rmap: support file thp")
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reported-by: NGang Deng <gavin.dg@linux.alibaba.com>
Tested-by: NGang Deng <gavin.dg@linux.alibaba.com>
Suggested-by: NHugh Dickins <hughd@google.com>
Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>    [4.8+]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

c9f8166a

ICX: perf/x86/intel: Fix invalid Bit 13 for Icelake MSR_OFFCORE_RSP_x register · c154e184

由 Yunying Sun 提交于 7月 24, 2019

commit 3b238a64c3009fed36eaea1af629d9377759d87d upstream.

The Intel SDM states that bit 13 of Icelake's MSR_OFFCORE_RSP_x
register is valid, and used for counting hardware generated prefetches
of L3 cache. Update the bitmask to allow bit 13.

Before:
$ perf stat -e cpu/event=0xb7,umask=0x1,config1=0x1bfff/u sleep 3
 Performance counter stats for 'sleep 3':
   <not supported>      cpu/event=0xb7,umask=0x1,config1=0x1bfff/u

After:
$ perf stat -e cpu/event=0xb7,umask=0x1,config1=0x1bfff/u sleep 3
 Performance counter stats for 'sleep 3':
             9,293      cpu/event=0xb7,umask=0x1,config1=0x1bfff/u
Signed-off-by: NYunying Sun <yunying.sun@intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NKan Liang <kan.liang@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: alexander.shishkin@linux.intel.com
Cc: bp@alien8.de
Cc: hpa@zytor.com
Cc: jolsa@redhat.com
Cc: namhyung@kernel.org
Link: https://lkml.kernel.org/r/20190724082932.12833-1-yunying.sun@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NLin Wang <lin.x.wang@intel.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

c154e184

ICX: perf/x86/intel: Add more Icelake CPUIDs · e4ed6f52

由 Kan Liang 提交于 6月 03, 2019

commit faaeff98666c24376cebd0b106504d05a36881d1 upstream.

Add new model number for Icelake desktop and server to perf.

The data source encoding for Icelake server is the same as Skylake
server.
Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@alien8.de
Cc: qiuxu.zhuo@intel.com
Cc: rui.zhang@intel.com
Cc: tony.luck@intel.com
Link: https://lkml.kernel.org/r/20190603134122.13853-2-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NLin Wang <lin.x.wang@intel.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

e4ed6f52

resource/docs: Complete kernel-doc style function documentation · 1de9c7c3

由 Borislav Petkov 提交于 11月 05, 2018

commit f26621e60b35369bca9228bc936dc723b3e421af upstream.

Add the missing kernel-doc style function parameters documentation.
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: linux-tip-commits@vger.kernel.org
Cc: rdunlap@infradead.org
Fixes: b69c2e20f6e4 ("resource: Clean it up a bit")
Link: http://lkml.kernel.org/r/20181105093307.GA12445@zn.tnicSigned-off-by: NIngo Molnar <mingo@kernel.org>
[joseph: fix find_next_iomem_res() documentation]
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>

1de9c7c3

resource/docs: Fix new kernel-doc warnings · 39cecf2f

由 Randy Dunlap 提交于 11月 04, 2018

commit f75d651587f719a813ebbbfeee570e6570731d55 upstream.

The first group of warnings is caused by a "/**" kernel-doc notation
marker but the function comments are not in kernel-doc format.
Also add another error return value here.

  ../kernel/resource.c:337: warning: Function parameter or member 'start' not described in 'find_next_iomem_res'
  ../kernel/resource.c:337: warning: Function parameter or member 'end' not described in 'find_next_iomem_res'
  ../kernel/resource.c:337: warning: Function parameter or member 'flags' not described in 'find_next_iomem_res'
  ../kernel/resource.c:337: warning: Function parameter or member 'desc' not described in 'find_next_iomem_res'
  ../kernel/resource.c:337: warning: Function parameter or member 'first_lvl' not described in 'find_next_iomem_res'
  ../kernel/resource.c:337: warning: Function parameter or member 'res' not described in 'find_next_iomem_res'

Add the missing function parameter documentation for the other warnings:

  ../kernel/resource.c:409: warning: Function parameter or member 'arg' not described in 'walk_iomem_res_desc'
  ../kernel/resource.c:409: warning: Function parameter or member 'func' not described in 'walk_iomem_res_desc'
Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: b69c2e20f6e4 ("resource: Clean it up a bit")
Link: http://lkml.kernel.org/r/dda2e4d8-bedd-3167-20fe-8c7d2d35b354@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
[joseph: fix find_next_iomem_res() documentation]
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>

39cecf2f

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功