1. 18 3月, 2020 40 次提交
    • M
      mm: introduce MADV_PAGEOUT · 23757dcc
      Minchan Kim 提交于
      commit 1a4e58cce84ee88129d5d49c064bd2852b481357 upstream
      
      When a process expects no accesses to a certain memory range for a long
      time, it could hint kernel that the pages can be reclaimed instantly but
      data should be preserved for future use.  This could reduce workingset
      eviction so it ends up increasing performance.
      
      This patch introduces the new MADV_PAGEOUT hint to madvise(2) syscall.
      MADV_PAGEOUT can be used by a process to mark a memory range as not
      expected to be used for a long time so that kernel reclaims *any LRU*
      pages instantly.  The hint can help kernel in deciding which pages to
      evict proactively.
      
      A note: It doesn't apply SWAP_CLUSTER_MAX LRU page isolation limit
      intentionally because it's automatically bounded by PMD size.  If PMD
      size(e.g., 256) makes some trouble, we could fix it later by limit it to
      SWAP_CLUSTER_MAX[1].
      
      - man-page material
      
      MADV_PAGEOUT (since Linux x.x)
      
      Do not expect access in the near future so pages in the specified
      regions could be reclaimed instantly regardless of memory pressure.
      Thus, access in the range after successful operation could cause
      major page fault but never lose the up-to-date contents unlike
      MADV_DONTNEED. Pages belonging to a shared mapping are only processed
      if a write access is allowed for the calling process.
      
      MADV_PAGEOUT cannot be applied to locked pages, Huge TLB pages, or
      VM_PFNMAP pages.
      
      [1] https://lore.kernel.org/lkml/20190710194719.GS29695@dhcp22.suse.cz/
      
      [minchan@kernel.org: clear PG_active on MADV_PAGEOUT]
        Link: http://lkml.kernel.org/r/20190802200643.GA181880@google.com
      [akpm@linux-foundation.org: resolve conflicts with hmm.git]
      Link: http://lkml.kernel.org/r/20190726023435.214162-5-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      23757dcc
    • M
      mm: introduce MADV_COLD · 1af766e8
      Minchan Kim 提交于
      commit 9c276cc65a58faf98be8e56962745ec99ab87636 upstream
      
      Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.
      
      - Background
      
      The Android terminology used for forking a new process and starting an app
      from scratch is a cold start, while resuming an existing app is a hot
      start.  While we continually try to improve the performance of cold
      starts, hot starts will always be significantly less power hungry as well
      as faster so we are trying to make hot start more likely than cold start.
      
      To increase hot start, Android userspace manages the order that apps
      should be killed in a process called ActivityManagerService.
      ActivityManagerService tracks every Android app or service that the user
      could be interacting with at any time and translates that into a ranked
      list for lmkd(low memory killer daemon).  They are likely to be killed by
      lmkd if the system has to reclaim memory.  In that sense they are similar
      to entries in any other cache.  Those apps are kept alive for
      opportunistic performance improvements but those performance improvements
      will vary based on the memory requirements of individual workloads.
      
      - Problem
      
      Naturally, cached apps were dominant consumers of memory on the system.
      However, they were not significant consumers of swap even though they are
      good candidate for swap.  Under investigation, swapping out only begins
      once the low zone watermark is hit and kswapd wakes up, but the overall
      allocation rate in the system might trip lmkd thresholds and cause a
      cached process to be killed(we measured performance swapping out vs.
      zapping the memory by killing a process.  Unsurprisingly, zapping is 10x
      times faster even though we use zram which is much faster than real
      storage) so kill from lmkd will often satisfy the high zone watermark,
      resulting in very few pages actually being moved to swap.
      
      - Approach
      
      The approach we chose was to use a new interface to allow userspace to
      proactively reclaim entire processes by leveraging platform information.
      This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
      that are known to be cold from userspace and to avoid races with lmkd by
      reclaiming apps as soon as they entered the cached state.  Additionally,
      it could provide many chances for platform to use much information to
      optimize memory efficiency.
      
      To achieve the goal, the patchset introduce two new options for madvise.
      One is MADV_COLD which will deactivate activated pages and the other is
      MADV_PAGEOUT which will reclaim private pages instantly.  These new
      options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
      ways to gain some free memory space.  MADV_PAGEOUT is similar to
      MADV_DONTNEED in a way that it hints the kernel that memory region is not
      currently needed and should be reclaimed immediately; MADV_COLD is similar
      to MADV_FREE in a way that it hints the kernel that memory region is not
      currently needed and should be reclaimed when memory pressure rises.
      
      This patch (of 5):
      
      When a process expects no accesses to a certain memory range, it could
      give a hint to kernel that the pages can be reclaimed when memory pressure
      happens but data should be preserved for future use.  This could reduce
      workingset eviction so it ends up increasing performance.
      
      This patch introduces the new MADV_COLD hint to madvise(2) syscall.
      MADV_COLD can be used by a process to mark a memory range as not expected
      to be used in the near future.  The hint can help kernel in deciding which
      pages to evict early during memory pressure.
      
      It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
      
      	active file page -> inactive file LRU
      	active anon page -> inacdtive anon LRU
      
      Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
      LRU's head because MADV_COLD is a little bit different symantic.
      MADV_FREE means it's okay to discard when the memory pressure because the
      content of the page is *garbage* so freeing such pages is almost zero
      overhead since we don't need to swap out and access afterward causes just
      minor fault.  Thus, it would make sense to put those freeable pages in
      inactive file LRU to compete other used-once pages.  It makes sense for
      implmentaion point of view, too because it's not swapbacked memory any
      longer until it would be re-dirtied.  Even, it could give a bonus to make
      them be reclaimed on swapless system.  However, MADV_COLD doesn't mean
      garbage so reclaiming them requires swap-out/in in the end so it's bigger
      cost.  Since we have designed VM LRU aging based on cost-model, anonymous
      cold pages would be better to position inactive anon's LRU list, not file
      LRU.  Furthermore, it would help to avoid unnecessary scanning if system
      doesn't have a swap device.  Let's start simpler way without adding
      complexity at this moment.  However, keep in mind, too that it's a caveat
      that workloads with a lot of pages cache are likely to ignore MADV_COLD on
      anonymous memory because we rarely age anonymous LRU lists.
      
      * man-page material
      
      MADV_COLD (since Linux x.x)
      
      Pages in the specified regions will be treated as less-recently-accessed
      compared to pages in the system with similar access frequencies.  In
      contrast to MADV_FREE, the contents of the region are preserved regardless
      of subsequent writes to pages.
      
      MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
      pages.
      
      [akpm@linux-foundation.org: resolve conflicts with hmm.git]
      Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      1af766e8
    • M
      mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM · a0747c91
      Minchan Kim 提交于
      commit 8940b34a4e082ae11498ddae8432f2ac07685d1c upstream
      
      The local variable references in shrink_page_list is PAGEREF_RECLAIM_CLEAN
      as default.  It is for preventing to reclaim dirty pages when CMA try to
      migrate pages.  Strictly speaking, we don't need it because CMA didn't
      allow to write out by .may_writepage = 0 in reclaim_clean_pages_from_list.
      
      Moreover, it has a problem to prevent anonymous pages's swap out even
      though force_reclaim = true in shrink_page_list on upcoming patch.  So
      this patch makes references's default value to PAGEREF_RECLAIM and rename
      force_reclaim with ignore_references to make it more clear.
      
      This is a preparatory work for next patch.
      
      Link: http://lkml.kernel.org/r/20190726023435.214162-3-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: kbuild test robot <lkp@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      a0747c91
    • A
      tools build: Check if gettid() is available before providing helper · 27a374d1
      Arnaldo Carvalho de Melo 提交于
      commit 4541a8bb13a86e504416a13360c8dc64d2fd612a upstream
      
      Laura reported that the perf build failed in fedora when we got a glibc
      that provides gettid(), which I reproduced using fedora rawhide with the
      glibc-devel-2.29.9000-26.fc31.x86_64 package.
      
      Add a feature check to avoid providing a gettid() helper in such
      systems.
      
      On a fedora rawhide system with this patch applied we now get:
      
        [root@7a5f55352234 perf]# grep gettid /tmp/build/perf/FEATURE-DUMP
        feature-gettid=1
        [root@7a5f55352234 perf]# cat /tmp/build/perf/feature/test-gettid.make.output
        [root@7a5f55352234 perf]# ldd /tmp/build/perf/feature/test-gettid.bin
                linux-vdso.so.1 (0x00007ffc6b1f6000)
                libc.so.6 => /lib64/libc.so.6 (0x00007f04e0a74000)
                /lib64/ld-linux-x86-64.so.2 (0x00007f04e0c47000)
        [root@7a5f55352234 perf]# nm /tmp/build/perf/feature/test-gettid.bin | grep -w gettid
                         U gettid@@GLIBC_2.30
        [root@7a5f55352234 perf]#
      
      While on a fedora:29 system:
      
        [acme@quaco perf]$ grep gettid /tmp/build/perf/FEATURE-DUMP
        feature-gettid=0
        [acme@quaco perf]$ cat /tmp/build/perf/feature/test-gettid.make.output
        test-gettid.c: In function ‘main’:
        test-gettid.c:8:9: error: implicit declaration of function ‘gettid’; did you mean ‘getgid’? [-Werror=implicit-function-declaration]
          return gettid();
                 ^~~~~~
                 getgid
        cc1: all warnings being treated as errors
        [acme@quaco perf]$
      Reported-by: NLaura Abbott <labbott@redhat.com>
      Tested-by: NLaura Abbott <labbott@redhat.com>
      Acked-by: NJiri Olsa &lt;jolsa@kernel.org&gt;Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Link: https://lkml.kernel.org/n/tip-yfy3ch53agmklwu9o7rlgf9c@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: Nluanshi <zhangliguang@linux.alibaba.com>
      Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com>
      27a374d1
    • X
      alinux: mm: add proc interface to control context readahead · 2e38a0f2
      Xiaoguang Wang 提交于
      For some workloads whose io activities are mostly random, context
      readahead feature can introduce unnecessary io read operations, which
      will impact app's performance. Context readahead's algorithm is
      straightforward and not that smart.
      
      This patch adds "/proc/sys/vm/enable_context_readahead" to control
      whether to disable or enable this feature. Currently we enable context
      readahead default, user can echo 0 to /proc/sys/vm/enable_context_readahead
      to disable context readahead.
      
      We also have tested mongodb's performance in 'random point select' case,
      With context readahead enabled:
        mongodb eps 12409
      With context readahead disabled:
        mongodb eps 14443
      About 16% performance improvement.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      2e38a0f2
    • Z
      alinux: hookers: add arm64 dependency · db153a10
      Zou Cao 提交于
      Now hookers has been support in arm64, add the Kconfig depend for
      arm64, otherwise it won't be built.
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>
      db153a10
    • Z
      alinux: Hookers: add arm64 support · bc56b7b3
      Zou Cao 提交于
      arm64 has't global write protect set, here we remove the write
      proctect in pmd table and update the hook data.
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>
      bc56b7b3
    • Z
      alinux: arm64: use __kernel_text_address to replace kthread_return_to_user · 64259ab4
      Zou Cao 提交于
      We don't need to use kthread_return_to_user to tell unwind it is kernel
      thread, we can use __kernel_text_address, it is a normal way in other
      arch like x86/ppc.
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      64259ab4
    • T
      arm64: reliable stacktraces · 46ad7da7
      Torsten Duwe 提交于
      cherry-picked from: https://patchwork.kernel.org/patch/10657429/
      
      Enhance the stack unwinder so that it reports whether it had to stop
      normally or due to an error condition; unwind_frame() will report
      continue/error/normal ending and walk_stackframe() will pass that
      info. __save_stack_trace() is used to check the validity of a stack;
      save_stack_trace_tsk_reliable() can now trivially be implemented.
      Modify arch/arm64/kernel/time.c as the only external caller so far
      to recognise the new semantics.
      
      I had to introduce a marker symbol kthread_return_to_user to tell
      the normal origin of a kernel thread.
      Signed-off-by: NTorsten Duwe <duwe@suse.de>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      46ad7da7
    • Z
      alinux: arm64: add livepatch support · 7d9b185c
      Zou Cao 提交于
      Now we support FTRACE_WITH_REGS with -fpatchable-function-entry, here
      enable the livepatch support depend on FTRACE_WITH_REGS.
      
      Use task flag bit 6 to track patch transisiton state for the consistency
      model. Add it to the work mask so it gets cleared on all kernel exits to
      userland.
      
      Tell livepatch regs->pc + 2*AARCH64_INSN_SIZE is the place to change the
      return address.
      
      these codes have a big change against reference link, beacause we use new
      gcc featrue.
      
      References:
      https://patchwork.kernel.org/patch/10657431/
      
      Based-on-code-from: Torsten Duwe <duwe@suse.de>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      7d9b185c
    • S
      efi: Make efi_rts_work accessible to efi page fault handler · 95fc4624
      Sai Praneeth 提交于
      [ Upstream commit 9dbbedaa6171247c4c7c40b83f05b200a117c2e0 ]
      
      After the kernel has booted, if any accesses by firmware causes a page
      fault, the efi page fault handler would freeze efi_rts_wq and schedules
      a new process. To do this, the efi page fault handler needs
      efi_rts_work. Hence, make it accessible.
      
      There will be no race conditions in accessing this structure, because
      all the calls to efi runtime services are already serialized.
      
      [ Wen: This patch also fixes a memory corruption:
             #define efi_queue_work(_rts, _arg1, _arg2, _arg3, _arg4, _arg5)\
             ({                                                             \
              struct efi_runtime_work efi_rts_work;                           \
             …
              init_completion(&efi_rts_work.efi_rts_comp);                    \
              INIT_WORK(&efi_rts_work.work, efi_call_rts);                    \
             …
      
             efi_rts_work is on the stack, registering it to workqueue will cause
             the following error:
      
             ODEBUG: object (____ptrval____) is on stack (____ptrval____),
             but NOT annotated.
             ------------[ cut here ]------------
             WARNING: CPU: 6 PID: 1 at lib/debugobjects.c:368
             __debug_object_init+0x218/0x538
             Modules linked in:
             CPU: 6 PID: 1 Comm: swapper/0 Tainted: G        W         4.19.91 #19
             …
             Call trace:
             __debug_object_init+0x218/0x538
             debug_object_init+0x20/0x28
             __init_work+0x34/0x58
             virt_efi_get_time.part.5+0x6c/0x12c
             virt_efi_get_time+0x4c/0x58
             efi_read_time+0x40/0x9c
             __rtc_read_time+0x50/0x118
             rtc_read_time+0x60/0x1f0
             rtc_hctosys+0x74/0x124
             do_one_initcall+0xac/0x3d4
             kernel_init_freeable+0x49c/0x59c
             kernel_init+0x18/0x110 ]
      Tested-by: NBhupesh Sharma <bhsharma@redhat.com>
      Suggested-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Based-on-code-from: Ricardo Neri <ricardo.neri@intel.com>
      Signed-off-by: NSai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Fixes: 3eb420e7 ("efi: Use a work queue to invoke EFI Runtime Services")
      Signed-off-by: NWen Yang <wenyang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      95fc4624
    • F
      netfilter: conntrack: udp: set stream timeout to 2 minutes · 0127f777
      Florian Westphal 提交于
      commit 294304e4c522d797b7ea8200ab74354843fa68e9 upstream
      
      We have no explicit signal when a UDP stream has terminated, peers just
      stop sending.
      
      For suspected stream connections a timeout of two minutes is sane to keep
      NAT mapping alive a while longer.
      
      It matches tcp conntracks 'timewait' default timeout value.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
      Acked-by: NDust Li <dust.li@linux.alibaba.com>
      0127f777
    • F
      netfilter: conntrack: udp: only extend timeout to stream mode after 2s · 617c7456
      Florian Westphal 提交于
      commit d535c8a69c1924e70186d80be0a9cecaf475f166 upstream
      
      Currently DNS resolvers that send both A and AAAA queries from same source port
      can trigger stream mode prematurely, which results in non-early-evictable conntrack entry
      for three minutes, even though DNS requests are done in a few milliseconds.
      
      Add a two second grace period where we continue to use the ordinary
      30-second default timeout.  Its enough for DNS request/response traffic,
      even if two request/reply packets are involved.
      
      ASSURED is still set, else conntrack (and thus a possible
      NAT mapping ...) gets zapped too in case conntrack table runs full.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
      Acked-by: NDust Li <dust.li@linux.alibaba.com>
      617c7456
    • J
      iomap: Allow forcing of waiting for running DIO in iomap_dio_rw() · 96d39291
      Jan Kara 提交于
      commit 13ef954445df4fd1d7c003a500ec5ce49573e14b upstream
      
      Notes from Xiaoguang Wang:
          Indeed this patch should be appled before "ext4: introduce direct I/O
      read using iomap infrastructure", but given that we have already appled
      "ext4: introduce direct I/O read using iomap infrastructure" previously,
      we need to update iomap_dio_rw() calls with the new argument  in ext4.
      
      Filesystems do not support doing IO as asynchronous in some cases. For
      example in case of unaligned writes or in case file size needs to be
      extended (e.g. for ext4). Instead of forcing filesystem to wait for AIO
      in such cases, add argument to iomap_dio_rw() which makes the function
      wait for IO completion. This also results in executing
      iomap_dio_complete() inline in iomap_dio_rw() providing its return value
      to the caller as for ordinary sync IO.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      96d39291
    • X
      io_uring: fix poll_list race for SETUP_IOPOLL|SETUP_SQPOLL · a3a1829e
      Xiaoguang Wang 提交于
      commit bdcd3eab2a9ae0ac93f27275b6895dd95e5bf360 upstream
      
      After making ext4 support iopoll method:
        let ext4_file_operations's iopoll method be iomap_dio_iopoll(),
      we found fio can easily hang in fio_ioring_getevents() with below fio
      job:
          rm -f testfile; sync;
          sudo fio -name=fiotest -filename=testfile -iodepth=128 -thread
      -rw=write -ioengine=io_uring  -hipri=1 -sqthread_poll=1 -direct=1
      -bs=4k -size=10G -numjobs=8 -runtime=2000 -group_reporting
      with IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL enabled.
      
      There are two issues that results in this hang, one reason is that
      when IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL are enabled, fio
      does not use io_uring_enter to get completed events, it relies on
      kernel io_sq_thread to poll for completed events.
      
      Another reason is that there is a race: when io_submit_sqes() in
      io_sq_thread() submits a batch of sqes, variable 'inflight' will
      record the number of submitted reqs, then io_sq_thread will poll for
      reqs which have been added to poll_list. But note, if some previous
      reqs have been punted to io worker, these reqs will won't be in
      poll_list timely. io_sq_thread() will only poll for a part of previous
      submitted reqs, and then find poll_list is empty, reset variable
      'inflight' to be zero. If app just waits these deferred reqs and does
      not wake up io_sq_thread again, then hang happens.
      
      For app that entirely relies on io_sq_thread to poll completed requests,
      let io_iopoll_req_issued() wake up io_sq_thread properly when adding new
      element to poll_list, and when io_sq_thread prepares to sleep, check
      whether poll_list is empty again, if not empty, continue to poll.
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a3a1829e
    • Z
      cpuidle: haltpoll: Take 'idle=' override into account · 0e2f72bd
      Zhenzhong Duan 提交于
      commit be721b9b24394c088326384d44dcb9ec4b321aae upstream
      
      Currenly haltpoll isn't aware of the 'idle=' override, the priority is
      'idle=poll' > haltpoll > 'idle=halt'. When 'idle=poll' is used, cpuidle
      driver is bypassed but current_driver in sys still shows 'haltpoll'.
      
      When 'idle=halt' is used, haltpoll takes precedence and makes
      'idle=halt' have no effect.
      
      Add a check to prevent the haltpoll driver from loading if 'idle=' is
      present.
      Signed-off-by: NZhenzhong Duan <zhenzhong.duan@oracle.com>
      Co-developed-by: NJoao Martins <joao.m.martins@oracle.com>
      [ rjw: Subject ]
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      0e2f72bd
    • J
      cpuidle-haltpoll: do not set an owner to allow modunload · 6805e4ba
      Joao Martins 提交于
      commit b0513562cf8c97035a60b6463b459edf8b3c9e5d upstream
      
      cpuidle-haltpoll can be built as a module to allow optional late load.
      Given we are setting @owner to THIS_MODULE, cpuidle will attempt to grab a
      module reference every time a cpuidle_device is registered -- so
      essentially all online cpus get a reference.
      
      This prevents for the module to be unloaded later, which makes the
      module_exit callback entirely unused. Thus remove the @owner and allow
      module to be unloaded.
      
      Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver")
      Signed-off-by: NJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      6805e4ba
    • J
      cpuidle-haltpoll: return -ENODEV on modinit failure · f9706f22
      Joao Martins 提交于
      commit 26ee75559bd10f07b5b99e40e05df5f0d7d7b7ec upstream
      
      When a user loads cpuidle-haltpoll on a non KVM guest the module will
      successfully load, even though idle driver registration didn't take
      place.
      
      We should instead return -ENODEV signaling the user that the driver can't
      be loaded, like other error paths in haltpoll_init().  An example of such
      error paths is when we return -EBUSY when attempting to register an idle
      driver when it had one already (e.g. intel_idle loads at boot and then we
      attempt to insert module cpuidle-haltpoll).
      
      Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver")
      Signed-off-by: NJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      f9706f22
    • J
      cpuidle-haltpoll: set haltpoll as preferred governor · aee4b74e
      Joao Martins 提交于
      commit c194c2ad74696ce686a2ee1239f44d9750cb5e5c upstream
      
      Right now, guest current governors have the following ratings:
      
       * ladder            -> 10
       * teo               -> 19
       * menu              -> 20
       * haltpoll          -> 21
       * ladder + nohz=off -> 25
      
      haltpoll governor got introduced and it is now the default governor given
      its highest rating -- with ladder+nohz being the exception -- regardless of
      idle driver in the guest. An example of an undesirable case is x86 KVM
      guests with MWAIT which have intel_idle registered first, and consequently
      will have haltpoll be used as governor which would get limited to a poll
      state and state 1 and the other states wouldn't get used.
      
      To keep the previous defaults we decrease rating of governor to 9 (below
      current lowest rating) and thus rely on @governor switch on
      cpuidle_register_driver() to tie in haltpoll idle driver and governor
      together.
      Signed-off-by: NJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      aee4b74e
    • J
      cpuidle: allow governor switch on cpuidle_register_driver() · 0176005e
      Joao Martins 提交于
      commit 11c59eae6633b8a7e77b8ee1cf908964d80c78cd upstream
      
      The recently introduced haltpoll driver is largely only useful with
      haltpoll governor. To allow drivers to associate with a particular idle
      behaviour, add a @governor property to 'struct cpuidle_driver' and thus
      allow a cpuidle driver to switch to a *preferred* governor on idle driver
      registration. We save the previous governor, and when an idle driver is
      unregistered we switch back to that.
      
      The @governor can be overridden by cpuidle.governor= boot param or
      alternatively be ignored if the governor doesn't exist.
      Signed-off-by: NJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      0176005e
    • M
      cpuidle-haltpoll: vcpu hotplug support · 1e18beca
      Marcelo Tosatti 提交于
      commit 97d3eb9da84cae0548359b0aecb8619faad003b7 upstream
      
      When cpus != maxcpus cpuidle-haltpoll will fail to register all vcpus
      past the online ones and thus fail to register the idle driver.
      This is because cpuidle_add_sysfs() will return with -ENODEV as a
      consequence from get_cpu_device() return no device for a non-existing
      CPU.
      
      Instead switch to cpuidle_register_driver() and manually register each
      of the present cpus through cpuhp_setup_state() callbacks and future
      ones that get onlined or offlined. This mimmics similar logic that
      intel_idle does.
      
      Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver")
      Signed-off-by: NJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      1e18beca
    • M
      cpuidle: add haltpoll governor · 7b905bca
      Marcelo Tosatti 提交于
      commit 2cffe9f6b96fece065ee8522673c90e92ef2085d upstream
      
      The cpuidle_haltpoll governor, in conjunction with the haltpoll cpuidle
      driver, allows guest vcpus to poll for a specified amount of time before
      halting.
      This provides the following benefits to host side polling:
      
              1) The POLL flag is set while polling is performed, which allows
                 a remote vCPU to avoid sending an IPI (and the associated
                 cost of handling the IPI) when performing a wakeup.
      
              2) The VM-exit cost can be avoided.
      
      The downside of guest side polling is that polling is performed
      even with other runnable tasks in the host.
      
      Results comparing halt_poll_ns and server/client application
      where a small packet is ping-ponged:
      
      host                                        --> 31.33
      halt_poll_ns=300000 / no guest busy spin    --> 33.40   (93.8%)
      halt_poll_ns=0 / guest_halt_poll_ns=300000  --> 32.73   (95.7%)
      
      For the SAP HANA benchmarks (where idle_spin is a parameter
      of the previous version of the patch, results should be the
      same):
      
      hpns == halt_poll_ns
      
                                idle_spin=0/   idle_spin=800/    idle_spin=0/
                                hpns=200000    hpns=0            hpns=800000
      DeleteC06T03 (100 thread) 1.76           1.71 (-3%)        1.78   (+1%)
      InsertC16T02 (100 thread) 2.14           2.07 (-3%)        2.18   (+1.8%)
      DeleteC00T01 (1 thread)   1.34           1.28 (-4.5%)      1.29   (-3.7%)
      UpdateC00T03 (1 thread)   4.72           4.18 (-12%)       4.53   (-5%)
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      7b905bca
    • M
      add cpuidle-haltpoll driver · 6d2ef95f
      Marcelo Tosatti 提交于
      commit fa86ee90eb1111267de67cb4272b5ce711f18cbb upstream
      
      Add a cpuidle driver that calls the architecture default_idle routine.
      
      To be used in conjunction with the haltpoll governor.
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      6d2ef95f
    • M
      governors: unify last_state_idx · dc59f8b0
      Marcelo Tosatti 提交于
      commit 7d4daeedd575bbc3c40c87fc6708a8b88c50fe7e upstream
      
      Since this field is shared by all governors, move it to
      cpuidle device structure.
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      dc59f8b0
    • R
      cpuidle: menu: Do not update last_state_idx in menu_select() · 5e083c5c
      Rafael J. Wysocki 提交于
      commit eb40a380bff28f84b6583bba6786b46ef26ef548 upstream
      
      It is not necessary to update data->last_state_idx in menu_select()
      as it only is used in menu_update() which only runs when
      data->needs_update is set and that is set only when updating
      data->last_state_idx in menu_reflect().
      
      Accordingly, drop the update of data->last_state_idx from
      menu_select() and get rid of the (now redundant) "out" label
      from it.
      
      No intentional behavior changes.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      5e083c5c
    • M
      cpuidle: add poll_limit_ns to cpuidle_device structure · 2163e221
      Marcelo Tosatti 提交于
      commit 259231a045616c4101d023a8f4dcc8379af265a6 upstream
      
      Add a poll_limit_ns variable to cpuidle_device structure.
      
      Calculate and configure it in the new cpuidle_poll_time
      function, in case its zero.
      
      Individual governors are allowed to override this value.
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      2163e221
    • D
      cpuidle: poll_state: Fix default time limit · a72c57d7
      Doug Smythies 提交于
      commit 1617971c6616c87185cbc78fa1a86dfc70dd16b6 upstream
      
      The default time is declared in units of microsecnds,
      but is used as nanoseconds, resulting in significant
      accounting errors for idle state 0 time when all idle
      states deeper than 0 are disabled.
      
      Under these unusual conditions, we don't really care
      about the poll time limit anyhow.
      
      Fixes: 800fb34a99ce ("cpuidle: poll_state: Disregard disable idle states")
      Signed-off-by: NDoug Smythies <dsmythies@telus.net>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      a72c57d7
    • R
      cpuidle: Add cpuidle.governor= command line parameter · 4488ba36
      Rafael J. Wysocki 提交于
      commit 61cb5758d3c46bc1ba87694fefc0d9653613ce6b upstream
      
      Add cpuidle.governor= command line parameter to allow the default
      cpuidle governor to be replaced.
      
      That is useful, for example, if someone running a tickful kernel
      wants to use the menu governor on it.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      4488ba36
    • R
      cpuidle: poll_state: Disregard disable idle states · 907fd899
      Rafael J. Wysocki 提交于
      commit 800fb34a99ce7d22dca839c90f869c7a12b50f70 upstream
      
      When computing the limit of time to spend in the loop in poll_idle(),
      use the target residency of the first enabled idle state deeper than
      state 0 instead of always using the target residency of state 1.
      
      This helps when state 1 is disabled for diagnostics, for instance.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      907fd899
    • R
      cpuidle: poll_state: Revise loop termination condition · a89bfc68
      Rafael J. Wysocki 提交于
      commit 01bad1c6896db021db82042e71c2bf1f97cc026b upstream
      
      If need_resched() returns "false", breaking out of the loop in
      poll_idle() will cause a new idle state to be selected, so in fact
      it usually doesn't make sense to spin in it longer than the target
      residency of the second state.  [Note that the "polling" state is
      used only if there is at least one "real" state defined in addition
      to it, so the second state is always there.]  On the other hand,
      breaking out of it early (say in case the next state is disabled)
      shouldn't hurt as it is polling anyway.
      
      For this reason, make the loop in poll_idle() break if the CPU has
      been spinning longer than the target residency of the second state
      (the "polling" state can only be state[0]).
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      a89bfc68
    • X
      io_uring: fix __io_iopoll_check deadlock in io_sq_thread · a715bf8d
      Xiaoguang Wang 提交于
      commit c7849be9cc2dd2754c48ddbaca27c2de6d80a95d upstream.
      
      Since commit a3a0e43fd770 ("io_uring: don't enter poll loop if we have
      CQEs pending"), if we already events pending, we won't enter poll loop.
      In case SETUP_IOPOLL and SETUP_SQPOLL are both enabled, if app has
      been terminated and don't reap pending events which are already in cq
      ring, and there are some reqs in poll_list, io_sq_thread will enter
      __io_iopoll_check(), and find pending events, then return, this loop
      will never have a chance to exit.
      
      I have seen this issue in fio stress tests, to fix this issue, let
      io_sq_thread call io_iopoll_getevents() with argument 'min' being zero,
      and remove __io_iopoll_check().
      
      Fixes: a3a0e43fd770 ("io_uring: don't enter poll loop if we have CQEs pending")
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a715bf8d
    • C
      alinux: doc: use unified official project name Cloud Kernel · a60721b9
      Caspar Zhang 提交于
      Cloud Kernel is the official name of our project, this patch unitizes
      the project names used in docs and comments.
      Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      a60721b9
    • W
      alinux: mm: oom_kill: show killed task's cgroup info in global oom · 5028e358
      Wenwei Tao 提交于
      Some users want to know the killed task's cgroup info in global
      oom, this message would help them to make upper decision.
      Signed-off-by: NWenwei Tao <wenwei.tao@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      5028e358
    • W
      alinux: mm: memcontrol: enable oom.group on cgroup-v1 · 7d41295c
      Wenwei Tao 提交于
      Enable oom.group on cgroup-v1.
      Signed-off-by: NWenwei Tao <wenwei.tao@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      7d41295c
    • W
      alinux: doc: alibaba: Add priority oom descriptions · 0c8648d9
      Wenwei Tao 提交于
      Add "memory.priority" and "memory.use_priority_oom" descriptions.
      Signed-off-by: NWenwei Tao <wenwei.tao@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      0c8648d9
    • W
      alinux: mm: memcontrol: introduce memcg priority oom · 52e375fc
      Wenwei Tao 提交于
      Under memory pressure reclaim and oom would happen, with multiple
      cgroups exist in one system, we might want some of their memory
      or tasks survived the reclaim and oom while there are other
      candidates.
      
      The @memory.low and @memory.min have make that happen during reclaim,
      this patch introduces memcg priority oom to meet above requirement in
      the oom.
      
      The priority is from 0 to 12, the higher number the higher priority.
      When oom happens it always choose victim from low priority memcg.
      And it works both for memcg oom and global oom, it can be enabled/disabled
      through @memory.use_priority_oom, for global oom through the root
      memcg's @memory.use_priority_oom, it is disabled by default.
      Signed-off-by: NWenwei Tao <wenwei.tao@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      52e375fc
    • W
      alinux: kernel: cgroup: account number of tasks in the css and its descendants · 1e91d392
      Wenwei Tao 提交于
      Account number of the tasks in the css and its descendants, this is
      prepared for the incoming memcg priority patch.
      
      In memcg priority oom, we will select victim cgroup which has victim
      tasks in it. We need to know whether the memcg and its descendants
      have tasks before the selection can move on.
      Signed-off-by: NWenwei Tao <wenwei.tao@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      1e91d392
    • X
      alinux: doc: Add Documentation/alibaba/interfaces.rst · 279df2ff
      Xunlei Pang 提交于
      This file collects all the interfaces specific to Alibaba Cloud Kernel.
      
      Add "memory.wmark_min_adj", "memory.exstat", and "zombie memcgs reaper"
      descriptions.
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      279df2ff
    • X
      alinux: memcg: Account throttled time due to memory.wmark_min_adj · ef467b9d
      Xunlei Pang 提交于
      Accessing original memory.stat turned out to be one heavy
      operation which has been caused many real product problems.
      
      Introduce new cgroup memory.exstat, memory.exstat stands
      for "extra/extended memory.stat", which contains dedicated
      statistics from Alibaba Clould Kernel.
      
      memory.exstat is supposed to provide hierarchical statistics.
      
      Export its first "wmark_min_throttled_ms", and will add more
      like direct reclaim, direct compaction, etc.
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      ef467b9d
    • X
      alinux: memcg: Introduce memory.wmark_min_adj · 60be0f54
      Xunlei Pang 提交于
      In co-location environment, there are more or less some memory
      overcommitment, then BATCH tasks may break the shared global min
      watermark resulting in all types of applications falling into
      the direct reclaim slow path hurting the RT of LS tasks.
      (NOTE: BATCH tasks tolerate big latency spike even in seconds
      as long as doesn't hurt its overal throughput. While LS tasks
      are very Latency-Sensitive, they may time out or fail in case
      of sudden latency spike lasts like hundreds of ms typically.)
      
      Actually BATCH tasks are not sensitive to memory latency, they
      can be assigned a strict min watermark which is different from
      that of LS tasks(which can be aissgned a lenient min watermark
      accordingly), thus isolating each other in case of global memory
      allocation. This is kind of like the idea behind ALLOC_HARDER
      for rt_task(), see gfp_to_alloc_flags().
      
      memory.wmark_min_adj stands for memcg global WMARK_MIN adjustment,
      it is used to realize separate min watermarks above-mentioned for
      memcgs, its valid value is within [-25, 50], specifically:
      negative value means to be relative to [0, WMARK_MIN],
      positive value means to be relative to [WMARK_MIN, WMARK_LOW].
      For examples,
        -25 means "WMARK_MIN + (WMARK_MIN - 0) * (-25%)"
         50 means "WMARK_MIN + (WMARK_LOW - WMARK_MIN) * 50%"
      
      Note that the minimum -25 is what ALLOC_HARDER uses which is safe
      for us to adopt, and the maximum 50 is one experienced value.
      
      Negative memory.wmark_min_adj means high QoS requirements, it can
      allocate below the global WMARK_MIN, which is kind of like the idea
      behind ALLOC_HARDER, see gfp_to_alloc_flags().
      
      Positive memory.wmark_min_adj means low QoS requirements, thus when
      allocation broke memcg min watermark, it should trigger direct reclaim
      traditionally, and we trigger throttle instead to further prevent
      them from disturbing others.
      
      With this interface, we can assign positive values for BATCH memcgs
      and negative values for LS memcgs.
      
      memory.wmark_min_adj default value is 0, and inherit from its parent,
      Note that the final effective wmark_min_adj will consider all the
      hierarchical values, its value is the maximal(most conservative)
      wmark_min_adj along the hierarchy but excluding intermediate default
      values(zero).
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      60be0f54