1. 23 4月, 2020 3 次提交
  2. 22 4月, 2020 2 次提交
  3. 17 4月, 2020 2 次提交
  4. 16 4月, 2020 6 次提交
    • X
      alinux: mm, memcg: add kconfig MEMSLI · 9e58d704
      Xu Yu 提交于
      to #26424368
      
      This introduces the new bool kconfig MEMSLI, determining whether the
      memsli (memory latency histogram) feature should be built-in or not.
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      9e58d704
    • X
      alinux: mm, memcg: add memsli procfs switch interface · 892970b7
      Xu Yu 提交于
      to #26424368
      
      Since memsli also records latency histogram for swapout and swapin,
      which are NOT in the slow memory path, the overhead of memsli could
      be nonnegligible in some specific scenarios.
      
      For example, in scenarios with frequent swapping out and in, memsli
      could introduce overhead of ~1% of total run time of the synthetic
      testcase.
      
      This adds procfs interface for memsli switch. The memsli feature is
      enabled by default, and you can now disable it by:
      
      $ echo 0 > /proc/memsli/enabled
      
      Apparently, you can check current memsli switch status by:
      
      $ cat /proc/memsli/enabled
      
      Note that disabling memsli at runtime will NOT clear the existing
      latency histogram. You still need to manually reset the specified
      latency histogram(s) by echo 0 into the corresponding cgroup control
      file(s).
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      892970b7
    • X
      alinux: mm, memcg: record latency of swapout and swapin in every memcg · ddfd4d5e
      Xu Yu 提交于
      to #26424368
      
      Probe and calculate the latency of global swapout, memcg swapout and
      swapin respectively, and then group into the latency histogram in struct
      mem_cgroup.
      
      Note that the latency in each memcg is aggregated from all child memcgs.
      
      Usage:
      
      $ cat memory.direct_swapout_global_latency
      0-1ms:  98313
      1-5ms:  0
      5-10ms:         0
      10-100ms:       0
      100-500ms:      0
      500-1000ms:     0
      >=1000ms:       0
      total(ms):      52
      
      Each line is the count of global swapout within the appropriate latency
      range.
      
      To clear the latency histogram:
      
      $ echo 0 > memory.direct_swapout_global_latency
      $ cat memory.direct_swapout_global_latency
      0-1ms:  0
      1-5ms:  0
      5-10ms:         0
      10-100ms:       0
      100-500ms:      0
      500-1000ms:     0
      >=1000ms:       0
      total(ms):      0
      
      The usage of memory.direct_swapout_memcg_latency and
      memory.direct_swapin_latency is the same as
      memory.direct_swapout_global_latency.
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      ddfd4d5e
    • X
      alinux: mm, memcg: rework memory latency histogram interfaces · 837e53ab
      Xu Yu 提交于
      to #26424368
      
      There are some duplicate codes in the original implementation of memory
      latency histogram, such as {x, y, z}_show, and {x, y, z}_write, where x,
      y, z represents various types of memory latency.
      
      This reworks common codes of memory latency histogram to make it easier
      to add more types of memory latency later.
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      837e53ab
    • X
      alinux: mm, memcg: record latency of direct compact in every memcg · 4bec5cfe
      Xu Yu 提交于
      to #26424368
      
      Probe and calculate the latency of direct compact, and then group into
      the latency histogram in struct mem_cgroup.
      
      Note that the latency in each memcg is aggregated from all child memcgs.
      
      Usage:
      
      $ cat memory.direct_compact_latency
      0-1ms:  1176
      1-5ms:  259
      5-10ms:         17
      10-100ms:       10
      100-500ms:      0
      500-1000ms:     0
      >=1000ms:       0
      total(ms):      921
      
      Each line is the count of direct compact within the appropriate latency
      range.
      
      To clear the latency histogram:
      
      $ echo 0 > memory.direct_compact_latency
      $ cat memory.direct_compact_latency
      0-1ms:  0
      1-5ms:  0
      5-10ms:         0
      10-100ms:       0
      100-500ms:      0
      500-1000ms:     0
      >=1000ms:       0
      total(ms):      0
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      4bec5cfe
    • X
      alinux: mm, memcg: record latency of direct reclaim in every memcg · 83058e75
      Xu Yu 提交于
      to #26424368
      
      Probe and calculate the latency of global direct reclaim and memcg
      direct reclaim, respectively, and then group into the latency histogram
      in struct mem_cgroup. Besides, the total latency is accumulated each
      time the histogram is updated.
      
      Note that the latency in each memcg is aggregated from all child memcgs.
      
      Usage:
      
      $ cat memory.direct_reclaim_global_latency
      0-1ms:  228
      1-5ms:  283
      5-10ms:         0
      10-100ms:       0
      100-500ms:      0
      500-1000ms:     0
      >=1000ms:       0
      total(ms):      539
      
      Each line is the count of global direct reclaim within the appropriate
      latency range.
      
      To clear the latency histogram:
      
      $ echo 0 > memory.direct_reclaim_global_latency
      $ cat memory.direct_reclaim_global_latency
      0-1ms:  0
      1-5ms:  0
      5-10ms:         0
      10-100ms:       0
      100-500ms:      0
      500-1000ms:     0
      >=1000ms:       0
      total(ms):      0
      
      The usage of memory.direct_reclaim_memcg_latency is the same as
      memory.direct_reclaim_global_latency.
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      83058e75
  5. 13 4月, 2020 8 次提交
    • A
      mm/page_reporting: add budget limit on how many pages can be reported per pass · d2932bdb
      Alexander Duyck 提交于
      to #26589565
      
      In order to keep ourselves from reporting pages that are just going to be
      reused again in the case of heavy churn we can put a limit on how many
      total pages we will process per pass.  Doing this will allow the worker
      thread to go into idle much more quickly so that we avoid competing with
      other threads that might be allocating or freeing pages.
      
      The logic added here will limit the worker thread to no more than one
      sixteenth of the total free pages in a given area per list.  Once that
      limit is reached it will update the state so that at the end of the pass
      we will reschedule the worker to try again in 2 seconds when the memory
      churn has hopefully settled down.
      
      Again this optimization doesn't show much of a benefit in the standard
      case as the memory churn is minmal.  However with page allocator shuffling
      enabled the gain is quite noticeable.  Below are the results with a THP
      enabled version of the will-it-scale page_fault1 test showing the
      improvement in iterations for 16 processes or threads.
      
      Without:
      tasks   processes       processes_idle  threads         threads_idle
      16      8283274.75      0.17            5594261.00      38.15
      
      With:
      tasks   processes       processes_idle  threads         threads_idle
      16      8767010.50      0.21            5791312.75      36.98
      
      Link: http://lkml.kernel.org/r/20200211224719.29318.72113.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Cc: Yang Zhang <yang.zhang.wz@gmail.com>
      Cc: wei qi <weiqi4@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      d2932bdb
    • A
      virtio-balloon: add support for providing free page reports to host · 671b4f34
      Alexander Duyck 提交于
      to #26589565
      
      Add support for the page reporting feature provided by virtio-balloon.
      Reporting differs from the regular balloon functionality in that is is
      much less durable than a standard memory balloon.  Instead of creating a
      list of pages that cannot be accessed the pages are only inaccessible
      while they are being indicated to the virtio interface.  Once the
      interface has acknowledged them they are placed back into their respective
      free lists and are once again accessible by the guest system.
      
      Unlike a standard balloon we don't inflate and deflate the pages.  Instead
      we perform the reporting, and once the reporting is completed it is
      assumed that the page has been dropped from the guest and will be faulted
      back in the next time the page is accessed.
      
      Link: http://lkml.kernel.org/r/20200211224657.29318.68624.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Cc: Yang Zhang <yang.zhang.wz@gmail.com>
      Cc: wei qi <weiqi4@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      671b4f34
    • A
      mm: introduce Reported pages · 2b197730
      Alexander Duyck 提交于
      to #26589565
      
      In order to pave the way for free page reporting in virtualized
      environments we will need a way to get pages out of the free lists and
      identify those pages after they have been returned.  To accomplish this,
      this patch adds the concept of a Reported Buddy, which is essentially
      meant to just be the Uptodate flag used in conjunction with the Buddy page
      type.
      
      To prevent the reported pages from leaking outside of the buddy lists I
      added a check to clear the PageReported bit in the del_page_from_free_list
      function.  As a result any reported page that is split, merged, or
      allocated will have the flag cleared prior to the PageBuddy value being
      cleared.
      
      The process for reporting pages is fairly simple.  Once we free a page
      that meets the minimum order for page reporting we will schedule a
      worker thread to start 2s or more in the future.  That worker thread will begin
      working from the lowest supported page reporting order up to MAX_ORDER - 1
      pulling unreported pages from the free list and storing them in the
      scatterlist.
      
      When processing each individual free list it is necessary for the worker
      thread to release the zone lock when it needs to stop and report the full
      scatterlist of pages.  To reduce the work of the next iteration the worker
      thread will rotate the free list so that the first unreported page in
      the free list becomes the first entry in the list.
      
      It will then call a reporting function providing information on how many
      entries are in the scatterlist.  Once the function completes it will
      return the pages to the free area from which they were allocated and start
      over pulling more pages from the free areas until there are no longer
      enough pages to report on to keep the worker busy, or we have processed as
      many pages as were contained in the free area when we started processing
      the list.
      
      The worker thread will work in a round-robin fashion making its way though
      each zone requesting reporting, and through each reportable free list
      within that zone.  Once all free areas within the zone have been processed
      it will check to see if there have been any requests for reporting while
      it was processing.  If so it will reschedule the worker thread to start up
      again in roughly 2s and exit.
      
      Link: http://lkml.kernel.org/r/20200211224635.29318.19750.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Cc: Yang Zhang <yang.zhang.wz@gmail.com>
      Cc: wei qi <weiqi4@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      2b197730
    • A
      mm: use zone and order instead of free area in free_list manipulators · 999c6c60
      Alexander Duyck 提交于
      to #26589565
      
      In order to enable the use of the zone from the list manipulator
      functions
      I will need access to the zone pointer.  As it turns out most of the
      accessors were always just being directly passed &zone->free_area[order]
      anyway so it would make sense to just fold that into the function itself
      and pass the zone and order as arguments instead of the free area.
      
      In order to be able to reference the zone we need to move the declaration
      of the functions down so that we have the zone defined before we define
      the list manipulation functions.  Since the functions are only used in the
      file mm/page_alloc.c we can just move them there to reduce noise in the
      header.
      
      Link: http://lkml.kernel.org/r/20200211224613.29318.43080.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPankaj Gupta <pagupta@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Cc: Yang Zhang <yang.zhang.wz@gmail.com>
      Cc: wei qi <weiqi4@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      999c6c60
    • D
      mm: move buddy list manipulations into helpers · 54c14b30
      Dan Williams 提交于
      to #26589565
      
      commit b03641af680959df57c275a80ff0dc116627c7ae upstream
      
      In preparation for runtime randomization of the zone lists, take all
      (well, most of) the list_*() functions in the buddy allocator and put
      them in helper functions.  Provide a common control point for injecting
      additional behavior when freeing pages.
      
      [dan.j.williams@intel.com: fix buddy list helpers]
        Link: http://lkml.kernel.org/r/155033679702.1773410.13041474192173212653.stgit@dwillia2-desk3.amr.corp.intel.com
      [vbabka@suse.cz: remove del_page_from_free_area() migratetype parameter]
        Link: http://lkml.kernel.org/r/4672701b-6775-6efd-0797-b6242591419e@suse.cz
      Link: http://lkml.kernel.org/r/154899812264.3165233.5219320056406926223.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Robert Elliott <elliott@hpe.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      54c14b30
    • Y
      alinux: list: add list_is_first() and list_rotate_to_front() · 00625113
      Yang Shi 提交于
      to #26589565
      
      The list_is_first() and list_rotate_to_front() functions are needed by
      free page hinting.  list_is_first() was used by i915 driver, so moved it
      to common header file.
      
      Backported list_rotate_to_front() from commit a16b53849913 ("list: add
      function list_rotate_to_front()"). Not backport the whole patch, just
      copy the implementation of the function.
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      00625113
    • W
      virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON · 67595fa8
      Wei Wang 提交于
      to #26589565
      
      commit 2e991629bcf55a43681aec1ee096eeb03cf81709 upstream
      
      The VIRTIO_BALLOON_F_PAGE_POISON feature bit is used to indicate if the
      guest is using page poisoning. Guest writes to the poison_val config
      field to tell host about the page poisoning value that is in use.
      Suggested-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NWei Wang <wei.w.wang@intel.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      67595fa8
    • W
      virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT · 0db17e11
      Wei Wang 提交于
      to #26589565
      
      commit 86a559787e6f5cf662c081363f64a20cad654195 upstream
      
      Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature indicates the
      support of reporting hints of guest free pages to host via virtio-balloon.
      Currenlty, only free page blocks of MAX_ORDER - 1 are reported. They are
      obtained one by one from the mm free list via the regular allocation
      function.
      
      Host requests the guest to report free page hints by sending a new cmd id
      to the guest via the free_page_report_cmd_id configuration register. When
      the guest starts to report, it first sends a start cmd to host via the
      free page vq, which acks to host the cmd id received. When the guest
      finishes reporting free pages, a stop cmd is sent to host via the vq.
      Host may also send a stop cmd id to the guest to stop the reporting.
      
      VIRTIO_BALLOON_CMD_ID_STOP: Host sends this cmd to stop the guest reporting.
      VIRTIO_BALLOON_CMD_ID_DONE: Host sends this cmd to tell the guest that
      the reported pages are ready to be freed.
      
      Why does the guest free the reported pages when host tells it is ready to free?
      This is because freeing pages appears to be expensive for live migration.
      free_pages() dirties memory very quickly and makes the live migraion not
      converge in some cases. So it is good to delay the free_page operation
      when the migration is done, and host sends a command to guest about that.
      
      Why do we need the new VIRTIO_BALLOON_CMD_ID_DONE, instead of reusing
      VIRTIO_BALLOON_CMD_ID_STOP?
      This is because live migration is usually done in several rounds. At the
      end of each round, host needs to send a VIRTIO_BALLOON_CMD_ID_STOP cmd to
      the guest to stop (or say pause) the reporting. The guest resumes the
      reporting when it receives a new command id at the beginning of the next
      round. So we need a new cmd id to distinguish between "stop reporting" and
      "ready to free the reported pages".
      
      TODO:
      - Add a batch page allocation API to amortize the allocation overhead.
      Signed-off-by: NWei Wang <wei.w.wang@intel.com>
      Signed-off-by: NLiang Li <liang.z.li@intel.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      0db17e11
  6. 02 4月, 2020 1 次提交
  7. 25 3月, 2020 1 次提交
  8. 18 3月, 2020 17 次提交
    • X
      alinux: mm, memcg: account number of processes in the css · 2061acd6
      Xu Yu 提交于
      Since commit e0205ae40f12 ("mm: memcontrol: use CSS_TASK_ITER_PROCS at
      mem_cgroup_scan_tasks()") made mem_cgroup_scan_tasks() to check only one
      thread from each thread group, we can make cgroup_subsys_state::nr_tasks
      to record only the thread group leader, i.e., process, instead of
      thread(s). Furthermore, this renames cgroup_subsys_state::nr_tasks to
      cgroup_subsys_state::nr_procs.
      
      Fixes: f061cd88 ("alinux: kernel: cgroup: account number of tasks in
      the css and its descendants")
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      2061acd6
    • X
      alinux: mm, memcg: record latency of memcg wmark reclaim · 40969475
      Xu Yu 提交于
      The memcg background async page reclaim, a.k.a, memcg kswapd, is
      implemented with a dedicated unbound workqueue currently.
      
      However, memcg kswapd will run too frequently, resulting in high
      overhead, page cache thrashing, frequent dirty page writeback, etc., due
      to improper memcg memory.wmark_ratio, unreasonable memcg memor capacity,
      or even abnormal memcg memory usage.
      
      We need to find out the problematic memcg(s) where memcg kswapd
      introduces significant overhead.
      
      This records the latency of each run of memcg kswapd work, and then
      aggregates into the exstat of per memcg.
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      40969475
    • M
      KVM: vgic-v4: Track the number of VLPIs per vcpu · 27840020
      Marc Zyngier 提交于
      commit 5bd90b0989731520f2cdcfbbe467f1271f3cc803 upstream.
      
      In order to find out whether a vcpu is likely to be the target of
      VLPIs (and to further optimize the way we deal with those), let's
      track the number of VLPIs a vcpu can receive.
      
      This gets implemented with an atomic variable that gets incremented
      or decremented on map, unmap and move of a VLPI.
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      Reviewed-by: NZenghui Yu <yuzenghui@huawei.com>
      Reviewed-by: NChristoffer Dall <christoffer.dall@arm.com>
      Link: https://lore.kernel.org/r/20191107160412.30301-2-maz@kernel.orgSigned-off-by: NShannon Zhao <shannon.zhao@linux.alibaba.com>
      Acked-by: NZou Cao <zoucao@linux.alibaba.com>
      27840020
    • M
      KVM: arm64: vgic-v4: Move the GICv4 residency flow to be driven by vcpu_load/put · 42993070
      Marc Zyngier 提交于
      commit 8e01d9a396e6db153d94a6004e6473d9ff251a6a upstream.
      
      When the VHE code was reworked, a lot of the vgic stuff was moved around,
      but the GICv4 residency code did stay untouched, meaning that we come
      in and out of residency on each flush/sync, which is obviously suboptimal.
      
      To address this, let's move things around a bit:
      
      - Residency entry (flush) moves to vcpu_load
      - Residency exit (sync) moves to vcpu_put
      - On blocking (entry to WFI), we "put"
      - On unblocking (exit from WFI), we "load"
      
      Because these can nest (load/block/put/load/unblock/put, for example),
      we now have per-VPE tracking of the residency state.
      
      Additionally, vgic_v4_put gains a "need doorbell" parameter, which only
      gets set to true when blocking because of a WFI. This allows a finer
      control of the doorbell, which now also gets disabled as soon as
      it gets signaled.
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20191027144234.8395-2-maz@kernel.orgSigned-off-by: NShannon Zhao <shannon.zhao@linux.alibaba.com>
      Acked-by: NZou Cao <zoucao@linux.alibaba.com>
      42993070
    • M
      mm: introduce MADV_PAGEOUT · 23757dcc
      Minchan Kim 提交于
      commit 1a4e58cce84ee88129d5d49c064bd2852b481357 upstream
      
      When a process expects no accesses to a certain memory range for a long
      time, it could hint kernel that the pages can be reclaimed instantly but
      data should be preserved for future use.  This could reduce workingset
      eviction so it ends up increasing performance.
      
      This patch introduces the new MADV_PAGEOUT hint to madvise(2) syscall.
      MADV_PAGEOUT can be used by a process to mark a memory range as not
      expected to be used for a long time so that kernel reclaims *any LRU*
      pages instantly.  The hint can help kernel in deciding which pages to
      evict proactively.
      
      A note: It doesn't apply SWAP_CLUSTER_MAX LRU page isolation limit
      intentionally because it's automatically bounded by PMD size.  If PMD
      size(e.g., 256) makes some trouble, we could fix it later by limit it to
      SWAP_CLUSTER_MAX[1].
      
      - man-page material
      
      MADV_PAGEOUT (since Linux x.x)
      
      Do not expect access in the near future so pages in the specified
      regions could be reclaimed instantly regardless of memory pressure.
      Thus, access in the range after successful operation could cause
      major page fault but never lose the up-to-date contents unlike
      MADV_DONTNEED. Pages belonging to a shared mapping are only processed
      if a write access is allowed for the calling process.
      
      MADV_PAGEOUT cannot be applied to locked pages, Huge TLB pages, or
      VM_PFNMAP pages.
      
      [1] https://lore.kernel.org/lkml/20190710194719.GS29695@dhcp22.suse.cz/
      
      [minchan@kernel.org: clear PG_active on MADV_PAGEOUT]
        Link: http://lkml.kernel.org/r/20190802200643.GA181880@google.com
      [akpm@linux-foundation.org: resolve conflicts with hmm.git]
      Link: http://lkml.kernel.org/r/20190726023435.214162-5-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      23757dcc
    • M
      mm: introduce MADV_COLD · 1af766e8
      Minchan Kim 提交于
      commit 9c276cc65a58faf98be8e56962745ec99ab87636 upstream
      
      Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.
      
      - Background
      
      The Android terminology used for forking a new process and starting an app
      from scratch is a cold start, while resuming an existing app is a hot
      start.  While we continually try to improve the performance of cold
      starts, hot starts will always be significantly less power hungry as well
      as faster so we are trying to make hot start more likely than cold start.
      
      To increase hot start, Android userspace manages the order that apps
      should be killed in a process called ActivityManagerService.
      ActivityManagerService tracks every Android app or service that the user
      could be interacting with at any time and translates that into a ranked
      list for lmkd(low memory killer daemon).  They are likely to be killed by
      lmkd if the system has to reclaim memory.  In that sense they are similar
      to entries in any other cache.  Those apps are kept alive for
      opportunistic performance improvements but those performance improvements
      will vary based on the memory requirements of individual workloads.
      
      - Problem
      
      Naturally, cached apps were dominant consumers of memory on the system.
      However, they were not significant consumers of swap even though they are
      good candidate for swap.  Under investigation, swapping out only begins
      once the low zone watermark is hit and kswapd wakes up, but the overall
      allocation rate in the system might trip lmkd thresholds and cause a
      cached process to be killed(we measured performance swapping out vs.
      zapping the memory by killing a process.  Unsurprisingly, zapping is 10x
      times faster even though we use zram which is much faster than real
      storage) so kill from lmkd will often satisfy the high zone watermark,
      resulting in very few pages actually being moved to swap.
      
      - Approach
      
      The approach we chose was to use a new interface to allow userspace to
      proactively reclaim entire processes by leveraging platform information.
      This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
      that are known to be cold from userspace and to avoid races with lmkd by
      reclaiming apps as soon as they entered the cached state.  Additionally,
      it could provide many chances for platform to use much information to
      optimize memory efficiency.
      
      To achieve the goal, the patchset introduce two new options for madvise.
      One is MADV_COLD which will deactivate activated pages and the other is
      MADV_PAGEOUT which will reclaim private pages instantly.  These new
      options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
      ways to gain some free memory space.  MADV_PAGEOUT is similar to
      MADV_DONTNEED in a way that it hints the kernel that memory region is not
      currently needed and should be reclaimed immediately; MADV_COLD is similar
      to MADV_FREE in a way that it hints the kernel that memory region is not
      currently needed and should be reclaimed when memory pressure rises.
      
      This patch (of 5):
      
      When a process expects no accesses to a certain memory range, it could
      give a hint to kernel that the pages can be reclaimed when memory pressure
      happens but data should be preserved for future use.  This could reduce
      workingset eviction so it ends up increasing performance.
      
      This patch introduces the new MADV_COLD hint to madvise(2) syscall.
      MADV_COLD can be used by a process to mark a memory range as not expected
      to be used in the near future.  The hint can help kernel in deciding which
      pages to evict early during memory pressure.
      
      It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
      
      	active file page -> inactive file LRU
      	active anon page -> inacdtive anon LRU
      
      Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
      LRU's head because MADV_COLD is a little bit different symantic.
      MADV_FREE means it's okay to discard when the memory pressure because the
      content of the page is *garbage* so freeing such pages is almost zero
      overhead since we don't need to swap out and access afterward causes just
      minor fault.  Thus, it would make sense to put those freeable pages in
      inactive file LRU to compete other used-once pages.  It makes sense for
      implmentaion point of view, too because it's not swapbacked memory any
      longer until it would be re-dirtied.  Even, it could give a bonus to make
      them be reclaimed on swapless system.  However, MADV_COLD doesn't mean
      garbage so reclaiming them requires swap-out/in in the end so it's bigger
      cost.  Since we have designed VM LRU aging based on cost-model, anonymous
      cold pages would be better to position inactive anon's LRU list, not file
      LRU.  Furthermore, it would help to avoid unnecessary scanning if system
      doesn't have a swap device.  Let's start simpler way without adding
      complexity at this moment.  However, keep in mind, too that it's a caveat
      that workloads with a lot of pages cache are likely to ignore MADV_COLD on
      anonymous memory because we rarely age anonymous LRU lists.
      
      * man-page material
      
      MADV_COLD (since Linux x.x)
      
      Pages in the specified regions will be treated as less-recently-accessed
      compared to pages in the system with similar access frequencies.  In
      contrast to MADV_FREE, the contents of the region are preserved regardless
      of subsequent writes to pages.
      
      MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
      pages.
      
      [akpm@linux-foundation.org: resolve conflicts with hmm.git]
      Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      1af766e8
    • S
      efi: Make efi_rts_work accessible to efi page fault handler · 95fc4624
      Sai Praneeth 提交于
      [ Upstream commit 9dbbedaa6171247c4c7c40b83f05b200a117c2e0 ]
      
      After the kernel has booted, if any accesses by firmware causes a page
      fault, the efi page fault handler would freeze efi_rts_wq and schedules
      a new process. To do this, the efi page fault handler needs
      efi_rts_work. Hence, make it accessible.
      
      There will be no race conditions in accessing this structure, because
      all the calls to efi runtime services are already serialized.
      
      [ Wen: This patch also fixes a memory corruption:
             #define efi_queue_work(_rts, _arg1, _arg2, _arg3, _arg4, _arg5)\
             ({                                                             \
              struct efi_runtime_work efi_rts_work;                           \
             …
              init_completion(&efi_rts_work.efi_rts_comp);                    \
              INIT_WORK(&efi_rts_work.work, efi_call_rts);                    \
             …
      
             efi_rts_work is on the stack, registering it to workqueue will cause
             the following error:
      
             ODEBUG: object (____ptrval____) is on stack (____ptrval____),
             but NOT annotated.
             ------------[ cut here ]------------
             WARNING: CPU: 6 PID: 1 at lib/debugobjects.c:368
             __debug_object_init+0x218/0x538
             Modules linked in:
             CPU: 6 PID: 1 Comm: swapper/0 Tainted: G        W         4.19.91 #19
             …
             Call trace:
             __debug_object_init+0x218/0x538
             debug_object_init+0x20/0x28
             __init_work+0x34/0x58
             virt_efi_get_time.part.5+0x6c/0x12c
             virt_efi_get_time+0x4c/0x58
             efi_read_time+0x40/0x9c
             __rtc_read_time+0x50/0x118
             rtc_read_time+0x60/0x1f0
             rtc_hctosys+0x74/0x124
             do_one_initcall+0xac/0x3d4
             kernel_init_freeable+0x49c/0x59c
             kernel_init+0x18/0x110 ]
      Tested-by: NBhupesh Sharma <bhsharma@redhat.com>
      Suggested-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Based-on-code-from: Ricardo Neri <ricardo.neri@intel.com>
      Signed-off-by: NSai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Fixes: 3eb420e7 ("efi: Use a work queue to invoke EFI Runtime Services")
      Signed-off-by: NWen Yang <wenyang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      95fc4624
    • F
      netfilter: conntrack: udp: only extend timeout to stream mode after 2s · 617c7456
      Florian Westphal 提交于
      commit d535c8a69c1924e70186d80be0a9cecaf475f166 upstream
      
      Currently DNS resolvers that send both A and AAAA queries from same source port
      can trigger stream mode prematurely, which results in non-early-evictable conntrack entry
      for three minutes, even though DNS requests are done in a few milliseconds.
      
      Add a two second grace period where we continue to use the ordinary
      30-second default timeout.  Its enough for DNS request/response traffic,
      even if two request/reply packets are involved.
      
      ASSURED is still set, else conntrack (and thus a possible
      NAT mapping ...) gets zapped too in case conntrack table runs full.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
      Acked-by: NDust Li <dust.li@linux.alibaba.com>
      617c7456
    • J
      iomap: Allow forcing of waiting for running DIO in iomap_dio_rw() · 96d39291
      Jan Kara 提交于
      commit 13ef954445df4fd1d7c003a500ec5ce49573e14b upstream
      
      Notes from Xiaoguang Wang:
          Indeed this patch should be appled before "ext4: introduce direct I/O
      read using iomap infrastructure", but given that we have already appled
      "ext4: introduce direct I/O read using iomap infrastructure" previously,
      we need to update iomap_dio_rw() calls with the new argument  in ext4.
      
      Filesystems do not support doing IO as asynchronous in some cases. For
      example in case of unaligned writes or in case file size needs to be
      extended (e.g. for ext4). Instead of forcing filesystem to wait for AIO
      in such cases, add argument to iomap_dio_rw() which makes the function
      wait for IO completion. This also results in executing
      iomap_dio_complete() inline in iomap_dio_rw() providing its return value
      to the caller as for ordinary sync IO.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      96d39291
    • J
      cpuidle: allow governor switch on cpuidle_register_driver() · 0176005e
      Joao Martins 提交于
      commit 11c59eae6633b8a7e77b8ee1cf908964d80c78cd upstream
      
      The recently introduced haltpoll driver is largely only useful with
      haltpoll governor. To allow drivers to associate with a particular idle
      behaviour, add a @governor property to 'struct cpuidle_driver' and thus
      allow a cpuidle driver to switch to a *preferred* governor on idle driver
      registration. We save the previous governor, and when an idle driver is
      unregistered we switch back to that.
      
      The @governor can be overridden by cpuidle.governor= boot param or
      alternatively be ignored if the governor doesn't exist.
      Signed-off-by: NJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      0176005e
    • M
      governors: unify last_state_idx · dc59f8b0
      Marcelo Tosatti 提交于
      commit 7d4daeedd575bbc3c40c87fc6708a8b88c50fe7e upstream
      
      Since this field is shared by all governors, move it to
      cpuidle device structure.
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      dc59f8b0
    • M
      cpuidle: add poll_limit_ns to cpuidle_device structure · 2163e221
      Marcelo Tosatti 提交于
      commit 259231a045616c4101d023a8f4dcc8379af265a6 upstream
      
      Add a poll_limit_ns variable to cpuidle_device structure.
      
      Calculate and configure it in the new cpuidle_poll_time
      function, in case its zero.
      
      Individual governors are allowed to override this value.
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      2163e221
    • W
      alinux: mm: memcontrol: introduce memcg priority oom · 52e375fc
      Wenwei Tao 提交于
      Under memory pressure reclaim and oom would happen, with multiple
      cgroups exist in one system, we might want some of their memory
      or tasks survived the reclaim and oom while there are other
      candidates.
      
      The @memory.low and @memory.min have make that happen during reclaim,
      this patch introduces memcg priority oom to meet above requirement in
      the oom.
      
      The priority is from 0 to 12, the higher number the higher priority.
      When oom happens it always choose victim from low priority memcg.
      And it works both for memcg oom and global oom, it can be enabled/disabled
      through @memory.use_priority_oom, for global oom through the root
      memcg's @memory.use_priority_oom, it is disabled by default.
      Signed-off-by: NWenwei Tao <wenwei.tao@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      52e375fc
    • W
      alinux: kernel: cgroup: account number of tasks in the css and its descendants · 1e91d392
      Wenwei Tao 提交于
      Account number of the tasks in the css and its descendants, this is
      prepared for the incoming memcg priority patch.
      
      In memcg priority oom, we will select victim cgroup which has victim
      tasks in it. We need to know whether the memcg and its descendants
      have tasks before the selection can move on.
      Signed-off-by: NWenwei Tao <wenwei.tao@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      1e91d392
    • X
      alinux: memcg: Account throttled time due to memory.wmark_min_adj · ef467b9d
      Xunlei Pang 提交于
      Accessing original memory.stat turned out to be one heavy
      operation which has been caused many real product problems.
      
      Introduce new cgroup memory.exstat, memory.exstat stands
      for "extra/extended memory.stat", which contains dedicated
      statistics from Alibaba Clould Kernel.
      
      memory.exstat is supposed to provide hierarchical statistics.
      
      Export its first "wmark_min_throttled_ms", and will add more
      like direct reclaim, direct compaction, etc.
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      ef467b9d
    • X
      alinux: memcg: Introduce memory.wmark_min_adj · 60be0f54
      Xunlei Pang 提交于
      In co-location environment, there are more or less some memory
      overcommitment, then BATCH tasks may break the shared global min
      watermark resulting in all types of applications falling into
      the direct reclaim slow path hurting the RT of LS tasks.
      (NOTE: BATCH tasks tolerate big latency spike even in seconds
      as long as doesn't hurt its overal throughput. While LS tasks
      are very Latency-Sensitive, they may time out or fail in case
      of sudden latency spike lasts like hundreds of ms typically.)
      
      Actually BATCH tasks are not sensitive to memory latency, they
      can be assigned a strict min watermark which is different from
      that of LS tasks(which can be aissgned a lenient min watermark
      accordingly), thus isolating each other in case of global memory
      allocation. This is kind of like the idea behind ALLOC_HARDER
      for rt_task(), see gfp_to_alloc_flags().
      
      memory.wmark_min_adj stands for memcg global WMARK_MIN adjustment,
      it is used to realize separate min watermarks above-mentioned for
      memcgs, its valid value is within [-25, 50], specifically:
      negative value means to be relative to [0, WMARK_MIN],
      positive value means to be relative to [WMARK_MIN, WMARK_LOW].
      For examples,
        -25 means "WMARK_MIN + (WMARK_MIN - 0) * (-25%)"
         50 means "WMARK_MIN + (WMARK_LOW - WMARK_MIN) * 50%"
      
      Note that the minimum -25 is what ALLOC_HARDER uses which is safe
      for us to adopt, and the maximum 50 is one experienced value.
      
      Negative memory.wmark_min_adj means high QoS requirements, it can
      allocate below the global WMARK_MIN, which is kind of like the idea
      behind ALLOC_HARDER, see gfp_to_alloc_flags().
      
      Positive memory.wmark_min_adj means low QoS requirements, thus when
      allocation broke memcg min watermark, it should trigger direct reclaim
      traditionally, and we trigger throttle instead to further prevent
      them from disturbing others.
      
      With this interface, we can assign positive values for BATCH memcgs
      and negative values for LS memcgs.
      
      memory.wmark_min_adj default value is 0, and inherit from its parent,
      Note that the final effective wmark_min_adj will consider all the
      hierarchical values, its value is the maximal(most conservative)
      wmark_min_adj along the hierarchy but excluding intermediate default
      values(zero).
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      60be0f54
    • X
      alinux: memcg: Provide users the ability to reap zombie memcgs · 63442ea9
      Xunlei Pang 提交于
      After memcg was deleted, page caches still reference to this memcg
      causing large number of dead(zombie) memcgs in the system. Then it
      slows down access to "/sys/fs/cgroup/cpu/memory.stat", etc due to
      tons of iterations, further causing various latencies.
      
      This patch introduces two ways to reclaim these zombie memcgs.
      1) Background kthread reaper
      Introduce a kernel thread "memcg_zombie_reaper" to reclaim zombie
      memcgs at background periodically.
      
      Several knobs are also added to control the reaper scan frequency:
      - /sys/kernel/mm/memcg_reaper/scan_interval
        The scan period in second. Default 5s.
      - /sys/kernel/mm/memcg_reaper/pages_scan
        The scan rate of pages per scan. Default 1310720(5GiB for 4KiB page).
      - /sys/kernel/mm/memcg_reaper/verbose
        Output some zombie memcg information for debug purpose. Default off.
      - /sys/kernel/mm/memcg_reaper/reap_background
        "on/off" switch. Default "0" means off. Write "1" to switch it on.
      
      2) One-shot trigger by users
      - /sys/kernel/mm/memcg_reaper/reap
        Write "1" to trigger one round of zombie memcg reaping, but without
        any guarantee, you may need to launch multiple rounds as needed.
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      63442ea9