1. 08 8月, 2020 5 次提交
  2. 26 6月, 2020 1 次提交
  3. 05 6月, 2020 1 次提交
  4. 04 6月, 2020 15 次提交
  5. 03 6月, 2020 1 次提交
    • N
      mm/writeback: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE · a37b0715
      NeilBrown 提交于
      PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the
      loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a
      daemon needs to write to one bdi (the final bdi) in order to free up
      writes queued to another bdi (the client bdi).
      
      The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty
      pages, so that it can still dirty pages after other processses have been
      throttled.  The purpose of this is to avoid deadlock that happen when
      the PF_LESS_THROTTLE process must write for any dirty pages to be freed,
      but it is being thottled and cannot write.
      
      This approach was designed when all threads were blocked equally,
      independently on which device they were writing to, or how fast it was.
      Since that time the writeback algorithm has changed substantially with
      different threads getting different allowances based on non-trivial
      heuristics.  This means the simple "add 25%" heuristic is no longer
      reliable.
      
      The important issue is not that the daemon needs a *larger* dirty page
      allowance, but that it needs a *private* dirty page allowance, so that
      dirty pages for the "client" bdi that it is helping to clear (the bdi
      for an NFS filesystem or loop block device etc) do not affect the
      throttling of the daemon writing to the "final" bdi.
      
      This patch changes the heuristic so that the task is not throttled when
      the bdi it is writing to has a dirty page count below below (or equal
      to) the free-run threshold for that bdi.  This ensures it will always be
      able to have some pages in flight, and so will not deadlock.
      
      In a steady-state, it is expected that PF_LOCAL_THROTTLE tasks might
      still be throttled by global threshold, but that is acceptable as it is
      only the deadlock state that is interesting for this flag.
      
      This approach of "only throttle when target bdi is busy" is consistent
      with the other use of PF_LESS_THROTTLE in current_may_throttle(), were
      it causes attention to be focussed only on the target bdi.
      
      So this patch
       - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE,
       - removes the 25% bonus that that flag gives, and
       - If PF_LOCAL_THROTTLE is set, don't delay at all unless the
         global and the local free-run thresholds are exceeded.
      
      Note that previously realtime threads were treated the same as
      PF_LESS_THROTTLE threads.  This patch does *not* change the behvaiour
      for real-time threads, so it is now different from the behaviour of nfsd
      and loop tasks.  I don't know what is wanted for realtime.
      
      [akpm@linux-foundation.org: coding style fixes]
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: Chuck Lever <chuck.lever@oracle.com>	[nfsd]
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Link: http://lkml.kernel.org/r/87ftbf7gs3.fsf@notabene.neil.brown.nameSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a37b0715
  6. 08 5月, 2020 1 次提交
  7. 08 4月, 2020 1 次提交
  8. 03 4月, 2020 7 次提交
  9. 22 2月, 2020 1 次提交
    • G
      mm/vmscan.c: don't round up scan size for online memory cgroup · 76073c64
      Gavin Shan 提交于
      Commit 68600f62 ("mm: don't miss the last page because of round-off
      error") makes the scan size round up to @denominator regardless of the
      memory cgroup's state, online or offline.  This affects the overall
      reclaiming behavior: the corresponding LRU list is eligible for
      reclaiming only when its size logically right shifted by @sc->priority
      is bigger than zero in the former formula.
      
      For example, the inactive anonymous LRU list should have at least 0x4000
      pages to be eligible for reclaiming when we have 60/12 for
      swappiness/priority and without taking scan/rotation ratio into account.
      
      After the roundup is applied, the inactive anonymous LRU list becomes
      eligible for reclaiming when its size is bigger than or equal to 0x1000
      in the same condition.
      
          (0x4000 >> 12) * 60 / (60 + 140 + 1) = 1
          ((0x1000 >> 12) * 60) + 200) / (60 + 140 + 1) = 1
      
      aarch64 has 512MB huge page size when the base page size is 64KB.  The
      memory cgroup that has a huge page is always eligible for reclaiming in
      that case.
      
      The reclaiming is likely to stop after the huge page is reclaimed,
      meaing the further iteration on @sc->priority and the silbing and child
      memory cgroups will be skipped.  The overall behaviour has been changed.
      This fixes the issue by applying the roundup to offlined memory cgroups
      only, to give more preference to reclaim memory from offlined memory
      cgroup.  It sounds reasonable as those memory is unlikedly to be used by
      anyone.
      
      The issue was found by starting up 8 VMs on a Ampere Mustang machine,
      which has 8 CPUs and 16 GB memory.  Each VM is given with 2 vCPUs and
      2GB memory.  It took 264 seconds for all VMs to be completely up and
      784MB swap is consumed after that.  With this patch applied, it took 236
      seconds and 60MB swap to do same thing.  So there is 10% performance
      improvement for my case.  Note that KSM is disable while THP is enabled
      in the testing.
      
               total     used    free   shared  buff/cache   available
         Mem:  16196    10065    2049       16        4081        3749
         Swap:  8175      784    7391
               total     used    free   shared  buff/cache   available
         Mem:  16196    11324    3656       24        1215        2936
         Swap:  8175       60    8115
      
      Link: http://lkml.kernel.org/r/20200211024514.8730-1-gshan@redhat.com
      Fixes: 68600f62 ("mm: don't miss the last page because of round-off error")
      Signed-off-by: NGavin Shan <gshan@redhat.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Cc: <stable@vger.kernel.org>	[4.20+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      76073c64
  10. 01 2月, 2020 3 次提交
  11. 18 12月, 2019 1 次提交
  12. 02 12月, 2019 3 次提交
    • X
    • J
      mm: vmscan: enforce inactive:active ratio at the reclaim root · b91ac374
      Johannes Weiner 提交于
      We split the LRU lists into inactive and an active parts to maximize
      workingset protection while allowing just enough inactive cache space to
      faciltate readahead and writeback for one-off file accesses (e.g.  a
      linear scan through a file, or logging); or just enough inactive anon to
      maintain recent reference information when reclaim needs to swap.
      
      With cgroups and their nested LRU lists, we currently don't do this
      correctly.  While recursive cgroup reclaim establishes a relative LRU
      order among the pages of all involved cgroups, inactive:active size
      decisions are done on a per-cgroup level.  As a result, we'll reclaim a
      cgroup's workingset when it doesn't have cold pages, even when one of its
      siblings has plenty of it that should be reclaimed first.
      
      For example: workload A has 50M worth of hot cache but doesn't do any
      one-off file accesses; meanwhile, parallel workload B scans files and
      rarely accesses the same page twice.
      
      If these workloads were to run in an uncgrouped system, A would be
      protected from the high rate of cache faults from B.  But if they were put
      in parallel cgroups for memory accounting purposes, B's fast cache fault
      rate would push out the hot cache pages of A.  This is unexpected and
      undesirable - the "scan resistance" of the page cache is broken.
      
      This patch moves inactive:active size balancing decisions to the root of
      reclaim - the same level where the LRU order is established.
      
      It does this by looking at the recursive size of the inactive and the
      active file sets of the cgroup subtree at the beginning of the reclaim
      cycle, and then making a decision - scan or skip active pages - that
      applies throughout the entire run and to every cgroup involved.
      
      With that in place, in the test above, the VM will recognize that there
      are plenty of inactive pages in the combined cache set of workloads A and
      B and prefer the one-off cache in B over the hot pages in A.  The scan
      resistance of the cache is restored.
      
      [cai@lca.pw: fix some -Wenum-conversion warnings]
        Link: http://lkml.kernel.org/r/1573848697-29262-1-git-send-email-cai@lca.pw
      Link: http://lkml.kernel.org/r/20191107205334.158354-4-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b91ac374
    • J
      mm: vmscan: detect file thrashing at the reclaim root · b910718a
      Johannes Weiner 提交于
      We use refault information to determine whether the cache workingset is
      stable or transitioning, and dynamically adjust the inactive:active file
      LRU ratio so as to maximize protection from one-off cache during stable
      periods, and minimize IO during transitions.
      
      With cgroups and their nested LRU lists, we currently don't do this
      correctly.  While recursive cgroup reclaim establishes a relative LRU
      order among the pages of all involved cgroups, refaults only affect the
      local LRU order in the cgroup in which they are occuring.  As a result,
      cache transitions can take longer in a cgrouped system as the active pages
      of sibling cgroups aren't challenged when they should be.
      
      [ Right now, this is somewhat theoretical, because the siblings, under
        continued regular reclaim pressure, should eventually run out of
        inactive pages - and since inactive:active *size* balancing is also
        done on a cgroup-local level, we will challenge the active pages
        eventually in most cases. But the next patch will move that relative
        size enforcement to the reclaim root as well, and then this patch
        here will be necessary to propagate refault pressure to siblings. ]
      
      This patch moves refault detection to the root of reclaim.  Instead of
      remembering the cgroup owner of an evicted page, remember the cgroup that
      caused the reclaim to happen.  When refaults later occur, they'll
      correctly influence the cross-cgroup LRU order that reclaim follows.
      
      I.e.  if global reclaim kicked out pages in some subgroup A/B/C, the
      refault of those pages will challenge the global LRU order, and not just
      the local order down inside C.
      
      [hannes@cmpxchg.org:  use page_memcg() instead of another lookup]
        Link: http://lkml.kernel.org/r/20191115160722.GA309754@cmpxchg.org
      Link: http://lkml.kernel.org/r/20191107205334.158354-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b910718a