• Y
    mm: memcg: fix stale protection of reclaim target memcg · adb82130
    Yosry Ahmed 提交于
    Patch series "mm: memcg: fix protection of reclaim target memcg", v3.
    
    This series fixes a bug in calculating the protection of the reclaim
    target memcg where we end up using stale effective protection values from
    the last reclaim operation, instead of completely ignoring the protection
    of the reclaim target as intended.  More detailed explanation and examples
    in patch 1, which includes the fix.  Patches 2 & 3 introduce a selftest
    case that catches the bug.
    
    
    This patch (of 3):
    
    When we are doing memcg reclaim, the intended behavior is that we
    ignore any protection (memory.min, memory.low) of the target memcg (but
    not its children).  Ever since the patch pointed to by the "Fixes" tag,
    we actually read a stale value for the target memcg protection when
    deciding whether to skip the memcg or not because it is protected.  If
    the stale value happens to be high enough, we don't reclaim from the
    target memcg.
    
    Essentially, in some cases we may falsely skip reclaiming from the
    target memcg of reclaim because we read a stale protection value from
    last time we reclaimed from it.
    
    
    During reclaim, mem_cgroup_calculate_protection() is used to determine the
    effective protection (emin and elow) values of a memcg.  The protection of
    the reclaim target is ignored, but we cannot set their effective
    protection to 0 due to a limitation of the current implementation (see
    comment in mem_cgroup_protection()).  Instead, we leave their effective
    protection values unchaged, and later ignore it in
    mem_cgroup_protection().
    
    However, mem_cgroup_protection() is called later in
    shrink_lruvec()->get_scan_count(), which is after the
    mem_cgroup_below_{min/low}() checks in shrink_node_memcgs().  As a result,
    the stale effective protection values of the target memcg may lead us to
    skip reclaiming from the target memcg entirely, before calling
    shrink_lruvec().  This can be even worse with recursive protection, where
    the stale target memcg protection can be higher than its standalone
    protection.  See two examples below (a similar version of example (a) is
    added to test_memcontrol in a later patch).
    
    (a) A simple example with proactive reclaim is as follows. Consider the
    following hierarchy:
    ROOT
     |
     A
     |
     B (memory.min = 10M)
    
    Consider the following scenario:
    - B has memory.current = 10M.
    - The system undergoes global reclaim (or memcg reclaim in A).
    - In shrink_node_memcgs():
      - mem_cgroup_calculate_protection() calculates the effective min (emin)
        of B as 10M.
      - mem_cgroup_below_min() returns true for B, we do not reclaim from B.
    - Now if we want to reclaim 5M from B using proactive reclaim
      (memory.reclaim), we should be able to, as the protection of the
      target memcg should be ignored.
    - In shrink_node_memcgs():
      - mem_cgroup_calculate_protection() immediately returns for B without
        doing anything, as B is the target memcg, relying on
        mem_cgroup_protection() to ignore B's stale effective min (still 10M).
      - mem_cgroup_below_min() reads the stale effective min for B and we
        skip it instead of ignoring its protection as intended, as we never
        reach mem_cgroup_protection().
    
    (b) An more complex example with recursive protection is as follows.
    Consider the following hierarchy with memory_recursiveprot:
    ROOT
     |
     A (memory.min = 50M)
     |
     B (memory.min = 10M, memory.high = 40M)
    
    Consider the following scenario:
    - B has memory.current = 35M.
    - The system undergoes global reclaim (target memcg is NULL).
    - B will have an effective min of 50M (all of A's unclaimed protection).
    - B will not be reclaimed from.
    - Now allocate 10M more memory in B, pushing it above it's high limit.
    - The system undergoes memcg reclaim from B (target memcg is B).
    - Like example (a), we do nothing in mem_cgroup_calculate_protection(),
      then call mem_cgroup_below_min(), which will read the stale effective
      min for B (50M) and skip it. In this case, it's even worse because we
      are not just considering B's standalone protection (10M), but we are
      reading a much higher stale protection (50M) which will cause us to not
      reclaim from B at all.
    
    This is an artifact of commit 45c7f7e1 ("mm, memcg: decouple
    e{low,min} state mutations from protection checks") which made
    mem_cgroup_calculate_protection() only change the state without returning
    any value.  Before that commit, we used to return MEMCG_PROT_NONE for the
    target memcg, which would cause us to skip the
    mem_cgroup_below_{min/low}() checks.  After that commit we do not return
    anything and we end up checking the min & low effective protections for
    the target memcg, which are stale.
    
    Update mem_cgroup_supports_protection() to also check if we are reclaiming
    from the target, and rename it to mem_cgroup_unprotected() (now returns
    true if we should not protect the memcg, much simpler logic).
    
    Link: https://lkml.kernel.org/r/20221202031512.1365483-1-yosryahmed@google.com
    Link: https://lkml.kernel.org/r/20221202031512.1365483-2-yosryahmed@google.com
    Fixes: 45c7f7e1 ("mm, memcg: decouple e{low,min} state mutations from protection checks")
    Signed-off-by: NYosry Ahmed <yosryahmed@google.com>
    Reviewed-by: NRoman Gushchin <roman.gushchin@linux.dev>
    Cc: Chris Down <chris@chrisdown.name>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Vasily Averin <vasily.averin@linux.dev>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
    adb82130
memcontrol.h 45.4 KB