1. 18 8月, 2018 2 次提交
    • S
      fs: fsnotify: account fsnotify metadata to kmemcg · d46eb14b
      Shakeel Butt 提交于
      Patch series "Directed kmem charging", v8.
      
      The Linux kernel's memory cgroup allows limiting the memory usage of the
      jobs running on the system to provide isolation between the jobs.  All
      the kernel memory allocated in the context of the job and marked with
      __GFP_ACCOUNT will also be included in the memory usage and be limited
      by the job's limit.
      
      The kernel memory can only be charged to the memcg of the process in
      whose context kernel memory was allocated.  However there are cases
      where the allocated kernel memory should be charged to the memcg
      different from the current processes's memcg.  This patch series
      contains two such concrete use-cases i.e.  fsnotify and buffer_head.
      
      The fsnotify event objects can consume a lot of system memory for large
      or unlimited queues if there is either no or slow listener.  The events
      are allocated in the context of the event producer.  However they should
      be charged to the event consumer.  Similarly the buffer_head objects can
      be allocated in a memcg different from the memcg of the page for which
      buffer_head objects are being allocated.
      
      To solve this issue, this patch series introduces mechanism to charge
      kernel memory to a given memcg.  In case of fsnotify events, the memcg
      of the consumer can be used for charging and for buffer_head, the memcg
      of the page can be charged.  For directed charging, the caller can use
      the scope API memalloc_[un]use_memcg() to specify the memcg to charge
      for all the __GFP_ACCOUNT allocations within the scope.
      
      This patch (of 2):
      
      A lot of memory can be consumed by the events generated for the huge or
      unlimited queues if there is either no or slow listener.  This can cause
      system level memory pressure or OOMs.  So, it's better to account the
      fsnotify kmem caches to the memcg of the listener.
      
      However the listener can be in a different memcg than the memcg of the
      producer and these allocations happen in the context of the event
      producer.  This patch introduces remote memcg charging API which the
      producer can use to charge the allocations to the memcg of the listener.
      
      There are seven fsnotify kmem caches and among them allocations from
      dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
      inotify_inode_mark_cachep happens in the context of syscall from the
      listener.  So, SLAB_ACCOUNT is enough for these caches.
      
      The objects from fsnotify_mark_connector_cachep are not accounted as
      they are small compared to the notification mark or events and it is
      unclear whom to account connector to since it is shared by all events
      attached to the inode.
      
      The allocations from the event caches happen in the context of the event
      producer.  For such caches we will need to remote charge the allocations
      to the listener's memcg.  Thus we save the memcg reference in the
      fsnotify_group structure of the listener.
      
      This patch has also moved the members of fsnotify_group to keep the size
      same, at least for 64 bit build, even with additional member by filling
      the holes.
      
      [shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
        Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
      Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d46eb14b
    • R
      mm: introduce mem_cgroup_put() helper · dc0b5864
      Roman Gushchin 提交于
      Introduce the mem_cgroup_put() helper, which helps to eliminate guarding
      memcg css release with "#ifdef CONFIG_MEMCG" in multiple places.
      
      Link: http://lkml.kernel.org/r/20180623000600.5818-2-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc0b5864
  2. 09 7月, 2018 1 次提交
  3. 15 6月, 2018 1 次提交
  4. 08 6月, 2018 6 次提交
    • A
      mem_cgroup: make sure moving_account, move_lock_task and stat_cpu in the same cacheline · e81bf979
      Aaron Lu 提交于
      The LKP robot found a 27% will-it-scale/page_fault3 performance
      regression regarding commit e27be240("mm: memcg: make sure
      memory.events is uptodate when waking pollers").
      
      What the test does is:
       1 mkstemp() a 128M file on a tmpfs;
       2 start $nr_cpu processes, each to loop the following:
         2.1 mmap() this file in shared write mode;
         2.2 write 0 to this file in a PAGE_SIZE step till the end of the file;
         2.3 unmap() this file and repeat this process.
       3 After 5 minutes, check how many loops they managed to complete, the
         higher the better.
      
      The commit itself looks innocent enough as it merely changed some event
      counting mechanism and this test didn't trigger those events at all.
      Perf shows increased cycles spent on accessing root_mem_cgroup->stat_cpu
      in count_memcg_event_mm()(called by handle_mm_fault()) and in
      __mod_memcg_state() called by page_add_file_rmap().  So it's likely due
      to the changed layout of 'struct mem_cgroup' that either make stat_cpu
      falling into a constantly modifying cacheline or some hot fields stop
      being in the same cacheline.
      
      I verified this by moving memory_events[] back to where it was:
      
      : --- a/include/linux/memcontrol.h
      : +++ b/include/linux/memcontrol.h
      : @@ -205,7 +205,6 @@ struct mem_cgroup {
      :  	int		oom_kill_disable;
      :
      :  	/* memory.events */
      : -	atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];
      :  	struct cgroup_file events_file;
      :
      :  	/* protect arrays of thresholds */
      : @@ -238,6 +237,7 @@ struct mem_cgroup {
      :  	struct mem_cgroup_stat_cpu __percpu *stat_cpu;
      :  	atomic_long_t		stat[MEMCG_NR_STAT];
      :  	atomic_long_t		events[NR_VM_EVENT_ITEMS];
      : +	atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];
      :
      :  	unsigned long		socket_pressure;
      
      And performance restored.
      
      Later investigation found that as long as the following 3 fields
      moving_account, move_lock_task and stat_cpu are in the same cacheline,
      performance will be good.  To avoid future performance surprise by other
      commits changing the layout of 'struct mem_cgroup', this patch makes
      sure the 3 fields stay in the same cacheline.
      
      One concern of this approach is, moving_account and move_lock_task could
      be modified when a process changes memory cgroup while stat_cpu is a
      always read field, it might hurt to place them in the same cacheline.  I
      assume it is rare for a process to change memory cgroup so this should
      be OK.
      
      Link: https://lkml.kernel.org/r/20180528114019.GF9904@yexl-desktop
      Link: http://lkml.kernel.org/r/20180601071115.GA27302@intel.comSigned-off-by: NAaron Lu <aaron.lu@intel.com>
      Reported-by: Nkernel test robot <xiaolong.ye@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e81bf979
    • R
      memcg: introduce memory.min · bf8d5d52
      Roman Gushchin 提交于
      Memory controller implements the memory.low best-effort memory
      protection mechanism, which works perfectly in many cases and allows
      protecting working sets of important workloads from sudden reclaim.
      
      But its semantics has a significant limitation: it works only as long as
      there is a supply of reclaimable memory.  This makes it pretty useless
      against any sort of slow memory leaks or memory usage increases.  This
      is especially true for swapless systems.  If swap is enabled, memory
      soft protection effectively postpones problems, allowing a leaking
      application to fill all swap area, which makes no sense.  The only
      effective way to guarantee the memory protection in this case is to
      invoke the OOM killer.
      
      It's possible to handle this case in userspace by reacting on MEMCG_LOW
      events; but there is still a place for a fail-safe in-kernel mechanism
      to provide stronger guarantees.
      
      This patch introduces the memory.min interface for cgroup v2 memory
      controller.  It works very similarly to memory.low (sharing the same
      hierarchical behavior), except that it's not disabled if there is no
      more reclaimable memory in the system.
      
      If cgroup is not populated, its memory.min is ignored, because otherwise
      even the OOM killer wouldn't be able to reclaim the protected memory,
      and the system can stall.
      
      [guro@fb.com: s/low/min/ in docs]
      Link: http://lkml.kernel.org/r/20180510130758.GA9129@castle.DHCP.thefacebook.com
      Link: http://lkml.kernel.org/r/20180509180734.GA4856@castle.DHCP.thefacebook.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Reviewed-by: NRandy Dunlap <rdunlap@infradead.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf8d5d52
    • W
      memcg: writeback: use memcg->cgwb_list directly · 9ccc3617
      Wang Long 提交于
      mem_cgroup_cgwb_list is a very simple wrapper and it will never be used
      outside of code under CONFIG_CGROUP_WRITEBACK.  so use memcg->cgwb_list
      directly.
      
      Link: http://lkml.kernel.org/r/1524406173-212182-1-git-send-email-wanglong19@meituan.comSigned-off-by: NWang Long <wanglong19@meituan.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ccc3617
    • R
      mm: memory.low hierarchical behavior · 23067153
      Roman Gushchin 提交于
      This patch aims to address an issue in current memory.low semantics,
      which makes it hard to use it in a hierarchy, where some leaf memory
      cgroups are more valuable than others.
      
      For example, there are memcgs A, A/B, A/C, A/D and A/E:
      
        A      A/memory.low = 2G, A/memory.current = 6G
       //\\
      BC  DE   B/memory.low = 3G  B/memory.current = 2G
               C/memory.low = 1G  C/memory.current = 2G
               D/memory.low = 0   D/memory.current = 2G
      	 E/memory.low = 10G E/memory.current = 0
      
      If we apply memory pressure, B, C and D are reclaimed at the same pace
      while A's usage exceeds 2G.  This is obviously wrong, as B's usage is
      fully below B's memory.low, and C has 1G of protection as well.  Also, A
      is pushed to the size, which is less than A's 2G memory.low, which is
      also wrong.
      
      A simple bash script (provided below) can be used to reproduce
      the problem. Current results are:
        A:    1430097920
        A/B:  711929856
        A/C:  717426688
        A/D:  741376
        A/E:  0
      
      To address the issue a concept of effective memory.low is introduced.
      Effective memory.low is always equal or less than original memory.low.
      In a case, when there is no memory.low overcommittment (and also for
      top-level cgroups), these two values are equal.
      
      Otherwise it's a part of parent's effective memory.low, calculated as a
      cgroup's memory.low usage divided by sum of sibling's memory.low usages
      (under memory.low usage I mean the size of actually protected memory:
      memory.current if memory.current < memory.low, 0 otherwise).  It's
      necessary to track the actual usage, because otherwise an empty cgroup
      with memory.low set (A/E in my example) will affect actual memory
      distribution, which makes no sense.  To avoid traversing the cgroup tree
      twice, page_counters code is reused.
      
      Calculating effective memory.low can be done in the reclaim path, as we
      conveniently traversing the cgroup tree from top to bottom and check
      memory.low on each level.  So, it's a perfect place to calculate
      effective memory low and save it to use it for children cgroups.
      
      This also eliminates a need to traverse the cgroup tree from bottom to
      top each time to check if parent's guarantee is not exceeded.
      
      Setting/resetting effective memory.low is intentionally racy, but it's
      fine and shouldn't lead to any significant differences in actual memory
      distribution.
      
      With this patch applied results are matching the expectations:
        A:    2147930112
        A/B:  1428721664
        A/C:  718393344
        A/D:  815104
        A/E:  0
      
      Test script:
        #!/bin/bash
      
        CGPATH="/sys/fs/cgroup"
      
        truncate /file1 --size 2G
        truncate /file2 --size 2G
        truncate /file3 --size 2G
        truncate /file4 --size 50G
      
        mkdir "${CGPATH}/A"
        echo "+memory" > "${CGPATH}/A/cgroup.subtree_control"
        mkdir "${CGPATH}/A/B" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/E"
      
        echo 2G > "${CGPATH}/A/memory.low"
        echo 3G > "${CGPATH}/A/B/memory.low"
        echo 1G > "${CGPATH}/A/C/memory.low"
        echo 0 > "${CGPATH}/A/D/memory.low"
        echo 10G > "${CGPATH}/A/E/memory.low"
      
        echo $$ > "${CGPATH}/A/B/cgroup.procs" && vmtouch -qt /file1
        echo $$ > "${CGPATH}/A/C/cgroup.procs" && vmtouch -qt /file2
        echo $$ > "${CGPATH}/A/D/cgroup.procs" && vmtouch -qt /file3
        echo $$ > "${CGPATH}/cgroup.procs" && vmtouch -qt /file4
      
        echo "A:   " `cat "${CGPATH}/A/memory.current"`
        echo "A/B: " `cat "${CGPATH}/A/B/memory.current"`
        echo "A/C: " `cat "${CGPATH}/A/C/memory.current"`
        echo "A/D: " `cat "${CGPATH}/A/D/memory.current"`
        echo "A/E: " `cat "${CGPATH}/A/E/memory.current"`
      
        rmdir "${CGPATH}/A/B" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/E"
        rmdir "${CGPATH}/A"
        rm /file1 /file2 /file3 /file4
      
      Link: http://lkml.kernel.org/r/20180405185921.4942-2-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23067153
    • R
      mm: rename page_counter's count/limit into usage/max · bbec2e15
      Roman Gushchin 提交于
      This patch renames struct page_counter fields:
        count -> usage
        limit -> max
      
      and the corresponding functions:
        page_counter_limit() -> page_counter_set_max()
        mem_cgroup_get_limit() -> mem_cgroup_get_max()
        mem_cgroup_resize_limit() -> mem_cgroup_resize_max()
        memcg_update_kmem_limit() -> memcg_update_kmem_max()
        memcg_update_tcp_limit() -> memcg_update_tcp_max()
      
      The idea behind this renaming is to have the direct matching
      between memory cgroup knobs (low, high, max) and page_counters API.
      
      This is pure renaming, this patch doesn't bring any functional change.
      
      Link: http://lkml.kernel.org/r/20180405185921.4942-1-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bbec2e15
    • T
      mm, memcontrol: implement memory.swap.events · f3a53a3a
      Tejun Heo 提交于
      Add swap max and fail events so that userland can monitor and respond to
      running out of swap.
      
      I'm not too sure about the fail event.  Right now, it's a bit confusing
      which stats / events are recursive and which aren't and also which ones
      reflect events which originate from a given cgroup and which targets the
      cgroup.  No idea what the right long term solution is and it could just
      be that growing them organically is actually the only right thing to do.
      
      Link: http://lkml.kernel.org/r/20180416231151.GI1911913@devbig577.frc2.facebook.comSigned-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: <linux-api@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3a53a3a
  5. 12 4月, 2018 2 次提交
    • J
      mm: memcg: make sure memory.events is uptodate when waking pollers · e27be240
      Johannes Weiner 提交于
      Commit a983b5eb ("mm: memcontrol: fix excessive complexity in
      memory.stat reporting") added per-cpu drift to all memory cgroup stats
      and events shown in memory.stat and memory.events.
      
      For memory.stat this is acceptable.  But memory.events issues file
      notifications, and somebody polling the file for changes will be
      confused when the counters in it are unchanged after a wakeup.
      
      Luckily, the events in memory.events - MEMCG_LOW, MEMCG_HIGH, MEMCG_MAX,
      MEMCG_OOM - are sufficiently rare and high-level that we don't need
      per-cpu buffering for them: MEMCG_HIGH and MEMCG_MAX would be the most
      frequent, but they're counting invocations of reclaim, which is a
      complex operation that touches many shared cachelines.
      
      This splits memory.events from the generic VM events and tracks them in
      their own, unbuffered atomic counters.  That's also cleaner, as it
      eliminates the ugly enum nesting of VM and cgroup events.
      
      [hannes@cmpxchg.org: "array subscript is above array bounds"]
        Link: http://lkml.kernel.org/r/20180406155441.GA20806@cmpxchg.org
      Link: http://lkml.kernel.org/r/20180405175507.GA24817@cmpxchg.org
      Fixes: a983b5eb ("mm: memcontrol: fix excessive complexity in memory.stat reporting")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NTejun Heo <tj@kernel.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e27be240
    • A
      mm/vmscan: don't mess with pgdat->flags in memcg reclaim · e3c1ac58
      Andrey Ryabinin 提交于
      memcg reclaim may alter pgdat->flags based on the state of LRU lists in
      cgroup and its children.  PGDAT_WRITEBACK may force kswapd to sleep
      congested_wait(), PGDAT_DIRTY may force kswapd to writeback filesystem
      pages.  But the worst here is PGDAT_CONGESTED, since it may force all
      direct reclaims to stall in wait_iff_congested().  Note that only kswapd
      have powers to clear any of these bits.  This might just never happen if
      cgroup limits configured that way.  So all direct reclaims will stall as
      long as we have some congested bdi in the system.
      
      Leave all pgdat->flags manipulations to kswapd.  kswapd scans the whole
      pgdat, only kswapd can clear pgdat->flags once node is balanced, thus
      it's reasonable to leave all decisions about node state to kswapd.
      
      Why only kswapd? Why not allow to global direct reclaim change these
      flags? It is because currently only kswapd can clear these flags.  I'm
      less worried about the case when PGDAT_CONGESTED falsely not set, and
      more worried about the case when it falsely set.  If direct reclaimer
      sets PGDAT_CONGESTED, do we have guarantee that after the congestion
      problem is sorted out, kswapd will be woken up and clear the flag? It
      seems like there is no such guarantee.  E.g.  direct reclaimers may
      eventually balance pgdat and kswapd simply won't wake up (see
      wakeup_kswapd()).
      
      Moving pgdat->flags manipulation to kswapd, means that cgroup2 recalim
      now loses its congestion throttling mechanism.  Add per-cgroup
      congestion state and throttle cgroup2 reclaimers if memcg is in
      congestion state.
      
      Currently there is no need in per-cgroup PGDAT_WRITEBACK and PGDAT_DIRTY
      bits since they alter only kswapd behavior.
      
      The problem could be easily demonstrated by creating heavy congestion in
      one cgroup:
      
          echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
          mkdir -p /sys/fs/cgroup/congester
          echo 512M > /sys/fs/cgroup/congester/memory.max
          echo $$ > /sys/fs/cgroup/congester/cgroup.procs
          /* generate a lot of diry data on slow HDD */
          while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
          ....
          while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
      
      and some job in another cgroup:
      
          mkdir /sys/fs/cgroup/victim
          echo 128M > /sys/fs/cgroup/victim/memory.max
      
          # time cat /dev/sda > /dev/null
          real    10m15.054s
          user    0m0.487s
          sys     1m8.505s
      
      According to the tracepoint in wait_iff_congested(), the 'cat' spent 50%
      of the time sleeping there.
      
      With the patch, cat don't waste time anymore:
      
          # time cat /dev/sda > /dev/null
          real    5m32.911s
          user    0m0.411s
          sys     0m56.664s
      
      [aryabinin@virtuozzo.com: congestion state should be per-node]
        Link: http://lkml.kernel.org/r/20180406135215.10057-1-aryabinin@virtuozzo.com
      [ayabinin@virtuozzo.com: make congestion state per-cgroup-per-node instead of just per-cgroup[
        Link: http://lkml.kernel.org/r/20180406180254.8970-2-aryabinin@virtuozzo.com
      Link: http://lkml.kernel.org/r/20180323152029.11084-5-aryabinin@virtuozzo.comSigned-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3c1ac58
  6. 22 2月, 2018 1 次提交
  7. 01 2月, 2018 3 次提交
  8. 07 9月, 2017 1 次提交
  9. 19 8月, 2017 1 次提交
    • J
      mm: memcontrol: fix NULL pointer crash in test_clear_page_writeback() · 739f79fc
      Johannes Weiner 提交于
      Jaegeuk and Brad report a NULL pointer crash when writeback ending tries
      to update the memcg stats:
      
          BUG: unable to handle kernel NULL pointer dereference at 00000000000003b0
          IP: test_clear_page_writeback+0x12e/0x2c0
          [...]
          RIP: 0010:test_clear_page_writeback+0x12e/0x2c0
          Call Trace:
           <IRQ>
           end_page_writeback+0x47/0x70
           f2fs_write_end_io+0x76/0x180 [f2fs]
           bio_endio+0x9f/0x120
           blk_update_request+0xa8/0x2f0
           scsi_end_request+0x39/0x1d0
           scsi_io_completion+0x211/0x690
           scsi_finish_command+0xd9/0x120
           scsi_softirq_done+0x127/0x150
           __blk_mq_complete_request_remote+0x13/0x20
           flush_smp_call_function_queue+0x56/0x110
           generic_smp_call_function_single_interrupt+0x13/0x30
           smp_call_function_single_interrupt+0x27/0x40
           call_function_single_interrupt+0x89/0x90
          RIP: 0010:native_safe_halt+0x6/0x10
      
          (gdb) l *(test_clear_page_writeback+0x12e)
          0xffffffff811bae3e is in test_clear_page_writeback (./include/linux/memcontrol.h:619).
          614		mod_node_page_state(page_pgdat(page), idx, val);
          615		if (mem_cgroup_disabled() || !page->mem_cgroup)
          616			return;
          617		mod_memcg_state(page->mem_cgroup, idx, val);
          618		pn = page->mem_cgroup->nodeinfo[page_to_nid(page)];
          619		this_cpu_add(pn->lruvec_stat->count[idx], val);
          620	}
          621
          622	unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
          623							gfp_t gfp_mask,
      
      The issue is that writeback doesn't hold a page reference and the page
      might get freed after PG_writeback is cleared (and the mapping is
      unlocked) in test_clear_page_writeback().  The stat functions looking up
      the page's node or zone are safe, as those attributes are static across
      allocation and free cycles.  But page->mem_cgroup is not, and it will
      get cleared if we race with truncation or migration.
      
      It appears this race window has been around for a while, but less likely
      to trigger when the memcg stats were updated first thing after
      PG_writeback is cleared.  Recent changes reshuffled this code to update
      the global node stats before the memcg ones, though, stretching the race
      window out to an extent where people can reproduce the problem.
      
      Update test_clear_page_writeback() to look up and pin page->mem_cgroup
      before clearing PG_writeback, then not use that pointer afterward.  It
      is a partial revert of 62cccb8c ("mm: simplify lock_page_memcg()")
      but leaves the pageref-holding callsites that aren't affected alone.
      
      Link: http://lkml.kernel.org/r/20170809183825.GA26387@cmpxchg.org
      Fixes: 62cccb8c ("mm: simplify lock_page_memcg()")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NJaegeuk Kim <jaegeuk@kernel.org>
      Tested-by: NJaegeuk Kim <jaegeuk@kernel.org>
      Reported-by: NBradley Bolen <bradleybolen@gmail.com>
      Tested-by: NBrad Bolen <bradleybolen@gmail.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>	[4.6+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      739f79fc
  10. 07 7月, 2017 5 次提交
  11. 04 5月, 2017 6 次提交
  12. 01 4月, 2017 1 次提交
  13. 23 2月, 2017 2 次提交
    • T
      slab: use memcg_kmem_cache_wq for slab destruction operations · 17cc4dfe
      Tejun Heo 提交于
      If there's contention on slab_mutex, queueing the per-cache destruction
      work item on the system_wq can unnecessarily create and tie up a lot of
      kworkers.
      
      Rename memcg_kmem_cache_create_wq to memcg_kmem_cache_wq and make it
      global and use that workqueue for the destruction work items too.  While
      at it, convert the workqueue from an unbound workqueue to a per-cpu one
      with concurrency limited to 1.  It's generally preferable to use per-cpu
      workqueues and concurrency limit of 1 is safe enough.
      
      This is suggested by Joonsoo Kim.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-11-tj@kernel.orgSigned-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJay Vana <jsvana@fb.com>
      Acked-by: NVladimir Davydov <vdavydov@tarantool.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      17cc4dfe
    • T
      slab: link memcg kmem_caches on their associated memory cgroup · bc2791f8
      Tejun Heo 提交于
      With kmem cgroup support enabled, kmem_caches can be created and
      destroyed frequently and a great number of near empty kmem_caches can
      accumulate if there are a lot of transient cgroups and the system is not
      under memory pressure.  When memory reclaim starts under such
      conditions, it can lead to consecutive deactivation and destruction of
      many kmem_caches, easily hundreds of thousands on moderately large
      systems, exposing scalability issues in the current slab management
      code.  This is one of the patches to address the issue.
      
      While a memcg kmem_cache is listed on its root cache's ->children list,
      there is no direct way to iterate all kmem_caches which are assocaited
      with a memory cgroup.  The only way to iterate them is walking all
      caches while filtering out caches which don't match, which would be most
      of them.
      
      This makes memcg destruction operations O(N^2) where N is the total
      number of slab caches which can be huge.  This combined with the
      synchronous RCU operations can tie up a CPU and affect the whole machine
      for many hours when memory reclaim triggers offlining and destruction of
      the stale memcgs.
      
      This patch adds mem_cgroup->kmem_caches list which goes through
      memcg_cache_params->kmem_caches_node of all kmem_caches which are
      associated with the memcg.  All memcg specific iterations, including
      stat file access, are updated to use the new list instead.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-6-tj@kernel.orgSigned-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJay Vana <jsvana@fb.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc2791f8
  14. 11 1月, 2017 1 次提交
    • M
      mm, memcg: fix the active list aging for lowmem requests when memcg is enabled · b4536f0c
      Michal Hocko 提交于
      Nils Holland and Klaus Ethgen have reported unexpected OOM killer
      invocations with 32b kernel starting with 4.8 kernels
      
      	kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
      	kworker/u4:5 cpuset=/ mems_allowed=0
      	CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
      	[...]
      	Mem-Info:
      	active_anon:58685 inactive_anon:90 isolated_anon:0
      	 active_file:274324 inactive_file:281962 isolated_file:0
      	 unevictable:0 dirty:649 writeback:0 unstable:0
      	 slab_reclaimable:40662 slab_unreclaimable:17754
      	 mapped:7382 shmem:202 pagetables:351 bounce:0
      	 free:206736 free_pcp:332 free_cma:0
      	Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
      	DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB writepending:96kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
      	lowmem_reserve[]: 0 813 3474 3474
      	Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
      	lowmem_reserve[]: 0 0 21292 21292
      	HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB
      
      the oom killer is clearly pre-mature because there there is still a lot
      of page cache in the zone Normal which should satisfy this lowmem
      request.  Further debugging has shown that the reclaim cannot make any
      forward progress because the page cache is hidden in the active list
      which doesn't get rotated because inactive_list_is_low is not memcg
      aware.
      
      The code simply subtracts per-zone highmem counters from the respective
      memcg's lru sizes which doesn't make any sense.  We can simply end up
      always seeing the resulting active and inactive counts 0 and return
      false.  This issue is not limited to 32b kernels but in practice the
      effect on systems without CONFIG_HIGHMEM would be much harder to notice
      because we do not invoke the OOM killer for allocations requests
      targeting < ZONE_NORMAL.
      
      Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
      and subtract per-memcg highmem counts when memcg is enabled.  Introduce
      helper lruvec_zone_lru_size which redirects to either zone counters or
      mem_cgroup_get_zone_lru_size when appropriate.
      
      We are losing empty LRU but non-zero lru size detection introduced by
      ca707239 ("mm: update_lru_size warn and reset bad lru_size") because
      of the inherent zone vs. node discrepancy.
      
      Fixes: f8d1a311 ("mm: consider whether to decivate based on eligible zones inactive ratio")
      Link: http://lkml.kernel.org/r/20170104100825.3729-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NNils Holland <nholland@tisys.org>
      Tested-by: NNils Holland <nholland@tisys.org>
      Reported-by: NKlaus Ethgen <Klaus@Ethgen.de>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>	[4.8+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4536f0c
  15. 08 10月, 2016 2 次提交
  16. 29 7月, 2016 5 次提交