1. 06 11月, 2015 4 次提交
    • J
      mm/memcontrol.c: fix order calculation in try_charge() · 3608de07
      Jerome Marchand 提交于
      Since commit 6539cc05 ("mm: memcontrol: fold mem_cgroup_do_charge()"),
      the order to pass to mem_cgroup_oom() is calculated by passing the
      number of pages to get_order() instead of the expected size in bytes.
      AFAICT, it only affects the value displayed in the oom warning message.
      This patch fix this.
      
      Michal said:
      
      : We haven't noticed that just because the OOM is enabled only for page
      : faults of order-0 (single page) and get_order work just fine.  Thanks for
      : noticing this.  If we ever start triggering OOM on different orders this
      : would be broken.
      Signed-off-by: NJerome Marchand <jmarchan@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3608de07
    • T
      memcg: ratify and consolidate over-charge handling · 10d53c74
      Tejun Heo 提交于
      try_charge() is the main charging logic of memcg.  When it hits the limit
      but either can't fail the allocation due to __GFP_NOFAIL or the task is
      likely to free memory very soon, being OOM killed, has SIGKILL pending or
      exiting, it "bypasses" the charge to the root memcg and returns -EINTR.
      While this is one approach which can be taken for these situations, it has
      several issues.
      
      * It unnecessarily lies about the reality.  The number itself doesn't
        go over the limit but the actual usage does.  memcg is either forced
        to or actively chooses to go over the limit because that is the
        right behavior under the circumstances, which is completely fine,
        but, if at all avoidable, it shouldn't be misrepresenting what's
        happening by sneaking the charges into the root memcg.
      
      * Despite trying, we already do over-charge.  kmemcg can't deal with
        switching over to the root memcg by the point try_charge() returns
        -EINTR, so it open-codes over-charing.
      
      * It complicates the callers.  Each try_charge() user has to handle
        the weird -EINTR exception.  memcg_charge_kmem() does the manual
        over-charging.  mem_cgroup_do_precharge() performs unnecessary
        uncharging of root memcg, which BTW is inconsistent with what
        memcg_charge_kmem() does but not broken as [un]charging are noops on
        root memcg.  mem_cgroup_try_charge() needs to switch the returned
        cgroup to the root one.
      
      The reality is that in memcg there are cases where we are forced and/or
      willing to go over the limit.  Each such case needs to be scrutinized and
      justified but there definitely are situations where that is the right
      thing to do.  We alredy do this but with a superficial and inconsistent
      disguise which leads to unnecessary complications.
      
      This patch updates try_charge() so that it over-charges and returns 0 when
      deemed necessary.  -EINTR return is removed along with all special case
      handling in the callers.
      
      While at it, remove the local variable @ret, which was initialized to zero
      and never changed, along with done: label which just returned the always
      zero @ret.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10d53c74
    • T
      memcg: punt high overage reclaim to return-to-userland path · b23afb93
      Tejun Heo 提交于
      Currently, try_charge() tries to reclaim memory synchronously when the
      high limit is breached; however, if the allocation doesn't have
      __GFP_WAIT, synchronous reclaim is skipped.  If a process performs only
      speculative allocations, it can blow way past the high limit.  This is
      actually easily reproducible by simply doing "find /".  slab/slub
      allocator tries speculative allocations first, so as long as there's
      memory which can be consumed without blocking, it can keep allocating
      memory regardless of the high limit.
      
      This patch makes try_charge() always punt the over-high reclaim to the
      return-to-userland path.  If try_charge() detects that high limit is
      breached, it adds the overage to current->memcg_nr_pages_over_high and
      schedules execution of mem_cgroup_handle_over_high() which performs
      synchronous reclaim from the return-to-userland path.
      
      As long as kernel doesn't have a run-away allocation spree, this should
      provide enough protection while making kmemcg behave more consistently.
      It also has the following benefits.
      
      - All over-high reclaims can use GFP_KERNEL regardless of the specific
        gfp mask in use, e.g. GFP_NOFS, when the limit was breached.
      
      - It copes with prio inversion.  Previously, a low-prio task with
        small memory.high might perform over-high reclaim with a bunch of
        locks held.  If a higher prio task needed any of these locks, it
        would have to wait until the low prio task finished reclaim and
        released the locks.  By handing over-high reclaim to the task exit
        path this issue can be avoided.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b23afb93
    • T
      memcg: flatten task_struct->memcg_oom · 626ebc41
      Tejun Heo 提交于
      task_struct->memcg_oom is a sub-struct containing fields which are used
      for async memcg oom handling.  Most task_struct fields aren't packaged
      this way and it can lead to unnecessary alignment paddings.  This patch
      flattens it.
      
      * task.memcg_oom.memcg          -> task.memcg_in_oom
      * task.memcg_oom.gfp_mask	-> task.memcg_oom_gfp_mask
      * task.memcg_oom.order          -> task.memcg_oom_order
      * task.memcg_oom.may_oom        -> task.memcg_may_oom
      
      In addition, task.memcg_may_oom is relocated to where other bitfields are
      which reduces the size of task_struct.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      626ebc41
  2. 17 10月, 2015 1 次提交
  3. 13 10月, 2015 1 次提交
    • T
      writeback: fix incorrect calculation of available memory for memcg domains · c5edf9cd
      Tejun Heo 提交于
      For memcg domains, the amount of available memory was calculated as
      
       min(the amount currently in use + headroom according to memcg,
           total clean memory)
      
      This isn't quite correct as what should be capped by the amount of
      clean memory is the headroom, not the sum of memory in use and
      headroom.  For example, if a memcg domain has a significant amount of
      dirty memory, the above can lead to a value which is lower than the
      current amount in use which doesn't make much sense.  In most
      circumstances, the above leads to a number which is somewhat but not
      drastically lower.
      
      As the amount of memory which can be readily allocated to the memcg
      domain is capped by the amount of system-wide clean memory which is
      not already assigned to the memcg itself, the number we want is
      
       the amount currently in use +
       min(headroom according to memcg, clean memory elsewhere in the system)
      
      This patch updates mem_cgroup_wb_stats() to return the number of
      filepages and headroom instead of the calculated available pages.
      mdtc_cap_avail() is renamed to mdtc_calc_avail() and performs the
      above calculation from file, headroom, dirty and globally clean pages.
      
      v2: Dummy mem_cgroup_wb_stats() implementation wasn't updated leading
          to build failure when !CGROUP_WRITEBACK.  Fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: c2aa723a ("writeback: implement memcg writeback domain based throttling")
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c5edf9cd
  4. 02 10月, 2015 2 次提交
  5. 11 9月, 2015 2 次提交
    • V
      memcg: zap try_get_mem_cgroup_from_page · e993d905
      Vladimir Davydov 提交于
      It is only used in mem_cgroup_try_charge, so fold it in and zap it.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e993d905
    • V
      memcg: add page_cgroup_ino helper · 2fc04524
      Vladimir Davydov 提交于
      This patchset introduces a new user API for tracking user memory pages
      that have not been used for a given period of time.  The purpose of this
      is to provide the userspace with the means of tracking a workload's
      working set, i.e.  the set of pages that are actively used by the
      workload.  Knowing the working set size can be useful for partitioning the
      system more efficiently, e.g.  by tuning memory cgroup limits
      appropriately, or for job placement within a compute cluster.
      
      ==== USE CASES ====
      
      The unified cgroup hierarchy has memory.low and memory.high knobs, which
      are defined as the low and high boundaries for the workload working set
      size.  However, the working set size of a workload may be unknown or
      change in time.  With this patch set, one can periodically estimate the
      amount of memory unused by each cgroup and tune their memory.low and
      memory.high parameters accordingly, therefore optimizing the overall
      memory utilization.
      
      Another use case is balancing workloads within a compute cluster.  Knowing
      how much memory is not really used by a workload unit may help take a more
      optimal decision when considering migrating the unit to another node
      within the cluster.
      
      Also, as noted by Minchan, this would be useful for per-process reclaim
      (https://lwn.net/Articles/545668/). With idle tracking, we could reclaim idle
      pages only by smart user memory manager.
      
      ==== USER API ====
      
      The user API consists of two new files:
      
       * /sys/kernel/mm/page_idle/bitmap.  This file implements a bitmap where each
         bit corresponds to a page, indexed by PFN. When the bit is set, the
         corresponding page is idle. A page is considered idle if it has not been
         accessed since it was marked idle. To mark a page idle one should set the
         bit corresponding to the page by writing to the file. A value written to the
         file is OR-ed with the current bitmap value. Only user memory pages can be
         marked idle, for other page types input is silently ignored. Writing to this
         file beyond max PFN results in the ENXIO error. Only available when
         CONFIG_IDLE_PAGE_TRACKING is set.
      
         This file can be used to estimate the amount of pages that are not
         used by a particular workload as follows:
      
         1. mark all pages of interest idle by setting corresponding bits in the
            /sys/kernel/mm/page_idle/bitmap
         2. wait until the workload accesses its working set
         3. read /sys/kernel/mm/page_idle/bitmap and count the number of bits set
      
       * /proc/kpagecgroup.  This file contains a 64-bit inode number of the
         memory cgroup each page is charged to, indexed by PFN. Only available when
         CONFIG_MEMCG is set.
      
         This file can be used to find all pages (including unmapped file pages)
         accounted to a particular cgroup. Using /sys/kernel/mm/page_idle/bitmap, one
         can then estimate the cgroup working set size.
      
      For an example of using these files for estimating the amount of unused
      memory pages per each memory cgroup, please see the script attached
      below.
      
      ==== REASONING ====
      
      The reason to introduce the new user API instead of using
      /proc/PID/{clear_refs,smaps} is that the latter has two serious
      drawbacks:
      
       - it does not count unmapped file pages
       - it affects the reclaimer logic
      
      The new API attempts to overcome them both. For more details on how it
      is achieved, please see the comment to patch 6.
      
      ==== PATCHSET STRUCTURE ====
      
      The patch set is organized as follows:
      
       - patch 1 adds page_cgroup_ino() helper for the sake of
         /proc/kpagecgroup and patches 2-3 do related cleanup
       - patch 4 adds /proc/kpagecgroup, which reports cgroup ino each page is
         charged to
       - patch 5 introduces a new mmu notifier callback, clear_young, which is
         a lightweight version of clear_flush_young; it is used in patch 6
       - patch 6 implements the idle page tracking feature, including the
         userspace API, /sys/kernel/mm/page_idle/bitmap
       - patch 7 exports idle flag via /proc/kpageflags
      
      ==== SIMILAR WORKS ====
      
      Originally, the patch for tracking idle memory was proposed back in 2011
      by Michel Lespinasse (see http://lwn.net/Articles/459269/).  The main
      difference between Michel's patch and this one is that Michel implemented
      a kernel space daemon for estimating idle memory size per cgroup while
      this patch only provides the userspace with the minimal API for doing the
      job, leaving the rest up to the userspace.  However, they both share the
      same idea of Idle/Young page flags to avoid affecting the reclaimer logic.
      
      ==== PERFORMANCE EVALUATION ====
      
      SPECjvm2008 (https://www.spec.org/jvm2008/) was used to evaluate the
      performance impact introduced by this patch set.  Three runs were carried
      out:
      
       - base: kernel without the patch
       - patched: patched kernel, the feature is not used
       - patched-active: patched kernel, 1 minute-period daemon is used for
         tracking idle memory
      
      For tracking idle memory, idlememstat utility was used:
      https://github.com/locker/idlememstat
      
      testcase            base            patched        patched-active
      
      compiler       537.40 ( 0.00)%   532.26 (-0.96)%   538.31 ( 0.17)%
      compress       305.47 ( 0.00)%   301.08 (-1.44)%   300.71 (-1.56)%
      crypto         284.32 ( 0.00)%   282.21 (-0.74)%   284.87 ( 0.19)%
      derby          411.05 ( 0.00)%   413.44 ( 0.58)%   412.07 ( 0.25)%
      mpegaudio      189.96 ( 0.00)%   190.87 ( 0.48)%   189.42 (-0.28)%
      scimark.large   46.85 ( 0.00)%    46.41 (-0.94)%    47.83 ( 2.09)%
      scimark.small  412.91 ( 0.00)%   415.41 ( 0.61)%   421.17 ( 2.00)%
      serial         204.23 ( 0.00)%   213.46 ( 4.52)%   203.17 (-0.52)%
      startup         36.76 ( 0.00)%    35.49 (-3.45)%    35.64 (-3.05)%
      sunflow        115.34 ( 0.00)%   115.08 (-0.23)%   117.37 ( 1.76)%
      xml            620.55 ( 0.00)%   619.95 (-0.10)%   620.39 (-0.03)%
      
      composite      211.50 ( 0.00)%   211.15 (-0.17)%   211.67 ( 0.08)%
      
      time idlememstat:
      
      17.20user 65.16system 2:15:23elapsed 1%CPU (0avgtext+0avgdata 8476maxresident)k
      448inputs+40outputs (1major+36052minor)pagefaults 0swaps
      
      ==== SCRIPT FOR COUNTING IDLE PAGES PER CGROUP ====
      #! /usr/bin/python
      #
      
      import os
      import stat
      import errno
      import struct
      
      CGROUP_MOUNT = "/sys/fs/cgroup/memory"
      BUFSIZE = 8 * 1024  # must be multiple of 8
      
      def get_hugepage_size():
          with open("/proc/meminfo", "r") as f:
              for s in f:
                  k, v = s.split(":")
                  if k == "Hugepagesize":
                      return int(v.split()[0]) * 1024
      
      PAGE_SIZE = os.sysconf("SC_PAGE_SIZE")
      HUGEPAGE_SIZE = get_hugepage_size()
      
      def set_idle():
          f = open("/sys/kernel/mm/page_idle/bitmap", "wb", BUFSIZE)
          while True:
              try:
                  f.write(struct.pack("Q", pow(2, 64) - 1))
              except IOError as err:
                  if err.errno == errno.ENXIO:
                      break
                  raise
          f.close()
      
      def count_idle():
          f_flags = open("/proc/kpageflags", "rb", BUFSIZE)
          f_cgroup = open("/proc/kpagecgroup", "rb", BUFSIZE)
      
          with open("/sys/kernel/mm/page_idle/bitmap", "rb", BUFSIZE) as f:
              while f.read(BUFSIZE): pass  # update idle flag
      
          idlememsz = {}
          while True:
              s1, s2 = f_flags.read(8), f_cgroup.read(8)
              if not s1 or not s2:
                  break
      
              flags, = struct.unpack('Q', s1)
              cgino, = struct.unpack('Q', s2)
      
              unevictable = (flags >> 18) & 1
              huge = (flags >> 22) & 1
              idle = (flags >> 25) & 1
      
              if idle and not unevictable:
                  idlememsz[cgino] = idlememsz.get(cgino, 0) + \
                      (HUGEPAGE_SIZE if huge else PAGE_SIZE)
      
          f_flags.close()
          f_cgroup.close()
          return idlememsz
      
      if __name__ == "__main__":
          print "Setting the idle flag for each page..."
          set_idle()
      
          raw_input("Wait until the workload accesses its working set, "
                    "then press Enter")
      
          print "Counting idle pages..."
          idlememsz = count_idle()
      
          for dir, subdirs, files in os.walk(CGROUP_MOUNT):
              ino = os.stat(dir)[stat.ST_INO]
              print dir + ": " + str(idlememsz.get(ino, 0) / 1024) + " kB"
      ==== END SCRIPT ====
      
      This patch (of 8):
      
      Add page_cgroup_ino() helper to memcg.
      
      This function returns the inode number of the closest online ancestor of
      the memory cgroup a page is charged to.  It is required for exporting
      information about which page is charged to which cgroup to userspace,
      which will be introduced by a following patch.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2fc04524
  6. 09 9月, 2015 6 次提交
  7. 05 9月, 2015 1 次提交
    • S
      mm: memcontrol: bring back the VM_BUG_ON() in mem_cgroup_swapout() · ce9ce665
      Sebastian Andrzej Siewior 提交于
      Clark stumbled over a VM_BUG_ON() in -RT which was then was removed by
      Johannes in commit f371763a ("mm: memcontrol: fix false-positive
      VM_BUG_ON() on -rt").  The comment before that patch was a tiny bit better
      than it is now.  While the patch claimed to fix a false-postive on -RT
      this was not the case.  None of the -RT folks ACKed it and it was not a
      false positive report.  That was a *real* problem.
      
      This patch updates the comment that is improper because it refers to
      "disabled preemption" as a consequence of that lock being taken.  A
      spin_lock() disables preemption, true, but in this case the code relies on
      the fact that the lock _also_ disables interrupts once it is acquired.
      And this is the important detail (which was checked the VM_BUG_ON()) which
      needs to be pointed out.  This is the hint one needs while looking at the
      code.  It was explained by Johannes on the list that the per-CPU variables
      are protected by local_irq_save().  The BUG_ON() was helpful.  This code
      has been workarounded in -RT in the meantime.  I wouldn't mind running
      into more of those if the code in question uses *special* kind of locking
      since now there is no verification (in terms of lockdep or BUG_ON()) and
      therefore I bring the VM_BUG_ON() check back in.
      
      The two functions after the comment could also have a "local_irq_save()"
      dance around them in order to serialize access to the per-CPU variables.
      This has been avoided because the interrupts should be off.
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Clark Williams <williams@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce9ce665
  8. 25 6月, 2015 4 次提交
    • T
      memcg: convert mem_cgroup->under_oom from atomic_t to int · c2b42d3c
      Tejun Heo 提交于
      memcg->under_oom tracks whether the memcg is under OOM conditions and is
      an atomic_t counter managed with mem_cgroup_[un]mark_under_oom().  While
      atomic_t appears to be simple synchronization-wise, when used as a
      synchronization construct like here, it's trickier and more error-prone
      due to weak memory ordering rules, especially around atomic_read(), and
      false sense of security.
      
      For example, both non-trivial read sites of memcg->under_oom are a bit
      problematic although not being actually broken.
      
      * mem_cgroup_oom_register_event()
      
        It isn't explicit what guarantees the memory ordering between event
        addition and memcg->under_oom check.  This isn't broken only because
        memcg_oom_lock is used for both event list and memcg->oom_lock.
      
      * memcg_oom_recover()
      
        The lockless test doesn't have any explanation why this would be
        safe.
      
      mem_cgroup_[un]mark_under_oom() are very cold paths and there's no point
      in avoiding locking memcg_oom_lock there.  This patch converts
      memcg->under_oom from atomic_t to int, puts their modifications under
      memcg_oom_lock and documents why the lockless test in
      memcg_oom_recover() is safe.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c2b42d3c
    • T
      memcg: remove unused mem_cgroup->oom_wakeups · f4b90b70
      Tejun Heo 提交于
      Since commit 49426420 ("mm: memcg: handle non-error OOM situations
      more gracefully"), nobody uses mem_cgroup->oom_wakeups.  Remove it.
      
      While at it, also fold memcg_wakeup_oom() into memcg_oom_recover() which
      is its only user.  This cleanup was suggested by Michal.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f4b90b70
    • J
      mm: oom_kill: simplify OOM killer locking · dc56401f
      Johannes Weiner 提交于
      The zonelist locking and the oom_sem are two overlapping locks that are
      used to serialize global OOM killing against different things.
      
      The historical zonelist locking serializes OOM kills from allocations with
      overlapping zonelists against each other to prevent killing more tasks
      than necessary in the same memory domain.  Only when neither tasklists nor
      zonelists from two concurrent OOM kills overlap (tasks in separate memcgs
      bound to separate nodes) are OOM kills allowed to execute in parallel.
      
      The younger oom_sem is a read-write lock to serialize OOM killing against
      the PM code trying to disable the OOM killer altogether.
      
      However, the OOM killer is a fairly cold error path, there is really no
      reason to optimize for highly performant and concurrent OOM kills.  And
      the oom_sem is just flat-out redundant.
      
      Replace both locking schemes with a single global mutex serializing OOM
      kills regardless of context.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc56401f
    • J
      mm: oom_kill: clean up victim marking and exiting interfaces · 16e95196
      Johannes Weiner 提交于
      Rename unmark_oom_victim() to exit_oom_victim().  Marking and unmarking
      are related in functionality, but the interface is not symmetrical at
      all: one is an internal OOM killer function used during the killing, the
      other is for an OOM victim to signal its own death on exit later on.
      This has locking implications, see follow-up changes.
      
      While at it, rename mark_tsk_oom_victim() to mark_oom_victim(), which
      is easier on the eye.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16e95196
  9. 11 6月, 2015 2 次提交
  10. 02 6月, 2015 8 次提交
    • T
      writeback: implement memcg writeback domain based throttling · c2aa723a
      Tejun Heo 提交于
      While cgroup writeback support now connects memcg and blkcg so that
      writeback IOs are properly attributed and controlled, the IO back
      pressure propagation mechanism implemented in balance_dirty_pages()
      and its subroutines wasn't aware of cgroup writeback.
      
      Processes belonging to a memcg may have access to only subset of total
      memory available in the system and not factoring this into dirty
      throttling rendered it completely ineffective for processes under
      memcg limits and memcg ended up building a separate ad-hoc degenerate
      mechanism directly into vmscan code to limit page dirtying.
      
      The previous patches updated balance_dirty_pages() and its subroutines
      so that they can deal with multiple wb_domain's (writeback domains)
      and defined per-memcg wb_domain.  Processes belonging to a non-root
      memcg are bound to two wb_domains, global wb_domain and memcg
      wb_domain, and should be throttled according to IO pressures from both
      domains.  This patch updates dirty throttling code so that it repeats
      similar calculations for the two domains - the differences between the
      two are few and minor - and applies the lower of the two sets of
      resulting constraints.
      
      wb_over_bg_thresh(), which controls when background writeback
      terminates, is also updated to consider both global and memcg
      wb_domains.  It returns true if dirty is over bg_thresh for either
      domain.
      
      This makes the dirty throttling mechanism operational for memcg
      domains including writeback-bandwidth-proportional dirty page
      distribution inside them but the ad-hoc memcg throttling mechanism in
      vmscan is still in place.  The next patch will rip it out.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c2aa723a
    • T
      writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes · 2529bb3a
      Tejun Heo 提交于
      The amount of available memory to a memcg wb_domain can change as
      memcg configuration changes.  A domain's ->dirty_limit exists to
      smooth out sudden drops in dirty threshold; however, when a domain's
      size actually drops significantly, it hinders the dirty throttling
      from adjusting to the new configuration leading to unexpected
      behaviors including unnecessary OOM kills.
      
      This patch resolves the issue by adding wb_domain_size_changed() which
      resets ->dirty_limit[_tstmp] and making memcg call it on configuration
      changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2529bb3a
    • T
      writeback: implement memcg wb_domain · 841710aa
      Tejun Heo 提交于
      Dirtyable memory is distributed to a wb (bdi_writeback) according to
      the relative bandwidth the wb is writing out in the whole system.
      This distribution is global - each wb is measured against all other
      wb's and gets the proportinately sized portion of the memory in the
      whole system.
      
      For cgroup writeback, the amount of dirtyable memory is scoped by
      memcg and thus each wb would need to be measured and controlled in its
      memcg.  IOW, a wb will belong to two writeback domains - the global
      and memcg domains.
      
      The previous patches laid the groundwork to support the two wb_domains
      and this patch implements memcg wb_domain.  memcg->cgwb_domain is
      initialized on css online and destroyed on css release,
      wb->memcg_completions is added, and __wb_writeout_inc() is updated to
      increment completions against both global and memcg wb_domains.
      
      The following patches will update balance_dirty_pages() and its
      subroutines to actually consider memcg wb_domain for throttling.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      841710aa
    • T
      memcg: make mem_cgroup_read_{stat|event}() iterate possible cpus instead of online · 733a572e
      Tejun Heo 提交于
      cpu_possible_mask represents the CPUs which are actually possible
      during that boot instance.  For systems which don't support CPU
      hotplug, this will match cpu_online_mask exactly in most cases.  Even
      for systems which support CPU hotplug, the number of possible CPU
      slots is highly unlikely to diverge greatly from the number of online
      CPUs.  The only cases where the difference between possible and online
      caused problems were when the boot code failed to initialize the
      possible mask and left it fully set at NR_CPUS - 1.
      
      As such, most per-cpu constructs allocate for all possible CPUs and
      often iterate over the possibles, which also has the benefit of
      avoiding the blocking CPU hotplug synchronization.
      
      memcg open codes per-cpu stat counting for mem_cgroup_read_stat() and
      mem_cgroup_read_events(), which iterates over online CPUs and handles
      CPU hotplug operations explicitly.  This complexity doesn't actually
      buy anything.  Switch to iterating over the possibles and drop the
      explicit CPU hotplug handling.
      
      Eventually, we want to convert memcg to use percpu_counter instead of
      its own custom implementation which also benefits from quick access
      w/o summing for cases where larger error margin is acceptable.
      
      This will allow mem_cgroup_read_stat() to be called from non-sleepable
      contexts which will be used by cgroup writeback.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      733a572e
    • T
      writeback: make backing_dev_info host cgroup-specific bdi_writebacks · 52ebea74
      Tejun Heo 提交于
      For the planned cgroup writeback support, on each bdi
      (backing_dev_info), each memcg will be served by a separate wb
      (bdi_writeback).  This patch updates bdi so that a bdi can host
      multiple wbs (bdi_writebacks).
      
      On the default hierarchy, blkcg implicitly enables memcg.  This allows
      using memcg's page ownership for attributing writeback IOs, and every
      memcg - blkcg combination can be served by its own wb by assigning a
      dedicated wb to each memcg.  This means that there may be multiple
      wb's of a bdi mapped to the same blkcg.  As congested state is per
      blkcg - bdi combination, those wb's should share the same congested
      state.  This is achieved by tracking congested state via
      bdi_writeback_congested structs which are keyed by blkcg.
      
      bdi->wb remains unchanged and will keep serving the root cgroup.
      cgwb's (cgroup wb's) for non-root cgroups are created on-demand or
      looked up while dirtying an inode according to the memcg of the page
      being dirtied or current task.  Each cgwb is indexed on bdi->cgwb_tree
      by its memcg id.  Once an inode is associated with its wb, it can be
      retrieved using inode_to_wb().
      
      Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
      pages will keep being associated with bdi->wb.
      
      v3: inode_attach_wb() in account_page_dirtied() moved inside
          mapping_cap_account_dirty() block where it's known to be !NULL.
          Also, an unnecessary NULL check before kfree() removed.  Both
          detected by the kbuild bot.
      
      v2: Updated so that wb association is per inode and wb is per memcg
          rather than blkcg.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: kbuild test robot <fengguang.wu@intel.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      52ebea74
    • T
      memcg: implement mem_cgroup_css_from_page() · ad7fa852
      Tejun Heo 提交于
      Implement mem_cgroup_css_from_page() which returns the
      cgroup_subsys_state of the memcg associated with a given page on the
      default hierarchy.  This will be used by cgroup writeback support.
      
      This function assumes that page->mem_cgroup association doesn't change
      until the page is released, which is true on the default hierarchy as
      long as replace_page_cache_page() is not used.  As the only user of
      replace_page_cache_page() is FUSE which won't support cgroup writeback
      for the time being, this works for now, and replace_page_cache_page()
      will soon be updated so that the invariant actually holds.
      
      Note that the RCU protected page->mem_cgroup access is consistent with
      other usages across memcg but ultimately incorrect.  These unlocked
      accesses are missing required barriers.  page->mem_cgroup should be
      made an RCU pointer and updated and accessed using RCU operations.
      
      v4: Instead of triggering WARN, return the root css on the traditional
          hierarchies.  This makes the function a lot easier to deal with
          especially as there's no light way to synchronize against
          hierarchy rebinding.
      
      v3: s/mem_cgroup_migrate()/mem_cgroup_css_from_page()/
      
      v2: Trigger WARN if the function is used on the traditional
          hierarchies and add comment about the assumed invariant.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ad7fa852
    • T
      memcg: add mem_cgroup_root_css · 56161634
      Tejun Heo 提交于
      Add global mem_cgroup_root_css which points to the root memcg css.
      This will be used by cgroup writeback support.  If memcg is disabled,
      it's defined as ERR_PTR(-EINVAL).
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      aCc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      56161634
    • G
      memcg: add per cgroup dirty page accounting · c4843a75
      Greg Thelen 提交于
      When modifying PG_Dirty on cached file pages, update the new
      MEM_CGROUP_STAT_DIRTY counter.  This is done in the same places where
      global NR_FILE_DIRTY is managed.  The new memcg stat is visible in the
      per memcg memory.stat cgroupfs file.  The most recent past attempt at
      this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632
      
      The new accounting supports future efforts to add per cgroup dirty
      page throttling and writeback.  It also helps an administrator break
      down a container's memory usage and provides evidence to understand
      memcg oom kills (the new dirty count is included in memcg oom kill
      messages).
      
      The ability to move page accounting between memcg
      (memory.move_charge_at_immigrate) makes this accounting more
      complicated than the global counter.  The existing
      mem_cgroup_{begin,end}_page_stat() lock is used to serialize move
      accounting with stat updates.
      Typical update operation:
      	memcg = mem_cgroup_begin_page_stat(page)
      	if (TestSetPageDirty()) {
      		[...]
      		mem_cgroup_update_page_stat(memcg)
      	}
      	mem_cgroup_end_page_stat(memcg)
      
      Summary of mem_cgroup_end_page_stat() overhead:
      - Without CONFIG_MEMCG it's a no-op
      - With CONFIG_MEMCG and no inter memcg task movement, it's just
        rcu_read_lock()
      - With CONFIG_MEMCG and inter memcg  task movement, it's
        rcu_read_lock() + spin_lock_irqsave()
      
      A memcg parameter is added to several routines because their callers
      now grab mem_cgroup_begin_page_stat() which returns the memcg later
      needed by for mem_cgroup_update_page_stat().
      
      Because mem_cgroup_begin_page_stat() may disable interrupts, some
      adjustments are needed:
      - move __mark_inode_dirty() from __set_page_dirty() to its caller.
        __mark_inode_dirty() locking does not want interrupts disabled.
      - use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in
        __delete_from_page_cache(), replace_page_cache_page(),
        invalidate_complete_page2(), and __remove_mapping().
      
         text    data     bss      dec    hex filename
      8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before
      8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after
                                  +192 text bytes
      8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before
      8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after
                                  +773 text bytes
      
      Performance tests run on v4.0-rc1-36-g4f671fe2.  Lower is better for
      all metrics, they're all wall clock or cycle counts.  The read and write
      fault benchmarks just measure fault time, they do not include I/O time.
      
      * CONFIG_MEMCG not set:
                                  baseline                              patched
        kbuild                 1m25.030000(+-0.088% 3 samples)       1m25.426667(+-0.120% 3 samples)
        dd write 100 MiB          0.859211561 +-15.10%                  0.874162885 +-15.03%
        dd write 200 MiB          1.670653105 +-17.87%                  1.669384764 +-11.99%
        dd write 1000 MiB         8.434691190 +-14.15%                  8.474733215 +-14.77%
        read fault cycles       254.0(+-0.000% 10 samples)            253.0(+-0.000% 10 samples)
        write fault cycles     2021.2(+-3.070% 10 samples)           1984.5(+-1.036% 10 samples)
      
      * CONFIG_MEMCG=y root_memcg:
                                  baseline                              patched
        kbuild                 1m25.716667(+-0.105% 3 samples)       1m25.686667(+-0.153% 3 samples)
        dd write 100 MiB          0.855650830 +-14.90%                  0.887557919 +-14.90%
        dd write 200 MiB          1.688322953 +-12.72%                  1.667682724 +-13.33%
        dd write 1000 MiB         8.418601605 +-14.30%                  8.673532299 +-15.00%
        read fault cycles       266.0(+-0.000% 10 samples)            266.0(+-0.000% 10 samples)
        write fault cycles     2051.7(+-1.349% 10 samples)           2049.6(+-1.686% 10 samples)
      
      * CONFIG_MEMCG=y non-root_memcg:
                                  baseline                              patched
        kbuild                 1m26.120000(+-0.273% 3 samples)       1m25.763333(+-0.127% 3 samples)
        dd write 100 MiB          0.861723964 +-15.25%                  0.818129350 +-14.82%
        dd write 200 MiB          1.669887569 +-13.30%                  1.698645885 +-13.27%
        dd write 1000 MiB         8.383191730 +-14.65%                  8.351742280 +-14.52%
        read fault cycles       265.7(+-0.172% 10 samples)            267.0(+-0.000% 10 samples)
        write fault cycles     2070.6(+-1.512% 10 samples)           2084.4(+-2.148% 10 samples)
      
      As expected anon page faults are not affected by this patch.
      
      tj: Updated to apply on top of the recent cancel_dirty_page() changes.
      Signed-off-by: NSha Zhengju <handai.szj@gmail.com>
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c4843a75
  11. 16 4月, 2015 3 次提交
  12. 15 4月, 2015 3 次提交
  13. 13 3月, 2015 1 次提交
  14. 01 3月, 2015 2 次提交
    • J
      mm: memcontrol: use "max" instead of "infinity" in control knobs · d2973697
      Johannes Weiner 提交于
      The memcg control knobs indicate the highest possible value using the
      symbolic name "infinity", which is long and awkward to type.
      
      Switch to the string "max", which is just as descriptive but shorter and
      sweeter.
      
      This changes a user interface, so do it before the release and before
      the development flag is dropped from the default hierarchy.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2973697
    • M
      memcg: fix low limit calculation · 4e54dede
      Michal Hocko 提交于
      A memcg is considered low limited even when the current usage is equal to
      the low limit.  This leads to interesting side effects e.g.
      groups/hierarchies with no memory accounted are considered protected and
      so the reclaim will emit MEMCG_LOW event when encountering them.
      
      Another and much bigger issue was reported by Joonsoo Kim.  He has hit a
      NULL ptr dereference with the legacy cgroup API which even doesn't have
      low limit exposed.  The limit is 0 by default but the initial check fails
      for memcg with 0 consumption and parent_mem_cgroup() would return NULL if
      use_hierarchy is 0 and so page_counter_read would try to dereference NULL.
      
      I suppose that the current implementation is just an overlook because the
      documentation in Documentation/cgroups/unified-hierarchy.txt says:
      
        "The memory.low boundary on the other hand is a top-down allocated
        reserve.  A cgroup enjoys reclaim protection when it and all its
        ancestors are below their low boundaries"
      
      Fix the usage and the low limit comparision in mem_cgroup_low accordingly.
      
      Fixes: 241994ed (mm: memcontrol: default hierarchy interface for memory)
      Reported-by: NJoonsoo Kim <js1304@gmail.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e54dede