1. 13 12月, 2012 3 次提交
  2. 12 12月, 2012 1 次提交
  3. 11 12月, 2012 1 次提交
    • M
      mm: numa: Add THP migration for the NUMA working set scanning fault case. · b32967ff
      Mel Gorman 提交于
      Note: This is very heavily based on a patch from Peter Zijlstra with
      	fixes from Ingo Molnar, Hugh Dickins and Johannes Weiner.  That patch
      	put a lot of migration logic into mm/huge_memory.c where it does
      	not belong. This version puts tries to share some of the migration
      	logic with migrate_misplaced_page.  However, it should be noted
      	that now migrate.c is doing more with the pagetable manipulation
      	than is preferred. The end result is barely recognisable so as
      	before, the signed-offs had to be removed but will be re-added if
      	the original authors are ok with it.
      
      Add THP migration for the NUMA working set scanning fault case.
      
      It uses the page lock to serialize. No migration pte dance is
      necessary because the pte is already unmapped when we decide
      to migrate.
      
      [dhillf@gmail.com: Fix memory leak on isolation failure]
      [dhillf@gmail.com: Fix transfer of last_nid information]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      b32967ff
  4. 20 11月, 2012 1 次提交
  5. 17 11月, 2012 2 次提交
    • H
      memcg: fix hotplugged memory zone oops · bea8c150
      Hugh Dickins 提交于
      When MEMCG is configured on (even when it's disabled by boot option),
      when adding or removing a page to/from its lru list, the zone pointer
      used for stats updates is nowadays taken from the struct lruvec.  (On
      many configurations, calculating zone from page is slower.)
      
      But we have no code to update all the lruvecs (per zone, per memcg) when
      a memory node is hotadded.  Here's an extract from the oops which
      results when running numactl to bind a program to a newly onlined node:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000f60
        IP:  __mod_zone_page_state+0x9/0x60
        Pid: 1219, comm: numactl Not tainted 3.6.0-rc5+ #180 Bochs Bochs
        Process numactl (pid: 1219, threadinfo ffff880039abc000, task ffff8800383c4ce0)
        Call Trace:
          __pagevec_lru_add_fn+0xdf/0x140
          pagevec_lru_move_fn+0xb1/0x100
          __pagevec_lru_add+0x1c/0x30
          lru_add_drain_cpu+0xa3/0x130
          lru_add_drain+0x2f/0x40
         ...
      
      The natural solution might be to use a memcg callback whenever memory is
      hotadded; but that solution has not been scoped out, and it happens that
      we do have an easy location at which to update lruvec->zone.  The lruvec
      pointer is discovered either by mem_cgroup_zone_lruvec() or by
      mem_cgroup_page_lruvec(), and both of those do know the right zone.
      
      So check and set lruvec->zone in those; and remove the inadequate
      attempt to set lruvec->zone from lruvec_init(), which is called before
      NODE_DATA(node) has been allocated in such cases.
      
      Ah, there was one exceptionr.  For no particularly good reason,
      mem_cgroup_force_empty_list() has its own code for deciding lruvec.
      Change it to use the standard mem_cgroup_zone_lruvec() and
      mem_cgroup_get_lru_size() too.  In fact it was already safe against such
      an oops (the lru lists in danger could only be empty), but we're better
      proofed against future changes this way.
      
      I've marked this for stable (3.6) since we introduced the problem in 3.5
      (now closed to stable); but I have no idea if this is the only fix
      needed to get memory hotadd working with memcg in 3.6, and received no
      answer when I enquired twice before.
      Reported-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bea8c150
    • M
      memcg: oom: fix totalpages calculation for memory.swappiness==0 · 9a5a8f19
      Michal Hocko 提交于
      oom_badness() takes a totalpages argument which says how many pages are
      available and it uses it as a base for the score calculation.  The value
      is calculated by mem_cgroup_get_limit which considers both limit and
      total_swap_pages (resp.  memsw portion of it).
      
      This is usually correct but since fe35004f ("mm: avoid swapping out
      with swappiness==0") we do not swap when swappiness is 0 which means
      that we cannot really use up all the totalpages pages.  This in turn
      confuses oom score calculation if the memcg limit is much smaller than
      the available swap because the used memory (capped by the limit) is
      negligible comparing to totalpages so the resulting score is too small
      if adj!=0 (typically task with CAP_SYS_ADMIN or non zero oom_score_adj).
      A wrong process might be selected as result.
      
      The problem can be worked around by checking mem_cgroup_swappiness==0
      and not considering swap at all in such a case.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a5a8f19
  6. 06 11月, 2012 4 次提交
    • T
      cgroup: make ->pre_destroy() return void · bcf6de1b
      Tejun Heo 提交于
      All ->pre_destory() implementations return 0 now, which is the only
      allowed return value.  Make it return void.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      bcf6de1b
    • M
      memcg: make mem_cgroup_reparent_charges non failing · ab5196c2
      Michal Hocko 提交于
      Now that pre_destroy callbacks are called from the context where neither
      any task can attach the group nor any children group can be added there
      is no other way to fail from mem_cgroup_pre_destroy.
      mem_cgroup_pre_destroy doesn't have to take a reference to memcg's css
      because all css' are marked dead already.
      
      tj: Remove now unused local variable @cgrp from
          mem_cgroup_reparent_charges().
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      ab5196c2
    • T
      cgroup: remove CGRP_WAIT_ON_RMDIR, cgroup_exclude_rmdir() and cgroup_release_and_wakeup_rmdir() · b25ed609
      Tejun Heo 提交于
      CGRP_WAIT_ON_RMDIR is another kludge which was added to make cgroup
      destruction rollback somewhat working.  cgroup_rmdir() used to drain
      CSS references and CGRP_WAIT_ON_RMDIR and the associated waitqueue and
      helpers were used to allow the task performing rmdir to wait for the
      next relevant event.
      
      Unfortunately, the wait is visible to controllers too and the
      mechanism got exposed to memcg by 88703267 ("cgroup avoid permanent
      sleep at rmdir").
      
      Now that the draining and retries are gone, CGRP_WAIT_ON_RMDIR is
      unnecessary.  Remove it and all the mechanisms supporting it.  Note
      that memcontrol.c changes are essentially revert of 88703267
      ("cgroup avoid permanent sleep at rmdir").
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      b25ed609
    • T
      cgroup: kill CSS_REMOVED · e9316080
      Tejun Heo 提交于
      CSS_REMOVED is one of the several contortions which were necessary to
      support css reference draining on cgroup removal.  All css->refcnts
      which need draining should be deactivated and verified to equal zero
      atomically w.r.t. css_tryget().  If any one isn't zero, all refcnts
      needed to be re-activated and css_tryget() shouldn't fail in the
      process.
      
      This was achieved by letting css_tryget() busy-loop until either the
      refcnt is reactivated (failed removal attempt) or CSS_REMOVED is set
      (committing to removal).
      
      Now that css refcnt draining is no longer used, there's no need for
      atomic rollback mechanism.  css_tryget() simply can look at the
      reference count and fail if it's deactivated - it's never getting
      re-activated.
      
      This patch removes CSS_REMOVED and updates __css_tryget() to fail if
      the refcnt is deactivated.  As deactivation and removal are a single
      step now, they no longer need to be protected against css_tryget()
      happening from irq context.  Remove local_irq_disable/enable() from
      cgroup_rmdir().
      
      Note that this removes css_is_removed() whose only user is VM_BUG_ON()
      in memcontrol.c.  We can replace it with a check on the refcnt but
      given that the only use case is a debug assert, I think it's better to
      simply unexport it.
      
      v2: Comment updated and explanation on local_irq_disable/enable()
          added per Michal Hocko.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      e9316080
  7. 30 10月, 2012 3 次提交
    • M
      memcg: Simplify mem_cgroup_force_empty_list error handling · 2ef37d3f
      Michal Hocko 提交于
      mem_cgroup_force_empty_list currently tries to remove all pages from
      the given LRU. To prevent from temoporary failures (EBUSY returned by
      mem_cgroup_move_parent) it uses a margin to the current LRU pages and
      returns the true if there are still some pages left on the list.
      
      If we consider that mem_cgroup_move_parent fails only when it is racing
      with somebody else removing (uncharging) the page or when the page is
      migrated then it is obvious that all those failures are only temporal
      and so we can safely retry later.
      Let's get rid of the safety margin and make the loop really wait for
      the empty LRU. The caller should still make sure that all charges have
      been removed from the res_counter because mem_cgroup_replace_page_cache
      might add a page to the LRU after the list_empty check (it doesn't touch
      res_counter though).
      This catches most of the cases except for shmem which might call
      mem_cgroup_replace_page_cache with a page which is not charged and on
      the LRU yet but this was the case also without this patch. In order to
      fix this we need a guarantee that try_get_mem_cgroup_from_page falls
      back to the current mm's cgroup so it needs css_tryget to fail. This
      will be fixed up in a later patch because it needs a help from cgroup
      core (pre_destroy has to be called after css is cleared).
      
      Although mem_cgroup_pre_destroy can still fail (if a new task or a new
      sub-group appears) there is no reason to retry pre_destroy callback from
      the cgroup core. This means that __DEPRECATED_clear_css_refs has lost
      its meaning and it can be removed.
      
      Changes since v2
      - remove __DEPRECATED_clear_css_refs
      
      Changes since v1
      - use kerndoc
      - be more specific about mem_cgroup_move_parent possible failures
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NGlauber Costa <glommer@parallels.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      2ef37d3f
    • M
      memcg: root_cgroup cannot reach mem_cgroup_move_parent · d8423011
      Michal Hocko 提交于
      The root cgroup cannot be destroyed so we never hit it down the
      mem_cgroup_pre_destroy path and mem_cgroup_force_empty_write shouldn't
      even try to do anything if called for the root.
      
      This means that mem_cgroup_move_parent doesn't have to bother with the
      root cgroup and it can assume it can always move charges upwards.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NGlauber Costa <glommer@parallels.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      d8423011
    • M
      memcg: split mem_cgroup_force_empty into reclaiming and reparenting parts · c26251f9
      Michal Hocko 提交于
      mem_cgroup_force_empty did two separate things depending on free_all
      parameter from the very beginning. It either reclaimed as many pages as
      possible and moved the rest to the parent or just moved charges to the
      parent. The first variant is used as memory.force_empty callback while
      the later is used from the mem_cgroup_pre_destroy.
      
      The whole games around gotos are far from being nice and there is no
      reason to keep those two functions inside one. Let's split them and
      also move the responsibility for css reference counting to their callers
      to make to code easier.
      
      This patch doesn't have any functional changes.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NGlauber Costa <glommer@parallels.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      c26251f9
  8. 09 10月, 2012 2 次提交
  9. 15 9月, 2012 1 次提交
    • T
      cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them · 8c7f6edb
      Tejun Heo 提交于
      Currently, cgroup hierarchy support is a mess.  cpu related subsystems
      behave correctly - configuration, accounting and control on a parent
      properly cover its children.  blkio and freezer completely ignore
      hierarchy and treat all cgroups as if they're directly under the root
      cgroup.  Others show yet different behaviors.
      
      These differing interpretations of cgroup hierarchy make using cgroup
      confusing and it impossible to co-mount controllers into the same
      hierarchy and obtain sane behavior.
      
      Eventually, we want full hierarchy support from all subsystems and
      probably a unified hierarchy.  Users using separate hierarchies
      expecting completely different behaviors depending on the mounted
      subsystem is deterimental to making any progress on this front.
      
      This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
      for controllers which are lacking in hierarchy support.  The goal of
      this patch is two-fold.
      
      * Move users away from using hierarchy on currently non-hierarchical
        subsystems, so that implementing proper hierarchy support on those
        doesn't surprise them.
      
      * Keep track of which controllers are broken how and nudge the
        subsystems to implement proper hierarchy support.
      
      For now, start with a single warning message.  We can whine louder
      later on.
      
      v2: Fixed a typo spotted by Michal. Warning message updated.
      
      v3: Updated memcg part so that it doesn't generate warning in the
          cases where .use_hierarchy=false doesn't make the behavior
          different from root.use_hierarchy=true.  Fixed a typo spotted by
          Glauber.
      
      v4: Check ->broken_hierarchy after cgroup creation is complete so that
          ->create() can affect the result per Michal.  Dropped unnecessary
          memcg root handling per Michal.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NSerge E. Hallyn <serue@us.ibm.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      8c7f6edb
  10. 01 8月, 2012 22 次提交