1. 09 9月, 2017 3 次提交
    • J
      mm/memcontrol: support MEMORY_DEVICE_PRIVATE · c733a828
      Jérôme Glisse 提交于
      HMM pages (private or public device pages) are ZONE_DEVICE page and thus
      need special handling when it comes to lru or refcount.  This patch make
      sure that memcontrol properly handle those when it face them.  Those pages
      are use like regular pages in a process address space either as anonymous
      page or as file back page.  So from memcg point of view we want to handle
      them like regular page for now at least.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-11-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c733a828
    • J
      mm/memcontrol: allow to uncharge page without using page->lru field · a9d5adee
      Jérôme Glisse 提交于
      HMM pages (private or public device pages) are ZONE_DEVICE page and
      thus you can not use page->lru fields of those pages. This patch
      re-arrange the uncharge to allow single page to be uncharge without
      modifying the lru field of the struct page.
      
      There is no change to memcontrol logic, it is the same as it was
      before this patch.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-10-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9d5adee
    • Z
      mm: thp: check pmd migration entry in common path · 84c3fc4e
      Zi Yan 提交于
      When THP migration is being used, memory management code needs to handle
      pmd migration entries properly.  This patch uses !pmd_present() or
      is_swap_pmd() (depending on whether pmd_none() needs separate code or
      not) to check pmd migration entries at the places where a pmd entry is
      present.
      
      Since pmd-related code uses split_huge_page(), split_huge_pmd(),
      pmd_trans_huge(), pmd_trans_unstable(), or
      pmd_none_or_trans_huge_or_clear_bad(), this patch:
      
      1. adds pmd migration entry split code in split_huge_pmd(),
      
      2. takes care of pmd migration entries whenever pmd_trans_huge() is present,
      
      3. makes pmd_none_or_trans_huge_or_clear_bad() pmd migration entry aware.
      
      Since split_huge_page() uses split_huge_pmd() and pmd_trans_unstable()
      is equivalent to pmd_none_or_trans_huge_or_clear_bad(), we do not change
      them.
      
      Until this commit, a pmd entry should be:
      1. pointing to a pte page,
      2. is_swap_pmd(),
      3. pmd_trans_huge(),
      4. pmd_devmap(), or
      5. pmd_none().
      Signed-off-by: NZi Yan <zi.yan@cs.rutgers.edu>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84c3fc4e
  2. 07 9月, 2017 6 次提交
  3. 19 8月, 2017 1 次提交
    • J
      mm: memcontrol: fix NULL pointer crash in test_clear_page_writeback() · 739f79fc
      Johannes Weiner 提交于
      Jaegeuk and Brad report a NULL pointer crash when writeback ending tries
      to update the memcg stats:
      
          BUG: unable to handle kernel NULL pointer dereference at 00000000000003b0
          IP: test_clear_page_writeback+0x12e/0x2c0
          [...]
          RIP: 0010:test_clear_page_writeback+0x12e/0x2c0
          Call Trace:
           <IRQ>
           end_page_writeback+0x47/0x70
           f2fs_write_end_io+0x76/0x180 [f2fs]
           bio_endio+0x9f/0x120
           blk_update_request+0xa8/0x2f0
           scsi_end_request+0x39/0x1d0
           scsi_io_completion+0x211/0x690
           scsi_finish_command+0xd9/0x120
           scsi_softirq_done+0x127/0x150
           __blk_mq_complete_request_remote+0x13/0x20
           flush_smp_call_function_queue+0x56/0x110
           generic_smp_call_function_single_interrupt+0x13/0x30
           smp_call_function_single_interrupt+0x27/0x40
           call_function_single_interrupt+0x89/0x90
          RIP: 0010:native_safe_halt+0x6/0x10
      
          (gdb) l *(test_clear_page_writeback+0x12e)
          0xffffffff811bae3e is in test_clear_page_writeback (./include/linux/memcontrol.h:619).
          614		mod_node_page_state(page_pgdat(page), idx, val);
          615		if (mem_cgroup_disabled() || !page->mem_cgroup)
          616			return;
          617		mod_memcg_state(page->mem_cgroup, idx, val);
          618		pn = page->mem_cgroup->nodeinfo[page_to_nid(page)];
          619		this_cpu_add(pn->lruvec_stat->count[idx], val);
          620	}
          621
          622	unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
          623							gfp_t gfp_mask,
      
      The issue is that writeback doesn't hold a page reference and the page
      might get freed after PG_writeback is cleared (and the mapping is
      unlocked) in test_clear_page_writeback().  The stat functions looking up
      the page's node or zone are safe, as those attributes are static across
      allocation and free cycles.  But page->mem_cgroup is not, and it will
      get cleared if we race with truncation or migration.
      
      It appears this race window has been around for a while, but less likely
      to trigger when the memcg stats were updated first thing after
      PG_writeback is cleared.  Recent changes reshuffled this code to update
      the global node stats before the memcg ones, though, stretching the race
      window out to an extent where people can reproduce the problem.
      
      Update test_clear_page_writeback() to look up and pin page->mem_cgroup
      before clearing PG_writeback, then not use that pointer afterward.  It
      is a partial revert of 62cccb8c ("mm: simplify lock_page_memcg()")
      but leaves the pageref-holding callsites that aren't affected alone.
      
      Link: http://lkml.kernel.org/r/20170809183825.GA26387@cmpxchg.org
      Fixes: 62cccb8c ("mm: simplify lock_page_memcg()")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NJaegeuk Kim <jaegeuk@kernel.org>
      Tested-by: NJaegeuk Kim <jaegeuk@kernel.org>
      Reported-by: NBradley Bolen <bradleybolen@gmail.com>
      Tested-by: NBrad Bolen <bradleybolen@gmail.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>	[4.6+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      739f79fc
  4. 21 7月, 2017 1 次提交
    • T
      cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS · bc2fb7ed
      Tejun Heo 提交于
      css_task_iter currently always walks all tasks.  With the scheduled
      cgroup v2 thread support, the iterator would need to handle multiple
      types of iteration.  As a preparation, add @flags to
      css_task_iter_start() and implement CSS_TASK_ITER_PROCS.  If the flag
      is not specified, it walks all tasks as before.  When asserted, the
      iterator only walks the group leaders.
      
      For now, the only user of the flag is cgroup v2 "cgroup.procs" file
      which no longer needs to skip non-leader tasks in cgroup_procs_next().
      Note that cgroup v1 "cgroup.procs" can't use the group leader walk as
      v1 "cgroup.procs" doesn't mean "list all thread group leaders in the
      cgroup" but "list all thread group id's with any threads in the
      cgroup".
      
      While at it, update cgroup_procs_show() to use task_pid_vnr() instead
      of task_tgid_vnr().  As the iteration guarantees that the function
      only sees group leaders, this doesn't change the output and will allow
      sharing the function for thread iteration.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      bc2fb7ed
  5. 11 7月, 2017 2 次提交
    • M
      mm, memcg: fix potential undefined behavior in mem_cgroup_event_ratelimit() · 6a1a8b80
      Michal Hocko 提交于
      Alice has reported the following UBSAN splat:
      
        UBSAN: Undefined behaviour in mm/memcontrol.c:661:17
        signed integer overflow:
        -2147483644 - 2147483525 cannot be represented in type 'long int'
        CPU: 1 PID: 11758 Comm: mybibtex2filena Tainted: P           O 4.9.25-gentoo #4
        Hardware name: XXXXXX, BIOS YYYYYY
        Call Trace:
          dump_stack+0x59/0x87
          ubsan_epilogue+0xe/0x40
          handle_overflow+0xbb/0xf0
          __ubsan_handle_sub_overflow+0x12/0x20
          memcg_check_events.isra.36+0x223/0x360
          mem_cgroup_commit_charge+0x55/0x140
          wp_page_copy+0x34e/0xb80
          do_wp_page+0x1e6/0x1300
          handle_mm_fault+0x88b/0x1990
          __do_page_fault+0x2de/0x8a0
          do_page_fault+0x1a/0x20
          error_code+0x67/0x6c
      
      The reason is that we subtract two signed types.  Let's fix this by
      truly mimicing time_after and cast the result of the subtraction.
      
      Link: http://lkml.kernel.org/r/20170616150057.GQ30580@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NAlice Ferrazzi <alicef@gentoo.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6a1a8b80
    • S
      mm/memcontrol: exclude @root from checks in mem_cgroup_low · 34c81057
      Sean Christopherson 提交于
      Make @root exclusive in mem_cgroup_low; it is never considered low when
      looked at directly and is not checked when traversing the tree.  In
      effect, @root is handled identically to how root_mem_cgroup was
      previously handled by mem_cgroup_low.
      
      If @root is not excluded from the checks, a cgroup underneath @root will
      never be considered low during targeted reclaim of @root, e.g.  due to
      memory.current > memory.high, unless @root is misconfigured to have
      memory.low > memory.high.
      
      Excluding @root enables using memory.low to prioritize memory usage
      between cgroups within a subtree of the hierarchy that is limited by
      memory.high or memory.max, e.g.  when ROOT owns @root's controls but
      delegates the @root directory to a USER so that USER can create and
      administer children of @root.
      
      For example, given cgroup A with children B and C:
      
          A
         / \
        B   C
      
      and
      
        1. A/memory.current > A/memory.high
        2. A/B/memory.current < A/B/memory.low
        3. A/C/memory.current >= A/C/memory.low
      
      As 'A' is high, i.e.  triggers reclaim from 'A', and 'B' is low, we
      should reclaim from 'C' until 'A' is no longer high or until we can no
      longer reclaim from 'C'.  If 'A', i.e.  @root, isn't excluded by
      mem_cgroup_low when reclaming from 'A', then 'B' won't be considered low
      and we will reclaim indiscriminately from both 'B' and 'C'.
      
      Here is the test I used to confirm the bug and the patch.
      
      20:00:55@sjchrist-vm ? ~ $ cat ~/.bin/memcg_low_test
      #!/bin/bash
      
      x62mb=$((62<<20))
      x66mb=$((66<<20))
      x94mb=$((94<<20))
      x98mb=$((98<<20))
      
      setup() {
          set -e
      
          if [[ -n $DEBUG ]]; then
              set -x
          fi
      
          trap teardown EXIT HUP INT TERM
      
          if [[ ! -e /mnt/1gb.swap ]]; then
              sudo fallocate -l 1G /mnt/1gb.swap > /dev/null
              sudo mkswap /mnt/1gb.swap > /dev/null
          fi
          if ! swapon --show=NAME | grep -q "/mnt/1gb.swap"; then
              sudo swapon /mnt/1gb.swap
          fi
      
          if [[ ! -e /cgroup/cgroup.controllers ]]; then
              sudo mount -t cgroup2 none /cgroup
          fi
      
          grep -q memory /cgroup/cgroup.controllers
      
          sudo sh -c "echo '+memory' > /cgroup/cgroup.subtree_control"
      
          sudo mkdir /cgroup/A && sudo chown $USER:$USER /cgroup/A
          sudo sh -c "echo '+memory' > /cgroup/A/cgroup.subtree_control"
          sudo sh -c "echo '96m' > /cgroup/A/memory.high"
      
          mkdir /cgroup/A/0
          mkdir /cgroup/A/1
      
          echo 64m > /cgroup/A/0/memory.low
      }
      
      teardown() {
          set +e
      
          trap - EXIT HUP INT TERM
      
          if [[ -z $1 ]]; then
              printf "\n"
              printf "%0.s*" {1..35}
              printf "\nFAILED!\n\n"
              tail /cgroup/A/**/memory.current
              printf "%0.s*" {1..35}
              printf "\n\n"
          fi
      
          ps | grep stress | tr -s ' ' | cut -f 2 -d ' ' | xargs -I % kill %
      
          sleep 2
      
          if [[ -e /cgroup/A/0 ]]; then
              rmdir /cgroup/A/0
          fi
          if [[ -e /cgroup/A/1 ]]; then
              rmdir /cgroup/A/1
          fi
          if [[ -e /cgroup/A ]]; then
              sudo rmdir /cgroup/A
          fi
      }
      
      stress_test() {
          sudo sh -c "echo $$ > /cgroup/A/$1/cgroup.procs"
          stress --vm 1 --vm-bytes 64M --vm-keep > /dev/null &
      
          sudo sh -c "echo $$ > /cgroup/A/$2/cgroup.procs"
          stress --vm 1 --vm-bytes 64M --vm-keep > /dev/null &
      
          sudo sh -c "echo $$ > /cgroup/cgroup.procs"
      
          sleep 1
      
          # A/0 should be consuming more memory than A/1
          [[ $(cat /cgroup/A/0/memory.current) -ge $(cat /cgroup/A/1/memory.current) ]]
      
          # A/0 should be consuming ~64mb
          [[ $(cat /cgroup/A/0/memory.current) -ge $x62mb ]] && [[ $(cat /cgroup/A/0/memory.current) -le $x66mb ]]
      
          # A should cumulatively be consuming ~96mb
          [[ $(cat /cgroup/A/memory.current) -ge $x94mb ]] && [[ $(cat /cgroup/A/memory.current) -le $x98mb ]]
      
          # Stop the stressors
          ps | grep stress | tr -s ' ' | cut -f 2 -d ' ' | xargs -I % kill %
      }
      
      teardown 1
      setup
      
      for ((i=1;i<=$1;i++)); do
          printf "ITERATION $i of $1 - stress_test 0 1"
          stress_test 0 1
          printf "\x1b[2K\r"
      
          printf "ITERATION $i of $1 - stress_test 1 0"
          stress_test 1 0
          printf "\x1b[2K\r"
      
          printf "ITERATION $i of $1 - PASSED\n"
      done
      
      teardown 1
      
      echo PASSED!
      
      20:11:26@sjchrist-vm ? ~ $ memcg_low_test 10
      
      Link: http://lkml.kernel.org/r/1496434412-21005-1-git-send-email-sean.j.christopherson@intel.comSigned-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34c81057
  6. 07 7月, 2017 5 次提交
    • J
      mm: memcontrol: per-lruvec stats infrastructure · 00f3ca2c
      Johannes Weiner 提交于
      lruvecs are at the intersection of the NUMA node and memcg, which is the
      scope for most paging activity.
      
      Introduce a convenient accounting infrastructure that maintains
      statistics per node, per memcg, and the lruvec itself.
      
      Then convert over accounting sites for statistics that are already
      tracked in both nodes and memcgs and can be easily switched.
      
      [hannes@cmpxchg.org: fix crash in the new cgroup stat keeping code]
        Link: http://lkml.kernel.org/r/20170531171450.GA10481@cmpxchg.org
      [hannes@cmpxchg.org: don't track uncharged pages at all
        Link: http://lkml.kernel.org/r/20170605175254.GA8547@cmpxchg.org
      [hannes@cmpxchg.org: add missing free_percpu()]
        Link: http://lkml.kernel.org/r/20170605175354.GB8547@cmpxchg.org
      [linux@roeck-us.net: hexagon: fix build error caused by include file order]
        Link: http://lkml.kernel.org/r/20170617153721.GA4382@roeck-us.net
      Link: http://lkml.kernel.org/r/20170530181724.27197-6-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00f3ca2c
    • J
      mm: memcontrol: use the node-native slab memory counters · 32049296
      Johannes Weiner 提交于
      Now that the slab counters are moved from the zone to the node level we
      can drop the private memcg node stats and use the official ones.
      
      Link: http://lkml.kernel.org/r/20170530181724.27197-4-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      32049296
    • K
      mm/oom_kill: count global and memory cgroup oom kills · 8e675f7a
      Konstantin Khlebnikov 提交于
      Show count of oom killer invocations in /proc/vmstat and count of
      processes killed in memory cgroup in knob "memory.events" (in
      memory.oom_control for v1 cgroup).
      
      Also describe difference between "oom" and "oom_kill" in memory cgroup
      documentation.  Currently oom in memory cgroup kills tasks iff shortage
      has happened inside page fault.
      
      These counters helps in monitoring oom kills - for now the only way is
      grepping for magic words in kernel log.
      
      [akpm@linux-foundation.org: fix for mem_cgroup_count_vm_event() rename]
      [akpm@linux-foundation.org: fix comment, per Konstantin]
      Link: http://lkml.kernel.org/r/149570810989.203600.9492483715840752937.stgit@buzzSigned-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Roman Guschin <guroan@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e675f7a
    • R
      mm: per-cgroup memory reclaim stats · 2262185c
      Roman Gushchin 提交于
      Track the following reclaim counters for every memory cgroup: PGREFILL,
      PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and PGLAZYFREED.
      
      These values are exposed using the memory.stats interface of cgroup v2.
      
      The meaning of each value is the same as for global counters, available
      using /proc/vmstat.
      
      Also, for consistency, rename mem_cgroup_count_vm_event() to
      count_memcg_event_mm().
      
      Link: http://lkml.kernel.org/r/1494530183-30808-1-git-send-email-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2262185c
    • H
      mm, THP, swap: delay splitting THP during swap out · 38d8b4e6
      Huang Ying 提交于
      Patch series "THP swap: Delay splitting THP during swapping out", v11.
      
      This patchset is to optimize the performance of Transparent Huge Page
      (THP) swap.
      
      Recently, the performance of the storage devices improved so fast that
      we cannot saturate the disk bandwidth with single logical CPU when do
      page swap out even on a high-end server machine.  Because the
      performance of the storage device improved faster than that of single
      logical CPU.  And it seems that the trend will not change in the near
      future.  On the other hand, the THP becomes more and more popular
      because of increased memory size.  So it becomes necessary to optimize
      THP swap performance.
      
      The advantages of the THP swap support include:
      
       - Batch the swap operations for the THP to reduce lock
         acquiring/releasing, including allocating/freeing the swap space,
         adding/deleting to/from the swap cache, and writing/reading the swap
         space, etc. This will help improve the performance of the THP swap.
      
       - The THP swap space read/write will be 2M sequential IO. It is
         particularly helpful for the swap read, which are usually 4k random
         IO. This will improve the performance of the THP swap too.
      
       - It will help the memory fragmentation, especially when the THP is
         heavily used by the applications. The 2M continuous pages will be
         free up after THP swapping out.
      
       - It will improve the THP utilization on the system with the swap
         turned on. Because the speed for khugepaged to collapse the normal
         pages into the THP is quite slow. After the THP is split during the
         swapping out, it will take quite long time for the normal pages to
         collapse back into the THP after being swapped in. The high THP
         utilization helps the efficiency of the page based memory management
         too.
      
      There are some concerns regarding THP swap in, mainly because possible
      enlarged read/write IO size (for swap in/out) may put more overhead on
      the storage device.  To deal with that, the THP swap in should be turned
      on only when necessary.  For example, it can be selected via
      "always/never/madvise" logic, to be turned on globally, turned off
      globally, or turned on only for VMA with MADV_HUGEPAGE, etc.
      
      This patchset is the first step for the THP swap support.  The plan is
      to delay splitting THP step by step, finally avoid splitting THP during
      the THP swapping out and swap out/in the THP as a whole.
      
      As the first step, in this patchset, the splitting huge page is delayed
      from almost the first step of swapping out to after allocating the swap
      space for the THP and adding the THP into the swap cache.  This will
      reduce lock acquiring/releasing for the locks used for the swap cache
      management.
      
      With the patchset, the swap out throughput improves 15.5% (from about
      3.73GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case
      with 8 processes.  The test is done on a Xeon E5 v3 system.  The swap
      device used is a RAM simulated PMEM (persistent memory) device.  To test
      the sequential swapping out, the test case creates 8 processes, which
      sequentially allocate and write to the anonymous pages until the RAM and
      part of the swap device is used up.
      
      This patch (of 5):
      
      In this patch, splitting huge page is delayed from almost the first step
      of swapping out to after allocating the swap space for the THP
      (Transparent Huge Page) and adding the THP into the swap cache.  This
      will batch the corresponding operation, thus improve THP swap out
      throughput.
      
      This is the first step for the THP swap optimization.  The plan is to
      delay splitting the THP step by step and avoid splitting the THP
      finally.
      
      In this patch, one swap cluster is used to hold the contents of each THP
      swapped out.  So, the size of the swap cluster is changed to that of the
      THP (Transparent Huge Page) on x86_64 architecture (512).  For other
      architectures which want such THP swap optimization,
      ARCH_USES_THP_SWAP_CLUSTER needs to be selected in the Kconfig file for
      the architecture.  In effect, this will enlarge swap cluster size by 2
      times on x86_64.  Which may make it harder to find a free cluster when
      the swap space becomes fragmented.  So that, this may reduce the
      continuous swap space allocation and sequential write in theory.  The
      performance test in 0day shows no regressions caused by this.
      
      In the future of THP swap optimization, some information of the swapped
      out THP (such as compound map count) will be recorded in the
      swap_cluster_info data structure.
      
      The mem cgroup swap accounting functions are enhanced to support charge
      or uncharge a swap cluster backing a THP as a whole.
      
      The swap cluster allocate/free functions are added to allocate/free a
      swap cluster for a THP.  A fair simple algorithm is used for swap
      cluster allocation, that is, only the first swap device in priority list
      will be tried to allocate the swap cluster.  The function will fail if
      the trying is not successful, and the caller will fallback to allocate a
      single swap slot instead.  This works good enough for normal cases.  If
      the difference of the number of the free swap clusters among multiple
      swap devices is significant, it is possible that some THPs are split
      earlier than necessary.  For example, this could be caused by big size
      difference among multiple swap devices.
      
      The swap cache functions is enhanced to support add/delete THP to/from
      the swap cache as a set of (HPAGE_PMD_NR) sub-pages.  This may be
      enhanced in the future with multi-order radix tree.  But because we will
      split the THP soon during swapping out, that optimization doesn't make
      much sense for this first step.
      
      The THP splitting functions are enhanced to support to split THP in swap
      cache during swapping out.  The page lock will be held during allocating
      the swap cluster, adding the THP into the swap cache and splitting the
      THP.  So in the code path other than swapping out, if the THP need to be
      split, the PageSwapCache(THP) will be always false.
      
      The swap cluster is only available for SSD, so the THP swap optimization
      in this patchset has no effect for HDD.
      
      [ying.huang@intel.com: fix two issues in THP optimize patch]
        Link: http://lkml.kernel.org/r/87k25ed8zo.fsf@yhuang-dev.intel.com
      [hannes@cmpxchg.org: extensive cleanups and simplifications, reduce code size]
      Link: http://lkml.kernel.org/r/20170515112522.32457-2-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Suggested-by: Andrew Morton <akpm@linux-foundation.org> [for config option]
      Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> [for changes in huge_memory.c and huge_mm.h]
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      38d8b4e6
  7. 20 6月, 2017 2 次提交
    • I
      sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming · 2055da97
      Ingo Molnar 提交于
      So I've noticed a number of instances where it was not obvious from the
      code whether ->task_list was for a wait-queue head or a wait-queue entry.
      
      Furthermore, there's a number of wait-queue users where the lists are
      not for 'tasks' but other entities (poll tables, etc.), in which case
      the 'task_list' name is actively confusing.
      
      To clear this all up, name the wait-queue head and entry list structure
      fields unambiguously:
      
      	struct wait_queue_head::task_list	=> ::head
      	struct wait_queue_entry::task_list	=> ::entry
      
      For example, this code:
      
      	rqw->wait.task_list.next != &wait->task_list
      
      ... is was pretty unclear (to me) what it's doing, while now it's written this way:
      
      	rqw->wait.head.next != &wait->entry
      
      ... which makes it pretty clear that we are iterating a list until we see the head.
      
      Other examples are:
      
      	list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
      	list_for_each_entry(wq, &fence->wait.task_list, task_list) {
      
      ... where it's unclear (to me) what we are iterating, and during review it's
      hard to tell whether it's trying to walk a wait-queue entry (which would be
      a bug), while now it's written as:
      
      	list_for_each_entry_safe(pos, next, &x->head, entry) {
      	list_for_each_entry(wq, &fence->wait.head, entry) {
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2055da97
    • I
      sched/wait: Rename wait_queue_t => wait_queue_entry_t · ac6424b9
      Ingo Molnar 提交于
      Rename:
      
      	wait_queue_t		=>	wait_queue_entry_t
      
      'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
      but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
      which had to carry the name.
      
      Start sorting this out by renaming it to 'wait_queue_entry_t'.
      
      This also allows the real structure name 'struct __wait_queue' to
      lose its double underscore and become 'struct wait_queue_entry',
      which is the more canonical nomenclature for such data types.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ac6424b9
  8. 13 5月, 2017 1 次提交
  9. 04 5月, 2017 6 次提交
  10. 10 3月, 2017 2 次提交
    • T
      mm: do not call mem_cgroup_free() from within mem_cgroup_alloc() · 40e952f9
      Tahsin Erdogan 提交于
      mem_cgroup_free() indirectly calls wb_domain_exit() which is not
      prepared to deal with a struct wb_domain object that hasn't executed
      wb_domain_init().  For instance, the following warning message is
      printed by lockdep if alloc_percpu() fails in mem_cgroup_alloc():
      
        INFO: trying to register non-static key.
        the code is fine but needs lockdep annotation.
        turning off the locking correctness validator.
        CPU: 1 PID: 1950 Comm: mkdir Not tainted 4.10.0+ #151
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
        Call Trace:
         dump_stack+0x67/0x99
         register_lock_class+0x36d/0x540
         __lock_acquire+0x7f/0x1a30
         lock_acquire+0xcc/0x200
         del_timer_sync+0x3c/0xc0
         wb_domain_exit+0x14/0x20
         mem_cgroup_free+0x14/0x40
         mem_cgroup_css_alloc+0x3f9/0x620
         cgroup_apply_control_enable+0x190/0x390
         cgroup_mkdir+0x290/0x3d0
         kernfs_iop_mkdir+0x58/0x80
         vfs_mkdir+0x10e/0x1a0
         SyS_mkdirat+0xa8/0xd0
         SyS_mkdir+0x14/0x20
         entry_SYSCALL_64_fastpath+0x18/0xad
      
      Add __mem_cgroup_free() which skips wb_domain_exit().  This is used by
      both mem_cgroup_free() and mem_cgroup_alloc() clean up.
      
      Fixes: 0b8f73e1 ("mm: memcontrol: clean up alloc, online, offline, free functions")
      Link: http://lkml.kernel.org/r/20170306192122.24262-1-tahsin@google.comSigned-off-by: NTahsin Erdogan <tahsin@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      40e952f9
    • L
      mm/cgroup: avoid panic when init with low memory · bfc7228b
      Laurent Dufour 提交于
      The system may panic when initialisation is done when almost all the
      memory is assigned to the huge pages using the kernel command line
      parameter hugepage=xxxx.  Panic may occur like this:
      
        Unable to handle kernel paging request for data at address 0x00000000
        Faulting instruction address: 0xc000000000302b88
        Oops: Kernel access of bad area, sig: 11 [#1]
        SMP NR_CPUS=2048 [    0.082424] NUMA
        pSeries
        Modules linked in:
        CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.9.0-15-generic #16-Ubuntu
        task: c00000021ed01600 task.stack: c00000010d108000
        NIP: c000000000302b88 LR: c000000000270e04 CTR: c00000000016cfd0
        REGS: c00000010d10b2c0 TRAP: 0300   Not tainted (4.9.0-15-generic)
        MSR: 8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>[ 0.082770]   CR: 28424422  XER: 00000000
        CFAR: c0000000003d28b8 DAR: 0000000000000000 DSISR: 40000000 SOFTE: 1
        GPR00: c000000000270e04 c00000010d10b540 c00000000141a300 c00000010fff6300
        GPR04: 0000000000000000 00000000026012c0 c00000010d10b630 0000000487ab0000
        GPR08: 000000010ee90000 c000000001454fd8 0000000000000000 0000000000000000
        GPR12: 0000000000004400 c00000000fb80000 00000000026012c0 00000000026012c0
        GPR16: 00000000026012c0 0000000000000000 0000000000000000 0000000000000002
        GPR20: 000000000000000c 0000000000000000 0000000000000000 00000000024200c0
        GPR24: c0000000016eef48 0000000000000000 c00000010fff7d00 00000000026012c0
        GPR28: 0000000000000000 c00000010fff7d00 c00000010fff6300 c00000010d10b6d0
        NIP mem_cgroup_soft_limit_reclaim+0xf8/0x4f0
        LR do_try_to_free_pages+0x1b4/0x450
        Call Trace:
          do_try_to_free_pages+0x1b4/0x450
          try_to_free_pages+0xf8/0x270
          __alloc_pages_nodemask+0x7a8/0xff0
          new_slab+0x104/0x8e0
          ___slab_alloc+0x620/0x700
          __slab_alloc+0x34/0x60
          kmem_cache_alloc_node_trace+0xdc/0x310
          mem_cgroup_init+0x158/0x1c8
          do_one_initcall+0x68/0x1d0
          kernel_init_freeable+0x278/0x360
          kernel_init+0x24/0x170
          ret_from_kernel_thread+0x5c/0x74
        Instruction dump:
        eb81ffe0 eba1ffe8 ebc1fff0 ebe1fff8 4e800020 3d230001 e9499a42 3d220004
        3929acd8 794a1f24 7d295214 eac90100 <e9360000> 2fa90000 419eff74 3b200000
        ---[ end trace 342f5208b00d01b6 ]---
      
      This is a chicken and egg issue where the kernel try to get free memory
      when allocating per node data in mem_cgroup_init(), but in that path
      mem_cgroup_soft_limit_reclaim() is called which assumes that these data
      are allocated.
      
      As mem_cgroup_soft_limit_reclaim() is best effort, it should return when
      these data are not yet allocated.
      
      This patch also fixes potential null pointer access in
      mem_cgroup_remove_from_trees() and mem_cgroup_update_tree().
      
      Link: http://lkml.kernel.org/r/1487856999-16581-2-git-send-email-ldufour@linux.vnet.ibm.comSigned-off-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bfc7228b
  11. 02 3月, 2017 1 次提交
    • I
      sched/headers: Prepare for new header dependencies before moving code to <linux/sched/mm.h> · 6e84f315
      Ingo Molnar 提交于
      We are going to split <linux/sched/mm.h> out of <linux/sched.h>, which
      will have to be picked up from other headers and a couple of .c files.
      
      Create a trivial placeholder <linux/sched/mm.h> file that just
      maps to <linux/sched.h> to make this patch obviously correct and
      bisectable.
      
      The APIs that are going to be moved first are:
      
         mm_alloc()
         __mmdrop()
         mmdrop()
         mmdrop_async_fn()
         mmdrop_async()
         mmget_not_zero()
         mmput()
         mmput_async()
         get_task_mm()
         mm_access()
         mm_release()
      
      Include the new header in the files that are going to need it.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6e84f315
  12. 25 2月, 2017 1 次提交
  13. 23 2月, 2017 2 次提交
    • T
      slab: use memcg_kmem_cache_wq for slab destruction operations · 17cc4dfe
      Tejun Heo 提交于
      If there's contention on slab_mutex, queueing the per-cache destruction
      work item on the system_wq can unnecessarily create and tie up a lot of
      kworkers.
      
      Rename memcg_kmem_cache_create_wq to memcg_kmem_cache_wq and make it
      global and use that workqueue for the destruction work items too.  While
      at it, convert the workqueue from an unbound workqueue to a per-cpu one
      with concurrency limited to 1.  It's generally preferable to use per-cpu
      workqueues and concurrency limit of 1 is safe enough.
      
      This is suggested by Joonsoo Kim.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-11-tj@kernel.orgSigned-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJay Vana <jsvana@fb.com>
      Acked-by: NVladimir Davydov <vdavydov@tarantool.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      17cc4dfe
    • T
      slab: link memcg kmem_caches on their associated memory cgroup · bc2791f8
      Tejun Heo 提交于
      With kmem cgroup support enabled, kmem_caches can be created and
      destroyed frequently and a great number of near empty kmem_caches can
      accumulate if there are a lot of transient cgroups and the system is not
      under memory pressure.  When memory reclaim starts under such
      conditions, it can lead to consecutive deactivation and destruction of
      many kmem_caches, easily hundreds of thousands on moderately large
      systems, exposing scalability issues in the current slab management
      code.  This is one of the patches to address the issue.
      
      While a memcg kmem_cache is listed on its root cache's ->children list,
      there is no direct way to iterate all kmem_caches which are assocaited
      with a memory cgroup.  The only way to iterate them is walking all
      caches while filtering out caches which don't match, which would be most
      of them.
      
      This makes memcg destruction operations O(N^2) where N is the total
      number of slab caches which can be huge.  This combined with the
      synchronous RCU operations can tie up a CPU and affect the whole machine
      for many hours when memory reclaim triggers offlining and destruction of
      the stale memcgs.
      
      This patch adds mem_cgroup->kmem_caches list which goes through
      memcg_cache_params->kmem_caches_node of all kmem_caches which are
      associated with the memcg.  All memcg specific iterations, including
      stat file access, are updated to use the new list instead.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-6-tj@kernel.orgSigned-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJay Vana <jsvana@fb.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc2791f8
  14. 25 1月, 2017 1 次提交
  15. 11 1月, 2017 1 次提交
    • M
      mm, memcg: fix the active list aging for lowmem requests when memcg is enabled · b4536f0c
      Michal Hocko 提交于
      Nils Holland and Klaus Ethgen have reported unexpected OOM killer
      invocations with 32b kernel starting with 4.8 kernels
      
      	kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
      	kworker/u4:5 cpuset=/ mems_allowed=0
      	CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
      	[...]
      	Mem-Info:
      	active_anon:58685 inactive_anon:90 isolated_anon:0
      	 active_file:274324 inactive_file:281962 isolated_file:0
      	 unevictable:0 dirty:649 writeback:0 unstable:0
      	 slab_reclaimable:40662 slab_unreclaimable:17754
      	 mapped:7382 shmem:202 pagetables:351 bounce:0
      	 free:206736 free_pcp:332 free_cma:0
      	Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
      	DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB writepending:96kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
      	lowmem_reserve[]: 0 813 3474 3474
      	Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
      	lowmem_reserve[]: 0 0 21292 21292
      	HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB
      
      the oom killer is clearly pre-mature because there there is still a lot
      of page cache in the zone Normal which should satisfy this lowmem
      request.  Further debugging has shown that the reclaim cannot make any
      forward progress because the page cache is hidden in the active list
      which doesn't get rotated because inactive_list_is_low is not memcg
      aware.
      
      The code simply subtracts per-zone highmem counters from the respective
      memcg's lru sizes which doesn't make any sense.  We can simply end up
      always seeing the resulting active and inactive counts 0 and return
      false.  This issue is not limited to 32b kernels but in practice the
      effect on systems without CONFIG_HIGHMEM would be much harder to notice
      because we do not invoke the OOM killer for allocations requests
      targeting < ZONE_NORMAL.
      
      Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
      and subtract per-memcg highmem counts when memcg is enabled.  Introduce
      helper lruvec_zone_lru_size which redirects to either zone counters or
      mem_cgroup_get_zone_lru_size when appropriate.
      
      We are losing empty LRU but non-zero lru size detection introduced by
      ca707239 ("mm: update_lru_size warn and reset bad lru_size") because
      of the inherent zone vs. node discrepancy.
      
      Fixes: f8d1a311 ("mm: consider whether to decivate based on eligible zones inactive ratio")
      Link: http://lkml.kernel.org/r/20170104100825.3729-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NNils Holland <nholland@tisys.org>
      Tested-by: NNils Holland <nholland@tisys.org>
      Reported-by: NKlaus Ethgen <Klaus@Ethgen.de>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>	[4.8+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4536f0c
  16. 25 12月, 2016 1 次提交
  17. 13 12月, 2016 1 次提交
  18. 10 11月, 2016 1 次提交
  19. 28 10月, 2016 1 次提交
    • J
      mm: memcontrol: do not recurse in direct reclaim · 89a28483
      Johannes Weiner 提交于
      On 4.0, we saw a stack corruption from a page fault entering direct
      memory cgroup reclaim, calling into btrfs_releasepage(), which then
      tried to allocate an extent and recursed back into a kmem charge ad
      nauseam:
      
        [...]
        btrfs_releasepage+0x2c/0x30
        try_to_release_page+0x32/0x50
        shrink_page_list+0x6da/0x7a0
        shrink_inactive_list+0x1e5/0x510
        shrink_lruvec+0x605/0x7f0
        shrink_zone+0xee/0x320
        do_try_to_free_pages+0x174/0x440
        try_to_free_mem_cgroup_pages+0xa7/0x130
        try_charge+0x17b/0x830
        memcg_charge_kmem+0x40/0x80
        new_slab+0x2d9/0x5a0
        __slab_alloc+0x2fd/0x44f
        kmem_cache_alloc+0x193/0x1e0
        alloc_extent_state+0x21/0xc0
        __clear_extent_bit+0x2b5/0x400
        try_release_extent_mapping+0x1a3/0x220
        __btrfs_releasepage+0x31/0x70
        btrfs_releasepage+0x2c/0x30
        try_to_release_page+0x32/0x50
        shrink_page_list+0x6da/0x7a0
        shrink_inactive_list+0x1e5/0x510
        shrink_lruvec+0x605/0x7f0
        shrink_zone+0xee/0x320
        do_try_to_free_pages+0x174/0x440
        try_to_free_mem_cgroup_pages+0xa7/0x130
        try_charge+0x17b/0x830
        mem_cgroup_try_charge+0x65/0x1c0
        handle_mm_fault+0x117f/0x1510
        __do_page_fault+0x177/0x420
        do_page_fault+0xc/0x10
        page_fault+0x22/0x30
      
      On later kernels, kmem charging is opt-in rather than opt-out, and that
      particular kmem allocation in btrfs_releasepage() is no longer being
      charged and won't recurse and overrun the stack anymore.
      
      But it's not impossible for an accounted allocation to happen from the
      memcg direct reclaim context, and we needed to reproduce this crash many
      times before we even got a useful stack trace out of it.
      
      Like other direct reclaimers, mark tasks in memcg reclaim PF_MEMALLOC to
      avoid recursing into any other form of direct reclaim.  Then let
      recursive charges from PF_MEMALLOC contexts bypass the cgroup limit.
      
      Link: http://lkml.kernel.org/r/20161025141050.GA13019@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89a28483
  20. 08 10月, 2016 1 次提交