1. 13 2月, 2015 6 次提交
    • M
      mm: numa: avoid unnecessary TLB flushes when setting NUMA hinting entries · 10c1045f
      Mel Gorman 提交于
      If a PTE or PMD is already marked NUMA when scanning to mark entries for
      NUMA hinting then it is not necessary to update the entry and incur a TLB
      flush penalty.  Avoid the avoidhead where possible.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10c1045f
    • M
      mm: numa: add paranoid check around pte_protnone_numa · c0e7cad9
      Mel Gorman 提交于
      pte_protnone_numa is only safe to use after VMA checks for PROT_NONE are
      complete.  Treating a real PROT_NONE PTE as a NUMA hinting fault is going
      to result in strangeness so add a check for it.  BUG_ON looks like
      overkill but if this is hit then it's a serious bug that could result in
      corruption so do not even try recovering.  It would have been more
      comprehensive to check VMA flags in pte_protnone_numa but it would have
      made the API ugly just for a debugging check.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0e7cad9
    • M
      mm: numa: do not trap faults on the huge zero page · e944fd67
      Mel Gorman 提交于
      Faults on the huge zero page are pointless and there is a BUG_ON to catch
      them during fault time.  This patch reintroduces a check that avoids
      marking the zero page PAGE_NONE.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e944fd67
    • M
      mm: convert p[te|md]_mknonnuma and remaining page table manipulations · 4d942466
      Mel Gorman 提交于
      With PROT_NONE, the traditional page table manipulation functions are
      sufficient.
      
      [andre.przywara@arm.com: fix compiler warning in pmdp_invalidate()]
      [akpm@linux-foundation.org: fix build with STRICT_MM_TYPECHECKS]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NAneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d942466
    • M
      mm: convert p[te|md]_numa users to p[te|md]_protnone_numa · 8a0516ed
      Mel Gorman 提交于
      Convert existing users of pte_numa and friends to the new helper.  Note
      that the kernel is broken after this patch is applied until the other page
      table modifiers are also altered.  This patch layout is to make review
      easier.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NAneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a0516ed
    • M
      mm: numa: do not dereference pmd outside of the lock during NUMA hinting fault · 5d833062
      Mel Gorman 提交于
      Automatic NUMA balancing depends on being able to protect PTEs to trap a
      fault and gather reference locality information.  Very broadly speaking
      it would mark PTEs as not present and use another bit to distinguish
      between NUMA hinting faults and other types of faults.  It was
      universally loved by everybody and caused no problems whatsoever.  That
      last sentence might be a lie.
      
      This series is very heavily based on patches from Linus and Aneesh to
      replace the existing PTE/PMD NUMA helper functions with normal change
      protections.  I did alter and add parts of it but I consider them
      relatively minor contributions.  At their suggestion, acked-bys are in
      there but I've no problem converting them to Signed-off-by if requested.
      
      AFAIK, this has received no testing on ppc64 and I'm depending on Aneesh
      for that.  I tested trinity under kvm-tool and passed and ran a few
      other basic tests.  At the time of writing, only the short-lived tests
      have completed but testing of V2 indicated that long-term testing had no
      surprises.  In most cases I'm leaving out detail as it's not that
      interesting.
      
      specjbb single JVM: There was negligible performance difference in the
      	benchmark itself for short runs. However, system activity is
      	higher and interrupts are much higher over time -- possibly TLB
      	flushes. Migrations are also higher. Overall, this is more overhead
      	but considering the problems faced with the old approach I think
      	we just have to suck it up and find another way of reducing the
      	overhead.
      
      specjbb multi JVM: Negligible performance difference to the actual benchmark
      	but like the single JVM case, the system overhead is noticeably
      	higher.  Again, interrupts are a major factor.
      
      autonumabench: This was all over the place and about all that can be
      	reasonably concluded is that it's different but not necessarily
      	better or worse.
      
      autonumabench
                                           3.18.0-rc5            3.18.0-rc5
                                       mmotm-20141119         protnone-v3r3
      User    NUMA01               32380.24 (  0.00%)    21642.92 ( 33.16%)
      User    NUMA01_THEADLOCAL    22481.02 (  0.00%)    22283.22 (  0.88%)
      User    NUMA02                3137.00 (  0.00%)     3116.54 (  0.65%)
      User    NUMA02_SMT            1614.03 (  0.00%)     1543.53 (  4.37%)
      System  NUMA01                 322.97 (  0.00%)     1465.89 (-353.88%)
      System  NUMA01_THEADLOCAL       91.87 (  0.00%)       49.32 ( 46.32%)
      System  NUMA02                  37.83 (  0.00%)       14.61 ( 61.38%)
      System  NUMA02_SMT               7.36 (  0.00%)        7.45 ( -1.22%)
      Elapsed NUMA01                 716.63 (  0.00%)      599.29 ( 16.37%)
      Elapsed NUMA01_THEADLOCAL      553.98 (  0.00%)      539.94 (  2.53%)
      Elapsed NUMA02                  83.85 (  0.00%)       83.04 (  0.97%)
      Elapsed NUMA02_SMT              86.57 (  0.00%)       79.15 (  8.57%)
      CPU     NUMA01                4563.00 (  0.00%)     3855.00 ( 15.52%)
      CPU     NUMA01_THEADLOCAL     4074.00 (  0.00%)     4136.00 ( -1.52%)
      CPU     NUMA02                3785.00 (  0.00%)     3770.00 (  0.40%)
      CPU     NUMA02_SMT            1872.00 (  0.00%)     1959.00 ( -4.65%)
      
      System CPU usage of NUMA01 is worse but it's an adverse workload on this
      machine so I'm reluctant to conclude that it's a problem that matters.  On
      the other workloads that are sensible on this machine, system CPU usage is
      great.  Overall time to complete the benchmark is comparable
      
                3.18.0-rc5  3.18.0-rc5
              mmotm-20141119protnone-v3r3
      User        59612.50    48586.44
      System        460.22     1537.45
      Elapsed      1442.20     1304.29
      
      NUMA alloc hit                 5075182     5743353
      NUMA alloc miss                      0           0
      NUMA interleave hit                  0           0
      NUMA alloc local               5075174     5743339
      NUMA base PTE updates        637061448   443106883
      NUMA huge PMD updates          1243434      864747
      NUMA page range updates     1273699656   885857347
      NUMA hint faults               1658116     1214277
      NUMA hint local faults          959487      754113
      NUMA hint local percent             57          62
      NUMA pages migrated            5467056    61676398
      
      The NUMA pages migrated look terrible but when I looked at a graph of the
      activity over time I see that the massive spike in migration activity was
      during NUMA01.  This correlates with high system CPU usage and could be
      simply down to bad luck but any modifications that affect that workload
      would be related to scan rates and migrations, not the protection
      mechanism.  For all other workloads, migration activity was comparable.
      
      Overall, headline performance figures are comparable but the overhead is
      higher, mostly in interrupts.  To some extent, higher overhead from this
      approach was anticipated but not to this degree.  It's going to be
      necessary to reduce this again with a separate series in the future.  It's
      still worth going ahead with this series though as it's likely to avoid
      constant headaches with Xen and is probably easier to maintain.
      
      This patch (of 10):
      
      A transhuge NUMA hinting fault may find the page is migrating and should
      wait until migration completes.  The check is race-prone because the pmd
      is deferenced outside of the page lock and while the race is tiny, it'll
      be larger if the PMD is cleared while marking PMDs for hinting fault.
      This patch closes the race.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d833062
  2. 12 2月, 2015 34 次提交
    • R
      mm/nommu.c: fix arithmetic overflow in __vm_enough_memory() · 8138a67a
      Roman Gushchin 提交于
      I noticed that "allowed" can easily overflow by falling below 0, because
      (total_vm / 32) can be larger than "allowed".  The problem occurs in
      OVERCOMMIT_NONE mode.
      
      In this case, a huge allocation can success and overcommit the system
      (despite OVERCOMMIT_NONE mode).  All subsequent allocations will fall
      (system-wide), so system become unusable.
      
      The problem was masked out by commit c9b1d098
      ("mm: limit growth of 3% hardcoded other user reserve"),
      but it's easy to reproduce it on older kernels:
      1) set overcommit_memory sysctl to 2
      2) mmap() large file multiple times (with VM_SHARED flag)
      3) try to malloc() large amount of memory
      
      It also can be reproduced on newer kernels, but miss-configured
      sysctl_user_reserve_kbytes is required.
      
      Fix this issue by switching to signed arithmetic here.
      Signed-off-by: NRoman Gushchin <klamm@yandex-team.ru>
      Cc: Andrew Shewmaker <agshew@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8138a67a
    • R
      mm/mmap.c: fix arithmetic overflow in __vm_enough_memory() · 5703b087
      Roman Gushchin 提交于
      I noticed, that "allowed" can easily overflow by falling below 0,
      because (total_vm / 32) can be larger than "allowed".  The problem
      occurs in OVERCOMMIT_NONE mode.
      
      In this case, a huge allocation can success and overcommit the system
      (despite OVERCOMMIT_NONE mode).  All subsequent allocations will fall
      (system-wide), so system become unusable.
      
      The problem was masked out by commit c9b1d098
      ("mm: limit growth of 3% hardcoded other user reserve"),
      but it's easy to reproduce it on older kernels:
      1) set overcommit_memory sysctl to 2
      2) mmap() large file multiple times (with VM_SHARED flag)
      3) try to malloc() large amount of memory
      
      It also can be reproduced on newer kernels, but miss-configured
      sysctl_user_reserve_kbytes is required.
      
      Fix this issue by switching to signed arithmetic here.
      
      [akpm@linux-foundation.org: use min_t]
      Signed-off-by: NRoman Gushchin <klamm@yandex-team.ru>
      Cc: Andrew Shewmaker <agshew@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5703b087
    • C
      vmstat: Reduce time interval to stat update on idle cpu · 57c2e36b
      Christoph Lameter 提交于
      It was noted that the vm stat shepherd runs every 2 seconds and that the
      vmstat update is then scheduled 2 seconds in the future.
      
      This yields an interval of double the time interval which is not desired.
      
      Change the shepherd so that it does not delay the vmstat update on the
      other cpu.  We stil have to use schedule_delayed_work since we are using a
      delayed_work_struct but we can set the delay to 0.
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57c2e36b
    • S
      mm/page_owner.c: remove unnecessary stack_trace field · 94f759d6
      Sergei Rogachev 提交于
      Page owner uses the page_ext structure to keep meta-information for every
      page in the system.  The structure also contains a field of type 'struct
      stack_trace', page owner uses this field during invocation of the function
      save_stack_trace.  It is easy to notice that keeping a copy of this
      structure for every page in the system is very inefficiently in terms of
      memory.
      
      The patch removes this unnecessary field of page_ext and forces page owner
      to use a stack_trace structure allocated on the stack.
      
      [akpm@linux-foundation.org: use struct initializers]
      Signed-off-by: NSergei Rogachev <rogachevsergei@gmail.com>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      94f759d6
    • E
      mm: incorporate read-only pages into transparent huge pages · 10359213
      Ebru Akagunduz 提交于
      This patch aims to improve THP collapse rates, by allowing THP collapse in
      the presence of read-only ptes, like those left in place by do_swap_page
      after a read fault.
      
      Currently THP can collapse 4kB pages into a THP when there are up to
      khugepaged_max_ptes_none pte_none ptes in a 2MB range.  This patch applies
      the same limit for read-only ptes.
      
      The patch was tested with a test program that allocates 800MB of memory,
      writes to it, and then sleeps.  I force the system to swap out all but
      190MB of the program by touching other memory.  Afterwards, the test
      program does a mix of reads and writes to its memory, and the memory gets
      swapped back in.
      
      Without the patch, only the memory that did not get swapped out remained
      in THPs, which corresponds to 24% of the memory of the program.  The
      percentage did not increase over time.
      
      With this patch, after 5 minutes of waiting khugepaged had collapsed 50%
      of the program's memory back into THPs.
      
      Test results:
      
      With the patch:
      After swapped out:
      cat /proc/pid/smaps:
      Anonymous:      100464 kB
      AnonHugePages:  100352 kB
      Swap:           699540 kB
      Fraction:       99,88
      
      cat /proc/meminfo:
      AnonPages:      1754448 kB
      AnonHugePages:  1716224 kB
      Fraction:       97,82
      
      After swapped in:
      In a few seconds:
      cat /proc/pid/smaps:
      Anonymous:      800004 kB
      AnonHugePages:  145408 kB
      Swap:           0 kB
      Fraction:       18,17
      
      cat /proc/meminfo:
      AnonPages:      2455016 kB
      AnonHugePages:  1761280 kB
      Fraction:       71,74
      
      In 5 minutes:
      cat /proc/pid/smaps
      Anonymous:      800004 kB
      AnonHugePages:  407552 kB
      Swap:           0 kB
      Fraction:       50,94
      
      cat /proc/meminfo:
      AnonPages:      2456872 kB
      AnonHugePages:  2023424 kB
      Fraction:       82,35
      
      Without the patch:
      After swapped out:
      cat /proc/pid/smaps:
      Anonymous:      190660 kB
      AnonHugePages:  190464 kB
      Swap:           609344 kB
      Fraction:       99,89
      
      cat /proc/meminfo:
      AnonPages:      1740456 kB
      AnonHugePages:  1667072 kB
      Fraction:       95,78
      
      After swapped in:
      cat /proc/pid/smaps:
      Anonymous:      800004 kB
      AnonHugePages:  190464 kB
      Swap:           0 kB
      Fraction:       23,80
      
      cat /proc/meminfo:
      AnonPages:      2350032 kB
      AnonHugePages:  1667072 kB
      Fraction:       70,93
      
      I waited 10 minutes the fractions did not change without the patch.
      Signed-off-by: NEbru Akagunduz <ebru.akagunduz@gmail.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10359213
    • M
      vmstat: do not use deferrable delayed work for vmstat_update · ba4877b9
      Michal Hocko 提交于
      Vinayak Menon has reported that an excessive number of tasks was throttled
      in the direct reclaim inside too_many_isolated() because NR_ISOLATED_FILE
      was relatively high compared to NR_INACTIVE_FILE.  However it turned out
      that the real number of NR_ISOLATED_FILE was 0 and the per-cpu
      vm_stat_diff wasn't transferred into the global counter.
      
      vmstat_work which is responsible for the sync is defined as deferrable
      delayed work which means that the defined timeout doesn't wake up an idle
      CPU.  A CPU might stay in an idle state for a long time and general effort
      is to keep such a CPU in this state as long as possible which might lead
      to all sorts of troubles for vmstat consumers as can be seen with the
      excessive direct reclaim throttling.
      
      This patch basically reverts 39bf6270 ("VM statistics: Make timer
      deferrable") but it shouldn't cause any problems for idle CPUs because
      only CPUs with an active per-cpu drift are woken up since 7cc36bbd
      ("vmstat: on-demand vmstat workers v8") and CPUs which are idle for a
      longer time shouldn't have per-cpu drift.
      
      Fixes: 39bf6270 (VM statistics: Make timer deferrable)
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reported-by: NVinayak Menon <vinmenon@codeaurora.org>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba4877b9
    • V
      mm: more aggressive page stealing for UNMOVABLE allocations · 9c0415eb
      Vlastimil Babka 提交于
      When allocation falls back to stealing free pages of another migratetype,
      it can decide to steal extra pages, or even the whole pageblock in order
      to reduce fragmentation, which could happen if further allocation
      fallbacks pick a different pageblock.  In try_to_steal_freepages(), one of
      the situations where extra pages are stolen happens when we are trying to
      allocate a MIGRATE_RECLAIMABLE page.
      
      However, MIGRATE_UNMOVABLE allocations are not treated the same way,
      although spreading such allocation over multiple fallback pageblocks is
      arguably even worse than it is for RECLAIMABLE allocations.  To minimize
      fragmentation, we should minimize the number of such fallbacks, and thus
      steal as much as is possible from each fallback pageblock.
      
      Note that in theory this might put more pressure on movable pageblocks and
      cause movable allocations to steal back from unmovable pageblocks.
      However, movable allocations are not as aggressive with stealing, and do
      not cause permanent fragmentation, so the tradeoff is reasonable, and
      evaluation seems to support the change.
      
      This patch thus adds a check for MIGRATE_UNMOVABLE to the decision to
      steal extra free pages.  When evaluating with stress-highalloc from
      mmtests, this has reduced the number of MIGRATE_UNMOVABLE fallbacks to
      roughly 1/6.  The number of these fallbacks stealing from MIGRATE_MOVABLE
      block is reduced to 1/3.  There was no observation of growing number of
      unmovable pageblocks over time, and also not of increased movable
      allocation fallbacks.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c0415eb
    • V
      mm: always steal split buddies in fallback allocations · 3a1086fb
      Vlastimil Babka 提交于
      When allocation falls back to another migratetype, it will steal a page
      with highest available order, and (depending on this order and desired
      migratetype), it might also steal the rest of free pages from the same
      pageblock.
      
      Given the preference of highest available order, it is likely that it will
      be higher than the desired order, and result in the stolen buddy page
      being split.  The remaining pages after split are currently stolen only
      when the rest of the free pages are stolen.  This can however lead to
      situations where for MOVABLE allocations we split e.g.  order-4 fallback
      UNMOVABLE page, but steal only order-0 page.  Then on the next MOVABLE
      allocation (which may be batched to fill the pcplists) we split another
      order-3 or higher page, etc.  By stealing all pages that we have split, we
      can avoid further stealing.
      
      This patch therefore adjusts the page stealing so that buddy pages created
      by split are always stolen.  This has effect only on MOVABLE allocations,
      as RECLAIMABLE and UNMOVABLE allocations already always do that in
      addition to stealing the rest of free pages from the pageblock.  The
      change also allows to simplify try_to_steal_freepages() and factor out CMA
      handling.
      
      According to Mel, it has been intended since the beginning that buddy
      pages after split would be stolen always, but it doesn't seem like it was
      ever the case until commit 47118af0 ("mm: mmzone: MIGRATE_CMA
      migration type added").  The commit has unintentionally introduced this
      behavior, but was reverted by commit 0cbef29a ("mm:
      __rmqueue_fallback() should respect pageblock type").  Neither included
      evaluation.
      
      My evaluation with stress-highalloc from mmtests shows about 2.5x
      reduction of page stealing events for MOVABLE allocations, without
      affecting the page stealing events for other allocation migratetypes.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3a1086fb
    • V
      mm: when stealing freepages, also take pages created by splitting buddy page · 99592d59
      Vlastimil Babka 提交于
      When studying page stealing, I noticed some weird looking decisions in
      try_to_steal_freepages().  The first I assume is a bug (Patch 1), the
      following two patches were driven by evaluation.
      
      Testing was done with stress-highalloc of mmtests, using the
      mm_page_alloc_extfrag tracepoint and postprocessing to get counts of how
      often page stealing occurs for individual migratetypes, and what
      migratetypes are used for fallbacks.  Arguably, the worst case of page
      stealing is when UNMOVABLE allocation steals from MOVABLE pageblock.
      RECLAIMABLE allocation stealing from MOVABLE allocation is also not ideal,
      so the goal is to minimize these two cases.
      
      The evaluation of v2 wasn't always clear win and Joonsoo questioned the
      results.  Here I used different baseline which includes RFC compaction
      improvements from [1].  I found that the compaction improvements reduce
      variability of stress-highalloc, so there's less noise in the data.
      
      First, let's look at stress-highalloc configured to do sync compaction,
      and how these patches reduce page stealing events during the test.  First
      column is after fresh reboot, other two are reiterations of test without
      reboot.  That was all accumulater over 5 re-iterations (so the benchmark
      was run 5x3 times with 5 fresh restarts).
      
      Baseline:
      
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                        5-nothp-1       5-nothp-2       5-nothp-3
      Page alloc extfrag event                               10264225     8702233    10244125
      Extfrag fragmenting                                    10263271     8701552    10243473
      Extfrag fragmenting for unmovable                         13595       17616       15960
      Extfrag fragmenting unmovable placed with movable          7989       12193        8447
      Extfrag fragmenting for reclaimable                         658        1840        1817
      Extfrag fragmenting reclaimable placed with movable         558        1677        1679
      Extfrag fragmenting for movable                        10249018     8682096    10225696
      
      With Patch 1:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                        6-nothp-1       6-nothp-2       6-nothp-3
      Page alloc extfrag event                               11834954     9877523     9774860
      Extfrag fragmenting                                    11833993     9876880     9774245
      Extfrag fragmenting for unmovable                          7342       16129       11712
      Extfrag fragmenting unmovable placed with movable          4191       10547        6270
      Extfrag fragmenting for reclaimable                         373        1130         923
      Extfrag fragmenting reclaimable placed with movable         302         906         738
      Extfrag fragmenting for movable                        11826278     9859621     9761610
      
      With Patch 2:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                        7-nothp-1       7-nothp-2       7-nothp-3
      Page alloc extfrag event                                4725990     3668793     3807436
      Extfrag fragmenting                                     4725104     3668252     3806898
      Extfrag fragmenting for unmovable                          6678        7974        7281
      Extfrag fragmenting unmovable placed with movable          2051        3829        4017
      Extfrag fragmenting for reclaimable                         429        1208        1278
      Extfrag fragmenting reclaimable placed with movable         369         976        1034
      Extfrag fragmenting for movable                         4717997     3659070     3798339
      
      With Patch 3:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                        8-nothp-1       8-nothp-2       8-nothp-3
      Page alloc extfrag event                                5016183     4700142     3850633
      Extfrag fragmenting                                     5015325     4699613     3850072
      Extfrag fragmenting for unmovable                          1312        3154        3088
      Extfrag fragmenting unmovable placed with movable          1115        2777        2714
      Extfrag fragmenting for reclaimable                         437        1193        1097
      Extfrag fragmenting reclaimable placed with movable         330         969         879
      Extfrag fragmenting for movable                         5013576     4695266     3845887
      
      In v2 we've seen apparent regression with Patch 1 for unmovable events,
      this is now gone, suggesting it was indeed noise.  Here, each patch
      improves the situation for unmovable events.  Reclaimable is improved by
      patch 1 and then either the same modulo noise, or perhaps sligtly worse -
      a small price for unmovable improvements, IMHO.  The number of movable
      allocations falling back to other migratetypes is most noisy, but it's
      reduced to half at Patch 2 nevertheless.  These are least critical as
      compaction can move them around.
      
      If we look at success rates, the patches don't affect them, that didn't change.
      
      Baseline:
                                   3.19-rc4              3.19-rc4              3.19-rc4
                                  5-nothp-1             5-nothp-2             5-nothp-3
      Success 1 Min         49.00 (  0.00%)       42.00 ( 14.29%)       41.00 ( 16.33%)
      Success 1 Mean        51.00 (  0.00%)       45.00 ( 11.76%)       42.60 ( 16.47%)
      Success 1 Max         55.00 (  0.00%)       51.00 (  7.27%)       46.00 ( 16.36%)
      Success 2 Min         53.00 (  0.00%)       47.00 ( 11.32%)       44.00 ( 16.98%)
      Success 2 Mean        59.60 (  0.00%)       50.80 ( 14.77%)       48.20 ( 19.13%)
      Success 2 Max         64.00 (  0.00%)       56.00 ( 12.50%)       52.00 ( 18.75%)
      Success 3 Min         84.00 (  0.00%)       82.00 (  2.38%)       78.00 (  7.14%)
      Success 3 Mean        85.60 (  0.00%)       82.80 (  3.27%)       79.40 (  7.24%)
      Success 3 Max         86.00 (  0.00%)       83.00 (  3.49%)       80.00 (  6.98%)
      
      Patch 1:
                                   3.19-rc4              3.19-rc4              3.19-rc4
                                  6-nothp-1             6-nothp-2             6-nothp-3
      Success 1 Min         49.00 (  0.00%)       44.00 ( 10.20%)       44.00 ( 10.20%)
      Success 1 Mean        51.80 (  0.00%)       46.00 ( 11.20%)       45.80 ( 11.58%)
      Success 1 Max         54.00 (  0.00%)       49.00 (  9.26%)       49.00 (  9.26%)
      Success 2 Min         58.00 (  0.00%)       49.00 ( 15.52%)       48.00 ( 17.24%)
      Success 2 Mean        60.40 (  0.00%)       51.80 ( 14.24%)       50.80 ( 15.89%)
      Success 2 Max         63.00 (  0.00%)       54.00 ( 14.29%)       55.00 ( 12.70%)
      Success 3 Min         84.00 (  0.00%)       81.00 (  3.57%)       79.00 (  5.95%)
      Success 3 Mean        85.00 (  0.00%)       81.60 (  4.00%)       79.80 (  6.12%)
      Success 3 Max         86.00 (  0.00%)       82.00 (  4.65%)       82.00 (  4.65%)
      
      Patch 2:
      
                                   3.19-rc4              3.19-rc4              3.19-rc4
                                  7-nothp-1             7-nothp-2             7-nothp-3
      Success 1 Min         50.00 (  0.00%)       44.00 ( 12.00%)       39.00 ( 22.00%)
      Success 1 Mean        52.80 (  0.00%)       45.60 ( 13.64%)       42.40 ( 19.70%)
      Success 1 Max         55.00 (  0.00%)       46.00 ( 16.36%)       47.00 ( 14.55%)
      Success 2 Min         52.00 (  0.00%)       48.00 (  7.69%)       45.00 ( 13.46%)
      Success 2 Mean        53.40 (  0.00%)       49.80 (  6.74%)       48.80 (  8.61%)
      Success 2 Max         57.00 (  0.00%)       52.00 (  8.77%)       52.00 (  8.77%)
      Success 3 Min         84.00 (  0.00%)       81.00 (  3.57%)       79.00 (  5.95%)
      Success 3 Mean        85.00 (  0.00%)       82.40 (  3.06%)       79.60 (  6.35%)
      Success 3 Max         86.00 (  0.00%)       83.00 (  3.49%)       80.00 (  6.98%)
      
      Patch 3:
                                   3.19-rc4              3.19-rc4              3.19-rc4
                                  8-nothp-1             8-nothp-2             8-nothp-3
      Success 1 Min         46.00 (  0.00%)       44.00 (  4.35%)       42.00 (  8.70%)
      Success 1 Mean        50.20 (  0.00%)       45.60 (  9.16%)       44.00 ( 12.35%)
      Success 1 Max         52.00 (  0.00%)       47.00 (  9.62%)       47.00 (  9.62%)
      Success 2 Min         53.00 (  0.00%)       49.00 (  7.55%)       48.00 (  9.43%)
      Success 2 Mean        55.80 (  0.00%)       50.60 (  9.32%)       49.00 ( 12.19%)
      Success 2 Max         59.00 (  0.00%)       52.00 ( 11.86%)       51.00 ( 13.56%)
      Success 3 Min         84.00 (  0.00%)       80.00 (  4.76%)       79.00 (  5.95%)
      Success 3 Mean        85.40 (  0.00%)       81.60 (  4.45%)       80.40 (  5.85%)
      Success 3 Max         87.00 (  0.00%)       83.00 (  4.60%)       82.00 (  5.75%)
      
      While there's no improvement here, I consider reduced fragmentation events
      to be worth on its own.  Patch 2 also seems to reduce scanning for free
      pages, and migrations in compaction, suggesting it has somewhat less work
      to do:
      
      Patch 1:
      
      Compaction stalls                 4153        3959        3978
      Compaction success                1523        1441        1446
      Compaction failures               2630        2517        2531
      Page migrate success           4600827     4943120     5104348
      Page migrate failure             19763       16656       17806
      Compaction pages isolated      9597640    10305617    10653541
      Compaction migrate scanned    77828948    86533283    87137064
      Compaction free scanned      517758295   521312840   521462251
      Compaction cost                   5503        5932        6110
      
      Patch 2:
      
      Compaction stalls                 3800        3450        3518
      Compaction success                1421        1316        1317
      Compaction failures               2379        2134        2201
      Page migrate success           4160421     4502708     4752148
      Page migrate failure             19705       14340       14911
      Compaction pages isolated      8731983     9382374     9910043
      Compaction migrate scanned    98362797    96349194    98609686
      Compaction free scanned      496512560   469502017   480442545
      Compaction cost                   5173        5526        5811
      
      As with v2, /proc/pagetypeinfo appears unaffected with respect to numbers
      of unmovable and reclaimable pageblocks.
      
      Configuring the benchmark to allocate like THP page fault (i.e.  no sync
      compaction) gives much noisier results for iterations 2 and 3 after
      reboot.  This is not so surprising given how [1] offers lower improvements
      in this scenario due to less restarts after deferred compaction which
      would change compaction pivot.
      
      Baseline:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                          5-thp-1         5-thp-2         5-thp-3
      Page alloc extfrag event                                8148965     6227815     6646741
      Extfrag fragmenting                                     8147872     6227130     6646117
      Extfrag fragmenting for unmovable                         10324       12942       15975
      Extfrag fragmenting unmovable placed with movable          5972        8495       10907
      Extfrag fragmenting for reclaimable                         601        1707        2210
      Extfrag fragmenting reclaimable placed with movable         520        1570        2000
      Extfrag fragmenting for movable                         8136947     6212481     6627932
      
      Patch 1:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                          6-thp-1         6-thp-2         6-thp-3
      Page alloc extfrag event                                8345457     7574471     7020419
      Extfrag fragmenting                                     8343546     7573777     7019718
      Extfrag fragmenting for unmovable                         10256       18535       30716
      Extfrag fragmenting unmovable placed with movable          6893       11726       22181
      Extfrag fragmenting for reclaimable                         465        1208        1023
      Extfrag fragmenting reclaimable placed with movable         353         996         843
      Extfrag fragmenting for movable                         8332825     7554034     6987979
      
      Patch 2:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                          7-thp-1         7-thp-2         7-thp-3
      Page alloc extfrag event                                3512847     3020756     2891625
      Extfrag fragmenting                                     3511940     3020185     2891059
      Extfrag fragmenting for unmovable                          9017        6892        6191
      Extfrag fragmenting unmovable placed with movable          1524        3053        2435
      Extfrag fragmenting for reclaimable                         445        1081        1160
      Extfrag fragmenting reclaimable placed with movable         375         918         986
      Extfrag fragmenting for movable                         3502478     3012212     2883708
      
      Patch 3:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                          8-thp-1         8-thp-2         8-thp-3
      Page alloc extfrag event                                3181699     3082881     2674164
      Extfrag fragmenting                                     3180812     3082303     2673611
      Extfrag fragmenting for unmovable                          1201        4031        4040
      Extfrag fragmenting unmovable placed with movable           974        3611        3645
      Extfrag fragmenting for reclaimable                         478        1165        1294
      Extfrag fragmenting reclaimable placed with movable         387         985        1030
      Extfrag fragmenting for movable                         3179133     3077107     2668277
      
      The improvements for first iteration are clear, the rest is much noisier
      and can appear like regression for Patch 1.  Anyway, patch 2 rectifies it.
      
      Allocation success rates are again unaffected so there's no point in
      making this e-mail any longer.
      
      [1] http://marc.info/?l=linux-mm&m=142166196321125&w=2
      
      This patch (of 3):
      
      When __rmqueue_fallback() is called to allocate a page of order X, it will
      find a page of order Y >= X of a fallback migratetype, which is different
      from the desired migratetype.  With the help of try_to_steal_freepages(),
      it may change the migratetype (to the desired one) also of:
      
      1) all currently free pages in the pageblock containing the fallback page
      2) the fallback pageblock itself
      3) buddy pages created by splitting the fallback page (when Y > X)
      
      These decisions take the order Y into account, as well as the desired
      migratetype, with the goal of preventing multiple fallback allocations
      that could e.g.  distribute UNMOVABLE allocations among multiple
      pageblocks.
      
      Originally, decision for 1) has implied the decision for 3).  Commit
      47118af0 ("mm: mmzone: MIGRATE_CMA migration type added") changed that
      (probably unintentionally) so that the buddy pages in case 3) are always
      changed to the desired migratetype, except for CMA pageblocks.
      
      Commit fef903ef ("mm/page_allo.c: restructure free-page stealing code
      and fix a bug") did some refactoring and added a comment that the case of
      3) is intended.  Commit 0cbef29a ("mm: __rmqueue_fallback() should
      respect pageblock type") removed the comment and tried to restore the
      original behavior where 1) implies 3), but due to the previous
      refactoring, the result is instead that only 2) implies 3) - and the
      conditions for 2) are less frequently met than conditions for 1).  This
      may increase fragmentation in situations where the code decides to steal
      all free pages from the pageblock (case 1)), but then gives back the buddy
      pages produced by splitting.
      
      This patch restores the original intended logic where 1) implies 3).
      During testing with stress-highalloc from mmtests, this has shown to
      decrease the number of events where UNMOVABLE and RECLAIMABLE allocations
      steal from MOVABLE pageblocks, which can lead to permanent fragmentation.
      In some cases it has increased the number of events when MOVABLE
      allocations steal from UNMOVABLE or RECLAIMABLE pageblocks, but these are
      fixable by sync compaction and thus less harmful.
      
      Note that evaluation has shown that the behavior introduced by
      47118af0 for buddy pages in case 3) is actually even better than the
      original logic, so the following patch will introduce it properly once
      again.  For stable backports of this patch it makes thus sense to only fix
      versions containing 0cbef29a.
      
      [iamjoonsoo.kim@lge.com: tracepoint fix]
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: <stable@vger.kernel.org>	[3.13+ containing 0cbef29a]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99592d59
    • N
      mincore: apply page table walker on do_mincore() · 1e25a271
      Naoya Horiguchi 提交于
      This patch makes do_mincore() use walk_page_vma(), which reduces many
      lines of code by using common page table walk code.
      
      [daeseok.youn@gmail.com: remove unneeded variable 'err']
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NDaeseok Youn <daeseok.youn@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e25a271
    • N
      mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP) · 48684a65
      Naoya Horiguchi 提交于
      walk_page_range() silently skips vma having VM_PFNMAP set, which leads to
      undesirable behaviour at client end (who called walk_page_range).  For
      example for pagemap_read(), when no callbacks are called against VM_PFNMAP
      vma, pagemap_read() may prepare pagemap data for next virtual address
      range at wrong index.  That could confuse and/or break userspace
      applications.
      
      This patch avoid this misbehavior caused by vma(VM_PFNMAP) like follows:
      - for pagemap_read() which has its own ->pte_hole(), call the ->pte_hole()
        over vma(VM_PFNMAP),
      - for clear_refs and queue_pages which have their own ->tests_walk,
        just return 1 and skip vma(VM_PFNMAP). This is no problem because
        these are not interested in hole regions,
      - for other callers, just skip the vma(VM_PFNMAP) as a default behavior.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NShiraz Hashim <shashim@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48684a65
    • N
      mempolicy: apply page table walker on queue_pages_range() · 6f4576e3
      Naoya Horiguchi 提交于
      queue_pages_range() does page table walking in its own way now, but there
      is some code duplicate.  This patch applies page table walker to reduce
      lines of code.
      
      queue_pages_range() has to do some precheck to determine whether we really
      walk over the vma or just skip it.  Now we have test_walk() callback in
      mm_walk for this purpose, so we can do this replacement cleanly.
      queue_pages_test_walk() depends on not only the current vma but also the
      previous one, so queue_pages->prev is introduced to remember it.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f4576e3
    • N
      memcg: cleanup preparation for page table walk · 26bcd64a
      Naoya Horiguchi 提交于
      pagewalk.c can handle vma in itself, so we don't have to pass vma via
      walk->private.  And both of mem_cgroup_count_precharge() and
      mem_cgroup_move_charge() do for each vma loop themselves, but now it's
      done in pagewalk.c, so let's clean up them.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26bcd64a
    • N
      pagewalk: add walk_page_vma() · 900fc5f1
      Naoya Horiguchi 提交于
      Introduce walk_page_vma(), which is useful for the callers which want to
      walk over a given vma.  It's used by later patches.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      900fc5f1
    • N
      pagewalk: improve vma handling · fafaa426
      Naoya Horiguchi 提交于
      Current implementation of page table walker has a fundamental problem in
      vma handling, which started when we tried to handle vma(VM_HUGETLB).
      Because it's done in pgd loop, considering vma boundary makes code
      complicated and bug-prone.
      
      From the users viewpoint, some user checks some vma-related condition to
      determine whether the user really does page walk over the vma.
      
      In order to solve these, this patch moves vma check outside pgd loop and
      introduce a new callback ->test_walk().
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fafaa426
    • N
      mm/pagewalk: remove pgd_entry() and pud_entry() · 0b1fbfe5
      Naoya Horiguchi 提交于
      Currently no user of page table walker sets ->pgd_entry() or
      ->pud_entry(), so checking their existence in each loop is just wasting
      CPU cycle.  So let's remove it to reduce overhead.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b1fbfe5
    • A
      mm: gup: use get_user_pages_unlocked · 7e339128
      Andrea Arcangeli 提交于
      This allows those get_user_pages calls to pass FAULT_FLAG_ALLOW_RETRY to
      the page fault in order to release the mmap_sem during the I/O.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e339128
    • A
      mm: gup: use get_user_pages_unlocked within get_user_pages_fast · a7b78075
      Andrea Arcangeli 提交于
      This allows the get_user_pages_fast slow path to release the mmap_sem
      before blocking.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a7b78075
    • A
      mm: gup: add __get_user_pages_unlocked to customize gup_flags · 0fd71a56
      Andrea Arcangeli 提交于
      Some callers (like KVM) may want to set the gup_flags like FOLL_HWPOSION
      to get a proper -EHWPOSION retval instead of -EFAULT to take a more
      appropriate action if get_user_pages runs into a memory failure.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0fd71a56
    • A
      mm: gup: add get_user_pages_locked and get_user_pages_unlocked · f0818f47
      Andrea Arcangeli 提交于
      FAULT_FOLL_ALLOW_RETRY allows the page fault to drop the mmap_sem for
      reading to reduce the mmap_sem contention (for writing), like while
      waiting for I/O completion.  The problem is that right now practically no
      get_user_pages call uses FAULT_FOLL_ALLOW_RETRY, so we're not leveraging
      that nifty feature.
      
      Andres fixed it for the KVM page fault.  However get_user_pages_fast
      remains uncovered, and 99% of other get_user_pages aren't using it either
      (the only exception being FOLL_NOWAIT in KVM which is really nonblocking
      and in fact it doesn't even release the mmap_sem).
      
      So this patchsets extends the optimization Andres did in the KVM page
      fault to the whole kernel.  It makes most important places (including
      gup_fast) to use FAULT_FOLL_ALLOW_RETRY to reduce the mmap_sem hold times
      during I/O.
      
      The only few places that remains uncovered are drivers like v4l and other
      exceptions that tends to work on their own memory and they're not working
      on random user memory (for example like O_DIRECT that uses gup_fast and is
      fully covered by this patch).
      
      A follow up patch should probably also add a printk_once warning to
      get_user_pages that should go obsolete and be phased out eventually.  The
      "vmas" parameter of get_user_pages makes it fundamentally incompatible
      with FAULT_FOLL_ALLOW_RETRY (vmas array becomes meaningless the moment the
      mmap_sem is released).
      
      While this is just an optimization, this becomes an absolute requirement
      for the userfaultfd feature http://lwn.net/Articles/615086/ .
      
      The userfaultfd allows to block the page fault, and in order to do so I
      need to drop the mmap_sem first.  So this patch also ensures that all
      memory where userfaultfd could be registered by KVM, the very first fault
      (no matter if it is a regular page fault, or a get_user_pages) always has
      FAULT_FOLL_ALLOW_RETRY set.  Then the userfaultfd blocks and it is waken
      only when the pagetable is already mapped.  The second fault attempt after
      the wakeup doesn't need FAULT_FOLL_ALLOW_RETRY, so it's ok to retry
      without it.
      
      This patch (of 5):
      
      We can leverage the VM_FAULT_RETRY functionality in the page fault paths
      better by using either get_user_pages_locked or get_user_pages_unlocked.
      
      The former allows conversion of get_user_pages invocations that will have
      to pass a "&locked" parameter to know if the mmap_sem was dropped during
      the call.  Example from:
      
          down_read(&mm->mmap_sem);
          do_something()
          get_user_pages(tsk, mm, ..., pages, NULL);
          up_read(&mm->mmap_sem);
      
      to:
      
          int locked = 1;
          down_read(&mm->mmap_sem);
          do_something()
          get_user_pages_locked(tsk, mm, ..., pages, &locked);
          if (locked)
              up_read(&mm->mmap_sem);
      
      The latter is suitable only as a drop in replacement of the form:
      
          down_read(&mm->mmap_sem);
          get_user_pages(tsk, mm, ..., pages, NULL);
          up_read(&mm->mmap_sem);
      
      into:
      
          get_user_pages_unlocked(tsk, mm, ..., pages);
      
      Where tsk, mm, the intermediate "..." paramters and "pages" can be any
      value as before.  Just the last parameter of get_user_pages (vmas) must be
      NULL for get_user_pages_locked|unlocked to be usable (the latter original
      form wouldn't have been safe anyway if vmas wasn't null, for the former we
      just make it explicit by dropping the parameter).
      
      If vmas is not NULL these two methods cannot be used.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Reviewed-by: NPeter Feiner <pfeiner@google.com>
      Reviewed-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0818f47
    • V
      mm/mempolicy.c: merge alloc_hugepage_vma to alloc_pages_vma · be97a41b
      Vlastimil Babka 提交于
      The previous commit ("mm/thp: Allocate transparent hugepages on local
      node") introduced alloc_hugepage_vma() to mm/mempolicy.c to perform a
      special policy for THP allocations.  The function has the same interface
      as alloc_pages_vma(), shares a lot of boilerplate code and a long
      comment.
      
      This patch merges the hugepage special case into alloc_pages_vma.  The
      extra if condition should be cheap enough price to pay.  We also prevent
      a (however unlikely) race with parallel mems_allowed update, which could
      make hugepage allocation restart only within the fallback call to
      alloc_hugepage_vma() and not reconsider the special rule in
      alloc_hugepage_vma().
      
      Also by making sure mpol_cond_put(pol) is always called before actual
      allocation attempt, we can use a single exit path within the function.
      
      Also update the comment for missing node parameter and obsolete reference
      to mm_sem.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be97a41b
    • A
      mm/thp: allocate transparent hugepages on local node · 077fcf11
      Aneesh Kumar K.V 提交于
      This make sure that we try to allocate hugepages from local node if
      allowed by mempolicy.  If we can't, we fallback to small page allocation
      based on mempolicy.  This is based on the observation that allocating
      pages on local node is more beneficial than allocating hugepages on remote
      node.
      
      With this patch applied we may find transparent huge page allocation
      failures if the current node doesn't have enough freee hugepages.  Before
      this patch such failures result in us retrying the allocation on other
      nodes in the numa node mask.
      
      [akpm@linux-foundation.org: fix comment, add CONFIG_TRANSPARENT_HUGEPAGE dependency]
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      077fcf11
    • J
      mm/compaction: add tracepoint to observe behaviour of compaction defer · 24e2716f
      Joonsoo Kim 提交于
      Compaction deferring logic is heavy hammer that block the way to the
      compaction.  It doesn't consider overall system state, so it could prevent
      user from doing compaction falsely.  In other words, even if system has
      enough range of memory to compact, compaction would be skipped due to
      compaction deferring logic.  This patch add new tracepoint to understand
      work of deferring logic.  This will also help to check compaction success
      and fail.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      24e2716f
    • J
      mm/compaction: more trace to understand when/why compaction start/finish · 837d026d
      Joonsoo Kim 提交于
      It is not well analyzed that when/why compaction start/finish or not.
      With these new tracepoints, we can know much more about start/finish
      reason of compaction.  I can find following bug with these tracepoint.
      
      http://www.spinics.net/lists/linux-mm/msg81582.htmlSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      837d026d
    • J
      mm/compaction: print current range where compaction work · e34d85f0
      Joonsoo Kim 提交于
      It'd be useful to know current range where compaction work for detailed
      analysis.  With it, we can know pageblock where we actually scan and
      isolate, and, how much pages we try in that pageblock and can guess why it
      doesn't become freepage with pageblock order roughly.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e34d85f0
    • J
      mm/compaction: enhance tracepoint output for compaction begin/end · 16c4a097
      Joonsoo Kim 提交于
      We now have tracepoint for begin event of compaction and it prints start
      position of both scanners, but, tracepoint for end event of compaction
      doesn't print finish position of both scanners.  It'd be also useful to
      know finish position of both scanners so this patch add it.  It will help
      to find odd behavior or problem on compaction internal logic.
      
      And mode is added to both begin/end tracepoint output, since according to
      mode, compaction behavior is quite different.
      
      And lastly, status format is changed to string rather than status number
      for readability.
      
      [akpm@linux-foundation.org: fix sparse warning]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16c4a097
    • K
      page_writeback: put account_page_redirty() after set_page_dirty() · 8d38633c
      Konstantin Khebnikov 提交于
      Helper account_page_redirty() fixes dirty pages counter for redirtied
      pages.  This patch puts it after dirtying and prevents temporary
      underflows of dirtied pages counters on zone/bdi and current->nr_dirtied.
      Signed-off-by: NKonstantin Khebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d38633c
    • K
      mm: fix false-positive warning on exit due mm_nr_pmds(mm) · b30fe6c7
      Kirill A. Shutemov 提交于
      The problem is that we check nr_ptes/nr_pmds in exit_mmap() which happens
      *before* pgd_free().  And if an arch does pte/pmd allocation in
      pgd_alloc() and frees them in pgd_free() we see offset in counters by the
      time of the checks.
      
      We tried to workaround this by offsetting expected counter value according
      to FIRST_USER_ADDRESS for both nr_pte and nr_pmd in exit_mmap().  But it
      doesn't work in some cases:
      
      1. ARM with LPAE enabled also has non-zero USER_PGTABLES_CEILING, but
         upper addresses occupied with huge pmd entries, so the trick with
         offsetting expected counter value will get really ugly: we will have
         to apply it nr_pmds, but not nr_ptes.
      
      2. Metag has non-zero FIRST_USER_ADDRESS, but doesn't do allocation
         pte/pmd page tables allocation in pgd_alloc(), just setup a pgd entry
         which is allocated at boot and shared accross all processes.
      
      The proposal is to move the check to check_mm() which happens *after*
      pgd_free() and do proper accounting during pgd_alloc() and pgd_free()
      which would bring counters to zero if nothing leaked.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NTyler Baker <tyler.baker@linaro.org>
      Tested-by: NTyler Baker <tyler.baker@linaro.org>
      Tested-by: NNishanth Menon <nm@ti.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b30fe6c7
    • K
      mm: account pmd page tables to the process · dc6c9a35
      Kirill A. Shutemov 提交于
      Dave noticed that unprivileged process can allocate significant amount of
      memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
      memory cgroup.  The trick is to allocate a lot of PMD page tables.  Linux
      kernel doesn't account PMD tables to the process, only PTE.
      
      The use-cases below use few tricks to allocate a lot of PMD page tables
      while keeping VmRSS and VmPTE low.  oom_score for the process will be 0.
      
      	#include <errno.h>
      	#include <stdio.h>
      	#include <stdlib.h>
      	#include <unistd.h>
      	#include <sys/mman.h>
      	#include <sys/prctl.h>
      
      	#define PUD_SIZE (1UL << 30)
      	#define PMD_SIZE (1UL << 21)
      
      	#define NR_PUD 130000
      
      	int main(void)
      	{
      		char *addr = NULL;
      		unsigned long i;
      
      		prctl(PR_SET_THP_DISABLE);
      		for (i = 0; i < NR_PUD ; i++) {
      			addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
      					MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
      			if (addr == MAP_FAILED) {
      				perror("mmap");
      				break;
      			}
      			*addr = 'x';
      			munmap(addr, PMD_SIZE);
      			mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
      					MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
      			if (addr == MAP_FAILED)
      				perror("re-mmap"), exit(1);
      		}
      		printf("PID %d consumed %lu KiB in PMD page tables\n",
      				getpid(), i * 4096 >> 10);
      		return pause();
      	}
      
      The patch addresses the issue by account PMD tables to the process the
      same way we account PTE.
      
      The main place where PMD tables is accounted is __pmd_alloc() and
      free_pmd_range(). But there're few corner cases:
      
       - HugeTLB can share PMD page tables. The patch handles by accounting
         the table to all processes who share it.
      
       - x86 PAE pre-allocates few PMD tables on fork.
      
       - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
         check on exit(2).
      
      Accounting only happens on configuration where PMD page table's level is
      present (PMD is not folded).  As with nr_ptes we use per-mm counter.  The
      counter value is used to calculate baseline for badness score by
      oom-killer.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: David Rientjes <rientjes@google.com>
      Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc6c9a35
    • J
      mm: memcontrol: consolidate swap controller code · 21afa38e
      Johannes Weiner 提交于
      The swap controller code is scattered all over the file.  Gather all
      the code that isn't directly needed by the memory controller at the
      end of the file in its own CONFIG_MEMCG_SWAP section.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21afa38e
    • J
      mm: memcontrol: consolidate memory controller initialization · 95a045f6
      Johannes Weiner 提交于
      The initialization code for the per-cpu charge stock and the soft
      limit tree is compact enough to inline it into mem_cgroup_init().
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      95a045f6
    • J
      mm: memcontrol: simplify soft limit tree init code · 9c608dbe
      Johannes Weiner 提交于
      - No need to test the node for N_MEMORY.  node_online() is enough for
        node fallback to work in slab, use NUMA_NO_NODE for everything else.
      
      - Remove the BUG_ON() for allocation failure.  A NULL pointer crash is
        just as descriptive, and the absent return value check is obvious.
      
      - Move local variables to the inner-most blocks.
      
      - Point to the tree structure after its initialized, not before, it's
        just more logical that way.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c608dbe
    • G
      mm: cma: fix totalcma_pages to include DT defined CMA regions · 94737a85
      George G. Davis 提交于
      The totalcma_pages variable is not updated to account for CMA regions
      defined via device tree reserved-memory sub-nodes.  Fix this omission by
      moving the calculation of totalcma_pages into cma_init_reserved_mem()
      instead of cma_declare_contiguous() such that it will include reserved
      memory used by all CMA regions.
      Signed-off-by: NGeorge G. Davis <george_davis@mentor.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      94737a85
    • M
      oom, PM: make OOM detection in the freezer path raceless · c32b3cbe
      Michal Hocko 提交于
      Commit 5695be14 ("OOM, PM: OOM killed task shouldn't escape PM
      suspend") has left a race window when OOM killer manages to
      note_oom_kill after freeze_processes checks the counter.  The race
      window is quite small and really unlikely and partial solution deemed
      sufficient at the time of submission.
      
      Tejun wasn't happy about this partial solution though and insisted on a
      full solution.  That requires the full OOM and freezer's task freezing
      exclusion, though.  This is done by this patch which introduces oom_sem
      RW lock and turns oom_killer_disable() into a full OOM barrier.
      
      oom_killer_disabled check is moved from the allocation path to the OOM
      level and we take oom_sem for reading for both the check and the whole
      OOM invocation.
      
      oom_killer_disable() takes oom_sem for writing so it waits for all
      currently running OOM killer invocations.  Then it disable all the further
      OOMs by setting oom_killer_disabled and checks for any oom victims.
      Victims are counted via mark_tsk_oom_victim resp.  unmark_oom_victim.  The
      last victim wakes up all waiters enqueued by oom_killer_disable().
      Therefore this function acts as the full OOM barrier.
      
      The page fault path is covered now as well although it was assumed to be
      safe before.  As per Tejun, "We used to have freezing points deep in file
      system code which may be reacheable from page fault." so it would be
      better and more robust to not rely on freezing points here.  Same applies
      to the memcg OOM killer.
      
      out_of_memory tells the caller whether the OOM was allowed to trigger and
      the callers are supposed to handle the situation.  The page allocation
      path simply fails the allocation same as before.  The page fault path will
      retry the fault (more on that later) and Sysrq OOM trigger will simply
      complain to the log.
      
      Normally there wouldn't be any unfrozen user tasks after
      try_to_freeze_tasks so the function will not block. But if there was an
      OOM killer racing with try_to_freeze_tasks and the OOM victim didn't
      finish yet then we have to wait for it. This should complete in a finite
      time, though, because
      
      	- the victim cannot loop in the page fault handler (it would die
      	  on the way out from the exception)
      	- it cannot loop in the page allocator because all the further
      	  allocation would fail and __GFP_NOFAIL allocations are not
      	  acceptable at this stage
      	- it shouldn't be blocked on any locks held by frozen tasks
      	  (try_to_freeze expects lockless context) and kernel threads and
      	  work queues are not frozen yet
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c32b3cbe