提交 · bfe63d3beabfac93521c8b7ccd40befd7a90148e · openeuler / Kernel

07 7月, 2017 1 次提交

mm: drop page_initialized check from get_nid_for_pfn · bfe63d3b

由 Michal Hocko 提交于 7月 06, 2017

Commit c04fc586 ("mm: show node to memory section relationship with
symlinks in sysfs") has added means to export memblock<->node
association into the sysfs.  It has also introduced get_nid_for_pfn
which is a rather confusing counterpart of pfn_to_nid which checks also
whether the pfn page is already initialized (page_initialized).

This is done by checking page::lru != NULL which doesn't make any sense
at all.  Nothing in this path really relies on the lru list being used
or initialized.  Just remove it because this will become a problem with
later patches.

Thanks to Reza Arbab for testing which revealed this to be a problem
(http://lkml.kernel.org/r/20170403202337.GA12482@dhcp22.suse.cz)

Link: http://lkml.kernel.org/r/20170515085827.16474-4-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Daniel Kiper <daniel.kiper@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Tobias Regnery <tobias.regnery@gmail.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

bfe63d3b

23 5月, 2017 1 次提交

mm: Adjust system_state check · 8cdde385

由 Thomas Gleixner 提交于 5月 16, 2017

To enable smp_processor_id() and might_sleep() debug checks earlier, it's
required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING.

get_nid_for_pfn() checks for system_state == BOOTING to decide whether to
use early_pfn_to_nid() when CONFIG_DEFERRED_STRUCT_PAGE_INIT=y.

That check is dubious, because the switch to state RUNNING happes way after
page_alloc_init_late() has been invoked.

Change the check to less than RUNNING state so it covers the new
intermediate states as well.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20170516184735.528279534@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>

8cdde385

03 8月, 2016 1 次提交

treewide: replace obsolete _refok by __ref · bd721ea7

由 Fabian Frederick 提交于 8月 02, 2016

There was only one use of __initdata_refok and __exit_refok

__init_refok was used 46 times against 82 for __ref.

Those definitions are obsolete since commit 312b1485 ("Introduce new
section reference annotations tags: __ref, __refdata, __refconst")

This patch removes the following compatibility definitions and replaces
them treewide.

/* compatibility defines */
#define __init_refok     __ref
#define __initdata_refok __refdata
#define __exit_refok     __ref

I can also provide separate patches if necessary.
(One patch per tree and check in 1 month or 2 to remove old definitions)

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1466796271-3043-1-git-send-email-fabf@skynet.beSigned-off-by: NFabian Frederick <fabf@skynet.be>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

bd721ea7

29 7月, 2016 6 次提交

mm: track NR_KERNEL_STACK in KiB instead of number of stacks · d30dd8be

由 Andy Lutomirski 提交于 7月 28, 2016

Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a zone.
This only makes sense if each kernel stack exists entirely in one zone,
and allowing vmapped stacks could break this assumption.

Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack
allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on all
architectures. Keep it simple and use KiB.

Link: http://lkml.kernel.org/r/083c71e642c5fa5f1b6898902e1b2db7b48940d4.1468523549.git.luto@kernel.orgSigned-off-by: NAndy Lutomirski <luto@kernel.org>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d30dd8be

mm: move most file-based accounting to the node · 11fb9989

由 Mel Gorman 提交于 7月 28, 2016

There are now a number of accounting oddities such as mapped file pages
being accounted for on the node while the total number of file pages are
accounted on the zone.  This can be coped with to some extent but it's
confusing so this patch moves the relevant file-based accounted.  Due to
throttling logic in the page allocator for reliable OOM detection, it is
still necessary to track dirty and writeback pages on a per-zone basis.

[mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
  Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

11fb9989

mm: rename NR_ANON_PAGES to NR_ANON_MAPPED · 4b9d0fab

由 Mel Gorman 提交于 7月 28, 2016

NR_FILE_PAGES  is the number of        file pages.
NR_FILE_MAPPED is the number of mapped file pages.
NR_ANON_PAGES  is the number of mapped anon pages.

This is unhelpful naming as it's easy to confuse NR_FILE_MAPPED and
NR_ANON_PAGES for mapped pages.  This patch renames NR_ANON_PAGES so we
have

NR_FILE_PAGES  is the number of        file pages.
NR_FILE_MAPPED is the number of mapped file pages.
NR_ANON_MAPPED is the number of mapped anon pages.

Link: http://lkml.kernel.org/r/1467970510-21195-19-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4b9d0fab

mm: move page mapped accounting to the node · 50658e2e

由 Mel Gorman 提交于 7月 28, 2016

Reclaim makes decisions based on the number of pages that are mapped but
it's mixing node and zone information.  Account NR_FILE_MAPPED and
NR_ANON_PAGES pages on the node.

Link: http://lkml.kernel.org/r/1467970510-21195-18-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

50658e2e

mm, vmscan: move LRU lists to node · 599d0c95

由 Mel Gorman 提交于 7月 28, 2016

This moves the LRU lists from the zone to the node and related data such
as counters, tracing, congestion tracking and writeback tracking.

Unfortunately, due to reclaim and compaction retry logic, it is
necessary to account for the number of LRU pages on both zone and node
logic. Most reclaim logic is based on the node counters but the retry
logic uses the zone counters which do not distinguish inactive and
active sizes. It would be possible to leave the LRU counters on a
per-zone basis but it's a heavier calculation across multiple cache
lines that is much more frequent than the retry checks.

Other than the LRU counters, this is mostly a mechanical patch but note
that it introduces a number of anomalies. For example, the scans are
per-zone but using per-node counters. We also mark a node as congested
when a zone is congested. This causes weird problems that are fixed
later but is easier to review.

In the event that there is excessive overhead on 32-bit systems due to
the nodes being on LRU then there are two potential solutions

1. Long-term isolation of highmem pages when reclaim is lowmem

When pages are skipped, they are immediately added back onto the LRU
list. If lowmem reclaim persisted for long periods of time, the same
highmem pages get continually scanned. The idea would be that lowmem
keeps those pages on a separate list until a reclaim for highmem pages
arrives that splices the highmem pages back onto the LRU. It potentially
could be implemented similar to the UNEVICTABLE list.

That would reduce the skip rate with the potential corner case is that
highmem pages have to be scanned and reclaimed to free lowmem slab pages.

2. Linear scan lowmem pages if the initial LRU shrink fails

This will break LRU ordering but may be preferable and faster during
memory pressure than skipping LRU pages.

Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

599d0c95

mm, vmstat: add infrastructure for per-node vmstats · 75ef7184

由 Mel Gorman 提交于 7月 28, 2016

Patchset: "Move LRU page reclaim from zones to nodes v9"

This series moves LRUs from the zones to the node.  While this is a
current rebase, the test results were based on mmotm as of June 23rd.
Conceptually, this series is simple but there are a lot of details.
Some of the broad motivations for this are;

1. The residency of a page partially depends on what zone the page was
   allocated from.  This is partially combatted by the fair zone allocation
   policy but that is a partial solution that introduces overhead in the
   page allocator paths.

2. Currently, reclaim on node 0 behaves slightly different to node 1. For
   example, direct reclaim scans in zonelist order and reclaims even if
   the zone is over the high watermark regardless of the age of pages
   in that LRU. Kswapd on the other hand starts reclaim on the highest
   unbalanced zone. A difference in distribution of file/anon pages due
   to when they were allocated results can result in a difference in
   again. While the fair zone allocation policy mitigates some of the
   problems here, the page reclaim results on a multi-zone node will
   always be different to a single-zone node.
   it was scheduled on as a result.

3. kswapd and the page allocator scan zones in the opposite order to
   avoid interfering with each other but it's sensitive to timing.  This
   mitigates the page allocator using pages that were allocated very recently
   in the ideal case but it's sensitive to timing. When kswapd is allocating
   from lower zones then it's great but during the rebalancing of the highest
   zone, the page allocator and kswapd interfere with each other. It's worse
   if the highest zone is small and difficult to balance.

4. slab shrinkers are node-based which makes it harder to identify the exact
   relationship between slab reclaim and LRU reclaim.

The reason we have zone-based reclaim is that we used to have
large highmem zones in common configurations and it was necessary
to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
less of a concern as machines with lots of memory will (or should) use
64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
rare. Machines that do use highmem should have relatively low highmem:lowmem
ratios than we worried about in the past.

Conceptually, moving to node LRUs should be easier to understand. The
page allocator plays fewer tricks to game reclaim and reclaim behaves
similarly on all nodes.

The series has been tested on a 16 core UMA machine and a 2-socket 48
core NUMA machine. The UMA results are presented in most cases as the NUMA
machine behaved similarly.

pagealloc
---------

This is a microbenchmark that shows the benefit of removing the fair zone
allocation policy. It was tested uip to order-4 but only orders 0 and 1 are
shown as the other orders were comparable.

                                           4.7.0-rc4                  4.7.0-rc4
                                      mmotm-20160623                 nodelru-v9
Min      total-odr0-1               490.00 (  0.00%)           457.00 (  6.73%)
Min      total-odr0-2               347.00 (  0.00%)           329.00 (  5.19%)
Min      total-odr0-4               288.00 (  0.00%)           273.00 (  5.21%)
Min      total-odr0-8               251.00 (  0.00%)           239.00 (  4.78%)
Min      total-odr0-16              234.00 (  0.00%)           222.00 (  5.13%)
Min      total-odr0-32              223.00 (  0.00%)           211.00 (  5.38%)
Min      total-odr0-64              217.00 (  0.00%)           208.00 (  4.15%)
Min      total-odr0-128             214.00 (  0.00%)           204.00 (  4.67%)
Min      total-odr0-256             250.00 (  0.00%)           230.00 (  8.00%)
Min      total-odr0-512             271.00 (  0.00%)           269.00 (  0.74%)
Min      total-odr0-1024            291.00 (  0.00%)           282.00 (  3.09%)
Min      total-odr0-2048            303.00 (  0.00%)           296.00 (  2.31%)
Min      total-odr0-4096            311.00 (  0.00%)           309.00 (  0.64%)
Min      total-odr0-8192            316.00 (  0.00%)           314.00 (  0.63%)
Min      total-odr0-16384           317.00 (  0.00%)           315.00 (  0.63%)
Min      total-odr1-1               742.00 (  0.00%)           712.00 (  4.04%)
Min      total-odr1-2               562.00 (  0.00%)           530.00 (  5.69%)
Min      total-odr1-4               457.00 (  0.00%)           433.00 (  5.25%)
Min      total-odr1-8               411.00 (  0.00%)           381.00 (  7.30%)
Min      total-odr1-16              381.00 (  0.00%)           356.00 (  6.56%)
Min      total-odr1-32              372.00 (  0.00%)           346.00 (  6.99%)
Min      total-odr1-64              372.00 (  0.00%)           343.00 (  7.80%)
Min      total-odr1-128             375.00 (  0.00%)           351.00 (  6.40%)
Min      total-odr1-256             379.00 (  0.00%)           351.00 (  7.39%)
Min      total-odr1-512             385.00 (  0.00%)           355.00 (  7.79%)
Min      total-odr1-1024            386.00 (  0.00%)           358.00 (  7.25%)
Min      total-odr1-2048            390.00 (  0.00%)           362.00 (  7.18%)
Min      total-odr1-4096            390.00 (  0.00%)           362.00 (  7.18%)
Min      total-odr1-8192            388.00 (  0.00%)           363.00 (  6.44%)

This shows a steady improvement throughout. The primary benefit is from
reduced system CPU usage which is obvious from the overall times;

           4.7.0-rc4   4.7.0-rc4
        mmotm-20160623nodelru-v8
User          189.19      191.80
System       2604.45     2533.56
Elapsed      2855.30     2786.39

The vmstats also showed that the fair zone allocation policy was definitely
removed as can be seen here;

                             4.7.0-rc3   4.7.0-rc3
                         mmotm-20160623 nodelru-v8
DMA32 allocs               28794729769           0
Normal allocs              48432501431 77227309877
Movable allocs                       0           0

tiobench on ext4
----------------

tiobench is a benchmark that artifically benefits if old pages remain resident
while new pages get reclaimed. The fair zone allocation policy mitigates this
problem so pages age fairly. While the benchmark has problems, it is important
that tiobench performance remains constant as it implies that page aging
problems that the fair zone allocation policy fixes are not re-introduced.

                                         4.7.0-rc4             4.7.0-rc4
                                    mmotm-20160623            nodelru-v9
Min      PotentialReadSpeed        89.65 (  0.00%)       90.21 (  0.62%)
Min      SeqRead-MB/sec-1          82.68 (  0.00%)       82.01 ( -0.81%)
Min      SeqRead-MB/sec-2          72.76 (  0.00%)       72.07 ( -0.95%)
Min      SeqRead-MB/sec-4          75.13 (  0.00%)       74.92 ( -0.28%)
Min      SeqRead-MB/sec-8          64.91 (  0.00%)       65.19 (  0.43%)
Min      SeqRead-MB/sec-16         62.24 (  0.00%)       62.22 ( -0.03%)
Min      RandRead-MB/sec-1          0.88 (  0.00%)        0.88 (  0.00%)
Min      RandRead-MB/sec-2          0.95 (  0.00%)        0.92 ( -3.16%)
Min      RandRead-MB/sec-4          1.43 (  0.00%)        1.34 ( -6.29%)
Min      RandRead-MB/sec-8          1.61 (  0.00%)        1.60 ( -0.62%)
Min      RandRead-MB/sec-16         1.80 (  0.00%)        1.90 (  5.56%)
Min      SeqWrite-MB/sec-1         76.41 (  0.00%)       76.85 (  0.58%)
Min      SeqWrite-MB/sec-2         74.11 (  0.00%)       73.54 ( -0.77%)
Min      SeqWrite-MB/sec-4         80.05 (  0.00%)       80.13 (  0.10%)
Min      SeqWrite-MB/sec-8         72.88 (  0.00%)       73.20 (  0.44%)
Min      SeqWrite-MB/sec-16        75.91 (  0.00%)       76.44 (  0.70%)
Min      RandWrite-MB/sec-1         1.18 (  0.00%)        1.14 ( -3.39%)
Min      RandWrite-MB/sec-2         1.02 (  0.00%)        1.03 (  0.98%)
Min      RandWrite-MB/sec-4         1.05 (  0.00%)        0.98 ( -6.67%)
Min      RandWrite-MB/sec-8         0.89 (  0.00%)        0.92 (  3.37%)
Min      RandWrite-MB/sec-16        0.92 (  0.00%)        0.93 (  1.09%)

           4.7.0-rc4   4.7.0-rc4
        mmotm-20160623 approx-v9
User          645.72      525.90
System        403.85      331.75
Elapsed      6795.36     6783.67

This shows that the series has little or not impact on tiobench which is
desirable and a reduction in system CPU usage. It indicates that the fair
zone allocation policy was removed in a manner that didn't reintroduce
one class of page aging bug. There were only minor differences in overall
reclaim activity

                             4.7.0-rc4   4.7.0-rc4
                          mmotm-20160623nodelru-v8
Minor Faults                    645838      647465
Major Faults                       573         640
Swap Ins                             0           0
Swap Outs                            0           0
DMA allocs                           0           0
DMA32 allocs                  46041453    44190646
Normal allocs                 78053072    79887245
Movable allocs                       0           0
Allocation stalls                   24          67
Stall zone DMA                       0           0
Stall zone DMA32                     0           0
Stall zone Normal                    0           2
Stall zone HighMem                   0           0
Stall zone Movable                   0          65
Direct pages scanned             10969       30609
Kswapd pages scanned          93375144    93492094
Kswapd pages reclaimed        93372243    93489370
Direct pages reclaimed           10969       30609
Kswapd efficiency                  99%         99%
Kswapd velocity              13741.015   13781.934
Direct efficiency                 100%        100%
Direct velocity                  1.614       4.512
Percentage direct scans             0%          0%

kswapd activity was roughly comparable. There were differences in direct
reclaim activity but negligible in the context of the overall workload
(velocity of 4 pages per second with the patches applied, 1.6 pages per
second in the baseline kernel).

pgbench read-only large configuration on ext4
---------------------------------------------

pgbench is a database benchmark that can be sensitive to page reclaim
decisions. This also checks if removing the fair zone allocation policy
is safe

pgbench Transactions
                        4.7.0-rc4             4.7.0-rc4
                   mmotm-20160623            nodelru-v8
Hmean    1       188.26 (  0.00%)      189.78 (  0.81%)
Hmean    5       330.66 (  0.00%)      328.69 ( -0.59%)
Hmean    12      370.32 (  0.00%)      380.72 (  2.81%)
Hmean    21      368.89 (  0.00%)      369.00 (  0.03%)
Hmean    30      382.14 (  0.00%)      360.89 ( -5.56%)
Hmean    32      428.87 (  0.00%)      432.96 (  0.95%)

Negligible differences again. As with tiobench, overall reclaim activity
was comparable.

bonnie++ on ext4
----------------

No interesting performance difference, negligible differences on reclaim
stats.

paralleldd on ext4
------------------

This workload uses varying numbers of dd instances to read large amounts of
data from disk.

                               4.7.0-rc3             4.7.0-rc3
                          mmotm-20160623            nodelru-v9
Amean    Elapsd-1       186.04 (  0.00%)      189.41 ( -1.82%)
Amean    Elapsd-3       192.27 (  0.00%)      191.38 (  0.46%)
Amean    Elapsd-5       185.21 (  0.00%)      182.75 (  1.33%)
Amean    Elapsd-7       183.71 (  0.00%)      182.11 (  0.87%)
Amean    Elapsd-12      180.96 (  0.00%)      181.58 ( -0.35%)
Amean    Elapsd-16      181.36 (  0.00%)      183.72 ( -1.30%)

           4.7.0-rc4   4.7.0-rc4
        mmotm-20160623 nodelru-v9
User         1548.01     1552.44
System       8609.71     8515.08
Elapsed      3587.10     3594.54

There is little or no change in performance but some drop in system CPU usage.

                             4.7.0-rc3   4.7.0-rc3
                        mmotm-20160623  nodelru-v9
Minor Faults                    362662      367360
Major Faults                      1204        1143
Swap Ins                            22           0
Swap Outs                         2855        1029
DMA allocs                           0           0
DMA32 allocs                  31409797    28837521
Normal allocs                 46611853    49231282
Movable allocs                       0           0
Direct pages scanned                 0           0
Kswapd pages scanned          40845270    40869088
Kswapd pages reclaimed        40830976    40855294
Direct pages reclaimed               0           0
Kswapd efficiency                  99%         99%
Kswapd velocity              11386.711   11369.769
Direct efficiency                 100%        100%
Direct velocity                  0.000       0.000
Percentage direct scans             0%          0%
Page writes by reclaim            2855        1029
Page writes file                     0           0
Page writes anon                  2855        1029
Page reclaim immediate             771        1628
Sector Reads                 293312636   293536360
Sector Writes                 18213568    18186480
Page rescued immediate               0           0
Slabs scanned                   128257      132747
Direct inode steals                181          56
Kswapd inode steals                 59        1131

It basically shows that kswapd was active at roughly the same rate in
both kernels. There was also comparable slab scanning activity and direct
reclaim was avoided in both cases. There appears to be a large difference
in numbers of inodes reclaimed but the workload has few active inodes and
is likely a timing artifact.

stutter
-------

stutter simulates a simple workload. One part uses a lot of anonymous
memory, a second measures mmap latency and a third copies a large file.
The primary metric is checking for mmap latency.

stutter
                             4.7.0-rc4             4.7.0-rc4
                        mmotm-20160623            nodelru-v8
Min         mmap     16.6283 (  0.00%)     13.4258 ( 19.26%)
1st-qrtle   mmap     54.7570 (  0.00%)     34.9121 ( 36.24%)
2nd-qrtle   mmap     57.3163 (  0.00%)     46.1147 ( 19.54%)
3rd-qrtle   mmap     58.9976 (  0.00%)     47.1882 ( 20.02%)
Max-90%     mmap     59.7433 (  0.00%)     47.4453 ( 20.58%)
Max-93%     mmap     60.1298 (  0.00%)     47.6037 ( 20.83%)
Max-95%     mmap     73.4112 (  0.00%)     82.8719 (-12.89%)
Max-99%     mmap     92.8542 (  0.00%)     88.8870 (  4.27%)
Max         mmap   1440.6569 (  0.00%)    121.4201 ( 91.57%)
Mean        mmap     59.3493 (  0.00%)     42.2991 ( 28.73%)
Best99%Mean mmap     57.2121 (  0.00%)     41.8207 ( 26.90%)
Best95%Mean mmap     55.9113 (  0.00%)     39.9620 ( 28.53%)
Best90%Mean mmap     55.6199 (  0.00%)     39.3124 ( 29.32%)
Best50%Mean mmap     53.2183 (  0.00%)     33.1307 ( 37.75%)
Best10%Mean mmap     45.9842 (  0.00%)     20.4040 ( 55.63%)
Best5%Mean  mmap     43.2256 (  0.00%)     17.9654 ( 58.44%)
Best1%Mean  mmap     32.9388 (  0.00%)     16.6875 ( 49.34%)

This shows a number of improvements with the worst-case outlier greatly
improved.

Some of the vmstats are interesting

                             4.7.0-rc4   4.7.0-rc4
                          mmotm-20160623nodelru-v8
Swap Ins                           163         502
Swap Outs                            0           0
DMA allocs                           0           0
DMA32 allocs                 618719206  1381662383
Normal allocs                891235743   564138421
Movable allocs                       0           0
Allocation stalls                 2603           1
Direct pages scanned            216787           2
Kswapd pages scanned          50719775    41778378
Kswapd pages reclaimed        41541765    41777639
Direct pages reclaimed          209159           0
Kswapd efficiency                  81%         99%
Kswapd velocity              16859.554   14329.059
Direct efficiency                  96%          0%
Direct velocity                 72.061       0.001
Percentage direct scans             0%          0%
Page writes by reclaim         6215049           0
Page writes file               6215049           0
Page writes anon                     0           0
Page reclaim immediate           70673          90
Sector Reads                  81940800    81680456
Sector Writes                100158984    98816036
Page rescued immediate               0           0
Slabs scanned                  1366954       22683

While this is not guaranteed in all cases, this particular test showed
a large reduction in direct reclaim activity. It's also worth noting
that no page writes were issued from reclaim context.

This series is not without its hazards. There are at least three areas
that I'm concerned with even though I could not reproduce any problems in
that area.

1. Reclaim/compaction is going to be affected because the amount of reclaim is
   no longer targetted at a specific zone. Compaction works on a per-zone basis
   so there is no guarantee that reclaiming a few THP's worth page pages will
   have a positive impact on compaction success rates.

2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers
   are called is now different. This may or may not be a problem but if it
   is, it'll be because shrinkers are not called enough and some balancing
   is required.

3. The anon/file reclaim ratio may be affected. Pages about to be dirtied are
   distributed between zones and the fair zone allocation policy used to do
   something very similar for anon. The distribution is now different but not
   necessarily in any way that matters but it's still worth bearing in mind.

VM statistic counters for reclaim decisions are zone-based.  If the kernel
is to reclaim on a per-node basis then we need to track per-node
statistics but there is no infrastructure for that.  The most notable
change is that the old node_page_state is renamed to
sum_zone_node_page_state.  The new node_page_state takes a pglist_data and
uses per-node stats but none exist yet.  There is some renaming such as
vm_stat to vm_zone_stat and the addition of vm_node_stat and the renaming
of mod_state to mod_zone_state.  Otherwise, this is mostly a mechanical
patch with no functional change.  There is a lot of similarity between the
node and zone helpers which is unfortunate but there was no obvious way of
reusing the code and maintaining type safety.

Link: http://lkml.kernel.org/r/1467970510-21195-2-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

75ef7184

27 7月, 2016 1 次提交

mm, rmap: account shmem thp pages · 65c45377

由 Kirill A. Shutemov 提交于 7月 26, 2016

Let's add ShmemHugePages and ShmemPmdMapped fields into meminfo and
smaps. It indicates how many times we allocate and map shmem THP.

NR_ANON_TRANSPARENT_HUGEPAGES is renamed to NR_ANON_THPS.

Link: http://lkml.kernel.org/r/1466021202-61880-27-git-send-email-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

65c45377

14 10月, 2015 1 次提交

Revert "mm: Check if section present during memory block (un)registering" · 8346aa76

由 Greg Kroah-Hartman 提交于 10月 13, 2015

This reverts commit 7568fb63 as it's
already in Linus's tree through a different patch.
Reported-by: NTony Luck <tony.luck@intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: stable@vger.kernel.org #v3.15
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

8346aa76

05 10月, 2015 1 次提交

mm: Check if section present during memory block (un)registering · 7568fb63

由 Yinghai Lu 提交于 8月 25, 2015

Tony found on his setup, if memory block size 512M will cause crash
during booting.

 BUG: unable to handle kernel paging request at ffffea0074000020
 IP: [<ffffffff81670527>] get_nid_for_pfn+0x17/0x40
 PGD 128ffcb067 PUD 128ffc9067 PMD 0
 Oops: 0000 [#1] SMP
 Modules linked in:
 CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.2.0-rc8 #1
...
 Call Trace:
  [<ffffffff81453b56>] ? register_mem_sect_under_node+0x66/0xe0
  [<ffffffff81453eeb>] register_one_node+0x17b/0x240
  [<ffffffff81b1f1ed>] ? pci_iommu_alloc+0x6e/0x6e
  [<ffffffff81b1f229>] topology_init+0x3c/0x95
  [<ffffffff8100213d>] do_one_initcall+0xcd/0x1f0

The system has non continuous RAM address:
 BIOS-e820: [mem 0x0000001300000000-0x0000001cffffffff] usable
 BIOS-e820: [mem 0x0000001d70000000-0x0000001ec7ffefff] usable
 BIOS-e820: [mem 0x0000001f00000000-0x0000002bffffffff] usable
 BIOS-e820: [mem 0x0000002c18000000-0x0000002d6fffefff] usable
 BIOS-e820: [mem 0x0000002e00000000-0x00000039ffffffff] usable

So there are start sections in memory block not present.
For example:
memory block : [0x2c18000000, 0x2c20000000) 512M
first three sections are not present.

Current register_mem_sect_under_node() assume first section is present,
but memory block section number range [start_section_nr, end_section_nr]
would include not present section.

For arch that support vmemmap, we don't setup memmap for struct page area
within not present sections area.

So skip the pfn range that belong to absent section.

Also fixes unregister_mem_sect_under_nodes() that assume one section per
memory block.
Reported-by: NTony Luck <tony.luck@intel.com>
Tested-by: NTony Luck <tony.luck@intel.com>
Fixes: bdee237c ("x86: mm: Use 2GB memory block size on large memory x86-64 systems")
Fixes: 982792c7 ("x86, mm: probe memory block size for generic x86 64bit")
Signed-off-by: NYinghai Lu <yinghai@kernel.org>
Cc: stable@vger.kernel.org #v3.15
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

7568fb63

05 9月, 2015 1 次提交

mm: check if section present during memory block registering · 04697858

由 Yinghai Lu 提交于 9月 04, 2015

Tony Luck found on his setup, if memory block size 512M will cause crash
during booting.

  BUG: unable to handle kernel paging request at ffffea0074000020
  IP: get_nid_for_pfn+0x17/0x40
  PGD 128ffcb067 PUD 128ffc9067 PMD 0
  Oops: 0000 [#1] SMP
  Modules linked in:
  CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.2.0-rc8 #1
  ...
  Call Trace:
     ? register_mem_sect_under_node+0x66/0xe0
     register_one_node+0x17b/0x240
     ? pci_iommu_alloc+0x6e/0x6e
     topology_init+0x3c/0x95
     do_one_initcall+0xcd/0x1f0

The system has non continuous RAM address:
 BIOS-e820: [mem 0x0000001300000000-0x0000001cffffffff] usable
 BIOS-e820: [mem 0x0000001d70000000-0x0000001ec7ffefff] usable
 BIOS-e820: [mem 0x0000001f00000000-0x0000002bffffffff] usable
 BIOS-e820: [mem 0x0000002c18000000-0x0000002d6fffefff] usable
 BIOS-e820: [mem 0x0000002e00000000-0x00000039ffffffff] usable

So there are start sections in memory block not present.  For example:

    memory block : [0x2c18000000, 0x2c20000000) 512M

first three sections are not present.

The current register_mem_sect_under_node() assume first section is
present, but memory block section number range [start_section_nr,
end_section_nr] would include not present section.

For arch that support vmemmap, we don't setup memmap for struct page
area within not present sections area.

So skip the pfn range that belong to absent section.

[akpm@linux-foundation.org: simplification]
[rientjes@google.com: more simplification]
Fixes: bdee237c ("x86: mm: Use 2GB memory block size on large memory x86-64 systems")
Fixes: 982792c7 ("x86, mm: probe memory block size for generic x86 64bit")
Signed-off-by: NYinghai Lu <yinghai@kernel.org>
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Reported-by: NTony Luck <tony.luck@intel.com>
Tested-by: NTony Luck <tony.luck@intel.com>
Cc: Greg KH <greg@kroah.com>
Cc: Ingo Molnar <mingo@elte.hu>
Tested-by: NDavid Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org>	[3.15+]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

04697858

01 7月, 2015 1 次提交

mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set · 3a80a7fa

由 Mel Gorman 提交于 6月 30, 2015

This patch initalises all low memory struct pages and 2G of the highest
zone on each node during memory initialisation if
CONFIG_DEFERRED_STRUCT_PAGE_INIT is set.  That config option cannot be set
but will be available in a later patch.  Parallel initialisation of struct
page depends on some features from memory hotplug and it is necessary to
alter alter section annotations.
Signed-off-by: NMel Gorman <mgorman@suse.de>
Tested-by: NNate Zimmer <nzimmer@sgi.com>
Tested-by: NWaiman Long <waiman.long@hp.com>
Tested-by: NDaniel J Blueman <daniel@numascale.com>
Acked-by: NPekka Enberg <penberg@kernel.org>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Nate Zimmer <nzimmer@sgi.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3a80a7fa

25 3月, 2015 3 次提交

drivers: base: node: Delete space after pointer declaration · 518d3f38

由 Ana Nedelcu 提交于 3月 08, 2015

This patch fixes the following error found by checkpatch.pl:
ERROR: "foo * bar" should be "foo *bar"
Signed-off-by: NAna Nedelcu <anafnedelcu@gmail.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

518d3f38

drivers/base/node: clean up attribute group conversion · 7ca7ec40

由 Greg Kroah-Hartman 提交于 3月 25, 2015

We can use the ATTRIBUTE_GROUPS() macro here, so use it, saving some
lines of code.

Cc: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org

7ca7ec40

drivers/base/node: Avoid manual device_create_file() calls · 3c9b8aaf

由 Takashi Iwai 提交于 1月 29, 2015

Instead of manual calls of multiple device_create_file() and
device_remove_file(), use the static attribute groups assigned to the
new device.  This also fixes the possible races, too.
Signed-off-by: NTakashi Iwai <tiwai@suse.de>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

3c9b8aaf

14 2月, 2015 1 次提交

drivers/base: use %*pb[l] to print bitmaps including cpumasks and nodemasks · f799b1a7

由 Tejun Heo 提交于 2月 13, 2015

printk and friends can now format bitmaps using '%*pb[l]'.  cpumask
and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
respectively which can be used to generate the two printf arguments
necessary to format the specified cpu/nodemask.

* Line termination only requires one extra space at the end of the
  buffer.  Use PAGE_SIZE - 1 instead of PAGE_SIZE - 2 when formatting.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f799b1a7

08 11月, 2014 1 次提交

cpumask: factor out show_cpumap into separate helper function · 5aaba363

由 Sudeep Holla 提交于 9月 30, 2014

Many sysfs *_show function use cpu{list,mask}_scnprintf to copy cpumap
to the buffer aligned to PAGE_SIZE, append '\n' and '\0' to return null
terminated buffer with newline.

This patch creates a new helper function cpumap_print_to_pagebuf in
cpumask.h using newly added bitmap_print_to_pagebuf and consolidates
most of those sysfs functions using the new helper function.
Signed-off-by: NSudeep Holla <sudeep.holla@arm.com>
Suggested-by: NStephen Boyd <sboyd@codeaurora.org>
Tested-by: NStephen Boyd <sboyd@codeaurora.org>
Acked-by: N"Rafael J. Wysocki" <rjw@rjwysocki.net>
Acked-by: NBjorn Helgaas <bhelgaas@google.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: x86@kernel.org
Cc: linux-acpi@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

5aaba363

10 10月, 2014 1 次提交

mm: remove noisy remainder of the scan_unevictable interface · 1f13ae39

由 Johannes Weiner 提交于 10月 09, 2014

The deprecation warnings for the scan_unevictable interface triggers by
scripts doing `sysctl -a | grep something else'.  This is annoying and not
helpful.

The interface has been defunct since 264e56d8 ("mm: disable user
interface to manually rescue unevictable pages"), which was in 2011, and
there haven't been any reports of usecases for it, only reports that the
deprecation warnings are annying.  It's unlikely that anybody is using
this interface specifically at this point, so remove it.
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1f13ae39

04 10月, 2014 1 次提交

driver/base/node: remove unnecessary kfree of node struct from unregister_one_node · 33ead538

由 Yasuaki Ishimatsu 提交于 10月 03, 2014

Commit 92d585ef ("numa: fix NULL pointer access and memory
leak in unregister_one_node()") added kfree() of node struct in
unregister_one_node(). But node struct is freed by node_device_release()
which is called in  unregister_node(). So by adding the kfree(),
node struct is freed two times.

While hot removing memory, the commit leads the following BUG_ON():

  kernel BUG at mm/slub.c:3346!
  invalid opcode: 0000 [#1] SMP
  [...]
  Call Trace:
   [...] unregister_one_node
   [...] try_offline_node
   [...] remove_memory
   [...] acpi_memory_device_remove
   [...] acpi_bus_trim
   [...] acpi_bus_trim
   [...] acpi_device_hotplug
   [...] acpi_hotplug_work_fn
   [...] process_one_work
   [...] worker_thread
   [...] ? rescuer_thread
   [...] kthread
   [...] ? kthread_create_on_node
   [...] ret_from_fork
   [...] ? kthread_create_on_node

This patch removes unnecessary kfree() from unregister_one_node().
Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: stable@vger.kernel.org # v3.16+
Fixes: 92d585ef "numa: fix NULL pointer access and memory leak in unregister_one_node()"
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

33ead538

07 8月, 2014 1 次提交

mm: export NR_SHMEM via sysinfo(2) / si_meminfo() interfaces · cc7452b6

由 Rafael Aquini 提交于 8月 06, 2014

Historically, we exported shared pages to userspace via sysinfo(2)
sharedram and /proc/meminfo's "MemShared" fields.  With the advent of
tmpfs, from kernel v2.4 onward, that old way for accounting shared mem
was deemed inaccurate and we started to export a hard-coded 0 for
sysinfo.sharedram.  Later on, during the 2.6 timeframe, "MemShared" got
re-introduced to /proc/meminfo re-branded as "Shmem", but we're still
reporting sysinfo.sharedmem as that old hard-coded zero, which makes the
"shared memory" report inconsistent across interfaces.

This patch leverages the addition of explicit accounting for pages used
by shmem/tmpfs -- "4b02108a mm: oom analysis: add shmem vmstat" -- in
order to make the users of sysinfo(2) and si_meminfo*() friends aware of
that vmstat entry and make them report it consistently across the
interfaces, as well to make sysinfo(2) returned data consistent with our
current API documentation states.
Signed-off-by: NRafael Aquini <aquini@redhat.com>
Acked-by: NRik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

cc7452b6

09 3月, 2014 1 次提交

numa: fix NULL pointer access and memory leak in unregister_one_node() · 92d585ef

由 Xishi Qiu 提交于 3月 06, 2014

When doing socket hot remove, "node_devices[nid]" is set to NULL;
acpi_processor_remove()
	try_offline_node()
		unregister_one_node()

Then hot add a socket, but do not echo 1 > /sys/devices/system/cpu/cpuXX/online,
so register_one_node() will not be called, and "node_devices[nid]"
is still NULL.

If doing socket hot remove again, NULL pointer access will be happen.
unregister_one_node()
	unregister_node()

Another, we should free the memory used by "node_devices[nid]" in
unregister_one_node().
Signed-off-by: NXishi Qiu <qiuxishi@huawei.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

92d585ef

13 9月, 2013 1 次提交

thp: account anon transparent huge pages into NR_ANON_PAGES · 3cd14fcd

由 Kirill A. Shutemov 提交于 9月 12, 2013

We use NR_ANON_PAGES as base for reporting AnonPages to user.  There's
not much sense in not accounting transparent huge pages there, but add
them on printing to user.

Let's account transparent huge pages in NR_ANON_PAGES in the first place.
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: NDave Hansen <dave.hansen@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Ning Qu <quning@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3cd14fcd

30 4月, 2013 1 次提交

drivers/base/node.c: switch to register_hotmemory_notifier() · 6e259e7d

由 Andrew Morton 提交于 4月 29, 2013

Squishes a warning which my change to hotplug_memory_notifier() added.

I want to keep that warning, because it is punishment for failnig to check
the hotplug_memory_notifier() return value.

Cc: Greg KH <greg@kroah.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6e259e7d

13 12月, 2012 2 次提交

numa: add CONFIG_MOVABLE_NODE for movable-dedicated node · 20b2f52b

由 Lai Jiangshan 提交于 12月 12, 2012

We need a node which only contains movable memory.  This feature is very
important for node hotplug.  If a node has normal/highmem, the memory may
be used by the kernel and can't be offlined.  If the node only contains
movable memory, we can offline the memory and the node.

All are prepared, we can actually introduce N_MEMORY.
add CONFIG_MOVABLE_NODE make we can use it for movable-dedicated node

[akpm@linux-foundation.org: fix Kconfig text]
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

20b2f52b

hugetlb: use N_MEMORY instead N_HIGH_MEMORY · 8cebfcd0

由 Lai Jiangshan 提交于 12月 12, 2012

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: NHillf Danton <dhillf@gmail.com>
Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8cebfcd0

12 12月, 2012 4 次提交

drivers/base/node.c: cleanup node_state_attr[] · fcf07d22

由 Lai Jiangshan 提交于 12月 11, 2012

use [index] = init_value
use N_xxxxx instead of hardcode.

Make it more readability and easier to add new state.
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fcf07d22

mm: cleanup register_node() · fa264375

由 Yasuaki Ishimatsu 提交于 12月 11, 2012

register_node() is defined as extern in include/linux/node.h.  But the
function is only called from register_one_node() in driver/base/node.c.

So the patch defines register_node() as static.
Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fa264375

memory-hotplug: suppress "Device nodeX does not have a release() function" warning · 8c7b5b4e

由 Yasuaki Ishimatsu 提交于 12月 11, 2012

When calling unregister_node(), the function shows following message at
device_release().

"Device 'node2' does not have a release() function, it is broken and must
be fixed."

The reason is node's device struct does not have a release() function.

So the patch registers node_device_release() to the device's release()
function for suppressing the warning message.  Additionally, the patch
adds memset() to initialize a node struct into register_node().  Because
the node struct is part of node_devices[] array and it cannot be freed by
node_device_release().  So if system reuses the node struct, it has a
garbage.
Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8c7b5b4e

numa: convert static memory to dynamically allocated memory for per node device · 8732794b

由 Wen Congyang 提交于 12月 11, 2012

We use a static array to store struct node.  In many cases, we don't have
too many nodes, and some memory will be unused.  Convert it to per-device
dynamically allocated memory.
Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8732794b

30 5月, 2012 1 次提交

mm: fix off-by-one bug in print_nodes_state() · f6238818

由 Ryota Ozaki 提交于 5月 29, 2012

/sys/devices/system/node/{online,possible} outputs a garbage byte
because print_nodes_state() returns content size + 1.  To fix the bug,
the patch changes the use of cpuset_sprintf_cpulist to follow the use at
other places, which is clearer and safer.

This bug was introduced in v2.6.24 (commit bde631a5: "mm: add node
states sysfs class attributeS").
Signed-off-by: NRyota Ozaki <ozaki.ryota@gmail.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f6238818

03 2月, 2012 1 次提交

drivers/base/memory.c: fix memory_dev_init() long delay · 321bf4ed

由 Yinghai Lu 提交于 1月 30, 2012

One system with 2048g ram, reported soft lockup on recent kernel.

[   34.426749] cpu_dev_init done
[   61.166399] BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]
[   61.166733] Modules linked in:
[   61.166904] irq event stamp: 1935610
[   61.178431] hardirqs last  enabled at (1935609): [<ffffffff81ce8c05>] mutex_lock_nested+0x299/0x2b4
[   61.178923] hardirqs last disabled at (1935610): [<ffffffff81cf2bab>] apic_timer_interrupt+0x6b/0x80
[   61.198767] softirqs last  enabled at (1935476): [<ffffffff8106e59c>] __do_softirq+0x195/0x1ab
[   61.218604] softirqs last disabled at (1935471): [<ffffffff81cf359c>] call_softirq+0x1c/0x30
[   61.238408] CPU 0
[   61.238549] Modules linked in:
[   61.238744]
[   61.238825] Pid: 1, comm: swapper/0 Not tainted 3.3.0-rc1-tip-yh-02076-g962f689-dirty #171
[   61.278212] RIP: 0010:[<ffffffff810b3e3a>]  [<ffffffff810b3e3a>] lock_release+0x90/0x9c
[   61.278627] RSP: 0018:ffff883f64dbfd70  EFLAGS: 00000246
[   61.298287] RAX: ffff883f64dc0000 RBX: 0000000000000000 RCX: 000000000000008b
[   61.298690] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[   61.318383] RBP: ffff883f64dbfda0 R08: 0000000000000001 R09: 000000000000008b
[   61.338215] R10: 0000000000000000 R11: 0000000000000000 R12: ffff883f64dbfd10
[   61.338610] R13: ffff883f64dc0708 R14: ffff883f64dc0708 R15: ffffffff81095657
[   61.358299] FS:  0000000000000000(0000) GS:ffff883f7d600000(0000) knlGS:0000000000000000
[   61.378118] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   61.378450] CR2: 0000000000000000 CR3: 00000000024af000 CR4: 00000000000007f0
[   61.398144] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   61.417918] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   61.418260] Process swapper/0 (pid: 1, threadinfo ffff883f64dbe000, task ffff883f64dc0000)
[   61.445358] Stack:
[   61.445511]  0000000000000002 ffff897f649ba168 ffff883f64dbfe10 ffff88ff64bb57a8
[   61.458040]  0000000000000000 0000000000000000 ffff883f64dbfdc0 ffffffff81ceb1b4
[   61.458491]  000000000011608c ffff88ff64bb58a8 ffff883f64dbfdf0 ffffffff81c57638
[   61.478215] Call Trace:
[   61.478367]  [<ffffffff81ceb1b4>] _raw_spin_unlock+0x21/0x2e
[   61.497994]  [<ffffffff81c57638>] klist_next+0x9e/0xbc
[   61.498264]  [<ffffffff8148ba99>] next_device+0xe/0x1e
[   61.517867]  [<ffffffff8148c0cc>] subsys_find_device_by_id+0xb7/0xd6
[   61.518197]  [<ffffffff81498846>] find_memory_block_hinted+0x3d/0x66
[   61.537927]  [<ffffffff8149887f>] find_memory_block+0x10/0x12
[   61.538193]  [<ffffffff814988b6>] add_memory_section+0x35/0x9e
[   61.557932]  [<ffffffff827fecef>] memory_dev_init+0x68/0xda
[   61.558227]  [<ffffffff827fec01>] driver_init+0x97/0xa7
[   61.577853]  [<ffffffff827cdf3c>] kernel_init+0xf6/0x1c0
[   61.578140]  [<ffffffff81cf34a4>] kernel_thread_helper+0x4/0x10
[   61.597850]  [<ffffffff81ceb59d>] ? retint_restore_args+0xe/0xe
[   61.598144]  [<ffffffff827cde46>] ? start_kernel+0x3ab/0x3ab
[   61.617826]  [<ffffffff81cf34a0>] ? gs_change+0xb/0xb
[   61.618060] Code: 10 48 83 3b 00 eb e8 4c 89 f2 44 89 fe 4c 89 ef e8 e1 fe ff ff 65 48 8b 04 25 40 bc 00 00 c7 80 cc 06 00 00 00 00 00 00 41 54 9d <5e> 5b 41 5c 41 5d 41 5e 41 5f 5d c3 55 48 89 e5 41 57 41 89 cf
[   89.285380] memory_dev_init done

Finally it takes about 55s to create 16400 memory entries.

Root cause: for x86_64, 2048g (with 2g hole at [2g,4g), and TOP2 will be 2050g), will have 16400 memory block.

find_memory_block/subsys_find_device_by_id will be expensive with that many entries.

Actually, we don't need to find that memory block for BOOT path.

Skip that finding make it get back to normal.

[   34.466696] cpu_dev_init done
[   35.290080] memory_dev_init done

Also solved the delay with topology_init when sections_per_block is not 1.
Signed-off-by: NYinghai Lu <yinghai@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Nathan Fontenot <nfont@austin.ibm.com>
Cc: Robin Holt <holt@sgi.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

321bf4ed

22 12月, 2011 2 次提交

convert 'memory' sysdev_class to a regular subsystem · 10fbcf4c

由 Kay Sievers 提交于 12月 21, 2011

This moves the 'memory sysdev_class' over to a regular 'memory' subsystem
and converts the devices to regular devices. The sysdev drivers are
implemented as subsystem interfaces now.

After all sysdev classes are ported to regular driver core entities, the
sysdev implementation will be entirely removed from the kernel.
Signed-off-by: NKay Sievers <kay.sievers@vrfy.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>

10fbcf4c

cpu: convert 'cpu' and 'machinecheck' sysdev_class to a regular subsystem · 8a25a2fd

由 Kay Sievers 提交于 12月 21, 2011

This moves the 'cpu sysdev_class' over to a regular 'cpu' subsystem
and converts the devices to regular devices. The sysdev drivers are
implemented as subsystem interfaces now.

After all sysdev classes are ported to regular driver core entities, the
sysdev implementation will be entirely removed from the kernel.

Userspace relies on events and generic sysfs subsystem infrastructure
from sysdev devices, which are made available with this conversion.

Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Borislav Petkov <bp@amd64.org>
Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
Cc: Len Brown <lenb@kernel.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: NKay Sievers <kay.sievers@vrfy.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>

8a25a2fd

19 11月, 2011 1 次提交

drivers/base/node.c: fix compilation error with older versions of gcc · 91a13c28

由 Claudio Scordino 提交于 11月 17, 2011

Patch to fix the error message "directives may not be used inside a macro
argument" which appears when the kernel is compiled for the cris architecture.
Signed-off-by: NClaudio Scordino <claudio@evidence.eu.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>

91a13c28

25 5月, 2011 1 次提交

mm: per-node vmstat: show proper vmstats · fa25c503

由 KOSAKI Motohiro 提交于 5月 24, 2011

commit 2ac39037 ("writeback: add
/sys/devices/system/node/<node>/vmstat") added vmstat entry.  But
strangely it only show nr_written and nr_dirtied.

        # cat /sys/devices/system/node/node20/vmstat
        nr_written 0
        nr_dirtied 0

Of course, It's not adequate.  With this patch, the vmstat show all vm
stastics as /proc/vmstat.

        # cat /sys/devices/system/node/node0/vmstat
	nr_free_pages 899224
	nr_inactive_anon 201
	nr_active_anon 17380
	nr_inactive_file 31572
	nr_active_file 28277
	nr_unevictable 0
	nr_mlock 0
	nr_anon_pages 17321
	nr_mapped 8640
	nr_file_pages 60107
	nr_dirty 33
	nr_writeback 0
	nr_slab_reclaimable 6850
	nr_slab_unreclaimable 7604
	nr_page_table_pages 3105
	nr_kernel_stack 175
	nr_unstable 0
	nr_bounce 0
	nr_vmscan_write 0
	nr_writeback_temp 0
	nr_isolated_anon 0
	nr_isolated_file 0
	nr_shmem 260
	nr_dirtied 1050
	nr_written 938
	numa_hit 962872
	numa_miss 0
	numa_foreign 0
	numa_interleave 8617
	numa_local 962872
	numa_other 0
	nr_anon_transparent_hugepages 0

[akpm@linux-foundation.org: no externs in .c files]
Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fa25c503

04 2月, 2011 1 次提交

memory hotplug: Update phys_index to [start|end]_section_nr · d3360164

由 Nathan Fontenot 提交于 1月 20, 2011

Update the 'phys_index' property of a the memory_block struct to be
called start_section_nr, and add a end_section_nr property.  The
data tracked here is the same but the updated naming is more in line
with what is stored here, namely the first and last section number
that the memory block spans.

The names presented to userspace remain the same, phys_index for
start_section_nr and end_phys_index for end_section_nr, to avoid breaking
anything in userspace.

This also updates the node sysfs code to be aware of the new capability for
a memory block to contain multiple memory sections and be aware of the memory
block structure name changes (start_section_nr).  This requires an additional
parameter to unregister_mem_sect_under_nodes so that we know which memory
section of the memory block to unregister.
Signed-off-by: NNathan Fontenot <nfont@austin.ibm.com>
Reviewed-by: NRobin Holt <holt@sgi.com>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>

d3360164

14 1月, 2011 1 次提交

thp: transparent hugepage sysfs meminfo · 05b258e9

由 David Rientjes 提交于 1月 13, 2011

Add hugepage statistics to per-node sysfs meminfo
Reviewed-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

05b258e9

27 10月, 2010 1 次提交

writeback: add /sys/devices/system/node/<node>/vmstat · 2ac39037

由 Michael Rubin 提交于 10月 26, 2010

For NUMA node systems it is important to have visibility in memory
characteristics.  Two of the /proc/vmstat values "nr_written" and
"nr_dirtied" are added here.

	# cat /sys/devices/system/node/node20/vmstat
	nr_written 0
	nr_dirtied 0
Signed-off-by: NMichael Rubin <mrubin@google.com>
Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2ac39037

openeuler / Kernel 大约 2 年 前同步成功

openeuler / Kernel
大约 2 年前同步成功