1. 12 7月, 2012 1 次提交
    • J
      memory hotplug: fix invalid memory access caused by stale kswapd pointer · d8adde17
      Jiang Liu 提交于
      kswapd_stop() is called to destroy the kswapd work thread when all memory
      of a NUMA node has been offlined.  But kswapd_stop() only terminates the
      work thread without resetting NODE_DATA(nid)->kswapd to NULL.  The stale
      pointer will prevent kswapd_run() from creating a new work thread when
      adding memory to the memory-less NUMA node again.  Eventually the stale
      pointer may cause invalid memory access.
      
      An example stack dump as below. It's reproduced with 2.6.32, but latest
      kernel has the same issue.
      
        BUG: unable to handle kernel NULL pointer dereference at (null)
        IP: [<ffffffff81051a94>] exit_creds+0x12/0x78
        PGD 0
        Oops: 0000 [#1] SMP
        last sysfs file: /sys/devices/system/memory/memory391/state
        CPU 11
        Modules linked in: cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq microcode fuse loop dm_mod tpm_tis rtc_cmos i2c_i801 rtc_core tpm serio_raw pcspkr sg tpm_bios igb i2c_core iTCO_wdt rtc_lib mptctl iTCO_vendor_support button dca bnx2 usbhid hid uhci_hcd ehci_hcd usbcore sd_mod crc_t10dif edd ext3 mbcache jbd fan ide_pci_generic ide_core ata_generic ata_piix libata thermal processor thermal_sys hwmon mptsas mptscsih mptbase scsi_transport_sas scsi_mod
        Pid: 7949, comm: sh Not tainted 2.6.32.12-qiuxishi-5-default #92 Tecal RH2285
        RIP: 0010:exit_creds+0x12/0x78
        RSP: 0018:ffff8806044f1d78  EFLAGS: 00010202
        RAX: 0000000000000000 RBX: ffff880604f22140 RCX: 0000000000019502
        RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000
        RBP: ffff880604f22150 R08: 0000000000000000 R09: ffffffff81a4dc10
        R10: 00000000000032a0 R11: ffff880006202500 R12: 0000000000000000
        R13: 0000000000c40000 R14: 0000000000008000 R15: 0000000000000001
        FS:  00007fbc03d066f0(0000) GS:ffff8800282e0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 0000000000000000 CR3: 000000060f029000 CR4: 00000000000006e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Process sh (pid: 7949, threadinfo ffff8806044f0000, task ffff880603d7c600)
        Stack:
         ffff880604f22140 ffffffff8103aac5 ffff880604f22140 ffffffff8104d21e
         ffff880006202500 0000000000008000 0000000000c38000 ffffffff810bd5b1
         0000000000000000 ffff880603d7c600 00000000ffffdd29 0000000000000003
        Call Trace:
          __put_task_struct+0x5d/0x97
          kthread_stop+0x50/0x58
          offline_pages+0x324/0x3da
          memory_block_change_state+0x179/0x1db
          store_mem_state+0x9e/0xbb
          sysfs_write_file+0xd0/0x107
          vfs_write+0xad/0x169
          sys_write+0x45/0x6e
          system_call_fastpath+0x16/0x1b
        Code: ff 4d 00 0f 94 c0 84 c0 74 08 48 89 ef e8 1f fd ff ff 5b 5d 31 c0 41 5c c3 53 48 8b 87 20 06 00 00 48 89 fb 48 8b bf 18 06 00 00 <8b> 00 48 c7 83 18 06 00 00 00 00 00 00 f0 ff 0f 0f 94 c0 84 c0
        RIP  exit_creds+0x12/0x78
         RSP <ffff8806044f1d78>
        CR2: 0000000000000000
      
      [akpm@linux-foundation.org: add pglist_data.kswapd locking comments]
      Signed-off-by: NXishi Qiu <qiuxishi@huawei.com>
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8adde17
  2. 30 5月, 2012 3 次提交
  3. 21 5月, 2012 2 次提交
  4. 15 4月, 2012 1 次提交
  5. 22 3月, 2012 1 次提交
  6. 13 1月, 2012 3 次提交
  7. 11 1月, 2012 1 次提交
    • J
      mm: exclude reserved pages from dirtyable memory · ab8fabd4
      Johannes Weiner 提交于
      Per-zone dirty limits try to distribute page cache pages allocated for
      writing across zones in proportion to the individual zone sizes, to reduce
      the likelihood of reclaim having to write back individual pages from the
      LRU lists in order to make progress.
      
      This patch:
      
      The amount of dirtyable pages should not include the full number of free
      pages: there is a number of reserved pages that the page allocator and
      kswapd always try to keep free.
      
      The closer (reclaimable pages - dirty pages) is to the number of reserved
      pages, the more likely it becomes for reclaim to run into dirty pages:
      
             +----------+ ---
             |   anon   |  |
             +----------+  |
             |          |  |
             |          |  -- dirty limit new    -- flusher new
             |   file   |  |                     |
             |          |  |                     |
             |          |  -- dirty limit old    -- flusher old
             |          |                        |
             +----------+                       --- reclaim
             | reserved |
             +----------+
             |  kernel  |
             +----------+
      
      This patch introduces a per-zone dirty reserve that takes both the lowmem
      reserve as well as the high watermark of the zone into account, and a
      global sum of those per-zone values that is subtracted from the global
      amount of dirtyable pages.  The lowmem reserve is unavailable to page
      cache allocations and kswapd tries to keep the high watermark free.  We
      don't want to end up in a situation where reclaim has to clean pages in
      order to balance zones.
      
      Not treating reserved pages as dirtyable on a global level is only a
      conceptual fix.  In reality, dirty pages are not distributed equally
      across zones and reclaim runs into dirty pages on a regular basis.
      
      But it is important to get this right before tackling the problem on a
      per-zone level, where the distance between reclaim and the dirty pages is
      mostly much smaller in absolute numbers.
      
      [akpm@linux-foundation.org: fix highmem build]
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab8fabd4
  8. 09 12月, 2011 1 次提交
    • T
      memblock: Kill early_node_map[] · 0ee332c1
      Tejun Heo 提交于
      Now all ARCH_POPULATES_NODE_MAP archs select HAVE_MEBLOCK_NODE_MAP -
      there's no user of early_node_map[] left.  Kill early_node_map[] and
      replace ARCH_POPULATES_NODE_MAP with HAVE_MEMBLOCK_NODE_MAP.  Also,
      relocate for_each_mem_pfn_range() and helper from mm.h to memblock.h
      as page_alloc.c would no longer host an alternative implementation.
      
      This change is ultimately one to one mapping and shouldn't cause any
      observable difference; however, after the recent changes, there are
      some functions which now would fit memblock.c better than page_alloc.c
      and dependency on HAVE_MEMBLOCK_NODE_MAP instead of HAVE_MEMBLOCK
      doesn't make much sense on some of them.  Further cleanups for
      functions inside HAVE_MEMBLOCK_NODE_MAP in mm.h would be nice.
      
      -v2: Fix compile bug introduced by mis-spelling
       CONFIG_HAVE_MEMBLOCK_NODE_MAP to CONFIG_MEMBLOCK_HAVE_NODE_MAP in
       mmzone.h.  Reported by Stephen Rothwell.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Chen Liqin <liqin.chen@sunplusct.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      0ee332c1
  9. 01 11月, 2011 5 次提交
    • M
      mm: vmscan: immediately reclaim end-of-LRU dirty pages when writeback completes · 49ea7eb6
      Mel Gorman 提交于
      When direct reclaim encounters a dirty page, it gets recycled around the
      LRU for another cycle.  This patch marks the page PageReclaim similar to
      deactivate_page() so that the page gets reclaimed almost immediately after
      the page gets cleaned.  This is to avoid reclaiming clean pages that are
      younger than a dirty page encountered at the end of the LRU that might
      have been something like a use-once page.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <jweiner@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49ea7eb6
    • M
      mm: vmscan: do not writeback filesystem pages in direct reclaim · ee72886d
      Mel Gorman 提交于
      Testing from the XFS folk revealed that there is still too much I/O from
      the end of the LRU in kswapd.  Previously it was considered acceptable by
      VM people for a small number of pages to be written back from reclaim with
      testing generally showing about 0.3% of pages reclaimed were written back
      (higher if memory was low).  That writing back a small number of pages is
      ok has been heavily disputed for quite some time and Dave Chinner
      explained it well;
      
      	It doesn't have to be a very high number to be a problem. IO
      	is orders of magnitude slower than the CPU time it takes to
      	flush a page, so the cost of making a bad flush decision is
      	very high. And single page writeback from the LRU is almost
      	always a bad flush decision.
      
      To complicate matters, filesystems respond very differently to requests
      from reclaim according to Christoph Hellwig;
      
      	xfs tries to write it back if the requester is kswapd
      	ext4 ignores the request if it's a delayed allocation
      	btrfs ignores the request
      
      As a result, each filesystem has different performance characteristics
      when under memory pressure and there are many pages being dirtied.  In
      some cases, the request is ignored entirely so the VM cannot depend on the
      IO being dispatched.
      
      The objective of this series is to reduce writing of filesystem-backed
      pages from reclaim, play nicely with writeback that is already in progress
      and throttle reclaim appropriately when writeback pages are encountered.
      The assumption is that the flushers will always write pages faster than if
      reclaim issues the IO.
      
      A secondary goal is to avoid the problem whereby direct reclaim splices
      two potentially deep call stacks together.
      
      There is a potential new problem as reclaim has less control over how long
      before a page in a particularly zone or container is cleaned and direct
      reclaimers depend on kswapd or flusher threads to do the necessary work.
      However, as filesystems sometimes ignore direct reclaim requests already,
      it is not expected to be a serious issue.
      
      Patch 1 disables writeback of filesystem pages from direct reclaim
      	entirely. Anonymous pages are still written.
      
      Patch 2 removes dead code in lumpy reclaim as it is no longer able
      	to synchronously write pages. This hurts lumpy reclaim but
      	there is an expectation that compaction is used for hugepage
      	allocations these days and lumpy reclaim's days are numbered.
      
      Patches 3-4 add warnings to XFS and ext4 if called from
      	direct reclaim. With patch 1, this "never happens" and is
      	intended to catch regressions in this logic in the future.
      
      Patch 5 disables writeback of filesystem pages from kswapd unless
      	the priority is raised to the point where kswapd is considered
      	to be in trouble.
      
      Patch 6 throttles reclaimers if too many dirty pages are being
      	encountered and the zones or backing devices are congested.
      
      Patch 7 invalidates dirty pages found at the end of the LRU so they
      	are reclaimed quickly after being written back rather than
      	waiting for a reclaimer to find them
      
      I consider this series to be orthogonal to the writeback work but it is
      worth noting that the writeback work affects the viability of patch 8 in
      particular.
      
      I tested this on ext4 and xfs using fs_mark, a simple writeback test based
      on dd and a micro benchmark that does a streaming write to a large mapping
      (exercises use-once LRU logic) followed by streaming writes to a mix of
      anonymous and file-backed mappings.  The command line for fs_mark when
      botted with 512M looked something like
      
      ./fs_mark -d  /tmp/fsmark-2676  -D  100  -N  150  -n  150  -L  25  -t  1  -S0  -s  10485760
      
      The number of files was adjusted depending on the amount of available
      memory so that the files created was about 3xRAM.  For multiple threads,
      the -d switch is specified multiple times.
      
      The test machine is x86-64 with an older generation of AMD processor with
      4 cores.  The underlying storage was 4 disks configured as RAID-0 as this
      was the best configuration of storage I had available.  Swap is on a
      separate disk.  Dirty ratio was tuned to 40% instead of the default of
      20%.
      
      Testing was run with and without monitors to both verify that the patches
      were operating as expected and that any performance gain was real and not
      due to interference from monitors.
      
      Here is a summary of results based on testing XFS.
      
      512M1P-xfs           Files/s  mean                 32.69 ( 0.00%)     34.44 ( 5.08%)
      512M1P-xfs           Elapsed Time fsmark                    51.41     48.29
      512M1P-xfs           Elapsed Time simple-wb                114.09    108.61
      512M1P-xfs           Elapsed Time mmap-strm                113.46    109.34
      512M1P-xfs           Kswapd efficiency fsmark                 62%       63%
      512M1P-xfs           Kswapd efficiency simple-wb              56%       61%
      512M1P-xfs           Kswapd efficiency mmap-strm              44%       42%
      512M-xfs             Files/s  mean                 30.78 ( 0.00%)     35.94 (14.36%)
      512M-xfs             Elapsed Time fsmark                    56.08     48.90
      512M-xfs             Elapsed Time simple-wb                112.22     98.13
      512M-xfs             Elapsed Time mmap-strm                219.15    196.67
      512M-xfs             Kswapd efficiency fsmark                 54%       56%
      512M-xfs             Kswapd efficiency simple-wb              54%       55%
      512M-xfs             Kswapd efficiency mmap-strm              45%       44%
      512M-4X-xfs          Files/s  mean                 30.31 ( 0.00%)     33.33 ( 9.06%)
      512M-4X-xfs          Elapsed Time fsmark                    63.26     55.88
      512M-4X-xfs          Elapsed Time simple-wb                100.90     90.25
      512M-4X-xfs          Elapsed Time mmap-strm                261.73    255.38
      512M-4X-xfs          Kswapd efficiency fsmark                 49%       50%
      512M-4X-xfs          Kswapd efficiency simple-wb              54%       56%
      512M-4X-xfs          Kswapd efficiency mmap-strm              37%       36%
      512M-16X-xfs         Files/s  mean                 60.89 ( 0.00%)     65.22 ( 6.64%)
      512M-16X-xfs         Elapsed Time fsmark                    67.47     58.25
      512M-16X-xfs         Elapsed Time simple-wb                103.22     90.89
      512M-16X-xfs         Elapsed Time mmap-strm                237.09    198.82
      512M-16X-xfs         Kswapd efficiency fsmark                 45%       46%
      512M-16X-xfs         Kswapd efficiency simple-wb              53%       55%
      512M-16X-xfs         Kswapd efficiency mmap-strm              33%       33%
      
      Up until 512-4X, the FSmark improvements were statistically significant.
      For the 4X and 16X tests the results were within standard deviations but
      just barely.  The time to completion for all tests is improved which is an
      important result.  In general, kswapd efficiency is not affected by
      skipping dirty pages.
      
      1024M1P-xfs          Files/s  mean                 39.09 ( 0.00%)     41.15 ( 5.01%)
      1024M1P-xfs          Elapsed Time fsmark                    84.14     80.41
      1024M1P-xfs          Elapsed Time simple-wb                210.77    184.78
      1024M1P-xfs          Elapsed Time mmap-strm                162.00    160.34
      1024M1P-xfs          Kswapd efficiency fsmark                 69%       75%
      1024M1P-xfs          Kswapd efficiency simple-wb              71%       77%
      1024M1P-xfs          Kswapd efficiency mmap-strm              43%       44%
      1024M-xfs            Files/s  mean                 35.45 ( 0.00%)     37.00 ( 4.19%)
      1024M-xfs            Elapsed Time fsmark                    94.59     91.00
      1024M-xfs            Elapsed Time simple-wb                229.84    195.08
      1024M-xfs            Elapsed Time mmap-strm                405.38    440.29
      1024M-xfs            Kswapd efficiency fsmark                 79%       71%
      1024M-xfs            Kswapd efficiency simple-wb              74%       74%
      1024M-xfs            Kswapd efficiency mmap-strm              39%       42%
      1024M-4X-xfs         Files/s  mean                 32.63 ( 0.00%)     35.05 ( 6.90%)
      1024M-4X-xfs         Elapsed Time fsmark                   103.33     97.74
      1024M-4X-xfs         Elapsed Time simple-wb                204.48    178.57
      1024M-4X-xfs         Elapsed Time mmap-strm                528.38    511.88
      1024M-4X-xfs         Kswapd efficiency fsmark                 81%       70%
      1024M-4X-xfs         Kswapd efficiency simple-wb              73%       72%
      1024M-4X-xfs         Kswapd efficiency mmap-strm              39%       38%
      1024M-16X-xfs        Files/s  mean                 42.65 ( 0.00%)     42.97 ( 0.74%)
      1024M-16X-xfs        Elapsed Time fsmark                   103.11     99.11
      1024M-16X-xfs        Elapsed Time simple-wb                200.83    178.24
      1024M-16X-xfs        Elapsed Time mmap-strm                397.35    459.82
      1024M-16X-xfs        Kswapd efficiency fsmark                 84%       69%
      1024M-16X-xfs        Kswapd efficiency simple-wb              74%       73%
      1024M-16X-xfs        Kswapd efficiency mmap-strm              39%       40%
      
      All FSMark tests up to 16X had statistically significant improvements.
      For the most part, tests are completing faster with the exception of the
      streaming writes to a mixture of anonymous and file-backed mappings which
      were slower in two cases
      
      In the cases where the mmap-strm tests were slower, there was more
      swapping due to dirty pages being skipped.  The number of additional pages
      swapped is almost identical to the fewer number of pages written from
      reclaim.  In other words, roughly the same number of pages were reclaimed
      but swapping was slower.  As the test is a bit unrealistic and stresses
      memory heavily, the small shift is acceptable.
      
      4608M1P-xfs          Files/s  mean                 29.75 ( 0.00%)     30.96 ( 3.91%)
      4608M1P-xfs          Elapsed Time fsmark                   512.01    492.15
      4608M1P-xfs          Elapsed Time simple-wb                618.18    566.24
      4608M1P-xfs          Elapsed Time mmap-strm                488.05    465.07
      4608M1P-xfs          Kswapd efficiency fsmark                 93%       86%
      4608M1P-xfs          Kswapd efficiency simple-wb              88%       84%
      4608M1P-xfs          Kswapd efficiency mmap-strm              46%       45%
      4608M-xfs            Files/s  mean                 27.60 ( 0.00%)     28.85 ( 4.33%)
      4608M-xfs            Elapsed Time fsmark                   555.96    532.34
      4608M-xfs            Elapsed Time simple-wb                659.72    571.85
      4608M-xfs            Elapsed Time mmap-strm               1082.57   1146.38
      4608M-xfs            Kswapd efficiency fsmark                 89%       91%
      4608M-xfs            Kswapd efficiency simple-wb              88%       82%
      4608M-xfs            Kswapd efficiency mmap-strm              48%       46%
      4608M-4X-xfs         Files/s  mean                 26.00 ( 0.00%)     27.47 ( 5.35%)
      4608M-4X-xfs         Elapsed Time fsmark                   592.91    564.00
      4608M-4X-xfs         Elapsed Time simple-wb                616.65    575.07
      4608M-4X-xfs         Elapsed Time mmap-strm               1773.02   1631.53
      4608M-4X-xfs         Kswapd efficiency fsmark                 90%       94%
      4608M-4X-xfs         Kswapd efficiency simple-wb              87%       82%
      4608M-4X-xfs         Kswapd efficiency mmap-strm              43%       43%
      4608M-16X-xfs        Files/s  mean                 26.07 ( 0.00%)     26.42 ( 1.32%)
      4608M-16X-xfs        Elapsed Time fsmark                   602.69    585.78
      4608M-16X-xfs        Elapsed Time simple-wb                606.60    573.81
      4608M-16X-xfs        Elapsed Time mmap-strm               1549.75   1441.86
      4608M-16X-xfs        Kswapd efficiency fsmark                 98%       98%
      4608M-16X-xfs        Kswapd efficiency simple-wb              88%       82%
      4608M-16X-xfs        Kswapd efficiency mmap-strm              44%       42%
      
      Unlike the other tests, the fsmark results are not statistically
      significant but the min and max times are both improved and for the most
      part, tests completed faster.
      
      There are other indications that this is an improvement as well.  For
      example, in the vast majority of cases, there were fewer pages scanned by
      direct reclaim implying in many cases that stalls due to direct reclaim
      are reduced.  KSwapd is scanning more due to skipping dirty pages which is
      unfortunate but the CPU usage is still acceptable
      
      In an earlier set of tests, I used blktrace and in almost all cases
      throughput throughout the entire test was higher.  However, I ended up
      discarding those results as recording blktrace data was too heavy for my
      liking.
      
      On a laptop, I plugged in a USB stick and ran a similar tests of tests
      using it as backing storage.  A desktop environment was running and for
      the entire duration of the tests, firefox and gnome terminal were
      launching and exiting to vaguely simulate a user.
      
      1024M-xfs            Files/s  mean               0.41 ( 0.00%)        0.44 ( 6.82%)
      1024M-xfs            Elapsed Time fsmark               2053.52   1641.03
      1024M-xfs            Elapsed Time simple-wb            1229.53    768.05
      1024M-xfs            Elapsed Time mmap-strm            4126.44   4597.03
      1024M-xfs            Kswapd efficiency fsmark              84%       85%
      1024M-xfs            Kswapd efficiency simple-wb           92%       81%
      1024M-xfs            Kswapd efficiency mmap-strm           60%       51%
      1024M-xfs            Avg wait ms fsmark                5404.53     4473.87
      1024M-xfs            Avg wait ms simple-wb             2541.35     1453.54
      1024M-xfs            Avg wait ms mmap-strm             3400.25     3852.53
      
      The mmap-strm results were hurt because firefox launching had a tendency
      to push the test out of memory.  On the postive side, firefox launched
      marginally faster with the patches applied.  Time to completion for many
      tests was faster but more importantly - the "Avg wait" time as measured by
      iostat was far lower implying the system would be more responsive.  It was
      also the case that "Avg wait ms" on the root filesystem was lower.  I
      tested it manually and while the system felt slightly more responsive
      while copying data to a USB stick, it was marginal enough that it could be
      my imagination.
      
      This patch: do not writeback filesystem pages in direct reclaim.
      
      When kswapd is failing to keep zones above the min watermark, a process
      will enter direct reclaim in the same manner kswapd does.  If a dirty page
      is encountered during the scan, this page is written to backing storage
      using mapping->writepage.
      
      This causes two problems.  First, it can result in very deep call stacks,
      particularly if the target storage or filesystem are complex.  Some
      filesystems ignore write requests from direct reclaim as a result.  The
      second is that a single-page flush is inefficient in terms of IO.  While
      there is an expectation that the elevator will merge requests, this does
      not always happen.  Quoting Christoph Hellwig;
      
      	The elevator has a relatively small window it can operate on,
      	and can never fix up a bad large scale writeback pattern.
      
      This patch prevents direct reclaim writing back filesystem pages by
      checking if current is kswapd.  Anonymous pages are still written to swap
      as there is not the equivalent of a flusher thread for anonymous pages.
      If the dirty pages cannot be written back, they are placed back on the LRU
      lists.  There is now a direct dependency on dirty page balancing to
      prevent too many pages in the system being dirtied which would prevent
      reclaim making forward progress.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ee72886d
    • M
      mm: zone_reclaim: make isolate_lru_page() filter-aware · f80c0673
      Minchan Kim 提交于
      In __zone_reclaim case, we don't want to shrink mapped page.  Nonetheless,
      we have isolated mapped page and re-add it into LRU's head.  It's
      unnecessary CPU overhead and makes LRU churning.
      
      Of course, when we isolate the page, the page might be mapped but when we
      try to migrate the page, the page would be not mapped.  So it could be
      migrated.  But race is rare and although it happens, it's no big deal.
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f80c0673
    • M
      mm: compaction: make isolate_lru_page() filter-aware · 39deaf85
      Minchan Kim 提交于
      In async mode, compaction doesn't migrate dirty or writeback pages.  So,
      it's meaningless to pick the page and re-add it to lru list.
      
      Of course, when we isolate the page in compaction, the page might be dirty
      or writeback but when we try to migrate the page, the page would be not
      dirty, writeback.  So it could be migrated.  But it's very unlikely as
      isolate and migration cycle is much faster than writeout.
      
      So, this patch helps cpu overhead and prevent unnecessary LRU churning.
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      39deaf85
    • M
      mm: change isolate mode from #define to bitwise type · 4356f21d
      Minchan Kim 提交于
      Change ISOLATE_XXX macro with bitwise isolate_mode_t type.  Normally,
      macro isn't recommended as it's type-unsafe and making debugging harder as
      symbol cannot be passed throught to the debugger.
      
      Quote from Johannes
      " Hmm, it would probably be cleaner to fully convert the isolation mode
      into independent flags.  INACTIVE, ACTIVE, BOTH is currently a
      tri-state among flags, which is a bit ugly."
      
      This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4356f21d
  10. 27 7月, 2011 2 次提交
    • A
      atomic: use <linux/atomic.h> · 60063497
      Arun Sharma 提交于
      This allows us to move duplicated code in <asm/atomic.h>
      (atomic_inc_not_zero() for now) to <linux/atomic.h>
      Signed-off-by: NArun Sharma <asharma@fb.com>
      Reviewed-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: David Miller <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NMike Frysinger <vapier@gentoo.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60063497
    • K
      memcg: consolidate memory cgroup lru stat functions · bb2a0de9
      KAMEZAWA Hiroyuki 提交于
      In mm/memcontrol.c, there are many lru stat functions as..
      
        mem_cgroup_zone_nr_lru_pages
        mem_cgroup_node_nr_file_lru_pages
        mem_cgroup_nr_file_lru_pages
        mem_cgroup_node_nr_anon_lru_pages
        mem_cgroup_nr_anon_lru_pages
        mem_cgroup_node_nr_unevictable_lru_pages
        mem_cgroup_nr_unevictable_lru_pages
        mem_cgroup_node_nr_lru_pages
        mem_cgroup_nr_lru_pages
        mem_cgroup_get_local_zonestat
      
      Some of them are under #ifdef MAX_NUMNODES >1 and others are not.
      This seems bad. This patch consolidates all functions into
      
        mem_cgroup_zone_nr_lru_pages()
        mem_cgroup_node_nr_lru_pages()
        mem_cgroup_nr_lru_pages()
      
      For these functions, "which LRU?" information is passed by a mask.
      
      example:
        mem_cgroup_nr_lru_pages(mem, BIT(LRU_ACTIVE_ANON))
      
      And I added some macro as ALL_LRU, ALL_LRU_FILE, ALL_LRU_ANON.
      
      example:
        mem_cgroup_nr_lru_pages(mem, ALL_LRU)
      
      BTW, considering layout of NUMA memory placement of counters, this patch seems
      to be better.
      
      Now, when we gather all LRU information, we scan in following orer
          for_each_lru -> for_each_node -> for_each_zone.
      
      This means we'll touch cache lines in different node in turn.
      
      After patch, we'll scan
          for_each_node -> for_each_zone -> for_each_lru(mask)
      
      Then, we'll gather information in the same cacheline at once.
      
      [akpm@linux-foundation.org: fix warnigns, build error]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb2a0de9
  11. 28 6月, 2011 1 次提交
    • K
      Fix node_start/end_pfn() definition for mm/page_cgroup.c · c6830c22
      KAMEZAWA Hiroyuki 提交于
      commit 21a3c964 uses node_start/end_pfn(nid) for detection start/end
      of nodes. But, it's not defined in linux/mmzone.h but defined in
      /arch/???/include/mmzone.h which is included only under
      CONFIG_NEED_MULTIPLE_NODES=y.
      
      Then, we see
        mm/page_cgroup.c: In function 'page_cgroup_init':
        mm/page_cgroup.c:308: error: implicit declaration of function 'node_start_pfn'
        mm/page_cgroup.c:309: error: implicit declaration of function 'node_end_pfn'
      
      So, fixiing page_cgroup.c is an idea...
      
      But node_start_pfn()/node_end_pfn() is a very generic macro and
      should be implemented in the same manner for all archs.
      (m32r has different implementation...)
      
      This patch removes definitions of node_start/end_pfn() in each archs
      and defines a unified one in linux/mmzone.h. It's not under
      CONFIG_NEED_MULTIPLE_NODES, now.
      
      A result of macro expansion is here (mm/page_cgroup.c)
      
      for !NUMA
       start_pfn = ((&contig_page_data)->node_start_pfn);
        end_pfn = ({ pg_data_t *__pgdat = (&contig_page_data); __pgdat->node_start_pfn + __pgdat->node_spanned_pages;});
      
      for NUMA (x86-64)
        start_pfn = ((node_data[nid])->node_start_pfn);
        end_pfn = ({ pg_data_t *__pgdat = (node_data[nid]); __pgdat->node_start_pfn + __pgdat->node_spanned_pages;});
      
      Changelog:
       - fixed to avoid using "nid" twice in node_end_pfn() macro.
      Reported-and-acked-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Reported-and-tested-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c6830c22
  12. 27 5月, 2011 1 次提交
    • K
      memcg: fix get_scan_count() for small targets · 246e87a9
      KAMEZAWA Hiroyuki 提交于
      During memory reclaim we determine the number of pages to be scanned per
      zone as
      
      	(anon + file) >> priority.
      Assume
      	scan = (anon + file) >> priority.
      
      If scan < SWAP_CLUSTER_MAX, the scan will be skipped for this time and
      priority gets higher.  This has some problems.
      
        1. This increases priority as 1 without any scan.
           To do scan in this priority, amount of pages should be larger than 512M.
           If pages>>priority < SWAP_CLUSTER_MAX, it's recorded and scan will be
           batched, later. (But we lose 1 priority.)
           If memory size is below 16M, pages >> priority is 0 and no scan in
           DEF_PRIORITY forever.
      
        2. If zone->all_unreclaimabe==true, it's scanned only when priority==0.
           So, x86's ZONE_DMA will never be recoverred until the user of pages
           frees memory by itself.
      
        3. With memcg, the limit of memory can be small. When using small memcg,
           it gets priority < DEF_PRIORITY-2 very easily and need to call
           wait_iff_congested().
           For doing scan before priorty=9, 64MB of memory should be used.
      
      Then, this patch tries to scan SWAP_CLUSTER_MAX of pages in force...when
      
        1. the target is enough small.
        2. it's kswapd or memcg reclaim.
      
      Then we can avoid rapid priority drop and may be able to recover
      all_unreclaimable in a small zones.  And this patch removes nr_saved_scan.
       This will allow scanning in this priority even when pages >> priority is
      very small.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NYing Han <yinghan@google.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      246e87a9
  13. 26 5月, 2011 1 次提交
    • W
      ARM: 6913/1: sparsemem: allow pfn_valid to be overridden when using SPARSEMEM · 7b7bf499
      Will Deacon 提交于
      In commit eb33575c ("[ARM] Double check memmap is actually valid with a
      memmap has unexpected holes V2"), a new function, memmap_valid_within,
      was introduced to mmzone.h so that holes in the memmap which pass
      pfn_valid in SPARSEMEM configurations can be detected and avoided.
      
      The fix to this problem checks that the pfn <-> page linkages are
      correct by calculating the page for the pfn and then checking that
      page_to_pfn on that page returns the original pfn. Unfortunately, in
      SPARSEMEM configurations, this results in reading from the page flags to
      determine the correct section. Since the memmap here has been freed,
      junk is read from memory and the check is no longer robust.
      
      In the best case, reading from /proc/pagetypeinfo will give you the
      wrong answer. In the worst case, you get SEGVs, Kernel OOPses and hung
      CPUs. Furthermore, ioremap implementations that use pfn_valid to
      disallow the remapping of normal memory will break.
      
      This patch allows architectures to provide their own pfn_valid function
      instead of using the default implementation used by sparsemem. The
      architecture-specific version is aware of the memmap state and will
      return false when passed a pfn for a freed page within a valid section.
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Tested-by: NH Hartley Sweeten <hsweeten@visionengravers.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      7b7bf499
  14. 25 5月, 2011 2 次提交
  15. 04 2月, 2011 1 次提交
  16. 14 1月, 2011 3 次提交
    • A
      thp: transparent hugepage vmstat · 79134171
      Andrea Arcangeli 提交于
      Add hugepage stat information to /proc/vmstat and /proc/meminfo.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79134171
    • M
      mm: kswapd: stop high-order balancing when any suitable zone is balanced · 99504748
      Mel Gorman 提交于
      Simon Kirby reported the following problem
      
         We're seeing cases on a number of servers where cache never fully
         grows to use all available memory.  Sometimes we see servers with 4 GB
         of memory that never seem to have less than 1.5 GB free, even with a
         constantly-active VM.  In some cases, these servers also swap out while
         this happens, even though they are constantly reading the working set
         into memory.  We have been seeing this happening for a long time; I
         don't think it's anything recent, and it still happens on 2.6.36.
      
      After some debugging work by Simon, Dave Hansen and others, the prevaling
      theory became that kswapd is reclaiming order-3 pages requested by SLUB
      too aggressive about it.
      
      There are two apparent problems here.  On the target machine, there is a
      small Normal zone in comparison to DMA32.  As kswapd tries to balance all
      zones, it would continually try reclaiming for Normal even though DMA32
      was balanced enough for callers.  The second problem is that
      sleeping_prematurely() does not use the same logic as balance_pgdat() when
      deciding whether to sleep or not.  This keeps kswapd artifically awake.
      
      A number of tests were run and the figures from previous postings will
      look very different for a few reasons.  One, the old figures were forcing
      my network card to use GFP_ATOMIC in attempt to replicate Simon's problem.
       Second, I previous specified slub_min_order=3 again in an attempt to
      reproduce Simon's problem.  In this posting, I'm depending on Simon to say
      whether his problem is fixed or not and these figures are to show the
      impact to the ordinary cases.  Finally, the "vmscan" figures are taken
      from /proc/vmstat instead of the tracepoints.  There is less information
      but recording is less disruptive.
      
      The first test of relevance was postmark with a process running in the
      background reading a large amount of anonymous memory in blocks.  The
      objective was to vaguely simulate what was happening on Simon's machine
      and it's memory intensive enough to have kswapd awake.
      
      POSTMARK
                                                  traceonly          kanyzone
      Transactions per second:              156.00 ( 0.00%)   153.00 (-1.96%)
      Data megabytes read per second:        21.51 ( 0.00%)    21.52 ( 0.05%)
      Data megabytes written per second:     29.28 ( 0.00%)    29.11 (-0.58%)
      Files created alone per second:       250.00 ( 0.00%)   416.00 (39.90%)
      Files create/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)
      Files deleted alone per second:       520.00 ( 0.00%)   420.00 (-23.81%)
      Files delete/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)
      
      MMTests Statistics: duration
      User/Sys Time Running Test (seconds)         16.58      17.4
      Total Elapsed Time (seconds)                218.48    222.47
      
      VMstat Reclaim Statistics: vmscan
      Direct reclaims                                  0          4
      Direct reclaim pages scanned                     0        203
      Direct reclaim pages reclaimed                   0        184
      Kswapd pages scanned                        326631     322018
      Kswapd pages reclaimed                      312632     309784
      Kswapd low wmark quickly                         1          4
      Kswapd high wmark quickly                      122        475
      Kswapd skip congestion_wait                      1          0
      Pages activated                             700040     705317
      Pages deactivated                           212113     203922
      Pages written                                 9875       6363
      
      Total pages scanned                         326631    322221
      Total pages reclaimed                       312632    309968
      %age total pages scanned/reclaimed          95.71%    96.20%
      %age total pages scanned/written             3.02%     1.97%
      
      proc vmstat: Faults
      Major Faults                                   300       254
      Minor Faults                                645183    660284
      Page ins                                    493588    486704
      Page outs                                  4960088   4986704
      Swap ins                                      1230       661
      Swap outs                                     9869      6355
      
      Performance is mildly affected because kswapd is no longer doing as much
      work and the background memory consumer process is getting in the way.
      Note that kswapd scanned and reclaimed fewer pages as it's less aggressive
      and overall fewer pages were scanned and reclaimed.  Swap in/out is
      particularly reduced again reflecting kswapd throwing out fewer pages.
      
      The slight performance impact is unfortunate here but it looks like a
      direct result of kswapd being less aggressive.  As the bug report is about
      too many pages being freed by kswapd, it may have to be accepted for now.
      
      The second test is a streaming IO benchmark that was previously used by
      Johannes to show regressions in page reclaim.
      
      MICRO
      					 traceonly  kanyzone
      User/Sys Time Running Test (seconds)         29.29     28.87
      Total Elapsed Time (seconds)                492.18    488.79
      
      VMstat Reclaim Statistics: vmscan
      Direct reclaims                               2128       1460
      Direct reclaim pages scanned               2284822    1496067
      Direct reclaim pages reclaimed              148919     110937
      Kswapd pages scanned                      15450014   16202876
      Kswapd pages reclaimed                     8503697    8537897
      Kswapd low wmark quickly                      3100       3397
      Kswapd high wmark quickly                     1860       7243
      Kswapd skip congestion_wait                    708        801
      Pages activated                               9635       9573
      Pages deactivated                             1432       1271
      Pages written                                  223       1130
      
      Total pages scanned                       17734836  17698943
      Total pages reclaimed                      8652616   8648834
      %age total pages scanned/reclaimed          48.79%    48.87%
      %age total pages scanned/written             0.00%     0.01%
      
      proc vmstat: Faults
      Major Faults                                   165       221
      Minor Faults                               9655785   9656506
      Page ins                                      3880      7228
      Page outs                                 37692940  37480076
      Swap ins                                         0        69
      Swap outs                                       19        15
      
      Again fewer pages are scanned and reclaimed as expected and this time the
      test completed faster.  Note that kswapd is hitting its watermarks faster
      (low and high wmark quickly) which I expect is due to kswapd reclaiming
      fewer pages.
      
      I also ran fs-mark, iozone and sysbench but there is nothing interesting
      to report in the figures.  Performance is not significantly changed and
      the reclaim statistics look reasonable.
      
      Tgis patch:
      
      When the allocator enters its slow path, kswapd is woken up to balance the
      node.  It continues working until all zones within the node are balanced.
      For order-0 allocations, this makes perfect sense but for higher orders it
      can have unintended side-effects.  If the zone sizes are imbalanced,
      kswapd may reclaim heavily within a smaller zone discarding an excessive
      number of pages.  The user-visible behaviour is that kswapd is awake and
      reclaiming even though plenty of pages are free from a suitable zone.
      
      This patch alters the "balance" logic for high-order reclaim allowing
      kswapd to stop if any suitable zone becomes balanced to reduce the number
      of pages it reclaims from other zones.  kswapd still tries to ensure that
      order-0 watermarks for all zones are met before sleeping.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NEric B Munson <emunson@mgebm.net>
      Cc: Simon Kirby <sim@hostway.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99504748
    • M
      mm: page allocator: adjust the per-cpu counter threshold when memory is low · 88f5acf8
      Mel Gorman 提交于
      Commit aa454840 ("calculate a better estimate of NR_FREE_PAGES when memory
      is low") noted that watermarks were based on the vmstat NR_FREE_PAGES.  To
      avoid synchronization overhead, these counters are maintained on a per-cpu
      basis and drained both periodically and when a threshold is above a
      threshold.  On large CPU systems, the difference between the estimate and
      real value of NR_FREE_PAGES can be very high.  The system can get into a
      case where pages are allocated far below the min watermark potentially
      causing livelock issues.  The commit solved the problem by taking a better
      reading of NR_FREE_PAGES when memory was low.
      
      Unfortately, as reported by Shaohua Li this accurate reading can consume a
      large amount of CPU time on systems with many sockets due to cache line
      bouncing.  This patch takes a different approach.  For large machines
      where counter drift might be unsafe and while kswapd is awake, the per-cpu
      thresholds for the target pgdat are reduced to limit the level of drift to
      what should be a safe level.  This incurs a performance penalty in heavy
      memory pressure by a factor that depends on the workload and the machine
      but the machine should function correctly without accidentally exhausting
      all memory on a node.  There is an additional cost when kswapd wakes and
      sleeps but the event is not expected to be frequent - in Shaohua's test
      case, there was one recorded sleep and wake event at least.
      
      To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is
      introduced that takes a more accurate reading of NR_FREE_PAGES when called
      from wakeup_kswapd, when deciding whether it is really safe to go back to
      sleep in sleeping_prematurely() and when deciding if a zone is really
      balanced or not in balance_pgdat().  We are still using an expensive
      function but limiting how often it is called.
      
      When the test case is reproduced, the time spent in the watermark
      functions is reduced.  The following report is on the percentage of time
      spent cumulatively spent in the functions zone_nr_free_pages(),
      zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(),
      zone_page_state_snapshot(), zone_page_state().
      
      vanilla                      11.6615%
      disable-threshold            0.2584%
      
      David said:
      
      : We had to pull aa454840 "mm: page allocator: calculate a better estimate
      : of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36
      : internally because tests showed that it would cause the machine to stall
      : as the result of heavy kswapd activity.  I merged it back with this fix as
      : it is pending in the -mm tree and it solves the issue we were seeing, so I
      : definitely think this should be pushed to -stable (and I would seriously
      : consider it for 2.6.37 inclusion even at this late date).
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reported-by: NShaohua Li <shaohua.li@intel.com>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Tested-by: NNicolas Bareil <nico@chdir.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: <stable@kernel.org>		[2.6.37.1, 2.6.36.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88f5acf8
  17. 27 10月, 2010 2 次提交
    • M
      writeback: do not sleep on the congestion queue if there are no congested BDIs... · 0e093d99
      Mel Gorman 提交于
      writeback: do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone
      
      If congestion_wait() is called with no BDI congested, the caller will
      sleep for the full timeout and this may be an unnecessary sleep.  This
      patch adds a wait_iff_congested() that checks congestion and only sleeps
      if a BDI is congested else, it calls cond_resched() to ensure the caller
      is not hogging the CPU longer than its quota but otherwise will not sleep.
      
      This is aimed at reducing some of the major desktop stalls reported during
      IO.  For example, while kswapd is operating, it calls congestion_wait()
      but it could just have been reclaiming clean page cache pages with no
      congestion.  Without this patch, it would sleep for a full timeout but
      after this patch, it'll just call schedule() if it has been on the CPU too
      long.  Similar logic applies to direct reclaimers that are not making
      enough progress.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0e093d99
    • M
      writeback: add nr_dirtied and nr_written to /proc/vmstat · ea941f0e
      Michael Rubin 提交于
      To help developers and applications gain visibility into writeback
      behaviour adding two entries to vm_stat_items and /proc/vmstat.  This will
      allow us to track the "written" and "dirtied" counts.
      
         # grep nr_dirtied /proc/vmstat
         nr_dirtied 3747
         # grep nr_written /proc/vmstat
         nr_written 3618
      Signed-off-by: NMichael Rubin <mrubin@google.com>
      Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea941f0e
  18. 10 9月, 2010 1 次提交
    • C
      mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory... · aa454840
      Christoph Lameter 提交于
      mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
      
      Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is
      cheaper than scanning a number of lists.  To avoid synchronization
      overhead, counter deltas are maintained on a per-cpu basis and drained
      both periodically and when the delta is above a threshold.  On large CPU
      systems, the difference between the estimated and real value of
      NR_FREE_PAGES can be very high.  If NR_FREE_PAGES is much higher than
      number of real free page in buddy, the VM can allocate pages below min
      watermark, at worst reducing the real number of pages to zero.  Even if
      the OOM killer kills some victim for freeing memory, it may not free
      memory if the exit path requires a new page resulting in livelock.
      
      This patch introduces a zone_page_state_snapshot() function (courtesy of
      Christoph) that takes a slightly more accurate view of an arbitrary vmstat
      counter.  It is used to read NR_FREE_PAGES while kswapd is awake to avoid
      the watermark being accidentally broken.  The estimate is not perfect and
      may result in cache line bounces but is expected to be lighter than the
      IPI calls necessary to continually drain the per-cpu counters while kswapd
      is awake.
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa454840
  19. 10 8月, 2010 2 次提交
    • K
      vmscan: kill prev_priority completely · 25edde03
      KOSAKI Motohiro 提交于
      Since 2.6.28 zone->prev_priority is unused. Then it can be removed
      safely. It reduce stack usage slightly.
      
      Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
      can be integrate again, it's useful. but four (or more) times trying
      haven't got good performance number. Thus I give up such approach.
      
      The rest of this changelog is notes on prev_priority and why it existed in
      the first place and why it might be not necessary any more. This information
      is based heavily on discussions between Andrew Morton, Rik van Riel and
      Kosaki Motohiro who is heavily quotes from.
      
      Historically prev_priority was important because it determined when the VM
      would start unmapping PTE pages. i.e. there are no balances of note within
      the VM, Anon vs File and Mapped vs Unmapped. Without prev_priority, there
      is a potential risk of unnecessarily increasing minor faults as a large
      amount of read activity of use-once pages could push mapped pages to the
      end of the LRU and get unmapped.
      
      There is no proof this is still a problem but currently it is not considered
      to be. Active files are not deactivated if the active file list is smaller
      than the inactive list reducing the liklihood that file-mapped pages are
      being pushed off the LRU and referenced executable pages are kept on the
      active list to avoid them getting pushed out by read activity.
      
      Even if it is a problem, prev_priority prev_priority wouldn't works
      nowadays. First of all, current vmscan still a lot of UP centric code. it
      expose some weakness on some dozens CPUs machine. I think we need more and
      more improvement.
      
      The problem is, current vmscan mix up per-system-pressure, per-zone-pressure
      and per-task-pressure a bit. example, prev_priority try to boost priority to
      other concurrent priority. but if the another task have mempolicy restriction,
      it is unnecessary, but also makes wrong big latency and exceeding reclaim.
      per-task based priority + prev_priority adjustment make the emulation of
      per-system pressure. but it have two issue 1) too rough and brutal emulation
      2) we need per-zone pressure, not per-system.
      
      Another example, currently DEF_PRIORITY is 12. it mean the lru rotate about
      2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer.
      but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the
      system have higher memory pressure than priority==0 (1/4096*10,000 > 2).
      prev_priority can't solve such multithreads workload issue. In other word,
      prev_priority concept assume the sysmtem don't have lots threads."
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michael Rubin <mrubin@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      25edde03
    • A
      mmzone.h: remove dead prototype · b645bd12
      Alexander Nevenchannyy 提交于
      get_zone_counts() was dropped from kernel tree, see:
      http://www.mail-archive.com/mm-commits@vger.kernel.org/msg07313.html but
      its prototype remains.
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b645bd12
  20. 28 5月, 2010 1 次提交
    • L
      numa: introduce numa_mem_id()- effective local memory node id · 7aac7898
      Lee Schermerhorn 提交于
      Introduce numa_mem_id(), based on generic percpu variable infrastructure
      to track "nearest node with memory" for archs that support memoryless
      nodes.
      
      Define API in <linux/topology.h> when CONFIG_HAVE_MEMORYLESS_NODES
      defined, else stubs.  Architectures will define HAVE_MEMORYLESS_NODES
      if/when they support them.
      
      Archs can override definitions of:
      
      numa_mem_id() - returns node number of "local memory" node
      set_numa_mem() - initialize [this cpus'] per cpu variable 'numa_mem'
      cpu_to_mem()  - return numa_mem for specified cpu; may be used as lvalue
      
      Generic initialization of 'numa_mem' occurs in __build_all_zonelists().
      This will initialize the boot cpu at boot time, and all cpus on change of
      numa_zonelist_order, or when node or memory hot-plug requires zonelist
      rebuild.  Archs that support memoryless nodes will need to initialize
      'numa_mem' for secondary cpus as they're brought on-line.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7aac7898
  21. 25 5月, 2010 4 次提交
    • H
      mem-hotplug: fix potential race while building zonelist for new populated zone · 4eaf3f64
      Haicheng Li 提交于
      Add global mutex zonelists_mutex to fix the possible race:
      
           CPU0                                  CPU1                    CPU2
      (1) zone->present_pages += online_pages;
      (2)                                       build_all_zonelists();
      (3)                                                               alloc_page();
      (4)                                                               free_page();
      (5) build_all_zonelists();
      (6)   __build_all_zonelists();
      (7)     zone->pageset = alloc_percpu();
      
      In step (3,4), zone->pageset still points to boot_pageset, so bad
      things may happen if 2+ nodes are in this state. Even if only 1 node
      is accessing the boot_pageset, (3) may still consume too much memory
      to fail the memory allocations in step (7).
      
      Besides, atomic operation ensures alloc_percpu() in step (7) will never fail
      since there is a new fresh memory block added in step(6).
      
      [haicheng.li@linux.intel.com: hold zonelists_mutex when build_all_zonelists]
      Signed-off-by: NHaicheng Li <haicheng.li@linux.intel.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: NAndi Kleen <andi.kleen@intel.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4eaf3f64
    • H
      mem-hotplug: avoid multiple zones sharing same boot strapping boot_pageset · 1f522509
      Haicheng Li 提交于
      For each new populated zone of hotadded node, need to update its pagesets
      with dynamically allocated per_cpu_pageset struct for all possible CPUs:
      
          1) Detach zone->pageset from the shared boot_pageset
             at end of __build_all_zonelists().
      
          2) Use mutex to protect zone->pageset when it's still
             shared in onlined_pages()
      
      Otherwises, multiple zones of different nodes would share same boot strapping
      boot_pageset for same CPU, which will finally cause below kernel panic:
      
        ------------[ cut here ]------------
        kernel BUG at mm/page_alloc.c:1239!
        invalid opcode: 0000 [#1] SMP
        ...
        Call Trace:
         [<ffffffff811300c1>] __alloc_pages_nodemask+0x131/0x7b0
         [<ffffffff81162e67>] alloc_pages_current+0x87/0xd0
         [<ffffffff81128407>] __page_cache_alloc+0x67/0x70
         [<ffffffff811325f0>] __do_page_cache_readahead+0x120/0x260
         [<ffffffff81132751>] ra_submit+0x21/0x30
         [<ffffffff811329c6>] ondemand_readahead+0x166/0x2c0
         [<ffffffff81132ba0>] page_cache_async_readahead+0x80/0xa0
         [<ffffffff8112a0e4>] generic_file_aio_read+0x364/0x670
         [<ffffffff81266cfa>] nfs_file_read+0xca/0x130
         [<ffffffff8117b20a>] do_sync_read+0xfa/0x140
         [<ffffffff8117bf75>] vfs_read+0xb5/0x1a0
         [<ffffffff8117c151>] sys_read+0x51/0x80
         [<ffffffff8103c032>] system_call_fastpath+0x16/0x1b
        RIP  [<ffffffff8112ff13>] get_page_from_freelist+0x883/0x900
         RSP <ffff88000d1e78a8>
        ---[ end trace 4bda28328b9990db ]
      
      [akpm@linux-foundation.org: merge fix]
      Signed-off-by: NHaicheng Li <haicheng.li@linux.intel.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: NAndi Kleen <andi.kleen@intel.com>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f522509
    • M
      mm: fix NR_SECTION_ROOTS == 0 when using using sparsemem extreme. · 0faa5638
      Marcelo Roberto Jimenez 提交于
      Got this while compiling for ARM/SA1100:
      
      mm/sparse.c: In function '__section_nr':
      mm/sparse.c:135: warning: 'root' is used uninitialized in this function
      
      This patch follows Russell King's suggestion for a new calculation for
      NR_SECTION_ROOTS.  Thanks also to Sergei Shtylyov for pointing out the
      existence of the macro DIV_ROUND_UP.
      
      Atsushi Nemoto observed:
      : This fix doesn't just silence the warning - it fixes a real problem.
      :
      : Without this fix, mem_section[] might have 0 size so mem_section[0]
      : will share other variable area.  For example, I got:
      :
      : c030c700 b __warned.16478
      : c030c700 B mem_section
      : c030c701 b __warned.16483
      :
      : This might cause very strange behavior.  Your patch actually fixes it.
      Signed-off-by: NMarcelo Roberto Jimenez <mroberto@cpti.cetuc.puc-rio.br>
      Cc: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Sergei Shtylyov <sshtylyov@mvista.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0faa5638
    • M
      mm: compaction: defer compaction using an exponential backoff when compaction fails · 4f92e258
      Mel Gorman 提交于
      The fragmentation index may indicate that a failure is due to external
      fragmentation but after a compaction run completes, it is still possible
      for an allocation to fail.  There are two obvious reasons as to why
      
        o Page migration cannot move all pages so fragmentation remains
        o A suitable page may exist but watermarks are not met
      
      In the event of compaction followed by an allocation failure, this patch
      defers further compaction in the zone (1 << compact_defer_shift) times.
      If the next compaction attempt also fails, compact_defer_shift is
      increased up to a maximum of 6.  If compaction succeeds, the defer
      counters are reset again.
      
      The zone that is deferred is the first zone in the zonelist - i.e.  the
      preferred zone.  To defer compaction in the other zones, the information
      would need to be stored in the zonelist or implemented similar to the
      zonelist_cache.  This would impact the fast-paths and is not justified at
      this time.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f92e258
  22. 07 3月, 2010 1 次提交