1. 24 2月, 2013 1 次提交
  2. 13 12月, 2012 3 次提交
    • J
      mm: introduce new field "managed_pages" to struct zone · 9feedc9d
      Jiang Liu 提交于
      Currently a zone's present_pages is calcuated as below, which is
      inaccurate and may cause trouble to memory hotplug.
      
      	spanned_pages - absent_pages - memmap_pages - dma_reserve.
      
      During fixing bugs caused by inaccurate zone->present_pages, we found
      zone->present_pages has been abused.  The field zone->present_pages may
      have different meanings in different contexts:
      
      1) pages existing in a zone.
      2) pages managed by the buddy system.
      
      For more discussions about the issue, please refer to:
        http://lkml.org/lkml/2012/11/5/866
        https://patchwork.kernel.org/patch/1346751/
      
      This patchset tries to introduce a new field named "managed_pages" to
      struct zone, which counts "pages managed by the buddy system".  And revert
      zone->present_pages to count "physical pages existing in a zone", which
      also keep in consistence with pgdat->node_present_pages.
      
      We will set an initial value for zone->managed_pages in function
      free_area_init_core() and will adjust it later if the initial value is
      inaccurate.
      
      For DMA/normal zones, the initial value is set to:
      
      	(spanned_pages - absent_pages - memmap_pages - dma_reserve)
      
      Later zone->managed_pages will be adjusted to the accurate value when the
      bootmem allocator frees all free pages to the buddy system in function
      free_all_bootmem_node() and free_all_bootmem().
      
      The bootmem allocator doesn't touch highmem pages, so highmem zones'
      managed_pages is set to the accurate value "spanned_pages - absent_pages"
      in function free_area_init_core() and won't be updated anymore.
      
      This patch also adds a new field "managed_pages" to /proc/zoneinfo
      and sysrq showmem.
      
      [akpm@linux-foundation.org: small comment tweaks]
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
      Tested-by: NChris Clayton <chris2553@googlemail.com>
      Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9feedc9d
    • L
      vmstat: use N_MEMORY instead N_HIGH_MEMORY · a47b53c5
      Lai Jiangshan 提交于
      N_HIGH_MEMORY stands for the nodes that has normal or high memory.
      N_MEMORY stands for the nodes that has any memory.
      
      The code here need to handle with the nodes which have memory, we should
      use N_MEMORY instead.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Lin Feng <linfeng@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a47b53c5
    • K
      thp, vmstat: implement HZP_ALLOC and HZP_ALLOC_FAILED events · d8a8e1f0
      Kirill A. Shutemov 提交于
      hzp_alloc is incremented every time a huge zero page is successfully
      	allocated. It includes allocations which where dropped due
      	race with other allocation. Note, it doesn't count every map
      	of the huge zero page, only its allocation.
      
      hzp_alloc_failed is incremented if kernel fails to allocate huge zero
      	page and falls back to using small pages.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8a8e1f0
  3. 11 12月, 2012 3 次提交
    • M
      mm: numa: Add pte updates, hinting and migration stats · 03c5a6e1
      Mel Gorman 提交于
      It is tricky to quantify the basic cost of automatic NUMA placement in a
      meaningful manner. This patch adds some vmstats that can be used as part
      of a basic costing model.
      
      u    = basic unit = sizeof(void *)
      Ca   = cost of struct page access = sizeof(struct page) / u
      Cpte = Cost PTE access = Ca
      Cupdate = Cost PTE update = (2 * Cpte) + (2 * Wlock)
      	where Cpte is incurred twice for a read and a write and Wlock
      	is a constant representing the cost of taking or releasing a
      	lock
      Cnumahint = Cost of a minor page fault = some high constant e.g. 1000
      Cpagerw = Cost to read or write a full page = Ca + PAGE_SIZE/u
      Ci = Cost of page isolation = Ca + Wi
      	where Wi is a constant that should reflect the approximate cost
      	of the locking operation
      Cpagecopy = Cpagerw + (Cpagerw * Wnuma) + Ci + (Ci * Wnuma)
      	where Wnuma is the approximate NUMA factor. 1 is local. 1.2
      	would imply that remote accesses are 20% more expensive
      
      Balancing cost = Cpte * numa_pte_updates +
      		Cnumahint * numa_hint_faults +
      		Ci * numa_pages_migrated +
      		Cpagecopy * numa_pages_migrated
      
      Note that numa_pages_migrated is used as a measure of how many pages
      were isolated even though it would miss pages that failed to migrate. A
      vmstat counter could have been added for it but the isolation cost is
      pretty marginal in comparison to the overall cost so it seemed overkill.
      
      The ideal way to measure automatic placement benefit would be to count
      the number of remote accesses versus local accesses and do something like
      
      	benefit = (remote_accesses_before - remove_access_after) * Wnuma
      
      but the information is not readily available. As a workload converges, the
      expection would be that the number of remote numa hints would reduce to 0.
      
      	convergence = numa_hint_faults_local / numa_hint_faults
      		where this is measured for the last N number of
      		numa hints recorded. When the workload is fully
      		converged the value is 1.
      
      This can measure if the placement policy is converging and how fast it is
      doing it.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      03c5a6e1
    • M
      mm: compaction: Add scanned and isolated counters for compaction · 397487db
      Mel Gorman 提交于
      Compaction already has tracepoints to count scanned and isolated pages
      but it requires that ftrace be enabled and if that information has to be
      written to disk then it can be disruptive. This patch adds vmstat counters
      for compaction called compact_migrate_scanned, compact_free_scanned and
      compact_isolated.
      
      With these counters, it is possible to define a basic cost model for
      compaction. This approximates of how much work compaction is doing and can
      be compared that with an oprofile showing TLB misses and see if the cost of
      compaction is being offset by THP for example. Minimally a compaction patch
      can be evaluated in terms of whether it increases or decreases cost. The
      basic cost model looks like this
      
      Fundamental unit u:	a word	sizeof(void *)
      
      Ca  = cost of struct page access = sizeof(struct page) / u
      
      Cmc = Cost migrate page copy = (Ca + PAGE_SIZE/u) * 2
      Cmf = Cost migrate failure   = Ca * 2
      Ci  = Cost page isolation    = (Ca + Wi)
      	where Wi is a constant that should reflect the approximate
      	cost of the locking operation.
      
      Csm = Cost migrate scanning = Ca
      Csf = Cost free    scanning = Ca
      
      Overall cost =	(Csm * compact_migrate_scanned) +
      	      	(Csf * compact_free_scanned)    +
      	      	(Ci  * compact_isolated)	+
      		(Cmc * pgmigrate_success)	+
      		(Cmf * pgmigrate_failed)
      
      Where the values are read from /proc/vmstat.
      
      This is very basic and ignores certain costs such as the allocation cost
      to do a migrate page copy but any improvement to the model would still
      use the same vmstat counters.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      397487db
    • M
      mm: compaction: Move migration fail/success stats to migrate.c · 5647bc29
      Mel Gorman 提交于
      The compact_pages_moved and compact_pagemigrate_failed events are
      convenient for determining if compaction is active and to what
      degree migration is succeeding but it's at the wrong level. Other
      users of migration may also want to know if migration is working
      properly and this will be particularly true for any automated
      NUMA migration. This patch moves the counters down to migration
      with the new events called pgmigrate_success and pgmigrate_fail.
      The compact_blocks_moved counter is removed because while it was
      useful for debugging initially, it's worthless now as no meaningful
      conclusions can be drawn from its value.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      5647bc29
  4. 09 10月, 2012 4 次提交
  5. 22 8月, 2012 1 次提交
  6. 01 8月, 2012 1 次提交
    • M
      mm: account for the number of times direct reclaimers get throttled · 68243e76
      Mel Gorman 提交于
      Under significant pressure when writing back to network-backed storage,
      direct reclaimers may get throttled.  This is expected to be a short-lived
      event and the processes get woken up again but processes do get stalled.
      This patch counts how many times such stalling occurs.  It's up to the
      administrator whether to reduce these stalls by increasing
      min_free_kbytes.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68243e76
  7. 30 5月, 2012 1 次提交
  8. 21 5月, 2012 1 次提交
  9. 26 4月, 2012 1 次提交
  10. 13 1月, 2012 1 次提交
  11. 01 11月, 2011 3 次提交
    • D
      mm/vmstat.c: cache align vm_stat · a1cb2c60
      Dimitri Sivanich 提交于
      Avoid false sharing of the vm_stat array.
      
      This was found to adversely affect tmpfs I/O performance.
      
      Tests run on a 640 cpu UV system.
      
      With 120 threads doing parallel writes, each to different tmpfs mounts:
      No patch:		~300 MB/sec
      With vm_stat alignment:	~430 MB/sec
      Signed-off-by: NDimitri Sivanich <sivanich@sgi.com>
      Acked-by: NChristoph Lameter <cl@gentwo.org>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a1cb2c60
    • M
      mm: vmscan: immediately reclaim end-of-LRU dirty pages when writeback completes · 49ea7eb6
      Mel Gorman 提交于
      When direct reclaim encounters a dirty page, it gets recycled around the
      LRU for another cycle.  This patch marks the page PageReclaim similar to
      deactivate_page() so that the page gets reclaimed almost immediately after
      the page gets cleaned.  This is to avoid reclaiming clean pages that are
      younger than a dirty page encountered at the end of the LRU that might
      have been something like a use-once page.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <jweiner@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49ea7eb6
    • M
      mm: vmscan: do not writeback filesystem pages in direct reclaim · ee72886d
      Mel Gorman 提交于
      Testing from the XFS folk revealed that there is still too much I/O from
      the end of the LRU in kswapd.  Previously it was considered acceptable by
      VM people for a small number of pages to be written back from reclaim with
      testing generally showing about 0.3% of pages reclaimed were written back
      (higher if memory was low).  That writing back a small number of pages is
      ok has been heavily disputed for quite some time and Dave Chinner
      explained it well;
      
      	It doesn't have to be a very high number to be a problem. IO
      	is orders of magnitude slower than the CPU time it takes to
      	flush a page, so the cost of making a bad flush decision is
      	very high. And single page writeback from the LRU is almost
      	always a bad flush decision.
      
      To complicate matters, filesystems respond very differently to requests
      from reclaim according to Christoph Hellwig;
      
      	xfs tries to write it back if the requester is kswapd
      	ext4 ignores the request if it's a delayed allocation
      	btrfs ignores the request
      
      As a result, each filesystem has different performance characteristics
      when under memory pressure and there are many pages being dirtied.  In
      some cases, the request is ignored entirely so the VM cannot depend on the
      IO being dispatched.
      
      The objective of this series is to reduce writing of filesystem-backed
      pages from reclaim, play nicely with writeback that is already in progress
      and throttle reclaim appropriately when writeback pages are encountered.
      The assumption is that the flushers will always write pages faster than if
      reclaim issues the IO.
      
      A secondary goal is to avoid the problem whereby direct reclaim splices
      two potentially deep call stacks together.
      
      There is a potential new problem as reclaim has less control over how long
      before a page in a particularly zone or container is cleaned and direct
      reclaimers depend on kswapd or flusher threads to do the necessary work.
      However, as filesystems sometimes ignore direct reclaim requests already,
      it is not expected to be a serious issue.
      
      Patch 1 disables writeback of filesystem pages from direct reclaim
      	entirely. Anonymous pages are still written.
      
      Patch 2 removes dead code in lumpy reclaim as it is no longer able
      	to synchronously write pages. This hurts lumpy reclaim but
      	there is an expectation that compaction is used for hugepage
      	allocations these days and lumpy reclaim's days are numbered.
      
      Patches 3-4 add warnings to XFS and ext4 if called from
      	direct reclaim. With patch 1, this "never happens" and is
      	intended to catch regressions in this logic in the future.
      
      Patch 5 disables writeback of filesystem pages from kswapd unless
      	the priority is raised to the point where kswapd is considered
      	to be in trouble.
      
      Patch 6 throttles reclaimers if too many dirty pages are being
      	encountered and the zones or backing devices are congested.
      
      Patch 7 invalidates dirty pages found at the end of the LRU so they
      	are reclaimed quickly after being written back rather than
      	waiting for a reclaimer to find them
      
      I consider this series to be orthogonal to the writeback work but it is
      worth noting that the writeback work affects the viability of patch 8 in
      particular.
      
      I tested this on ext4 and xfs using fs_mark, a simple writeback test based
      on dd and a micro benchmark that does a streaming write to a large mapping
      (exercises use-once LRU logic) followed by streaming writes to a mix of
      anonymous and file-backed mappings.  The command line for fs_mark when
      botted with 512M looked something like
      
      ./fs_mark -d  /tmp/fsmark-2676  -D  100  -N  150  -n  150  -L  25  -t  1  -S0  -s  10485760
      
      The number of files was adjusted depending on the amount of available
      memory so that the files created was about 3xRAM.  For multiple threads,
      the -d switch is specified multiple times.
      
      The test machine is x86-64 with an older generation of AMD processor with
      4 cores.  The underlying storage was 4 disks configured as RAID-0 as this
      was the best configuration of storage I had available.  Swap is on a
      separate disk.  Dirty ratio was tuned to 40% instead of the default of
      20%.
      
      Testing was run with and without monitors to both verify that the patches
      were operating as expected and that any performance gain was real and not
      due to interference from monitors.
      
      Here is a summary of results based on testing XFS.
      
      512M1P-xfs           Files/s  mean                 32.69 ( 0.00%)     34.44 ( 5.08%)
      512M1P-xfs           Elapsed Time fsmark                    51.41     48.29
      512M1P-xfs           Elapsed Time simple-wb                114.09    108.61
      512M1P-xfs           Elapsed Time mmap-strm                113.46    109.34
      512M1P-xfs           Kswapd efficiency fsmark                 62%       63%
      512M1P-xfs           Kswapd efficiency simple-wb              56%       61%
      512M1P-xfs           Kswapd efficiency mmap-strm              44%       42%
      512M-xfs             Files/s  mean                 30.78 ( 0.00%)     35.94 (14.36%)
      512M-xfs             Elapsed Time fsmark                    56.08     48.90
      512M-xfs             Elapsed Time simple-wb                112.22     98.13
      512M-xfs             Elapsed Time mmap-strm                219.15    196.67
      512M-xfs             Kswapd efficiency fsmark                 54%       56%
      512M-xfs             Kswapd efficiency simple-wb              54%       55%
      512M-xfs             Kswapd efficiency mmap-strm              45%       44%
      512M-4X-xfs          Files/s  mean                 30.31 ( 0.00%)     33.33 ( 9.06%)
      512M-4X-xfs          Elapsed Time fsmark                    63.26     55.88
      512M-4X-xfs          Elapsed Time simple-wb                100.90     90.25
      512M-4X-xfs          Elapsed Time mmap-strm                261.73    255.38
      512M-4X-xfs          Kswapd efficiency fsmark                 49%       50%
      512M-4X-xfs          Kswapd efficiency simple-wb              54%       56%
      512M-4X-xfs          Kswapd efficiency mmap-strm              37%       36%
      512M-16X-xfs         Files/s  mean                 60.89 ( 0.00%)     65.22 ( 6.64%)
      512M-16X-xfs         Elapsed Time fsmark                    67.47     58.25
      512M-16X-xfs         Elapsed Time simple-wb                103.22     90.89
      512M-16X-xfs         Elapsed Time mmap-strm                237.09    198.82
      512M-16X-xfs         Kswapd efficiency fsmark                 45%       46%
      512M-16X-xfs         Kswapd efficiency simple-wb              53%       55%
      512M-16X-xfs         Kswapd efficiency mmap-strm              33%       33%
      
      Up until 512-4X, the FSmark improvements were statistically significant.
      For the 4X and 16X tests the results were within standard deviations but
      just barely.  The time to completion for all tests is improved which is an
      important result.  In general, kswapd efficiency is not affected by
      skipping dirty pages.
      
      1024M1P-xfs          Files/s  mean                 39.09 ( 0.00%)     41.15 ( 5.01%)
      1024M1P-xfs          Elapsed Time fsmark                    84.14     80.41
      1024M1P-xfs          Elapsed Time simple-wb                210.77    184.78
      1024M1P-xfs          Elapsed Time mmap-strm                162.00    160.34
      1024M1P-xfs          Kswapd efficiency fsmark                 69%       75%
      1024M1P-xfs          Kswapd efficiency simple-wb              71%       77%
      1024M1P-xfs          Kswapd efficiency mmap-strm              43%       44%
      1024M-xfs            Files/s  mean                 35.45 ( 0.00%)     37.00 ( 4.19%)
      1024M-xfs            Elapsed Time fsmark                    94.59     91.00
      1024M-xfs            Elapsed Time simple-wb                229.84    195.08
      1024M-xfs            Elapsed Time mmap-strm                405.38    440.29
      1024M-xfs            Kswapd efficiency fsmark                 79%       71%
      1024M-xfs            Kswapd efficiency simple-wb              74%       74%
      1024M-xfs            Kswapd efficiency mmap-strm              39%       42%
      1024M-4X-xfs         Files/s  mean                 32.63 ( 0.00%)     35.05 ( 6.90%)
      1024M-4X-xfs         Elapsed Time fsmark                   103.33     97.74
      1024M-4X-xfs         Elapsed Time simple-wb                204.48    178.57
      1024M-4X-xfs         Elapsed Time mmap-strm                528.38    511.88
      1024M-4X-xfs         Kswapd efficiency fsmark                 81%       70%
      1024M-4X-xfs         Kswapd efficiency simple-wb              73%       72%
      1024M-4X-xfs         Kswapd efficiency mmap-strm              39%       38%
      1024M-16X-xfs        Files/s  mean                 42.65 ( 0.00%)     42.97 ( 0.74%)
      1024M-16X-xfs        Elapsed Time fsmark                   103.11     99.11
      1024M-16X-xfs        Elapsed Time simple-wb                200.83    178.24
      1024M-16X-xfs        Elapsed Time mmap-strm                397.35    459.82
      1024M-16X-xfs        Kswapd efficiency fsmark                 84%       69%
      1024M-16X-xfs        Kswapd efficiency simple-wb              74%       73%
      1024M-16X-xfs        Kswapd efficiency mmap-strm              39%       40%
      
      All FSMark tests up to 16X had statistically significant improvements.
      For the most part, tests are completing faster with the exception of the
      streaming writes to a mixture of anonymous and file-backed mappings which
      were slower in two cases
      
      In the cases where the mmap-strm tests were slower, there was more
      swapping due to dirty pages being skipped.  The number of additional pages
      swapped is almost identical to the fewer number of pages written from
      reclaim.  In other words, roughly the same number of pages were reclaimed
      but swapping was slower.  As the test is a bit unrealistic and stresses
      memory heavily, the small shift is acceptable.
      
      4608M1P-xfs          Files/s  mean                 29.75 ( 0.00%)     30.96 ( 3.91%)
      4608M1P-xfs          Elapsed Time fsmark                   512.01    492.15
      4608M1P-xfs          Elapsed Time simple-wb                618.18    566.24
      4608M1P-xfs          Elapsed Time mmap-strm                488.05    465.07
      4608M1P-xfs          Kswapd efficiency fsmark                 93%       86%
      4608M1P-xfs          Kswapd efficiency simple-wb              88%       84%
      4608M1P-xfs          Kswapd efficiency mmap-strm              46%       45%
      4608M-xfs            Files/s  mean                 27.60 ( 0.00%)     28.85 ( 4.33%)
      4608M-xfs            Elapsed Time fsmark                   555.96    532.34
      4608M-xfs            Elapsed Time simple-wb                659.72    571.85
      4608M-xfs            Elapsed Time mmap-strm               1082.57   1146.38
      4608M-xfs            Kswapd efficiency fsmark                 89%       91%
      4608M-xfs            Kswapd efficiency simple-wb              88%       82%
      4608M-xfs            Kswapd efficiency mmap-strm              48%       46%
      4608M-4X-xfs         Files/s  mean                 26.00 ( 0.00%)     27.47 ( 5.35%)
      4608M-4X-xfs         Elapsed Time fsmark                   592.91    564.00
      4608M-4X-xfs         Elapsed Time simple-wb                616.65    575.07
      4608M-4X-xfs         Elapsed Time mmap-strm               1773.02   1631.53
      4608M-4X-xfs         Kswapd efficiency fsmark                 90%       94%
      4608M-4X-xfs         Kswapd efficiency simple-wb              87%       82%
      4608M-4X-xfs         Kswapd efficiency mmap-strm              43%       43%
      4608M-16X-xfs        Files/s  mean                 26.07 ( 0.00%)     26.42 ( 1.32%)
      4608M-16X-xfs        Elapsed Time fsmark                   602.69    585.78
      4608M-16X-xfs        Elapsed Time simple-wb                606.60    573.81
      4608M-16X-xfs        Elapsed Time mmap-strm               1549.75   1441.86
      4608M-16X-xfs        Kswapd efficiency fsmark                 98%       98%
      4608M-16X-xfs        Kswapd efficiency simple-wb              88%       82%
      4608M-16X-xfs        Kswapd efficiency mmap-strm              44%       42%
      
      Unlike the other tests, the fsmark results are not statistically
      significant but the min and max times are both improved and for the most
      part, tests completed faster.
      
      There are other indications that this is an improvement as well.  For
      example, in the vast majority of cases, there were fewer pages scanned by
      direct reclaim implying in many cases that stalls due to direct reclaim
      are reduced.  KSwapd is scanning more due to skipping dirty pages which is
      unfortunate but the CPU usage is still acceptable
      
      In an earlier set of tests, I used blktrace and in almost all cases
      throughput throughout the entire test was higher.  However, I ended up
      discarding those results as recording blktrace data was too heavy for my
      liking.
      
      On a laptop, I plugged in a USB stick and ran a similar tests of tests
      using it as backing storage.  A desktop environment was running and for
      the entire duration of the tests, firefox and gnome terminal were
      launching and exiting to vaguely simulate a user.
      
      1024M-xfs            Files/s  mean               0.41 ( 0.00%)        0.44 ( 6.82%)
      1024M-xfs            Elapsed Time fsmark               2053.52   1641.03
      1024M-xfs            Elapsed Time simple-wb            1229.53    768.05
      1024M-xfs            Elapsed Time mmap-strm            4126.44   4597.03
      1024M-xfs            Kswapd efficiency fsmark              84%       85%
      1024M-xfs            Kswapd efficiency simple-wb           92%       81%
      1024M-xfs            Kswapd efficiency mmap-strm           60%       51%
      1024M-xfs            Avg wait ms fsmark                5404.53     4473.87
      1024M-xfs            Avg wait ms simple-wb             2541.35     1453.54
      1024M-xfs            Avg wait ms mmap-strm             3400.25     3852.53
      
      The mmap-strm results were hurt because firefox launching had a tendency
      to push the test out of memory.  On the postive side, firefox launched
      marginally faster with the patches applied.  Time to completion for many
      tests was faster but more importantly - the "Avg wait" time as measured by
      iostat was far lower implying the system would be more responsive.  It was
      also the case that "Avg wait ms" on the root filesystem was lower.  I
      tested it manually and while the system felt slightly more responsive
      while copying data to a USB stick, it was marginal enough that it could be
      my imagination.
      
      This patch: do not writeback filesystem pages in direct reclaim.
      
      When kswapd is failing to keep zones above the min watermark, a process
      will enter direct reclaim in the same manner kswapd does.  If a dirty page
      is encountered during the scan, this page is written to backing storage
      using mapping->writepage.
      
      This causes two problems.  First, it can result in very deep call stacks,
      particularly if the target storage or filesystem are complex.  Some
      filesystems ignore write requests from direct reclaim as a result.  The
      second is that a single-page flush is inefficient in terms of IO.  While
      there is an expectation that the elevator will merge requests, this does
      not always happen.  Quoting Christoph Hellwig;
      
      	The elevator has a relatively small window it can operate on,
      	and can never fix up a bad large scale writeback pattern.
      
      This patch prevents direct reclaim writing back filesystem pages by
      checking if current is kswapd.  Anonymous pages are still written to swap
      as there is not the equivalent of a flusher thread for anonymous pages.
      If the dirty pages cannot be written back, they are placed back on the LRU
      lists.  There is now a direct dependency on dirty page balancing to
      prevent too many pages in the system being dirtied which would prevent
      reclaim making forward progress.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ee72886d
  12. 15 9月, 2011 1 次提交
  13. 25 5月, 2011 2 次提交
  14. 15 4月, 2011 2 次提交
  15. 23 3月, 2011 1 次提交
    • A
      mm: add __GFP_OTHER_NODE flag · 78afd561
      Andi Kleen 提交于
      Add a new __GFP_OTHER_NODE flag to tell the low level numa statistics in
      zone_statistics() that an allocation is on behalf of another thread.  This
      way the local and remote counters can be still correct, even when
      background daemons like khugepaged are changing memory mappings.
      
      This only affects the accounting, but I think it's worth doing that right
      to avoid confusing users.
      
      I first tried to just pass down the right node, but this required a lot of
      changes to pass down this parameter and at least one addition of a 10th
      argument to a 9 argument function.  Using the flag is a lot less
      intrusive.
      
      Open: should be also used for migration?
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      78afd561
  16. 14 1月, 2011 3 次提交
    • A
      thp: transparent hugepage vmstat · 79134171
      Andrea Arcangeli 提交于
      Add hugepage stat information to /proc/vmstat and /proc/meminfo.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79134171
    • M
      mm: vmstat: use a single setter function and callback for adjusting percpu thresholds · b44129b3
      Mel Gorman 提交于
      reduce_pgdat_percpu_threshold() and restore_pgdat_percpu_threshold() exist
      to adjust the per-cpu vmstat thresholds while kswapd is awake to avoid
      errors due to counter drift.  The functions duplicate some code so this
      patch replaces them with a single set_pgdat_percpu_threshold() that takes
      a callback function to calculate the desired threshold as a parameter.
      
      [akpm@linux-foundation.org: readability tweak]
      [kosaki.motohiro@jp.fujitsu.com: set_pgdat_percpu_threshold(): don't use for_each_online_cpu]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b44129b3
    • M
      mm: page allocator: adjust the per-cpu counter threshold when memory is low · 88f5acf8
      Mel Gorman 提交于
      Commit aa454840 ("calculate a better estimate of NR_FREE_PAGES when memory
      is low") noted that watermarks were based on the vmstat NR_FREE_PAGES.  To
      avoid synchronization overhead, these counters are maintained on a per-cpu
      basis and drained both periodically and when a threshold is above a
      threshold.  On large CPU systems, the difference between the estimate and
      real value of NR_FREE_PAGES can be very high.  The system can get into a
      case where pages are allocated far below the min watermark potentially
      causing livelock issues.  The commit solved the problem by taking a better
      reading of NR_FREE_PAGES when memory was low.
      
      Unfortately, as reported by Shaohua Li this accurate reading can consume a
      large amount of CPU time on systems with many sockets due to cache line
      bouncing.  This patch takes a different approach.  For large machines
      where counter drift might be unsafe and while kswapd is awake, the per-cpu
      thresholds for the target pgdat are reduced to limit the level of drift to
      what should be a safe level.  This incurs a performance penalty in heavy
      memory pressure by a factor that depends on the workload and the machine
      but the machine should function correctly without accidentally exhausting
      all memory on a node.  There is an additional cost when kswapd wakes and
      sleeps but the event is not expected to be frequent - in Shaohua's test
      case, there was one recorded sleep and wake event at least.
      
      To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is
      introduced that takes a more accurate reading of NR_FREE_PAGES when called
      from wakeup_kswapd, when deciding whether it is really safe to go back to
      sleep in sleeping_prematurely() and when deciding if a zone is really
      balanced or not in balance_pgdat().  We are still using an expensive
      function but limiting how often it is called.
      
      When the test case is reproduced, the time spent in the watermark
      functions is reduced.  The following report is on the percentage of time
      spent cumulatively spent in the functions zone_nr_free_pages(),
      zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(),
      zone_page_state_snapshot(), zone_page_state().
      
      vanilla                      11.6615%
      disable-threshold            0.2584%
      
      David said:
      
      : We had to pull aa454840 "mm: page allocator: calculate a better estimate
      : of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36
      : internally because tests showed that it would cause the machine to stall
      : as the result of heavy kswapd activity.  I merged it back with this fix as
      : it is pending in the -mm tree and it solves the issue we were seeing, so I
      : definitely think this should be pushed to -stable (and I would seriously
      : consider it for 2.6.37 inclusion even at this late date).
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reported-by: NShaohua Li <shaohua.li@intel.com>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Tested-by: NNicolas Bareil <nico@chdir.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: <stable@kernel.org>		[2.6.37.1, 2.6.36.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88f5acf8
  17. 18 12月, 2010 1 次提交
    • C
      vmstat: User per cpu atomics to avoid interrupt disable / enable · 7c839120
      Christoph Lameter 提交于
      Currently the operations to increment vm counters must disable interrupts
      in order to not mess up their housekeeping of counters.
      
      So use this_cpu_cmpxchg() to avoid the overhead. Since we can no longer
      count on preremption being disabled we still have some minor issues.
      The fetching of the counter thresholds is racy.
      A threshold from another cpu may be applied if we happen to be
      rescheduled on another cpu.  However, the following vmstat operation
      will then bring the counter again under the threshold limit.
      
      The operations for __xxx_zone_state are not changed since the caller
      has taken care of the synchronization needs (and therefore the cycle
      count is even less than the optimized version for the irq disable case
      provided here).
      
      The optimization using this_cpu_cmpxchg will only be used if the arch
      supports efficient this_cpu_ops (must have CONFIG_CMPXCHG_LOCAL set!)
      
      The use of this_cpu_cmpxchg reduces the cycle count for the counter
      operations by %80 (inc_zone_page_state goes from 170 cycles to 32).
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      7c839120
  18. 17 12月, 2010 2 次提交
  19. 15 12月, 2010 1 次提交
    • T
      workqueue: convert cancel_rearming_delayed_work[queue]() users to cancel_delayed_work_sync() · afe2c511
      Tejun Heo 提交于
      cancel_rearming_delayed_work[queue]() has been superceded by
      cancel_delayed_work_sync() quite some time ago.  Convert all the
      in-kernel users.  The conversions are completely equivalent and
      trivial.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: N"David S. Miller" <davem@davemloft.net>
      Acked-by: NGreg Kroah-Hartman <gregkh@suse.de>
      Acked-by: NEvgeniy Polyakov <zbr@ioremap.net>
      Cc: Jeff Garzik <jgarzik@pobox.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
      Cc: netdev@vger.kernel.org
      Cc: Anton Vorontsov <cbou@mail.ru>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: xfs-masters@oss.sgi.com
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: netfilter-devel@vger.kernel.org
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: linux-nfs@vger.kernel.org
      afe2c511
  20. 03 12月, 2010 1 次提交
  21. 04 11月, 2010 1 次提交
    • W
      vmstat: fix offset calculation on void* · ff8b16d7
      Wu Fengguang 提交于
      Fix regression introduced by commit 79da826a ("writeback: report
      dirty thresholds in /proc/vmstat").
      
      The incorrect pointer arithmetic can result in problems like this:
      
        BUG: unable to handle kernel paging request at 07c06d16
        IP: [<c050c336>] strnlen+0x6/0x20
        Call Trace:
         [<c050a249>] ? string+0x39/0xe0
         [<c042be6b>] ? __wake_up_common+0x4b/0x80
         [<c050afcc>] ? vsnprintf+0x1ec/0x380
         [<c04b380e>] ? seq_printf+0x2e/0x60
         [<c04829a6>] ? vmstat_show+0x26/0x30
         [<c04b3bb6>] ? seq_read+0xa6/0x380
         [<c04b3b10>] ? seq_read+0x0/0x380
         [<c04d5d2f>] ? proc_reg_read+0x5f/0x90
         [<c049c4a1>] ? vfs_read+0xa1/0x140
         [<c04d5cd0>] ? proc_reg_read+0x0/0x90
         [<c049c981>] ? sys_read+0x41/0x70
         [<c0402bd0>] ? sysenter_do_call+0x12/0x26
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Michael Rubin <mrubin@google.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff8b16d7
  22. 27 10月, 2010 3 次提交
  23. 10 9月, 2010 2 次提交