1. 16 6月, 2011 1 次提交
  2. 27 5月, 2011 1 次提交
    • Y
      memcg: count the soft_limit reclaim in global background reclaim · 0ae5e89c
      Ying Han 提交于
      The global kswapd scans per-zone LRU and reclaims pages regardless of the
      cgroup. It breaks memory isolation since one cgroup can end up reclaiming
      pages from another cgroup. Instead we should rely on memcg-aware target
      reclaim including per-memcg kswapd and soft_limit hierarchical reclaim under
      memory pressure.
      
      In the global background reclaim, we do soft reclaim before scanning the
      per-zone LRU. However, the return value is ignored. This patch is the first
      step to skip shrink_zone() if soft_limit reclaim does enough work.
      
      This is part of the effort which tries to reduce reclaiming pages in global
      LRU in memcg. The per-memcg background reclaim patchset further enhances the
      per-cgroup targetting reclaim, which I should have V4 posted shortly.
      
      Try running multiple memory intensive workloads within seperate memcgs. Watch
      the counters of soft_steal in memory.stat.
      
        $ cat /dev/cgroup/A/memory.stat | grep 'soft'
        soft_steal 240000
        soft_scan 240000
        total_soft_steal 240000
        total_soft_scan 240000
      
      This patch:
      
      In the global background reclaim, we do soft reclaim before scanning the
      per-zone LRU.  However, the return value is ignored.
      
      We would like to skip shrink_zone() if soft_limit reclaim does enough
      work.  Also, we need to make the memory pressure balanced across per-memcg
      zones, like the logic vm-core.  This patch is the first step where we
      start with counting the nr_scanned and nr_reclaimed from soft_limit
      reclaim into the global scan_control.
      Signed-off-by: NYing Han <yinghan@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Acked-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ae5e89c
  3. 23 3月, 2011 2 次提交
    • M
      mm: vmscan: kswapd should not free an excessive number of pages when balancing small zones · 8afdcece
      Mel Gorman 提交于
      When reclaiming for order-0 pages, kswapd requires that all zones be
      balanced.  Each cycle through balance_pgdat() does background ageing on
      all zones if necessary and applies equal pressure on the inactive zone
      unless a lot of pages are free already.
      
      A "lot of free pages" is defined as a "balance gap" above the high
      watermark which is currently 7*high_watermark.  Historically this was
      reasonable as min_free_kbytes was small.  However, on systems using huge
      pages, it is recommended that min_free_kbytes is higher and it is tuned
      with hugeadm --set-recommended-min_free_kbytes.  With the introduction of
      transparent huge page support, this recommended value is also applied.  On
      X86-64 with 4G of memory, min_free_kbytes becomes 67584 so one would
      expect around 68M of memory to be free.  The Normal zone is approximately
      35000 pages so under even normal memory pressure such as copying a large
      file, it gets exhausted quickly.  As it is getting exhausted, kswapd
      applies pressure equally to all zones, including the DMA32 zone.  DMA32 is
      approximately 700,000 pages with a high watermark of around 23,000 pages.
      In this situation, kswapd will reclaim around (23000*8 where 8 is the high
      watermark + balance gap of 7 * high watermark) pages or 718M of pages
      before the zone is ignored.  What the user sees is that free memory far
      higher than it should be.
      
      To avoid an excessive number of pages being reclaimed from the larger
      zones, explicitely defines the "balance gap" to be either 1% of the zone
      or the low watermark for the zone, whichever is smaller.  While kswapd
      will check all zones to apply pressure, it'll ignore zones that meets the
      (high_wmark + balance_gap) watermark.
      
      To test this, 80G were copied from a partition and the amount of memory
      being used was recorded.  A comparison of a patch and unpatched kernel can
      be seen at
      http://www.csn.ul.ie/~mel/postings/minfree-20110222/memory-usage-hydra.ps
      and shows that kswapd is not reclaiming as much memory with the patch
      applied.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: "Chen, Tim C" <tim.c.chen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8afdcece
    • M
      mm: deactivate invalidated pages · 31560180
      Minchan Kim 提交于
      Recently, there are reported problem about thrashing.
      (http://marc.info/?l=rsync&m=128885034930933&w=2) It happens by backup
      workloads(ex, nightly rsync).  That's because the workload makes just
      use-once pages and touches pages twice.  It promotes the page into active
      list so that it results in working set page eviction.
      
      Some app developer want to support POSIX_FADV_NOREUSE.  But other OSes
      don't support it, either.
      (http://marc.info/?l=linux-mm&m=128928979512086&w=2)
      
      By other approach, app developers use POSIX_FADV_DONTNEED.  But it has a
      problem.  If kernel meets page is writing during invalidate_mapping_pages,
      it can't work.  It makes for application programmer to use it since they
      always have to sync data before calling fadivse(..POSIX_FADV_DONTNEED) to
      make sure the pages could be discardable.  At last, they can't use
      deferred write of kernel so that they could see performance loss.
      (http://insights.oetiker.ch/linux/fadvise.html)
      
      In fact, invalidation is very big hint to reclaimer.  It means we don't
      use the page any more.  So let's move the writing page into inactive
      list's head if we can't truncate it right now.
      
      Why I move page to head of lru on this patch, Dirty/Writeback page would
      be flushed sooner or later.  It can prevent writeout of pageout which is
      less effective than flusher's writeout.
      
      Originally, I reused lru_demote of Peter with some change so added his
      Signed-off-by.
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Reported-by: NBen Gamari <bgamari.foss@gmail.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31560180
  4. 10 3月, 2011 1 次提交
  5. 14 1月, 2011 1 次提交
    • A
      thp: transparent hugepage core · 71e3aac0
      Andrea Arcangeli 提交于
      Lately I've been working to make KVM use hugepages transparently without
      the usual restrictions of hugetlbfs.  Some of the restrictions I'd like to
      see removed:
      
      1) hugepages have to be swappable or the guest physical memory remains
         locked in RAM and can't be paged out to swap
      
      2) if a hugepage allocation fails, regular pages should be allocated
         instead and mixed in the same vma without any failure and without
         userland noticing
      
      3) if some task quits and more hugepages become available in the
         buddy, guest physical memory backed by regular pages should be
         relocated on hugepages automatically in regions under
         madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
         kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes
         not null)
      
      4) avoidance of reservation and maximization of use of hugepages whenever
         possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
         1 machine with 1 database with 1 database cache with 1 database cache size
         known at boot time. It's definitely not feasible with a virtualization
         hypervisor usage like RHEV-H that runs an unknown number of virtual machines
         with an unknown size of each virtual machine with an unknown amount of
         pagecache that could be potentially useful in the host for guest not using
         O_DIRECT (aka cache=off).
      
      hugepages in the virtualization hypervisor (and also in the guest!) are
      much more important than in a regular host not using virtualization,
      becasue with NPT/EPT they decrease the tlb-miss cacheline accesses from 24
      to 19 in case only the hypervisor uses transparent hugepages, and they
      decrease the tlb-miss cacheline accesses from 19 to 15 in case both the
      linux hypervisor and the linux guest both uses this patch (though the
      guest will limit the addition speedup to anonymous regions only for
      now...).  Even more important is that the tlb miss handler is much slower
      on a NPT/EPT guest than for a regular shadow paging or no-virtualization
      scenario.  So maximizing the amount of virtual memory cached by the TLB
      pays off significantly more with NPT/EPT than without (even if there would
      be no significant speedup in the tlb-miss runtime).
      
      The first (and more tedious) part of this work requires allowing the VM to
      handle anonymous hugepages mixed with regular pages transparently on
      regular anonymous vmas.  This is what this patch tries to achieve in the
      least intrusive possible way.  We want hugepages and hugetlb to be used in
      a way so that all applications can benefit without changes (as usual we
      leverage the KVM virtualization design: by improving the Linux VM at
      large, KVM gets the performance boost too).
      
      The most important design choice is: always fallback to 4k allocation if
      the hugepage allocation fails!  This is the _very_ opposite of some large
      pagecache patches that failed with -EIO back then if a 64k (or similar)
      allocation failed...
      
      Second important decision (to reduce the impact of the feature on the
      existing pagetable handling code) is that at any time we can split an
      hugepage into 512 regular pages and it has to be done with an operation
      that can't fail.  This way the reliability of the swapping isn't decreased
      (no need to allocate memory when we are short on memory to swap) and it's
      trivial to plug a split_huge_page* one-liner where needed without
      polluting the VM.  Over time we can teach mprotect, mremap and friends to
      handle pmd_trans_huge natively without calling split_huge_page*.  The fact
      it can't fail isn't just for swap: if split_huge_page would return -ENOMEM
      (instead of the current void) we'd need to rollback the mprotect from the
      middle of it (ideally including undoing the split_vma) which would be a
      big change and in the very wrong direction (it'd likely be simpler not to
      call split_huge_page at all and to teach mprotect and friends to handle
      hugepages instead of rolling them back from the middle).  In short the
      very value of split_huge_page is that it can't fail.
      
      The collapsing and madvise(MADV_HUGEPAGE) part will remain separated and
      incremental and it'll just be an "harmless" addition later if this initial
      part is agreed upon.  It also should be noted that locking-wise replacing
      regular pages with hugepages is going to be very easy if compared to what
      I'm doing below in split_huge_page, as it will only happen when
      page_count(page) matches page_mapcount(page) if we can take the PG_lock
      and mmap_sem in write mode.  collapse_huge_page will be a "best effort"
      that (unlike split_huge_page) can fail at the minimal sign of trouble and
      we can try again later.  collapse_huge_page will be similar to how KSM
      works and the madvise(MADV_HUGEPAGE) will work similar to
      madvise(MADV_MERGEABLE).
      
      The default I like is that transparent hugepages are used at page fault
      time.  This can be changed with
      /sys/kernel/mm/transparent_hugepage/enabled.  The control knob can be set
      to three values "always", "madvise", "never" which mean respectively that
      hugepages are always used, or only inside madvise(MADV_HUGEPAGE) regions,
      or never used.  /sys/kernel/mm/transparent_hugepage/defrag instead
      controls if the hugepage allocation should defrag memory aggressively
      "always", only inside "madvise" regions, or "never".
      
      The pmd_trans_splitting/pmd_trans_huge locking is very solid.  The
      put_page (from get_user_page users that can't use mmu notifier like
      O_DIRECT) that runs against a __split_huge_page_refcount instead was a
      pain to serialize in a way that would result always in a coherent page
      count for both tail and head.  I think my locking solution with a
      compound_lock taken only after the page_first is valid and is still a
      PageHead should be safe but it surely needs review from SMP race point of
      view.  In short there is no current existing way to serialize the O_DIRECT
      final put_page against split_huge_page_refcount so I had to invent a new
      one (O_DIRECT loses knowledge on the mapping status by the time gup_fast
      returns so...).  And I didn't want to impact all gup/gup_fast users for
      now, maybe if we change the gup interface substantially we can avoid this
      locking, I admit I didn't think too much about it because changing the gup
      unpinning interface would be invasive.
      
      If we ignored O_DIRECT we could stick to the existing compound refcounting
      code, by simply adding a get_user_pages_fast_flags(foll_flags) where KVM
      (and any other mmu notifier user) would call it without FOLL_GET (and if
      FOLL_GET isn't set we'd just BUG_ON if nobody registered itself in the
      current task mmu notifier list yet).  But O_DIRECT is fundamental for
      decent performance of virtualized I/O on fast storage so we can't avoid it
      to solve the race of put_page against split_huge_page_refcount to achieve
      a complete hugepage feature for KVM.
      
      Swap and oom works fine (well just like with regular pages ;).  MMU
      notifier is handled transparently too, with the exception of the young bit
      on the pmd, that didn't have a range check but I think KVM will be fine
      because the whole point of hugepages is that EPT/NPT will also use a huge
      pmd when they notice gup returns pages with PageCompound set, so they
      won't care of a range and there's just the pmd young bit to check in that
      case.
      
      NOTE: in some cases if the L2 cache is small, this may slowdown and waste
      memory during COWs because 4M of memory are accessed in a single fault
      instead of 8k (the payoff is that after COW the program can run faster).
      So we might want to switch the copy_huge_page (and clear_huge_page too) to
      not temporal stores.  I also extensively researched ways to avoid this
      cache trashing with a full prefault logic that would cow in 8k/16k/32k/64k
      up to 1M (I can send those patches that fully implemented prefault) but I
      concluded they're not worth it and they add an huge additional complexity
      and they remove all tlb benefits until the full hugepage has been faulted
      in, to save a little bit of memory and some cache during app startup, but
      they still don't improve substantially the cache-trashing during startup
      if the prefault happens in >4k chunks.  One reason is that those 4k pte
      entries copied are still mapped on a perfectly cache-colored hugepage, so
      the trashing is the worst one can generate in those copies (cow of 4k page
      copies aren't so well colored so they trashes less, but again this results
      in software running faster after the page fault).  Those prefault patches
      allowed things like a pte where post-cow pages were local 4k regular anon
      pages and the not-yet-cowed pte entries were pointing in the middle of
      some hugepage mapped read-only.  If it doesn't payoff substantially with
      todays hardware it will payoff even less in the future with larger l2
      caches, and the prefault logic would blot the VM a lot.  If one is
      emebdded transparent_hugepage can be disabled during boot with sysfs or
      with the boot commandline parameter transparent_hugepage=0 (or
      transparent_hugepage=2 to restrict hugepages inside madvise regions) that
      will ensure not a single hugepage is allocated at boot time.  It is simple
      enough to just disable transparent hugepage globally and let transparent
      hugepages be allocated selectively by applications in the MADV_HUGEPAGE
      region (both at page fault time, and if enabled with the
      collapse_huge_page too through the kernel daemon).
      
      This patch supports only hugepages mapped in the pmd, archs that have
      smaller hugepages will not fit in this patch alone.  Also some archs like
      power have certain tlb limits that prevents mixing different page size in
      the same regions so they will not fit in this framework that requires
      "graceful fallback" to basic PAGE_SIZE in case of physical memory
      fragmentation.  hugetlbfs remains a perfect fit for those because its
      software limits happen to match the hardware limits.  hugetlbfs also
      remains a perfect fit for hugepage sizes like 1GByte that cannot be hoped
      to be found not fragmented after a certain system uptime and that would be
      very expensive to defragment with relocation, so requiring reservation.
      hugetlbfs is the "reservation way", the point of transparent hugepages is
      not to have any reservation at all and maximizing the use of cache and
      hugepages at all times automatically.
      
      Some performance result:
      
      vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
      ages3
      memset page fault 1566023
      memset tlb miss 453854
      memset second tlb miss 453321
      random access tlb miss 41635
      random access second tlb miss 41658
      vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
      memset page fault 1566471
      memset tlb miss 453375
      memset second tlb miss 453320
      random access tlb miss 41636
      random access second tlb miss 41637
      vmx andrea # ./largepages3
      memset page fault 1566642
      memset tlb miss 453417
      memset second tlb miss 453313
      random access tlb miss 41630
      random access second tlb miss 41647
      vmx andrea # ./largepages3
      memset page fault 1566872
      memset tlb miss 453418
      memset second tlb miss 453315
      random access tlb miss 41618
      random access second tlb miss 41659
      vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
      vmx andrea # ./largepages3
      memset page fault 2182476
      memset tlb miss 460305
      memset second tlb miss 460179
      random access tlb miss 44483
      random access second tlb miss 44186
      vmx andrea # ./largepages3
      memset page fault 2182791
      memset tlb miss 460742
      memset second tlb miss 459962
      random access tlb miss 43981
      random access second tlb miss 43988
      
      ============
      #include <stdio.h>
      #include <stdlib.h>
      #include <string.h>
      #include <sys/time.h>
      
      #define SIZE (3UL*1024*1024*1024)
      
      int main()
      {
      	char *p = malloc(SIZE), *p2;
      	struct timeval before, after;
      
      	gettimeofday(&before, NULL);
      	memset(p, 0, SIZE);
      	gettimeofday(&after, NULL);
      	printf("memset page fault %Lu\n",
      	       (after.tv_sec-before.tv_sec)*1000000UL +
      	       after.tv_usec-before.tv_usec);
      
      	gettimeofday(&before, NULL);
      	memset(p, 0, SIZE);
      	gettimeofday(&after, NULL);
      	printf("memset tlb miss %Lu\n",
      	       (after.tv_sec-before.tv_sec)*1000000UL +
      	       after.tv_usec-before.tv_usec);
      
      	gettimeofday(&before, NULL);
      	memset(p, 0, SIZE);
      	gettimeofday(&after, NULL);
      	printf("memset second tlb miss %Lu\n",
      	       (after.tv_sec-before.tv_sec)*1000000UL +
      	       after.tv_usec-before.tv_usec);
      
      	gettimeofday(&before, NULL);
      	for (p2 = p; p2 < p+SIZE; p2 += 4096)
      		*p2 = 0;
      	gettimeofday(&after, NULL);
      	printf("random access tlb miss %Lu\n",
      	       (after.tv_sec-before.tv_sec)*1000000UL +
      	       after.tv_usec-before.tv_usec);
      
      	gettimeofday(&before, NULL);
      	for (p2 = p; p2 < p+SIZE; p2 += 4096)
      		*p2 = 0;
      	gettimeofday(&after, NULL);
      	printf("random access second tlb miss %Lu\n",
      	       (after.tv_sec-before.tv_sec)*1000000UL +
      	       after.tv_usec-before.tv_usec);
      
      	return 0;
      }
      ============
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71e3aac0
  6. 27 10月, 2010 1 次提交
  7. 10 9月, 2010 2 次提交
    • H
      swap: discard while swapping only if SWAP_FLAG_DISCARD · 33994466
      Hugh Dickins 提交于
      Tests with recent firmware on Intel X25-M 80GB and OCZ Vertex 60GB SSDs
      show a shift since I last tested in December: in part because of firmware
      updates, in part because of the necessary move from barriers to awaiting
      completion at the block layer.  While discard at swapon still shows as
      slightly beneficial on both, discarding 1MB swap cluster when allocating
      is now disadvanteous: adds 25% overhead on Intel, adds 230% on OCZ (YMMV).
      
      Surrender: discard as presently implemented is more hindrance than help
      for swap; but might prove useful on other devices, or with improvements.
      So continue to do the discard at swapon, but make discard while swapping
      conditional on a SWAP_FLAG_DISCARD to sys_swapon() (which has been using
      only the lower 16 bits of int flags).
      
      We can add a --discard or -d to swapon(8), and a "discard" to swap in
      /etc/fstab: matching the mount option for btrfs, ext4, fat, gfs2, nilfs2.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Nigel Cunningham <nigel@tuxonice.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <jaxboe@fusionio.com>
      Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33994466
    • H
      swap: revert special hibernation allocation · 910321ea
      Hugh Dickins 提交于
      Please revert 2.6.36-rc commit d2997b10
      "hibernation: freeze swap at hibernation".  It complicated matters by
      adding a second swap allocation path, just for hibernation; without in any
      way fixing the issue that it was intended to address - page reclaim after
      fixing the hibernation image might free swap from a page already imaged as
      swapcache, letting its swap be reallocated to store a different page of
      the image: resulting in data corruption if the imaged page were freed as
      clean then swapped back in.  Pages freed to si->swap_map were still in
      danger of being reallocated by the alternative allocation path.
      
      I guess it inadvertently fixed slow SSD swap allocation for hibernation,
      as reported by Nigel Cunningham: by missing out the discards that occur on
      the usual swap allocation path; but that was unintentional, and needs a
      separate fix.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Ondrej Zary <linux@rainbow-software.org>
      Cc: Andrea Gelmini <andrea.gelmini@gmail.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nigel Cunningham <nigel@tuxonice.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      910321ea
  8. 11 8月, 2010 1 次提交
  9. 10 8月, 2010 1 次提交
    • K
      hibernation: freeze swap at hibernation · d2997b10
      KAMEZAWA Hiroyuki 提交于
      When taking a memory snapshot in hibernate_snapshot(), all (directly
      called) memory allocations use GFP_ATOMIC.  Hence swap misusage during
      hibernation never occurs.
      
      But from a pessimistic point of view, there is no guarantee that no page
      allcation has __GFP_WAIT.  It is better to have a global indication "we
      enter hibernation, don't use swap!".
      
      This patch tries to freeze new-swap-allocation during hibernation.  (All
      user processes are frozenm so swapin is not a concern).
      
      This way, no updates will happen to swap_map[] between
      hibernate_snapshot() and save_image().  Swap is thawed when swsusp_free()
      is called.  We can be assured that swap corruption will not occur.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Ondrej Zary <linux@rainbow-software.org>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2997b10
  10. 28 5月, 2010 1 次提交
    • D
      memcg: move charge of file pages · 87946a72
      Daisuke Nishimura 提交于
      This patch adds support for moving charge of file pages, which include
      normal file, tmpfs file and swaps of tmpfs file.  It's enabled by setting
      bit 1 of <target cgroup>/memory.move_charge_at_immigrate.
      
      Unlike the case of anonymous pages, file pages(and swaps) in the range
      mmapped by the task will be moved even if the task hasn't done page fault,
      i.e.  they might not be the task's "RSS", but other task's "RSS" that maps
      the same file.  And mapcount of the page is ignored(the page can be moved
      even if page_mapcount(page) > 1).  So, conditions that the page/swap
      should be met to be moved is that it must be in the range mmapped by the
      target task and it must be charged to the old cgroup.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix warning]
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      87946a72
  11. 25 5月, 2010 3 次提交
  12. 19 5月, 2010 1 次提交
  13. 13 3月, 2010 1 次提交
    • D
      memcg: move charges of anonymous swap · 02491447
      Daisuke Nishimura 提交于
      This patch is another core part of this move-charge-at-task-migration
      feature.  It enables moving charges of anonymous swaps.
      
      To move the charge of swap, we need to exchange swap_cgroup's record.
      
      In current implementation, swap_cgroup's record is protected by:
      
        - page lock: if the entry is on swap cache.
        - swap_lock: if the entry is not on swap cache.
      
      This works well in usual swap-in/out activity.
      
      But this behavior make the feature of moving swap charge check many
      conditions to exchange swap_cgroup's record safely.
      
      So I changed modification of swap_cgroup's recored(swap_cgroup_record())
      to use xchg, and define a new function to cmpxchg swap_cgroup's record.
      
      This patch also enables moving charge of non pte_present but not uncharged
      swap caches, which can be exist on swap-out path, by getting the target
      pages via find_get_page() as do_mincore() does.
      
      [kosaki.motohiro@jp.fujitsu.com: fix ia64 build]
      [akpm@linux-foundation.org: fix typos]
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02491447
  14. 16 12月, 2009 10 次提交
  15. 24 9月, 2009 2 次提交
  16. 22 9月, 2009 1 次提交
  17. 16 9月, 2009 1 次提交
    • A
      HWPOISON: Add support for poison swap entries v2 · a7420aa5
      Andi Kleen 提交于
      Memory migration uses special swap entry types to trigger special actions on
      page faults. Extend this mechanism to also support poisoned swap entries, to
      trigger poison handling on page faults. This allows follow-on patches to
      prevent processes from faulting in poisoned pages again.
      
      v2: Fix overflow in MAX_SWAPFILES (Fengguang Wu)
      v3: Better overflow fix (Hidehiro Kawai)
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      a7420aa5
  18. 24 6月, 2009 1 次提交
  19. 19 6月, 2009 2 次提交
  20. 17 6月, 2009 4 次提交
  21. 29 5月, 2009 1 次提交
  22. 01 4月, 2009 1 次提交
    • H
      shmem: writepage directly to swap · 9fab5619
      Hugh Dickins 提交于
      Synopsis: if shmem_writepage calls swap_writepage directly, most shmem
      swap loads benefit, and a catastrophic interaction between SLUB and some
      flash storage is avoided.
      
      shmem_writepage() has always been peculiar in making no attempt to write:
      it has just transferred a shmem page from file cache to swap cache, then
      let that page make its way around the LRU again before being written and
      freed.
      
      The idea was that people use tmpfs because they want those pages to stay
      in RAM; so although we give it an overflow to swap, we should resist
      writing too soon, giving those pages a second chance before they can be
      reclaimed.
      
      That was always questionable, and I've toyed with this patch for years;
      but never had a clear justification to depart from the original design.
      
      It became more questionable in 2.6.28, when the split LRU patches classed
      shmem and tmpfs pages as SwapBacked rather than as file_cache: that in
      itself gives them more resistance to reclaim than normal file pages.  I
      prepared this patch for 2.6.29, but the merge window arrived before I'd
      completed gathering statistics to justify sending it in.
      
      Then while comparing SLQB against SLUB, running SLUB on a laptop I'd
      habitually used with SLAB, I found SLUB to run my tmpfs kbuild swapping
      tests five times slower than SLAB or SLQB - other machines slower too, but
      nowhere near so bad.  Simpler "cp -a" swapping tests showed the same.
      
      slub_max_order=0 brings sanity to all, but heavy swapping is too far from
      normal to justify such a tuning.  The crucial factor on that laptop turns
      out to be that I'm using an SD card for swap.  What happens is this:
      
      By default, SLUB uses order-2 pages for shmem_inode_cache (and many other
      fs inodes), so creating tmpfs files under memory pressure brings lumpy
      reclaim into play.  One subpage of the order is chosen from the bottom of
      the LRU as usual, then the other three picked out from their random
      positions on the LRUs.
      
      In a tmpfs load, many of these pages will be ones which already passed
      through shmem_writepage, so already have swap allocated.  And though their
      offsets on swap were probably allocated sequentially, now that the pages
      are picked off at random, their swap offsets are scattered.
      
      But the flash storage on the SD card is very sensitive to having its
      writes merged: once swap is written at scattered offsets, performance
      falls apart.  Rotating disk seeks increase too, but less disastrously.
      
      So: stop giving shmem/tmpfs pages a second pass around the LRU, write them
      out to swap as soon as their swap has been allocated.
      
      It's surely possible to devise an artificial load which runs faster the
      old way, one whose sizing is such that the tmpfs pages on their second
      pass are the ones that are wanted again, and other pages not.
      
      But I've not yet found such a load: on all machines, under the loads I've
      tried, immediate swap_writepage speeds up shmem swapping: especially when
      using the SLUB allocator (and more effectively than slub_max_order=0), but
      also with the others; and it also reduces the variance between runs.  How
      much faster varies widely: a factor of five is rare, 5% is common.
      
      One load which might have suffered: imagine a swapping shmem load in a
      limited mem_cgroup on a machine with plenty of memory.  Before 2.6.29 the
      swapcache was not charged, and such a load would have run quickest with
      the shmem swapcache never written to swap.  But now swapcache is charged,
      so even this load benefits from shmem_writepage directly to swap.
      
      Apologies for the #ifndef CONFIG_SWAP swap_writepage() stub in swap.h:
      it's silly because that will never get called; but refactoring shmem.c
      sensibly according to CONFIG_SWAP will be a separate task.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9fab5619