1. 25 1月, 2016 1 次提交
    • C
      vmstat: Remove BUG_ON from vmstat_update · 587198ba
      Christoph Lameter 提交于
      If we detect that there is nothing to do just set the flag and do not
      check if it was already set before.  Races really do not matter.  If the
      flag is set by any code then the shepherd will start dealing with the
      situation and reenable the vmstat workers when necessary again.
      
      Since commit 0eb77e98 ("vmstat: make vmstat_updater deferrable again
      and shut down on idle") quiet_vmstat might update cpu_stat_off and mark
      a particular cpu to be handled by vmstat_shepherd.  This might trigger a
      VM_BUG_ON in vmstat_update because the work item might have been
      sleeping during the idle period and see the cpu_stat_off updated after
      the wake up.  The VM_BUG_ON is therefore misleading and no more
      appropriate.  Moreover it doesn't really suite any protection from real
      bugs because vmstat_shepherd will simply reschedule the vmstat_work
      anytime it sees a particular cpu set or vmstat_update would do the same
      from the worker context directly.  Even when the two would race the
      result wouldn't be incorrect as the counters update is fully idempotent.
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      587198ba
  2. 16 1月, 2016 2 次提交
    • M
      mm: support madvise(MADV_FREE) · 854e9ed0
      Minchan Kim 提交于
      Linux doesn't have an ability to free pages lazy while other OS already
      have been supported that named by madvise(MADV_FREE).
      
      The gain is clear that kernel can discard freed pages rather than
      swapping out or OOM if memory pressure happens.
      
      Without memory pressure, freed pages would be reused by userspace
      without another additional overhead(ex, page fault + allocation +
      zeroing).
      
      Jason Evans said:
      
      : Facebook has been using MAP_UNINITIALIZED
      : (https://lkml.org/lkml/2012/1/18/308) in some of its applications for
      : several years, but there are operational costs to maintaining this
      : out-of-tree in our kernel and in jemalloc, and we are anxious to retire it
      : in favor of MADV_FREE.  When we first enabled MAP_UNINITIALIZED it
      : increased throughput for much of our workload by ~5%, and although the
      : benefit has decreased using newer hardware and kernels, there is still
      : enough benefit that we cannot reasonably retire it without a replacement.
      :
      : Aside from Facebook operations, there are numerous broadly used
      : applications that would benefit from MADV_FREE.  The ones that immediately
      : come to mind are redis, varnish, and MariaDB.  I don't have much insight
      : into Android internals and development process, but I would hope to see
      : MADV_FREE support eventually end up there as well to benefit applications
      : linked with the integrated jemalloc.
      :
      : jemalloc will use MADV_FREE once it becomes available in the Linux kernel.
      : In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's
      : available: *BSD, OS X, Windows, and Solaris -- every platform except Linux
      : (and AIX, but I'm not sure it even compiles on AIX).  The lack of
      : MADV_FREE on Linux forced me down a long series of increasingly
      : sophisticated heuristics for madvise() volume reduction, and even so this
      : remains a common performance issue for people using jemalloc on Linux.
      : Please integrate MADV_FREE; many people will benefit substantially.
      
      How it works:
      
      When madvise syscall is called, VM clears dirty bit of ptes of the
      range.  If memory pressure happens, VM checks dirty bit of page table
      and if it found still "clean", it means it's a "lazyfree pages" so VM
      could discard the page instead of swapping out.  Once there was store
      operation for the page before VM peek a page to reclaim, dirty bit is
      set so VM can swap out the page instead of discarding.
      
      One thing we should notice is that basically, MADV_FREE relies on dirty
      bit in page table entry to decide whether VM allows to discard the page
      or not.  IOW, if page table entry includes marked dirty bit, VM
      shouldn't discard the page.
      
      However, as a example, if swap-in by read fault happens, page table
      entry doesn't have dirty bit so MADV_FREE could discard the page
      wrongly.
      
      For avoiding the problem, MADV_FREE did more checks with PageDirty and
      PageSwapCache.  It worked out because swapped-in page lives on swap
      cache and since it is evicted from the swap cache, the page has PG_dirty
      flag.  So both page flags check effectively prevent wrong discarding by
      MADV_FREE.
      
      However, a problem in above logic is that swapped-in page has PG_dirty
      still after they are removed from swap cache so VM cannot consider the
      page as freeable any more even if madvise_free is called in future.
      
      Look at below example for detail.
      
          ptr = malloc();
          memset(ptr);
          ..
          ..
          .. heavy memory pressure so all of pages are swapped out
          ..
          ..
          var = *ptr; -> a page swapped-in and could be removed from
                         swapcache. Then, page table doesn't mark
                         dirty bit and page descriptor includes PG_dirty
          ..
          ..
          madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
          ..
          ..
          ..
          .. heavy memory pressure again.
          .. In this time, VM cannot discard the page because the page
          .. has *PG_dirty*
      
      To solve the problem, this patch clears PG_dirty if only the page is
      owned exclusively by current process when madvise is called because
      PG_dirty represents ptes's dirtiness in several processes so we could
      clear it only if we own it exclusively.
      
      Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
      and hope glibc supports it) and jemalloc/tcmalloc already have supported
      the feature for other OS(ex, FreeBSD)
      
        barrios@blaptop:~/benchmark/ebizzy$ lscpu
        Architecture:          x86_64
        CPU op-mode(s):        32-bit, 64-bit
        Byte Order:            Little Endian
        CPU(s):                12
        On-line CPU(s) list:   0-11
        Thread(s) per core:    1
        Core(s) per socket:    1
        Socket(s):             12
        NUMA node(s):          1
        Vendor ID:             GenuineIntel
        CPU family:            6
        Model:                 2
        Stepping:              3
        CPU MHz:               3200.185
        BogoMIPS:              6400.53
        Virtualization:        VT-x
        Hypervisor vendor:     KVM
        Virtualization type:   full
        L1d cache:             32K
        L1i cache:             32K
        L2 cache:              4096K
        NUMA node0 CPU(s):     0-11
        ebizzy benchmark(./ebizzy -S 10 -n 512)
      
        Higher avg is better.
      
         vanilla-jemalloc             MADV_free-jemalloc
      
        1 thread
        records: 10                   records: 10
        avg:   2961.90                avg:  12069.70
        std:     71.96(2.43%)         std:    186.68(1.55%)
        max:   3070.00                max:  12385.00
        min:   2796.00                min:  11746.00
      
        2 thread
        records: 10                   records: 10
        avg:   5020.00                avg:  17827.00
        std:    264.87(5.28%)         std:    358.52(2.01%)
        max:   5244.00                max:  18760.00
        min:   4251.00                min:  17382.00
      
        4 thread
        records: 10                   records: 10
        avg:   8988.80                avg:  27930.80
        std:   1175.33(13.08%)        std:   3317.33(11.88%)
        max:   9508.00                max:  30879.00
        min:   5477.00                min:  21024.00
      
        8 thread
        records: 10                   records: 10
        avg:  13036.50                avg:  33739.40
        std:    170.67(1.31%)         std:   5146.22(15.25%)
        max:  13371.00                max:  40572.00
        min:  12785.00                min:  24088.00
      
        16 thread
        records: 10                   records: 10
        avg:  11092.40                avg:  31424.20
        std:    710.60(6.41%)         std:   3763.89(11.98%)
        max:  12446.00                max:  36635.00
        min:   9949.00                min:  25669.00
      
        32 thread
        records: 10                   records: 10
        avg:  11067.00                avg:  34495.80
        std:    971.06(8.77%)         std:   2721.36(7.89%)
        max:  12010.00                max:  38598.00
        min:   9002.00                min:  30636.00
      
      In summary, MADV_FREE is about much faster than MADV_DONTNEED.
      
      This patch (of 12):
      
      Add core MADV_FREE implementation.
      
      [akpm@linux-foundation.org: small cleanups]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Mika Penttil <mika.penttila@nextfour.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jason Evans <je@fb.com>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: <yalin.wang2010@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: "Shaohua Li" <shli@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Gang <gang.chen.5i5j@gmail.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      854e9ed0
    • K
      mm, vmstats: new THP splitting event · 122afea9
      Kirill A. Shutemov 提交于
      The patch replaces THP_SPLIT with tree events: THP_SPLIT_PAGE,
      THP_SPLIT_PAGE_FAILED and THP_SPLIT_PMD.  It reflects the fact that we
      are going to be able split PMD without the compound page and that
      split_huge_page() can fail.
      Signed-off-by: NKirill A.  Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      122afea9
  3. 15 1月, 2016 1 次提交
    • C
      vmstat: make vmstat_updater deferrable again and shut down on idle · 0eb77e98
      Christoph Lameter 提交于
      Currently the vmstat updater is not deferrable as a result of commit
      ba4877b9 ("vmstat: do not use deferrable delayed work for
      vmstat_update").  This in turn can cause multiple interruptions of the
      applications because the vmstat updater may run at
      
      Make vmstate_update deferrable again and provide a function that folds
      the differentials when the processor is going to idle mode thus
      addressing the issue of the above commit in a clean way.
      
      Note that the shepherd thread will continue scanning the differentials
      from another processor and will reenable the vmstat workers if it
      detects any changes.
      
      Fixes: ba4877b9 ("vmstat: do not use deferrable delayed work for vmstat_update")
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0eb77e98
  4. 09 1月, 2016 1 次提交
    • M
      vmstat: allocate vmstat_wq before it is used · 751e5f5c
      Michal Hocko 提交于
      kernel test robot has reported the following crash:
      
        BUG: unable to handle kernel NULL pointer dereference at 00000100
        IP: [<c1074df6>] __queue_work+0x26/0x390
        *pdpt = 0000000000000000 *pde = f000ff53f000ff53 *pde = f000ff53f000ff53
        Oops: 0000 [#1] PREEMPT PREEMPT SMP SMP
        CPU: 0 PID: 24 Comm: kworker/0:1 Not tainted 4.4.0-rc4-00139-g373ccbe5 #1
        Workqueue: events vmstat_shepherd
        task: cb684600 ti: cb7ba000 task.ti: cb7ba000
        EIP: 0060:[<c1074df6>] EFLAGS: 00010046 CPU: 0
        EIP is at __queue_work+0x26/0x390
        EAX: 00000046 EBX: cbb37800 ECX: cbb37800 EDX: 00000000
        ESI: 00000000 EDI: 00000000 EBP: cb7bbe68 ESP: cb7bbe38
         DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
        CR0: 8005003b CR2: 00000100 CR3: 01fd5000 CR4: 000006b0
        Stack:
        Call Trace:
          __queue_delayed_work+0xa1/0x160
          queue_delayed_work_on+0x36/0x60
          vmstat_shepherd+0xad/0xf0
          process_one_work+0x1aa/0x4c0
          worker_thread+0x41/0x440
          kthread+0xb0/0xd0
          ret_from_kernel_thread+0x21/0x40
      
      The reason is that start_shepherd_timer schedules the shepherd work item
      which uses vmstat_wq (vmstat_shepherd) before setup_vmstat allocates
      that workqueue so if the further initialization takes more than HZ we
      might end up scheduling on a NULL vmstat_wq.  This is really unlikely
      but not impossible.
      
      Fixes: 373ccbe5 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress")
      Reported-by: Nkernel test robot <ying.huang@linux.intel.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: stable@vger.kernel.org
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      751e5f5c
  5. 30 12月, 2015 1 次提交
    • H
      mm/vmstat: fix overflow in mod_zone_page_state() · 6cdb18ad
      Heiko Carstens 提交于
      mod_zone_page_state() takes a "delta" integer argument.  delta contains
      the number of pages that should be added or subtracted from a struct
      zone's vm_stat field.
      
      If a zone is larger than 8TB this will cause overflows.  E.g.  for a
      zone with a size slightly larger than 8TB the line
      
          mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages);
      
      in mm/page_alloc.c:free_area_init_core() will result in a negative
      result for the NR_ALLOC_BATCH entry within the zone's vm_stat, since 8TB
      contain 0x8xxxxxxx pages which will be sign extended to a negative
      value.
      
      Fix this by changing the delta argument to long type.
      
      This could fix an early boot problem seen on s390, where we have a 9TB
      system with only one node.  ZONE_DMA contains 2GB and ZONE_NORMAL the
      rest.  The system is trying to allocate a GFP_DMA page but ZONE_DMA is
      completely empty, so it tries to reclaim pages in an endless loop.
      
      This was seen on a heavily patched 3.10 kernel.  One possible
      explaination seem to be the overflows caused by mod_zone_page_state().
      Unfortunately I did not have the chance to verify that this patch
      actually fixes the problem, since I don't have access to the system
      right now.  However the overflow problem does exist anyway.
      
      Given the description that a system with slightly less than 8TB does
      work, this seems to be a candidate for the observed problem.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6cdb18ad
  6. 13 12月, 2015 2 次提交
    • M
      mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress · 373ccbe5
      Michal Hocko 提交于
      Tetsuo Handa has reported that the system might basically livelock in
      OOM condition without triggering the OOM killer.
      
      The issue is caused by internal dependency of the direct reclaim on
      vmstat counter updates (via zone_reclaimable) which are performed from
      the workqueue context.  If all the current workers get assigned to an
      allocation request, though, they will be looping inside the allocator
      trying to reclaim memory but zone_reclaimable can see stalled numbers so
      it will consider a zone reclaimable even though it has been scanned way
      too much.  WQ concurrency logic will not consider this situation as a
      congested workqueue because it relies that worker would have to sleep in
      such a situation.  This also means that it doesn't try to spawn new
      workers or invoke the rescuer thread if the one is assigned to the
      queue.
      
      In order to fix this issue we need to do two things.  First we have to
      let wq concurrency code know that we are in trouble so we have to do a
      short sleep.  In order to prevent from issues handled by 0e093d99
      ("writeback: do not sleep on the congestion queue if there are no
      congested BDIs or if significant congestion is not being encountered in
      the current zone") we limit the sleep only to worker threads which are
      the ones of the interest anyway.
      
      The second thing to do is to create a dedicated workqueue for vmstat and
      mark it WQ_MEM_RECLAIM to note it participates in the reclaim and to
      have a spare worker thread for it.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Cristopher Lameter <clameter@sgi.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Arkadiusz Miskiewicz <arekm@maven.pl>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      373ccbe5
    • V
      mm: fix swapped Movable and Reclaimable in /proc/pagetypeinfo · 475a2f90
      Vlastimil Babka 提交于
      Commit 016c13da ("mm, page_alloc: use masks and shifts when
      converting GFP flags to migrate types") has swapped MIGRATE_MOVABLE and
      MIGRATE_RECLAIMABLE in the enum definition.  However, migratetype_names
      wasn't updated to reflect that.
      
      As a result, the file /proc/pagetypeinfo shows the counts for Movable as
      Reclaimable and vice versa.
      
      Additionally, commit 0aaa29a5 ("mm, page_alloc: reserve pageblocks
      for high-order atomic allocations on demand") introduced
      MIGRATE_HIGHATOMIC, but did not add a letter to distinguish it into
      show_migration_types(), so it doesn't appear in the listing of free
      areas during page alloc failures or oom kills.
      
      This patch fixes both problems.  The atomic reserves will show with a
      letter 'H' in the free areas listings.
      
      Fixes: 016c13da ("mm, page_alloc: use masks and shifts when converting GFP flags to migrate types")
      Fixes: 0aaa29a5 ("mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand")
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      475a2f90
  7. 07 11月, 2015 2 次提交
    • M
      mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand · 0aaa29a5
      Mel Gorman 提交于
      High-order watermark checking exists for two reasons -- kswapd high-order
      awareness and protection for high-order atomic requests.  Historically the
      kernel depended on MIGRATE_RESERVE to preserve min_free_kbytes as
      high-order free pages for as long as possible.  This patch introduces
      MIGRATE_HIGHATOMIC that reserves pageblocks for high-order atomic
      allocations on demand and avoids using those blocks for order-0
      allocations.  This is more flexible and reliable than MIGRATE_RESERVE was.
      
      A MIGRATE_HIGHORDER pageblock is created when an atomic high-order
      allocation request steals a pageblock but limits the total number to 1% of
      the zone.  Callers that speculatively abuse atomic allocations for
      long-lived high-order allocations to access the reserve will quickly fail.
       Note that SLUB is currently not such an abuser as it reclaims at least
      once.  It is possible that the pageblock stolen has few suitable
      high-order pages and will need to steal again in the near future but there
      would need to be strong justification to search all pageblocks for an
      ideal candidate.
      
      The pageblocks are unreserved if an allocation fails after a direct
      reclaim attempt.
      
      The watermark checks account for the reserved pageblocks when the
      allocation request is not a high-order atomic allocation.
      
      The reserved pageblocks can not be used for order-0 allocations.  This may
      allow temporary wastage until a failed reclaim reassigns the pageblock.
      This is deliberate as the intent of the reservation is to satisfy a
      limited number of atomic high-order short-lived requests if the system
      requires them.
      
      The stutter benchmark was used to evaluate this but while it was running
      there was a systemtap script that randomly allocated between 1 high-order
      page and 12.5% of memory's worth of order-3 pages using GFP_ATOMIC.  This
      is much larger than the potential reserve and it does not attempt to be
      realistic.  It is intended to stress random high-order allocations from an
      unknown source, show that there is a reduction in failures without
      introducing an anomaly where atomic allocations are more reliable than
      regular allocations.  The amount of memory reserved varied throughout the
      workload as reserves were created and reclaimed under memory pressure.
      The allocation failures once the workload warmed up were as follows;
      
      4.2-rc5-vanilla		70%
      4.2-rc5-atomic-reserve	56%
      
      The failure rate was also measured while building multiple kernels.  The
      failure rate was 14% but is 6% with this patch applied.
      
      Overall, this is a small reduction but the reserves are small relative to
      the number of allocation requests.  In early versions of the patch, the
      failure rate reduced by a much larger amount but that required much larger
      reserves and perversely made atomic allocations seem more reliable than
      regular allocations.
      
      [yalin.wang2010@gmail.com: fix redundant check and a memory leak]
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: Nyalin wang <yalin.wang2010@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0aaa29a5
    • M
      mm, page_alloc: remove MIGRATE_RESERVE · 974a786e
      Mel Gorman 提交于
      MIGRATE_RESERVE preserves an old property of the buddy allocator that
      existed prior to fragmentation avoidance -- min_free_kbytes worth of pages
      tended to remain contiguous until the only alternative was to fail the
      allocation.  At the time it was discovered that high-order atomic
      allocations relied on this property so MIGRATE_RESERVE was introduced.  A
      later patch will introduce an alternative MIGRATE_HIGHATOMIC so this patch
      deletes MIGRATE_RESERVE and supporting code so it'll be easier to review.
      Note that this patch in isolation may look like a false regression if
      someone was bisecting high-order atomic allocation failures.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      974a786e
  8. 06 11月, 2015 1 次提交
  9. 16 10月, 2015 1 次提交
    • L
      vmstat: explicitly schedule per-cpu work on the CPU we need it to run on · 176bed1d
      Linus Torvalds 提交于
      The vmstat code uses "schedule_delayed_work_on()" to do the initial
      startup of the delayed work on the right CPU, but then once it was
      started it would use the non-cpu-specific "schedule_delayed_work()" to
      re-schedule it on that CPU.
      
      That just happened to schedule it on the same CPU historically (well, in
      almost all situations), but the code _requires_ this work to be per-cpu,
      and should say so explicitly rather than depend on the non-cpu-specific
      scheduling to schedule on the current CPU.
      
      The timer code is being changed to not be as single-minded in always
      running things on the calling CPU.
      
      See also commit 874bbfe6 ("workqueue: make sure delayed work run in
      local cpu") that for now maintains the local CPU guarantees just in case
      there are other broken users that depended on the accidental behavior.
      
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      176bed1d
  10. 12 2月, 2015 2 次提交
    • C
      vmstat: Reduce time interval to stat update on idle cpu · 57c2e36b
      Christoph Lameter 提交于
      It was noted that the vm stat shepherd runs every 2 seconds and that the
      vmstat update is then scheduled 2 seconds in the future.
      
      This yields an interval of double the time interval which is not desired.
      
      Change the shepherd so that it does not delay the vmstat update on the
      other cpu.  We stil have to use schedule_delayed_work since we are using a
      delayed_work_struct but we can set the delay to 0.
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57c2e36b
    • M
      vmstat: do not use deferrable delayed work for vmstat_update · ba4877b9
      Michal Hocko 提交于
      Vinayak Menon has reported that an excessive number of tasks was throttled
      in the direct reclaim inside too_many_isolated() because NR_ISOLATED_FILE
      was relatively high compared to NR_INACTIVE_FILE.  However it turned out
      that the real number of NR_ISOLATED_FILE was 0 and the per-cpu
      vm_stat_diff wasn't transferred into the global counter.
      
      vmstat_work which is responsible for the sync is defined as deferrable
      delayed work which means that the defined timeout doesn't wake up an idle
      CPU.  A CPU might stay in an idle state for a long time and general effort
      is to keep such a CPU in this state as long as possible which might lead
      to all sorts of troubles for vmstat consumers as can be seen with the
      excessive direct reclaim throttling.
      
      This patch basically reverts 39bf6270 ("VM statistics: Make timer
      deferrable") but it shouldn't cause any problems for idle CPUs because
      only CPUs with an active per-cpu drift are woken up since 7cc36bbd
      ("vmstat: on-demand vmstat workers v8") and CPUs which are idle for a
      longer time shouldn't have per-cpu drift.
      
      Fixes: 39bf6270 (VM statistics: Make timer deferrable)
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reported-by: NVinayak Menon <vinmenon@codeaurora.org>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba4877b9
  11. 11 2月, 2015 1 次提交
    • A
      mm/vmstat.c: fix/cleanup ifdefs · 3c486871
      Andrew Morton 提交于
      CONFIG_COMPACTION=y, CONFIG_DEBUG_FS=n:
      
        mm/vmstat.c:690: warning: 'frag_start' defined but not used
        mm/vmstat.c:702: warning: 'frag_next' defined but not used
        mm/vmstat.c:710: warning: 'frag_stop' defined but not used
        mm/vmstat.c:715: warning: 'walk_zones_in_node' defined but not used
      
      It's all a bit of a tangly mess and it's unclear why CONFIG_COMPACTION
      figures in there at all.  Move frag_start/frag_next/frag_stop and
      migratetype_names[] into the existing CONFIG_PROC_FS block.
      
      walk_zones_in_node() gets a special ifdef.
      
      Also move the #include lines up to where #include lines live.
      
      [axel.lin@ingics.com: fix build error when !CONFIG_PROC_FS]
      Signed-off-by: NAxel Lin <axel.lin@ingics.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Tested-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c486871
  12. 14 12月, 2014 2 次提交
    • D
      mm,vmacache: count number of system-wide flushes · f5f302e2
      Davidlohr Bueso 提交于
      These flushes deal with sequence number overflows, such as for long lived
      threads.  These are rare, but interesting from a debugging PoV.  As such,
      display the number of flushes when vmacache debugging is enabled.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5f302e2
    • J
      mm/page_owner: keep track of page owners · 48c96a36
      Joonsoo Kim 提交于
      This is the page owner tracking code which is introduced so far ago.  It
      is resident on Andrew's tree, though, nobody tried to upstream so it
      remain as is.  Our company uses this feature actively to debug memory leak
      or to find a memory hogger so I decide to upstream this feature.
      
      This functionality help us to know who allocates the page.  When
      allocating a page, we store some information about allocation in extra
      memory.  Later, if we need to know status of all pages, we can get and
      analyze it from this stored information.
      
      In previous version of this feature, extra memory is statically defined in
      struct page, but, in this version, extra memory is allocated outside of
      struct page.  It enables us to turn on/off this feature at boottime
      without considerable memory waste.
      
      Although we already have tracepoint for tracing page allocation/free,
      using it to analyze page owner is rather complex.  We need to enlarge the
      trace buffer for preventing overlapping until userspace program launched.
      And, launched program continually dump out the trace buffer for later
      analysis and it would change system behaviour with more possibility rather
      than just keeping it in memory, so bad for debug.
      
      Moreover, we can use page_owner feature further for various purposes.  For
      example, we can use it for fragmentation statistics implemented in this
      patch.  And, I also plan to implement some CMA failure debugging feature
      using this interface.
      
      I'd like to give the credit for all developers contributed this feature,
      but, it's not easy because I don't know exact history.  Sorry about that.
      Below is people who has "Signed-off-by" in the patches in Andrew's tree.
      
      Contributor:
      Alexander Nyberg <alexn@dsv.su.se>
      Mel Gorman <mgorman@suse.de>
      Dave Hansen <dave@linux.vnet.ibm.com>
      Minchan Kim <minchan@kernel.org>
      Michal Nazarewicz <mina86@mina86.com>
      Andrew Morton <akpm@linux-foundation.org>
      Jungsoo Son <jungsoo.son@lge.com>
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48c96a36
  13. 10 10月, 2014 2 次提交
    • C
      vmstat: on-demand vmstat workers V8 · 7cc36bbd
      Christoph Lameter 提交于
      vmstat workers are used for folding counter differentials into the zone,
      per node and global counters at certain time intervals.  They currently
      run at defined intervals on all processors which will cause some holdoff
      for processors that need minimal intrusion by the OS.
      
      The current vmstat_update mechanism depends on a deferrable timer firing
      every other second by default which registers a work queue item that runs
      on the local CPU, with the result that we have 1 interrupt and one
      additional schedulable task on each CPU every 2 seconds If a workload
      indeed causes VM activity or multiple tasks are running on a CPU, then
      there are probably bigger issues to deal with.
      
      However, some workloads dedicate a CPU for a single CPU bound task.  This
      is done in high performance computing, in high frequency financial
      applications, in networking (Intel DPDK, EZchip NPS) and with the advent
      of systems with more and more CPUs over time, this may become more and
      more common to do since when one has enough CPUs one cares less about
      efficiently sharing a CPU with other tasks and more about efficiently
      monopolizing a CPU per task.
      
      The difference of having this timer firing and workqueue kernel thread
      scheduled per second can be enormous.  An artificial test measuring the
      worst case time to do a simple "i++" in an endless loop on a bare metal
      system and under Linux on an isolated CPU with dynticks and with and
      without this patch, have Linux match the bare metal performance (~700
      cycles) with this patch and loose by couple of orders of magnitude (~200k
      cycles) without it[*].  The loss occurs for something that just calculates
      statistics.  For networking applications, for example, this could be the
      difference between dropping packets or sustaining line rate.
      
      Statistics are important and useful, but it would be great if there would
      be a way to not cause statistics gathering produce a huge performance
      difference.  This patche does just that.
      
      This patch creates a vmstat shepherd worker that monitors the per cpu
      differentials on all processors.  If there are differentials on a
      processor then a vmstat worker local to the processors with the
      differentials is created.  That worker will then start folding the diffs
      in regular intervals.  Should the worker find that there is no work to be
      done then it will make the shepherd worker monitor the differentials
      again.
      
      With this patch it is possible then to have periods longer than
      2 seconds without any OS event on a "cpu" (hardware thread).
      
      The patch shows a very minor increased in system performance.
      
      hackbench -s 512 -l 2000 -g 15 -f 25 -P
      
      Results before the patch:
      
      Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
      Each sender will pass 2000 messages of 512 bytes
      Time: 4.992
      Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
      Each sender will pass 2000 messages of 512 bytes
      Time: 4.971
      Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
      Each sender will pass 2000 messages of 512 bytes
      Time: 5.063
      
      Hackbench after the patch:
      
      Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
      Each sender will pass 2000 messages of 512 bytes
      Time: 4.973
      Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
      Each sender will pass 2000 messages of 512 bytes
      Time: 4.990
      Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
      Each sender will pass 2000 messages of 512 bytes
      Time: 4.993
      
      [fengguang.wu@intel.com: cpu_stat_off can be static]
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Reviewed-by: NGilad Ben-Yossef <gilad@benyossef.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: Max Krasnyansky <maxk@qti.qualcomm.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cc36bbd
    • K
      mm/balloon_compaction: add vmstat counters and kpageflags bit · 09316c09
      Konstantin Khlebnikov 提交于
      Always mark pages with PageBalloon even if balloon compaction is disabled
      and expose this mark in /proc/kpageflags as KPF_BALLOON.
      
      Also this patch adds three counters into /proc/vmstat: "balloon_inflate",
      "balloon_deflate" and "balloon_migrate".  They accumulate balloon
      activity.  Current size of balloon is (balloon_inflate - balloon_deflate)
      pages.
      
      All generic balloon code now gathered under option CONFIG_MEMORY_BALLOON.
      It should be selected by ballooning driver which wants use this feature.
      Currently virtio-balloon is the only user.
      Signed-off-by: NKonstantin Khlebnikov <k.khlebnikov@samsung.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09316c09
  14. 07 8月, 2014 3 次提交
    • M
      mm: vmscan: only update per-cpu thresholds for online CPU · bb0b6dff
      Mel Gorman 提交于
      When kswapd is awake reclaiming, the per-cpu stat thresholds are lowered
      to get more accurate counts to avoid breaching watermarks.  This
      threshold update iterates over all possible CPUs which is unnecessary.
      Only online CPUs need to be updated.  If a new CPU is onlined,
      refresh_zone_stat_thresholds() will set the thresholds correctly.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb0b6dff
    • M
      mm: move zone->pages_scanned into a vmstat counter · 0d5d823a
      Mel Gorman 提交于
      zone->pages_scanned is a write-intensive cache line during page reclaim
      and it's also updated during page free.  Move the counter into vmstat to
      take advantage of the per-cpu updates and do not update it in the free
      paths unless necessary.
      
      On a small UMA machine running tiobench the difference is marginal.  On
      a 4-node machine the overhead is more noticable.  Note that automatic
      NUMA balancing was disabled for this test as otherwise the system CPU
      overhead is unpredictable.
      
                3.16.0-rc3  3.16.0-rc3  3.16.0-rc3
                   vanillarearrange-v5   vmstat-v5
      User          746.94      759.78      774.56
      System      65336.22    58350.98    32847.27
      Elapsed     27553.52    27282.02    27415.04
      
      Note that the overhead reduction will vary depending on where exactly
      pages are allocated and freed.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d5d823a
    • M
      mm: rearrange zone fields into read-only, page alloc, statistics and page reclaim lines · 3484b2de
      Mel Gorman 提交于
      The arrangement of struct zone has changed over time and now it has
      reached the point where there is some inappropriate sharing going on.
      On x86-64 for example
      
      o The zone->node field is shared with the zone lock and zone->node is
        accessed frequently from the page allocator due to the fair zone
        allocation policy.
      
      o span_seqlock is almost never used by shares a line with free_area
      
      o Some zone statistics share a cache line with the LRU lock so
        reclaim-intensive and allocator-intensive workloads can bounce the cache
        line on a stat update
      
      This patch rearranges struct zone to put read-only and read-mostly
      fields together and then splits the page allocator intensive fields, the
      zone statistics and the page reclaim intensive fields into their own
      cache lines.  Note that the type of lowmem_reserve changes due to the
      watermark calculations being signed and avoiding a signed/unsigned
      conversion there.
      
      On the test configuration I used the overall size of struct zone shrunk
      by one cache line.  On smaller machines, this is not likely to be
      noticable.  However, on a 4-node NUMA machine running tiobench the
      system CPU overhead is reduced by this patch.
      
                3.16.0-rc3  3.16.0-rc3
                   vanillarearrange-v5r9
      User          746.94      759.78
      System      65336.22    58350.98
      Elapsed     27553.52    27282.02
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3484b2de
  15. 05 6月, 2014 3 次提交
  16. 04 4月, 2014 3 次提交
    • D
      drop_caches: add some documentation and info message · 5509a5d2
      Dave Hansen 提交于
      There is plenty of anecdotal evidence and a load of blog posts
      suggesting that using "drop_caches" periodically keeps your system
      running in "tip top shape".  Perhaps adding some kernel documentation
      will increase the amount of accurate data on its use.
      
      If we are not shrinking caches effectively, then we have real bugs.
      Using drop_caches will simply mask the bugs and make them harder to
      find, but certainly does not fix them, nor is it an appropriate
      "workaround" to limit the size of the caches.  On the contrary, there
      have been bug reports on issues that turned out to be misguided use of
      cache dropping.
      
      Dropping caches is a very drastic and disruptive operation that is good
      for debugging and running tests, but if it creates bug reports from
      production use, kernel developers should be aware of its use.
      
      Add a bit more documentation about it, a syslog message to track down
      abusers, and vmstat drop counters to help analyze problem reports.
      
      [akpm@linux-foundation.org: checkpatch fixes]
      [hannes@cmpxchg.org: add runtime suppression control]
      Signed-off-by: NDave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5509a5d2
    • J
      mm: keep page cache radix tree nodes in check · 449dd698
      Johannes Weiner 提交于
      Previously, page cache radix tree nodes were freed after reclaim emptied
      out their page pointers.  But now reclaim stores shadow entries in their
      place, which are only reclaimed when the inodes themselves are
      reclaimed.  This is problematic for bigger files that are still in use
      after they have a significant amount of their cache reclaimed, without
      any of those pages actually refaulting.  The shadow entries will just
      sit there and waste memory.  In the worst case, the shadow entries will
      accumulate until the machine runs out of memory.
      
      To get this under control, the VM will track radix tree nodes
      exclusively containing shadow entries on a per-NUMA node list.  Per-NUMA
      rather than global because we expect the radix tree nodes themselves to
      be allocated node-locally and we want to reduce cross-node references of
      otherwise independent cache workloads.  A simple shrinker will then
      reclaim these nodes on memory pressure.
      
      A few things need to be stored in the radix tree node to implement the
      shadow node LRU and allow tree deletions coming from the list:
      
      1. There is no index available that would describe the reverse path
         from the node up to the tree root, which is needed to perform a
         deletion.  To solve this, encode in each node its offset inside the
         parent.  This can be stored in the unused upper bits of the same
         member that stores the node's height at no extra space cost.
      
      2. The number of shadow entries needs to be counted in addition to the
         regular entries, to quickly detect when the node is ready to go to
         the shadow node LRU list.  The current entry count is an unsigned
         int but the maximum number of entries is 64, so a shadow counter
         can easily be stored in the unused upper bits.
      
      3. Tree modification needs tree lock and tree root, which are located
         in the address space, so store an address_space backpointer in the
         node.  The parent pointer of the node is in a union with the 2-word
         rcu_head, so the backpointer comes at no extra cost as well.
      
      4. The node needs to be linked to an LRU list, which requires a list
         head inside the node.  This does increase the size of the node, but
         it does not change the number of objects that fit into a slab page.
      
      [akpm@linux-foundation.org: export the right function]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      449dd698
    • J
      mm: thrash detection-based file cache sizing · a528910e
      Johannes Weiner 提交于
      The VM maintains cached filesystem pages on two types of lists.  One
      list holds the pages recently faulted into the cache, the other list
      holds pages that have been referenced repeatedly on that first list.
      The idea is to prefer reclaiming young pages over those that have shown
      to benefit from caching in the past.  We call the recently usedbut
      ultimately was not significantly better than a FIFO policy and still
      thrashed cache based on eviction speed, rather than actual demand for
      cache.
      
      This patch solves one half of the problem by decoupling the ability to
      detect working set changes from the inactive list size.  By maintaining
      a history of recently evicted file pages it can detect frequently used
      pages with an arbitrarily small inactive list size, and subsequently
      apply pressure on the active list based on actual demand for cache, not
      just overall eviction speed.
      
      Every zone maintains a counter that tracks inactive list aging speed.
      When a page is evicted, a snapshot of this counter is stored in the
      now-empty page cache radix tree slot.  On refault, the minimum access
      distance of the page can be assessed, to evaluate whether the page
      should be part of the active list or not.
      
      This fixes the VM's blindness towards working set changes in excess of
      the inactive list.  And it's the foundation to further improve the
      protection ability and reduce the minimum inactive list size of 50%.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NBob Liu <bob.liu@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a528910e
  17. 20 3月, 2014 1 次提交
    • S
      mm, vmstat: Fix CPU hotplug callback registration · 0be94bad
      Srivatsa S. Bhat 提交于
      Subsystems that want to register CPU hotplug callbacks, as well as perform
      initialization for the CPUs that are already online, often do it as shown
      below:
      
      	get_online_cpus();
      
      	for_each_online_cpu(cpu)
      		init_cpu(cpu);
      
      	register_cpu_notifier(&foobar_cpu_notifier);
      
      	put_online_cpus();
      
      This is wrong, since it is prone to ABBA deadlocks involving the
      cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
      with CPU hotplug operations).
      
      Instead, the correct and race-free way of performing the callback
      registration is:
      
      	cpu_notifier_register_begin();
      
      	for_each_online_cpu(cpu)
      		init_cpu(cpu);
      
      	/* Note the use of the double underscored version of the API */
      	__register_cpu_notifier(&foobar_cpu_notifier);
      
      	cpu_notifier_register_done();
      
      Fix the vmstat code in the MM subsystem by using this latter form of callback
      registration.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Ingo Molnar <mingo@kernel.org>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      0be94bad
  18. 25 1月, 2014 1 次提交
  19. 13 11月, 2013 3 次提交
    • M
      mm: numa: return the number of base pages altered by protection changes · 72403b4a
      Mel Gorman 提交于
      Commit 0255d491 ("mm: Account for a THP NUMA hinting update as one
      PTE update") was added to account for the number of PTE updates when
      marking pages prot_numa.  task_numa_work was using the old return value
      to track how much address space had been updated.  Altering the return
      value causes the scanner to do more work than it is configured or
      documented to in a single unit of work.
      
      This patch reverts that commit and accounts for the number of THP
      updates separately in vmstat.  It is up to the administrator to
      interpret the pair of values correctly.  This is a straight-forward
      operation and likely to only be of interest when actively debugging NUMA
      balancing problems.
      
      The impact of this patch is that the NUMA PTE scanner will scan slower
      when THP is enabled and workloads may converge slower as a result.  On
      the flip size system CPU usage should be lower than recent tests
      reported.  This is an illustrative example of a short single JVM specjbb
      test
      
      specjbb
                             3.12.0                3.12.0
                            vanilla      acctupdates
      TPut 1      26143.00 (  0.00%)     25747.00 ( -1.51%)
      TPut 7     185257.00 (  0.00%)    183202.00 ( -1.11%)
      TPut 13    329760.00 (  0.00%)    346577.00 (  5.10%)
      TPut 19    442502.00 (  0.00%)    460146.00 (  3.99%)
      TPut 25    540634.00 (  0.00%)    549053.00 (  1.56%)
      TPut 31    512098.00 (  0.00%)    519611.00 (  1.47%)
      TPut 37    461276.00 (  0.00%)    474973.00 (  2.97%)
      TPut 43    403089.00 (  0.00%)    414172.00 (  2.75%)
      
                    3.12.0      3.12.0
                   vanillaacctupdates
      User         5169.64     5184.14
      System        100.45       80.02
      Elapsed       252.75      251.85
      
      Performance is similar but note the reduction in system CPU time.  While
      this showed a performance gain, it will not be universal but at least
      it'll be behaving as documented.  The vmstats are obviously different but
      here is an obvious interpretation of them from mmtests.
      
                                      3.12.0      3.12.0
                                     vanillaacctupdates
      NUMA page range updates        1408326    11043064
      NUMA huge PMD updates                0       21040
      NUMA PTE updates               1408326      291624
      
      "NUMA page range updates" == nr_pte_updates and is the value returned to
      the NUMA pte scanner.  NUMA huge PMD updates were the number of THP
      updates which in combination can be used to calculate how many ptes were
      updated from userspace.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NAlex Thorlton <athorlton@sgi.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72403b4a
    • T
      mm: clear N_CPU from node_states at CPU offline · 807a1bd2
      Toshi Kani 提交于
      vmstat_cpuup_callback() is a CPU notifier callback, which marks N_CPU to a
      node at CPU online event.  However, it does not update this N_CPU info at
      CPU offline event.
      
      Changed vmstat_cpuup_callback() to clear N_CPU when the last CPU in the
      node is put into offline, i.e.  the node no longer has any online CPU.
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Tested-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      807a1bd2
    • T
      mm: set N_CPU to node_states during boot · d7e0b37a
      Toshi Kani 提交于
      After a system booted, N_CPU is not set to any node as has_cpu shows an
      empty line.
      
        # cat /sys/devices/system/node/has_cpu
        (show-empty-line)
      
      setup_vmstat() registers its CPU notifier callback,
      vmstat_cpuup_callback(), which marks N_CPU to a node when a CPU is put
      into online.  However, setup_vmstat() is called after all CPUs are
      launched in the boot sequence.
      
      Changed setup_vmstat() to mark N_CPU to the nodes with online CPUs at
      boot, which is consistent with other operations in
      vmstat_cpuup_callback(), i.e.  start_cpu_timer() and
      refresh_zone_stat_thresholds().
      
      Also added get_online_cpus() to protect the for_each_online_cpu() loop.
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Tested-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7e0b37a
  20. 12 9月, 2013 7 次提交
    • L
      mm: vmscan: fix do_try_to_free_pages() livelock · 6e543d57
      Lisa Du 提交于
      This patch is based on KOSAKI's work and I add a little more description,
      please refer https://lkml.org/lkml/2012/6/14/74.
      
      Currently, I found system can enter a state that there are lots of free
      pages in a zone but only order-0 and order-1 pages which means the zone is
      heavily fragmented, then high order allocation could make direct reclaim
      path's long stall(ex, 60 seconds) especially in no swap and no compaciton
      enviroment.  This problem happened on v3.4, but it seems issue still lives
      in current tree, the reason is do_try_to_free_pages enter live lock:
      
      kswapd will go to sleep if the zones have been fully scanned and are still
      not balanced.  As kswapd thinks there's little point trying all over again
      to avoid infinite loop.  Instead it changes order from high-order to
      0-order because kswapd think order-0 is the most important.  Look at
      73ce02e9 in detail.  If watermarks are ok, kswapd will go back to sleep
      and may leave zone->all_unreclaimable =3D 0.  It assume high-order users
      can still perform direct reclaim if they wish.
      
      Direct reclaim continue to reclaim for a high order which is not a
      COSTLY_ORDER without oom-killer until kswapd turn on
      zone->all_unreclaimble= .  This is because to avoid too early oom-kill.
      So it means direct_reclaim depends on kswapd to break this loop.
      
      In worst case, direct-reclaim may continue to page reclaim forever when
      kswapd sleeps forever until someone like watchdog detect and finally kill
      the process.  As described in:
      http://thread.gmane.org/gmane.linux.kernel.mm/103737
      
      We can't turn on zone->all_unreclaimable from direct reclaim path because
      direct reclaim path don't take any lock and this way is racy.  Thus this
      patch removes zone->all_unreclaimable field completely and recalculates
      zone reclaimable state every time.
      
      Note: we can't take the idea that direct-reclaim see zone->pages_scanned
      directly and kswapd continue to use zone->all_unreclaimable.  Because, it
      is racy.  commit 929bea7c (vmscan: all_unreclaimable() use
      zone->all_unreclaimable as a name) describes the detail.
      
      [akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()]
      Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Nick Piggin <npiggin@gmail.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Bob Liu <lliubbo@gmail.com>
      Cc: Neil Zhang <zhangwm@marvell.com>
      Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NLisa Du <cldu@marvell.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6e543d57
    • C
      vmstat: use this_cpu() to avoid irqon/off sequence in refresh_cpu_vm_stats · fbc2edb0
      Christoph Lameter 提交于
      Disabling interrupts repeatedly can be avoided in the inner loop if we use
      a this_cpu operation.
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      CC: Tejun Heo <tj@kernel.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fbc2edb0
    • C
      vmstat: create fold_diff · 4edb0748
      Christoph Lameter 提交于
      Both functions that update global counters use the same mechanism.
      
      Create a function that contains the common code.
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      CC: Tejun Heo <tj@kernel.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4edb0748
    • C
      vmstat: create separate function to fold per cpu diffs into local counters · 2bb921e5
      Christoph Lameter 提交于
      The main idea behind this patchset is to reduce the vmstat update overhead
      by avoiding interrupt enable/disable and the use of per cpu atomics.
      
      This patch (of 3):
      
      It is better to have a separate folding function because
      refresh_cpu_vm_stats() also does other things like expire pages in the
      page allocator caches.
      
      If we have a separate function then refresh_cpu_vm_stats() is only called
      from the local cpu which allows additional optimizations.
      
      The folding function is only called when a cpu is being downed and
      therefore no other processor will be accessing the counters.  Also
      simplifies synchronization.
      
      [akpm@linux-foundation.org: fix UP build]
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      CC: Tejun Heo <tj@kernel.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2bb921e5
    • J
      mm: page_alloc: fair zone allocator policy · 81c0a2bb
      Johannes Weiner 提交于
      Each zone that holds userspace pages of one workload must be aged at a
      speed proportional to the zone size.  Otherwise, the time an individual
      page gets to stay in memory depends on the zone it happened to be
      allocated in.  Asymmetry in the zone aging creates rather unpredictable
      aging behavior and results in the wrong pages being reclaimed, activated
      etc.
      
      But exactly this happens right now because of the way the page allocator
      and kswapd interact.  The page allocator uses per-node lists of all zones
      in the system, ordered by preference, when allocating a new page.  When
      the first iteration does not yield any results, kswapd is woken up and the
      allocator retries.  Due to the way kswapd reclaims zones below the high
      watermark while a zone can be allocated from when it is above the low
      watermark, the allocator may keep kswapd running while kswapd reclaim
      ensures that the page allocator can keep allocating from the first zone in
      the zonelist for extended periods of time.  Meanwhile the other zones
      rarely see new allocations and thus get aged much slower in comparison.
      
      The result is that the occasional page placed in lower zones gets
      relatively more time in memory, even gets promoted to the active list
      after its peers have long been evicted.  Meanwhile, the bulk of the
      working set may be thrashing on the preferred zone even though there may
      be significant amounts of memory available in the lower zones.
      
      Even the most basic test -- repeatedly reading a file slightly bigger than
      memory -- shows how broken the zone aging is.  In this scenario, no single
      page should be able stay in memory long enough to get referenced twice and
      activated, but activation happens in spades:
      
        $ grep active_file /proc/zoneinfo
            nr_inactive_file 0
            nr_active_file 0
            nr_inactive_file 0
            nr_active_file 8
            nr_inactive_file 1582
            nr_active_file 11994
        $ cat data data data data >/dev/null
        $ grep active_file /proc/zoneinfo
            nr_inactive_file 0
            nr_active_file 70
            nr_inactive_file 258753
            nr_active_file 443214
            nr_inactive_file 149793
            nr_active_file 12021
      
      Fix this with a very simple round robin allocator.  Each zone is allowed a
      batch of allocations that is proportional to the zone's size, after which
      it is treated as full.  The batch counters are reset when all zones have
      been tried and the allocator enters the slowpath and kicks off kswapd
      reclaim.  Allocation and reclaim is now fairly spread out to all
      available/allowable zones:
      
        $ grep active_file /proc/zoneinfo
            nr_inactive_file 0
            nr_active_file 0
            nr_inactive_file 174
            nr_active_file 4865
            nr_inactive_file 53
            nr_active_file 860
        $ cat data data data data >/dev/null
        $ grep active_file /proc/zoneinfo
            nr_inactive_file 0
            nr_active_file 0
            nr_inactive_file 666622
            nr_active_file 4988
            nr_inactive_file 190969
            nr_active_file 937
      
      When zone_reclaim_mode is enabled, allocations will now spread out to all
      zones on the local node, not just the first preferred zone (which on a 4G
      node might be a tiny Normal zone).
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Paul Bolle <paul.bollee@gmail.com>
      Cc: Zlatko Calusic <zcalusic@bitsync.net>
      Tested-by: NKevin Hilman <khilman@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81c0a2bb
    • D
      mm: vmstats: track TLB flush stats on UP too · 6df46865
      Dave Hansen 提交于
      The previous patch doing vmstats for TLB flushes ("mm: vmstats: tlb flush
      counters") effectively missed UP since arch/x86/mm/tlb.c is only compiled
      for SMP.
      
      UP systems do not do remote TLB flushes, so compile those counters out on
      UP.
      
      arch/x86/kernel/cpu/mtrr/generic.c calls __flush_tlb() directly.  This is
      probably an optimization since both the mtrr code and __flush_tlb() write
      cr4.  It would probably be safe to make that a flush_tlb_all() (and then
      get these statistics), but the mtrr code is ancient and I'm hesitant to
      touch it other than to just stick in the counters.
      
      [akpm@linux-foundation.org: tweak comments]
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6df46865
    • D
      mm: vmstats: tlb flush counters · 9824cf97
      Dave Hansen 提交于
      I was investigating some TLB flush scaling issues and realized that we do
      not have any good methods for figuring out how many TLB flushes we are
      doing.
      
      It would be nice to be able to do these in generic code, but the
      arch-independent calls don't explicitly specify whether we actually need
      to do remote flushes or not.  In the end, we really need to know if we
      actually _did_ global vs.  local invalidations, so that leaves us with few
      options other than to muck with the counters from arch-specific code.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9824cf97