1. 17 7月, 2018 1 次提交
    • R
      x86/mm/tlb: Leave lazy TLB mode at page table free time · 2ff6ddf1
      Rik van Riel 提交于
      Andy discovered that speculative memory accesses while in lazy
      TLB mode can crash a system, when a CPU tries to dereference a
      speculative access using memory contents that used to be valid
      page table memory, but have since been reused for something else
      and point into la-la land.
      
      The latter problem can be prevented in two ways. The first is to
      always send a TLB shootdown IPI to CPUs in lazy TLB mode, while
      the second one is to only send the TLB shootdown at page table
      freeing time.
      
      The second should result in fewer IPIs, since operationgs like
      mprotect and madvise are very common with some workloads, but
      do not involve page table freeing. Also, on munmap, batching
      of page table freeing covers much larger ranges of virtual
      memory than the batching of unmapped user pages.
      Tested-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NRik van Riel <riel@surriel.com>
      Acked-by: NDave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: efault@gmx.de
      Cc: kernel-team@fb.com
      Cc: luto@kernel.org
      Link: http://lkml.kernel.org/r/20180716190337.26133-3-riel@surriel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2ff6ddf1
  2. 08 6月, 2018 3 次提交
  3. 01 6月, 2018 1 次提交
  4. 06 4月, 2018 2 次提交
  5. 18 3月, 2018 1 次提交
  6. 17 2月, 2018 1 次提交
    • A
      mm: hide a #warning for COMPILE_TEST · af27d940
      Arnd Bergmann 提交于
      We get a warning about some slow configurations in randconfig kernels:
      
        mm/memory.c:83:2: error: #warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_cpupid. [-Werror=cpp]
      
      The warning is reasonable by itself, but gets in the way of randconfig
      build testing, so I'm hiding it whenever CONFIG_COMPILE_TEST is set.
      
      The warning was added in 2013 in commit 75980e97 ("mm: fold
      page->_last_nid into page->flags where possible").
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af27d940
  7. 07 2月, 2018 1 次提交
  8. 01 2月, 2018 3 次提交
  9. 20 1月, 2018 2 次提交
  10. 16 12月, 2017 1 次提交
    • L
      Revert "mm: replace p??_write with pte_access_permitted in fault + gup paths" · f6f37321
      Linus Torvalds 提交于
      This reverts commits 5c9d2d5c, c7da82b8, and e7fe7b5c.
      
      We'll probably need to revisit this, but basically we should not
      complicate the get_user_pages_fast() case, and checking the actual page
      table protection key bits will require more care anyway, since the
      protection keys depend on the exact state of the VM in question.
      
      Particularly when doing a "remote" page lookup (ie in somebody elses VM,
      not your own), you need to be much more careful than this was.  Dave
      Hansen says:
      
       "So, the underlying bug here is that we now a get_user_pages_remote()
        and then go ahead and do the p*_access_permitted() checks against the
        current PKRU. This was introduced recently with the addition of the
        new p??_access_permitted() calls.
      
        We have checks in the VMA path for the "remote" gups and we avoid
        consulting PKRU for them. This got missed in the pkeys selftests
        because I did a ptrace read, but not a *write*. I also didn't
        explicitly test it against something where a COW needed to be done"
      
      It's also not entirely clear that it makes sense to check the protection
      key bits at this level at all.  But one possible eventual solution is to
      make the get_user_pages_fast() case just abort if it sees protection key
      bits set, which makes us fall back to the regular get_user_pages() case,
      which then has a vma and can do the check there if we want to.
      
      We'll see.
      
      Somewhat related to this all: what we _do_ want to do some day is to
      check the PAGE_USER bit - it should obviously always be set for user
      pages, but it would be a good check to have back.  Because we have no
      generic way to test for it, we lost it as part of moving over from the
      architecture-specific x86 GUP implementation to the generic one in
      commit e585513b ("x86/mm/gup: Switch GUP to the generic
      get_user_page_fast() implementation").
      
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f6f37321
  11. 15 12月, 2017 1 次提交
  12. 30 11月, 2017 4 次提交
  13. 28 11月, 2017 1 次提交
    • K
      mm, thp: Do not make pmd/pud dirty without a reason · 152e93af
      Kirill A. Shutemov 提交于
      Currently we make page table entries dirty all the time regardless of
      access type and don't even consider if the mapping is write-protected.
      The reasoning is that we don't really need dirty tracking on THP and
      making the entry dirty upfront may save some time on first write to the
      page.
      
      Unfortunately, such approach may result in false-positive
      can_follow_write_pmd() for huge zero page or read-only shmem file.
      
      Let's only make page dirty only if we about to write to the page anyway
      (as we do for small pages).
      
      I've restructured the code to make entry dirty inside
      maybe_p[mu]d_mkwrite(). It also takes into account if the vma is
      write-protected.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      152e93af
  14. 16 11月, 2017 6 次提交
  15. 25 10月, 2017 1 次提交
    • P
      locking/atomics, mm: Convert ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE() · b03a0fe0
      Paul E. McKenney 提交于
      For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
      preference to ACCESS_ONCE(), and new code is expected to use one of the
      former. So far, there's been no reason to change most existing uses of
      ACCESS_ONCE(), as these aren't currently harmful.
      
      However, for some features it is necessary to instrument reads and
      writes separately, which is not possible with ACCESS_ONCE(). This
      distinction is critical to correct operation.
      
      It's possible to transform the bulk of kernel code using the Coccinelle
      script below. However, this doesn't handle comments, leaving references
      to ACCESS_ONCE() instances which have been removed. As a preparatory
      step, this patch converts the mm code and comments to use
      {READ,WRITE}_ONCE() consistently.
      
      ----
      virtual patch
      
      @ depends on patch @
      expression E1, E2;
      @@
      
      - ACCESS_ONCE(E1) = E2
      + WRITE_ONCE(E1, E2)
      
      @ depends on patch @
      expression E;
      @@
      
      - ACCESS_ONCE(E)
      + READ_ONCE(E)
      ----
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: davem@davemloft.net
      Cc: linux-arch@vger.kernel.org
      Cc: mpe@ellerman.id.au
      Cc: shuah@kernel.org
      Cc: snitzer@redhat.com
      Cc: thor.thayer@linux.intel.com
      Cc: tj@kernel.org
      Cc: viro@zeniv.linux.org.uk
      Link: http://lkml.kernel.org/r/1508792849-3115-15-git-send-email-paulmck@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b03a0fe0
  16. 04 10月, 2017 1 次提交
  17. 09 9月, 2017 6 次提交
  18. 07 9月, 2017 4 次提交
    • H
      mm: hugetlb: clear target sub-page last when clearing huge page · c79b57e4
      Huang Ying 提交于
      Huge page helps to reduce TLB miss rate, but it has higher cache
      footprint, sometimes this may cause some issue.  For example, when
      clearing huge page on x86_64 platform, the cache footprint is 2M.  But
      on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M
      LLC (last level cache).  That is, in average, there are 2.5M LLC for
      each core and 1.25M LLC for each thread.
      
      If the cache pressure is heavy when clearing the huge page, and we clear
      the huge page from the begin to the end, it is possible that the begin
      of huge page is evicted from the cache after we finishing clearing the
      end of the huge page.  And it is possible for the application to access
      the begin of the huge page after clearing the huge page.
      
      To help the above situation, in this patch, when we clear a huge page,
      the order to clear sub-pages is changed.  In quite some situation, we
      can get the address that the application will access after we clear the
      huge page, for example, in a page fault handler.  Instead of clearing
      the huge page from begin to end, we will clear the sub-pages farthest
      from the the sub-page to access firstly, and clear the sub-page to
      access last.  This will make the sub-page to access most cache-hot and
      sub-pages around it more cache-hot too.  If we cannot know the address
      the application will access, the begin of the huge page is assumed to be
      the the address the application will access.
      
      With this patch, the throughput increases ~28.3% in vm-scalability
      anon-w-seq test case with 72 processes on a 2 socket Xeon E5 v3 2699
      system (36 cores, 72 threads).  The test case creates 72 processes, each
      process mmap a big anonymous memory area and writes to it from the begin
      to the end.  For each process, other processes could be seen as other
      workload which generates heavy cache pressure.  At the same time, the
      cache miss rate reduced from ~33.4% to ~31.7%, the IPC (instruction per
      cycle) increased from 0.56 to 0.74, and the time spent in user space is
      reduced ~7.9%
      
      Christopher Lameter suggests to clear bytes inside a sub-page from end
      to begin too.  But tests show no visible performance difference in the
      tests.  May because the size of page is small compared with the cache
      size.
      
      Thanks Andi Kleen to propose to use address to access to determine the
      order of sub-pages to clear.
      
      The hugetlbfs access address could be improved, will do that in another
      patch.
      
      [ying.huang@intel.com: improve readability of clear_huge_page()]
        Link: http://lkml.kernel.org/r/20170830051842.1397-1-ying.huang@intel.com
      Link: http://lkml.kernel.org/r/20170815014618.15842-1-ying.huang@intel.comSuggested-by: NAndi Kleen <andi.kleen@intel.com>
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Nadia Yvette Chambers <nyc@holomorphy.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c79b57e4
    • H
      mm, swap: VMA based swap readahead · ec560175
      Huang Ying 提交于
      The swap readahead is an important mechanism to reduce the swap in
      latency.  Although pure sequential memory access pattern isn't very
      popular for anonymous memory, the space locality is still considered
      valid.
      
      In the original swap readahead implementation, the consecutive blocks in
      swap device are readahead based on the global space locality estimation.
      But the consecutive blocks in swap device just reflect the order of page
      reclaiming, don't necessarily reflect the access pattern in virtual
      memory.  And the different tasks in the system may have different access
      patterns, which makes the global space locality estimation incorrect.
      
      In this patch, when page fault occurs, the virtual pages near the fault
      address will be readahead instead of the swap slots near the fault swap
      slot in swap device.  This avoid to readahead the unrelated swap slots.
      At the same time, the swap readahead is changed to work on per-VMA from
      globally.  So that the different access patterns of the different VMAs
      could be distinguished, and the different readahead policy could be
      applied accordingly.  The original core readahead detection and scaling
      algorithm is reused, because it is an effect algorithm to detect the
      space locality.
      
      The test and result is as follow,
      
      Common test condition
      =====================
      
      Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM) Swap device:
      NVMe disk
      
      Micro-benchmark with combined access pattern
      ============================================
      
      vm-scalability, sequential swap test case, 4 processes to eat 50G
      virtual memory space, repeat the sequential memory writing until 300
      seconds.  The first round writing will trigger swap out, the following
      rounds will trigger sequential swap in and out.
      
      At the same time, run vm-scalability random swap test case in
      background, 8 processes to eat 30G virtual memory space, repeat the
      random memory write until 300 seconds.  This will trigger random swap-in
      in the background.
      
      This is a combined workload with sequential and random memory accessing
      at the same time.  The result (for sequential workload) is as follow,
      
      			Base		Optimized
      			----		---------
      throughput		345413 KB/s	414029 KB/s (+19.9%)
      latency.average		97.14 us	61.06 us (-37.1%)
      latency.50th		2 us		1 us
      latency.60th		2 us		1 us
      latency.70th		98 us		2 us
      latency.80th		160 us		2 us
      latency.90th		260 us		217 us
      latency.95th		346 us		369 us
      latency.99th		1.34 ms		1.09 ms
      ra_hit%			52.69%		99.98%
      
      The original swap readahead algorithm is confused by the background
      random access workload, so readahead hit rate is lower.  The VMA-base
      readahead algorithm works much better.
      
      Linpack
      =======
      
      The test memory size is bigger than RAM to trigger swapping.
      
      			Base		Optimized
      			----		---------
      elapsed_time		393.49 s	329.88 s (-16.2%)
      ra_hit%			86.21%		98.82%
      
      The score of base and optimized kernel hasn't visible changes.  But the
      elapsed time reduced and readahead hit rate improved, so the optimized
      kernel runs better for startup and tear down stages.  And the absolute
      value of readahead hit rate is high, shows that the space locality is
      still valid in some practical workloads.
      
      Link: http://lkml.kernel.org/r/20170807054038.1843-4-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec560175
    • H
      mm, THP, swap: make reuse_swap_page() works for THP swapped out · ba3c4ce6
      Huang Ying 提交于
      After supporting to delay THP (Transparent Huge Page) splitting after
      swapped out, it is possible that some page table mappings of the THP are
      turned into swap entries.  So reuse_swap_page() need to check the swap
      count in addition to the map count as before.  This patch done that.
      
      In the huge PMD write protect fault handler, in addition to the page map
      count, the swap count need to be checked too, so the page lock need to
      be acquired too when calling reuse_swap_page() in addition to the page
      table lock.
      
      [ying.huang@intel.com: silence a compiler warning]
        Link: http://lkml.kernel.org/r/87bmnzizjy.fsf@yhuang-dev.intel.com
      Link: http://lkml.kernel.org/r/20170724051840.2309-4-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c]
      Cc: Vishal L Verma <vishal.l.verma@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba3c4ce6
    • M
      mm: always flush VMA ranges affected by zap_page_range · 4647706e
      Mel Gorman 提交于
      Nadav Amit report zap_page_range only specifies that the caller protect
      the VMA list but does not specify whether it is held for read or write
      with callers using either.  madvise holds mmap_sem for read meaning that
      a parallel zap operation can unmap PTEs which are then potentially
      skipped by madvise which potentially returns with stale TLB entries
      present.  While the API could be extended, it would be a difficult API
      to use.  This patch causes zap_page_range() to always consider flushing
      the full affected range.  For small ranges or sparsely populated
      mappings, this may result in one additional spurious TLB flush.  For
      larger ranges, it is possible that the TLB has already been flushed and
      the overhead is negligible.  Either way, this approach is safer overall
      and avoids stale entries being present when madvise returns.
      
      This can be illustrated with the following program provided by Nadav
      Amit and slightly modified.  With the patch applied, it has an exit code
      of 0 indicating a stale TLB entry did not leak to userspace.
      
      ---8<---
      
      volatile int sync_step = 0;
      volatile char *p;
      
      static inline unsigned long rdtsc()
      {
      	unsigned long hi, lo;
      	__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
      	 return lo | (hi << 32);
      }
      
      static inline void wait_rdtsc(unsigned long cycles)
      {
      	unsigned long tsc = rdtsc();
      
      	while (rdtsc() - tsc < cycles);
      }
      
      void *big_madvise_thread(void *ign)
      {
      	sync_step = 1;
      	while (sync_step != 2);
      	madvise((void*)p, PAGE_SIZE * N_PAGES, MADV_DONTNEED);
      }
      
      int main(void)
      {
      	pthread_t aux_thread;
      
      	p = mmap(0, PAGE_SIZE * N_PAGES, PROT_READ|PROT_WRITE,
      		 MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
      
      	memset((void*)p, 8, PAGE_SIZE * N_PAGES);
      
      	pthread_create(&aux_thread, NULL, big_madvise_thread, NULL);
      	while (sync_step != 1);
      
      	*p = 8;		// Cache in TLB
      	sync_step = 2;
      	wait_rdtsc(100000);
      	madvise((void*)p, PAGE_SIZE, MADV_DONTNEED);
      	printf("data: %d (%s)\n", *p, (*p == 8 ? "stale, broken" : "cleared, fine"));
      	return *p == 8 ? -1 : 0;
      }
      ---8<---
      
      Link: http://lkml.kernel.org/r/20170725101230.5v7gvnjmcnkzzql3@techsingularity.netSigned-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NNadav Amit <nadav.amit@gmail.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4647706e